JUno
Java Unified Neural Orchestration
Distributed LLM inference and fine-tuning. No Python, no GIL, no Spring.
Run open-source GGUF models locally, in a cluster, or embedded as a JVM library - with an OpenAI-compatible REST API out of the box.
Why Juno?
- Your organization accepts only on-prem or air-gapped deployments, with strict disclosure policies, or runs clusters of commodity or heterogeneous hardware.
- Your organization has mature Java teams and infrastructure, or enforces strict security policies against Python processes running in production.
- You are building Edge or IoT clusters with small heterogeneous machines and need LLM inference without a Python runtime dependency.
- You are building Java-native products - especially in fintech, enterprise tooling, or any domain where the JVM is the standard deployment target.
Quick Start
We may use jUno in several ways. All share the same GGUF model file and the same CLI launcher - only the invocation differs. Full flag reference and examples: docs/howto.md · README.md.
But let's check out a model first:
# example - TinyLlama 1.1B, ~637 MB
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/\
resolve/main/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf -P models/
Console modes
local — Offline or stand-alone mode, no JVM cluster no gRPC wire. The fastest way to try a model locally or run LoRA training.
./juno local --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf
cluster — Offline and stand-alone mode, with JVM cluster on localhost with gRPC wire. The fastest way to test distributed behavior of model locally.
./juno cluster --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf
lora — Offline or stand-alone mode, no JVM cluster no gRPC wire. The fastest way to train a model locally using interactive LoRA console.
./juno lora --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf
Stand-alone HTTP API server
Option --api-port N to any local or cluster jUno invocation will enable OpenAI-compatible HTTP server. And expose POST /v1/chat/completions and GET /v1/models. Any client that speaks the OpenAI wire format - LangChain, LlamaIndex, the OpenAI SDK - works with only a base-URL change.
./juno local --model-path model.gguf --api-port 8080
Programmatic integration
Import one POM so every cab.ml module shares the same version:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>cab.ml</groupId>
<artifactId>juno-bom</artifactId>
<version>0.1.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>cab.ml</groupId>
<artifactId>juno-player</artifactId>
</dependency>
</dependencies>
Then play or train your favorite model right from your code.
Distributed deployment
Deploy juno-master and juno-node within your network to obtain distributed inference. Choose parallelism type, byte order and other options to meet your hardware/software in the best condition. (See full examples in scripts/aws/juno-deploy.sh.)
# coordinator — one active instance per cluster
/usr/bin/java \
--enable-preview --enable-native-access=ALL-UNNAMED \ # JDK 25 required
--add-opens java.base/java.lang=ALL-UNNAMED \ # reflection access
--add-opens java.base/java.nio=ALL-UNNAMED \ # NIO access
-XX:+UseG1GC -XX:+AlwaysPreTouch -Xmx4g \ # GC + heap
${JFR_OPT:+$JFR_OPT} \ # optional: --jfr 60s
-DJUNO_HEALTH=true -DJUNO_HEALTH_PORT=8081 \ # health dashboard
-jar juno-master.jar \
--model-path /models/model.gguf \
--pType pipeline \ # pipeline | tensor
--dtype FP16 # FP16 | FP32 | INT8
# worker node — repeat on every inference host
/usr/bin/java \
--enable-preview --enable-native-access=ALL-UNNAMED \
--add-opens java.base/java.lang=ALL-UNNAMED \
--add-opens java.base/java.nio=ALL-UNNAMED \
-XX:+UseG1GC -XX:+AlwaysPreTouch -Xmx12g \ # larger heap for shards
${JFR_OPT:+$JFR_OPT} \ # optional JFR profiling
-DJUNO_USE_GPU=true \ # false → CPU quantized
-Djuno.byteOrder=BE \ # BE (default) | LE
-Dnode.id=1 \ # unique per node
-Dnode.port=50051 \ # gRPC listen port
-Dmodel.path=/models/model.gguf \
-Djuno.lora.play.path=/adapters/model.lora \ # optional LoRA adapter
-Djuno.health.url=http://master:8081/health \ # optional health report-back
-jar juno-node.jar cab.ml.juno.node.NodeMain
AWS Cluster Deployment
scripts/aws/juno-deploy.sh is the unified cluster lifecycle script. Hardware is auto-detected during bootstrap: GPU instances set JUNO_USE_GPU=true (CUDA is pre-installed in the golden AMI). The OpenAI-compatible API is available on port 8080 after setup. Full options and LoRA deploy flow: docs/howto.md - AWS.
One-time setup
cd scripts/aws
nano launcher.sh # set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
Commands
| Command | Description |
|---|---|
setup [options] | Provision, bootstrap, start coordinator, expose OpenAI API on port 8080 (~5 min) |
start | Start stopped instances |
stop | Stop instances - EBS and key pair retained |
teardown | Terminate everything - no lingering costs |
status | Show current instance states from AWS API |
scan-regions | Find cheapest AZ for the selected instance type |
# GPU cluster (3 x g4dn.xlarge, T4 16 GB VRAM)
./launcher.sh juno-deploy.sh setup
# CPU cluster (3 x m7i-flex.large)
./launcher.sh juno-deploy.sh setup \
--instance-type m7i-flex.large \
--model-url https://huggingface.co/.../model.gguf
# Deploy with LoRA adapter
./launcher.sh juno-deploy.sh setup \
--lora-play /absolute/path/to/model.lora
# Teardown
./launcher.sh juno-deploy.sh teardown
Architecture
Two distribution strategies, selected with --pType. Pipeline (default): contiguous layer blocks, serial activation flow, LAN-friendly, no InfiniBand required. Tensor: all nodes hold all layers but a horizontal weight slice; coordinator broadcasts tokens in parallel and sums partial logit vectors (AllReduce).
On-prem and cluster topology
Run juno-master as the coordinator and juno-node on each worker with gRPC between them. Any node count is supported - use pipeline parallelism to scale total VRAM, tensor parallelism to scale throughput. The same roles run locally (forked JVMs), on bare metal, or on cloud instances. Commodity hardware works well; so does modern datacenter hardware.
Sequence: generation loop
On startup, ClusterHarness forks node processes. Each calls ForwardPassHandlerLoader.load(), routing to LlamaTransformerHandler or Phi3TransformerHandler based on general.architecture in the GGUF. EmbeddedNodeServer wires a NodeKVCacheAdapter after loadShard(). On inference: coordinator tokenizes, runs prefill+decode. Session KV blocks are reused across turns - only new tokens are prefilled on turn N+1.
Modules
Multi-module Maven project. All cab.ml artifacts at version 0.1.0 are available on Maven Central. Import juno-bom to align versions. Details: docs/arch.md.
juno-bom
Maven BOM - aligned dependency versions for all cab.ml artifacts.
api
OpenAPI 3.0 spec, protobuf/gRPC contract (inference.proto), JAX-RS interfaces.
registry
Shard planning, model registry, parallelism strategy (ShardPlanner, TensorShardPlanner).
coordinator
Generation loop, request scheduler, OpenAI-compatible REST + SSE, fault-tolerant pipeline.
node
Transformer forward-pass handlers (Llama, Phi-3), GGUF reader, MatVec backends (CPU / CUDA), LoRA overlay.
lora
Adapter tensors, Adam optimizer, .lora checkpoint format, merge-to-GGUF writer.
tokenizer
GGUF BPE tokenizer (SentencePiece + GPT-2 paths auto-detected), chat template formatter.
sampler · kvcache · health · metrics
Shared infrastructure - sampling pipeline, KV cache (GPU + CPU tiers), circuit breaker, JFR metrics extractor.
juno-player
CLI REPL and cluster harness. Exposes JunoPlayer facade, LoraTrainer, JunoHttpClient for JVM integration.
juno-node · juno-master
Shaded fat jars for standalone remote deployment - node process and coordinator process respectively.
Supported Models
Any GGUF file with a supported architecture. Chat template is auto-detected from the model filename.
Architectures
| Architecture | Handler | Template auto-detect |
|---|---|---|
| LLaMA / LLaMA 2 | LlamaTransformerHandler | llama* → chatml; tinyllama* / zephyr* → tinyllama |
| Meta-Llama 3.x | LlamaTransformerHandler | llama3* → llama3 (GPT-2 BPE auto-detected) |
| Mistral | LlamaTransformerHandler | mistral* → mistral |
| Gemma | LlamaTransformerHandler | gemma* → gemma |
| Phi-3 / Phi-3.5 | Phi3TransformerHandler | phi3* / phi-3* → phi3 |
Quantization types
F32 · F16 · BF16 · Q8_0 · Q4_0 · Q2_K · Q3_K · Q4_K · Q5_K · Q6_K
All quantization types stay compressed in memory; dequantization runs block-by-block inside the matmul loop. On GPU, Llama and Phi-3 dequantize once at load time and keep weights as FP16 on device (DeviceHalfMatrix).
CLI Reference
./juno is the unified launcher at the project root. Requires JDK 25+ and pre-built jars. Full flag table and examples: docs/howto.md.
Modes
| Command | Description |
|---|---|
./juno local | In-process REPL - all shards in one JVM, no forking, no gRPC. Add --api-port N to start the OpenAI-compatible HTTP server alongside the REPL. |
./juno (default) | 3-node cluster - forked JVMs with real gRPC. Default --pType pipeline; use --pType tensor for AllReduce mode. Also supports --api-port. |
./juno lora | LoRA fine-tuning REPL - single in-process JVM, adapter persisted to .lora checkpoint. Use /train, /train-qa, /save inside the REPL. |
./juno merge | Bake a trained .lora adapter into a new standalone GGUF - no sidecar needed at inference time. |
For the full list of flags (--model-path, --dtype, --pType, --heap, --jfr, --lora-play, LoRA-specific flags, merge flags, environment variable overrides) and usage examples, see docs/howto.md.
GPU Acceleration
GPU inference via CUDA 12.x and cuBLAS. All projection weights are uploaded once at load time as IEEE FP16 (DeviceHalfMatrix); forward-pass matmuls use cublasHSSgemvStridedBatched. Per-call H2D transfer is limited to the small input/output activation vectors - CPU cores are idle during generation.
- FP16 resident weights. Both Llama and Phi-3 handlers dequantize once on load and keep weights on device, roughly halving VRAM versus FP32 resident.
- CUDA streams. Per-thread non-blocking streams with
cudaMemcpyAsync; cuBLAS calls are serialized with a per-context lock. - Multi-device. One shared
GpuContextper CUDA device. Pin a node to a specific GPU with-Djuno.cuda.device=N. - OOM fallback. On
cudaMallocfailure, partial device buffers are closed and inference falls back to CPU quantized matmul for those projections - no crash, no restart. - Explicit lifecycle.
releaseGpuResources()frees VRAM on shard unload, reload, or handler swap.
LoRA Fine-Tuning
Parameter-efficient fine-tuning via low-rank adapter matrices. The base GGUF is never modified. Adapters persist to a .lora checkpoint; the same handler serves training and inference. Full guide, rank selection, and checkpoint format: docs/LoRA.md.
./juno lora- fine-tuning REPL. Use/trainfor free-text,/train-qafor Q&A pairs (auto-generates 4 phrasings with the model's chat template).--lora-play PATH- apply a pre-trained adapter read-only at inference, in both local and cluster modes. In cluster mode the adapter is injected into every forked node JVM../juno merge- write a new GGUF with the 44 LoRA-patched projection tensors stored as F32 (the LoRA delta is smaller than Q4_K noise; re-quantizing would erase training). All other tensors are copied verbatim.- JVM API.
LoraTrainer(injuno-player) exposestrainRawText(),trainQaPair(), andsave()for programmatic use.
Health Dashboard
The health module exposes an auto-refreshing dashboard served at GET / on the health port. Each node runs a HealthReporter that probes local metrics every 5 s. Per-node cards show VRAM, CPU load, circuit state, and a role-conditional metric: Latency P99 on the coordinator, Throughput MB/s on worker nodes.
Circuit-open nodes are highlighted in red; half-open in amber. The coordinator's port 8080 also embeds a simplified health view inline.
JFR Profiling
All hot paths are instrumented with custom Java Flight Recorder events. Pass --jfr DURATION to any command; a juno-<modelStem>-<timestamp>.jfr file is written on exit. Open in JDK Mission Control → Event Browser. Full guide and extraction examples: docs/howto.md → Metrics.
| Event | Key fields |
|---|---|
juno.MatVec | backend (cpu / cuda / cuda-resident-fp16), rows, cols |
juno.ForwardPass | handlerType, requestId, startPosition, layerCount |
juno.TokenProduced | coordinator-side; used to derive aggregate TPS |
juno.Tokenizer | tokenizerType, operation, inputLength, outputLength |
juno.TemplateFormat | modelType, messageCount, outputLength |
juno.LoraTrainStep | step, loss, forwardMs, backwardMs, optimizerMs |
Key Design Decisions
| Decision | Detail |
|---|---|
| No Python, no subprocess | The JVM reads GGUF binary directly via GgufReader and runs the full transformer forward pass end to end. No sidecar, no IPC, no GIL. |
| No Spring Boot | Javalin for REST. Virtual threads on the gRPC ServerBuilder - required to avoid OS-thread saturation under concurrent prefill sessions. |
| OpenAI wire compatibility | OpenAiChatHandler and OpenAiAdapter are new classes in the coordinator module. No existing classes were modified; the existing native /v1/inference endpoints are untouched. |
| Maven BOM on Central | cab.ml:juno-bom:0.1.0 aligns all artifact versions. Import once in dependencyManagement; downstream modules declare no version. |
| Two parallelism modes | Pipeline: vertical depth scaling, serial gRPC hops, adds total VRAM linearly. Tensor: horizontal width scaling, parallel broadcast + AllReduce, one gRPC round-trip per step. |
| Lazy dequant on CPU / eager upload on GPU | On CPU, one 256-element block is dequantized at a time inside the matmul loop. On GPU, weights are dequantized once at load and kept as FP16 on device. OOM falls back to CPU gracefully. |
| LoRA without modifying the base GGUF | LoraTrainableHandler applies W_eff = W + (α/rank)·B·A at inference. Frozen weights stay quantized. Adapters persist separately; the GGUF is never touched. juno merge bakes them permanently when needed. |
| Session KV cache | A stable sessionId key survives across REPL turns. Turn latency is proportional to new tokens only, not history length. |
| Full JFR instrumentation | Six custom event types across every hot path - observable in JDK Mission Control with no agent or bytecode manipulation. Zero overhead when recording is off. |
| AWS infrastructure fully scripted | juno-deploy.sh handles the full cluster lifecycle. GPU quota is checked before any instance launches; insufficient vCPUs fail hard. State persisted to ~/.juno-deploy-state. |
| Stub mode | EmbeddedNodeServer uses StubForwardPassHandler (zero-filled arrays) before a shard is loaded. Integration tests run in stub mode - no model file, no GPU, boots in seconds. |
Performance
Measured on tinyllama-1.1b-chat-v1.0-q4_k_m.gguf via AWS JFR cluster runs. CPU: m7i-flex.large. GPU: g4dn.2xlarge. TPS = coordinator-side TokenProduced.tps.
- GPU dominates at all concurrency levels. 17–37× faster than CPU; 21× faster on multi-turn where prefill cost compounds harder on CPU.
- CPU pipeline sweet spot: 3 nodes. 1.59 tps peak. Adding nodes beyond 3 degrades TPS - network sync per forward pass outweighs compute savings.
- Tensor-parallel is a net loss on CPU at this model size. 3-node tensor: 0.44 tps versus 1.59 tps pipeline. No benefit on GPU single-node either.
- CPU dtype: FP16 wins. FP16 (1.54) > INT8 (1.13) ≈ FP32 (1.06). No AVX-512 INT8 path - scalar fallback. LE byte order adds ~19% overhead vs BE.
- GPU dtype: flat. FP32 on GPU is only 7% slower than FP16 - driver upcasts internally.
Changelog
Last 5 development sessions. Scroll horizontally to see all. Full history on GitHub.
Session 29 - OpenAI-compatible REST API
POST /v1/chat/completions and GET /v1/models added. Any OpenAI-compatible client (LangChain, LlamaIndex, OpenAI SDK) works with a base-URL change only. OpenAiAdapter and OpenAiChatHandler are new classes; no existing endpoints modified. --api-port N wired into both local and cluster modes. OpenAPI 3.0 spec at api/src/main/resources/juno-api.yaml.
Session 28 - Health dashboard & virtual-thread gRPC
CPU load metric replaces unavailable /sys/class/thermal on EC2. Role-conditional secondary metric: coordinator shows Latency P99, nodes show Throughput MB/s. Root-cause investigation: gRPC thread pool saturation with 9 concurrent sessions fixed by switching to newVirtualThreadPerTaskExecutor() on ServerBuilder.
Session 27 - GPU lifecycle, CUDA streams, multi-device
Llama GPU path upgraded to FP16 resident weights (DeviceHalfMatrix). Per-thread non-blocking CUDA streams with cudaMemcpyAsync. One shared GpuContext per device index. releaseGpuResources() called on shard unload/reload. OOM fallback verified on allocation failure.
Session 26 - LoRA merge + inference overlay + Q&A training
juno merge writes a new GGUF with LoRA-patched tensors as F32 - no sidecar at inference. --lora-play applies a trained adapter read-only in all modes. /train-qa REPL command auto-generates 4 phrasings with the model's chat template. AWS deploy hardening: race conditions, path resolution, and double-base64 cloud-init bugs fixed.
Session 25 - Code quality & dead code removal
CyclicForwardPassHandler moved to test scope; production stub replaced by StubForwardPassHandler inner class. ConsoleMain --cpu fall-through bug fixed. One shared GpuContext per runLocalRepl invocation. Docs fully updated.