Juno.
Distributed LLM inference engine on the JVM. Reads GGUF models directly, runs the full transformer forward pass in pure Java, shards layers across commodity GPU nodes via gRPC. No Python. No GIL. No subprocess.
Status
Verified end-to-end with TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf on a 3-node CPU cluster and against Mistral-7B / Llama-3.2-8B / Llama-3.1-70B.
GPU forward pass (GpuForwardPassHandler, JCublas cublasSgemv) runs numerically identical to the CPU path within float32 rounding. GPU tests are opt-in via -Dgroups=gpu or -Pgpu — the full test suite runs on any machine without a GPU.
you> hey there, my name is Dima, nice to meet you!
bot> Greetings! Nice to meet you too.
[37 tokens · 7342 ms · FLOAT16] ← CPU baseline
you> what is my name?
bot> Your name is Dima.
[11 tokens · 8103 ms · FLOAT16] ← flat KV cache reuse
Quick Start
Download a model
# TinyLlama — 637 MB, good for initial testing
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/\
resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
Build
mvn clean package -DskipTests
Run
# Interactive REPL — single JVM, no forking (dev)
./run.sh console --model-path /path/to/model.gguf
# 3-node cluster — forked JVMs, real gRPC (production)
./run.sh cluster --model-path /path/to/model.gguf
# Smoke test — 6 automated checks, exits 0/1
./run.sh live --model-path /path/to/model.gguf
Stub mode (no model file)
# Cluster boots in seconds, all integration tests run stub
./run.sh cluster
Architecture
Pipeline parallelism over tensor parallelism — LAN-friendly, no InfiniBand required. Separate data plane (gRPC activations) from control plane (Hazelcast state).
Each node runs either CpuForwardPassHandler (pure Java, parallel matVec) or GpuForwardPassHandler (JCublas cublasSgemv). Both implement ForwardPassHandler via the GpuMatVec interface. Node selection is automatic — GPU nodes use CublasMatVec, CPU-only nodes fall back to CpuMatVec.
The GpuMatVec interface decouples the matmul backend from transformer logic entirely. Swapping CpuMatVec → CublasMatVec → RocmMatVec requires no changes to GpuForwardPassHandler.
Hardware target
The design target is 16 commodity machines with 4 GB VRAM each — 64 GB total VRAM at a fraction of the cost of a single 64 GB card. 10 GbE minimum, 25 GbE recommended. Managed switch with jumbo frames.
Modules
Each module owns one concern and carries its own test suite. Dependencies flow one way.
api
OpenAPI 3.0 spec, JAX-RS interfaces, inference.proto
registry
NodeDescriptor, ShardPlanner, ShardMap, SeedScorer
coordinator
GenerationLoop, RequestScheduler, FaultTolerantPipeline, Javalin REST, SSE
node
CpuForwardPassHandler, GpuForwardPassHandler, GpuMatVec, CublasMatVec, GgufReader, LlamaConfig
kvcache
KVCacheManager, GpuKVCache, CpuKVCache, PrefixCache trie
tokenizer
GgufTokenizer (SentencePiece BPE from GGUF), ChatTemplate, ChatTemplateFormatter
sampler
Temperature, top-k, top-p, repetition penalty — pure Java pipeline steps
health
CircuitBreaker, HealthEvaluator, NodeHealth (Resilience4j)
player
ConsoleMain REPL, ClusterHarness, ProcessPipelineClient, ChatHistory
integration
InProcessClusterIT, ThreeNodeClusterIT, ModelLiveRunner (6 real-model checks)
Supported Models
Any GGUF file with a LLaMA-compatible architecture. Chat templates: llama3, mistral, gemma, tinyllama/zephyr, chatml (default). Template is derived automatically from the GGUF filename.
| Model | File size | RAM |
|---|---|---|
| TinyLlama-1.1B-Chat Q4_K_M | 637 MB | ~2 GB |
| Mistral-7B-Instruct Q4_K_M | 4.1 GB | ~6 GB |
| Llama-3.2-8B-Instruct Q4_K_M | 4.9 GB | ~8 GB |
| Llama-3.1-70B-Instruct Q4_K_M | 40 GB | 16 × 4 GB nodes |
Quantization types: F32, F16, BF16, Q8_0, Q4_0, Q4_K, Q6_K.
CLI Reference
Commands
| Command | Description |
|---|---|
console | Single-JVM in-process REPL, no forking. Fastest startup, everyday use. |
cluster | 3-node cluster, forked JVMs, real gRPC. GPU deployments and pipeline-parallel scenarios. |
live | 6 automated real-model checks. Exits 0 on pass, 1 on any failure. |
Flags
| Flag | Default | Description |
|---|---|---|
--model-path PATH | — | Path to GGUF file (required for real inference) |
--dtype FLOAT32|FLOAT16|INT8 | FLOAT16 | Activation wire format between nodes |
--max-tokens N | 200 | Max tokens per response |
--temperature F | 0.6 | Sampling temperature |
--heap SIZE | 4g | JVM heap. Use 8g+ for 7B models |
--nodes N | 3 | Number of shards (console mode) |
--verbose / -v | — | Show gRPC and node logs |
All flags also available as env vars: MODEL_PATH, DTYPE, MAX_TOKENS, TEMPERATURE, HEAP, NODES, JAVA_HOME.
Examples
# 7B model with larger heap and verbose output
./run.sh cluster --model-path /models/mistral-7b.gguf --heap 8g -v
# INT8 activation compression (max bandwidth saving, ~1% error)
./run.sh cluster --model-path /models/llama3.gguf --dtype INT8
# Controlled generation
./run.sh console --model-path /models/llama3.gguf --temperature 0.1 --max-tokens 512
# Environment variable style
MODEL_PATH=/models/llama3.gguf DTYPE=FLOAT16 HEAP=8g ./run.sh cluster
Unit & Integration Tests
# Build (produces shade jars)
mvn clean package -DskipTests
# Unit tests — no model file, no GPU needed
mvn test -pl tokenizer,node,coordinator,sampler,kvcache,health,registry,player
# Integration tests — forks 3 JVM nodes in stub mode (~30s)
mvn verify -pl integration
# Real-model smoke test
./run.sh live --model-path /path/to/model.gguf
355 @Test methods across all modules. Notable anchors:
- Golden-value regression for Q6_K dequantization — bit-exact output checked against known values.
- Timing regression in
LoadShardsParallelTest— prevents accidental re-serialization of parallel load. - Anti-regression for EOS piece filtering in
GenerationLoopEosPieceTest. - Fault tolerance in
FaultTolerantPipelineTest,HealthReactorTest,RetryPolicyTest.
GPU Tests
GPU tests are excluded from default CI. Activate with -Pgpu or -Dgroups=gpu. Requires CUDA 12.x and an Nvidia GPU.
# Unit tests — node module only, no model file needed
mvn test -Dgroups=gpu -pl node --enable-native-access=ALL-UNNAMED
# Integration test — requires CUDA + GGUF model
mvn verify -Pgpu -Dit.model.path=/path/to/model.gguf -pl integration \
--enable-native-access=ALL-UNNAMED
# Via env var
MODEL_PATH=/path/to/model.gguf mvn verify -Pgpu -pl integration \
--enable-native-access=ALL-UNNAMED
GpuForwardPassIT is excluded from default failsafe scan to prevent JCuda native libs loading into the coordinator JVM and poisoning FD inheritance into forked node processes. The -Pgpu profile sets -Djuno.gpu.test=true, which is the guard checked in @BeforeAll.
AWS Setup for GPU Testing
Recommended: g4dn.xlarge (T4, 16 GB VRAM, ~$0.50/hr on-demand).
# 1. Install CUDA 12.x
sudo apt update && sudo apt install -y nvidia-cuda-toolkit
# 2. Verify GPU
nvidia-smi
# 3. Install JDK 25 and Maven
sudo apt install -y openjdk-25-jdk maven
# 4. Clone and build
git clone https://github.com/ml-cab/juno
cd juno && mvn clean package -DskipTests
# 5. Download TinyLlama (637 MB — smallest supported)
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/\
resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# 6. GPU unit tests (validate CUDA wiring, no model needed)
mvn test -Dgroups=gpu -pl node --enable-native-access=ALL-UNNAMED
# 7. GPU integration test
mvn verify -Pgpu \
-Dit.model.path=$(pwd)/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-pl integration --enable-native-access=ALL-UNNAMED
Key Design Decisions
- No Python, no subprocess. JVM reads GGUF binary directly and runs the transformer end to end.
- No Spring Boot. Javalin for REST. gRPC for the data plane. Plain Java for everything else.
- Pipeline parallelism over tensor parallelism. LAN-friendly — no InfiniBand required.
- Two
ActivationDtypeenums by design. Protobuf-generated for the wire; domain enum for application code. The bridge lives in one file. - GGUF tokenizer from model metadata. No separate
tokenizer.modelfile needed. - Stub mode. The cluster boots in seconds without a model file. All integration tests run stub.
- GPU tests excluded from default CI via failsafe
<excludes>and a-Pgpuprofile. GpuMatVecinterface decouples the matmul backend from transformer logic. Swap backends without touchingGpuForwardPassHandler.
Performance
| Session | Change | ms / 10 tokens |
|---|---|---|
| 5 | Baseline — FLOAT32, serial matVec | ~34,891 ms |
| 6 | Parallel matVec + FLOAT16 default | ~3,802 ms (9×) |
| 9 | Session KV cache — turn latency now flat | ~7,000–8,000 ms / turn |
| 10 | GpuForwardPassHandler (cublasSgemv) — AWS benchmark pending | — |
Session 9 turn latency grows with new tokens per turn only — not with total history length. GPU numbers will be filled in after the first g4dn.xlarge run.
Comparison
| Feature | Juno | llama.cpp | vLLM | Ollama |
|---|---|---|---|---|
| Pure JVM | ✓ | ✗ | ✗ | ✗ |
| Cluster-native | ✓ | ✗ | ~ | ✗ |
| GGUF quantized models | ✓ | ✓ | ~ | ✓ |
| GPU acceleration | ✓ | ✓ | ✓ | ✓ |
| Java ecosystem fit | ✓ | ✗ | ✗ | ✗ |
| Session KV reuse | ✓ | ~ | ✓ | ✗ |
| Continuous batching | roadmap | ✗ | ✓ | ✗ |
Requirements
- JDK 25+ — virtual threads,
--enable-native-access=ALL-UNNAMEDfor GPU path - Maven 3.9+
- CUDA 12.x — GPU nodes only. Not required for CPU mode, unit tests, or integration tests.