Quick Start Architecture Modules Models Testing Performance GitHub →
ML Cabinet · Open Source

Juno.

Distributed LLM inference engine on the JVM. Reads GGUF models directly, runs the full transformer forward pass in pure Java, shards layers across commodity GPU nodes via gRPC. No Python. No GIL. No subprocess.

JDK 25+ CUDA 12.x Maven 3.9+ Apache 2.0 gRPC GGUF

Status

Session 10 — GPU acceleration complete. All modules build. All tests pass.

Verified end-to-end with TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf on a 3-node CPU cluster and against Mistral-7B / Llama-3.2-8B / Llama-3.1-70B.

GPU forward pass (GpuForwardPassHandler, JCublas cublasSgemv) runs numerically identical to the CPU path within float32 rounding. GPU tests are opt-in via -Dgroups=gpu or -Pgpu — the full test suite runs on any machine without a GPU.

you> hey there, my name is Dima, nice to meet you!
bot> Greetings! Nice to meet you too.
     [37 tokens · 7342 ms · FLOAT16]   ← CPU baseline

you> what is my name?
bot> Your name is Dima.
     [11 tokens · 8103 ms · FLOAT16]   ← flat KV cache reuse

Quick Start

Download a model

# TinyLlama — 637 MB, good for initial testing
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/\
resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

Build

mvn clean package -DskipTests

Run

# Interactive REPL — single JVM, no forking (dev)
./run.sh console --model-path /path/to/model.gguf

# 3-node cluster — forked JVMs, real gRPC (production)
./run.sh cluster --model-path /path/to/model.gguf

# Smoke test — 6 automated checks, exits 0/1
./run.sh live --model-path /path/to/model.gguf

Stub mode (no model file)

# Cluster boots in seconds, all integration tests run stub
./run.sh cluster

Architecture

Pipeline parallelism over tensor parallelism — LAN-friendly, no InfiniBand required. Separate data plane (gRPC activations) from control plane (Hazelcast state).

[Client] REST (Javalin) / gRPC streaming | [Coordinator] |-- GgufTokenizer (BPE from GGUF metadata) |-- ChatTemplateFormatter |-- RequestScheduler (virtual threads, CompletableFuture) |-- Sampler (temperature / top-k / top-p / rep. penalty) |-- KVCacheManager (GPU tier + CPU tier + PrefixCache trie) +-- GenerationLoop (prefill + decode + session KV reuse) | | gRPC (activations — FLOAT16/INT8/FLOAT32) | +--------------------------------------+ | Node 1 Node 2 Node 3 ... | 10/25 GbE | L 0-7 L 8-14 L 15-21 | | +embed +output proj | +--------------------------------------+

Each node runs either CpuForwardPassHandler (pure Java, parallel matVec) or GpuForwardPassHandler (JCublas cublasSgemv). Both implement ForwardPassHandler via the GpuMatVec interface. Node selection is automatic — GPU nodes use CublasMatVec, CPU-only nodes fall back to CpuMatVec.

The GpuMatVec interface decouples the matmul backend from transformer logic entirely. Swapping CpuMatVec → CublasMatVec → RocmMatVec requires no changes to GpuForwardPassHandler.

Hardware target

The design target is 16 commodity machines with 4 GB VRAM each — 64 GB total VRAM at a fraction of the cost of a single 64 GB card. 10 GbE minimum, 25 GbE recommended. Managed switch with jumbo frames.

Modules

Each module owns one concern and carries its own test suite. Dependencies flow one way.

api

OpenAPI 3.0 spec, JAX-RS interfaces, inference.proto

registry

NodeDescriptor, ShardPlanner, ShardMap, SeedScorer

coordinator

GenerationLoop, RequestScheduler, FaultTolerantPipeline, Javalin REST, SSE

node

CpuForwardPassHandler, GpuForwardPassHandler, GpuMatVec, CublasMatVec, GgufReader, LlamaConfig

kvcache

KVCacheManager, GpuKVCache, CpuKVCache, PrefixCache trie

tokenizer

GgufTokenizer (SentencePiece BPE from GGUF), ChatTemplate, ChatTemplateFormatter

sampler

Temperature, top-k, top-p, repetition penalty — pure Java pipeline steps

health

CircuitBreaker, HealthEvaluator, NodeHealth (Resilience4j)

player

ConsoleMain REPL, ClusterHarness, ProcessPipelineClient, ChatHistory

integration

InProcessClusterIT, ThreeNodeClusterIT, ModelLiveRunner (6 real-model checks)

Supported Models

Any GGUF file with a LLaMA-compatible architecture. Chat templates: llama3, mistral, gemma, tinyllama/zephyr, chatml (default). Template is derived automatically from the GGUF filename.

ModelFile sizeRAM
TinyLlama-1.1B-Chat Q4_K_M637 MB~2 GB
Mistral-7B-Instruct Q4_K_M4.1 GB~6 GB
Llama-3.2-8B-Instruct Q4_K_M4.9 GB~8 GB
Llama-3.1-70B-Instruct Q4_K_M40 GB16 × 4 GB nodes

Quantization types: F32, F16, BF16, Q8_0, Q4_0, Q4_K, Q6_K.

CLI Reference

Commands

CommandDescription
consoleSingle-JVM in-process REPL, no forking. Fastest startup, everyday use.
cluster3-node cluster, forked JVMs, real gRPC. GPU deployments and pipeline-parallel scenarios.
live6 automated real-model checks. Exits 0 on pass, 1 on any failure.

Flags

FlagDefaultDescription
--model-path PATHPath to GGUF file (required for real inference)
--dtype FLOAT32|FLOAT16|INT8FLOAT16Activation wire format between nodes
--max-tokens N200Max tokens per response
--temperature F0.6Sampling temperature
--heap SIZE4gJVM heap. Use 8g+ for 7B models
--nodes N3Number of shards (console mode)
--verbose / -vShow gRPC and node logs

All flags also available as env vars: MODEL_PATH, DTYPE, MAX_TOKENS, TEMPERATURE, HEAP, NODES, JAVA_HOME.

Examples

# 7B model with larger heap and verbose output
./run.sh cluster --model-path /models/mistral-7b.gguf --heap 8g -v

# INT8 activation compression (max bandwidth saving, ~1% error)
./run.sh cluster --model-path /models/llama3.gguf --dtype INT8

# Controlled generation
./run.sh console --model-path /models/llama3.gguf --temperature 0.1 --max-tokens 512

# Environment variable style
MODEL_PATH=/models/llama3.gguf DTYPE=FLOAT16 HEAP=8g ./run.sh cluster

Unit & Integration Tests

# Build (produces shade jars)
mvn clean package -DskipTests

# Unit tests — no model file, no GPU needed
mvn test -pl tokenizer,node,coordinator,sampler,kvcache,health,registry,player

# Integration tests — forks 3 JVM nodes in stub mode (~30s)
mvn verify -pl integration

# Real-model smoke test
./run.sh live --model-path /path/to/model.gguf

355 @Test methods across all modules. Notable anchors:

  • Golden-value regression for Q6_K dequantization — bit-exact output checked against known values.
  • Timing regression in LoadShardsParallelTest — prevents accidental re-serialization of parallel load.
  • Anti-regression for EOS piece filtering in GenerationLoopEosPieceTest.
  • Fault tolerance in FaultTolerantPipelineTest, HealthReactorTest, RetryPolicyTest.

GPU Tests

GPU tests are excluded from default CI. Activate with -Pgpu or -Dgroups=gpu. Requires CUDA 12.x and an Nvidia GPU.

# Unit tests — node module only, no model file needed
mvn test -Dgroups=gpu -pl node --enable-native-access=ALL-UNNAMED

# Integration test — requires CUDA + GGUF model
mvn verify -Pgpu -Dit.model.path=/path/to/model.gguf -pl integration \
  --enable-native-access=ALL-UNNAMED

# Via env var
MODEL_PATH=/path/to/model.gguf mvn verify -Pgpu -pl integration \
  --enable-native-access=ALL-UNNAMED

GpuForwardPassIT is excluded from default failsafe scan to prevent JCuda native libs loading into the coordinator JVM and poisoning FD inheritance into forked node processes. The -Pgpu profile sets -Djuno.gpu.test=true, which is the guard checked in @BeforeAll.

AWS Setup for GPU Testing

Recommended: g4dn.xlarge (T4, 16 GB VRAM, ~$0.50/hr on-demand).

# 1. Install CUDA 12.x
sudo apt update && sudo apt install -y nvidia-cuda-toolkit

# 2. Verify GPU
nvidia-smi

# 3. Install JDK 25 and Maven
sudo apt install -y openjdk-25-jdk maven

# 4. Clone and build
git clone https://github.com/ml-cab/juno
cd juno && mvn clean package -DskipTests

# 5. Download TinyLlama (637 MB — smallest supported)
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/\
resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

# 6. GPU unit tests (validate CUDA wiring, no model needed)
mvn test -Dgroups=gpu -pl node --enable-native-access=ALL-UNNAMED

# 7. GPU integration test
mvn verify -Pgpu \
  -Dit.model.path=$(pwd)/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  -pl integration --enable-native-access=ALL-UNNAMED

Key Design Decisions

  • No Python, no subprocess. JVM reads GGUF binary directly and runs the transformer end to end.
  • No Spring Boot. Javalin for REST. gRPC for the data plane. Plain Java for everything else.
  • Pipeline parallelism over tensor parallelism. LAN-friendly — no InfiniBand required.
  • Two ActivationDtype enums by design. Protobuf-generated for the wire; domain enum for application code. The bridge lives in one file.
  • GGUF tokenizer from model metadata. No separate tokenizer.model file needed.
  • Stub mode. The cluster boots in seconds without a model file. All integration tests run stub.
  • GPU tests excluded from default CI via failsafe <excludes> and a -Pgpu profile.
  • GpuMatVec interface decouples the matmul backend from transformer logic. Swap backends without touching GpuForwardPassHandler.

Performance

SessionChangems / 10 tokens
5Baseline — FLOAT32, serial matVec~34,891 ms
6Parallel matVec + FLOAT16 default~3,802 ms (9×)
9Session KV cache — turn latency now flat~7,000–8,000 ms / turn
10GpuForwardPassHandler (cublasSgemv) — AWS benchmark pending

Session 9 turn latency grows with new tokens per turn only — not with total history length. GPU numbers will be filled in after the first g4dn.xlarge run.

Comparison

FeatureJunollama.cppvLLMOllama
Pure JVM
Cluster-native~
GGUF quantized models~
GPU acceleration
Java ecosystem fit
Session KV reuse~
Continuous batchingroadmap

Requirements

  • JDK 25+ — virtual threads, --enable-native-access=ALL-UNNAMED for GPU path
  • Maven 3.9+
  • CUDA 12.x — GPU nodes only. Not required for CPU mode, unit tests, or integration tests.