Why Juno Quick Start Architecture Modules Models CLI LoRA JFR GPU AWS Deploy Design Perf Matrix → GitHub →
ML Cabinet · Open Source

JUno

Java Unified Neural Orchestration

Distributed LLM inference and fine-tuning. No Python, no GIL, no Spring.

Run open-source GGUF models locally, in a cluster, or embedded as a JVM library - with an OpenAI-compatible REST API out of the box.

JDK 25+ CUDA 12.x ROCm 6+ Maven 3.9+ Apache 2.0 gRPC GGUF OpenAI API
0Python processes
52TPS on g4dn.2xlarge
475+Unit tests
4+Supported families

Why Juno?

  • Your organization accepts only on-prem or air-gapped deployments, with strict disclosure policies, or runs clusters of heterogeneous hardware.
  • Your organization has mature Java teams and infrastructure, or enforces strict security policies against Python processes running in production.
  • You are building Edge or IoT clusters with small heterogeneous machines and need LLM inference without a Python runtime dependency.
  • You are building Java-native products - especially in fintech, enterprise tooling, or any domain where the JVM is the standard deployment target.

Quick Start

All invocation modes share the same GGUF model file and the same CLI launcher. Full flag reference and examples: docs/howto.md · README.md.

Download a model first:

# example - TinyLlama 1.1B, ~637 MB
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/\
resolve/main/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf -P models/

Console modes

local — Offline or stand-alone mode, no JVM cluster no gRPC wire. The fastest way to try a model locally or run LoRA training.

./juno local --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf

cluster — Offline and stand-alone mode, with JVM cluster on localhost with gRPC wire. The fastest way to test distributed behavior of model locally.

./juno cluster --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf

lora — Offline or stand-alone mode, no JVM cluster no gRPC wire. The fastest way to train a model locally using interactive LoRA console.

./juno lora --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf

Stand-alone HTTP API server

Option --api-port N to any local or cluster Juno invocation will enable OpenAI-compatible HTTP server. And expose POST /v1/chat/completions and GET /v1/models. Any client that speaks the OpenAI wire format - LangChain, LlamaIndex, the OpenAI SDK - works with only a base-URL change.

./juno local --model-path model.gguf --api-port 8080

Programmatic integration

Import one POM so every cab.ml module shares the same version:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>cab.ml</groupId>
      <artifactId>juno-bom</artifactId>
      <version>0.1.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>
<dependencies>
  <dependency>
    <groupId>cab.ml</groupId>
    <artifactId>juno-player</artifactId>
  </dependency>
</dependencies>

Then play or train your favorite model right from your code.

Distributed deployment

Deploy juno-master and juno-node within your network to obtain distributed inference. Choose parallelism type, byte order and other options to meet your hardware/software in the best condition. (See full examples in scripts/aws/juno-deploy.sh.)

# coordinator — one active instance per cluster
/usr/bin/java \
  --enable-preview --enable-native-access=ALL-UNNAMED \  # JDK 25 required
  --add-opens java.base/java.lang=ALL-UNNAMED \          # reflection access
  --add-opens java.base/java.nio=ALL-UNNAMED \           # NIO access
  -XX:+UseG1GC -XX:+AlwaysPreTouch -Xmx4g \             # GC + heap
  ${JFR_OPT:+$JFR_OPT} \                                # optional: --jfr 60s
  -DJUNO_HEALTH=true -DJUNO_HEALTH_PORT=8081 \           # health dashboard
  -jar juno-master.jar \
  --model-path /models/model.gguf \
  --pType pipeline \                                     # pipeline | tensor
  --dtype FP16                                           # FP16 | FP32 | INT8
# worker node — repeat on every inference host
/usr/bin/java \
  --enable-preview --enable-native-access=ALL-UNNAMED \
  --add-opens java.base/java.lang=ALL-UNNAMED \
  --add-opens java.base/java.nio=ALL-UNNAMED \
  -XX:+UseG1GC -XX:+AlwaysPreTouch -Xmx12g \            # larger heap for shards
  ${JFR_OPT:+$JFR_OPT} \                                # optional JFR profiling
  -DJUNO_USE_GPU=true \                                  # false → CPU quantized
  -Djuno.byteOrder=BE \                                  # BE (default) | LE
  -Dnode.id=1 \                                          # unique per node
  -Dnode.port=50051 \                                    # gRPC listen port
  -Dmodel.path=/models/model.gguf \
  -Djuno.lora.play.path=/adapters/model.lora \           # optional LoRA adapter
  -Djuno.health.url=http://master:8081/health \          # optional health report-back
  -jar juno-node.jar cab.ml.juno.node.NodeMain

AWS Cluster Deployment

scripts/aws/juno-deploy.sh is the unified cluster lifecycle script. Hardware is auto-detected during bootstrap: GPU instances set JUNO_USE_GPU=true (CUDA is pre-installed in the golden AMI). The OpenAI-compatible API is available on port 8080 after setup. Full options and LoRA deploy flow: docs/howto.md - AWS.

Juno AWS cluster chat - multi-turn conversation on a 3-node deployment

One-time setup

cd scripts/aws
nano launcher.sh  # set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION

Commands

CommandDescription
setup [options]Provision, bootstrap, start coordinator, expose OpenAI API on port 8080 (~5 min)
startStart stopped instances
stopStop instances - EBS and key pair retained
teardownTerminate everything - no lingering costs
statusShow current instance states from AWS API
scan-regionsFind cheapest AZ for the selected instance type
# GPU cluster (3 x g4dn.xlarge, T4 16 GB VRAM)
./launcher.sh juno-deploy.sh setup

# CPU cluster (3 x m7i-flex.large)
./launcher.sh juno-deploy.sh setup \
  --instance-type m7i-flex.large \
  --model-url https://huggingface.co/.../model.gguf

# Deploy with LoRA adapter
./launcher.sh juno-deploy.sh setup \
  --lora-play /absolute/path/to/model.lora

# Teardown
./launcher.sh juno-deploy.sh teardown

Architecture

Two distribution strategies, selected with --pType. Pipeline (default): contiguous layer blocks, serial activation flow, LAN-friendly, no InfiniBand required. Tensor: all nodes hold all layers but a horizontal weight slice; coordinator broadcasts tokens in parallel and sums partial logit vectors (AllReduce).

[Client] REST (Javalin) / gRPC streaming | [Coordinator] |-- RequestScheduler (virtual threads, CompletableFuture) |-- GenerationLoop (prefill + decode + session KV reuse) |-- GgufTokenizer (SentencePiece BPE + GPT-2 BPE auto-detected) |-- ChatTemplateFormatter (llama3 / mistral / phi3 / tinyllama / gemma / chatml) |-- Sampler (temperature / top-k / top-p / rep. penalty) |-- KVCacheManager (GPU tier + CPU tier + PrefixCache trie) +-- OpenAiChatHandler (POST /v1/chat/completions, GET /v1/models) | | gRPC activations - FLOAT16 / INT8 / FLOAT32 (BE or LE wire order) | ForwardPassHandlerLoader routes by GGUF general.architecture field phi3 → Phi3TransformerHandler (supported) qwen3 / qwen3moe → Qwen3* handlers (under development) * → LlamaTransformerHandler (llama, mistral, tinyllama supported; gemma, qwen2 under development) | -- PIPELINE mode (--pType pipeline, default) ------ +--------------------+--------------------+ | Node 1 | Node 2 | Node N | Embed + L 0-10 | L 11-21 | L 22-31 + Output | NodeKVCacheAdapter - serial gRPC hop chain +--------------------+--------------------+ -- TENSOR mode (--pType tensor) -------------------- +--------------------+--------------------+ | Node 1 | Node 2 | Node N | All layers, heads | All layers, heads | All layers, heads | [0, headEnd1) | [headEnd1, end2) | [headEnd2, numHeads) | parallel broadcast + coordinator AllReduce sum +--------------------+--------------------+ | CpuMatVec | CudaMatVec | RocmMatVec | (quantized CPU) | GpuMatVec sealed: CudaMatVec, RocmMatVec | LoraTrainableHandler (read-only overlay when --lora-play; LLaMA-family only)

On-prem and cluster topology

Run juno-master as the coordinator and juno-node on each worker with gRPC between them. Any node count is supported — use pipeline parallelism to scale total VRAM, tensor parallelism to scale throughput. The same roles run locally (forked JVMs), on bare metal, or on cloud instances.

Sequence: generation loop

On startup, ClusterHarness forks node processes. Each calls ForwardPassHandlerLoader.load(), routing by general.architecture in the GGUF to LlamaTransformerHandler, Phi3TransformerHandler (supported), or Qwen3 handlers (under development). Gemma uses the Llama handler but is under development. EmbeddedNodeServer wires a NodeKVCacheAdapter after loadShard(). On inference: coordinator tokenizes, runs prefill+decode. Session KV blocks are reused across turns — only new tokens are prefilled on turn N+1.

Modules

Multi-module Maven project. All cab.ml artifacts at version 0.1.0 are available on Maven Central. Import juno-bom to align versions. Details: docs/arch.md.

juno-bom

Maven BOM - aligned dependency versions for all cab.ml artifacts.

api

OpenAPI 3.0 spec, protobuf/gRPC contract (inference.proto), JAX-RS interfaces.

registry

Shard planning, model registry, parallelism strategy (ShardPlanner, TensorShardPlanner).

coordinator

Generation loop, request scheduler, OpenAI-compatible REST + SSE, fault-tolerant pipeline.

node

Transformer forward-pass handlers (Llama, Phi-3 supported; Gemma and Qwen 2/3/3.5 under development), GGUF reader, MatVec backends (CPU / CUDA / ROCm via GpuMatVec), LoRA overlay.

lora

Adapter tensors, Adam optimizer, .lora checkpoint format, merge-to-GGUF writer.

tokenizer

GGUF BPE tokenizer (SentencePiece + GPT-2 paths auto-detected), chat template formatter.

sampler · kvcache · health · metrics

Shared infrastructure - sampling pipeline, KV cache (GPU + CPU tiers), circuit breaker, JFR metrics extractor.

juno-player

CLI REPL and cluster harness. Exposes JunoPlayer facade, LoraTrainer, JunoHttpClient for JVM integration.

juno-node · juno-master

Shaded fat jars for standalone remote deployment - node process and coordinator process respectively.

Supported Models

Any GGUF file with a supported architecture. Chat template is auto-detected from the model filename. Phi-3 is supported; Gemma, Qwen 2, Qwen3, and Qwen3.5 are under development.

Architectures

ArchitectureHandlerTemplate auto-detectStatus
LLaMA / LLaMA 2LlamaTransformerHandlerllama* → chatml; tinyllama* / zephyr* → tinyllamaSupported
Meta-Llama 3.xLlamaTransformerHandlerllama3* → llama3 (GPT-2 BPE auto-detected)Supported
MistralLlamaTransformerHandlermistral* → mistralSupported
GemmaLlamaTransformerHandlergemma* → gemmaUnder development
Phi-3 / Phi-3.5Phi3TransformerHandlerphi3* / phi-3* → phi3Supported
Qwen 2 / 2.5LlamaTransformerHandler (+ QKV bias)qwen2* / qwen* → chatmlUnder development
Qwen3 denseQwen3TransformerHandlerqwen3* → chatmlUnder development
Qwen3-MoEQwen3MoeTransformerHandlerqwen3*moe* → chatmlUnder development
Qwen3.5none yetqwen3.5* → chatml (partial)Under development

Quantization types

F32 · F16 · BF16 · Q8_0 · Q4_0 · Q2_K · Q3_K · Q4_K · Q5_K · Q6_K

All quantization types stay compressed in memory; dequantization runs block-by-block inside the matmul loop. On GPU, handlers dequantize once at load time and keep weights as FP16 on device (DeviceHalfMatrix). Phi-3 uses the same FP16 resident path as LLaMA-family models. Gemma and Qwen handlers are under development — no LoRA overlay on those families, no thinking-mode template on Qwen.

CLI Reference

./juno is the unified launcher at the project root. Requires JDK 25+ and pre-built jars. Full flag table and examples: docs/howto.md.

Modes

CommandDescription
./juno localIn-process REPL - all shards in one JVM, no forking, no gRPC. Add --api-port N to start the OpenAI-compatible HTTP server alongside the REPL.
./juno (default)3-node cluster - forked JVMs with real gRPC. Default --pType pipeline; use --pType tensor for AllReduce mode. Also supports --api-port.
./juno loraLoRA fine-tuning REPL - single in-process JVM, adapter persisted to .lora checkpoint. Use /train, /train-qa, /save inside the REPL.
./juno mergeBake a trained .lora adapter into a new standalone GGUF - no sidecar needed at inference time.

For the full list of flags (--model-path, --dtype, --pType, --heap, --jfr, --lora-play, LoRA-specific flags, merge flags, environment variable overrides) and usage examples, see docs/howto.md.

GPU Acceleration

Two GPU backends via Panama FFI (java.lang.foreign) — no JavaCPP, no bytedeco. Backend is auto-selected at startup: CUDA preferred over ROCm over CPU. Override with -Djuno.gpu.backend=cuda|rocm|auto.

NVIDIA — CUDA 12.x / cuBLAS

CudaBindings resolves libcudart.so.12 and libcublas.so.12. All projection weights upload once at load time as IEEE FP16 (DeviceHalfMatrix); forward-pass matmuls use cublasHSSgemvStridedBatched. Per-call H2D transfer is limited to the small input/output activation vectors.

AMD — ROCm 6+ / rocBLAS

RocmBindings resolves libamdhip64.so and librocblas.so. RocmMatVec provides the same three compute paths as CudaMatVec: host FP32, device-resident FP32, and device-resident FP16 via rocblas_sgemv / rocblas_hssgemv_strided_batched. Tested on AMD Radeon RX 7900 XT (gfx1100, ROCm 7.2.x).

Vendor-neutral abstraction

GpuBindings is the vendor-neutral interface implemented by both CudaBindings and RocmBindings. GpuMatVec is a sealed interface (permits CudaMatVec, RocmMatVec) that transformer handlers depend on — not a concrete vendor class. Backend selection runs in GpuContext.selectBindings().

  • FP16 resident weights. Llama and Phi-3 handlers dequantize once on load and keep weights on device, roughly halving VRAM versus FP32 resident. Gemma and Qwen handlers follow the same GPU path but are under development. Works on both CUDA and ROCm.
  • CUDA streams / HIP streams. Per-thread non-blocking streams with async H2D/D2H copies; cuBLAS/rocBLAS calls serialized with a per-context lock.
  • Multi-device. One shared GpuContext per device index. Pin a node to a specific GPU with -Djuno.cuda.device=N.
  • OOM fallback. On cudaMalloc / hipMalloc failure, partial device buffers are closed and inference falls back to CPU quantized matmul for those projections — no crash, no restart.
  • Explicit lifecycle. releaseGpuResources() frees VRAM on shard unload, reload, or handler swap.
  • Force CPU. Pass --cpu or set JUNO_USE_GPU=false to skip GPU detection entirely.

LoRA Fine-Tuning

Parameter-efficient fine-tuning via low-rank adapter matrices. The base GGUF is never modified. Adapters persist to a .lora checkpoint; the same handler serves training and inference. Full guide, rank selection, and checkpoint format: docs/LoRA.md.

  • ./juno lora - fine-tuning REPL. Use /train for free-text, /train-qa for Q&A pairs (auto-generates 4 phrasings with the model's chat template).
  • --lora-play PATH - apply a pre-trained adapter read-only at inference, in both local and cluster modes. In cluster mode the adapter is injected into every forked node JVM.
  • ./juno merge - write a new GGUF with the 44 LoRA-patched projection tensors stored as F32 (the LoRA delta is smaller than Q4_K noise; re-quantizing would erase training). All other tensors are copied verbatim.
  • JVM API. LoraTrainer (in juno-player) exposes trainRawText(), trainQaPair(), and save() for programmatic use.

Health Dashboard

The health module exposes an auto-refreshing dashboard served at GET / on the health port. Each node runs a HealthReporter that probes local metrics every 5 s. Per-node cards show VRAM, CPU load, circuit state, and a role-conditional metric: Latency P99 on the coordinator, Throughput MB/s on worker nodes.

Circuit-open nodes are highlighted in red; half-open in amber. The coordinator's port 8080 also embeds a simplified health view inline.

JFR Profiling

All hot paths are instrumented with custom Java Flight Recorder events. Pass --jfr DURATION to any command; a juno-<modelStem>-<timestamp>.jfr file is written on exit. Open in JDK Mission Control → Event Browser. Full guide and extraction examples: docs/howto.md → Metrics.

EventKey fields
juno.MatVecbackend (CPU / CUDA / CUDA_RESIDENT / CUDA_RESIDENT_FP16 / ROCM / ROCM_RESIDENT / ROCM_RESIDENT_FP16), rows, cols
juno.ForwardPasshandlerType, requestId, startPosition, layerCount
juno.TokenProducedcoordinator-side; used to derive aggregate TPS
juno.TokenizertokenizerType, operation, inputLength, outputLength
juno.TemplateFormatmodelType, messageCount, outputLength
juno.LoraTrainStepstep, loss, forwardMs, backwardMs, optimizerMs

Key Design Decisions

DecisionDetail
No Python, no subprocessThe JVM reads GGUF binary directly via GgufReader and runs the full transformer forward pass end to end. No sidecar, no IPC, no GIL.
Dual GPU backend via Panama FFIGpuBindings is a vendor-neutral interface implemented by CudaBindings (CUDA 12.x + cuBLAS) and RocmBindings (ROCm 6+ + rocBLAS). GpuMatVec (sealed interface) lets transformer handlers upload weights and run matmuls on any GPU without importing a vendor class. Backend selected at startup: CUDA → ROCm → CPU.
No Spring BootJavalin for REST. Virtual threads on the gRPC ServerBuilder - required to avoid OS-thread saturation under concurrent prefill sessions.
OpenAI wire compatibilityOpenAiChatHandler and OpenAiAdapter are new classes in the coordinator module. No existing classes were modified; the existing native /v1/inference endpoints are untouched.
Maven BOM on Centralcab.ml:juno-bom:0.1.0 aligns all artifact versions. Import once in dependencyManagement; downstream modules declare no version.
Two parallelism modesPipeline: vertical depth scaling, serial gRPC hops, adds total VRAM linearly. Tensor: horizontal width scaling, parallel broadcast + AllReduce, one gRPC round-trip per step.
Lazy dequant on CPU / eager upload on GPUOn CPU, one 256-element block is dequantized at a time inside the matmul loop. On GPU, weights are dequantized once at load and kept as FP16 on device. OOM falls back to CPU gracefully.
LoRA without modifying the base GGUFLoraTrainableHandler applies W_eff = W + (α/rank)·B·A at inference. Frozen weights stay quantized. Adapters persist separately; the GGUF is never touched. juno merge bakes them permanently when needed.
Session KV cacheA stable sessionId key survives across REPL turns. Turn latency is proportional to new tokens only, not history length.
Full JFR instrumentationSix custom event types across every hot path - observable in JDK Mission Control with no agent or bytecode manipulation. Zero overhead when recording is off.
AWS infrastructure fully scriptedjuno-deploy.sh handles the full cluster lifecycle. GPU quota is checked before any instance launches; insufficient vCPUs fail hard. State persisted to ~/.juno-deploy-state.
Stub modeEmbeddedNodeServer uses StubForwardPassHandler (zero-filled arrays) before a shard is loaded. Integration tests run in stub mode - no model file, no GPU, boots in seconds.

Performance

Measured on tinyllama-1.1b-chat-v1.0-q4_k_m.gguf via AWS JFR cluster runs. CPU: m7i-flex.large. GPU: g4dn.2xlarge. TPS = coordinator-side TokenProduced.tps.

22.5GPU tps · s1
52.3GPU tps · s9
1.59CPU peak · 3 nodes
17–37×GPU vs CPU
  • GPU dominates at all concurrency levels. 17–37× faster than CPU; 21× faster on multi-turn where prefill cost compounds harder on CPU.
  • CPU pipeline sweet spot: 3 nodes. 1.59 tps peak. Adding nodes beyond 3 degrades TPS - network sync per forward pass outweighs compute savings.
  • Tensor-parallel is a net loss on CPU at this model size. 3-node tensor: 0.44 tps versus 1.59 tps pipeline. No benefit on GPU single-node either.
  • CPU dtype: FP16 wins. FP16 (1.54) > INT8 (1.13) ≈ FP32 (1.06). No AVX-512 INT8 path - scalar fallback. LE byte order adds ~19% overhead vs BE.
  • GPU dtype: flat. FP32 on GPU is only 7% slower than FP16 - driver upcasts internally.
→ full performance matrix - all test configurations, raw TPS, analysis

Changelog

Last 5 development sessions. Scroll horizontally to see all. Full history on GitHub.

Session 33 - Model support documentation

Docs aligned on model-support policy: Phi-3 / Phi-3.5 supported via Phi3TransformerHandler; Gemma, Qwen 2, Qwen3, and Qwen3.5 under development. Updated README, RELEASE_NOTES, arch.md, howto.md, LoRA.md, and model-support roadmap.

Session 32 - AMD ROCm / rocBLAS backend

Full first-class AMD GPU support alongside NVIDIA CUDA. GpuBindings (vendor-neutral interface), GpuMatVec (sealed interface), RocmBindings, RocmMatVec, and RocmAvailability added. GpuContext refactored to backend-agnostic; auto-selects CUDA → ROCm → CPU. MatVecBackend enum replaces ad-hoc label strings. 55 new tests; all 194 existing tests pass. Tested on RX 7900 XT (ROCm 7.2.x).

Session 31 - Panama FFI for Juno math

org.bytedeco:cuda-platform removed. New CudaBindings class resolves all CUDA Runtime and cuBLAS symbols via java.lang.foreign.Linker and SymbolLookup. CudaMatVec, GpuContext, DeviceFloatMatrix, and DeviceHalfMatrix rewritten to use MethodHandle downcalls and MemorySegment/Arena for all native memory. New CudaBindingsTest covers GPU-present and CPU-only paths.

Session 30 - Maven Central publish

All modules configured for publishing to central.sonatype.org. Sources and Javadoc jars attached at the verify phase via maven-source-plugin and maven-javadoc-plugin. GPG signing moved to the install phase with --pinentry-mode loopback so all attached artifacts are signed. Version set to 0.1.0-RC. distributionManagement wired to the Central Portal publisher.

Session 29 - OpenAI-compatible REST API

POST /v1/chat/completions and GET /v1/models added. Any OpenAI-compatible client (LangChain, LlamaIndex, OpenAI SDK) works with a base-URL change only. OpenAiAdapter and OpenAiChatHandler are new classes; no existing endpoints modified. --api-port N wired into both local and cluster modes. OpenAPI 3.0 spec at api/src/main/resources/juno-api.yaml.