Why Juno Quick Start Architecture Modules Models CLI LoRA JFR GPU AWS Deploy Design Perf Matrix → GitHub →
ML Cabinet · Open Source

JUno

Java Unified Neural Orchestration

Distributed LLM inference and fine-tuning. No Python, no GIL, no Spring.

Run open-source GGUF models locally, in a cluster, or embedded as a JVM library - with an OpenAI-compatible REST API out of the box.

JDK 25+ CUDA 12.x Maven 3.9+ Apache 2.0 gRPC GGUF OpenAI API
0Python processes
52TPS on g4dn.2xlarge
475+Unit tests
6Arch families

Why Juno?

  • Your organization accepts only on-prem or air-gapped deployments, with strict disclosure policies, or runs clusters of commodity or heterogeneous hardware.
  • Your organization has mature Java teams and infrastructure, or enforces strict security policies against Python processes running in production.
  • You are building Edge or IoT clusters with small heterogeneous machines and need LLM inference without a Python runtime dependency.
  • You are building Java-native products - especially in fintech, enterprise tooling, or any domain where the JVM is the standard deployment target.

Quick Start

We may use jUno in several ways. All share the same GGUF model file and the same CLI launcher - only the invocation differs. Full flag reference and examples: docs/howto.md · README.md.

But let's check out a model first:

# example - TinyLlama 1.1B, ~637 MB
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/\
resolve/main/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf -P models/

Console modes

local — Offline or stand-alone mode, no JVM cluster no gRPC wire. The fastest way to try a model locally or run LoRA training.

./juno local --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf

cluster — Offline and stand-alone mode, with JVM cluster on localhost with gRPC wire. The fastest way to test distributed behavior of model locally.

./juno cluster --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf

lora — Offline or stand-alone mode, no JVM cluster no gRPC wire. The fastest way to train a model locally using interactive LoRA console.

./juno lora --model-path models/TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf

Stand-alone HTTP API server

Option --api-port N to any local or cluster jUno invocation will enable OpenAI-compatible HTTP server. And expose POST /v1/chat/completions and GET /v1/models. Any client that speaks the OpenAI wire format - LangChain, LlamaIndex, the OpenAI SDK - works with only a base-URL change.

./juno local --model-path model.gguf --api-port 8080

Programmatic integration

Import one POM so every cab.ml module shares the same version:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>cab.ml</groupId>
      <artifactId>juno-bom</artifactId>
      <version>0.1.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>
<dependencies>
  <dependency>
    <groupId>cab.ml</groupId>
    <artifactId>juno-player</artifactId>
  </dependency>
</dependencies>

Then play or train your favorite model right from your code.

Distributed deployment

Deploy juno-master and juno-node within your network to obtain distributed inference. Choose parallelism type, byte order and other options to meet your hardware/software in the best condition. (See full examples in scripts/aws/juno-deploy.sh.)

# coordinator — one active instance per cluster
/usr/bin/java \
  --enable-preview --enable-native-access=ALL-UNNAMED \  # JDK 25 required
  --add-opens java.base/java.lang=ALL-UNNAMED \          # reflection access
  --add-opens java.base/java.nio=ALL-UNNAMED \           # NIO access
  -XX:+UseG1GC -XX:+AlwaysPreTouch -Xmx4g \             # GC + heap
  ${JFR_OPT:+$JFR_OPT} \                                # optional: --jfr 60s
  -DJUNO_HEALTH=true -DJUNO_HEALTH_PORT=8081 \           # health dashboard
  -jar juno-master.jar \
  --model-path /models/model.gguf \
  --pType pipeline \                                     # pipeline | tensor
  --dtype FP16                                           # FP16 | FP32 | INT8
# worker node — repeat on every inference host
/usr/bin/java \
  --enable-preview --enable-native-access=ALL-UNNAMED \
  --add-opens java.base/java.lang=ALL-UNNAMED \
  --add-opens java.base/java.nio=ALL-UNNAMED \
  -XX:+UseG1GC -XX:+AlwaysPreTouch -Xmx12g \            # larger heap for shards
  ${JFR_OPT:+$JFR_OPT} \                                # optional JFR profiling
  -DJUNO_USE_GPU=true \                                  # false → CPU quantized
  -Djuno.byteOrder=BE \                                  # BE (default) | LE
  -Dnode.id=1 \                                          # unique per node
  -Dnode.port=50051 \                                    # gRPC listen port
  -Dmodel.path=/models/model.gguf \
  -Djuno.lora.play.path=/adapters/model.lora \           # optional LoRA adapter
  -Djuno.health.url=http://master:8081/health \          # optional health report-back
  -jar juno-node.jar cab.ml.juno.node.NodeMain

AWS Cluster Deployment

scripts/aws/juno-deploy.sh is the unified cluster lifecycle script. Hardware is auto-detected during bootstrap: GPU instances set JUNO_USE_GPU=true (CUDA is pre-installed in the golden AMI). The OpenAI-compatible API is available on port 8080 after setup. Full options and LoRA deploy flow: docs/howto.md - AWS.

Juno AWS cluster chat - multi-turn conversation on a 3-node deployment

One-time setup

cd scripts/aws
nano launcher.sh  # set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION

Commands

CommandDescription
setup [options]Provision, bootstrap, start coordinator, expose OpenAI API on port 8080 (~5 min)
startStart stopped instances
stopStop instances - EBS and key pair retained
teardownTerminate everything - no lingering costs
statusShow current instance states from AWS API
scan-regionsFind cheapest AZ for the selected instance type
# GPU cluster (3 x g4dn.xlarge, T4 16 GB VRAM)
./launcher.sh juno-deploy.sh setup

# CPU cluster (3 x m7i-flex.large)
./launcher.sh juno-deploy.sh setup \
  --instance-type m7i-flex.large \
  --model-url https://huggingface.co/.../model.gguf

# Deploy with LoRA adapter
./launcher.sh juno-deploy.sh setup \
  --lora-play /absolute/path/to/model.lora

# Teardown
./launcher.sh juno-deploy.sh teardown

Architecture

Two distribution strategies, selected with --pType. Pipeline (default): contiguous layer blocks, serial activation flow, LAN-friendly, no InfiniBand required. Tensor: all nodes hold all layers but a horizontal weight slice; coordinator broadcasts tokens in parallel and sums partial logit vectors (AllReduce).

[Client] REST (Javalin) / gRPC streaming | [Coordinator] |-- RequestScheduler (virtual threads, CompletableFuture) |-- GenerationLoop (prefill + decode + session KV reuse) |-- GgufTokenizer (SentencePiece BPE + GPT-2 BPE auto-detected) |-- ChatTemplateFormatter (llama3 / mistral / phi3 / tinyllama / gemma / chatml) |-- Sampler (temperature / top-k / top-p / rep. penalty) |-- KVCacheManager (GPU tier + CPU tier + PrefixCache trie) +-- OpenAiChatHandler (POST /v1/chat/completions, GET /v1/models) | | gRPC activations - FLOAT16 / INT8 / FLOAT32 (BE or LE wire order) | ForwardPassHandlerLoader routes by GGUF general.architecture field | -- PIPELINE mode (--pType pipeline, default) ------ +--------------------+--------------------+ | Node 1 | Node 2 | Node N | Embed + L 0-10 | L 11-21 | L 22-31 + Output | NodeKVCacheAdapter - serial gRPC hop chain +--------------------+--------------------+ -- TENSOR mode (--pType tensor) -------------------- +--------------------+--------------------+ | Node 1 | Node 2 | Node N | All layers, heads | All layers, heads | All layers, heads | [0, headEnd1) | [headEnd1, end2) | [headEnd2, numHeads) | parallel broadcast + coordinator AllReduce sum +--------------------+--------------------+ | CpuMatVec | LlamaTransformerHandler | CudaMatVec | Phi3TransformerHandler | LoraTrainableHandler (read-only overlay when --lora-play)

On-prem and cluster topology

Run juno-master as the coordinator and juno-node on each worker with gRPC between them. Any node count is supported - use pipeline parallelism to scale total VRAM, tensor parallelism to scale throughput. The same roles run locally (forked JVMs), on bare metal, or on cloud instances. Commodity hardware works well; so does modern datacenter hardware.

Sequence: generation loop

On startup, ClusterHarness forks node processes. Each calls ForwardPassHandlerLoader.load(), routing to LlamaTransformerHandler or Phi3TransformerHandler based on general.architecture in the GGUF. EmbeddedNodeServer wires a NodeKVCacheAdapter after loadShard(). On inference: coordinator tokenizes, runs prefill+decode. Session KV blocks are reused across turns - only new tokens are prefilled on turn N+1.

Modules

Multi-module Maven project. All cab.ml artifacts at version 0.1.0 are available on Maven Central. Import juno-bom to align versions. Details: docs/arch.md.

juno-bom

Maven BOM - aligned dependency versions for all cab.ml artifacts.

api

OpenAPI 3.0 spec, protobuf/gRPC contract (inference.proto), JAX-RS interfaces.

registry

Shard planning, model registry, parallelism strategy (ShardPlanner, TensorShardPlanner).

coordinator

Generation loop, request scheduler, OpenAI-compatible REST + SSE, fault-tolerant pipeline.

node

Transformer forward-pass handlers (Llama, Phi-3), GGUF reader, MatVec backends (CPU / CUDA), LoRA overlay.

lora

Adapter tensors, Adam optimizer, .lora checkpoint format, merge-to-GGUF writer.

tokenizer

GGUF BPE tokenizer (SentencePiece + GPT-2 paths auto-detected), chat template formatter.

sampler · kvcache · health · metrics

Shared infrastructure - sampling pipeline, KV cache (GPU + CPU tiers), circuit breaker, JFR metrics extractor.

juno-player

CLI REPL and cluster harness. Exposes JunoPlayer facade, LoraTrainer, JunoHttpClient for JVM integration.

juno-node · juno-master

Shaded fat jars for standalone remote deployment - node process and coordinator process respectively.

Supported Models

Any GGUF file with a supported architecture. Chat template is auto-detected from the model filename.

Architectures

ArchitectureHandlerTemplate auto-detect
LLaMA / LLaMA 2LlamaTransformerHandlerllama* → chatml; tinyllama* / zephyr* → tinyllama
Meta-Llama 3.xLlamaTransformerHandlerllama3* → llama3 (GPT-2 BPE auto-detected)
MistralLlamaTransformerHandlermistral* → mistral
GemmaLlamaTransformerHandlergemma* → gemma
Phi-3 / Phi-3.5Phi3TransformerHandlerphi3* / phi-3* → phi3

Quantization types

F32 · F16 · BF16 · Q8_0 · Q4_0 · Q2_K · Q3_K · Q4_K · Q5_K · Q6_K

All quantization types stay compressed in memory; dequantization runs block-by-block inside the matmul loop. On GPU, Llama and Phi-3 dequantize once at load time and keep weights as FP16 on device (DeviceHalfMatrix).

CLI Reference

./juno is the unified launcher at the project root. Requires JDK 25+ and pre-built jars. Full flag table and examples: docs/howto.md.

Modes

CommandDescription
./juno localIn-process REPL - all shards in one JVM, no forking, no gRPC. Add --api-port N to start the OpenAI-compatible HTTP server alongside the REPL.
./juno (default)3-node cluster - forked JVMs with real gRPC. Default --pType pipeline; use --pType tensor for AllReduce mode. Also supports --api-port.
./juno loraLoRA fine-tuning REPL - single in-process JVM, adapter persisted to .lora checkpoint. Use /train, /train-qa, /save inside the REPL.
./juno mergeBake a trained .lora adapter into a new standalone GGUF - no sidecar needed at inference time.

For the full list of flags (--model-path, --dtype, --pType, --heap, --jfr, --lora-play, LoRA-specific flags, merge flags, environment variable overrides) and usage examples, see docs/howto.md.

GPU Acceleration

GPU inference via CUDA 12.x and cuBLAS. All projection weights are uploaded once at load time as IEEE FP16 (DeviceHalfMatrix); forward-pass matmuls use cublasHSSgemvStridedBatched. Per-call H2D transfer is limited to the small input/output activation vectors - CPU cores are idle during generation.

  • FP16 resident weights. Both Llama and Phi-3 handlers dequantize once on load and keep weights on device, roughly halving VRAM versus FP32 resident.
  • CUDA streams. Per-thread non-blocking streams with cudaMemcpyAsync; cuBLAS calls are serialized with a per-context lock.
  • Multi-device. One shared GpuContext per CUDA device. Pin a node to a specific GPU with -Djuno.cuda.device=N.
  • OOM fallback. On cudaMalloc failure, partial device buffers are closed and inference falls back to CPU quantized matmul for those projections - no crash, no restart.
  • Explicit lifecycle. releaseGpuResources() frees VRAM on shard unload, reload, or handler swap.

LoRA Fine-Tuning

Parameter-efficient fine-tuning via low-rank adapter matrices. The base GGUF is never modified. Adapters persist to a .lora checkpoint; the same handler serves training and inference. Full guide, rank selection, and checkpoint format: docs/LoRA.md.

  • ./juno lora - fine-tuning REPL. Use /train for free-text, /train-qa for Q&A pairs (auto-generates 4 phrasings with the model's chat template).
  • --lora-play PATH - apply a pre-trained adapter read-only at inference, in both local and cluster modes. In cluster mode the adapter is injected into every forked node JVM.
  • ./juno merge - write a new GGUF with the 44 LoRA-patched projection tensors stored as F32 (the LoRA delta is smaller than Q4_K noise; re-quantizing would erase training). All other tensors are copied verbatim.
  • JVM API. LoraTrainer (in juno-player) exposes trainRawText(), trainQaPair(), and save() for programmatic use.

Health Dashboard

The health module exposes an auto-refreshing dashboard served at GET / on the health port. Each node runs a HealthReporter that probes local metrics every 5 s. Per-node cards show VRAM, CPU load, circuit state, and a role-conditional metric: Latency P99 on the coordinator, Throughput MB/s on worker nodes.

Circuit-open nodes are highlighted in red; half-open in amber. The coordinator's port 8080 also embeds a simplified health view inline.

JFR Profiling

All hot paths are instrumented with custom Java Flight Recorder events. Pass --jfr DURATION to any command; a juno-<modelStem>-<timestamp>.jfr file is written on exit. Open in JDK Mission Control → Event Browser. Full guide and extraction examples: docs/howto.md → Metrics.

EventKey fields
juno.MatVecbackend (cpu / cuda / cuda-resident-fp16), rows, cols
juno.ForwardPasshandlerType, requestId, startPosition, layerCount
juno.TokenProducedcoordinator-side; used to derive aggregate TPS
juno.TokenizertokenizerType, operation, inputLength, outputLength
juno.TemplateFormatmodelType, messageCount, outputLength
juno.LoraTrainStepstep, loss, forwardMs, backwardMs, optimizerMs

Key Design Decisions

DecisionDetail
No Python, no subprocessThe JVM reads GGUF binary directly via GgufReader and runs the full transformer forward pass end to end. No sidecar, no IPC, no GIL.
No Spring BootJavalin for REST. Virtual threads on the gRPC ServerBuilder - required to avoid OS-thread saturation under concurrent prefill sessions.
OpenAI wire compatibilityOpenAiChatHandler and OpenAiAdapter are new classes in the coordinator module. No existing classes were modified; the existing native /v1/inference endpoints are untouched.
Maven BOM on Centralcab.ml:juno-bom:0.1.0 aligns all artifact versions. Import once in dependencyManagement; downstream modules declare no version.
Two parallelism modesPipeline: vertical depth scaling, serial gRPC hops, adds total VRAM linearly. Tensor: horizontal width scaling, parallel broadcast + AllReduce, one gRPC round-trip per step.
Lazy dequant on CPU / eager upload on GPUOn CPU, one 256-element block is dequantized at a time inside the matmul loop. On GPU, weights are dequantized once at load and kept as FP16 on device. OOM falls back to CPU gracefully.
LoRA without modifying the base GGUFLoraTrainableHandler applies W_eff = W + (α/rank)·B·A at inference. Frozen weights stay quantized. Adapters persist separately; the GGUF is never touched. juno merge bakes them permanently when needed.
Session KV cacheA stable sessionId key survives across REPL turns. Turn latency is proportional to new tokens only, not history length.
Full JFR instrumentationSix custom event types across every hot path - observable in JDK Mission Control with no agent or bytecode manipulation. Zero overhead when recording is off.
AWS infrastructure fully scriptedjuno-deploy.sh handles the full cluster lifecycle. GPU quota is checked before any instance launches; insufficient vCPUs fail hard. State persisted to ~/.juno-deploy-state.
Stub modeEmbeddedNodeServer uses StubForwardPassHandler (zero-filled arrays) before a shard is loaded. Integration tests run in stub mode - no model file, no GPU, boots in seconds.

Performance

Measured on tinyllama-1.1b-chat-v1.0-q4_k_m.gguf via AWS JFR cluster runs. CPU: m7i-flex.large. GPU: g4dn.2xlarge. TPS = coordinator-side TokenProduced.tps.

20.7GPU tps · s1
37.2GPU tps · s9
1.59CPU peak · 3 nodes
17–37×GPU vs CPU
  • GPU dominates at all concurrency levels. 17–37× faster than CPU; 21× faster on multi-turn where prefill cost compounds harder on CPU.
  • CPU pipeline sweet spot: 3 nodes. 1.59 tps peak. Adding nodes beyond 3 degrades TPS - network sync per forward pass outweighs compute savings.
  • Tensor-parallel is a net loss on CPU at this model size. 3-node tensor: 0.44 tps versus 1.59 tps pipeline. No benefit on GPU single-node either.
  • CPU dtype: FP16 wins. FP16 (1.54) > INT8 (1.13) ≈ FP32 (1.06). No AVX-512 INT8 path - scalar fallback. LE byte order adds ~19% overhead vs BE.
  • GPU dtype: flat. FP32 on GPU is only 7% slower than FP16 - driver upcasts internally.
→ full performance matrix - all test configurations, raw TPS, analysis

Changelog

Last 5 development sessions. Scroll horizontally to see all. Full history on GitHub.

Session 29 - OpenAI-compatible REST API

POST /v1/chat/completions and GET /v1/models added. Any OpenAI-compatible client (LangChain, LlamaIndex, OpenAI SDK) works with a base-URL change only. OpenAiAdapter and OpenAiChatHandler are new classes; no existing endpoints modified. --api-port N wired into both local and cluster modes. OpenAPI 3.0 spec at api/src/main/resources/juno-api.yaml.

Session 28 - Health dashboard & virtual-thread gRPC

CPU load metric replaces unavailable /sys/class/thermal on EC2. Role-conditional secondary metric: coordinator shows Latency P99, nodes show Throughput MB/s. Root-cause investigation: gRPC thread pool saturation with 9 concurrent sessions fixed by switching to newVirtualThreadPerTaskExecutor() on ServerBuilder.

Session 27 - GPU lifecycle, CUDA streams, multi-device

Llama GPU path upgraded to FP16 resident weights (DeviceHalfMatrix). Per-thread non-blocking CUDA streams with cudaMemcpyAsync. One shared GpuContext per device index. releaseGpuResources() called on shard unload/reload. OOM fallback verified on allocation failure.

Session 26 - LoRA merge + inference overlay + Q&A training

juno merge writes a new GGUF with LoRA-patched tensors as F32 - no sidecar at inference. --lora-play applies a trained adapter read-only in all modes. /train-qa REPL command auto-generates 4 phrasings with the model's chat template. AWS deploy hardening: race conditions, path resolution, and double-base64 cloud-init bugs fixed.

Session 25 - Code quality & dead code removal

CyclicForwardPassHandler moved to test scope; production stub replaced by StubForwardPassHandler inner class. ConsoleMain --cpu fall-through bug fixed. One shared GpuContext per runLocalRepl invocation. Docs fully updated.