jUno - Juno performance report

SCENARIO: long

Single inference request, max output.
Prompt: "write me a long poem about love and war, be as robust as you can"
max-tokens: CPU 50 · GPU 200
Measures: steady-state decode throughput, prefill latency, MatVec backend performance.

SCENARIO: conv

3-turn conversation with history recall.
msg1 "Hello, my name is Viktor…" (34 tok) → reply (26 tok)
msg2 "Could you recall my name?" (43 tok) → reply (20 tok)
msg3 "Thank you, have a nice day!" (30 tok) → reply (17 tok)
Measures: KV-cache reuse (or lack of), prefill cost growth across turns.

ANALYSIS ── key findings from completed cells

GPU

GPU (g4dn) is 17–37x faster than CPU across all single-session tests. Long/s9: GPU 37.24 tps vs CPU 1.05 tps - GPU batches forward passes across concurrent sessions. Conv/s9: GPU 5.89 tps vs CPU 0.28 tps - 21x gap widens further as per-session history growth hits CPU prefill time harder.

DTYPE

CPU dtype ranking: FP16 (1.54) > INT8 (1.13) ≈ FP32 (1.06) ≈ LoRA (1.05) ≈ LE (0.97). INT8 is slower than FP16 on CPU - no AVX-512 INT8 path implemented, falls back to scalar. LE byte order adds ~19% overhead vs BE on CPU (MatVec total 90,286 ms vs 56,514 ms).

GPU DTYPE

GPU dtype ranking: LoRA (22.19) > LE (21.63) > FP16/embed (20.69) ≈ FP32 (20.36) > INT8 (19.87). LoRA inference overhead is negligible on GPU - adapter is resident in VRAM. FP32 on GPU is only 7% slower than FP16 (driver upcasts internally, MatVec p95 identical at 0.241ms).

NODES

CPU pipeline node sweep: peak at 3 nodes (1.59 tps). Adding nodes beyond 3 degrades end-to-end TPS due to network synchronization cost per forward pass. 7 nodes (embedded) dropped to 1.12 tps - below single-node baseline.

TENSOR

CPU 3-node tensor parallelism: 0.44 tps - 3.5x worse than single-node pipeline. Network sync at every layer dominates for this model size. GPU tensor (1 node): 20.72 tps - within noise of pipeline (19.67). No benefit, non-zero overhead.

CONV

CPU conv/s1: 0.12 tps - 10x worse than long/s1. Full KV re-prefill each turn. GPU conv/s1: 1.31 tps - node prefill p95 spiked to 1,530ms vs 142ms for long. Multi-turn without KV cache is O(n²) in token cost. kvcache branch not yet merged. CPU conv/s9: 0.28 tps aggregate (~0.031 tps/session) - 4x worse per session than s1 due to compounding prefill cost across 9 growing contexts. GPU conv/s9: 5.89 tps aggregate (~0.65 tps/session) - 2x worse per session than s1. Decode p95 rose to 230ms (vs 54ms long/s1); GPU batches concurrent sessions but prefill cost scales with history length.

CONCURRENCY

CPU s9: aggregate 1.05 tps across 9 sessions - ~0.12 tps per session, identical to single-session. Node decode p95 jumped to 10,766ms (vs 1,046ms s1) - sessions serialized, queue wait dominates. GPU s9: aggregate 37.24 tps - ~4.1 tps per session. Decode p95 250ms vs 54ms s1 - GPU batches sessions, 4.6x speedup per session under load.

LORA

lora-play CPU test has 3 JFR files - coordinator, node, and a third empty process (lora sidecar). LoraTrainStep.count = 0 in all runs. The --lora-play flag loads adapter weights but no training steps were triggered during inference-only test scenarios. Training path untested.