══ Juno performance report tinyllama-1.1b-chat-v1.0-q4_k_m.gguf cpu: m7i-flex.large (50 tok) gpu: g4dn.2xlarge (200 tok)
← juno.html  ·  generated 2026-05-11
filter ›
done - tps measured
pending - in original plan
add - suggested
n/a - not applicable
s1 = 1 session  │  s9 = 9 concurrent sessions
SCENARIO: long
Single inference request, max output.
Prompt: "write me a long poem about love and war, be as robust as you can"
max-tokens: CPU 50  ·  GPU 200
Measures: steady-state decode throughput, prefill latency, MatVec backend performance.
SCENARIO: conv
3-turn conversation with history recall.
msg1 "Hello, my name is Viktor…" (34 tok)  →  reply (26 tok)
msg2 "Could you recall my name?" (43 tok)  →  reply (20 tok)
msg3 "Thank you, have a nice day!" (30 tok)  →  reply (17 tok)
Measures: KV-cache reuse (or lack of), prefill cost growth across turns.
# hw ptype nodes coord dtype bord lora long / s1 long / s9 conv / s1 conv / s9
ANALYSIS ── key findings from completed cells
GPU
GPU (g4dn) is 17–37x faster than CPU across all single-session tests. Long/s9: GPU 37.24 tps vs CPU 1.05 tps - GPU batches forward passes across concurrent sessions. Conv/s9: GPU 5.89 tps vs CPU 0.28 tps - 21x gap widens further as per-session history growth hits CPU prefill time harder.
DTYPE
CPU dtype ranking: FP16 (1.54) > INT8 (1.13)FP32 (1.06)LoRA (1.05)LE (0.97). INT8 is slower than FP16 on CPU - no AVX-512 INT8 path implemented, falls back to scalar. LE byte order adds ~19% overhead vs BE on CPU (MatVec total 90,286 ms vs 56,514 ms).
GPU DTYPE
GPU dtype ranking: LoRA (22.19) > LE (21.63) > FP16/embed (20.69)FP32 (20.36) > INT8 (19.87). LoRA inference overhead is negligible on GPU - adapter is resident in VRAM. FP32 on GPU is only 7% slower than FP16 (driver upcasts internally, MatVec p95 identical at 0.241ms).
NODES
CPU pipeline node sweep: peak at 3 nodes (1.59 tps). Adding nodes beyond 3 degrades end-to-end TPS due to network synchronization cost per forward pass. 7 nodes (embedded) dropped to 1.12 tps - below single-node baseline.
TENSOR
CPU 3-node tensor parallelism: 0.44 tps - 3.5x worse than single-node pipeline. Network sync at every layer dominates for this model size. GPU tensor (1 node): 20.72 tps - within noise of pipeline (19.67). No benefit, non-zero overhead.
CONV
CPU conv/s1: 0.12 tps - 10x worse than long/s1. Full KV re-prefill each turn. GPU conv/s1: 1.31 tps - node prefill p95 spiked to 1,530ms vs 142ms for long. Multi-turn without KV cache is O(n²) in token cost. kvcache branch not yet merged. CPU conv/s9: 0.28 tps aggregate (~0.031 tps/session) - 4x worse per session than s1 due to compounding prefill cost across 9 growing contexts. GPU conv/s9: 5.89 tps aggregate (~0.65 tps/session) - 2x worse per session than s1. Decode p95 rose to 230ms (vs 54ms long/s1); GPU batches concurrent sessions but prefill cost scales with history length.
CONCURRENCY
CPU s9: aggregate 1.05 tps across 9 sessions - ~0.12 tps per session, identical to single-session. Node decode p95 jumped to 10,766ms (vs 1,046ms s1) - sessions serialized, queue wait dominates. GPU s9: aggregate 37.24 tps - ~4.1 tps per session. Decode p95 250ms vs 54ms s1 - GPU batches sessions, 4.6x speedup per session under load.
LORA
lora-play CPU test has 3 JFR files - coordinator, node, and a third empty process (lora sidecar). LoraTrainStep.count = 0 in all runs. The --lora-play flag loads adapter weights but no training steps were triggered during inference-only test scenarios. Training path untested.