SCENARIO: long
Single inference request, max output.
Prompt: "write me a long poem about love and war, be as robust as you can"
max-tokens: CPU 50 · GPU 200
Measures: steady-state decode throughput, prefill latency, MatVec backend performance.
SCENARIO: conv
3-turn conversation with history recall.
msg1 "Hello, my name is Viktor…" (34 tok) → reply (26 tok)
msg2 "Could you recall my name?" (43 tok) → reply (20 tok)
msg3 "Thank you, have a nice day!" (30 tok) → reply (17 tok)
Measures: KV-cache reuse (or lack of), prefill cost growth across turns.
ANALYSIS ── key findings from completed cells
GPU
GPU (g4dn) is 17–37x faster than CPU across all single-session tests.
Long/s9: GPU 37.24 tps vs CPU 1.05 tps - GPU batches forward passes across concurrent sessions.
Conv/s9: GPU 5.89 tps vs CPU 0.28 tps - 21x gap widens further as per-session history growth hits CPU prefill time harder.
DTYPE
CPU dtype ranking: FP16 (1.54) > INT8 (1.13) ≈ FP32 (1.06) ≈ LoRA (1.05) ≈ LE (0.97).
INT8 is slower than FP16 on CPU - no AVX-512 INT8 path implemented, falls back to scalar.
LE byte order adds ~19% overhead vs BE on CPU (MatVec total 90,286 ms vs 56,514 ms).
GPU DTYPE
GPU dtype ranking: LoRA (22.19) > LE (21.63) > FP16/embed (20.69) ≈ FP32 (20.36) > INT8 (19.87).
LoRA inference overhead is negligible on GPU - adapter is resident in VRAM.
FP32 on GPU is only 7% slower than FP16 (driver upcasts internally, MatVec p95 identical at 0.241ms).
NODES
CPU pipeline node sweep: peak at 3 nodes (1.59 tps). Adding nodes beyond 3 degrades end-to-end TPS due to network synchronization cost per forward pass.
7 nodes (embedded) dropped to 1.12 tps - below single-node baseline.
TENSOR
CPU 3-node tensor parallelism: 0.44 tps - 3.5x worse than single-node pipeline.
Network sync at every layer dominates for this model size.
GPU tensor (1 node): 20.72 tps - within noise of pipeline (19.67). No benefit, non-zero overhead.
CONV
CPU conv/s1: 0.12 tps - 10x worse than long/s1. Full KV re-prefill each turn.
GPU conv/s1: 1.31 tps - node prefill p95 spiked to 1,530ms vs 142ms for long.
Multi-turn without KV cache is O(n²) in token cost. kvcache branch not yet merged.
CPU conv/s9: 0.28 tps aggregate (~0.031 tps/session) - 4x worse per session than s1 due to compounding prefill cost across 9 growing contexts.
GPU conv/s9: 5.89 tps aggregate (~0.65 tps/session) - 2x worse per session than s1. Decode p95 rose to 230ms (vs 54ms long/s1); GPU batches concurrent sessions but prefill cost scales with history length.
CONCURRENCY
CPU s9: aggregate 1.05 tps across 9 sessions - ~0.12 tps per session, identical to single-session.
Node decode p95 jumped to 10,766ms (vs 1,046ms s1) - sessions serialized, queue wait dominates.
GPU s9: aggregate 37.24 tps - ~4.1 tps per session. Decode p95 250ms vs 54ms s1 - GPU batches sessions, 4.6x speedup per session under load.
LORA
lora-play CPU test has 3 JFR files - coordinator, node, and a third empty process (lora sidecar).
LoraTrainStep.count = 0 in all runs. The --lora-play flag loads adapter weights but no training steps were triggered during inference-only test scenarios. Training path untested.