Strix Halo LLM Inference: Loading & Quantization

Shareable visual notes for the running Qwen3.6-35B-A3B homelab setup.

Source concept: Kilbright's Code - LLM Inference Engine Loading & Quantization

Generated 2026-04-26 | 4 visualizations extracted from transcript, then packaged with the live Strix Halo setup notes.

Live Setup Snapshot

Captured 2026-04-26 22:34 CDT. Screencast framing: Qwen3.6-35B-A3B serving roughly 25 tok/s over a 151k+ token working set at under 100W.

Hostartemis - Minisforum MS-S1 MAX
APURyzen AI Max+ 395, Radeon 8060S
Memory128GB LPDDR5X unified
Backendllama.cpp b8890, Vulkan RADV
ModelQwen3.6-35B-A3B UD-Q8_K_XL
Observed153,562 max tokens, 24.1 tok/s generation
llama-server -ngl 999 -fa on --no-mmap -t 16 -tb 32 \
  -m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf \
  -c 524288 -np 3 --kv-unified \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -b 4096 -ub 4096 --cache-ram 16384 \
  --cache-idle-slots --slot-prompt-similarity 0.8 \
  --mlock --reasoning on --reasoning-budget 65536

Screencast

The recording shows the live llama-swap activity view, GPU power draw, and terminal output while the Strix Halo node serves the long-context run.

1. Memory Hierarchy — Eager vs Lazy Loading

SSD → RAM → GPU model loading pipeline. Watch the difference between loading all weights at once vs loading on-demand via memory mapping.

Video timestamp: 0:50 – 5:10

2. Quantization — Symmetric vs Asymmetric

How BF16 weights are compressed to int4. Compare single-scale symmetric quantization against per-group asymmetric scaling with separate min/max.

Video timestamp: 6:40 – 10:03

3. K-Quants — Hierarchical Scaling + Mixed Precision

256 weights organized into 8 groups of 32. Some groups use 4-bit (16 levels), others use 6-bit (64 levels). Hierarchical scaling preserves local outliers.

Video timestamp: 10:13 – 11:44

4. Salient Weight Detection (AWQ / EXL2)

Find the most important weights by activation magnitude. These salient weights are protected with higher precision during quantization to preserve model quality.

Video timestamp: 11:51 – 13:43