I’ve been a GPU poor for a long time — my home workstation runs a 16GB 5060 Ti, and that’s been the ceiling on what I can tinker with locally.
That ceiling just moved. I bought — well, financed — a DGX Spark.
Okay, fine: technically it’s a Lenovo ThinkStation PGX, one of the OEM takes on NVIDIA’s DGX Spark reference design. But “DGX Spark” is the name everyone actually knows, so that’s what I’m calling it. Either way it’s the same silicon underneath: a GB10 Grace Blackwell machine, SoC Blackwell (sm121) with a big pool of unified memory. It’s the kind of box that’s about to show up everywhere now that the same chip is heading into laptops as the RTX Spark.
So naturally, the first thing I did was put it to work.
I’ve been bringing NVFP4 KV cache — native 4-bit KV on consumer and SoC Blackwell — to the local-inference stacks, validated across Gemma 3, Gemma 4, and DiffusionGemma. The KV cache is the single most valuable thing to shrink on a bandwidth-bound machine, and it turns out that in most cases you can take it to 4 bits without the quality falling over.
Two threads on it have done unexpectedly well:
- NVFP4 KV cache in vLLM for RTX PRO 6000 and DGX Spark — 166 likes, 858K views.
- NVFP4 KV cache, part 2: SGLang — the harder half — 516 likes, 2.1M views.
A longer write-up is coming. For now: the GPU poor days are, at least temporarily, behind me.