Salvatore Sanfilippo (Antirez, of Redis fame) dropped DS4, a narrow-bet inference engine that runs DeepSeek V4 Flash locally on Apple Silicon (Metal) and Linux (CUDA). Not a generic GGUF runner. It's DS4-Flash-specific, with an OpenAI/Anthropic-compatible server you can point Claude Code at. Two ideas worth stealing: a 2-bit quantization that actually works (only the routed MoE experts get quantized; shared experts and projections stay untouched), which runs the model on a 128GB MacBook Pro.

It calls tools reliably under coding agents and treating the KV cache as a first-class disk citizen, hashed by SHA1 of the rendered prefix so stateless API clients reuse cached state across sessions and restarts. Antirez also says openly that DS4 was built with strong assistance from GPT-5.5 — refreshingly honest about how high-end systems code gets written in 2026.
Comments
Sign in as a member to join the conversation.
Loading comments…
Stay Updated
Get the latest AI insights delivered to your inbox. No spam, unsubscribe anytime.