Bringing up DeepSeek-V4-Flash on AMD MI300X
A story of sharp edges, segfaults, and standards
Notes on Building AI Systems
A story of sharp edges, segfaults, and standards
The fastest a memory bound kernel can go is set by the time required to transfer the data to the SMs. How can we do better?
Most conversations in inference have centred on making the experience better for someone waiting. Reducing time-to-first-token. Claude's "fast mode.¹" Groq and Cerebras. The whole technical project of the last few years has assumed that a human, somewhere, is waiting for the response.
Doubleword's batch inference offering keeps costs down by keeping throughput high, something which isn't easily done given the architecture of popular Mixture-of-Expert models. While MoE's sparse expert weights make them quick to train, they also mean that at each layer of every forward each request in a batch typically requires different expert weights to be loaded. This makes inference severely memory-bandwidth bound and cuts throughput relative to dense models. However, by reordering inputs so that similar prompts batch together, we can overlap the experts needed and reduce the number of unique experts loaded per forward. Simply using an embedding model to reorder requests before inference can cut expert loads by approximately 15%, achieving a free throughput gain with no model or kernel changes.
Lossless compression of a target model's KV cache by up to 4×, using a cheaper predictor model to drive an arithmetic coder.
A note on how to interpret different attention mechanisms as tensor networks and how you can use that to inform inference and training time kernel decisions.
An empirical investigation into the byte-level entropy of model weights across numeric formats and model families.
An intro to tANS: a table-based entropy coder that removes rANS's per-symbol division while keeping Shannon-optimal compression.
An intro to rANS: an entropy coding method for losslessly encoding & decoding streams of bytes quickly.
Checkpoint/restore with CRIU and cuda-checkpoint, from 12 minutes to 10 seconds on a B200.
Recently released open source OCR models are starting to replace expert-based OCR systems. This post walks through an evaluation exercise of specialist and general OCR agents.
When a weighted-random fallback rejects samples and retries without replacement, high error rates cause low-weight models to be selected far more often than their weights suggest.