Doubleword blog

01July 9, 2026/Fergus Finn

Reverse-engineering NVIDIA's cuda-checkpoint for faster cold starts

Freezing a live CUDA process to host memory and thawing it again, what the driver does — and doesn't — do to make that work, and how understanding that lets us restore CUDA processes up to 4x faster.

02July 2, 2026/Fergus Finn

Width vs. depth: speculating on the margin

Some thinking about how to trade off batching and speculative decoding in a running inference engine.

03June 30, 2026/Peter Bhabra

The swarm that designs itself

We rebuilt Moonshot's Kimi agent swarm and pointed it at a real codebase: ~53× fewer tokens and ~45× cheaper than one long-context agent.

04June 29, 2026/Fergus Finn

What happens when you run a CUDA kernel

Tracing one vector-add kernel from nvcc all the way down to the warps that execute it.

05June 22, 2026/Jamie Dborin

Prediction: A Frontier Open Source LLM Will Be Released On 3rd December 2026

Using artificial analysis benchmarks we try to predict when an open source LLM will be released that matches frontier LLMs.

06June 22, 2026/Jamie Dborin

Anatomy of a Diffusion Language Model

A breakdown of the modelling choices of three new and popular diffusion language models.

07June 22, 2026/Jamie Dborin

FlashOffload: 7x Cheaper Prefills with Offloading

Extending SGLang to improve it's offloading capabiltiies, achieving speedups for compute bound workloads.

08June 22, 2026/Fergus Finn

Adaptive speculative decoding: picking draft lengths at runtime

A follow-on to the economics of speculative decoding, we run the inference lab simulator on MTP & DFlash drafters with real acceptance data, and find out whether adaptively choosing the draft length is worth it.

09June 19, 2026/Fergus Finn

InfiniBand, RoCE, and all that

How a technically superior but economically isolated solution slowly lost ground to a good-enough one built on infrastructure everyone already owns.

10June 12, 2026/Fergus Finn

UCCL-EP: An expert parallel communications kernel without owning the NIC

How UCCL reimplements the DeepEP kernels for arbitrary hardware by swapping the transport out from under them.

11June 10, 2026/Fergus Finn

Anatomy of a high-performance EP kernel

How expert-parallel dispatch and combine kernels work, built up from scratch: the high-throughput shape and the low-latency one.

12June 8, 2026/Fergus Finn

The economics of speculative decoding

Two underexplored axes: what MoE routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.