Tensor Network Attention
A note on how to interpret different attention mechanisms as tensor networks and how you can use that to inform inference and training time kernel decisions.
Notes on Building AI Systems
A note on how to interpret different attention mechanisms as tensor networks and how you can use that to inform inference and training time kernel decisions.
An empirical investigation into the byte-level entropy of model weights across numeric formats and model families.
An intro to tANS: a table-based entropy coder that removes rANS's per-symbol division while keeping Shannon-optimal compression.
An intro to rANS: an entropy coding method for losslessly encoding & decoding streams of bytes quickly.
Checkpoint/restore with CRIU and cuda-checkpoint, from 12 minutes to 10 seconds on a B200.
Recently released open source OCR models are starting to replace expert-based OCR systems. This post walks through an evaluation exercise of specialist and general OCR agents.
When a weighted-random fallback rejects samples and retries without replacement, high error rates cause low-weight models to be selected far more often than their weights suggest.
Speculative decoding speeds up LLM generation by letting a system propose several “draft” tokens at once, and then having the target model verify them in a single forward pass. The usual question is: where do we get good drafts cheaply? In this post, we explore queue speculation (QueueSpec): draft tokens come from a smaller model that runs while a request is queuing, so verification can start immediately once the request is serviced. At doubleword we use speculative decoding techniques like this and other throughput-specific optimizations to deliver cheaper inference at scale, by sacrificing end to end latency. If you want to get started with some free credits sign up here: Doubleword Platform
Today we’re reducing the price of our highest-intelligence model, Qwen3-235B-A22B-Instruct.
Building a content discovery system using parallel primitives and BST-based ranking with LLM comparisons
A lock-free binary search tree optimized for expensive async comparisons, with threaded linked list for O(1) sorted iteration
High throughput inference of LLMs using JIT weight offloading to optimize KV Cache.