Doubleword

Notes on Building AI Systems

01/Jamie Dborin

QueueSpec: Drafting While You Wait

Speculative decoding speeds up LLM generation by letting a system propose several “draft” tokens at once, and then having the target model verify them in a single forward pass. The usual question is: where do we get good drafts cheaply? In this post, we explore queue speculation (QueueSpec): draft tokens come from a smaller model that runs while a request is queuing, so verification can start immediately once the request is serviced. At doubleword we use speculative decoding techniques like this and other throughput-specific optimizations to deliver cheaper inference at scale, by sacrificing end to end latency. If you want to get started with some free credits sign up here: Doubleword Platform

09/Amanda Milberg

Why Batch Inference Matters: Moving from AI Assistants to Autonomous Agents

The initial wave of Generative AI adoption focused on augmenting human work - chatbots that help developers write cleaner code, assistants that polish our emails, or tools that speed up content creation. These productivity enhancements have proven their value tenfold, as almost every individual has a version of ChatGPT open to assist them during their day. But they represent just the beginning of what's possible with AI.