Doubleword blog

13October 21, 2025/Fergus Finn

Benchmarking the Doubleword Control Layer

Benchmarking is hard.

14October 13, 2025/Jamie Dborin

Chasing Cheap Tokens: 2x Cheaper Tokens Than H100s with Consumer Cards

How batched, latency-tolerant AI workloads can achieve significantly cheaper token costs using consumer-grade GPUs instead of enterprise hardware like H100s.

15October 6, 2025/Amanda Milberg

Choosing the Right Model for the Use Case

Selecting the right AI model for deployment is a critical decision that can significantly impact the performance, cost, and user experience of your application. With a wide variety of models available—each with unique strengths and trade-offs—it’s essential to evaluate them carefully based on relevant criteria. In this post, we’ll explore the three key factors to consider when comparing models for deployment: quality, cost, and speed. Understanding how these factors interact and influence your application will help you make informed choices that align with your technical requirements and business goals

16October 6, 2025/Amanda Milberg

Understanding Chargeback in the Context of Self-Hosted Systems

When technology infrastructure—such as GPUs and servers—is owned and managed by a central IT team, the need to allocate costs back to the business units that benefit from these resources becomes a critical consideration. This is particularly relevant in the context of self-hosting AI models, where the initial investment in high-performance GPUs, servers, and supporting infrastructure can be substantial. Without a clear chargeback mechanism, it becomes difficult to ensure accountability, optimize resource usage, and justify the ROI of such investments.

17September 19, 2025/Jamie Dborin

Should GPUs make Free Trade Agreements?

Drawing a parallel between optimizing LLM inference and classical economic principles, specifically the concept of comparative advantage.

18September 10, 2025/Jamie Dborin

Behind the Stack, Ep 10: Batched Endpoints

Running LLMs at scale can be expensive. Whether you're building customer-facing chatbots, document extraction pipelines, or research tools, token usage can balloon into thousands of dollars quickly. While infrastructure teams often focus on throughput optimizations (batching requests on the GPU, prefix caching, etc.), there's another lever to pull: endpoint design. One of the most powerful - and under-discussed - endpoint types is the batched endpoint. Instead of prioritizing instant responses, batched endpoints trade latency for cost, cutting your LLM bill in half (or more in some cases).

19September 3, 2025/Jamie Dborin

Behind the Stack, Ep 9: How to Evaluate Open Source LLMs

Choosing the right LLM for your workload isn’t just about picking the latest open-source release or switching to a cheaper closed model. If you’re self-hosting language models - whether for RAG pipelines, agents, or fine-tuned data tasks - knowing how good a model is (and compared to what) is a critical decision.

20July 15, 2025/Jamie Dborin

Behind the Stack, Ep 8: Choosing the Right Inference Engine for Your LLM Deployment

Inference engines are the backbone of self-hosted LLM stacks. They’re responsible for turning model weights into real-time, token-by-token output.

21July 9, 2025/Jamie Dborin

Behind the Stack, Ep 7: Choosing the Right Quantization for Self-Hosted LLMs

Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.

22July 1, 2025/Jamie Dborin

Doubleword