Doubleword blog

25June 10, 2025/Jamie Dborin

Behind the Stack, Ep 3: How to Serve 100 Models on a Single GPU with No Cold Starts

In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.

26June 4, 2025/Jamie Dborin

Behind the Stack, Ep 2: How Many Users Can My GPU Serve?

When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”

27May 28, 2025/Jamie Dborin

Behind the Stack, Ep 1: What Should I Be Observing in my LLM Stack?

It’s easy to default to GPU or CPU utilization to assess LLM system load - but that’s a trap. These metrics were built for traditional compute workflows and fall short in LLM deployments. They can stay flat while your model silently hits capacity, leading to missed scaling signals and degraded performance.