Scheduling in inference engines
A look into how inference engines choose which requests to process.
Notes on Building AI Systems
A look into how inference engines choose which requests to process.
Drawing a parallel between optimizing LLM inference and classical economic principles, specifically the concept of comparative advantage.
Running LLMs at scale can be expensive. Whether you're building customer-facing chatbots, document extraction pipelines, or research tools, token usage can balloon into thousands of dollars quickly. While infrastructure teams often focus on throughput optimizations (batching requests on the GPU, prefix caching, etc.), there's another lever to pull: endpoint design. One of the most powerful - and under-discussed - endpoint types is the batched endpoint. Instead of prioritizing instant responses, batched endpoints trade latency for cost, cutting your LLM bill in half (or more in some cases).
Choosing the right LLM for your workload isn’t just about picking the latest open-source release or switching to a cheaper closed model. If you’re self-hosting language models - whether for RAG pipelines, agents, or fine-tuned data tasks - knowing how good a model is (and compared to what) is a critical decision.
Inference engines are the backbone of self-hosted LLM stacks. They’re responsible for turning model weights into real-time, token-by-token output.
Large Language Models (LLMs) are incredibly powerful but also incredibly resource-hungry. Running them efficiently, especially on self-hosted infrastructure, requires squeezing every bit of performance out of limited compute and memory. That’s where quantization comes in.
AI agents are transforming everything from customer support to autonomous workflows. But under the hood, most AI agent architectures suffer from one major problem: growing latency and cost at scale.
Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:
If you’re self-hosting LLMs, scaling out to multiple nodes isn’t optional - and neither is load balancing. But conventional strategies like round-robin or least-connections often fail silently when applied to LLM workloads.
In many orgs, self-hosting LLMs starts with a single model. Then comes a customisation request. Then another. And before long, you’ve got dozens of fine-tuned variants - each trained with a LORA or other parameter-efficient technique.
When self-hosting LLMs and productionising AI, one of the first practical questions you’ll run into is: “How many users can this system actually support?”
It’s easy to default to GPU or CPU utilization to assess LLM system load - but that’s a trap. These metrics were built for traditional compute workflows and fall short in LLM deployments. They can stay flat while your model silently hits capacity, leading to missed scaling signals and degraded performance.