Large-Scale Semantic Search Without Embeddings
Applying parallel primitives to search and rank 2.4 million arXiv papers using LLM judgments
Notes on Building AI Systems
Applying parallel primitives to search and rank 2.4 million arXiv papers using LLM judgments
Exploring coordination patterns from parallel computing for multi-agent LLM systems
Researchers face an impossible task in staying up to date within their field. In AI and Machine Learning alone, arXiv publishes 50-100 new papers daily. Multiply that across computer science, physics, biology, and other domains, and you're looking at hundreds of potentially relevant papers flooding in every single day.
The initial wave of Generative AI adoption focused on augmenting human work - chatbots that help developers write cleaner code, assistants that polish our emails, or tools that speed up content creation. These productivity enhancements have proven their value tenfold, as almost every individual has a version of ChatGPT open to assist them during their day. But they represent just the beginning of what's possible with AI.
This episode explores how speculative decoding becomes increasingly valuable in high-throughput, batched inference scenarios, particularly with sparse MoE architectures.
This technical guide explores model parallelism, a critical technique for deploying large language models that exceed single GPU memory capacity.
This article explores speculative decoding, a technique designed to accelerate language model inference by introducing parallelism into the token generation process.
Benchmarking is hard.
How batched, latency-tolerant AI workloads can achieve significantly cheaper token costs using consumer-grade GPUs instead of enterprise hardware like H100s.
Selecting the right AI model for deployment is a critical decision that can significantly impact the performance, cost, and user experience of your application. With a wide variety of models available—each with unique strengths and trade-offs—it’s essential to evaluate them carefully based on relevant criteria. In this post, we’ll explore the three key factors to consider when comparing models for deployment: quality, cost, and speed. Understanding how these factors interact and influence your application will help you make informed choices that align with your technical requirements and business goals
When technology infrastructure—such as GPUs and servers—is owned and managed by a central IT team, the need to allocate costs back to the business units that benefit from these resources becomes a critical consideration. This is particularly relevant in the context of self-hosting AI models, where the initial investment in high-performance GPUs, servers, and supporting infrastructure can be substantial. Without a clear chargeback mechanism, it becomes difficult to ensure accountability, optimize resource usage, and justify the ROI of such investments.
An idea on how to use LLMs to help with scheduling.