Price Reduction for Qwen3-235B on Doubleword
Today we’re reducing the price of our highest-intelligence model, Qwen3-235B-A22B-Instruct.
Scaling Curation with LLM Comparisons
Building a content discovery system using parallel primitives and BST-based ranking with LLM comparisons
LLM powered data structures: A lock-free binary search tree
A lock-free binary search tree optimized for expensive async comparisons, with threaded linked list for O(1) sorted iteration
ZeroDP: Just-In-Time Weight Offloading over NVLink for Data Parallelism
High throughput inference of LLMs using JIT weight offloading to optimize KV Cache.
Large-Scale Semantic Search Without Embeddings
Applying parallel primitives to search and rank 2.4 million arXiv papers using LLM judgments
Parallel Primitives for Multi-Agent Workflows
Exploring coordination patterns from parallel computing for multi-agent LLM systems
$1 for a Year of Research Digests. That's Less Than a Coffee.
Researchers face an impossible task in staying up to date within their field. In AI and Machine Learning alone, arXiv publishes 50-100 new papers daily. Multiply that across computer science, physics, biology, and other domains, and you're looking at hundreds of potentially relevant papers flooding in every single day.
Why Batch Inference Matters: Moving from AI Assistants to Autonomous Agents
The initial wave of Generative AI adoption focused on augmenting human work - chatbots that help developers write cleaner code, assistants that polish our emails, or tools that speed up content creation. These productivity enhancements have proven their value tenfold, as almost every individual has a version of ChatGPT open to assist them during their day. But they represent just the beginning of what's possible with AI.
Behind the Stack, Ep 13: Faster Inference: Speculative Decoding for Batched Workloads
This episode explores how speculative decoding becomes increasingly valuable in high-throughput, batched inference scenarios, particularly with sparse MoE architectures.
Behind the Stack, Ep 12: Understanding Model Parallelism
This technical guide explores model parallelism, a critical technique for deploying large language models that exceed single GPU memory capacity.
Behind the Stack, Ep 11: How Speculative Decoding Speeds Up Language Models
This article explores speculative decoding, a technique designed to accelerate language model inference by introducing parallelism into the token generation process.