Doubleword

Notes on Building AI Systems

04/Josh Cowan

MoE expert co-activations: Reordering inputs yields easy throughput gains.

Doubleword's batch inference offering keeps costs down by keeping throughput high, something which isn't easily done given the architecture of popular Mixture-of-Expert models. While MoE's sparse expert weights make them quick to train, they also mean that at each layer of every forward each request in a batch typically requires different expert weights to be loaded. This makes inference severely memory-bandwidth bound and cuts throughput relative to dense models. However, by reordering inputs so that similar prompts batch together, we can overlap the experts needed and reduce the number of unique experts loaded per forward. Simply using an embedding model to reorder requests before inference can cut expert loads by approximately 15%, achieving a free throughput gain with no model or kernel changes.