Pushing memory bound kernels beyond the speed of light with lossless decompression

Fergus Finn
Fergus Finn
Founder & Member of Technical Staff, Doubleword

In an earlier post on this blog we measured the entropy of the weights of a lot of modern open-weight models. As a reminder: the Shannon entropy sets a lower bound for the compressibility of a stream of bytes.

We found real headroom between the bit-width the weights are stored at and the bit-width their actual values demand. But compressibility only starts to matter when something downstream runs faster, or fits somewhere it previously didn't. So the question is operational: given that the weights can be compressed, what can we actually do if we compress them?

There are lots of things you could think of doing. Some ideas:

  • Faster weight loading. Pulling the weights off disk or across the network into GPU memory is the path a previous post here pulled on. It gets more interesting on Blackwell, where the decompression engine is in hardware rather than a kernel you have to write yourself.
  • Cheaper collectives. Compress weights or activations in flight between GPUs, so the bytes crossing the interconnect are smaller than the bytes the model nominally operates on.
  • Memory-bound kernels. Keep the weights compressed in HBM and decompress them on the fly inside the kernel that consumes them, so the same arithmetic runs against fewer bytes read, and the kernel runs faster.

The first two seem likely to be possible. The last seems hard. In this post we'll see if we can get any purchase on it.

By way of making it even harder -- it's no longer very interesting to build optimizations that only apply to bf16 weights. Nobody serves a large model in bf16 if they can avoid it. The realistic baseline is fp8, and increasingly fp4, both of which have already taken a large bite out of the same headroom the entropy numbers were measuring. So the right question is not really how much we can compress bf16 weights: it is how much further we can compress weights that are already quantised to fp8 or fp4, and whether what is left is worth the cost of the decompressor.

We'll show here that we canIn fp8, on a consumer GPU. fp4 & other GPUs remain open questions..

What should we compare against?

What we're after is a memory-bound kernel that, given inputs we've compressed offline, runs at an effective bandwidth above the hardware's effective peak HBM throughput. I'm running everything in this post on an RTX 4090, because that's what I have under my desk. The question of whether the trick generalises to bigger GPUs is one I'll come back to at the end.

The kernel we'll use is a vector add. It's the canonical example of a memory-bound workload: you load two operands from HBM, do one addition, and write the result back, which works out to three bytes of memory traffic for every floating-point operation. That puts it deep into the memory-bound regime, where the arithmetic units spend almost all of their time waiting on bytesWe're interested in fp8 really, but I couldn't find a simple fp8 vecadd kernel that could reasonably be called speed of light. The one I wrote hit ~900 GB/s. So bf16 seems fairer..

import torch

# 1 GiB working set per tensor (bf16 = 2 bytes per element)
n = 512 * 1024 * 1024
a = torch.randn(n, dtype=torch.bfloat16, device="cuda")
b = torch.randn(n, dtype=torch.bfloat16, device="cuda")
c = torch.add(a, b)

The maximum achievable bandwidth is something like 920 GB/s, about 92% of the rated peakThe 4090's rated HBM bandwidth is 1008 GB/s; GDDR6X on a 384-bit bus at 21 Gbps, per the Ada architecture whitepaper. On this chip the aten torch.add kernel gets ~922 GB/s, NVIDIA's CCCL binary_transform (their hand-tuned device-wide elementwise primitive) gets ~927 GB/s, and memcpy gets ~900 GB/s. A lot of variation is probably compounded by bad thermals on consumer GPUs, but 90% of peak being the max achievable is fairly typical..

For the operands, we want bytes that look like what a real inference kernel would be reading off HBM. So both operands are drawn from the empirical distribution of FP8 weights in Qwen3-14B-FP8, the same distribution measured in the entropy post.

The quantity we're tracking is what I'll call bandwidth amplification: the ratio of the time the raw kernel takes on uncompressed operands to the time the fused kernel takes on compressed operands, end to end. An amplification of 1 means the compressed path matched raw and we got nothing for our trouble, and anything above 1 is the result we're afterAs a reminder, fp8 weights for typical models are about ~15% compressible. So the best achievable bandwidth amplification factor is 1.15\approx 1.15..

How to do parallelisable compression and decompression

We've written about lossless compression in this series before, in both the rANS and tANS flavours of asymmetric numeral systems. For the rest of this post we'll work with tANS specifically, because ANS coders can sit much closer to the Shannon limit than something like Huffman codingHuffman assigns each symbol an integer number of bits, which is wasteful when a symbol's optimal code length isn't already an integer. ANS coders effectively encode in fractional bits, and routinely sit within a fraction of a percent of the entropy., and that matters when the absolute gains we're chasing are on the order of ten percent.

To make any coder any use for a vector-add kernel, we need to fold it into a massively parallel algorithm. The problem we immediately run into is that a naive tANS decoder is a serial state machine: each byte of weight data, as it is decoded, transforms the internal state one step at a time, so the next byte cannot be decoded until the previous one is done. We therefore have to build a codec on top of tANS that exposes parallel structure for the GPU to chew on.

The solution is to chop the input into many independent streams at encode time, each decoded later by its own GPU threadEach stream gets its own initial tANS state, encodes its own bytes into its own bit-stream, and finalises its state at the end. The compressed object stored in HBM is the concatenation of every stream's bit-stream, prefixed by a small offset table that tells the decoder where each one begins. Ignore the padding in the compressed representation of the data -- it's there to make the diagram line up.. One serial decode becomes many short parallel decodes, at an encoder-side cost we'll come back to.

compression
──────────────────────────────────────────────────────────────────

┌────────────────────────────────────────────────────────────────┐
│                          fp8 weights                           │
└────────────────────────────────────────────────────────────────┘


┌────────────┬────────────┬────────────┬────────────┬────────────┐
│  stream 0  │  stream 1  │  stream 2  │  stream 3  │  stream 4  │
└────────────┴────────────┴────────────┴────────────┴────────────┘
      │            │            │            │            │
 CPU thread   CPU thread   CPU thread   CPU thread   CPU thread
      ▼            ▼            ▼            ▼            ▼
┌─┬──────────┬───────┬────┬─────┬──────┬─────┬──────┬───┬────────┐
│0│░░░░░░░░░░│   1   │░░░░│  2  │░░░░░░│  3  │░░░░░░│ 4 │░░░░░░░░│
└─┴──────────┴───────┴────┴─────┴──────┴─────┴──────┴───┴────────┘

decompression
──────────────────────────────────────────────────────────────────

┌─┬───────┬─────┬─────┬───┐
│0│   1   │  2  │  3  │ 4 │ ← storage, index [0, 1, 8, 13, 18]
└─┴───────┴─────┴─────┴───┘


 CUDA thread  CUDA thread  CUDA thread  CUDA thread  CUDA thread
      │            │            │            │            │
┌─────┴────────────┴────────────┴────────────┴────────────┴──────┐
│     │            │            │            │            │      │
│    tANS         tANS         tANS         tANS         tANS    │
│     ▼            ▼            ▼            ▼            ▼      │
│                              vecadd                            │
└────────────────────────────────────────────────────────────────┘

What we actually compress

As we pointed out in the earlier post compressibility doesn't apply uniformly to every bit of every FP8 byte. An FP8 byte in the E4M3 format is eight bits, broken down as one sign bit, four exponent bits, and three mantissa bits.

fp8 e4m3 bit layout: 1 sign, 4 exponent, 3 significand

Across the weights of a real model, the sign-and-mantissa bits are close to uniform. The entropy is essentially four bits per nibble. No coder is going to do better than that, so we should avoid passing these nibbles through the codec if we can avoid it.

The exponent half is a different story. Across Qwen3-14B-FP8, the bulk of the weights pile up in just a few magnitude bins out of the sixteen possible, and the entropy of the exponent stream lands at 2.79 bits per symbol, not the 4 bits a uniform distribution would carry. So we break each FP8 down into two nibbles, only compress the exponents, and then reassemble fp8s in registers.

Qwen3-14B-FP8 weight distribution by log₂|w| bin (% of mass)


                                          ▄   █              
                                          █   █   ▄          
                                          █   █   █          
                                      ▃   █   █   █          
              ▁   ▂   ▃   ▄   ▆   █   █   █   █   █   ▂      
-17 -16 -15 -14 -13 -12 -11 -10 -9  -8  -7  -6  -5  -4  -3  

There are a few sources of overhead. tANS has to store its table somewhere: for our 4096-entry decoder, 16 KiB of decode table loaded once from HBM into shared memory per kernel invocation. Each tANS-encoded stream has its own final state (MM in tANS parlance) that the decoder starts unwinding from, plus a few bytes of bookkeeping for the last machine word that didn't fill cleanly. And we need an index table giving the start offset of each stream in the packed buffer. Together this lands at roughly 7 bytes of metadata per stream. The result is that we hit a compression rate of about 7 bits per fp8.

There's a parallelism-vs-compression tradeoff. Short streams give the GPU more parallel work but pay the metadata cost more often per byte; long streams compress better but offer fewer threads to keep busy. There are other penalties to long streams too when it comes to writing fast kernels -- for example, they can increase warp divergence, since the ends of streams become increasingly ragged as different streams finish decoding.

For the 4090 peak perf is at about 512 FP8 bytes per stream, about two million streams across a 1 GiB working set, which we find to be a balance short enough to keep the SMs saturated and long enough that the per-stream metadata doesn't dominate. To keep the decoder cheap at that stream count, the codec pair-codes the exponent stream: a single state transition emits two exponent symbols at once, halving the table-lookup work per output byte. The full encoder and decoder kernel are open source at doublewordai/tans-vectoradd.

RTX 4090, 1 GiB working set. Compression improves monotonically with N; amplification peaks where the compression headroom and the GPU's parallelism budget balance. The dashed horizontal line is break-even with the raw kernel.

The result

Putting all the pieces together, we have a fused decode-and-vecadd kernel that runs on a 4090. The kernel does the same vector add as the baseline. The only thing that differs is that it reads the operands as tANS streams and decodes them inside the kernel before adding.

On a 1 GiB working set, the fused kernel takes 3.24 ms, which is about 993 GB/s of logical FP8 throughput. That's roughly 1.09× amplification over torch.add's 922 GB/s, putting the fused kernel well above the ~920 GB/s speed-of-light for an uncompressed elementwise kernel on this chip.

All measurements on the same RTX 4090, 1 GiB working set. torch.add is bf16 (fp8 elementwise isn't supported); the fused bar is fp8 logical throughput. The access pattern is the same in both cases.

The measurement is consistent with the obvious model of what's happening. An uncompressed vecadd reads two bytes per FP8 output and writes one, for three bytes of HBM traffic per output. The fused vecadd reads about 7 compressed bits per input FP8 and writes one byte, for about 2.75 bytes of HBM traffic per output. The ratio of those two is 3 / 2.75 ≈ 1.09. The measured amplification matches the bandwidth-saving ratio to within measurement noise, which means the fused decoder is paying essentially zero overhead in HBM-time, and almost every byte the compression saves lands as a speedup.

Outlook

Part of what makes this idea interesting is the broader shift it points at. Most of the work that has gone into making large models faster has come from exploiting structural properties of the problem: compute graph shape, arithmetic intensity, attention layout. The next step is to look past the structure and into the data itselfInteresting analogue from CPU optimization: modern CPUs have started shipping data memory-dependent prefetchers that inspect loaded cache lines and speculatively dereference any value that looks like a pointer; the speedup depends on what the data actually contains.. Decompressing weights inside a memory-bound kernel sits in this family: the win comes from a property of the weight values themselves, rather than from anything you could read off the shape of the model.

Getting practical results on a 4090 is an existence proof that these sorts of techniques can be useful. But the road to a production kernel is steep. Most forward-pass time at training and prefill is compute-bound: GEMMs saturate the tensor cores, and compression there buys nothing. The kernels where bandwidth actually dominates are the ones with structurally low arithmetic intensity: attention at decode (each token reads the full KV cache), and MoE GEMMs at decode (tokens spread thinly across experts, so each expert's GEMM runs at small effective batch). So to get there we'd have to fuse this into one of those.

How to go beyond consumer GPUs? Datacenter GPUs are a different shape: Hopper-class GPUs have memory bandwidth around 4 TB/s, four to five times faster than a 4090. Blackwell goes up to 8TB/s. The decoder costs a handful of INT32 ops per byte: state transition, LDS lookup, bit extract, renorm. On a 4090 those ops fit inside the time HBM takes to deliver each byte; on Hopper the same INT32 throughput per SM has to keep up with four times the bandwidth, so the per-byte op budget is four times tighter. The INT32 ops-per-cycle per SM doubled on Blackwell, over Ada, so there's some hope. There's no theoretical reason it can't work. But if our 1TB/s 4090 decoder was hard, getting up to Blackwell's 8 TB/s is a mountain to climb.