tANS: precomputing rANS

Fergus Finn
Fergus Finn
Founder & Member of Technical Staff, Doubleword

In Also-rANS, I walked through rANS: pack a stream of symbols into a single integer via reversible arithmetic, hit the Shannon limit asymptotically, no rounding to whole bits.

The cost shows up at runtime. Encoding takes a division (x/fs\lfloor x / f_s \rfloor); decoding takes another (x/M\lfloor x' / M \rfloor). On modern hardware integer division takes tens of cycles, where multiplication takes a few. For a compressor that wants to push gigabytes per second through a single core, division on the hot path is too expensive.

tANS removes the division. We tighten rANS's renormalization invariant until there are only finitely many possible states, then precompute every transition and store it in a table. Per-symbol work collapses to a table lookup and a shift.

The compression is the same as rANS in the limit — same Shannon bound, same fractional-bit costs — but the implementation looks completely different. Encoding becomes a state machine driven by a precomputed transition table; decoding becomes a table walk that reads bits from a stream and emits symbols. tANS is what powers Yann Collet's FSE library and Zstandard's entropy stage.

The slot picture, revisited

ANS coders share a common shape. There's a state xx — a single integer that evolves as we encode each symbol. There's a frequency table {fs}\{f_s\} over the alphabet, with total M=sfsM = \sum_s f_s. And there's renormalization, which keeps xx inside some chosen range [L,bL)[L, bL) by spilling low digits to (or refilling from) an output stream as the state grows or shrinks. LL is a lower bound; bb is the base of the stream, typically 22 (bits) or 256256 (bytes).

The state and the alphabet meet in a slot table. Allocate the integers [0,M)[0, M) to symbols, giving each ss exactly fsf_s slots. For our running alphabet (fA=4,fB=3,fC=1,M=8f_A = 4, f_B = 3, f_C = 1, M = 8), one valid allocation is

[A, A, A, A, B, B, B, C]

The decoder consults this table at every step. Compute the slot index xmodMx \bmod M, look up which symbol owns it. That's the next emitted symbol.

tANS's tightening choice is to set L=ML = M and b=2b = 2. The state now lives in [L,2L)[L, 2L), an interval containing exactly L=ML = M integers — the same size as the slot space.

In rANS, the state space is much bigger than the slot space. Many different states share the same slot, and xmodMx \bmod M extracts the slot from the state as a feature. In tANS, the two are in bijection: the eight states {8,9,,15}\{8, 9, \ldots, 15\} map one-to-one onto the eight slots {0,1,,7}\{0, 1, \ldots, 7\}, with xLx - L giving the slot index directly.

This changes what the state is. There are now only LL possible states, and each is permanently associated with a single symbol — the one that owns its slot. State 88 is an AA state. State 1515 is a CC state. The state isn't a number that contains a slot; it is a slot.

The decoder no longer computes xmodMx \bmod M to find the next symbol — the state already is one. The interesting question is what happens to the state next, after a symbol is emitted.

Decoding via a table

The decoder has two jobs at each step: emit the symbol associated with the current state, and transition to the next state. The first is a lookup in the slot table. The second needs work.

In rANS, the decode transition is

xprev=xMfs+(xmodMcs)x_{prev} = \left\lfloor \frac{x'}{M} \right\rfloor \cdot f_s + (x' \bmod M - c_s)

For tANS, xx' sits in [L,2L)[L, 2L) and L=ML = M, so x/M=1\lfloor x' / M \rfloor = 1 always. The formula collapses to

xprev=fs+(xLcs)x_{prev} = f_s + (x' - L - c_s)

The expression xLcsx' - L - c_s is the offset of xx''s slot inside symbol ss's range [cs,cs+fs)[c_s, c_s + f_s) — a number in [0,fs)[0, f_s). So xprevx_{prev} lands in [fs,2fs)[f_s, 2 f_s).

For our running alphabet, that means decoding from an A state lands at xprev[4,8)x_{prev} \in [4, 8); a B state, xprev[3,6)x_{prev} \in [3, 6); the C state, xprev=1x_{prev} = 1. All of these are below L=8L = 8, so the transition needs renormalization: pull bits from the stream into the low end of xx until it's back in [L,2L)[L, 2L).

Crucially, the number of refill bits is fully determined by which state we're decoding from. Decoding from state 1515 produces C and lands at xprev=1x_{prev} = 1, which needs three refills to reach [8,16)[8, 16). State 1212 produces B and lands at xprev=3x_{prev} = 3, which needs two. States 1313 and 1414 (also B) land at xprev=4x_{prev} = 4 and 55, each needing one refill. All four A-states need exactly one.

Everything the decoder needs — symbol, pre-refill state, refill count — depends only on the current state. So precompute it:

state  symbol  x_prev  refills
8      A       4       1
9      A       5       1
10     A       6       1
11     A       7       1
12     B       3       2
13     B       4       1
14     B       5       1
15     C       1       3

A decode step is now: look up the row for the current state, emit the symbol, pull refills bits off the stream into the low end of x_prev, set the state to the result. No division.

A worked example. We'll construct the encoded form properly in the next section; for now take it on faith that the message A,B,CA, B, C encodes to starting state x=10x = 10 with bit stream [0,0,0,1,1,0][0, 0, 0, 1, 1, 0], rightmost bit popped first.

Decode at x=10x = 10: A, xprev=6x_{prev} = 6, 1 refill. Pop 00 → state 62+0=126 \cdot 2 + 0 = 12.

Decode at x=12x = 12: B, xprev=3x_{prev} = 3, 2 refills. Pop 11 → state 32+1=73 \cdot 2 + 1 = 7. Pop 11 → state 72+1=157 \cdot 2 + 1 = 15.

Decode at x=15x = 15: C, xprev=1x_{prev} = 1, 3 refills. Pop 0022, pop 0044, pop 0088.

The stream is empty and the state is 88 — the seed. Decoded message: A,B,CA, B, C.

Encoding via a table

Encoding inverts decoding. Given the current state x[L,2L)x \in [L, 2L) and a symbol ss to encode, find the unique x[L,2L)x' \in [L, 2L) such that the decoder, starting at xx', emits ss and lands back at xx after refilling from the stream.

The decoder at xx' ran two steps: compute xprev=fs+(xLcs)x_{prev} = f_s + (x' - L - c_s), then refill bits from the stream into the low end of xprevx_{prev} until back in [L,2L)[L, 2L). Inverting, the encoder:

  1. Spills enough low bits of xx to drop it into [fs,2fs)[f_s, 2 f_s). The spill count kk is the smallest non-negative integer with x/2k[fs,2fs)\lfloor x / 2^k \rfloor \in [f_s, 2 f_s).
  2. Sets xprev=x/2kx_{prev} = \lfloor x / 2^k \rfloor.
  3. Computes x=L+cs+(xprevfs)x' = L + c_s + (x_{prev} - f_s).

Step 3 inverts the decoder's slot lookup: xprevfsx_{prev} - f_s is the offset within ss's range, csc_s shifts to the absolute slot index, and +L+L converts slot to state.

Both the spill count and the next state depend only on xx and ss, so encoding is also table-driven. Implementations store a (state, symbol) → (spill count, next state) table; the spilled bits themselves are just x mod 2^k, appended low-bit-first.

For our running alphabet, encoding A always spills 1 bit; encoding C always spills 3. Both match Shannon: log2(8/4)=1\log_2(8/4) = 1 and log2(8/1)=3\log_2(8/1) = 3. Encoding B spills 1 bit from low states (8–11) and 2 bits from high states (12–15) — averaging closer to 1.51.5 bits than Shannon's 1.4151.415. The gap is the slot allocation's fault; the next section shows how a smarter spread closes it.

A worked example. Encode the message A,B,CA, B, C from seed state L=8L = 8. Encoding processes the message in reverse, so encode C first.

Encode C from x=8x = 8: k=3k = 3. Append the three low bits of 88: 0,0,00, 0, 0. xprev=1x_{prev} = 1. New state x=8+7+0=15x' = 8 + 7 + 0 = 15.

Encode B from x=15x = 15: k=2k = 2, since 15/2=7\lfloor 15/2 \rfloor = 7 overshoots [3,6)[3, 6). Append the two low bits of 1515: 1,11, 1. xprev=3x_{prev} = 3. New state x=8+4+0=12x' = 8 + 4 + 0 = 12.

Encode A from x=12x = 12: k=1k = 1. Append the low bit of 1212: 00. xprev=6x_{prev} = 6. New state x=8+0+2=10x' = 8 + 0 + 2 = 10.

Final state 1010, stream [0,0,0,1,1,0][0, 0, 0, 1, 1, 0] in append order — the same encoded form decoded in the previous section.

Building the tables: spreading symbols

The slot table has been a free parameter throughout. The contiguous allocation [A,A,A,A,B,B,B,C][A, A, A, A, B, B, B, C] was an arbitrary choice; the previous section showed the cost — encoding B averages 1.5\sim 1.5 bits per symbol against Shannon's 1.4151.415.

The layout matters because the encode spill count depends on the current state. With contiguous slots, the B-states are 1212, 1313, and 1414 — the upper half of [L,2L)[L, 2L), where encoding B costs 22 bits instead of 11. Whenever the encoder lands in a B-state and the next symbol is also B, we pay the extra bit. Clustering symbols together creates state correlations that hurt amortized cost.

The fix: spread each symbol's slots roughly uniformly across [0,L)[0, L). A symbol with fsf_s slots should land roughly once every L/fsL / f_s positions, regardless of which symbol came before.

Yann Collet's heuristic does this with one hash-like walk. Set the stride σ=L/2+L/8+35L/8\sigma = L/2 + L/8 + 3 \approx 5L/8. Walk a cursor around [0,L)[0, L), stepping by σ\sigma modulo LL each time, and place each symbol fsf_s times in succession before moving to the next. The intuition: 5L/85L/8 is far from any small rational fraction, so the orbit doesn't cluster; the +3+3 keeps the stride coprime to LL for power-of-two sizes, so every position is visited exactly once before the cursor returns.

To see the spread do anything interesting, bump our running example up to fA=8,fB=6,fC=2,M=L=16f_A = 8, f_B = 6, f_C = 2, M = L = 16 — same probabilities, more states. The stride is σ=8+2+3=13\sigma = 8 + 2 + 3 = 13. Starting from cursor 00 and stepping by 13mod1613 \bmod 16, the first eight visits go to A, the next six to B, the last two to C:

A: 0, 13, 10, 7, 4, 1, 14, 11
B: 8, 5, 2, 15, 12, 9
C: 6, 3

Sorted by position, the slot table reads

position:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
symbol:    A  A  B  C  A  B  C  A  B  B  A  A  B  A  A  B

Each symbol is sprinkled across the full range with no contiguous runs. With this spread, the encoder's bit costs hit Shannon much more closely. B's amortized cost drifts toward 1.4151.415, and the slack continues to close as LL grows.

Putting it together

Here is complete tANS in pseudocode. Pick a state range [L,2L)[L, 2L) with L=M=sfsL = M = \sum_s f_s, build the spread and the encode/decode tables, and run the loops.

def build_spread(freq, L):
    """Yann's hash-like walk: place each symbol f_s times, stepping by sigma."""
    sigma = (L >> 1) + (L >> 3) + 3
    spread = [None] * L
    cursor = 0
    for s, (f, _) in freq.items():
        for _ in range(f):
            spread[cursor] = s
            cursor = (cursor + sigma) % L
    return spread

def encode(message, freq, L, encode_table):
    x = L                                # seed state
    stream = []
    for s in reversed(message):          # process LIFO
        f, _ = freq[s]
        while x >= 2 * f:                # spill low bit until x in [f, 2f)
            stream.append(x & 1)
            x >>= 1
        x = encode_table[s][x - f]       # look up next state
    return x, stream

def decode(x, stream, decode_table, L, n):
    message = []
    for _ in range(n):
        s, x_prev, refills = decode_table[x - L]
        message.append(s)
        for _ in range(refills):         # refill bits until back in [L, 2L)
            x_prev = (x_prev << 1) | stream.pop()
        x = x_prev
    return message

The encode_table and decode_table are built once from the spread and reused for every message. encode_table[s][r] is the next state when encoding ss with xprev=fs+rx_{prev} = f_s + r. decode_table[i] is the triple (s,xprev,refills)(s, x_{prev}, \text{refills}) for state L+iL + i.

The decode inner loop is what makes tANS fast. Per symbol: one table lookup, one append, a handful of shifts and bitwise-ors, a stack pop. No division, no symbol search, no variable-length prefix code. Encoding is similarly cheap, with the spill loop usually running for one or two iterations. This is why tANS sits at the heart of FSE and Zstandard's entropy stage: all the per-symbol work happens in tight loops over precomputed data.

That is tANS. A finite state machine with each state labelled by a symbol, encode and decode tables that turn every step into a few lookups and bit shifts, and a spread that distributes symbols evenly across the state space. Same Shannon-bounded compression as rANS, at table-driven speed.