Price Reduction for Qwen3-235B on Doubleword

Today we’re reducing the price of our highest-intelligence model, Qwen3-235B-A22B-Instruct.

This is a model with the same intelligence as the strongest frontier systems available today, and from now on it’s significantly cheaper to run with Doubleword. We’re dropping pricing:

From: $0.20 / $0.60
To: $0.10 / $0.40

This puts this level of intelligence 180× cheaper than the same intelligence models from Anthropic ($15 / $75).

This isn’t a promotion or a short-term incentive. It’s a permanent reduction, and it’s the result of how we think about inference from first principles.

Optimising for throughput, not latency

Most of the industry has been optimising their entire tech stack for latency. Sub-second responses became the benchmark, and infrastructure, pricing, and developer habits evolved around that goal. That makes sense for a narrow class of use cases - when a human is sitting there waiting for a response, latency matters enormously.

But what’s quietly become clear over the last year is that most inference workloads don’t look like that at all. The fastest-growing use cases are background jobs like model evaluations, synthetic data generation, and async agents which run for minutes or hours at a time. In these cases, whether a result arrives in 200 milliseconds or 20 minutes often makes no practical difference. And yet, many teams still run these workloads on real-time APIs by default.

The result is what we think of as the latency tax. If your workload can wait, using real-time infrastructure means you’re paying a large premium for speed you don’t need. In practice, teams are often overpaying by 80–90%, not because the model is better, but because the entire system is optimised for quick responses that they just don’t need.

Inference at Doubleword

All of our inference research starts with a simple question:

How can we make inference 100× cheaper?

That question shapes almost every decision we make. It guides our decisions on what hardware to use and which providers to work with. We build our inference and scaling engine around innovations that optimise for throughput (you can see some of our recent research here). And we’ve built a developer experience designed specifically for long-running, high-volume workloads.

Our research and infrastructure teams are constantly lowering the cost floor. Whenever we unlock these gains, we pass them on to our customers.