Google boosts Gemma 4 speed with new MTP drafter models
- Source
- Ars Technica
- Time
- 5:13 PM
- Weight
- 94/100
Google has introduced Multi-Token Prediction (MTP) drafter models for its Gemma 4 open AI series, significantly increasing local inference speeds. By employing a technique known as speculative decoding, these experimental models allow for up to a threefold increase in token generation speed.
The system works by using a lightweight drafter model to predict future tokens while the primary model verifies them in parallel, effectively utilizing compute cycles that would otherwise go unused during memory-intensive tasks. The performance gains vary across hardware configurations, with Google reporting speed increases of up to 3.1x on Pixel smartphones and 2.5x on Apple’s M4 silicon.