Gemma 4 ships with lightweight drafter models (gemma-4-E4B-it-assistant, gemma-4-26B-A4B-it-assistant, etc.) that enable Multi-Token Prediction for speculative decoding. llama.cpp already landed support for this and folks are seeing 40-50% faster generation. Would be great to have this in Gemma4.java.
https://x.com/ggerganov/status/2056391115469689330
Gemma 4 ships with lightweight drafter models (
gemma-4-E4B-it-assistant,gemma-4-26B-A4B-it-assistant, etc.) that enable Multi-Token Prediction for speculative decoding. llama.cpp already landed support for this and folks are seeing 40-50% faster generation. Would be great to have this in Gemma4.java.https://x.com/ggerganov/status/2056391115469689330