Google has introduced the Gemma 4 12B model, which is designed for complex multistep reasoning and agentic workflows similar to those of the larger Gemma variants. This model incorporates the Multi-Token Prediction (MTP) drafters, enhancing speed and efficiency by utilizing unused processing cycles to forecast potential future tokens. While MTP versions are also available for other Gemma 4 models, this version features MTP as a standard capability.
The Gemma 4 12B model offers improved efficiency through a new approach to multimodal input, which includes text, audio, and images. Unlike most generative AI models, including other versions of Gemma 4 that rely on separate encoders for non-text inputs, this new model introduces a streamlined embedding module for vision. This feature employs single-matrix multiplication and positional embedding, enabling direct data processing into the language model (LLM) without the need for an intermediary encoder. Additionally, the system presents audio data without encoding, projecting raw audio signals directly into text token vectors.
The Gemma 4 model is accessible for use without downloading through platforms such as LM Studio and Google AI Edge Gallery. Users interested in running the model locally can download the model weights, which total just under 18GB, from Kaggle and Hugging Face.
Source: arstechnica.com via Google News.

