Diffusion models are showing impressive performance that’s either faster than or on par with traditional models of the same size. Researchers at LLaDA found that their 8 billion parameter model matches the performance of LLaMA3 8B across several benchmarks, excelling in tasks like MMLU, ARC, and GSM8K.
Meanwhile, Mercury claims even more significant speed gains. Their Mercury Coder Mini achieved an 88.0 percent score on HumanEval and 77.1 percent on MBPP, performing comparably to GPT-4o Mini. However, it operates at an impressive 1,109 tokens per second, while GPT-4o Mini only reaches 59 tokens per second. This means Mercury’s model is about 19 times faster than GPT-4o Mini, with similar performance on coding tasks.
According to Mercury’s documentation, its models can run “at over 1,000 tokens/sec on Nvidia H100s.” This speed was previously only achievable with specialized hardware from companies like Groq and Cerebras. Even when put next to other fast models, Mercury Coder Mini stands out. It’s roughly 5.5 times quicker than Gemini 2.0 Flash-Lite, which runs at 201 tokens/second, and 18 times faster than Claude 3.5 Haiku at 61 tokens/second.
Exploring New Possibilities in LLMs
While diffusion models have their benefits, they do come with some trade-offs. They usually require multiple passes through the network to generate a full response, while traditional models only need one pass per token. However, because diffusion models handle all tokens in parallel, they can still achieve high throughput despite this additional work.
Inception believes these speed gains could significantly enhance tools for code completion, improving developer efficiency. They also see potential in conversational AI, mobile apps, and other AI agents that require quick responses.
If diffusion-based language models can maintain their quality while ramping up speed, they could revolutionize AI text generation. AI researchers are eager to explore these new methods.
Simon Willison, an independent researcher, shared with Ars Technica, “I love that people are experimenting with alternative architectures to transformers. It shows how much of the LLM space is still uncharted.” Former OpenAI researcher Andrej Karpathy mentioned on X that Inception’s model could bring fresh insights and encourage experimentation.
Still, questions linger about whether larger diffusion models can rival the performance of models like GPT-4o and Claude 3.7 Sonnet, especially in handling complex reasoning tasks. For now, these models present a promising alternative for smaller AI systems without losing speed.








