Watch this engaging argument with Sesame’s CSM, crafted by Gavin Purcell.
Gavin Purcell, who co-hosts the AI for Humans podcast, shared an intriguing video on Reddit. In it, a human plays an embezzler arguing with their boss. The interaction is so lively that it’s hard to tell the human apart from the AI. Based on our demo, it’s evident that such performances are achievable.
“Near-Human Quality”
Sesame’s CSM impressively uses two AI models—one as the backbone and another as a decoder—built on Meta’s Llama architecture. This setup allows it to handle text and audio simultaneously. They trained three models, with the biggest one boasting 8.3 billion parameters, which is quite powerful and drew from about a million hours of mainly English audio.
Unlike many earlier systems, Sesame’s CSM takes a fresh approach. It doesn’t generate speech in two separate phases. Instead, it employs a single-stage, multimodal transformer model to merge text and audio tokens seamlessly. This is similar to how OpenAI’s voice model operates.
In blind tests without any conversation context, listeners struggled to pick out which was real human speech versus CSM-generated audio. This shows that the model can provide astonishingly lifelike speech in isolated cases. However, in more natural conversational settings, real human voices were still preferred, indicating there’s room for growth in contextual speech generation.
Brendan Iribe, co-founder of Sesame, expressed his awareness of the model’s limitations in a Hacker News comment. He pointed out the system sometimes speaks inappropriately or struggles with timing and interactions. “Right now, we’re in a valley of challenges, but we believe we can overcome them,” he noted.
Check out this related article: “Discover How Andy Dunn’s New AI App ‘Pie’ Revolutionizes Friendship-Making” | TechCrunch
Source link