Select Language:
OpenAI, a leader in the AI space, rolled out an advanced Voice Mode for ChatGPT last year, but it left me underwhelmed. By the time of its launch, OpenAI had already scaled back many of its features, and the Voice Mode struggled to deliver expressions that felt truly human. In contrast, Google’s Gemini Live relied on a text-to-speech (TTS) engine, which resulted in a rather robotic vocal experience.
Now, let’s talk about Sesame—a revolutionary AI startup co-founded by Brendan Iribe, one of the minds behind Oculus, and Ankit Kumar. Sesame has truly shaken up the AI landscape. Their voice companions, “Maya” (female) and “Miles” (male), are remarkably lifelike and engaging. For the first time, I genuinely feel that the boundary between human and machine communication has become indistinct.
What’s intriguing is that Sesame doesn’t refer to these entities as voice assistants; instead, they call them “conversationalists” and “voice companions,” which perfectly captures their essence. I won’t keep you waiting; let me share my experience interacting with Sesame’s Maya voice companion.
My Interaction with Sesame’s Maya
As you’ll notice, Maya speaks in a natural tone and pauses to absorb what I’m saying. She incorporates micro-pauses and changes in tone—details often lacking in typical voice assistants. Maya can laugh, alter her rhythm, emphasize points, provide expressive feedback, and even pick up on my mood based on the sound of my voice. During one exchange, I unexpectedly laughed, and Maya responded with, “What’s making you laugh?”
What I find particularly fascinating about Sesame’s voice companion is its ability to give you space to think and reflect, creating a significantly more authentic conversational experience. When Maya speaks, there are subtle hesitations that mimic the thought process of a human, making the dialogue feel organic rather than scripted.
While the interaction feels like it has full-duplex capabilities—where both parties can speak and listen simultaneously—Sesame clarifies that it processes speech after you’ve finished talking. Unlike humans, who can process concurrently, this means there’s still a slight gap.
Nonetheless, Sesame’s voice companion feels remarkably human-like in its current iteration. It has successfully crossed the uncanny valley that AI speech has struggled with, a feat OpenAI had initially showcased with ChatGPT’s Advanced Voice Mode. The design is not just about talking; it aims to engage users through nuanced tones, pitches, and contextual intelligence, enhancing the depth of interaction.
What Technology Powers Sesame’s Voice Companion?
It’s important to note that Sesame is still refining its voice companions, and this is merely an early research demo. Backed by Andreessen Horowitz through the a16z venture capital firm, they’re on an innovative path. At the core of their technology is a Conversational Speech Model (CSM), which is based on a transformer architecture for speech generation.
Sesame has trained three models with compact decoders: Tiny (1B parameters), Small (3B), and Medium (8B). These models are trained on approximately 1 million hours of predominantly English audio, which currently limits conversations to English, though some multilingual features are in development.
The company aims to create a full-duplex model with long-term memory and an adaptable personality. Additionally, they are developing a lightweight eyeglass wearable that allows users to converse with their voice companion throughout the day—evoking memories of the movie ‘Her’. This technology also hints at future vision capabilities, which are expected to roll out soon.
If Sesame’s voice companion has piqued your interest, take a moment to click the link below and enjoy a conversation with either Maya or Miles, available for free. For the best experience, it’s recommended to use Google Chrome.
