The future of AI is voice

5 min read Original article ↗

Humanity tends to solve problems with temporary fixes. We often settle for a partial solution—a bridge technology—until the real breakthrough arrives to replace it. In the tech world, we refer to this as a Minimum Viable Product, or MVP. But this concept predates Silicon Valley by centuries.

Think about photography. In the beginning, if the nobility wanted a likeness, they commissioned a painting. It was slow and expensive. Then came the daguerreotype. It wasn’t perfect, but it was a marvel of its time. Eventually, we moved to black-and-white photography. It was limited, yes, but the possibilities were fascinating.

It took decades for color photography to emerge, and even then, it faced resistance. Purists and artists like Henri Cartier-Bresson scoffed at it, arguing it was “vulgar” and lacked the soul of the monochrome image. Yet, the technology advanced. Many of us grew up in that era of film and paper, unaware that the medium was about to vanish.

Then came digital. Initially, the adherents of traditional photography looked at these new, clumsy cameras with skepticism. The images were pixelated; the dynamic range was poor. “This is for amateurs,” they said. But the industry kept pouring money into R&D, design, and supply chains. Adoption grew. Newcomers didn’t care about “soul”; they wanted to capture a moment without the hassle of a darkroom. They learned to connect cables to computers and manage memory cards—friction points that simply didn’t exist before. The era of film was ending.

As image quality improved and the internet became ubiquitous, we started uploading these photos to blogs hosted in distant data centers. Cloud computing began enabling instant distribution to the four corners of the planet. Broadband investments accelerated this shift, video gained traction, and camcorders had their brief moment in the sun before the next great convergence.

Born from the need for voice communication, the mobile phone quickly mutated. By the 2000s, SMS was a fever dream, with teenagers glued to physical alphanumeric keypads on Nokia bricks. Then came the BlackBerry, bringing email and a QWERTY keyboard to the palm of the executive.

But the real disruption happened when photos, music, and video collided. Protocols like MP3 and MPEG made media portable. Steve Jobs, who had returned to Apple in 1997, didn’t miss a beat. He launched the iPod in 2001 and the iPhone in 2007.

The iPhone... It wasn’t just a phone. It was the apex of digital convergence: phone, agenda, calculator, clock, map, browser, and gaming console all in one. It carried two economies within it: an App Store and a music store. It wasn’t perfect—it famously lacked a “copy and paste” function at launch—but it changed human behavior forever.

Of course, the transformation of the smartphone into a mass consumer object brought anxieties. People worried about radiation; some refused to carry them. But as broadband speeds increased, the stage was set for social media. Pioneers like Orkut fell by the wayside. Facebook (long before the Meta rebrand) pivoted hard to mobile, eventually acquiring Instagram—a platform built entirely on the premise of sharing those digital photos we mentioned earlier.

Here is where we return to the thesis. If you trace the line from medieval paintings to TikTok videos, you see that the human interest in the image remained constant, but the medium evolved ruthlessly.

I believe we are at a similar inflection point with Generative AI. And one of the oldest “MVP” solutions for communicating with computers is about to die.

The typewriter became a commercial reality around 1874, initially to produce legible documents faster than handwriting. With the rise of the personal computer, we adapted this 19th-century mechanical interface—the QWERTY layout—to our modern machines.

But with the arrival of Generative AI, the keyboard’s days are numbered. The candidate to replace it? Voice.

Voice is fluid. It is natural. It is how we have communicated for millennia before we ever touched a key. Imagine a computing system that understands you exactly as you speak. Actually, stop imagining. It exists.

Voice recognition systems have been used commercially since the 1990s. But today, they have gained a new layer of abstraction. They don’t just transcribe; they filter semantic noise and extract intent. They decipher the message and produce a response that emulates a fluent speaker, complete with intonation, idiom, and nuance.

When you couple this recognition with a generative voice model and sufficient processing power, the equation is solved: we have a fully conversational system. This is the next paradigm shift. Exit the keyboard; enter the microphone and the wireless earbud. The hardware has been ready for years. We were just waiting for the “brain”—the GenAI—to make the experience fluid rather than frustrating.

Naturally, this brings us to the elephant in the room: Privacy.

The idea that a manufacturer has access to your conversations in real-time is terrifying. Yet, this is a bridge we crossed long ago. We all remember the congressional hearings where Mark Zuckerberg was grilled on data privacy. While he technically denied that Facebook listens to your microphone for ads, the sheer amount of data they do collect renders the distinction almost meaningless. The public suspicion remains: if the device can hear us, it probably is.

Consider Alexa. We invite these devices into our homes to perform trivial tasks—telling jokes or setting timers—implicitly trusting that they are only listening for a “wake word.” We have already traded privacy for minor conveniences.

So here is the provocation: If a “dumb” assistant like Alexa—which struggles to deliver complex value—already has the authorization of millions of users to listen in, what happens when an LLM that actually knows everything arrives?

We are about to trade our privacy not for a timer, but for a super-intelligence that replaces the friction of typing. The keyboard is a 150-year-old relic. The first company to make talking to a computer feel less like a command line and more like a conversation will own the future.

Discussion about this post

Ready for more?