Why ‘Multimodal AI’ Is the Hottest Thing in Tech Right Now

OpenAI and Google showcased their latest and greatest AI technology this week. For the last two years, tech companies have raced to make AI models smarter, but now a new focus has emerged: make them multimodal. OpenAI and Google are zeroing in on AI that can seamlessly switch between its robotic mouth, eyes, and ears.

“Multimodal” is the biggest buzzword as tech companies place bets on the most enticing form of their AI models in your everyday life. AI chatbots have lost their luster since ChatGPT’s launch in 2022. So companies are hoping that talking to and visually sharing things with an AI assistant feels more natural than typing. When you see multimodal AI done well, it feels like science fiction come to life.

On Monday, OpenAI showed off GPT-4 Omni, which was oddly reminiscent of the dystopian movie about lost human connection Her. Omni stands for “omnichannel,” and OpenAI touted the model’s ability to process video alongside audio. The demo showed ChatGPT looking at a math problem through a phone camera, as an OpenAI staff member verbally asked the chatbot to walk them through it. OpenAI says it’s rolling out now to Premium users.

The next day, Google unveiled Project Astra, which promised to do roughly the same thing. Gizmodo’s Florence Ion used multimodal AI to identify what faux flowers she was looking at, which it correctly identified as tulips. However, Project Astra seemed a little slower than GPT-4o, and the voice was far more robotic. More Siri than Her, but I’ll let you decide whether that’s a good thing. Google says this is in the early stages, however, and even notes some current challenges that OpenAI has overcome.

“While we’ve made incredible progress developing AI systems that can understand multimodal information, getting response time down to something conversational is a difficult engineering challenge,” said Google in a blog post.

Now you might remember Google’s Gemini demo video from Dec. 2023 that turned out to be highly manipulated. Six months later, Google still isn’t ready to release what it showed in that video, but OpenAI is speeding ahead with GPT-4o. Multimodal AI represents the next big race in AI development, and OpenAI seems to be winning.

A key difference maker for GPT-4o is that the single AI model can natively process audio, video, and text. Previously, OpenAI needed separate AI models to translate speech and video into text so that the underlying GPT-4, which is language-based, could understand these different mediums. It seems like Google may still be using multiple AI models to perform these tasks, given the slower response times.

We’ve also seen a wider adoption of AI wearables as tech companies embrace multimodal AI. The Humane AI Pin, Rabbit R1, and Meta Ray-Bans are all examples of AI-enabled devices that utilize these various mediums. These devices promise to make us less dependent on smartphones, though it’s possible that Siri and Google Assistant will also be empowered with multimodal AI soon enough.

Multimodal AI is likely something you’ll hear a lot more about in the months and years to come. Its development and integration into products could make AI significantly more useful. The technology ultimately takes the weight off of you to transcribe the world to an LLM and allows the AI to “see” and “hear” the world for itself.