OpenAI has introduced GPT-4o, its latest flagship generative AI model capable of handling text, speech, and video.
The "Omni" model is set to be integrated into various OpenAI products in the coming weeks.
OpenAI CTO Mira Murati emphasized that GPT-4o represents a significant leap forward in user interaction with AI models. "GPT-4o reasons across voice, text, and vision," Murati explained during a streamed presentation.
"This is incredibly important because we're looking at the future of interaction between ourselves and machines."
GPT-4o enhances ChatGPT's existing voice capabilities, enabling users to interact with the chatbot more conversationally.
Users can interrupt ChatGPT while it's answering, and the model offers real-time responsiveness and even generates voices with different emotional styles.
In addition, GPT-4o improves ChatGPT's vision capabilities, allowing it to quickly answer questions about photos or desktop screens, from identifying the brand of a shirt in a picture to analyzing software code.
Murati envisions future possibilities where GPT-4o could enable ChatGPT to "watch" a live sports game and explain the rules to users. She stressed OpenAI's focus on making the interaction experience more natural and easy, shifting the focus from the user interface to collaboration with ChatGPT.
OpenAI claims GPT-4o also boasts improved multilingual performance in around 50 languages. Additionally, the model is twice as fast as GPT-4 Turbo, half the price, and has higher rate limits in the OpenAI API.
However, due to potential misuse concerns, voice capabilities in the GPT-4o API will initially be available only to a select group of trusted partners.
GPT-4o is accessible in the free tier of ChatGPT starting today, as well as for ChatGPT Plus and Team subscribers with increased message limits. The enhanced ChatGPT voice experience will be available in alpha for Plus users in the next month, along with enterprise-focused options.