GPT-4o is OpenAI’s latest flagship model, designed to reason across audio, vision, and text in real time. The “o” in GPT-4o stands for “omni”, indicating its ability to accept and generate any combination of text, audio, and image inputs and outputs. It can respond to audio inputs in as little as 232 milliseconds, similar to human response time in a conversation. GPT-4o matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages. It is also faster and 50% cheaper in the API, and performs better at vision and audio understanding compared to existing models.
GPT-4o was trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. It achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities. It sets new high scores on 0-shot COT MMLU (general knowledge questions) and 5-shot no-CoT MMLU. It also dramatically improves speech recognition performance over Whisper-v3 across all languages, particularly for lower-resourced languages.
GPT-4o’s new tokenizer’s compression across different language families was demonstrated with 20 languages. These languages include Gujarati, Telugu, Tamil, Marathi, Hindi, Urdu, Arabic, Persian, Russian, Korean, Vietnamese, Chinese, Japanese, Turkish, Italian, German, Spanish, Portuguese, French, and English. The model showed a significant reduction in the number of tokens required for each language.
Safety is built into GPT-4o by design across modalities, through techniques such as filtering training data and refining the model’s behavior through post-training. New safety systems have been created to provide guardrails on voice outputs. The model has undergone extensive external red teaming with 70+ external experts in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities. The model does not score above Medium risk in any of the categories of cybersecurity, CBRN, persuasion, and model autonomy.
GPT-4o is the latest step in pushing the boundaries of deep learning, with a focus on practical usability. The model’s capabilities will be rolled out iteratively, starting with text and image capabilities in ChatGPT. It will be available in the free tier, and to Plus users with up to 5x higher message limits. Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. Support for GPT-4o’s new audio and video capabilities will be launched to a small group of trusted partners in the API in the coming weeks.