https://openai.com/index/gpt-4o-system-card/
GPT-4o is an autoregressive omni model, which accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It\u2019s trained end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.
GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window)2 in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.
In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House3, we are sharing the GPT-4o System Card, which includes our Preparedness Framework(opens in a new window)5 evaluations. In this System Card, we provide a detailed look at GPT-4o\u2019s capabilities, limitations, and safety evaluations across multiple categories, with a focus on speech-to-speech (voice)A while also evaluating text and image capabilities, and the measures we\u2019ve taken to enhance safety and alignment. We also include third party assessments on general autonomous capabilities, as well as discussion of potential societal impacts of GPT-4o text and vision capabilities.