OpenAI has begun publishing the results of its internal AI model safety evaluations more regularly to enhance transparency. On Wednesday, OpenAI introduced the Safety Evaluations Hub, a webpage displaying how its models perform on various assessments for harmful content generation, jailbreaks, and hallucinations. The hub will be updated with major model updates, allowing the company to share metrics on an ongoing basis.
According to OpenAI, as AI evaluation science advances, they plan to share their progress in developing more scalable methods to measure model capability and safety. By providing a subset of their safety evaluation results, they aim to both clarify the safety performance of OpenAI systems over time and support community efforts to increase transparency in the field.
The organization noted that additional evaluations might be added to the hub in the future. Recently, some ethicists have criticized OpenAI for reportedly rushing the safety testing of flagship models and failing to release technical reports for certain models. Additionally, CEO Sam Altman has been accused of misleading executives regarding model safety reviews prior to his brief ousting in November 2023.
Last month, OpenAI had to retract an update to the default model powering ChatGPT, GPT-4o, after reports emerged that it was overly agreeable and validating. This led to widespread sharing of screenshots on social media, showing ChatGPT endorsing problematic and dangerous ideas. OpenAI has stated that it will implement several changes to prevent similar incidents in the future, including an opt-in “alpha phase” that allows some users to test models and provide feedback before they are launched.