Executives at artificial intelligence companies frequently assert that artificial general intelligence (AGI) is on the horizon, yet recent models still require further refinement to reach optimal performance levels. Scale AI, a company instrumental in assisting advanced AI firms to develop sophisticated models, has introduced a platform that can automatically evaluate a model across numerous benchmarks and tasks. This platform identifies weaknesses and suggests additional training data to enhance their skills, with Scale providing the necessary data.
Scale AI initially gained recognition by supplying human labor for training and testing advanced AI models. Large language models (LLMs) are trained using extensive texts from books, the web, and other sources. To evolve these models into coherent and polite chatbots, they need additional “post-training,” where humans offer feedback on the model’s outputs.
Scale provides experts skilled in identifying problems and limitations within models. Their new tool, dubbed Scale Evaluation, automates part of this process using Scale’s proprietary machine learning algorithms.
Daniel Berrios, head of product for Scale Evaluation, noted that large labs have various disorganized methods for tracking model weaknesses. The new tool provides a structured way for model developers to analyze results, identify areas where models perform poorly, and guide data campaigns for improvement.
Berrios mentioned that several leading AI model companies are already utilizing the tool to enhance the reasoning capabilities of their top models. AI reasoning involves a model deconstructing a problem into parts to solve it more effectively, relying heavily on post-training feedback from users to verify solution accuracy.
In one case, Scale Evaluation discovered that a model’s reasoning skills diminished when it encountered prompts in languages other than English. Although the model displayed strong reasoning capabilities in English, its performance declined significantly with non-English prompts. The tool identified this issue, allowing the company to accumulate additional training data to remedy the problem.
Jonathan Frankle, chief AI scientist at Databricks, a company building large AI models, acknowledged that the ability to test one foundational model against another could be beneficial in principle. He commented that advancements in evaluation aid in constructing better AI systems.
In recent months, Scale has helped develop several new benchmarks aimed at making AI models more intelligent and scrutinizing potential misbehavior. These include EnigmaEval, MultiChallenge, MASK, and Humanity’s Last Exam.
Despite these advancements, Scale acknowledged it is increasingly complex to measure improvements in AI models as they excel in existing tests. The new tool aims to provide a more comprehensive analysis by combining various benchmarks and can devise custom tests, such as evaluating a model’s performance in different languages. Scale’s AI can take a given problem and generate more examples, facilitating a more thorough assessment of a model’s skills.
The new tool may also guide efforts to standardize testing AI models for misbehavior. Some researchers emphasize that the absence of standardization could result in undisclosed model vulnerabilities.
In February, the US National Institute of Standards and Technologies announced that Scale would aid in developing methodologies to ensure that AI models are safe and trustworthy.
For more information or to share observations about errors in generative AI tool outputs, comments can be directed to hello@wired.com.