A recent research paper by DeepMind suggests that Large Language Models (LLMs), which are AI systems trained on vast amounts of data to predict the next part of a word, can be viewed as strong data compressors. The study found that LLMs can effectively compress information, sometimes even better than widely used compression algorithms. The researchers repurposed LLMs to perform lossless compression using arithmetic coding, achieving impressive compression rates on text, image, and audio data. However, LLMs are not practical tools for data compression compared to existing models due to their size and speed limitations.
One interesting finding of the study is that the performance of LLMs is affected by the scale of the model and the dataset. While larger models achieve superior compression rates on larger datasets, their performance diminishes on smaller datasets. This suggests that a bigger LLM is not necessarily better for any task, and compression can serve as an indicator of how well the model learns the information of its dataset. These findings have implications for evaluating LLMs, especially in addressing the problem of test set contamination in machine learning training.
Overall, the study provides a fresh perspective on the capabilities of LLMs, viewing them as data compressors rather than just language prediction models. It highlights the potential of LLMs to achieve impressive compression rates on various types of data, but also recognizes their limitations compared to classical compression algorithms in terms of size and speed. The findings also shed light on the relationship between model scale and dataset size, suggesting that compression can be a useful metric for evaluating the effectiveness and appropriateness of LLMs for specific tasks.