Machine Learning @kbin.social nsa @kbin.social 1y ago

Language Modeling Is Compression

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

I wonder what a paper like this, especially given the title, does for the legal case regarding copyright and generative AI. Haven't had a chance to read the paper yet, so don't know if the findings are relevant to copyright.

7 comments

I think it furthers the thought that anything an AI model produces is uncopyrightable, as it's basically just trained off of publicly available data.
- That's like saying that books can't be copyrighted because the 26 letters are publicly available.
  
  You realize that this is already the case right? As it stands now, AI produced works are uncopyrightable. Copy-rights are dedicated to human produced works of art. The only exception to this is when AI is used in a non-major portion of production. Like a photo-editor using AI to remove a person from a picture, where the AI didn't produce the picture, it was just used as a tool to help the process along.
  
  Additionally -- If...say OpenAI made ChatGPT and AI works could be copyrighted...there's no use in a word-prediction-engine or diffusion engine, owning something, because it can't make decisions for itself. That would be required to pass copy-rights along to someone else, for example.

7 comments