Here’s a little story.

xeno@hexokina.se

Here’s a little story. Back when I was doing CS stuff instead of physics, I was presenting a malware detection model I had come up with. I'm still kinda proud of it, it's derived from like functional programming theory and showed some promise.

A guy who works at facebook listens to me explain it and asks a very good question. I had a big dataset of malware samples, which I divided into training and test samples. He asked if because malware is often derived from other malware, these two sets are not entirely independent. There's a chance that some of the malware in the test set was a small variation on malware in the training set, which would "leak data". The model could just memorize that example in the training set and on the test set it would recognize and it and would look like wow it generalizes.

I didn't really have a great answer for that, I had done some sorting with the way the samples were tagged but this was a real problem. I did have some other signs of generalization though. He still thought it was a cool project, and I thanked him for making such a good point I had totally missed.

That's how it used to work. Data leakage was a problem, it was a thing that would mislead you. It was a sneaky way models appear better-performing than they really are, and it was something to be very careful of.

Now the LLM folk are training giant models on the entire internet and claiming some world-changing generalization when they recreate things that already exist on the internet.

It's all just data leakage. Of course LLMs have some capacity for generalization, but that capacity is nothing in comparison to their capacity to just compress information that exists out on the internet.

The hype, absurd amounts of capital, and callous disregard of law and ethics that surrounds LLMs is not based on their capacity to generalize. It's based around an entire industry and their investors making galactically more egregious versions of the same mistake I made.

LLMs are extremely inefficient adaptive compression algorithms, indexed via natural language input and capable of interpolating.

They are systems that maximize data leakage, because a failure that makes a model look better than it really is turns out to be a fantastic way to get investment.

They are the most resource-inefficient con ever devised. A model will look more impressive the more it leaks. So make it leak everything. Every book ever written, every GitHub repo, every screenplay, every painting, every art piece uploaded to the internet, every blog post, every tweet, every reddit post, every documentation page, every website, everything.

We launder the world, at the expense of the world.

I don't think I've ever seen such a perfectly evil dystopian sci-fi premise, so give the LLMs one point for beating humans at that.

RE: https://hexokina.se/notes/ainao43qa0lo0afr

CIRCLE WITH A DOT

Here’s a little story.