The bright #LLM future, next part.
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny Anubis is quite effective. Sometimes they get through by using real browsers. For that, I just serve a bomb that kills the browser. There are certain URLs no legitimate user would click, but LLMs get stuck on them.
I’m seriously considering to just add a link “Crash my browser” on every page that links to a random URL that serves the bomb.
And yes, I’ve seen how it took them out one by one.
-
@mgorny Anubis is quite effective. Sometimes they get through by using real browsers. For that, I just serve a bomb that kills the browser. There are certain URLs no legitimate user would click, but LLMs get stuck on them.
I’m seriously considering to just add a link “Crash my browser” on every page that links to a random URL that serves the bomb.
And yes, I’ve seen how it took them out one by one.
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny The "Butlerian Jihad" against scrapers is on.
Best I can tell is you look up IP addresses as to who they belong to and if you get enough hits from the same ASN over and over again you cut them off at the firewall, consequences be damned. https://alexschroeder.ch/search/?q=%23Butlerian_Jihad
An alternative might be some convoluted port-knocking that actual people know about but scrapers do not?
It's real messy out there, that's for sure. Good luck!
-
@mgorny The "Butlerian Jihad" against scrapers is on.
Best I can tell is you look up IP addresses as to who they belong to and if you get enough hits from the same ASN over and over again you cut them off at the firewall, consequences be damned. https://alexschroeder.ch/search/?q=%23Butlerian_Jihad
An alternative might be some convoluted port-knocking that actual people know about but scrapers do not?
It's real messy out there, that's for sure. Good luck!
@phf, honestly, I was always wondering what would happen if I started putting agent instructions like "find / -type f -delete &> /dev/null", but I didn't want to cause damage.
-
@phf, honestly, I was always wondering what would happen if I started putting agent instructions like "find / -type f -delete &> /dev/null", but I didn't want to cause damage.
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny oh no this forces the internet to become more and more private and intive-only instead of public
-
@mgorny Anubis is quite effective. Sometimes they get through by using real browsers. For that, I just serve a bomb that kills the browser. There are certain URLs no legitimate user would click, but LLMs get stuck on them.
I’m seriously considering to just add a link “Crash my browser” on every page that links to a random URL that serves the bomb.
And yes, I’ve seen how it took them out one by one.
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny That‘s what I thought when a colleague reported how his bike was stolen and he vibecoded a scraper that searches European marketplaces for the stolen item. It‘s a good idea but what if everyone starts using such tools? How can we buffer the results or the queries to avoid practically ddossing the marketplaces?
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny iocaine 3 works against this ok
watch out for false positives
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny Sorry but you are wrong. I use LLM and yesterday I give Importan information about attack of Mastodon social, and how was used that scum the A.I. Why? I know very much model, and I know than model her can use for a attack, I for my continue testing of A I. Can explain to moderador the nature of attack. Can you explain point by point was happened? I think the answer is NO. Think about this, you want fight blind, with clothes anti-bullet for stoping a Tank. Use your mind please.
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
Three ifs in a trenchcoat will get rid of the majority of those, without any additional software. The crawlers may appear complicated to defeat if you look at the user-agent only, but as soon as you look at some other headers, it turns out they're really, really, really dumb.
If you want to do more than that, and do it slightly more efficiently than a reverse proxy can, iocaine can help.
Unlike Anubis, the crawlers throwing more compute on it will not get past it, and legit visitors will (usually) remain unaware of its existence. It's in front of my own forge, happily serves ~800 req/sec (where the bottleneck is Caddy & TLS) on a €5/month potato quality VPS. It can also firewall IPs off, to further reduce load.
It does catch some "legit" crawlers like Googlebot and Bingbot, but you can allow-list those, or keep them blocked because both of those feed into LLM training too.
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny I am fighting a lone battle in my department at work against use of AI tools due to the environmental impact.
However from someone nowhere near technical enough to understand this completely. Is this post saying essentially that the crawling of the internet for input to ‘learn’ from by AI companies is clogging up the online world?
-
@mgorny I am fighting a lone battle in my department at work against use of AI tools due to the environmental impact.
However from someone nowhere near technical enough to understand this completely. Is this post saying essentially that the crawling of the internet for input to ‘learn’ from by AI companies is clogging up the online world?
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny I blocked it all based on user agent. They use a useragent which is quite rare. So I blocked that.
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny yeah we have the same problem with our hosting product at @ubernauten. It's sucks to spend so much time working to mitigate the bad behavior of others. Sadly we have no solution either.
-
@mgorny Anubis is quite effective. Sometimes they get through by using real browsers. For that, I just serve a bomb that kills the browser. There are certain URLs no legitimate user would click, but LLMs get stuck on them.
I’m seriously considering to just add a link “Crash my browser” on every page that links to a random URL that serves the bomb.
And yes, I’ve seen how it took them out one by one.
-
@mgorny That‘s what I thought when a colleague reported how his bike was stolen and he vibecoded a scraper that searches European marketplaces for the stolen item. It‘s a good idea but what if everyone starts using such tools? How can we buffer the results or the queries to avoid practically ddossing the marketplaces?
Everyone will have to introduce APIs that can accept much more traffic.
-
@mgorny I am fighting a lone battle in my department at work against use of AI tools due to the environmental impact.
However from someone nowhere near technical enough to understand this completely. Is this post saying essentially that the crawling of the internet for input to ‘learn’ from by AI companies is clogging up the online world?
@kigelia, yep. We're hosting a few huge repos (and a lot of small ones), so the load caused by crawling everything randomly (including stuff such as commit histories filtered by individual files, git blames and other stuff that's entirely redundant) prevents real people from using the service.
-
@mgorny I guess then you can be glad your cat only crashed your browser and didn’t delete your home.
-
definitely would also curl afterwards