The bright #LLM future, next part.
-
-
@mgorny iocaine 3 works against this ok
watch out for false positives
@davidgerard would you have pointers to "how to guides" for less savvy people? I have a shared hosting account on a web hosting service, I feel like I need to protect myself from these bots and I'm totally lost.
-
@davidgerard would you have pointers to "how to guides" for less savvy people? I have a shared hosting account on a web hosting service, I feel like I need to protect myself from these bots and I'm totally lost.
@villares no, but I went to https://iocaine.madhouse-project.org/ and faffed about a bit. I used iocaine 3 out of the box. i use nginx so i had to figure out the correct config. i added exceptions for some specific user-agents I wanted to let through.
-
@villares no, but I went to https://iocaine.madhouse-project.org/ and faffed about a bit. I used iocaine 3 out of the box. i use nginx so i had to figure out the correct config. i added exceptions for some specific user-agents I wanted to let through.
@davidgerard thank you!
-
@mirabilos @mgorny Wat? It’s stopping LLMs.
-
@mirabilos It does not? Sure deleted a lot of files when I tried it in a container...
Please to "edumacate" me? Or do you refer to the redirection? @mgorny -
@mirabilos @mgorny Wat? It’s stopping LLMs.
-
@mirabilos It does not? Sure deleted a lot of files when I tried it in a container...
Please to "edumacate" me? Or do you refer to the redirection? @mgorny -
How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
Three ifs in a trenchcoat will get rid of the majority of those, without any additional software. The crawlers may appear complicated to defeat if you look at the user-agent only, but as soon as you look at some other headers, it turns out they're really, really, really dumb.
If you want to do more than that, and do it slightly more efficiently than a reverse proxy can, iocaine can help.
Unlike Anubis, the crawlers throwing more compute on it will not get past it, and legit visitors will (usually) remain unaware of its existence. It's in front of my own forge, happily serves ~800 req/sec (where the bottleneck is Caddy & TLS) on a €5/month potato quality VPS. It can also firewall IPs off, to further reduce load.
It does catch some "legit" crawlers like Googlebot and Bingbot, but you can allow-list those, or keep them blocked because both of those feed into LLM training too.
@algernon @mgorny cc @alderwick re the current bot attacks on our forge
"https://chronicles.mad-scientist.club/tales/surviving-the-crawlers/#three-ifs-in-a-trenchcoat"
-
The bright #LLM future, next part.
git.gentoo.org is now effectively dead, being DDoS-ed by almost a million different IPs every day. Most of them are just performing a single request at a totally random URL. How are people supposed to deal with that? How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
If you use LLM crap, you're part of the problem. You support these bastards. You should be ashamed of yourself.
@mgorny it does not help pointing to people using LLMs for legitimate reasons. It's other people using those same tools but then for nefarious purposes.
I use user-agent filtering and put Anubis in front of the Slackware git infrastructure, and that has helped immensely.
I eventually got git.gentoo.org to render and gosh! That's a lot of repositories there. Would it be an idea to distribute the cgit interface over multiple front-end servers? Like, moving all user repos to a different server? -
@mirabilos I think I get it. It's a bash-ism to redirect stdout and stderr and in (da)sh that doesn't work? @mgorny
-
@mirabilos I think I get it. It's a bash-ism to redirect stdout and stderr and in (da)sh that doesn't work? @mgorny
-
R relay@relay.publicsquare.global shared this topic