The bright #LLM future, next part.
-
@hierkiosk @villares unfortunately, it works real good. contactability is a question for the site being piled under with shit.
„it works real good“ — what does this mean? Zero false-positives, zero false-negatives?
-
„it works real good“ — what does this mean? Zero false-positives, zero false-negatives?
@hierkiosk i think if you apply some thought you'll work it out
-
@mgorny iocaine 3 works against this ok
watch out for false positives
@davidgerard @mgorny I keep teetering on the edge of setting up iocaine for my own code/issues/etc self-hosting using Phorge, which keeps getting hammered into OOM-death by LLM scrapers, and I really *should*, but it just feels so darned depressing -
How can we distinguish a legitimate user who hit some URL from a scraper that distributes its operations over thousands of IP addresses?
Three ifs in a trenchcoat will get rid of the majority of those, without any additional software. The crawlers may appear complicated to defeat if you look at the user-agent only, but as soon as you look at some other headers, it turns out they're really, really, really dumb.
If you want to do more than that, and do it slightly more efficiently than a reverse proxy can, iocaine can help.
Unlike Anubis, the crawlers throwing more compute on it will not get past it, and legit visitors will (usually) remain unaware of its existence. It's in front of my own forge, happily serves ~800 req/sec (where the bottleneck is Caddy & TLS) on a €5/month potato quality VPS. It can also firewall IPs off, to further reduce load.
It does catch some "legit" crawlers like Googlebot and Bingbot, but you can allow-list those, or keep them blocked because both of those feed into LLM training too.
@algernon@come-from.mad-scientist.club At present for my own personal purposes I haven't yet set up iocaine, despite getting somewhat exasperated with outages caused by LLM scraping, due to
- I'm kinda lacklustre with proxy config stuff and there's no guide you have up for Apache2 (I'm sure I could muddle through some setup eventually though if I put actual effort into it)
- Though it's better about this not happening than Anubis, I still seem to get my main current browser of choice (Falkon) caught by Iocaine, enough so that in fact currently I cannot read the documentation as pages like just https://iocaine.madhouse-project.org/documentation/ sit loading forever even when I begrudgingly switch back to Chrome (so I assume either I've now been blocked, or your server has been hammered coincidentally). From https://chronicles.mad-scientist.club/tales/surviving-the-crawlers/#three-ifs-in-a-trenchcoat I got for example
x-request-id: 0ia4BzigBrtETmjH0S2ndso, as per the text there at the bottom after the AI-aimed gobbleygook, I am telling you
-
@algernon@come-from.mad-scientist.club At present for my own personal purposes I haven't yet set up iocaine, despite getting somewhat exasperated with outages caused by LLM scraping, due to
- I'm kinda lacklustre with proxy config stuff and there's no guide you have up for Apache2 (I'm sure I could muddle through some setup eventually though if I put actual effort into it)
- Though it's better about this not happening than Anubis, I still seem to get my main current browser of choice (Falkon) caught by Iocaine, enough so that in fact currently I cannot read the documentation as pages like just https://iocaine.madhouse-project.org/documentation/ sit loading forever even when I begrudgingly switch back to Chrome (so I assume either I've now been blocked, or your server has been hammered coincidentally). From https://chronicles.mad-scientist.club/tales/surviving-the-crawlers/#three-ifs-in-a-trenchcoat I got for example
x-request-id: 0ia4BzigBrtETmjH0S2ndso, as per the text there at the bottom after the AI-aimed gobbleygook, I am telling you
@algernon@come-from.mad-scientist.club The outright pageload failures look to have been Very User Error on my part, but I can further report for example
x-request-id: NLCTLT-0WB7P2fWsXx1bHfrom my trying to load https://iocaine.madhouse-project.org/documentation/3/reverse-proxies/ with Falkon here still on Kubuntu 24.04 (and thus Falkon 24.01.75 using QtWebEngine 5.15.16, which is admittedly all pretty outdated in our fast-paced world). -
@algernon@come-from.mad-scientist.club The outright pageload failures look to have been Very User Error on my part, but I can further report for example
x-request-id: NLCTLT-0WB7P2fWsXx1bHfrom my trying to load https://iocaine.madhouse-project.org/documentation/3/reverse-proxies/ with Falkon here still on Kubuntu 24.04 (and thus Falkon 24.01.75 using QtWebEngine 5.15.16, which is admittedly all pretty outdated in our fast-paced world). -
@algernon Appreciated, and boy howdy do I know that spoonless feeling
-
@mgorny it does not help pointing to people using LLMs for legitimate reasons. It's other people using those same tools but then for nefarious purposes.
I use user-agent filtering and put Anubis in front of the Slackware git infrastructure, and that has helped immensely.
I eventually got git.gentoo.org to render and gosh! That's a lot of repositories there. Would it be an idea to distribute the cgit interface over multiple front-end servers? Like, moving all user repos to a different server? -
@mirabilos @js @mgorny Real instructions or trap instructions?
-
@mirabilos @js @mgorny Real instructions or trap instructions?
@Epic_Null @js @mgorny you’d have to ask them or look for yourself
-
R relay@relay.infosec.exchange shared this topic
︎