If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
-
@lilacperegrine Well, I have an intuition that the problem I'm looking to solve is akin to the construction of a quadtree:
I have a list of IPv4/32. I want to generalize this list using the properties of subnets (ansubnet contains 2n+1subnets).
I don't have an explaination though, and my math skills are rusty...@Soblow
so you have a list of ips, and are using a binary(or quad) tree in order to classify them into clean vs dirty? -
@Soblow
so you have a list of ips, and are using a binary(or quad) tree in order to classify them into clean vs dirty?@lilacperegrine No, not really...
I talked about quadtree because that's something I manipulated and it made met think of it.
Let's take a step back and formalize the problem:Let's assume I have a list of unique IPv4 adresses.
They are represented on 32 bits.
I want to construct a list of subnets (so still 32 bits) that summarize the list of IPs I have.For example, if I have
192.168.1.0and192.168.1.1, I could generalize this with the192.168.1.0/31subnet (if I'm not mistaken), which contains the previous two IPs without containing any other IPs.
If it helps, represent them in binary and find the common upper bits:11000000.10101000.00000001.00000000 (192.168.1.0)
11000000.10101000.00000001.00000001 (192.168.1.1)
The common part is everything up to the last bit, thus the mask is a `/31` which is 255.255.255.254, or in binary `11111111.11111111.11111111.11111110`.Now, I have tens of thousands of IPs and I'd like the smallest list of subnets that includes all bad IPs without including good IPs
I'm sure there are academic papers about this, this sounds like a problem folks must already have had
-
I should do an addendum but right now, my main website is getting hammered at rates similar to what my knowledge website used to be hit at.
On the opposite, the "knowledge" website is back at "normal" background noise of <100req/min.The banlist now contains so many IPs, and yet they still reach 6kreq/min nearly constantly.
At that point, I'm thinking about tinkering my banip tool to compute optimal subnets instead of always crafting
/24subnets.@Soblow Yeah… last time my website had been bombed like this, I decided that Cloudflare is not such a bad idea… I hope you’ll figure it out.
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow I don't always read a blog post til the end. But when I do, it's always a banger
-
@Soblow I don't always read a blog post til the end. But when I do, it's always a banger
@StupidCamille
Glad you liked it! -
@StupidCamille
Glad you liked it! -
@Soblow also, don't be insecure about nginx. I'm still using Apache2 and don't plan on changing anytime soon
-
@Soblow also, don't be insecure about nginx. I'm still using Apache2 and don't plan on changing anytime soon
@StupidCamille I mean, Apache2 is okay for regular users, using
nginxis a professional bias at this point
-
@StupidCamille I mean, Apache2 is okay for regular users, using
nginxis a professional bias at this point
@Soblow @StupidCamille i also use nginx mostly and i setup a fail2ban that throws 444 at crawlers
-
@Soblow @StupidCamille i also use nginx mostly and i setup a fail2ban that throws 444 at crawlers
@nicole4fox @Soblow yeah fail2ban is a must
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow how did that work for ya after a while?
I have a website thats, sadly, not self hosted(basic php host with 4 cores) and when I tried some tar pit methods it just... overloaded its CPU until the website went down for 1hI ended up just settling with a hidden, no-follow, no index link on every page that if you click 3x, IP bans you(405 IPs banned so far)
-
@Soblow how did that work for ya after a while?
I have a website thats, sadly, not self hosted(basic php host with 4 cores) and when I tried some tar pit methods it just... overloaded its CPU until the website went down for 1hI ended up just settling with a hidden, no-follow, no index link on every page that if you click 3x, IP bans you(405 IPs banned so far)
@peq42 That's the topic of a follow-up blogpost I'm writing but, basically, I rate-limit at
1 req/mthe IPs that reached a given threshold (50 req / months on myiocainemaze) with an HTTP code 429.
Behind this, if any IP hits 429 too often, they getfail2banned.
So far, I'm getting a background noise of roughly 1.5kreq/min and spikes at 5kreq/min but my server survives
-
@peq42 That's the topic of a follow-up blogpost I'm writing but, basically, I rate-limit at
1 req/mthe IPs that reached a given threshold (50 req / months on myiocainemaze) with an HTTP code 429.
Behind this, if any IP hits 429 too often, they getfail2banned.
So far, I'm getting a background noise of roughly 1.5kreq/min and spikes at 5kreq/min but my server survives
@Soblow in my case it lowered the CPU usage, but I don't think the crawlers took the hint to F off as the amount of hits/day remained the same on avg x.x
-
@Soblow in my case it lowered the CPU usage, but I don't think the crawlers took the hint to F off as the amount of hits/day remained the same on avg x.x
@peq42 you may want to use the ipset ban method then, but keep in mind the limitations!
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow
okay, I finally found the time to finish reading this
(I got through most of it like 2w ago, but...)
and I have an potential idea what made the crawlers go to your main website after a while:
it's the robots.txt file itself!
my guess is that claude saw the 301 redirect and interpreted it in the same way the OG search engines interpreted links: as a way to vouch for another website's worth -
@Soblow
okay, I finally found the time to finish reading this
(I got through most of it like 2w ago, but...)
and I have an potential idea what made the crawlers go to your main website after a while:
it's the robots.txt file itself!
my guess is that claude saw the 301 redirect and interpreted it in the same way the OG search engines interpreted links: as a way to vouch for another website's worth@niacdoial mmh...
That would explain things...
Maybe I should just put a static route on each site instead? -
@niacdoial mmh...
That would explain things...
Maybe I should just put a static route on each site instead?@Soblow
maybe?
not sure it will help with the existing scrapers, since they don't seem to "abandon" a website like this
in terms of resource use, an identical static route might even be more economical than a 301. -
@Soblow
maybe?
not sure it will help with the existing scrapers, since they don't seem to "abandon" a website like this
in terms of resource use, an identical static route might even be more economical than a 301.@niacdoial The reason I didn't have a static route is because of how
iocaineredirection works, but I could make some careful tests. -
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow ahhh a well-thought, well-written wall of text on a subject that matters to me, with relevant illustrations and comments from cute entities. What a treat to my mind
thanks for sharing! -
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
So, as planned, I published a follow-up / addendum blogpost about crawlers overwhelming my services.
It adds a few details about what happened since I put the rate limiting in place, provides an analysis of the attack sources (spoiler: big cloud providers), and some technical details about how everything works.
It also adds a few clarifications that were missing from previous post.Read it here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
(I also edited the previous post to mention the follow-up)
-
R relay@relay.publicsquare.global shared this topicR relay@relay.infosec.exchange shared this topic
