If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow just tried to add your article to my self hosted Readeck (a read it later service) and I think the request got caught by an anti scraper.
Guess I'll have to read the article soon to understand what happened!
-
@Soblow just tried to add your article to my self hosted Readeck (a read it later service) and I think the request got caught by an anti scraper.
Guess I'll have to read the article soon to understand what happened!
@Sargeros Yes, it's something expected, I have the same issue with some other tools.
Maybe it has a User-Agent I can whitelist though? -
@Sargeros Yes, it's something expected, I have the same issue with some other tools.
Maybe it has a User-Agent I can whitelist though?@Soblow It seems like it announces itself as a browser https://codeberg.org/readeck/readeck/src/commit/5a979acdb2afcfe8f87d5385db778e3643322b04/internal/httpclient/client.go#L33
But don't worry, I used the browser extension to save your article and it worked fine

-
@Soblow It seems like it announces itself as a browser https://codeberg.org/readeck/readeck/src/commit/5a979acdb2afcfe8f87d5385db778e3643322b04/internal/httpclient/client.go#L33
But don't worry, I used the browser extension to save your article and it worked fine

@Sargeros
pretty sure that UA got caught in the maze indeed...
good to know!
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
-
@velvet @Soblow @NafiTheBear Yup they are the bane of my life. I have an ... aggressive ... filter list, and have absolutely no qualms in just blocking LLM's entirely

-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
I should do an addendum but right now, my main website is getting hammered at rates similar to what my knowledge website used to be hit at.
On the opposite, the "knowledge" website is back at "normal" background noise of <100req/min.The banlist now contains so many IPs, and yet they still reach 6kreq/min nearly constantly.
At that point, I'm thinking about tinkering my banip tool to compute optimal subnets instead of always crafting
/24subnets. -
I should do an addendum but right now, my main website is getting hammered at rates similar to what my knowledge website used to be hit at.
On the opposite, the "knowledge" website is back at "normal" background noise of <100req/min.The banlist now contains so many IPs, and yet they still reach 6kreq/min nearly constantly.
At that point, I'm thinking about tinkering my banip tool to compute optimal subnets instead of always crafting
/24subnets.@Soblow feel like once you have blocked the big data center it just become wack and mole since tools like those exist

Internet Sharing SDKs: a Closer Look at the Emerging App Monetization Method - Proxyway
Internet sharing SDKs are reshaping app monetization. But what do you need to know before adopting them?
Proxyway (proxyway.com)
-
I should do an addendum but right now, my main website is getting hammered at rates similar to what my knowledge website used to be hit at.
On the opposite, the "knowledge" website is back at "normal" background noise of <100req/min.The banlist now contains so many IPs, and yet they still reach 6kreq/min nearly constantly.
At that point, I'm thinking about tinkering my banip tool to compute optimal subnets instead of always crafting
/24subnets.@Soblow
curious, is the subnet thing using similarities in the ip to ban specific ranges?computing optimal things sounds like a math problem and if so, im game to try it out
-
I should do an addendum but right now, my main website is getting hammered at rates similar to what my knowledge website used to be hit at.
On the opposite, the "knowledge" website is back at "normal" background noise of <100req/min.The banlist now contains so many IPs, and yet they still reach 6kreq/min nearly constantly.
At that point, I'm thinking about tinkering my banip tool to compute optimal subnets instead of always crafting
/24subnets.That's frightening.
Like, legitimately, I'm scared of what that means.I'll try to ratelimit at 10req/min per ip.
-
@Soblow
curious, is the subnet thing using similarities in the ip to ban specific ranges?computing optimal things sounds like a math problem and if so, im game to try it out
@lilacperegrine Are you familiar with the concept of a quadtree?
-
@lilacperegrine Are you familiar with the concept of a quadtree?
@Soblow
a data structure where every node has 4 children?
im sorta familiar with it, but i havent used it much -
That's frightening.
Like, legitimately, I'm scared of what that means.I'll try to ratelimit at 10req/min per ip.
@Soblow I've been following this with interest
-
@Soblow
a data structure where every node has 4 children?
im sorta familiar with it, but i havent used it much@lilacperegrine Well, I have an intuition that the problem I'm looking to solve is akin to the construction of a quadtree:
I have a list of IPv4/32. I want to generalize this list using the properties of subnets (ansubnet contains 2n+1subnets).
I don't have an explaination though, and my math skills are rusty... -
@lilacperegrine Well, I have an intuition that the problem I'm looking to solve is akin to the construction of a quadtree:
I have a list of IPv4/32. I want to generalize this list using the properties of subnets (ansubnet contains 2n+1subnets).
I don't have an explaination though, and my math skills are rusty...@Soblow@eldritch.cafe @lilacperegrine@clockwork.monster Seems like a cool thing to do… if only I could use university's time to do that.
-
@lilacperegrine Well, I have an intuition that the problem I'm looking to solve is akin to the construction of a quadtree:
I have a list of IPv4/32. I want to generalize this list using the properties of subnets (ansubnet contains 2n+1subnets).
I don't have an explaination though, and my math skills are rusty...@Soblow
so you have a list of ips, and are using a binary(or quad) tree in order to classify them into clean vs dirty? -
@Soblow
so you have a list of ips, and are using a binary(or quad) tree in order to classify them into clean vs dirty?@lilacperegrine No, not really...
I talked about quadtree because that's something I manipulated and it made met think of it.
Let's take a step back and formalize the problem:Let's assume I have a list of unique IPv4 adresses.
They are represented on 32 bits.
I want to construct a list of subnets (so still 32 bits) that summarize the list of IPs I have.For example, if I have
192.168.1.0and192.168.1.1, I could generalize this with the192.168.1.0/31subnet (if I'm not mistaken), which contains the previous two IPs without containing any other IPs.
If it helps, represent them in binary and find the common upper bits:11000000.10101000.00000001.00000000 (192.168.1.0)
11000000.10101000.00000001.00000001 (192.168.1.1)
The common part is everything up to the last bit, thus the mask is a `/31` which is 255.255.255.254, or in binary `11111111.11111111.11111111.11111110`.Now, I have tens of thousands of IPs and I'd like the smallest list of subnets that includes all bad IPs without including good IPs
I'm sure there are academic papers about this, this sounds like a problem folks must already have had
-
I should do an addendum but right now, my main website is getting hammered at rates similar to what my knowledge website used to be hit at.
On the opposite, the "knowledge" website is back at "normal" background noise of <100req/min.The banlist now contains so many IPs, and yet they still reach 6kreq/min nearly constantly.
At that point, I'm thinking about tinkering my banip tool to compute optimal subnets instead of always crafting
/24subnets.@Soblow Yeah… last time my website had been bombed like this, I decided that Cloudflare is not such a bad idea… I hope you’ll figure it out.
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow I don't always read a blog post til the end. But when I do, it's always a banger
-
@Soblow I don't always read a blog post til the end. But when I do, it's always a banger
@StupidCamille
Glad you liked it!