If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow Linkding sound really interesting.
I was looking for a bookmark sync thing/firefox sync alternative anyways and wasnt aware of it before. -
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow WHEW what an adventure
That sounds horrible. I hate the current state of the internet
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow Thank you for writing this, reading it has been very educational!
I have a number of things in development that will make iocaine less of a pain, and more suitable for cases like yours - but... that takes a bit of time.
And NSoE's documentation is a mess, indeed. My excuse is that it was never meant to be used by anyone else, it's something I wrote for me. But there was no other option for a long time, and even if iocaine3 has a built-in script now, that's not as good as NSoE (yet).
I have plans to address that shortcoming, so there's an option that isn't NSoE, has useful, navigatable documentation that isn't written like a mad scientist's diary1.
But all the issues you listed are valid, and you even highlighted shortcomings I wasn't aware of, and tricks I did not consider. Now I have more things to play with!

-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow OMG that's horrendous. I'm too scared to look at my (recently set up) nginx logs now.
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow@eldritch.cafe your prose is so clear and easy to follow! I love it :3
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow thank you loved reading this.
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow Owch. Painful long read, regurgitating the experience of a lot of tech administrators. I like the idea of poisoning AI, but giving them an infinite pile of garbage is a waste of my resources. I'll leave that to people like you

ps. Please learn the difference between scrapping and scraping, and dairy and diary.
-
@Soblow Linkding sound really interesting.
I was looking for a bookmark sync thing/firefox sync alternative anyways and wasnt aware of it before.@booklordofthedings Oh, for bookmark sync, I wouldn't recommend Linkding, but maybe something like Floccus (and maybe Nextcloud bookmarks)

-
@Soblow Owch. Painful long read, regurgitating the experience of a lot of tech administrators. I like the idea of poisoning AI, but giving them an infinite pile of garbage is a waste of my resources. I'll leave that to people like you

ps. Please learn the difference between scrapping and scraping, and dairy and diary.
@khleedril Oops, thanks for your feedback!
(english isn't my native language
) -
@Soblow Thank you for writing this, reading it has been very educational!
I have a number of things in development that will make iocaine less of a pain, and more suitable for cases like yours - but... that takes a bit of time.
And NSoE's documentation is a mess, indeed. My excuse is that it was never meant to be used by anyone else, it's something I wrote for me. But there was no other option for a long time, and even if iocaine3 has a built-in script now, that's not as good as NSoE (yet).
I have plans to address that shortcoming, so there's an option that isn't NSoE, has useful, navigatable documentation that isn't written like a mad scientist's diary1.
But all the issues you listed are valid, and you even highlighted shortcomings I wasn't aware of, and tricks I did not consider. Now I have more things to play with!

@algernon Thanks for your reply!
It may not be obvious, but I don't consider "mad scientist" as a depreciative term so yeah
(and to be honest, I didn't even realize it was in your domain name
)The main struggles I had were because both iocaine and NSoE use languages I'm not used to, which isn't your fault.
Again, your software work nicely and I'm glad they exist and work!
(And I know the problem of "this was meant to be a solution for me that I made available to everyone", again it's fully understandable and, that's open-source software, it comes with no expectation nor warranty
). -
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
-
@flann Please do
I'll go read that post when I have time~
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
As promised, I updated the linked repository to add a README (and a license)
-
@flann Please do
I'll go read that post when I have time~
@flann Okay, I read it, it's highly interesting and it would've helped if I saw this earlier
There are things from your blogpost I'll likely try later too

-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow@eldritch.cafe
A nice read! Nothing too new for me as I was following you live on that journey but good to hear you found something that helps!
Also using iocaine for my services I at one point thought "why not let iocaine also serve garbage to empty user-agents?". That'd also catch a lot of the vuln scanners that are convinced I'm using wordpress (I'm not).
There's a surprising number of legitimate traffic you wouldn't expect to not set a UA
-
@Soblow@eldritch.cafe
A nice read! Nothing too new for me as I was following you live on that journey but good to hear you found something that helps!
Also using iocaine for my services I at one point thought "why not let iocaine also serve garbage to empty user-agents?". That'd also catch a lot of the vuln scanners that are convinced I'm using wordpress (I'm not).
There's a surprising number of legitimate traffic you wouldn't expect to not set a UA
@pandro That could be something, but for example the "fursona lookup" tool didn't have a User Agent set (until I told the author)...
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow "I spent the last few years building up a tolerance to iocaine powder."
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow Hah... I think we're getting iocaine or something when trying to read the article on our phone (iOS Safari, on iOS 15 or something like that). Haven't tried our desktop. Pretty meta, though.
-
If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.
To annoy them and protect my services from DDoS, I decided to setup an
iocaineinstance, along with NSoE... And it worked... Too well.Recently, they started flooding my VPS so much it started choking.
If you followed me here on Fedi, you saw my journey to find a way to relieve my server.This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.
Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/
Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge
@Soblow Hmmm.. I think of serving a knowledge.tld just with static garbage now...
︎