I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level.

jsstaedtler@mastodon.art

@redstrate Ah, this sounds promising! I don't want to make my site invisible on the greater Web by blocking all bot crawlers, but I'd be fine with them only loading URLs with no queries/parameters (anything after a ?). I'll look into that meta tag, though I acknowledge the other reply here that bots can happily ignore that.

foobarsoft@mastodon.social

@jsstaedtler This, I think, is why so many people have moved to having Cloudflare in front of their sites. To block/limit badly behaved bots.

vga256@mastodon.tomodori.net

@jsstaedtler ah okay. i imagine that probably limits you from any making any apache/nginx configuration settings changes (e.g. IP blocklists)

i'm not familiar with your site generation code - but if you wrote it yourself, i *think* the trick would be to have it 404 when an incorrect tag has been used

How to create an error 404 page using PHP?

My file .htaccess handles all requests from /word_here to my internal endpoint /page.php?name=word_here. The PHP script then checks if the requested page is in its array of pages. If not, how can I

Stack Overflow (stackoverflow.com)

at least then the script can die() instead of yielding output. it's anyone's guess if the crawler will still continue to try generating tags when it has encountered a 404, but i *assume* they're built to avoid 404s

jsstaedtler@mastodon.art

@cb I also use the .htaccess method to "block" specific agents, so they simply get thousands of 0 byte responses. Whenever it's a known LLM/AI scraper, I'm happy with that solution (and IP blocking ones that don't present a unique user agent).

I've heard of Iocane and similar tools but never looked into them, and I guess now is the time!

gemelen@mammut.moe

@jsstaedtler @redstrate

The problem is that lots of crawlers do not respect robots.txt (especially those run by "AI" companies).

Thus people go for other solutions, to make it too expensive on the side of the crawler, like iocaine - https://firesphere.dev/articles/iocaine-the-deadliest-poison-known-to-ai, or anubis - https://anubis.techaro.lol

CIRCLE WITH A DOT

I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level.

How to create an error 404 page using PHP?