Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level.

I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level.

Scheduled Pinned Locked Moved Uncategorized
indiewebwebdevpersonalsite
25 Posts 9 Posters 83 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • redstrate@mastoart.socialR redstrate@mastoart.social

    @jsstaedtler a dumb solution would be to tell robots to not index the page (robots meta tag) if there is any tag queries, which i assume you can do via PHP.

    edit: or if you want individual tags indexed, at least reject robots for queries of more than one tag?

    jsstaedtler@mastodon.artJ This user is from outside of this forum
    jsstaedtler@mastodon.artJ This user is from outside of this forum
    jsstaedtler@mastodon.art
    wrote last edited by
    #21

    @redstrate Ah, this sounds promising! I don't want to make my site invisible on the greater Web by blocking all bot crawlers, but I'd be fine with them only loading URLs with no queries/parameters (anything after a ?). I'll look into that meta tag, though I acknowledge the other reply here that bots can happily ignore that.

    gemelen@mammut.moeG 1 Reply Last reply
    0
    • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

      I'm not fundamentally opposed to web crawlers, I would actually love it if my work is more discoverable. But this is such an obnoxious situation that I'm forced to accomodate or protect against.

      I'm starting to think I need to test for mutually exclusive tags, and if two or more are selected, the resulting page will have no links at all except one to go back. That will deny the bots any more links to dive into.

      But maybe there are better options? I'd wager this is not a novel issue...

      🧵7/7

      foobarsoft@mastodon.socialF This user is from outside of this forum
      foobarsoft@mastodon.socialF This user is from outside of this forum
      foobarsoft@mastodon.social
      wrote last edited by
      #22

      @jsstaedtler This, I think, is why so many people have moved to having Cloudflare in front of their sites. To block/limit badly behaved bots.

      1 Reply Last reply
      0
      • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

        @vga256 I'm sharing a paid host with a friend. Thanks to relatively low combined popularity, we can get away with a cheap plan, but I really don't want random bots to ruin that

        vga256@mastodon.tomodori.netV This user is from outside of this forum
        vga256@mastodon.tomodori.netV This user is from outside of this forum
        vga256@mastodon.tomodori.net
        wrote last edited by
        #23

        @jsstaedtler ah okay. i imagine that probably limits you from any making any apache/nginx configuration settings changes (e.g. IP blocklists)

        i'm not familiar with your site generation code - but if you wrote it yourself, i *think* the trick would be to have it 404 when an incorrect tag has been used

        Link Preview Image
        How to create an error 404 page using PHP?

        My file .htaccess handles all requests from /word_here to my internal endpoint /page.php?name=word_here. The PHP script then checks if the requested page is in its array of pages. If not, how can I

        favicon

        Stack Overflow (stackoverflow.com)

        at least then the script can die() instead of yielding output. it's anyone's guess if the crawler will still continue to try generating tags when it has encountered a 404, but i *assume* they're built to avoid 404s

        1 Reply Last reply
        0
        • cb@boop.bleepbop.spaceC cb@boop.bleepbop.space

          @jsstaedtler I've been using Iocaine, which is specifically intended to mess with AI bots, but it can also help with "normal" bots too
          https://iocaine.madhouse-project.org/

          of course that still eats up some of your server's power. I work for a web hosting company and frequently we'll just make a list of "bad bots" in an .htaccess file to block them. The server still has to reply to their requests but doesn't have to serve them any real data

          jsstaedtler@mastodon.artJ This user is from outside of this forum
          jsstaedtler@mastodon.artJ This user is from outside of this forum
          jsstaedtler@mastodon.art
          wrote last edited by
          #24

          @cb I also use the .htaccess method to "block" specific agents, so they simply get thousands of 0 byte responses. Whenever it's a known LLM/AI scraper, I'm happy with that solution (and IP blocking ones that don't present a unique user agent).

          I've heard of Iocane and similar tools but never looked into them, and I guess now is the time!

          1 Reply Last reply
          0
          • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

            @redstrate Ah, this sounds promising! I don't want to make my site invisible on the greater Web by blocking all bot crawlers, but I'd be fine with them only loading URLs with no queries/parameters (anything after a ?). I'll look into that meta tag, though I acknowledge the other reply here that bots can happily ignore that.

            gemelen@mammut.moeG This user is from outside of this forum
            gemelen@mammut.moeG This user is from outside of this forum
            gemelen@mammut.moe
            wrote last edited by
            #25

            @jsstaedtler @redstrate

            The problem is that lots of crawlers do not respect robots.txt (especially those run by "AI" companies).

            Thus people go for other solutions, to make it too expensive on the side of the crawler, like iocaine - https://firesphere.dev/articles/iocaine-the-deadliest-poison-known-to-ai, or anubis - https://anubis.techaro.lol

            1 Reply Last reply
            1
            0
            • R relay@relay.mycrowd.ca shared this topic
            Reply
            • Reply as topic
            Log in to reply
            • Oldest to Newest
            • Newest to Oldest
            • Most Votes


            • Login

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • World
            • Users
            • Groups