Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level.

I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level.

Scheduled Pinned Locked Moved Uncategorized
indiewebwebdevpersonalsite
25 Posts 9 Posters 83 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • jsstaedtler@mastodon.artJ This user is from outside of this forum
    jsstaedtler@mastodon.artJ This user is from outside of this forum
    jsstaedtler@mastodon.art
    wrote last edited by
    #1

    I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level. And I think it's something that precludes the use of caching, but there are probably many of you with more knowledge than I have and who may know what can be done

    🧵⤵️

    #IndieWeb #WebDev #PersonalSite

    vga256@mastodon.tomodori.netV jsstaedtler@mastodon.artJ rubenwardy@hachyderm.ioR cb@boop.bleepbop.spaceC 4 Replies Last reply
    0
    • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

      I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level. And I think it's something that precludes the use of caching, but there are probably many of you with more knowledge than I have and who may know what can be done

      🧵⤵️

      #IndieWeb #WebDev #PersonalSite

      vga256@mastodon.tomodori.netV This user is from outside of this forum
      vga256@mastodon.tomodori.netV This user is from outside of this forum
      vga256@mastodon.tomodori.net
      wrote last edited by
      #2

      @jsstaedtler I can't remember - are you self-hosting or using a paid host?

      jsstaedtler@mastodon.artJ 1 Reply Last reply
      0
      • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

        I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level. And I think it's something that precludes the use of caching, but there are probably many of you with more knowledge than I have and who may know what can be done

        🧵⤵️

        #IndieWeb #WebDev #PersonalSite

        jsstaedtler@mastodon.artJ This user is from outside of this forum
        jsstaedtler@mastodon.artJ This user is from outside of this forum
        jsstaedtler@mastodon.art
        wrote last edited by
        #3

        My site uses PHP to produce HTML output—no flat HTML files. There is an image gallery, and the images have tags.

        It's here if you actually want to see it: https://bigraccoon.ca/gallery

        When you filter on a tag, it adds a parameter to the URL, e.g. "domain[dot]com?tag=2026". That loads the gallery, but only displays images tagged with "2026".

        You can filter further on more tags, e.g. "?tag=2026%2CPencil" ("%2C" is a URL-encoded comma), which would show images from 2026 drawn in pencil.

        🧵2/?

        jsstaedtler@mastodon.artJ 1 Reply Last reply
        0
        • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

          I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level. And I think it's something that precludes the use of caching, but there are probably many of you with more knowledge than I have and who may know what can be done

          🧵⤵️

          #IndieWeb #WebDev #PersonalSite

          rubenwardy@hachyderm.ioR This user is from outside of this forum
          rubenwardy@hachyderm.ioR This user is from outside of this forum
          rubenwardy@hachyderm.io
          wrote last edited by
          #4

          @jsstaedtler

          The exact thing has happened to me recently with the tags. I now require users to log in to filter by multiple tags and I've blocked the subnets of the bots

          If I wanted to allow guest users to search by multiple tags, I'd probably try the following options - (1) changing it to a POST request (2) requiring JavaScript (3) using Anubis (4) looking into ip masked rate limiting, so a rate limit for like multiple ip addresses in the same block

          I wrote a blog post about my situation here https://blog.rubenwardy.com/2026/04/16/contentdb-ddos/

          rubenwardy@hachyderm.ioR 1 Reply Last reply
          0
          • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

            My site uses PHP to produce HTML output—no flat HTML files. There is an image gallery, and the images have tags.

            It's here if you actually want to see it: https://bigraccoon.ca/gallery

            When you filter on a tag, it adds a parameter to the URL, e.g. "domain[dot]com?tag=2026". That loads the gallery, but only displays images tagged with "2026".

            You can filter further on more tags, e.g. "?tag=2026%2CPencil" ("%2C" is a URL-encoded comma), which would show images from 2026 drawn in pencil.

            🧵2/?

            jsstaedtler@mastodon.artJ This user is from outside of this forum
            jsstaedtler@mastodon.artJ This user is from outside of this forum
            jsstaedtler@mastodon.art
            wrote last edited by
            #5

            This method of selecting tags allows for invalid combos, like "?tag=2026%2C2025". That selects images that were drawn both in 2026 *and* 2025... which obviously can't exist! The resulting page will tell you that 0 images were found.

            A human would generally make sense of the available options, and *not* select two different years simultaneously. I could even code the page so that if one year is already selected, you can't select another one.

            🧵3/?

            jsstaedtler@mastodon.artJ 1 Reply Last reply
            0
            • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

              This method of selecting tags allows for invalid combos, like "?tag=2026%2C2025". That selects images that were drawn both in 2026 *and* 2025... which obviously can't exist! The resulting page will tell you that 0 images were found.

              A human would generally make sense of the available options, and *not* select two different years simultaneously. I could even code the page so that if one year is already selected, you can't select another one.

              🧵3/?

              jsstaedtler@mastodon.artJ This user is from outside of this forum
              jsstaedtler@mastodon.artJ This user is from outside of this forum
              jsstaedtler@mastodon.art
              wrote last edited by
              #6

              But you can't stop anyone from entering a URL with any combination of tag names. You must decide what page they will see when they do so, and in my case, it's a gallery page with 0 images.

              Now: enter the web crawler bot. It finds my site. It grabs all of the links on the front page, then starts loading each one. Then it grabs all of the links on *those* pages, and starts loading all of *them*. Presumably it will stop once all links have been viewed.

              🧵4/?

              jsstaedtler@mastodon.artJ 1 Reply Last reply
              0
              • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

                I have an obnoxious problem with crawlers eating bandwidth on my personal web site—not just the fact that crawlers consume so much bandwidth, but rather a behaviour that is absolutely next-level. And I think it's something that precludes the use of caching, but there are probably many of you with more knowledge than I have and who may know what can be done

                🧵⤵️

                #IndieWeb #WebDev #PersonalSite

                cb@boop.bleepbop.spaceC This user is from outside of this forum
                cb@boop.bleepbop.spaceC This user is from outside of this forum
                cb@boop.bleepbop.space
                wrote last edited by
                #7

                @jsstaedtler I've been using Iocaine, which is specifically intended to mess with AI bots, but it can also help with "normal" bots too
                https://iocaine.madhouse-project.org/

                of course that still eats up some of your server's power. I work for a web hosting company and frequently we'll just make a list of "bad bots" in an .htaccess file to block them. The server still has to reply to their requests but doesn't have to serve them any real data

                jsstaedtler@mastodon.artJ 1 Reply Last reply
                0
                • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

                  But you can't stop anyone from entering a URL with any combination of tag names. You must decide what page they will see when they do so, and in my case, it's a gallery page with 0 images.

                  Now: enter the web crawler bot. It finds my site. It grabs all of the links on the front page, then starts loading each one. Then it grabs all of the links on *those* pages, and starts loading all of *them*. Presumably it will stop once all links have been viewed.

                  🧵4/?

                  jsstaedtler@mastodon.artJ This user is from outside of this forum
                  jsstaedtler@mastodon.artJ This user is from outside of this forum
                  jsstaedtler@mastodon.art
                  wrote last edited by
                  #8

                  So it loads my gallery page, and sees the list of tags: maybe 50 different links, all of which load the gallery page with a new filter applied. So it loads one, like "?tag=2026".

                  On the resulting page, there are still 50-odd tag links available. So it loads another one, and the URL now includes "?tag=2026%2C2025". Which is nonsense, but the page still loads.

                  Well, there are 0 images to show on that page, but still more tags to open! So next the bot opens "?tag=2026%2C2025%2C2024"...

                  🧵5/?

                  oblomov@sociale.networkO jsstaedtler@mastodon.artJ 2 Replies Last reply
                  0
                  • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

                    So it loads my gallery page, and sees the list of tags: maybe 50 different links, all of which load the gallery page with a new filter applied. So it loads one, like "?tag=2026".

                    On the resulting page, there are still 50-odd tag links available. So it loads another one, and the URL now includes "?tag=2026%2C2025". Which is nonsense, but the page still loads.

                    Well, there are 0 images to show on that page, but still more tags to open! So next the bot opens "?tag=2026%2C2025%2C2024"...

                    🧵5/?

                    oblomov@sociale.networkO This user is from outside of this forum
                    oblomov@sociale.networkO This user is from outside of this forum
                    oblomov@sociale.network
                    wrote last edited by
                    #9

                    @jsstaedtler an easy way to catch this is that these scrapers generally don't send Referer headers, so you can kill these by checking that a valid Referer header is present in tag search. This will have false positives for humans that try to be too smart though.

                    oblomov@sociale.networkO lumi@snug.moeL 2 Replies Last reply
                    0
                    • rubenwardy@hachyderm.ioR rubenwardy@hachyderm.io

                      @jsstaedtler

                      The exact thing has happened to me recently with the tags. I now require users to log in to filter by multiple tags and I've blocked the subnets of the bots

                      If I wanted to allow guest users to search by multiple tags, I'd probably try the following options - (1) changing it to a POST request (2) requiring JavaScript (3) using Anubis (4) looking into ip masked rate limiting, so a rate limit for like multiple ip addresses in the same block

                      I wrote a blog post about my situation here https://blog.rubenwardy.com/2026/04/16/contentdb-ddos/

                      rubenwardy@hachyderm.ioR This user is from outside of this forum
                      rubenwardy@hachyderm.ioR This user is from outside of this forum
                      rubenwardy@hachyderm.io
                      wrote last edited by
                      #10

                      @jsstaedtler

                      For your particular case, you should return a 404 if the URL contains both 2025 and 2026. This would stop them getting into invalid combinations. You can make it so the UI never links to these combinations by *replacing* rather than appending years if one already exists

                      rubenwardy@hachyderm.ioR 1 Reply Last reply
                      0
                      • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

                        So it loads my gallery page, and sees the list of tags: maybe 50 different links, all of which load the gallery page with a new filter applied. So it loads one, like "?tag=2026".

                        On the resulting page, there are still 50-odd tag links available. So it loads another one, and the URL now includes "?tag=2026%2C2025". Which is nonsense, but the page still loads.

                        Well, there are 0 images to show on that page, but still more tags to open! So next the bot opens "?tag=2026%2C2025%2C2024"...

                        🧵5/?

                        jsstaedtler@mastodon.artJ This user is from outside of this forum
                        jsstaedtler@mastodon.artJ This user is from outside of this forum
                        jsstaedtler@mastodon.art
                        wrote last edited by
                        #11

                        How many permutations of tags are there? A butttonne, and the bot will diligently check out ALL OF THEM. Thousands and thousands of page loads! And even though all of them have 0 images to display, there will still be a tag list to choose from, and it will always visually update to indicate which tags are currently selected. So the page can't just be saved in a static HTML file, and the bot isn't going to load anything from it's own cache.

                        🧵6/?

                        jsstaedtler@mastodon.artJ 1 Reply Last reply
                        0
                        • oblomov@sociale.networkO oblomov@sociale.network

                          @jsstaedtler an easy way to catch this is that these scrapers generally don't send Referer headers, so you can kill these by checking that a valid Referer header is present in tag search. This will have false positives for humans that try to be too smart though.

                          oblomov@sociale.networkO This user is from outside of this forum
                          oblomov@sociale.networkO This user is from outside of this forum
                          oblomov@sociale.network
                          wrote last edited by
                          #12

                          @jsstaedtler (talking from experience with my self-hosted gitweb for this, BTW)

                          1 Reply Last reply
                          0
                          • oblomov@sociale.networkO oblomov@sociale.network

                            @jsstaedtler an easy way to catch this is that these scrapers generally don't send Referer headers, so you can kill these by checking that a valid Referer header is present in tag search. This will have false positives for humans that try to be too smart though.

                            lumi@snug.moeL This user is from outside of this forum
                            lumi@snug.moeL This user is from outside of this forum
                            lumi@snug.moe
                            wrote last edited by
                            #13

                            @oblomov @jsstaedtler the referer header only exists for tracking, so many privacy-conscious people configure their browsers not to send it

                            the referer header should not exist in the first place

                            1 Reply Last reply
                            0
                            • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

                              How many permutations of tags are there? A butttonne, and the bot will diligently check out ALL OF THEM. Thousands and thousands of page loads! And even though all of them have 0 images to display, there will still be a tag list to choose from, and it will always visually update to indicate which tags are currently selected. So the page can't just be saved in a static HTML file, and the bot isn't going to load anything from it's own cache.

                              🧵6/?

                              jsstaedtler@mastodon.artJ This user is from outside of this forum
                              jsstaedtler@mastodon.artJ This user is from outside of this forum
                              jsstaedtler@mastodon.art
                              wrote last edited by
                              #14

                              I'm not fundamentally opposed to web crawlers, I would actually love it if my work is more discoverable. But this is such an obnoxious situation that I'm forced to accomodate or protect against.

                              I'm starting to think I need to test for mutually exclusive tags, and if two or more are selected, the resulting page will have no links at all except one to go back. That will deny the bots any more links to dive into.

                              But maybe there are better options? I'd wager this is not a novel issue...

                              🧵7/7

                              redstrate@mastoart.socialR foobarsoft@mastodon.socialF 2 Replies Last reply
                              0
                              • jsstaedtler@mastodon.artJ jsstaedtler@mastodon.art

                                I'm not fundamentally opposed to web crawlers, I would actually love it if my work is more discoverable. But this is such an obnoxious situation that I'm forced to accomodate or protect against.

                                I'm starting to think I need to test for mutually exclusive tags, and if two or more are selected, the resulting page will have no links at all except one to go back. That will deny the bots any more links to dive into.

                                But maybe there are better options? I'd wager this is not a novel issue...

                                🧵7/7

                                redstrate@mastoart.socialR This user is from outside of this forum
                                redstrate@mastoart.socialR This user is from outside of this forum
                                redstrate@mastoart.social
                                wrote last edited by
                                #15

                                @jsstaedtler a dumb solution would be to tell robots to not index the page (robots meta tag) if there is any tag queries, which i assume you can do via PHP.

                                edit: or if you want individual tags indexed, at least reject robots for queries of more than one tag?

                                rubenwardy@hachyderm.ioR jsstaedtler@mastodon.artJ 2 Replies Last reply
                                0
                                • redstrate@mastoart.socialR redstrate@mastoart.social

                                  @jsstaedtler a dumb solution would be to tell robots to not index the page (robots meta tag) if there is any tag queries, which i assume you can do via PHP.

                                  edit: or if you want individual tags indexed, at least reject robots for queries of more than one tag?

                                  rubenwardy@hachyderm.ioR This user is from outside of this forum
                                  rubenwardy@hachyderm.ioR This user is from outside of this forum
                                  rubenwardy@hachyderm.io
                                  wrote last edited by
                                  #16

                                  @redstrate @jsstaedtler

                                  Many crawlers ignore this in my experience, especially the AI ones

                                  redstrate@mastoart.socialR 1 Reply Last reply
                                  0
                                  • vga256@mastodon.tomodori.netV vga256@mastodon.tomodori.net

                                    @jsstaedtler I can't remember - are you self-hosting or using a paid host?

                                    jsstaedtler@mastodon.artJ This user is from outside of this forum
                                    jsstaedtler@mastodon.artJ This user is from outside of this forum
                                    jsstaedtler@mastodon.art
                                    wrote last edited by
                                    #17

                                    @vga256 I'm sharing a paid host with a friend. Thanks to relatively low combined popularity, we can get away with a cheap plan, but I really don't want random bots to ruin that

                                    vga256@mastodon.tomodori.netV 1 Reply Last reply
                                    0
                                    • rubenwardy@hachyderm.ioR rubenwardy@hachyderm.io

                                      @jsstaedtler

                                      For your particular case, you should return a 404 if the URL contains both 2025 and 2026. This would stop them getting into invalid combinations. You can make it so the UI never links to these combinations by *replacing* rather than appending years if one already exists

                                      rubenwardy@hachyderm.ioR This user is from outside of this forum
                                      rubenwardy@hachyderm.ioR This user is from outside of this forum
                                      rubenwardy@hachyderm.io
                                      wrote last edited by
                                      #18

                                      @jsstaedtler

                                      To block the abusive subnets, I used this tool to look up the IP ranges from example IP addresses. You can see all the IP ranges for a particular host: https://www.whatismyip.com/asn/AS150436/

                                      I then blocked using ipset/iptables but other options exist depending on your setup

                                      1 Reply Last reply
                                      0
                                      • rubenwardy@hachyderm.ioR rubenwardy@hachyderm.io

                                        @redstrate @jsstaedtler

                                        Many crawlers ignore this in my experience, especially the AI ones

                                        redstrate@mastoart.socialR This user is from outside of this forum
                                        redstrate@mastoart.socialR This user is from outside of this forum
                                        redstrate@mastoart.social
                                        wrote last edited by
                                        #19

                                        @rubenwardy @jsstaedtler it would at least help with the legitimate ones!

                                        rubenwardy@hachyderm.ioR 1 Reply Last reply
                                        0
                                        • redstrate@mastoart.socialR redstrate@mastoart.social

                                          @rubenwardy @jsstaedtler it would at least help with the legitimate ones!

                                          rubenwardy@hachyderm.ioR This user is from outside of this forum
                                          rubenwardy@hachyderm.ioR This user is from outside of this forum
                                          rubenwardy@hachyderm.io
                                          wrote last edited by
                                          #20

                                          @redstrate @jsstaedtler

                                          Ah yes, worth doing as it also improves your SEO by not having thousands of similar pages

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups