Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

Scheduled Pinned Locked Moved Uncategorized
selfhostingiocaineindieweb
57 Posts 27 Posters 2 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • soblow@eldritch.cafeS soblow@eldritch.cafe

    If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

    To annoy them and protect my services from DDoS, I decided to setup an iocaine instance, along with NSoE... And it worked... Too well.

    Recently, they started flooding my VPS so much it started choking.
    If you followed me here on Fedi, you saw my journey to find a way to relieve my server.

    This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.

    Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/

    Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

    #selfHosting #iocaine #indieWeb

    stupidcamille@eldritch.cafeS This user is from outside of this forum
    stupidcamille@eldritch.cafeS This user is from outside of this forum
    stupidcamille@eldritch.cafe
    wrote last edited by
    #41

    @Soblow I don't always read a blog post til the end. But when I do, it's always a banger

    soblow@eldritch.cafeS 1 Reply Last reply
    0
    • stupidcamille@eldritch.cafeS stupidcamille@eldritch.cafe

      @Soblow I don't always read a blog post til the end. But when I do, it's always a banger

      soblow@eldritch.cafeS This user is from outside of this forum
      soblow@eldritch.cafeS This user is from outside of this forum
      soblow@eldritch.cafe
      wrote last edited by
      #42

      @StupidCamille Glad you liked it!

      stupidcamille@eldritch.cafeS 1 Reply Last reply
      0
      • soblow@eldritch.cafeS soblow@eldritch.cafe

        @StupidCamille Glad you liked it!

        stupidcamille@eldritch.cafeS This user is from outside of this forum
        stupidcamille@eldritch.cafeS This user is from outside of this forum
        stupidcamille@eldritch.cafe
        wrote last edited by
        #43

        @Soblow

        stupidcamille@eldritch.cafeS 1 Reply Last reply
        0
        • stupidcamille@eldritch.cafeS stupidcamille@eldritch.cafe

          @Soblow

          stupidcamille@eldritch.cafeS This user is from outside of this forum
          stupidcamille@eldritch.cafeS This user is from outside of this forum
          stupidcamille@eldritch.cafe
          wrote last edited by
          #44

          @Soblow also, don't be insecure about nginx. I'm still using Apache2 and don't plan on changing anytime soon

          soblow@eldritch.cafeS 1 Reply Last reply
          0
          • stupidcamille@eldritch.cafeS stupidcamille@eldritch.cafe

            @Soblow also, don't be insecure about nginx. I'm still using Apache2 and don't plan on changing anytime soon

            soblow@eldritch.cafeS This user is from outside of this forum
            soblow@eldritch.cafeS This user is from outside of this forum
            soblow@eldritch.cafe
            wrote last edited by
            #45

            @StupidCamille I mean, Apache2 is okay for regular users, using nginx is a professional bias at this point

            nicole4fox@datastream.cortexvoid.netN 1 Reply Last reply
            0
            • soblow@eldritch.cafeS soblow@eldritch.cafe

              @StupidCamille I mean, Apache2 is okay for regular users, using nginx is a professional bias at this point

              nicole4fox@datastream.cortexvoid.netN This user is from outside of this forum
              nicole4fox@datastream.cortexvoid.netN This user is from outside of this forum
              nicole4fox@datastream.cortexvoid.net
              wrote last edited by
              #46

              @Soblow @StupidCamille i also use nginx mostly and i setup a fail2ban that throws 444 at crawlers

              stupidcamille@eldritch.cafeS 1 Reply Last reply
              0
              • nicole4fox@datastream.cortexvoid.netN nicole4fox@datastream.cortexvoid.net

                @Soblow @StupidCamille i also use nginx mostly and i setup a fail2ban that throws 444 at crawlers

                stupidcamille@eldritch.cafeS This user is from outside of this forum
                stupidcamille@eldritch.cafeS This user is from outside of this forum
                stupidcamille@eldritch.cafe
                wrote last edited by
                #47

                @nicole4fox @Soblow yeah fail2ban is a must

                1 Reply Last reply
                0
                • soblow@eldritch.cafeS soblow@eldritch.cafe

                  If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

                  To annoy them and protect my services from DDoS, I decided to setup an iocaine instance, along with NSoE... And it worked... Too well.

                  Recently, they started flooding my VPS so much it started choking.
                  If you followed me here on Fedi, you saw my journey to find a way to relieve my server.

                  This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.

                  Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/

                  Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

                  #selfHosting #iocaine #indieWeb

                  peq42@mastodon.socialP This user is from outside of this forum
                  peq42@mastodon.socialP This user is from outside of this forum
                  peq42@mastodon.social
                  wrote last edited by
                  #48

                  @Soblow how did that work for ya after a while?
                  I have a website thats, sadly, not self hosted(basic php host with 4 cores) and when I tried some tar pit methods it just... overloaded its CPU until the website went down for 1h

                  I ended up just settling with a hidden, no-follow, no index link on every page that if you click 3x, IP bans you(405 IPs banned so far)

                  soblow@eldritch.cafeS 1 Reply Last reply
                  0
                  • peq42@mastodon.socialP peq42@mastodon.social

                    @Soblow how did that work for ya after a while?
                    I have a website thats, sadly, not self hosted(basic php host with 4 cores) and when I tried some tar pit methods it just... overloaded its CPU until the website went down for 1h

                    I ended up just settling with a hidden, no-follow, no index link on every page that if you click 3x, IP bans you(405 IPs banned so far)

                    soblow@eldritch.cafeS This user is from outside of this forum
                    soblow@eldritch.cafeS This user is from outside of this forum
                    soblow@eldritch.cafe
                    wrote last edited by
                    #49

                    @peq42 That's the topic of a follow-up blogpost I'm writing but, basically, I rate-limit at 1 req/m the IPs that reached a given threshold (50 req / months on my iocaine maze) with an HTTP code 429.
                    Behind this, if any IP hits 429 too often, they get fail2banned.
                    So far, I'm getting a background noise of roughly 1.5kreq/min and spikes at 5kreq/min but my server survives

                    peq42@mastodon.socialP 1 Reply Last reply
                    0
                    • soblow@eldritch.cafeS soblow@eldritch.cafe

                      @peq42 That's the topic of a follow-up blogpost I'm writing but, basically, I rate-limit at 1 req/m the IPs that reached a given threshold (50 req / months on my iocaine maze) with an HTTP code 429.
                      Behind this, if any IP hits 429 too often, they get fail2banned.
                      So far, I'm getting a background noise of roughly 1.5kreq/min and spikes at 5kreq/min but my server survives

                      peq42@mastodon.socialP This user is from outside of this forum
                      peq42@mastodon.socialP This user is from outside of this forum
                      peq42@mastodon.social
                      wrote last edited by
                      #50

                      @Soblow in my case it lowered the CPU usage, but I don't think the crawlers took the hint to F off as the amount of hits/day remained the same on avg x.x

                      soblow@eldritch.cafeS 1 Reply Last reply
                      0
                      • peq42@mastodon.socialP peq42@mastodon.social

                        @Soblow in my case it lowered the CPU usage, but I don't think the crawlers took the hint to F off as the amount of hits/day remained the same on avg x.x

                        soblow@eldritch.cafeS This user is from outside of this forum
                        soblow@eldritch.cafeS This user is from outside of this forum
                        soblow@eldritch.cafe
                        wrote last edited by
                        #51

                        @peq42 you may want to use the ipset ban method then, but keep in mind the limitations!

                        1 Reply Last reply
                        0
                        • soblow@eldritch.cafeS soblow@eldritch.cafe

                          If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

                          To annoy them and protect my services from DDoS, I decided to setup an iocaine instance, along with NSoE... And it worked... Too well.

                          Recently, they started flooding my VPS so much it started choking.
                          If you followed me here on Fedi, you saw my journey to find a way to relieve my server.

                          This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.

                          Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/

                          Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

                          #selfHosting #iocaine #indieWeb

                          niacdoial@tilde.zoneN This user is from outside of this forum
                          niacdoial@tilde.zoneN This user is from outside of this forum
                          niacdoial@tilde.zone
                          wrote last edited by
                          #52

                          @Soblow
                          okay, I finally found the time to finish reading this
                          (I got through most of it like 2w ago, but...)
                          and I have an potential idea what made the crawlers go to your main website after a while:
                          it's the robots.txt file itself!
                          my guess is that claude saw the 301 redirect and interpreted it in the same way the OG search engines interpreted links: as a way to vouch for another website's worth

                          soblow@eldritch.cafeS 1 Reply Last reply
                          0
                          • niacdoial@tilde.zoneN niacdoial@tilde.zone

                            @Soblow
                            okay, I finally found the time to finish reading this
                            (I got through most of it like 2w ago, but...)
                            and I have an potential idea what made the crawlers go to your main website after a while:
                            it's the robots.txt file itself!
                            my guess is that claude saw the 301 redirect and interpreted it in the same way the OG search engines interpreted links: as a way to vouch for another website's worth

                            soblow@eldritch.cafeS This user is from outside of this forum
                            soblow@eldritch.cafeS This user is from outside of this forum
                            soblow@eldritch.cafe
                            wrote last edited by
                            #53

                            @niacdoial mmh...
                            That would explain things...
                            Maybe I should just put a static route on each site instead?

                            niacdoial@tilde.zoneN 1 Reply Last reply
                            0
                            • soblow@eldritch.cafeS soblow@eldritch.cafe

                              @niacdoial mmh...
                              That would explain things...
                              Maybe I should just put a static route on each site instead?

                              niacdoial@tilde.zoneN This user is from outside of this forum
                              niacdoial@tilde.zoneN This user is from outside of this forum
                              niacdoial@tilde.zone
                              wrote last edited by
                              #54

                              @Soblow
                              maybe?
                              not sure it will help with the existing scrapers, since they don't seem to "abandon" a website like this
                              in terms of resource use, an identical static route might even be more economical than a 301.

                              soblow@eldritch.cafeS 1 Reply Last reply
                              0
                              • niacdoial@tilde.zoneN niacdoial@tilde.zone

                                @Soblow
                                maybe?
                                not sure it will help with the existing scrapers, since they don't seem to "abandon" a website like this
                                in terms of resource use, an identical static route might even be more economical than a 301.

                                soblow@eldritch.cafeS This user is from outside of this forum
                                soblow@eldritch.cafeS This user is from outside of this forum
                                soblow@eldritch.cafe
                                wrote last edited by
                                #55

                                @niacdoial The reason I didn't have a static route is because of how iocaine redirection works, but I could make some careful tests.

                                1 Reply Last reply
                                0
                                • soblow@eldritch.cafeS soblow@eldritch.cafe

                                  If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

                                  To annoy them and protect my services from DDoS, I decided to setup an iocaine instance, along with NSoE... And it worked... Too well.

                                  Recently, they started flooding my VPS so much it started choking.
                                  If you followed me here on Fedi, you saw my journey to find a way to relieve my server.

                                  This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.

                                  Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/

                                  Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

                                  #selfHosting #iocaine #indieWeb

                                  miranda_blue@eldritch.cafeM This user is from outside of this forum
                                  miranda_blue@eldritch.cafeM This user is from outside of this forum
                                  miranda_blue@eldritch.cafe
                                  wrote last edited by
                                  #56

                                  @Soblow ahhh a well-thought, well-written wall of text on a subject that matters to me, with relevant illustrations and comments from cute entities. What a treat to my mind 💜 thanks for sharing!

                                  1 Reply Last reply
                                  0
                                  • soblow@eldritch.cafeS soblow@eldritch.cafe

                                    If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

                                    To annoy them and protect my services from DDoS, I decided to setup an iocaine instance, along with NSoE... And it worked... Too well.

                                    Recently, they started flooding my VPS so much it started choking.
                                    If you followed me here on Fedi, you saw my journey to find a way to relieve my server.

                                    This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.

                                    Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/

                                    Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

                                    #selfHosting #iocaine #indieWeb

                                    soblow@eldritch.cafeS This user is from outside of this forum
                                    soblow@eldritch.cafeS This user is from outside of this forum
                                    soblow@eldritch.cafe
                                    wrote last edited by
                                    #57

                                    So, as planned, I published a follow-up / addendum blogpost about crawlers overwhelming my services.

                                    It adds a few details about what happened since I put the rate limiting in place, provides an analysis of the attack sources (spoiler: big cloud providers), and some technical details about how everything works.
                                    It also adds a few clarifications that were missing from previous post.

                                    Read it here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

                                    (I also edited the previous post to mention the follow-up)

                                    #selfHosting #iocaine #indieWeb

                                    1 Reply Last reply
                                    2
                                    0
                                    • R relay@relay.publicsquare.global shared this topic
                                      R relay@relay.infosec.exchange shared this topic
                                    Reply
                                    • Reply as topic
                                    Log in to reply
                                    • Oldest to Newest
                                    • Newest to Oldest
                                    • Most Votes


                                    • Login

                                    • Login or register to search.
                                    • First post
                                      Last post
                                    0
                                    • Categories
                                    • Recent
                                    • Tags
                                    • Popular
                                    • World
                                    • Users
                                    • Groups