Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

Scheduled Pinned Locked Moved Uncategorized
selfhostingiocaineindieweb
57 Posts 27 Posters 2 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • soblow@eldritch.cafeS soblow@eldritch.cafe

    @lilacperegrine Are you familiar with the concept of a quadtree?

    lilacperegrine@clockwork.monsterL This user is from outside of this forum
    lilacperegrine@clockwork.monsterL This user is from outside of this forum
    lilacperegrine@clockwork.monster
    wrote on last edited by
    #34

    @Soblow
    a data structure where every node has 4 children?
    im sorta familiar with it, but i havent used it much

    soblow@eldritch.cafeS 1 Reply Last reply
    0
    • soblow@eldritch.cafeS soblow@eldritch.cafe

      That's frightening.
      Like, legitimately, I'm scared of what that means.

      I'll try to ratelimit at 10req/min per ip.

      ifixcoinops@retro.socialI This user is from outside of this forum
      ifixcoinops@retro.socialI This user is from outside of this forum
      ifixcoinops@retro.social
      wrote on last edited by
      #35

      @Soblow I've been following this with interest

      1 Reply Last reply
      0
      • lilacperegrine@clockwork.monsterL lilacperegrine@clockwork.monster

        @Soblow
        a data structure where every node has 4 children?
        im sorta familiar with it, but i havent used it much

        soblow@eldritch.cafeS This user is from outside of this forum
        soblow@eldritch.cafeS This user is from outside of this forum
        soblow@eldritch.cafe
        wrote on last edited by
        #36

        @lilacperegrine Well, I have an intuition that the problem I'm looking to solve is akin to the construction of a quadtree:
        I have a list of IPv4/32. I want to generalize this list using the properties of subnets (a n subnet contains 2 n+1 subnets).
        I don't have an explaination though, and my math skills are rusty...

        eragon@pl.eragon.reE lilacperegrine@clockwork.monsterL 2 Replies Last reply
        0
        • soblow@eldritch.cafeS soblow@eldritch.cafe

          @lilacperegrine Well, I have an intuition that the problem I'm looking to solve is akin to the construction of a quadtree:
          I have a list of IPv4/32. I want to generalize this list using the properties of subnets (a n subnet contains 2 n+1 subnets).
          I don't have an explaination though, and my math skills are rusty...

          eragon@pl.eragon.reE This user is from outside of this forum
          eragon@pl.eragon.reE This user is from outside of this forum
          eragon@pl.eragon.re
          wrote on last edited by
          #37

          @Soblow@eldritch.cafe @lilacperegrine@clockwork.monster Seems like a cool thing to do… if only I could use university's time to do that.

          1 Reply Last reply
          0
          • soblow@eldritch.cafeS soblow@eldritch.cafe

            @lilacperegrine Well, I have an intuition that the problem I'm looking to solve is akin to the construction of a quadtree:
            I have a list of IPv4/32. I want to generalize this list using the properties of subnets (a n subnet contains 2 n+1 subnets).
            I don't have an explaination though, and my math skills are rusty...

            lilacperegrine@clockwork.monsterL This user is from outside of this forum
            lilacperegrine@clockwork.monsterL This user is from outside of this forum
            lilacperegrine@clockwork.monster
            wrote on last edited by
            #38

            @Soblow
            so you have a list of ips, and are using a binary(or quad) tree in order to classify them into clean vs dirty?

            soblow@eldritch.cafeS 1 Reply Last reply
            0
            • lilacperegrine@clockwork.monsterL lilacperegrine@clockwork.monster

              @Soblow
              so you have a list of ips, and are using a binary(or quad) tree in order to classify them into clean vs dirty?

              soblow@eldritch.cafeS This user is from outside of this forum
              soblow@eldritch.cafeS This user is from outside of this forum
              soblow@eldritch.cafe
              wrote on last edited by
              #39

              @lilacperegrine No, not really...
              I talked about quadtree because that's something I manipulated and it made met think of it.
              Let's take a step back and formalize the problem:

              Let's assume I have a list of unique IPv4 adresses.
              They are represented on 32 bits.
              I want to construct a list of subnets (so still 32 bits) that summarize the list of IPs I have.

              For example, if I have 192.168.1.0 and 192.168.1.1, I could generalize this with the 192.168.1.0/31 subnet (if I'm not mistaken), which contains the previous two IPs without containing any other IPs.
              If it helps, represent them in binary and find the common upper bits:

              11000000.10101000.00000001.00000000 (192.168.1.0)
              11000000.10101000.00000001.00000001 (192.168.1.1)

              The common part is everything up to the last bit, thus the mask is a `/31` which is 255.255.255.254, or in binary `11111111.11111111.11111111.11111110`.

              Now, I have tens of thousands of IPs and I'd like the smallest list of subnets that includes all bad IPs without including good IPs

              I'm sure there are academic papers about this, this sounds like a problem folks must already have had

              1 Reply Last reply
              0
              • soblow@eldritch.cafeS soblow@eldritch.cafe

                I should do an addendum but right now, my main website is getting hammered at rates similar to what my knowledge website used to be hit at.
                On the opposite, the "knowledge" website is back at "normal" background noise of <100req/min.

                The banlist now contains so many IPs, and yet they still reach 6kreq/min nearly constantly.

                At that point, I'm thinking about tinkering my banip tool to compute optimal subnets instead of always crafting /24 subnets.

                _syhmac@meow.social_ This user is from outside of this forum
                _syhmac@meow.social_ This user is from outside of this forum
                _syhmac@meow.social
                wrote on last edited by
                #40

                @Soblow Yeah… last time my website had been bombed like this, I decided that Cloudflare is not such a bad idea… I hope you’ll figure it out.

                1 Reply Last reply
                0
                • soblow@eldritch.cafeS soblow@eldritch.cafe

                  If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

                  To annoy them and protect my services from DDoS, I decided to setup an iocaine instance, along with NSoE... And it worked... Too well.

                  Recently, they started flooding my VPS so much it started choking.
                  If you followed me here on Fedi, you saw my journey to find a way to relieve my server.

                  This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.

                  Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/

                  Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

                  #selfHosting #iocaine #indieWeb

                  stupidcamille@eldritch.cafeS This user is from outside of this forum
                  stupidcamille@eldritch.cafeS This user is from outside of this forum
                  stupidcamille@eldritch.cafe
                  wrote last edited by
                  #41

                  @Soblow I don't always read a blog post til the end. But when I do, it's always a banger

                  soblow@eldritch.cafeS 1 Reply Last reply
                  0
                  • stupidcamille@eldritch.cafeS stupidcamille@eldritch.cafe

                    @Soblow I don't always read a blog post til the end. But when I do, it's always a banger

                    soblow@eldritch.cafeS This user is from outside of this forum
                    soblow@eldritch.cafeS This user is from outside of this forum
                    soblow@eldritch.cafe
                    wrote last edited by
                    #42

                    @StupidCamille Glad you liked it!

                    stupidcamille@eldritch.cafeS 1 Reply Last reply
                    0
                    • soblow@eldritch.cafeS soblow@eldritch.cafe

                      @StupidCamille Glad you liked it!

                      stupidcamille@eldritch.cafeS This user is from outside of this forum
                      stupidcamille@eldritch.cafeS This user is from outside of this forum
                      stupidcamille@eldritch.cafe
                      wrote last edited by
                      #43

                      @Soblow

                      stupidcamille@eldritch.cafeS 1 Reply Last reply
                      0
                      • stupidcamille@eldritch.cafeS stupidcamille@eldritch.cafe

                        @Soblow

                        stupidcamille@eldritch.cafeS This user is from outside of this forum
                        stupidcamille@eldritch.cafeS This user is from outside of this forum
                        stupidcamille@eldritch.cafe
                        wrote last edited by
                        #44

                        @Soblow also, don't be insecure about nginx. I'm still using Apache2 and don't plan on changing anytime soon

                        soblow@eldritch.cafeS 1 Reply Last reply
                        0
                        • stupidcamille@eldritch.cafeS stupidcamille@eldritch.cafe

                          @Soblow also, don't be insecure about nginx. I'm still using Apache2 and don't plan on changing anytime soon

                          soblow@eldritch.cafeS This user is from outside of this forum
                          soblow@eldritch.cafeS This user is from outside of this forum
                          soblow@eldritch.cafe
                          wrote last edited by
                          #45

                          @StupidCamille I mean, Apache2 is okay for regular users, using nginx is a professional bias at this point

                          nicole4fox@datastream.cortexvoid.netN 1 Reply Last reply
                          0
                          • soblow@eldritch.cafeS soblow@eldritch.cafe

                            @StupidCamille I mean, Apache2 is okay for regular users, using nginx is a professional bias at this point

                            nicole4fox@datastream.cortexvoid.netN This user is from outside of this forum
                            nicole4fox@datastream.cortexvoid.netN This user is from outside of this forum
                            nicole4fox@datastream.cortexvoid.net
                            wrote last edited by
                            #46

                            @Soblow @StupidCamille i also use nginx mostly and i setup a fail2ban that throws 444 at crawlers

                            stupidcamille@eldritch.cafeS 1 Reply Last reply
                            0
                            • nicole4fox@datastream.cortexvoid.netN nicole4fox@datastream.cortexvoid.net

                              @Soblow @StupidCamille i also use nginx mostly and i setup a fail2ban that throws 444 at crawlers

                              stupidcamille@eldritch.cafeS This user is from outside of this forum
                              stupidcamille@eldritch.cafeS This user is from outside of this forum
                              stupidcamille@eldritch.cafe
                              wrote last edited by
                              #47

                              @nicole4fox @Soblow yeah fail2ban is a must

                              1 Reply Last reply
                              0
                              • soblow@eldritch.cafeS soblow@eldritch.cafe

                                If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

                                To annoy them and protect my services from DDoS, I decided to setup an iocaine instance, along with NSoE... And it worked... Too well.

                                Recently, they started flooding my VPS so much it started choking.
                                If you followed me here on Fedi, you saw my journey to find a way to relieve my server.

                                This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.

                                Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/

                                Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

                                #selfHosting #iocaine #indieWeb

                                peq42@mastodon.socialP This user is from outside of this forum
                                peq42@mastodon.socialP This user is from outside of this forum
                                peq42@mastodon.social
                                wrote last edited by
                                #48

                                @Soblow how did that work for ya after a while?
                                I have a website thats, sadly, not self hosted(basic php host with 4 cores) and when I tried some tar pit methods it just... overloaded its CPU until the website went down for 1h

                                I ended up just settling with a hidden, no-follow, no index link on every page that if you click 3x, IP bans you(405 IPs banned so far)

                                soblow@eldritch.cafeS 1 Reply Last reply
                                0
                                • peq42@mastodon.socialP peq42@mastodon.social

                                  @Soblow how did that work for ya after a while?
                                  I have a website thats, sadly, not self hosted(basic php host with 4 cores) and when I tried some tar pit methods it just... overloaded its CPU until the website went down for 1h

                                  I ended up just settling with a hidden, no-follow, no index link on every page that if you click 3x, IP bans you(405 IPs banned so far)

                                  soblow@eldritch.cafeS This user is from outside of this forum
                                  soblow@eldritch.cafeS This user is from outside of this forum
                                  soblow@eldritch.cafe
                                  wrote last edited by
                                  #49

                                  @peq42 That's the topic of a follow-up blogpost I'm writing but, basically, I rate-limit at 1 req/m the IPs that reached a given threshold (50 req / months on my iocaine maze) with an HTTP code 429.
                                  Behind this, if any IP hits 429 too often, they get fail2banned.
                                  So far, I'm getting a background noise of roughly 1.5kreq/min and spikes at 5kreq/min but my server survives

                                  peq42@mastodon.socialP 1 Reply Last reply
                                  0
                                  • soblow@eldritch.cafeS soblow@eldritch.cafe

                                    @peq42 That's the topic of a follow-up blogpost I'm writing but, basically, I rate-limit at 1 req/m the IPs that reached a given threshold (50 req / months on my iocaine maze) with an HTTP code 429.
                                    Behind this, if any IP hits 429 too often, they get fail2banned.
                                    So far, I'm getting a background noise of roughly 1.5kreq/min and spikes at 5kreq/min but my server survives

                                    peq42@mastodon.socialP This user is from outside of this forum
                                    peq42@mastodon.socialP This user is from outside of this forum
                                    peq42@mastodon.social
                                    wrote last edited by
                                    #50

                                    @Soblow in my case it lowered the CPU usage, but I don't think the crawlers took the hint to F off as the amount of hits/day remained the same on avg x.x

                                    soblow@eldritch.cafeS 1 Reply Last reply
                                    0
                                    • peq42@mastodon.socialP peq42@mastodon.social

                                      @Soblow in my case it lowered the CPU usage, but I don't think the crawlers took the hint to F off as the amount of hits/day remained the same on avg x.x

                                      soblow@eldritch.cafeS This user is from outside of this forum
                                      soblow@eldritch.cafeS This user is from outside of this forum
                                      soblow@eldritch.cafe
                                      wrote last edited by
                                      #51

                                      @peq42 you may want to use the ipset ban method then, but keep in mind the limitations!

                                      1 Reply Last reply
                                      0
                                      • soblow@eldritch.cafeS soblow@eldritch.cafe

                                        If you self-host services on the internet, you may have seen waves of crawlers hammering your websites without mercy.

                                        To annoy them and protect my services from DDoS, I decided to setup an iocaine instance, along with NSoE... And it worked... Too well.

                                        Recently, they started flooding my VPS so much it started choking.
                                        If you followed me here on Fedi, you saw my journey to find a way to relieve my server.

                                        This is a rant about LLM crawlers, and some observations & conclusions, along with some techniques to help you protect your own services.

                                        Read it here: https://xaselgio.net/posts/26.poisoning-knowledge/

                                        Edit: A follow-up is now available here: https://xaselgio.net/posts/26-1.addendum-poisoning-knowledge

                                        #selfHosting #iocaine #indieWeb

                                        niacdoial@tilde.zoneN This user is from outside of this forum
                                        niacdoial@tilde.zoneN This user is from outside of this forum
                                        niacdoial@tilde.zone
                                        wrote last edited by
                                        #52

                                        @Soblow
                                        okay, I finally found the time to finish reading this
                                        (I got through most of it like 2w ago, but...)
                                        and I have an potential idea what made the crawlers go to your main website after a while:
                                        it's the robots.txt file itself!
                                        my guess is that claude saw the 301 redirect and interpreted it in the same way the OG search engines interpreted links: as a way to vouch for another website's worth

                                        soblow@eldritch.cafeS 1 Reply Last reply
                                        0
                                        • niacdoial@tilde.zoneN niacdoial@tilde.zone

                                          @Soblow
                                          okay, I finally found the time to finish reading this
                                          (I got through most of it like 2w ago, but...)
                                          and I have an potential idea what made the crawlers go to your main website after a while:
                                          it's the robots.txt file itself!
                                          my guess is that claude saw the 301 redirect and interpreted it in the same way the OG search engines interpreted links: as a way to vouch for another website's worth

                                          soblow@eldritch.cafeS This user is from outside of this forum
                                          soblow@eldritch.cafeS This user is from outside of this forum
                                          soblow@eldritch.cafe
                                          wrote last edited by
                                          #53

                                          @niacdoial mmh...
                                          That would explain things...
                                          Maybe I should just put a static route on each site instead?

                                          niacdoial@tilde.zoneN 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups