Bluesky is down today.

mcc@mastodon.social

@esm @thisismissem I mean it's certainly possible that I am simply misinterpreting Rudy's comments about relays!… but all we ever get from Rudy are these vague gnomic comments, so this is about the best I can do. I'd rather him be spending his time sysadminining and writing Rust than writing up incident reports for public consumption but it does mean trying to figure out wtf is happening to my feed as a blacksky user is constant detective work

thisismissem@hachyderm.io

@mcc @esm I think we'll need to wait for the analysis and blog posts that follow.

aeris@firefish.imirhil.fr

@mcc@mastodon.social And it's a vector attack in theory. You can bootstrap thousands of instance, just subscribing to as many account as possible, and then just shutdown your instance.
Any content from subscribed account will generate a background job to your down instance, then hiting timeout each time.
You can just flood instance like that to continue to overflow queue with dangling content.

jeromechoo@masto.ai

@aeris @mcc failed deliveries happen all the time. mastodon.social is just another instance. Since Mastodon and misskey operate on shared inboxes, the failed deliveries won’t scale sender failures by recipient instance size.

https://seb.jambor.dev/posts/understanding-activitypub/#:~:text=To%20combat%20this,our%20followers%20internally.

aeris@firefish.imirhil.fr

@jeromechoo@masto.ai @mcc@mastodon.social Yes, I know that. Trouble is not one content send to many, but many content sent to one.
Each post of one instance is sent only once to mastodon.social, but EACH post.

aeris@firefish.imirhil.fr

@mcc@mastodon.social @jeromechoo@masto.ai So a huge instance sent dozen of post per second (many content generated, but delivered only one) to another huge instance, with one background job per content to deliver.

aeris@firefish.imirhil.fr

@mcc@mastodon.social @jeromechoo@masto.ai The trouble scale not to the down instance size, but to the alive instance size. The more it is active with many content generated, the fastest the background job queue fill with dangling content.

timbray@cosocial.ca

@mcc When I initially raised my eyebrows at Bluesky's notion of "federation", I was told that anyone can run a relay on a small cheap computer, it's dead easy, etc.…

jeromechoo@masto.ai

@aeris @mcc this post you made probably just failed to deliver to a few instances. One of them could be mastodon.social if it was down.

How does that affect delivery to masto.ai?

mastodon.social being down is no different to your server than any other server being down. One request each.

breizh@pleroma.breizh.pm

@mcc@mastodon.social From what I understand of the protocol, they could just stop using a relay at all, but then it would increase the traffic on all the PDS that were scrapped by the relay until then, since the AppView would have to connect to each of those instead of the relay.

And did switching to another relay solved the issue?

aeris@firefish.imirhil.fr

@jeromechoo@masto.ai @mcc@mastodon.social It affect deliver to masto.ai because EACH of my post generate a dangling request, hiting timeout. After a while, my worker consume more time to dangling request taking 2-3s (hiting timeout) than trying to send content to masto.ai.

aeris@firefish.imirhil.fr

@mcc@mastodon.social @jeromechoo@masto.ai Each post is a dangling request which will consume 3s of CPU time and so 10× consumption of 300ms for alive server, and planned for reschedule. After a while, all workers are just stuck with full of 3s waiting process, with starvation for alive requests.

mcc@mastodon.social

Updates

- Over the last two hours the problem has gone from "I don't see my posts" to "I see my posts 1 hour after I make them" to "17 minutes" to "3 minutes" to "it's fixed". I interpret this as the relay firehose pointer, whatever relay is in use right now, gradually catching up.

- I need to stress the above thread is a mix of fact (ATProto federation is duplicative and often brittle) and conjecture (I can't know what relay is being used internally by Blacksky except if Rudy tells us).

mcc@mastodon.social

@breizh As of this second, Blacksky has resolved the issue. I don't know how.

aeris@firefish.imirhil.fr

@mcc@mastodon.social @jeromechoo@masto.ai After a while, you have 43 minutes latency for EVERY DELIVERY, even alive server. I experience that on my own Mastodon instance…

scatty_hannah@federation.network

@mcc@mastodon.social @aeris@firefish.imirhil.fr if that's really the case, if anything, that's an implementation problem. Mail servers have dealt with this problem for ages. That's why they have queues and per server exponentially increasing retry intervals. Push is not inherently bad.

slothrop@chaos.social

@nasser @mcc Thanks indeed! This is a great explanation.

My own takeaway is that Bluesky is a lost cause in terms of decentralization, because its architecture is designed to resist that outcome.

jeromechoo@masto.ai

@aeris @mcc like I have already said — every post you’ve made just now has failed to deliver to several instances. Your instance is running just fine is it not?

If mastodon.social goes down. It would add ONE more failed delivery to the queue of thousands your instance is already managing.

aeris@firefish.imirhil.fr

@mcc@mastodon.social @jeromechoo@masto.ai At the end any workers just have 7% of "luck" (3 out of 42) to hit a down request, consuming resource for nothing for 2-3s, having no more time to schedule alive server, with 13.000 pending request because starvation, with many many alive request in those 13.000. Perhaps the 13.000th will be a alive one, but it will be delivered in only 43 minutes in average.

aeris@firefish.imirhil.fr

@jeromechoo@masto.ai @mcc@mastodon.social No, it not running fine. I ALREADY reported 43 minutes latency to deliver ANY MESSAGE on Mastodon. This "bug" (in fact bad design) is known since ages.
https://github.com/mastodon/mastodon/issues/12445

CIRCLE WITH A DOT

Bluesky is down today.

Rudolph Fraser. (@rude1.blacksky.team)