Bluesky is down today.
-
TLDR
1. My definition of "P2P" or "Federated" is that if server A goes down, servers B and C can still talk to each other.
2. Bluesky/"Atmosphere" fails at this because Blacksky (B) requires Bluesky (A) to talk to me (C).
3. In order for Blacksky to avert this, they have to do something unreasonable and expensive.
4. Blacksky someday *will* do this, but will depend heavily on massively overworking Rudy and a few other people. This may someday fail.
5. ActivityPub has problems, but not these
@mcc This is a good take, mcc.
-
@mcc@mastodon.social And it's worst for huge still alive instance. Hundred of message per second. Hundred of job per second for down instance. Hundred of dead job filling queue because timeout, competing resources for alive job. At a point, all workers process only dead job…
@mcc@mastodon.social I don't know exactly what would be the effect of a 10 hour downtime like bluesky for a mastodon.social downtime for example. I expect at least delay growing over time even from no mastodon.social communication.
-
@mcc@mastodon.social And it's worst for huge still alive instance. Hundred of message per second. Hundred of job per second for down instance. Hundred of dead job filling queue because timeout, competing resources for alive job. At a point, all workers process only dead job…
@aeris If this problem is real I can imagine multiple ways to mitigate it. This is a software engineering problem.
-
@aeris If this problem is real I can imagine multiple ways to mitigate it. This is a software engineering problem.
@mcc@mastodon.social No, it's a design trouble. ActivityPub use push when ATProto use pull.
-
@thisismissem @mcc probably worth noting that atproto.africa also appears to be down right now, and some microcosm services also appear to be going up and down
firehose.network and the microcosm relays look to be unaffected for now
@thisismissem @mcc rose also said a few hours ago that they were fighting a DoS attack; i'd assume whoever is doing the attack is targeting multiple notable services in the ecosystem
-
@mcc@mastodon.social they're allowed to succeed so they can be paraded around thet "see, it's all super distributed and decentralized".
The moment VCs realize they need RoI a bunch of " improvements" likely mostly "for security", probably " for safety", definitely "for the children" will add to the already insane architectural costs, a bunch of operafional burden that makes it impposible for other "instances" to exist.that's the Signal playbook. "sure we can federate, but we won't, for reasons"
CC: @mcc@mastodon.social
-
R relay@relay.infosec.exchange shared this topic
-
@mcc@mastodon.social No, it's a design trouble. ActivityPub use push when ATProto use pull.
@mcc@mastodon.social So by design a down instance pollute everything. You can mitigate that with software yes, but background task scheduling is a hard field.
Pull troubles is simpler to mitigate, because only require throttling output request on down instance after restart after a downtime to avoid hammering other instance to fill the gap. -
@thisismissem @mcc rose also said a few hours ago that they were fighting a DoS attack; i'd assume whoever is doing the attack is targeting multiple notable services in the ecosystem
@esm @mcc yeah, that'd be my guess. It'll be interesting to see if anyone takes responsibility for the attack, if it is an attack as suspected
Tangentially, Russia tried to block bluesky the other day: https://netcrook.com/russia-blocks-bluesky-social-media-crackdown/
-
@mcc@mastodon.social So by design a down instance pollute everything. You can mitigate that with software yes, but background task scheduling is a hard field.
Pull troubles is simpler to mitigate, because only require throttling output request on down instance after restart after a downtime to avoid hammering other instance to fill the gap.@mcc@mastodon.social A down ATP instance is really down. No more pull or effect in the network.
A down AP instance is not really down, all other instances try to communicate with it. -
@thisismissem @mcc rose also said a few hours ago that they were fighting a DoS attack; i'd assume whoever is doing the attack is targeting multiple notable services in the ecosystem
@esm @thisismissem That's interesting, but
1. If it's true, why would the DDOS differentially impact third-party PDSes on Blacksky while Blacksky PDS runs at normal speed?
2. Did atproto go down because of a DOS or because of some side-effect of an attempt to move over to it as the primary relay?
One possibility is the failures I saw were *because* we switched from bluesky to atproto.africa, causing a short netlag period while atproto.africa caught up to the present? I don't know?
-
@esm @thisismissem That's interesting, but
1. If it's true, why would the DDOS differentially impact third-party PDSes on Blacksky while Blacksky PDS runs at normal speed?
2. Did atproto go down because of a DOS or because of some side-effect of an attempt to move over to it as the primary relay?
One possibility is the failures I saw were *because* we switched from bluesky to atproto.africa, causing a short netlag period while atproto.africa caught up to the present? I don't know?
@esm @thisismissem I mean it's certainly possible that I am simply misinterpreting Rudy's comments about relays!… but all we ever get from Rudy are these vague gnomic comments, so this is about the best I can do. I'd rather him be spending his time sysadminining and writing Rust than writing up incident reports for public consumption but it does mean trying to figure out wtf is happening to my feed as a blacksky user is constant detective work
-
@esm @thisismissem That's interesting, but
1. If it's true, why would the DDOS differentially impact third-party PDSes on Blacksky while Blacksky PDS runs at normal speed?
2. Did atproto go down because of a DOS or because of some side-effect of an attempt to move over to it as the primary relay?
One possibility is the failures I saw were *because* we switched from bluesky to atproto.africa, causing a short netlag period while atproto.africa caught up to the present? I don't know?
-
@mcc@mastodon.social A down ATP instance is really down. No more pull or effect in the network.
A down AP instance is not really down, all other instances try to communicate with it.@mcc@mastodon.social And it's a vector attack in theory. You can bootstrap thousands of instance, just subscribing to as many account as possible, and then just shutdown your instance.
Any content from subscribed account will generate a background job to your down instance, then hiting timeout each time.
You can just flood instance like that to continue to overflow queue with dangling content. -
@mcc@mastodon.social No. Or with huge delay. Because each of my message will generate a background job to mastodon.social, leading to queue overflow over time and more and more lag even for digipres.club delivery.
-
@jeromechoo@masto.ai @mcc@mastodon.social Yes, I know that. Trouble is not one content send to many, but many content sent to one.
Each post of one instance is sent only once to mastodon.social, but EACH post. -
System shared this topic
-
@jeromechoo@masto.ai @mcc@mastodon.social Yes, I know that. Trouble is not one content send to many, but many content sent to one.
Each post of one instance is sent only once to mastodon.social, but EACH post.@mcc@mastodon.social @jeromechoo@masto.ai So a huge instance sent dozen of post per second (many content generated, but delivered only one) to another huge instance, with one background job per content to deliver.
-
@mcc@mastodon.social @jeromechoo@masto.ai So a huge instance sent dozen of post per second (many content generated, but delivered only one) to another huge instance, with one background job per content to deliver.
@mcc@mastodon.social @jeromechoo@masto.ai The trouble scale not to the down instance size, but to the alive instance size. The more it is active with many content generated, the fastest the background job queue fill with dangling content.
-
This appears to be the explanation:
Rudolph Fraser. (@rude1.blacksky.team)
Even their relay seems down(?) Trying to switch some things to use atproto.africa https://atproto.africa
Blacksky (blacksky.community)
In Bluesky, the PDS talks to the relay talks to the appview goes to the client. Blacksky set up all four last year. But they only deployed their PDS and client, at first. They used Bluesky's relay and appview. This wasn't clearly disclosed. Then there was a censorship scare, and they switched to their own appview. But apparently they're still using Bluesky's relay. This wasn't clearly disclosed. Now relay death kills Blacksky.
@mcc When I initially raised my eyebrows at Bluesky's notion of "federation", I was told that anyone can run a relay on a small cheap computer, it's dead easy, etc.…
-
@mcc@mastodon.social @jeromechoo@masto.ai The trouble scale not to the down instance size, but to the alive instance size. The more it is active with many content generated, the fastest the background job queue fill with dangling content.
-
Now, interestingly, this means that Blacksky users can continue talking to Blacksky users. I can read Rudy's posts on Blacksky. Because that bypasses the relay. But¹ to read my *own* posts, *on a self-hosted PDS*, Bluesky is apparently required, because Blacksky relies on Bluesky's "relay" to scrape my PDS before it gets added to the Blacksky appview database.
¹ (if I'm interpreting Rudy's posts correctly, hardly a guarantee)
@mcc@mastodon.social From what I understand of the protocol, they could just stop using a relay at all, but then it would increase the traffic on all the PDS that were scrapped by the relay until then, since the AppView would have to connect to each of those instead of the relay.
And did switching to another relay solved the issue?