Bluesky is down today.

javascript@app.wafrn.net

But if the problem is the relay couldn't they just use the microcosm.blue relays and call it a day ?? They are compatible with the original ones

And didn't Blacksky also ran their own relay at atproto.africa ??

aeris@firefish.imirhil.fr

@mcc@mastodon.social No, it's the trouble with the push design of ActivityPub.

mcc@mastodon.social

@javascript Before I attempt to reply to this, please clarify whether read the post I posted above.

Rudolph Fraser. (@rude1.blacksky.team)

Even their relay seems down(?) Trying to switch some things to use atproto.africa https://atproto.africa

Bluesky Social (bsky.app)

Rudolph Fraser. (@rude1.blacksky.team)

Even their relay seems down(?) Trying to switch some things to use atproto.africa https://atproto.africa

Blacksky (blacksky.community)

Yes, they've been running atproto.africa since last year. But are they *using* it?

aeris@firefish.imirhil.fr

@mcc@mastodon.social Each of your message generate a background job on a queue to be submitted to every instance with at least one ppl following you. If a huge one is down, all other instances will start to fill background queue with tons of dangling query, delaying more and more request for still live instance.

kunev@blewsky.social

@mcc@mastodon.social they're allowed to succeed so they can be paraded around thet "see, it's all super distributed and decentralized".

The moment VCs realize they need RoI a bunch of " improvements" likely mostly "for security", probably " for safety", definitely "for the children" will add to the already insane architectural costs, a bunch of operafional burden that makes it impposible for other "instances" to exist.

aeris@firefish.imirhil.fr

@mcc@mastodon.social Currently i have around 600 "delayed" job because of down instance polluting all delivery. This was reported to Mastodon years ago. Nothing change.

esm@wetdry.world

@thisismissem @mcc probably worth noting that atproto.africa also appears to be down right now, and some microcosm services also appear to be going up and down

firehose.network and the microcosm relays look to be unaffected for now

eestileib@tech.lgbt

@nasser @mcc

I have a skywalking friend and he says that if blacksky users had configured something in their app to make blacksky primary (which, to be fair, had never mattered before), their timelines would have remained synced with other blacksky users.

And also that blacksky was getting pulled down by bluesky repeatedly coming up, demanding to know the status of every lily in the field, then crashing.

Sounds like they need to come up with a more graceful recovery process and get bluesky to agree with it.

javascript@app.wafrn.net

@mcc@mastodon.social

I couldn't read the post linked above until you posted it again now, but I thought it was a bug in my software (wafrn) not getting the links right

aeris@firefish.imirhil.fr

@mcc@mastodon.social For tiny instance, it's not really a trouble, because few message and so queue don't fill.
For huge instance, pretty all message from all instances will generate a dangling request in queue. When queue filled, delay all message for any other instance even the one alive.

thisismissem@hachyderm.io

@esm @mcc I'm sure there'll be a full write up soon. They usually do pretty good postmortems

mcc@mastodon.social

@eestileib @nasser Posts hosted on the Blacksky PDS are appearing on the Blacksky AppView immediately. That's definitely true.

aeris@firefish.imirhil.fr

@mcc@mastodon.social And it's worst for huge still alive instance. Hundred of message per second. Hundred of job per second for down instance. Hundred of dead job filling queue because timeout, competing resources for alive job. At a point, all workers process only dead job…

wikisteff@mastodon.social

@mcc This is a good take, mcc.

aeris@firefish.imirhil.fr

@mcc@mastodon.social I don't know exactly what would be the effect of a 10 hour downtime like bluesky for a mastodon.social downtime for example. I expect at least delay growing over time even from no mastodon.social communication.

mcc@mastodon.social

@aeris If this problem is real I can imagine multiple ways to mitigate it. This is a software engineering problem.

aeris@firefish.imirhil.fr

@mcc@mastodon.social No, it's a design trouble. ActivityPub use push when ATProto use pull.

esm@wetdry.world

@thisismissem @mcc rose also said a few hours ago that they were fighting a DoS attack; i'd assume whoever is doing the attack is targeting multiple notable services in the ecosystem

khm@hj.9fs.net

that's the Signal playbook. "sure we can federate, but we won't, for reasons"

CC: @mcc@mastodon.social

aeris@firefish.imirhil.fr

@mcc@mastodon.social So by design a down instance pollute everything. You can mitigate that with software yes, but background task scheduling is a hard field.

Pull troubles is simpler to mitigate, because only require throttling output request on down instance after restart after a downtime to avoid hammering other instance to fill the gap.

CIRCLE WITH A DOT

Bluesky is down today.

Rudolph Fraser. (@rude1.blacksky.team)

Rudolph Fraser. (@rude1.blacksky.team)

Rudolph Fraser. (@rude1.blacksky.team)

Rudolph Fraser. (@rude1.blacksky.team)