Bluesky is down today.

esm@wetdry.world

@thisismissem @mcc probably worth noting that atproto.africa also appears to be down right now, and some microcosm services also appear to be going up and down

firehose.network and the microcosm relays look to be unaffected for now

eestileib@tech.lgbt

@nasser @mcc

I have a skywalking friend and he says that if blacksky users had configured something in their app to make blacksky primary (which, to be fair, had never mattered before), their timelines would have remained synced with other blacksky users.

And also that blacksky was getting pulled down by bluesky repeatedly coming up, demanding to know the status of every lily in the field, then crashing.

Sounds like they need to come up with a more graceful recovery process and get bluesky to agree with it.

javascript@app.wafrn.net

@mcc@mastodon.social

I couldn't read the post linked above until you posted it again now, but I thought it was a bug in my software (wafrn) not getting the links right

aeris@firefish.imirhil.fr

@mcc@mastodon.social For tiny instance, it's not really a trouble, because few message and so queue don't fill.
For huge instance, pretty all message from all instances will generate a dangling request in queue. When queue filled, delay all message for any other instance even the one alive.

thisismissem@hachyderm.io

@esm @mcc I'm sure there'll be a full write up soon. They usually do pretty good postmortems

mcc@mastodon.social

@eestileib @nasser Posts hosted on the Blacksky PDS are appearing on the Blacksky AppView immediately. That's definitely true.

aeris@firefish.imirhil.fr

@mcc@mastodon.social And it's worst for huge still alive instance. Hundred of message per second. Hundred of job per second for down instance. Hundred of dead job filling queue because timeout, competing resources for alive job. At a point, all workers process only dead job…

wikisteff@mastodon.social

@mcc This is a good take, mcc.

aeris@firefish.imirhil.fr

@mcc@mastodon.social I don't know exactly what would be the effect of a 10 hour downtime like bluesky for a mastodon.social downtime for example. I expect at least delay growing over time even from no mastodon.social communication.

mcc@mastodon.social

@aeris If this problem is real I can imagine multiple ways to mitigate it. This is a software engineering problem.

aeris@firefish.imirhil.fr

@mcc@mastodon.social No, it's a design trouble. ActivityPub use push when ATProto use pull.

esm@wetdry.world

@thisismissem @mcc rose also said a few hours ago that they were fighting a DoS attack; i'd assume whoever is doing the attack is targeting multiple notable services in the ecosystem

khm@hj.9fs.net

that's the Signal playbook. "sure we can federate, but we won't, for reasons"

CC: @mcc@mastodon.social

aeris@firefish.imirhil.fr

@mcc@mastodon.social So by design a down instance pollute everything. You can mitigate that with software yes, but background task scheduling is a hard field.

Pull troubles is simpler to mitigate, because only require throttling output request on down instance after restart after a downtime to avoid hammering other instance to fill the gap.

thisismissem@hachyderm.io

@esm @mcc yeah, that'd be my guess. It'll be interesting to see if anyone takes responsibility for the attack, if it is an attack as suspected

Tangentially, Russia tried to block bluesky the other day: https://netcrook.com/russia-blocks-bluesky-social-media-crackdown/

aeris@firefish.imirhil.fr

@mcc@mastodon.social A down ATP instance is really down. No more pull or effect in the network.
A down AP instance is not really down, all other instances try to communicate with it.

mcc@mastodon.social

@esm @thisismissem That's interesting, but

1. If it's true, why would the DDOS differentially impact third-party PDSes on Blacksky while Blacksky PDS runs at normal speed?

2. Did atproto go down because of a DOS or because of some side-effect of an attempt to move over to it as the primary relay?

One possibility is the failures I saw were *because* we switched from bluesky to atproto.africa, causing a short netlag period while atproto.africa caught up to the present? I don't know?

mcc@mastodon.social

@esm @thisismissem I mean it's certainly possible that I am simply misinterpreting Rudy's comments about relays!… but all we ever get from Rudy are these vague gnomic comments, so this is about the best I can do. I'd rather him be spending his time sysadminining and writing Rust than writing up incident reports for public consumption but it does mean trying to figure out wtf is happening to my feed as a blacksky user is constant detective work

thisismissem@hachyderm.io

@mcc @esm I think we'll need to wait for the analysis and blog posts that follow.

aeris@firefish.imirhil.fr

@mcc@mastodon.social And it's a vector attack in theory. You can bootstrap thousands of instance, just subscribing to as many account as possible, and then just shutdown your instance.
Any content from subscribed account will generate a background job to your down instance, then hiting timeout each time.
You can just flood instance like that to continue to overflow queue with dangling content.

CIRCLE WITH A DOT

Bluesky is down today.

Rudolph Fraser. (@rude1.blacksky.team)

Rudolph Fraser. (@rude1.blacksky.team)