Shit happens; if your plan was to figure it out live, that’s not a plan
Disproof
Incident Response Plans that are actually followed, someone who’s responsible at any given point (and everyone else knowing who that person is), dedicated comms, post-mortems, etc
Consequences
Wasted employee time
Extremely variable resolution times
Destruction of user trust
Mis-managed ad-libbed public response
Overall reputational dumpster fire
Get outcompeted and die
Causes
Unless you’re genuinely useless, someone at one time probably tried to create a process. That process is likely to be:
a) evolved, sometimes in over-corrected response to a prior disaster
b) poorly linked up between infrastructure, technical and non-technical support teams
c) overlapping or gappy
Furthermore, lack of discipline in a crisis will tend to create a divergent feedback loop - the headless chickens run into electric fences, overload them and cascade taking down the grid
In certain corporate cases, extreme cognitive dissonance or overconfidence in one’s own preventative measures prevents them from solidifying these measures.
Approaches
Again, if you haven’t heard of SRE start there
If you’re finding out about things breaking from twitter/x/y/z, you’re already in bad shape
If you’re finding out about things breaking from your monitoring but then scratching your head about who to DM to start fixing things or having a everything-is-on-fire megathread on the general channel... yeah
If you don’t have user support with clear rules of escalation, and/or every case of user-side breakage or user error gets bounced around until it becomes a ‘bug’ with no reproducability steps, that’s also a problem