Forgot that things still break

Description

  • Shit happens; if your plan was to figure it out live, that’s not a plan

Disproof

  • Incident Response Plans that are actually followed, someone who’s responsible at any given point (and everyone else knowing who that person is), dedicated comms, post-mortems, etc

Consequences

  • Wasted employee time
  • Extremely variable resolution times
  • Destruction of user trust
  • Mis-managed ad-libbed public response
  • Overall reputational dumpster fire
  • Get outcompeted and die

Causes

  • Unless you’re genuinely useless, someone at one time probably tried to create a process. That process is likely to be: a) evolved, sometimes in over-corrected response to a prior disaster b) poorly linked up between infrastructure, technical and non-technical support teams c) overlapping or gappy
  • Furthermore, lack of discipline in a crisis will tend to create a divergent feedback loop - the headless chickens run into electric fences, overload them and cascade taking down the grid
  • In certain corporate cases, extreme cognitive dissonance or overconfidence in one’s own preventative measures prevents them from solidifying these measures.

Approaches

  • Again, if you haven’t heard of SRE start there
  • If you’re finding out about things breaking from twitter/x/y/z, you’re already in bad shape
  • If you’re finding out about things breaking from your monitoring but then scratching your head about who to DM to start fixing things or having a everything-is-on-fire megathread on the general channel... yeah
  • If you don’t have user support with clear rules of escalation, and/or every case of user-side breakage or user error gets bounced around until it becomes a ‘bug’ with no reproducability steps, that’s also a problem