Can we talk about our failure modes?
Our core product has exactly two states: working, or completely down. There is no middle. No degraded mode, no read-only fallback, no "feature X is unavailable but the rest still works." When something breaks, it doesn't inconvenience customers, it stops their business entirely.
That is a design choice, not an inevitability. And I think we keep making it because the honest answer to "why is it built this way" is "because it's always been built this way."
Every dependency we have is currently a single point of total failure. A hiccup in one subsystem takes down the whole thing. We treat that as an ops problem to be monitored and paged on, when it's actually an architecture problem we've decided not to solve.
Graceful degradation costs something upfront. You have to define what "partial" means for each subsystem, build the fallback paths, decide what's safe to shed under load, and actually test the failure modes. That's real work. But the alternative is what we have now: every incident is a worst-case incident.
I'm not asking for a ground-up rewrite. I'm asking us to tier our failures. Which ones should be invisible to customers, which should be a minor inconvenience, and which are genuinely catastrophic? I'd bet most of what currently triggers a full outage belongs in the "inconvenient" bucket and could be isolated without rebuilding everything.
We accept this as normal. We shouldn't....