Recovery-First Automation: Undo Is the Feature

Every automation decision has a mirror question. Most engineers skip it.

Mar 04, 2026

A terminal in the dark. 2 AM. The count is zero. This is the moment before you learn what 'undo' actually means.

There is one action you will regret before you regret anything else. You clicked something, a script ran, a process completed, and in the space between “done” and “wait”: you realized you can’t go back.

Mine was many years back (I was essentialy a kid). A bulk delete. Wrong environment. 2,400 records. Production. No snapshot. The cursor blinked at me while my stomach caught up to what my hands had done. No error message. Just silence and a number: 0 rows remain.

We spend enormous effort designing what automation should do. We spend almost none designing how automation fails backward. That asymmetry is not a gap. It is the bug. And it compounds every time you add another automated step to the system.

Undo Is a System Requirement, Not a Feature

The instinct to treat undo as an afterthought comes from the wrong mental model.

We think of undo as a convenience: copy-paste correction, draft recovery, document history. The real undo problem is not about convenience. It is about blast radius.

When automation moves from script-and-wait to always-on, from manually triggered to schedule-driven, the failure surface changes completely. A human clicking delete can be interrupted. A process running at 3 AM across 800 nodes cannot.

Three properties define a recoverable system. All three are required. Miss one, and you have designed a one-way door.

State is observable before action. You can see what will change before it changes. Not a count, not “47 records affected”: the records themselves, reviewable before any mutation happens.

Actions are time-bounded. Nothing runs forever without a checkpoint. If a job started and you cannot answer “what did it do so far,” the job was not designed to be interruptible. Design it to be interruptible.

Rollback is a first-class operation. Undoing a thing is as designed as doing a thing. It has tests. It has documentation. It is findable at 3 AM by someone who has never seen the codebase.

Miss any of these and you are not shipping automation. You are shipping consequence without control.

Three Patterns That Make Undo Real

This does not have to be abstract. Three concrete patterns address the undo problem at different scopes. They are not mutually exclusive. They stack.

Preview mode. Before any destructive operation, show what will happen. Not a confirmation dialog; confirmation dialogs are theater. “Are you sure?” is a question. “Here are the 47 records this will affect, here is what they contain, here is what will be gone” is evidence. Force the operator to see the blast radius before the blast. The check prevents the accident. The confirmation dialog just creates the illusion of one.

Commit with checkpoint. Every automated action that mutates state should write a checkpoint before execution. Not a log line that says “Started at 03:14”: a structured artifact with state before, intended state after, and rollback procedure. Stored somewhere the on-call engineer can reach in under two minutes without digging through a Slack thread.

The checkpoint is the receipt for the action. Without it, you are back in the “someone says they did the thing” problem that receipts were designed to solve. Receipts are how you prove something happened. Checkpoints are how you undo it.

Undo session window. Not all destructive actions require permanent rollback capability. Many just require a delay. Email providers figured this out: “Message queued. Undo?” with a countdown. Gmail’s send delay is the simplest undo design in widespread use, and it works because the action is not permanent yet.

Define a session window for dangerous operations. Within the window: instantly reversible. After the window: committed. The window size is a function of blast radius: a config push to one service needs 60 seconds; a database migration needs 24 hours with manual confirmation on exit from the window. The window is not a security measure. It is a sanity check. A structured pause between “I think this is right” and “this is permanent.”

What Undo Does Not Solve

Here is the honest constraint. Some actions are structurally irreversible:

Emails delivered to 50,000 subscribers
Financial transactions that have cleared the bank
Webhook events consumed by downstream services
Time. You cannot undo elapsed time.

For these, the model shifts from undo to compensate. You do not unfire the arrow: you send the correction email, issue the refund, dispatch the correcting event. Compensation is reversibility’s close cousin. It cannot unwind time, but it can minimize harm.

The compensate model has its own design requirements. A compensating action needs to be idempotent: running it twice should not double the damage. It needs to be scoped: fix only what the original action broke. And it needs to be logged with a reference to the original action it is correcting, or your audit trail becomes unreadable.

Knowing the difference is the design decision. When you build a new operation, ask: is this undoable, compensatable, or neither? The answer determines how much friction to introduce at the moment of execution. If it is neither undoable nor compensatable, that operation needs the most slowing down, the most confirmation, the smallest scope, and the most human eyeballs before it runs.

One side has a handle. The other side has nothing. Every automated operation is one of these two doors. You just don't always know which one you're walking through.

The Checklist

Before shipping any automated operation that touches persistent state, run this:

[ ] Can an operator see what will change before it changes?
[ ] Is there a checkpoint written before the action executes?
[ ] Is there a rollback procedure documented somewhere findable at 3 AM?
[ ] Has the rollback procedure actually been tested?
[ ] If rollback fails: is there a compensating action?
[ ] If compensation fails: what is the escalation path?

Six items. If you cannot check all six, you have designed the forward path. The backward path is the other half of the system. Ship the whole thing.

The Mirror Question

Every automation decision has a mirror question that engineers tend to skip.

“What should happen?” has a partner: “What happens when we need to undo this?”

“What should I build?” has a partner: “What is the smallest reversible unit?”

“When should this run?” has a partner: “How long do we have to cancel it?”

Automation that cannot undo is not more powerful than manual work. It is more dangerous. Speed amplifies both correctness and mistakes. The systems that age well are not the fastest: they are the ones that treated reversibility as a design requirement from the first line.

If you can’t undo it, you must slow it down.

Resources

Undo — Wikipedia. The history and mechanisms of undo in software, including linear and non-linear models.
User Control and Freedom (Usability Heuristic #3) — Nielsen Norman Group. Jakob Nielsen’s foundational work on why users need undo as a safety net.
Memento Pattern — Wikipedia. The design pattern for capturing and restoring object state: the engineering foundation of most undo implementations.

Morphic

Discussion about this post

Ready for more?