The Saga Pattern¶
The Saga pattern is a strategy for managing failures in sequences of operations that each have side effects. When step N fails, you need to undo the effects of steps 1 through N-1. wf implements this with on_failure forensic tasks.
The Problem¶
Consider an order processing flow:
If step 3 fails, you need to:
- Refund the customer (undo step 2)
- Release the inventory reservation (undo step 1)
Without explicit compensation, the customer is charged for an order that was never shipped.
The wf Solution¶
Wire a compensating transaction to each task via on_failure:
name = "order-processing"
[tasks.validate-order]
cmd = "./validate.sh {{.order_id}}"
register = "validation_status"
[tasks.reserve-inventory]
cmd = "./reserve.sh {{.order_id}}"
depends_on = ["validate-order"]
register = "reservation_id"
on_failure = "release-inventory" # ← compensating tx
[tasks.charge-customer]
cmd = "./charge.sh --amount={{.order_total}} --reservation={{.reservation_id}}"
depends_on = ["reserve-inventory"]
register = "charge_id"
on_failure = "refund-customer" # ← compensating tx
[tasks.create-shipment]
cmd = "./ship.sh --charge={{.charge_id}}"
depends_on = ["charge-customer"]
register = "tracking_number"
on_failure = "cancel-shipment" # ← compensating tx
[tasks.send-confirmation]
cmd = "./notify.sh --tracking={{.tracking_number}}"
depends_on = ["create-shipment"]
ignore_failure = true # notification failure is not fatal
# ── Compensating transactions ────────────────────────────────────────
[tasks.release-inventory]
type = "forensic"
cmd = "./release.sh {{.reservation_id}}"
retries = 2
retry_delay = "5s"
ignore_failure = true # don't mask original error if rollback also fails
[tasks.refund-customer]
type = "forensic"
cmd = "./refund.sh --charge-id={{.charge_id}}"
retries = 3
retry_delay = "10s"
ignore_failure = true
[tasks.cancel-shipment]
type = "forensic"
cmd = "./cancel-ship.sh {{.tracking_number}}"
ignore_failure = true
Execution Flow¶
Happy path:
If charge-customer fails:
validate-order → reserve-inventory → charge-customer (FAILED)
↓
refund-customer (forensic — if charge was attempted)
release-inventory (forensic — because reserve-inventory failed? No.)
Wait — this is an important nuance: each on_failure handler fires for that specific task only. If charge-customer fails, refund-customer fires (wired to charge-customer). But release-inventory is wired to reserve-inventory — it only fires if reserve-inventory itself fails.
If create-shipment fails:
validate-order → reserve-inventory → charge-customer → create-shipment (FAILED)
↓
cancel-shipment (forensic)
Only cancel-shipment fires. refund-customer and release-inventory do not fire — those compensations are only needed if their specific predecessors fail.
Global Notification¶
Wire a workflow-level on_failure for a notification that fires regardless of which step failed:
name = "order-processing"
on_failure = "notify-ops"
# ... tasks ...
[tasks.notify-ops]
type = "forensic"
cmd = """
curl -X POST $PAGERDUTY_URL \
-d "{\"message\": \"Order processing failed at: {{.failed_task}}\",
\"details\": \"{{.error_message}}\"}"
"""
ignore_failure = true
timeout = "30s"
Both task-level and workflow-level handlers can coexist — both fire for the same failure event.
Key Design Rules¶
-
Always set
ignore_failure = trueon compensating transactions. If the rollback itself fails, you don't want to mask the original error with a "rollback failed" error. -
Add
retriesto compensating transactions. A refund or inventory release is critical — give it multiple chances to succeed. -
Compensating transactions should be idempotent. If the workflow is resumed, the compensation may be attempted again. Design the compensation script to be safe to call multiple times.
-
Keep compensating transaction scope narrow. Each task's
on_failureshould only undo that specific task's side effect — not all previous side effects. -
Use
{{.failed_task}}and{{.error_message}}for diagnostics. These are injected automatically and give your notification tasks full context.
Example: Infrastructure Provisioning with Teardown¶
name = "provision-infra"
on_failure = "teardown-everything"
[tasks.provision-network]
cmd = "terraform apply -target=module.network"
register = "vpc_id"
on_failure = "teardown-network"
retries = 1
[tasks.provision-database]
cmd = "terraform apply -target=module.database"
depends_on = ["provision-network"]
register = "db_endpoint"
on_failure = "teardown-database"
retries = 1
[tasks.provision-compute]
cmd = "terraform apply -target=module.compute"
depends_on = ["provision-database"]
register = "cluster_arn"
retries = 1
# ── Teardown handlers ─────────────────────────────────────────────────
[tasks.teardown-network]
type = "forensic"
cmd = "terraform destroy -target=module.network -auto-approve"
timeout = "10m"
ignore_failure = true
[tasks.teardown-database]
type = "forensic"
cmd = "terraform destroy -target=module.database -auto-approve"
timeout = "10m"
ignore_failure = true
[tasks.teardown-everything]
type = "forensic"
cmd = """
echo "Tearing down all infrastructure due to: {{.failed_task}}"
terraform destroy -auto-approve
"""
timeout = "30m"
ignore_failure = true