Forensic Tasks & Failure Handling¶
wf implements a first-class failure handling model inspired by the Saga pattern from distributed systems. Compensating transactions, rollbacks, and notifications are wired directly into the workflow definition — no external orchestration needed.
Overview¶
There are two levels of failure handling:
| Level | Field | Fires when |
|---|---|---|
| Task-level | on_failure on a task | That specific task fails |
| Workflow-level | on_failure at the top of the TOML | Any task in the workflow fails |
Both point to a task with type = "forensic". Forensic tasks are excluded from normal DAG execution — they only run on failure.
Task-Level on_failure¶
Wire a compensating transaction directly to a task:
[tasks.charge-card]
cmd = "./charge.sh --amount={{.order_total}}"
depends_on = ["reserve-inventory"]
on_failure = "refund-card"
[tasks.refund-card]
type = "forensic"
cmd = "./refund.sh --payment-id={{.payment_id}} --reason='charge failed'"
If charge-card fails, refund-card runs immediately. The rest of the workflow is cancelled.
Workflow-Level on_failure¶
A global handler that fires when any task fails (unless a task-level handler fires first):
name = "deploy"
on_failure = "alert-oncall"
[tasks.alert-oncall]
type = "forensic"
cmd = """
curl -X POST $PAGERDUTY_WEBHOOK \
-d '{"message": "Deploy failed: {{.error_message}}", "task": "{{.failed_task}}"}'
"""
Forensic Variables¶
The executor injects special variables into forensic task commands:
| Variable | Scope | Value |
|---|---|---|
{{.failed_task}} | Task-level handlers | ID of the task that failed |
{{.error_message}} | Both levels | stderr/output of the failed task |
{{.failed_dag}} | Workflow-level handler | Name of the workflow |
[tasks.rollback-db]
type = "forensic"
cmd = """
echo "Rolling back due to failure in {{.failed_task}}"
echo "Error: {{.error_message}}"
psql -c "ROLLBACK;"
"""
The Saga Pattern¶
In distributed systems, a Saga is a sequence of operations where each step has a corresponding compensating transaction. If step N fails, compensating transactions for steps N-1, N-2, … are executed in reverse order.
wf implements this naturally with task-level on_failure handlers:
name = "order-processing"
[tasks.reserve-inventory]
cmd = "./reserve.sh {{.order_id}}"
depends_on = ["validate-order"]
register = "reservation_id"
on_failure = "release-inventory"
[tasks.charge-customer]
cmd = "./charge.sh {{.reservation_id}}"
depends_on = ["reserve-inventory"]
register = "charge_id"
on_failure = "refund-customer"
[tasks.create-shipment]
cmd = "./ship.sh {{.charge_id}}"
depends_on = ["charge-customer"]
register = "tracking_number"
on_failure = "cancel-shipment"
# ── Compensating transactions ────────────────────────────
[tasks.release-inventory]
type = "forensic"
cmd = "./release.sh {{.reservation_id}}"
[tasks.refund-customer]
type = "forensic"
cmd = "./refund.sh {{.charge_id}}"
[tasks.cancel-shipment]
type = "forensic"
cmd = "./cancel-ship.sh {{.tracking_number}}"
If create-shipment fails, only cancel-shipment fires (not refund-customer or release-inventory — those only fire if their specific predecessor fails).
Forensic Task Properties¶
A forensic task is any task with type = "forensic". All standard task fields apply:
[tasks.emergency-rollback]
type = "forensic"
cmd = "./rollback.sh"
timeout = "5m"
retries = 2
retry_delay = "10s"
ignore_failure = true # don't fail the handler if rollback itself fails
env = {ROLLBACK_MODE = "force"}
ignore_failure on Forensic Tasks¶
Setting ignore_failure = true on a forensic task prevents a failure in the handler from masking the original error. Without it, a failing handler would itself be reported as the final failure — which obscures what actually went wrong.
[tasks.notify-slack]
type = "forensic"
cmd = "curl $SLACK_WEBHOOK -d '{\"text\": \"Deploy failed\"}'"
ignore_failure = true # Slack being down shouldn't change the run outcome
Execution Order¶
- A task fails
- If the task has
on_failure = "X", task X runs immediately - All remaining normal tasks are cancelled
- If the workflow has a top-level
on_failure = "Y", task Y runs after the run settles - The run is marked
failed
Task-level and workflow-level handlers can coexist — both fire for the same failure.
What Forensic Tasks Cannot Do¶
- Forensic tasks cannot have
depends_onpointing to normal tasks (they are excluded from the normal DAG) - Forensic tasks cannot register variables that downstream normal tasks read (there are no downstream normal tasks left after a failure)
- A forensic task cannot itself trigger another forensic task chain