Forensic Tasks & Failure Handling¶

wf implements a first-class failure handling model inspired by the Saga pattern from distributed systems. Compensating transactions, rollbacks, and notifications are wired directly into the workflow definition — no external orchestration needed.

Overview¶

There are two levels of failure handling:

Level	Field	Fires when
Task-level	`on_failure` on a task	That specific task fails
Workflow-level	`on_failure` at the top of the TOML	Any task in the workflow fails

Both point to a task with type = "forensic". Forensic tasks are excluded from normal DAG execution — they only run on failure.

Task-Level `on_failure`¶

Wire a compensating transaction directly to a task:

[tasks.charge-card]
cmd        = "./charge.sh --amount={{.order_total}}"
depends_on = ["reserve-inventory"]
on_failure = "refund-card"

[tasks.refund-card]
type = "forensic"
cmd  = "./refund.sh --payment-id={{.payment_id}} --reason='charge failed'"

If charge-card fails, refund-card runs immediately. The rest of the workflow is cancelled.

Workflow-Level `on_failure`¶

A global handler that fires when any task fails (unless a task-level handler fires first):

name       = "deploy"
on_failure = "alert-oncall"

[tasks.alert-oncall]
type = "forensic"
cmd  = """
curl -X POST $PAGERDUTY_WEBHOOK \
  -d '{"message": "Deploy failed: {{.error_message}}", "task": "{{.failed_task}}"}'
"""

Forensic Variables¶

The executor injects special variables into forensic task commands:

Variable	Scope	Value
`{{.failed_task}}`	Task-level handlers	ID of the task that failed
`{{.error_message}}`	Both levels	stderr/output of the failed task
`{{.failed_dag}}`	Workflow-level handler	Name of the workflow

[tasks.rollback-db]
type = "forensic"
cmd  = """
echo "Rolling back due to failure in {{.failed_task}}"
echo "Error: {{.error_message}}"
psql -c "ROLLBACK;"
"""

The Saga Pattern¶

In distributed systems, a Saga is a sequence of operations where each step has a corresponding compensating transaction. If step N fails, compensating transactions for steps N-1, N-2, … are executed in reverse order.

wf implements this naturally with task-level on_failure handlers:

name = "order-processing"

[tasks.reserve-inventory]
cmd        = "./reserve.sh {{.order_id}}"
depends_on = ["validate-order"]
register   = "reservation_id"
on_failure = "release-inventory"

[tasks.charge-customer]
cmd        = "./charge.sh {{.reservation_id}}"
depends_on = ["reserve-inventory"]
register   = "charge_id"
on_failure = "refund-customer"

[tasks.create-shipment]
cmd        = "./ship.sh {{.charge_id}}"
depends_on = ["charge-customer"]
register   = "tracking_number"
on_failure = "cancel-shipment"

# ── Compensating transactions ────────────────────────────

[tasks.release-inventory]
type = "forensic"
cmd  = "./release.sh {{.reservation_id}}"

[tasks.refund-customer]
type = "forensic"
cmd  = "./refund.sh {{.charge_id}}"

[tasks.cancel-shipment]
type = "forensic"
cmd  = "./cancel-ship.sh {{.tracking_number}}"

If create-shipment fails, only cancel-shipment fires (not refund-customer or release-inventory — those only fire if their specific predecessor fails).

Forensic Task Properties¶

A forensic task is any task with type = "forensic". All standard task fields apply:

[tasks.emergency-rollback]
type           = "forensic"
cmd            = "./rollback.sh"
timeout        = "5m"
retries        = 2
retry_delay    = "10s"
ignore_failure = true   # don't fail the handler if rollback itself fails
env            = {ROLLBACK_MODE = "force"}

`ignore_failure` on Forensic Tasks¶

Setting ignore_failure = true on a forensic task prevents a failure in the handler from masking the original error. Without it, a failing handler would itself be reported as the final failure — which obscures what actually went wrong.

[tasks.notify-slack]
type           = "forensic"
cmd            = "curl $SLACK_WEBHOOK -d '{\"text\": \"Deploy failed\"}'"
ignore_failure = true   # Slack being down shouldn't change the run outcome

Execution Order¶

A task fails
If the task has on_failure = "X", task X runs immediately
All remaining normal tasks are cancelled
If the workflow has a top-level on_failure = "Y", task Y runs after the run settles
The run is marked failed

Task-level and workflow-level handlers can coexist — both fire for the same failure.

What Forensic Tasks Cannot Do¶

Forensic tasks cannot have depends_on pointing to normal tasks (they are excluded from the normal DAG)
Forensic tasks cannot register variables that downstream normal tasks read (there are no downstream normal tasks left after a failure)
A forensic task cannot itself trigger another forensic task chain