Error Handling & Retries¶

wf provides four mechanisms for dealing with unreliable tasks: ignore_failure, retries, retry_delay, and timeout. This guide explains each one and how to combine them.

`ignore_failure` — Treat Failure as Success¶

When ignore_failure = true, a task's non-zero exit code is treated as if it were 0. The task is marked success and its dependants continue normally.

[tasks.send-notification]
cmd            = "curl -X POST $SLACK_WEBHOOK -d '{\"text\": \"Deploy done\"}'"
depends_on     = ["deploy"]
ignore_failure = true    # Slack being down should not abort the workflow

When to use:

Notifications, metrics pushes, and other non-critical side effects
Optional cleanup steps
Tasks that produce useful output even when they "fail" (e.g., a linter that exits 1 but still writes a report)

What ignore_failure does NOT do:

It does not suppress the task's error output — the log file still contains everything
It does not stop the task from being marked in the audit trail as having a non-zero exit code
It does not prevent the forensic handler from running if the task is wired with on_failure (the handler fires on non-zero exit regardless)

`retries` — Retry on Failure¶

[tasks.call-api]
cmd         = "curl https://api.example.com/webhook"
retries     = 3
retry_delay = "10s"

The task will be executed up to retries + 1 times. If it succeeds on any attempt, the task is marked success and no further retries occur.

The retry_delay is the pause between the end of one attempt and the start of the next. It is context-aware: if the run timeout fires during a retry delay, the delay is cancelled and the task is marked cancelled.

Total possible execution time for the example above:

attempt 1: fails
retry_delay: 10s
attempt 2: fails
retry_delay: 10s
attempt 3: fails
retry_delay: 10s
attempt 4: passes (or fails — 4th attempt is final)

max time = (task_duration × 4) + (10s × 3)

`timeout` — Bound Task Duration¶

[tasks.integration-test]
cmd     = "pytest tests/integration/"
timeout = "5m"

If the task runs longer than the timeout, the process group is killed with SIGKILL and the task is marked failed. The full timeout details are recorded in the forensic log.

timeout and retries compose naturally:

[tasks.flaky-service-check]
cmd         = "curl -f --max-time 10 https://myservice/health"
timeout     = "15s"    # per-attempt limit
retries     = 4
retry_delay = "5s"

Each attempt has a 15-second limit. If the attempt times out, it is retried (up to 4 times).

Duration syntax: 30s, 5m, 1m30s, 2h.

Combining All Four¶

A production-grade task:

[tasks.deploy]
name           = "Deploy to Production"
cmd            = "./deploy.sh --env=production --version={{.version}}"
depends_on     = ["smoke-test-staging"]
retries        = 2
retry_delay    = "30s"
timeout        = "10m"
ignore_failure = false    # deploy failure IS fatal
on_failure     = "rollback-production"

Breakdown:

Up to 3 attempts, each bounded to 10 minutes
30 seconds between attempts
If all attempts fail, rollback-production forensic task fires

Per-Run Timeout (`--timeout`)¶

In addition to per-task timeouts, the entire run can be bounded:

wf run deploy --timeout 30m

The run-level timeout cancels all running tasks (via SIGKILL) when it fires. Unlike per-task timeout, the run timeout is not retried — the run is immediately marked cancelled.

Failure Propagation¶

By default, when a task fails:

The task is marked failed
All tasks that depend on it (directly or transitively) are marked cancelled
Tasks at the same level that are already running continue to completion
The forensic handler fires (if configured)

With ignore_failure = true, step 1 becomes success and steps 2–4 do not apply.

Best Practices¶

Scenario	Recommendation
Slack/PagerDuty notifications	`ignore_failure = true` — external services should not abort pipelines
External API calls	`retries = 3`, `retry_delay = "10s"`, `timeout = "30s"` per call
Database operations	`retries = 1` max — most DB failures are not transient
CI tests	`timeout = "10m"` — prevent hanging test suites from blocking the pipeline
Deploy tasks	`retries = 2`, `timeout = "15m"`, `on_failure = "rollback"`
Health checks	`retries = 5`, `retry_delay = "5s"`, `ignore_failure = true` — eventual consistency

Inspecting Retry Behaviour¶

The audit trail shows every attempt:

wf audit $RUN_ID | grep -E 'task_started|task_failed|task_retrying|task_success'

wf inspect $RUN_ID shows the attempt count for each task in the task table.

Error Handling & Retries¶

ignore_failure — Treat Failure as Success¶

retries — Retry on Failure¶

timeout — Bound Task Duration¶