Skip to content

Example 08 — Log Analysis

File: files/examples/log-analysis.toml Industry: SRE / Observability Tags: observability, logs, sre

Features Demonstrated

  • matrix expansion across five services (api, auth, payments, notifications, search)
  • Parallel log collection and parsing per service
  • register for error counts per service
  • if conditional for SLO breach alert
  • ignore_failure on collection tasks
  • timeout per task
  • working_dir for log storage path

Why this pattern matters

Log analysis at scale has two problems: collection from many sources is slow if sequential, and the aggregate view (cross-service error correlation) is only useful if the per-service inputs are complete. A matrix-expanded DAG solves both: five parallel collection nodes finish faster, and the correlation task (correlate-errors) has a hard dependency on all five parse nodes — it cannot start until every service's error count is registered.

The SLO breach check uses if against a registered error rate. This means the alert condition is evaluated against the number the analysis task actually computed and wrote to stdout — not a threshold checked in a cron script that has no access to that context. When the SLO alert fires, wf audit shows the full chain: what was collected, what was parsed, what error rate triggered the alert. When it doesn't fire, the run record shows that too — same traceability for passing and failing runs.

Pipeline Structure

[collect-logs[service=api]]        ─┐
[collect-logs[service=auth]]        │
[collect-logs[service=payments]]    ├→ (per service)
[collect-logs[service=notifications]│   [parse-logs[service=*]] → [correlate-errors]
[collect-logs[service=search]]      ┘           ↓
                                          (if error_rate > threshold)
                                          [trigger-alert]
                                          [generate-dashboard]
                                          [archive-logs]

Run Commands

# Collect and analyse logs
wf run log-analysis --parallel --print-output

# Work-stealing for five-service fan-out
wf run log-analysis --work-stealing --max-parallel 10 --print-output

# Visualise matrix expansion
wf graph log-analysis --matrix

What to Observe

  • Ten matrix nodes total: five collect-logs + five parse-logs
  • correlate-errors waits for all ten nodes to complete
  • wf inspect shows api_errors, auth_errors, payments_errors, notifications_errors, search_errors
  • trigger-alert is gated by an if condition on error rate
  • generate-dashboard uses all five error count variables in its command

Inspect After Running

RUN_ID=$(wf runs --tag observability --limit 1 | awk 'NR==2{print $1}')
wf inspect $RUN_ID
wf audit   $RUN_ID | grep task_started    # confirm parallel starts

# Diff two analysis runs
RUN_A=$RUN_ID
wf run log-analysis --parallel
RUN_B=$(wf runs --tag observability --limit 1 | awk 'NR==2{print $1}')
wf diff $RUN_A $RUN_B    # compare error counts between runs