Example 08 — Log Analysis¶
File: files/examples/log-analysis.toml Industry: SRE / Observability Tags: observability, logs, sre
Features Demonstrated¶
matrixexpansion across five services (api, auth, payments, notifications, search)- Parallel log collection and parsing per service
registerfor error counts per serviceifconditional for SLO breach alertignore_failureon collection taskstimeoutper taskworking_dirfor log storage path
Why this pattern matters¶
Log analysis at scale has two problems: collection from many sources is slow if sequential, and the aggregate view (cross-service error correlation) is only useful if the per-service inputs are complete. A matrix-expanded DAG solves both: five parallel collection nodes finish faster, and the correlation task (correlate-errors) has a hard dependency on all five parse nodes — it cannot start until every service's error count is registered.
The SLO breach check uses if against a registered error rate. This means the alert condition is evaluated against the number the analysis task actually computed and wrote to stdout — not a threshold checked in a cron script that has no access to that context. When the SLO alert fires, wf audit shows the full chain: what was collected, what was parsed, what error rate triggered the alert. When it doesn't fire, the run record shows that too — same traceability for passing and failing runs.
Pipeline Structure¶
[collect-logs[service=api]] ─┐
[collect-logs[service=auth]] │
[collect-logs[service=payments]] ├→ (per service)
[collect-logs[service=notifications]│ [parse-logs[service=*]] → [correlate-errors]
[collect-logs[service=search]] ┘ ↓
(if error_rate > threshold)
[trigger-alert]
[generate-dashboard]
↓
[archive-logs]
Run Commands¶
# Collect and analyse logs
wf run log-analysis --parallel --print-output
# Work-stealing for five-service fan-out
wf run log-analysis --work-stealing --max-parallel 10 --print-output
# Visualise matrix expansion
wf graph log-analysis --matrix
What to Observe¶
- Ten matrix nodes total: five
collect-logs+ fiveparse-logs correlate-errorswaits for all ten nodes to completewf inspectshowsapi_errors,auth_errors,payments_errors,notifications_errors,search_errorstrigger-alertis gated by anifcondition on error rategenerate-dashboarduses all five error count variables in its command
Inspect After Running¶
RUN_ID=$(wf runs --tag observability --limit 1 | awk 'NR==2{print $1}')
wf inspect $RUN_ID
wf audit $RUN_ID | grep task_started # confirm parallel starts
# Diff two analysis runs
RUN_A=$RUN_ID
wf run log-analysis --parallel
RUN_B=$(wf runs --tag observability --limit 1 | awk 'NR==2{print $1}')
wf diff $RUN_A $RUN_B # compare error counts between runs