Skip to content

Example 04 — ML Training Pipeline

File: files/examples/ml-training.toml Industry: Machine Learning / MLOps Tags: ml, training, model

Features Demonstrated

  • Three parallel model trainers (XGBoost, LightGBM, TabNet)
  • register capturing experiment ID and best F1 score
  • if conditional on F1 score for model promotion gate
  • working_dir for experiment artifacts
  • env for experiment tracking
  • ignore_failure on optional Optuna sweep
  • Global on_failure forensic handler
  • Runtime --var for experiment naming
  • timeout on training tasks

Why this pattern matters

Model training pipelines are expensive — hours of GPU time, terabytes of data movement. A failure mid-run without resume capability means restarting from scratch. A promotion gate without an auditable record means "we deployed the best model we had at 2am" is institutional memory rather than a verifiable fact.

The three trainers run in parallel, each registering its F1 score as a named variable. select-champion evaluates those scores and registers the winner. The promotion gate uses if to check the champion's score against a threshold — the threshold is a runtime --var, not hardcoded. Every decision — which experiment ID was selected, what score triggered promotion — is in the run record and retrievable with wf inspect long after the training cluster has been torn down.

Pipeline Structure

[prepare-data]
  ├── [train-xgboost]  ─┐
  ├── [train-lightgbm]  ├→ [select-champion]
  └── [train-tabnet]   ─┘        ↓
  └── [optuna-sweep]             (if best_f1 > 0.90)
                             [promote-champion] → [deploy-champion]

Global forensic: [alert-ml-failure]

Run Commands

# Standard run with experiment name
wf run ml-training --var EXPERIMENT_NAME=run-$(date +%s) --parallel --print-output

# With timeout for training tasks
wf run ml-training \
  --var EXPERIMENT_NAME=experiment-001 \
  --work-stealing \
  --timeout 2h \
  --print-output

# Visualise
wf graph ml-training

What to Observe

  • Three model training tasks run simultaneously
  • wf inspect shows exp_id, best_f1, champion_model variables
  • promote-champion is gated by if = 'best_f1 > "0.90"' — inspect to see whether the condition was met
  • optuna-sweep has ignore_failure = true — it won't abort the run if it fails
  • deploy-champion uses {{.champion_model}} interpolation — confirm the model name appears in the log

Inspect After Running

RUN_ID=$(wf runs --tag ml --limit 1 | awk 'NR==2{print $1}')
wf inspect $RUN_ID                     # exp_id, best_f1, champion_model
wf logs    $RUN_ID select-champion     # champion selection logic output
wf audit   $RUN_ID                     # see if promote-champion was skipped or ran