Example 04 — ML Training Pipeline¶
File: files/examples/ml-training.toml Industry: Machine Learning / MLOps Tags: ml, training, model
Features Demonstrated¶
- Three parallel model trainers (XGBoost, LightGBM, TabNet)
registercapturing experiment ID and best F1 scoreifconditional on F1 score for model promotion gateworking_dirfor experiment artifactsenvfor experiment trackingignore_failureon optional Optuna sweep- Global
on_failureforensic handler - Runtime
--varfor experiment naming timeouton training tasks
Why this pattern matters¶
Model training pipelines are expensive — hours of GPU time, terabytes of data movement. A failure mid-run without resume capability means restarting from scratch. A promotion gate without an auditable record means "we deployed the best model we had at 2am" is institutional memory rather than a verifiable fact.
The three trainers run in parallel, each registering its F1 score as a named variable. select-champion evaluates those scores and registers the winner. The promotion gate uses if to check the champion's score against a threshold — the threshold is a runtime --var, not hardcoded. Every decision — which experiment ID was selected, what score triggered promotion — is in the run record and retrievable with wf inspect long after the training cluster has been torn down.
Pipeline Structure¶
[prepare-data]
├── [train-xgboost] ─┐
├── [train-lightgbm] ├→ [select-champion]
└── [train-tabnet] ─┘ ↓
└── [optuna-sweep] (if best_f1 > 0.90)
[promote-champion] → [deploy-champion]
Global forensic: [alert-ml-failure]
Run Commands¶
# Standard run with experiment name
wf run ml-training --var EXPERIMENT_NAME=run-$(date +%s) --parallel --print-output
# With timeout for training tasks
wf run ml-training \
--var EXPERIMENT_NAME=experiment-001 \
--work-stealing \
--timeout 2h \
--print-output
# Visualise
wf graph ml-training
What to Observe¶
- Three model training tasks run simultaneously
wf inspectshowsexp_id,best_f1,champion_modelvariablespromote-championis gated byif = 'best_f1 > "0.90"'— inspect to see whether the condition was metoptuna-sweephasignore_failure = true— it won't abort the run if it failsdeploy-championuses{{.champion_model}}interpolation — confirm the model name appears in the log