AI/ML Supply Chain¶
Machine learning adds new dimensions to the supply chain: models, datasets, training pipelines, and inference frameworks. Each is a dependency that can fail, be compromised, or disappear.
Downloading Weights from Strangers
You're downloading model weights from strangers on the internet and running them on your machine. Think about that for a moment. That Hugging Face model could have been trained on anything by anyone. Pickle files can execute arbitrary code. Datasets might contain poisoned examples designed to create backdoors. ML supply chain is the Wild West—all the software supply chain problems plus entirely new categories of risk.
The ML Dependency Stack¶
Traditional software has code dependencies. ML has more:
| Layer | Examples | Risks |
|---|---|---|
| Code | PyTorch, TensorFlow, scikit-learn | Same as any software |
| Models | GPT-2, ResNet, BERT | Provenance, poisoning, bias |
| Datasets | ImageNet, Common Crawl | Bias, licensing, privacy |
| Compute | CUDA, cuDNN, hardware | Availability, reproducibility |
| Weights | Pre-trained parameters | Integrity, versioning |
Each layer can introduce vulnerabilities, biases, and failure modes.
Model Supply Chain¶
Where Models Come From¶
Hugging Face Hub: Largest collection of open models. Anyone can upload.
Official releases: Model authors publish directly (OpenAI, Meta, Google).
Third-party training: Someone else trained it on unknown data.
Model Risks¶
Provenance unknown: Who trained this? On what data? With what objectives?
Poisoning attacks: Models can be trained to behave maliciously on specific inputs while appearing normal on others.1
Backdoors: Hidden behaviors triggered by specific patterns.
Weight tampering: Modified weights after training.
Bias: Models inherit biases from training data.
Model Verification¶
Checksums: Verify downloaded weights match expected hashes.
import hashlib
def verify_model(filepath, expected_hash):
sha256 = hashlib.sha256()
with open(filepath, 'rb') as f:
for chunk in iter(lambda: f.read(4096), b''):
sha256.update(chunk)
return sha256.hexdigest() == expected_hash
Provenance documentation: Require model cards with training details.
# Model Card (modelcard.yaml)
model_name: my-model
training_data:
- dataset: imagenet-1k
version: "2012"
license: custom-academic
training_compute: 8x A100, 72 hours
carbon_footprint: estimated 450 kg CO2
Behavioral testing: Test models against known-good outputs.
Dataset Supply Chain¶
Dataset Risks¶
Privacy leakage: Training data may contain personal information that models memorize.2
License violations: Dataset may include copyrighted material.
Label poisoning: Incorrect labels degrade model quality.
Data drift: Real-world distribution differs from training data.
Dataset Best Practices¶
Version your datasets:
Document provenance:
# data/MANIFEST.yaml
dataset:
name: customer-transactions
version: "2.0.0"
created: "2024-01-15"
sources:
- system: transactions-db
query: "SELECT * FROM transactions WHERE year >= 2023"
date_extracted: "2024-01-15"
preprocessing:
- removed PII columns: [email, phone, address]
- aggregated to daily level
validation:
row_count: 1_250_000
date_range: "2023-01-01 to 2023-12-31"
Hash datasets:
import hashlib
import pandas as pd
def hash_dataframe(df):
return hashlib.sha256(
pd.util.hash_pandas_object(df).values.tobytes()
).hexdigest()
Framework Dependencies¶
The Heavy Dependency Problem¶
ML frameworks have enormous dependency trees:
# Installing PyTorch can pull in hundreds of packages
pip install torch
# Plus CUDA, cuDNN, NCCL for GPU support
Framework Versioning¶
# requirements.txt - pin precisely
torch==2.1.0
torchvision==0.16.0
torchaudio==2.1.0
# These versions must be compatible
PyTorch/CUDA compatibility matrix:
| PyTorch | CUDA | cuDNN |
|---|---|---|
| 2.1.x | 11.8, 12.1 | 8.x |
| 2.0.x | 11.7, 11.8 | 8.x |
| 1.13.x | 11.6, 11.7 | 8.x |
Version mismatches cause silent failures or cryptic errors.
Native Dependencies¶
ML often requires native libraries:
# Dockerfile for ML workloads
FROM nvidia/cuda:12.1-cudnn8-devel-ubuntu22.04
# System dependencies
RUN apt-get update && apt-get install -y \
python3.11 \
python3-pip \
libopenblas-dev \
libomp-dev
# Python dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
Reproducibility Challenges¶
Sources of Non-Reproducibility¶
Hardware differences: GPU vs CPU, different GPU architectures.
# Results differ between devices
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)
Non-deterministic operations: Some operations are non-deterministic by design for performance.
# Set seeds everywhere
import random
import numpy as np
import torch
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# For complete reproducibility
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
Floating point non-associativity: (a + b) + c ≠ a + (b + c) in floating point.
Experiment Tracking¶
Track everything:
import mlflow
with mlflow.start_run():
# Log parameters
mlflow.log_params({
'learning_rate': 0.001,
'batch_size': 32,
'epochs': 100,
'seed': 42,
})
# Log environment
mlflow.log_artifact('requirements.txt')
mlflow.log_param('torch_version', torch.__version__)
mlflow.log_param('cuda_version', torch.version.cuda)
# Train...
# Log metrics
mlflow.log_metrics({
'accuracy': 0.95,
'loss': 0.12,
})
# Log model
mlflow.pytorch.log_model(model, 'model')
Security Considerations¶
Model Serialization¶
Pickle is dangerous:
# Models are often saved with pickle
torch.save(model, 'model.pt') # Uses pickle internally
# Pickle can execute arbitrary code on load
# Only load models from trusted sources
Safer alternatives:
# Save only weights (safer)
torch.save(model.state_dict(), 'weights.pt')
# Load into known architecture
model = MyModel()
model.load_state_dict(torch.load('weights.pt'))
Model Scanning¶
Some tools scan models for malicious payloads:
# Fickling - analyze pickle files for code execution
pip install fickling
fickling model.pt
# ModelScan - broader model security scanning
pip install modelscan
modelscan model.pt
Inference Security¶
Input validation:
def predict(input_data):
# Validate input shape
if input_data.shape != expected_shape:
raise ValueError(f"Expected shape {expected_shape}")
# Validate input range
if input_data.min() < -1 or input_data.max() > 1:
raise ValueError("Input must be normalized to [-1, 1]")
return model(input_data)
Resource limits:
# Prevent DoS via large inputs
MAX_INPUT_SIZE = 1024 * 1024 # 1MB
def predict_safe(input_data):
if len(input_data) > MAX_INPUT_SIZE:
raise ValueError("Input too large")
# Process...
Model Cards and Documentation¶
What to Document¶
# Model Card: sentiment-classifier-v1
## Model Description
- **Model type:** DistilBERT fine-tuned for sentiment classification
- **Training data:** IMDb reviews dataset (50k examples)
- **Intended use:** English movie review sentiment analysis
## Training Details
- **Framework:** PyTorch 2.1.0
- **Training compute:** 1x A100 GPU, 4 hours
- **Hyperparameters:** learning_rate=2e-5, batch_size=16, epochs=3
## Evaluation
| Metric | Value |
|--------|-------|
| Accuracy | 92.5% |
| F1 Score | 0.92 |
## Limitations
- English only
- Trained on movie reviews; may not generalize to other domains
- May exhibit biases present in IMDb reviews
## Ethical Considerations
- Should not be used for decisions affecting individuals
- Review sentiment is subjective; model reflects training data biases
Provenance Chain¶
Document the full lineage:
# provenance.yaml
model:
name: sentiment-classifier-v1
version: "1.0.0"
created: "2024-01-15"
base_model:
name: distilbert-base-uncased
source: huggingface
version: "1.0"
sha256: abc123...
training_data:
name: imdb-reviews
version: "1.0"
source: huggingface/datasets
sha256: def456...
dependencies:
torch: "2.1.0"
transformers: "4.35.0"
datasets: "2.15.0"
training:
script: train.py
script_sha256: ghi789...
random_seed: 42
gpu: "NVIDIA A100"
duration_hours: 4
Silent Failures
ML failures are often silent. A poisoned model still produces outputs—just wrong ones for certain inputs. A biased dataset produces a biased model that seems to work fine until it doesn't. You won't get an error message. You'll get wrong answers that look plausible. Treat ML artifacts with the same skepticism you'd apply to running code from the internet—because that's exactly what they are.
Quick Reference¶
ML Dependency Checklist¶
- Framework versions pinned (PyTorch, TensorFlow, etc.)
- CUDA/cuDNN versions documented
- Model weights verified with checksums
- Dataset provenance documented
- Random seeds set and documented
- Training script versioned
- Model card created
- Experiment tracked (MLflow, W&B, etc.)
Security Checklist¶
- Models from trusted sources only
- Avoid loading pickled objects from untrusted sources
- Input validation for inference
- Resource limits for inference
- Model scanned for malicious payloads
Reproducibility Checklist¶
- All random seeds set
- Deterministic mode enabled
- Hardware requirements documented
- Exact package versions in requirements.txt
- Dataset version pinned
- Training logs preserved