For Researchers¶
Research software has unique requirements: reproducibility for scientific integrity, compliance for sensitive data, and often limited time and resources. This appendix translates supply chain concepts for academic and research contexts.
Different Pressures
Research code has different pressures. You're not a software engineer—you're a domain expert who happens to write code. The timeline is grant-driven, not sprint-driven. "Ship it" means "publish the paper." I get it. But I've also seen papers retracted because the code couldn't be reproduced, and careers damaged by data breaches from scripts that grew beyond their original scope. The investment in getting this right is small compared to those costs.
The Research Software Reality¶
Most research software:
- Starts as a "quick script" and grows unexpectedly
- Is written by domain experts, not software engineers
- Has limited maintenance resources after publication
- Must be reproducible for scientific integrity
- May handle sensitive data (human subjects, medical records)
These constraints don't make supply chain management less important—they make it more important to get right with minimal overhead.
Reproducibility for Publications¶
When you publish research, your code should produce the same results. Always.
Minimum Reproducibility¶
For any published code:
# Document versions
python --version > VERSION_INFO.txt
pip freeze > requirements-frozen.txt
git log -1 > GIT_INFO.txt
Include in your repository:
# README.md
## Requirements
- Python 3.11.4
- See requirements-frozen.txt for exact package versions
## Reproducing Results
```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements-frozen.txt
python run_analysis.py --seed 42
Expected output: results/table1.csv should match published Table 1.
### Better Reproducibility
Add a container:
```dockerfile
# Dockerfile
FROM python:3.11.4-slim
WORKDIR /analysis
COPY requirements-frozen.txt .
RUN pip install --no-cache-dir -r requirements-frozen.txt
COPY . .
# Default: run the analysis
CMD ["python", "run_analysis.py", "--seed", "42"]
Now anyone can reproduce your exact environment.
Best Reproducibility¶
For high-stakes research:
# .github/workflows/reproducibility.yml
name: Verify Reproducibility
on: [push]
jobs:
reproduce:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build container
run: docker build -t analysis .
- name: Run analysis
run: docker run analysis > output.txt
- name: Verify results
run: diff output.txt expected_output.txt
Automated verification that results don't drift.
Sensitive Data Compliance¶
Research often involves data with compliance requirements.
Common Frameworks¶
| Framework | Applies To | Key Requirements |
|---|---|---|
| HIPAA | US health data | Access controls, encryption, audit logs |
| FERPA | US education records | Student consent, access restrictions |
| GDPR | EU personal data | Consent, data minimization, right to erasure |
| IRB/Ethics | Human subjects | Approval before data collection |
Data Handling Principles¶
Minimize collection:
# Bad: Collect everything, figure it out later
df = pd.read_sql("SELECT * FROM patients", conn)
# Good: Collect only what you need
df = pd.read_sql("""
SELECT patient_id, age_group, diagnosis_code
FROM patients
WHERE consent_research = TRUE
""", conn)
Separate identifiers:
data/
├── identifiers/ # Restricted access
│ └── patient_keys.csv # Links IDs to identities
└── analysis/ # For analysis
└── deidentified.csv # No direct identifiers
Use the medallion architecture:
Bronze (raw): Original data with identifiers
Silver (cleaned): Validated, de-identified
Gold (analysis): Aggregated, ready for research
Each layer has different access controls. Analysis happens on Gold layer only.
Dependency Considerations¶
For compliance-sensitive work:
Audit your dependencies:
# What are you actually running?
pip list
pip show package-name # Check where it came from
# Any known vulnerabilities?
pip-audit
Consider what dependencies access:
# This package might phone home
# Check before using with sensitive data
import some_analytics_package # Does it send telemetry?
Air-gapped environments: When data can't leave a secure environment, you may need: - Private package mirrors - Vendored dependencies - Pre-approved package lists
The "Quick Script" Problem¶
Scripts have a way of becoming critical infrastructure.
AI-assisted coding makes this problem worse. You can generate scripts faster than ever—and with even less scrutiny. The AI produces working code instantly; you run it; it works; you move on. A week later, three colleagues are using it. A month later, it's running in production. You never reviewed what the AI actually wrote, what dependencies it added, or whether those dependencies are maintained. See Vibe Coding for more on managing AI-generated code.
Signs Your Script Is Growing Up¶
- Multiple people use it
- It runs in production
- Results depend on it
- You can't remember how it works
- You used AI to write it and never reviewed the full output
Minimal Hardening¶
When a script starts to matter:
Add requirements.txt:
Add basic error handling:
def main():
try:
run_analysis()
except Exception as e:
logger.error(f"Analysis failed: {e}")
sys.exit(1)
Add a README:
# Analysis Script
## Usage
python analyze.py input.csv output.csv
## Requirements
pip install -r requirements.txt
## Notes
- Expects input CSV with columns: id, value, date
- Output includes statistical summary
Version control:
When to Invest More¶
Invest in proper engineering when:
- Multiple papers will depend on this code
- Other researchers will use it
- It processes sensitive data
- Results affect decisions (clinical, policy)
See When Scripts Become Software for the full discussion.
Managing Shared Resources¶
Research computing often means shared infrastructure.
HPC Clusters¶
# Don't install globally
pip install --user package-name # Goes to ~/.local
# Better: Use virtual environments
python -m venv ~/venvs/my-project
source ~/venvs/my-project/bin/activate
pip install -r requirements.txt
JupyterHub¶
# Each notebook should specify its environment
# requirements.txt or conda environment.yml
# In the notebook:
import sys
print(sys.executable) # Verify which Python
print(sys.version) # Verify version
Module Systems¶
# On HPC with module system
module load python/3.11
module load cuda/12.1
# Document in README
# Required modules: python/3.11, cuda/12.1
Archiving for Long-Term Reproducibility¶
Research should be reproducible years later.
What to Archive¶
| Item | Where | Duration |
|---|---|---|
| Code | GitHub + Zenodo | Indefinite |
| Data | Institutional repository | Per policy |
| Environment | Container registry / Zenodo | Indefinite |
| Documentation | With code | Indefinite |
Zenodo Integration¶
# .zenodo.json
{
"title": "Code for 'My Research Paper'",
"upload_type": "software",
"creators": [
{"name": "Your Name", "orcid": "0000-0000-0000-0000"}
],
"license": "MIT",
"related_identifiers": [
{
"identifier": "10.1234/journal.1234",
"relation": "isSupplementTo",
"scheme": "doi"
}
]
}
GitHub releases can automatically archive to Zenodo with DOIs.
Container Archiving¶
# Save container image
docker save my-analysis:v1.0 | gzip > my-analysis-v1.0.tar.gz
# Include with archived code
# Can be loaded years later:
# docker load < my-analysis-v1.0.tar.gz
Collaboration Practices¶
Lab/Group Standards¶
Establish minimal standards:
# Lab Code Standards
## Required for All Projects
- [ ] README with setup instructions
- [ ] requirements.txt or environment.yml
- [ ] Version control (git)
- [ ] Basic documentation of what code does
## Required for Publications
- [ ] Frozen dependencies (exact versions)
- [ ] Dockerfile or container
- [ ] Script to reproduce figures/tables
- [ ] Archive to Zenodo
## Required for Sensitive Data
- [ ] IRB/ethics approval documented
- [ ] Data access controls documented
- [ ] No sensitive data in repository
- [ ] Compliance review completed
Code Review for Research¶
Even informal review helps:
## Lab Code Review Checklist
- [ ] Code runs without errors
- [ ] README is accurate
- [ ] Dependencies are specified
- [ ] No hardcoded paths (or documented)
- [ ] No credentials in code
- [ ] Results are reproducible
Grant and Publication Requirements¶
Data Management Plans¶
Many funders require data management plans:
## Software Management
Code developed under this project will be:
- Version controlled using git
- Archived to Zenodo upon publication
- Licensed under MIT license
- Documented with README and inline comments
Dependencies will be:
- Managed using pip/conda with lock files
- Scanned for vulnerabilities before deployment
- Documented in requirements files
Journal Requirements¶
Some journals require:
- Code availability statements
- Data availability statements
- Environment specifications
- Container images
Prepare these from the start, not as an afterthought.
Minimum Viable Rigor
The key is finding the minimum viable rigor for your situation. A script that runs once for a class project needs less than code underlying a clinical trial. But even the simplest analysis benefits from basic version control and documented dependencies. Start simple: requirements.txt, git, a README. Add more as your code becomes more important. But start.
Quick Reference¶
Minimum Reproducibility Kit¶
project/
├── README.md # What this does, how to run
├── requirements.txt # pip freeze output
├── run_analysis.py # Main script
└── results/ # Expected outputs
Sensitive Data Checklist¶
- IRB/ethics approval obtained
- Data minimization applied
- Identifiers separated from analysis data
- Access controls documented
- No sensitive data in version control
- Dependencies audited for data handling
Publication Checklist¶
- Code archived (Zenodo, institutional repo)
- DOI obtained for code
- Exact dependency versions frozen
- Container available (Dockerfile or image)
- Reproduction instructions tested
- Data availability documented