For Researchers¶

Research software has unique requirements: reproducibility for scientific integrity, compliance for sensitive data, and often limited time and resources. This appendix translates supply chain concepts for academic and research contexts.

Different Pressures

Research code has different pressures. You're not a software engineer—you're a domain expert who happens to write code. The timeline is grant-driven, not sprint-driven. "Ship it" means "publish the paper." I get it. But I've also seen papers retracted because the code couldn't be reproduced, and careers damaged by data breaches from scripts that grew beyond their original scope. The investment in getting this right is small compared to those costs.

The Research Software Reality¶

Most research software:

Starts as a "quick script" and grows unexpectedly
Is written by domain experts, not software engineers
Has limited maintenance resources after publication
Must be reproducible for scientific integrity
May handle sensitive data (human subjects, medical records)

These constraints don't make supply chain management less important—they make it more important to get right with minimal overhead.

Reproducibility for Publications¶

When you publish research, your code should produce the same results. Always.

Minimum Reproducibility¶

For any published code:

# Document versions
python --version > VERSION_INFO.txt
pip freeze > requirements-frozen.txt
git log -1 > GIT_INFO.txt

Include in your repository:

# README.md

## Requirements
- Python 3.11.4
- See requirements-frozen.txt for exact package versions

## Reproducing Results
```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements-frozen.txt
python run_analysis.py --seed 42

Expected output: results/table1.csv should match published Table 1.

### Better Reproducibility

Add a container:

```dockerfile
# Dockerfile
FROM python:3.11.4-slim

WORKDIR /analysis

COPY requirements-frozen.txt .
RUN pip install --no-cache-dir -r requirements-frozen.txt

COPY . .

# Default: run the analysis
CMD ["python", "run_analysis.py", "--seed", "42"]

# Build and run
docker build -t my-analysis:v1.0 .
docker run my-analysis:v1.0

Now anyone can reproduce your exact environment.

Best Reproducibility¶

For high-stakes research:

# .github/workflows/reproducibility.yml
name: Verify Reproducibility

on: [push]

jobs:
  reproduce:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build container
        run: docker build -t analysis .

      - name: Run analysis
        run: docker run analysis > output.txt

      - name: Verify results
        run: diff output.txt expected_output.txt

Automated verification that results don't drift.

Sensitive Data Compliance¶

Research often involves data with compliance requirements.

Common Frameworks¶

Framework	Applies To	Key Requirements
HIPAA	US health data	Access controls, encryption, audit logs
FERPA	US education records	Student consent, access restrictions
GDPR	EU personal data	Consent, data minimization, right to erasure
IRB/Ethics	Human subjects	Approval before data collection

Data Handling Principles¶

Minimize collection:

# Bad: Collect everything, figure it out later
df = pd.read_sql("SELECT * FROM patients", conn)

# Good: Collect only what you need
df = pd.read_sql("""
    SELECT patient_id, age_group, diagnosis_code
    FROM patients
    WHERE consent_research = TRUE
""", conn)

Separate identifiers:

data/
├── identifiers/          # Restricted access
│   └── patient_keys.csv  # Links IDs to identities
└── analysis/             # For analysis
    └── deidentified.csv  # No direct identifiers

Use the medallion architecture:

Bronze (raw):      Original data with identifiers
Silver (cleaned):  Validated, de-identified
Gold (analysis):   Aggregated, ready for research

Each layer has different access controls. Analysis happens on Gold layer only.

Dependency Considerations¶

For compliance-sensitive work:

Audit your dependencies:

# What are you actually running?
pip list
pip show package-name  # Check where it came from

# Any known vulnerabilities?
pip-audit

Consider what dependencies access:

# This package might phone home
# Check before using with sensitive data
import some_analytics_package  # Does it send telemetry?

Air-gapped environments: When data can't leave a secure environment, you may need: - Private package mirrors - Vendored dependencies - Pre-approved package lists

The "Quick Script" Problem¶

Scripts have a way of becoming critical infrastructure.

AI-assisted coding makes this problem worse. You can generate scripts faster than ever—and with even less scrutiny. The AI produces working code instantly; you run it; it works; you move on. A week later, three colleagues are using it. A month later, it's running in production. You never reviewed what the AI actually wrote, what dependencies it added, or whether those dependencies are maintained. See Vibe Coding for more on managing AI-generated code.

Signs Your Script Is Growing Up¶

Multiple people use it
It runs in production
Results depend on it
You can't remember how it works
You used AI to write it and never reviewed the full output

Minimal Hardening¶

When a script starts to matter:

Add requirements.txt:

pip freeze > requirements.txt

Add basic error handling:

def main():
    try:
        run_analysis()
    except Exception as e:
        logger.error(f"Analysis failed: {e}")
        sys.exit(1)

Add a README:

# Analysis Script

## Usage
python analyze.py input.csv output.csv

## Requirements
pip install -r requirements.txt

## Notes
- Expects input CSV with columns: id, value, date
- Output includes statistical summary

Version control:

git init
git add .
git commit -m "Initial commit of analysis script"

When to Invest More¶

Invest in proper engineering when:

Multiple papers will depend on this code
Other researchers will use it
It processes sensitive data
Results affect decisions (clinical, policy)

See When Scripts Become Software for the full discussion.

Managing Shared Resources¶

Research computing often means shared infrastructure.

HPC Clusters¶

# Don't install globally
pip install --user package-name  # Goes to ~/.local

# Better: Use virtual environments
python -m venv ~/venvs/my-project
source ~/venvs/my-project/bin/activate
pip install -r requirements.txt

JupyterHub¶

# Each notebook should specify its environment
# requirements.txt or conda environment.yml

# In the notebook:
import sys
print(sys.executable)  # Verify which Python
print(sys.version)     # Verify version

Module Systems¶

# On HPC with module system
module load python/3.11
module load cuda/12.1

# Document in README
# Required modules: python/3.11, cuda/12.1

Archiving for Long-Term Reproducibility¶

Research should be reproducible years later.

What to Archive¶

Item	Where	Duration
Code	GitHub + Zenodo	Indefinite
Data	Institutional repository	Per policy
Environment	Container registry / Zenodo	Indefinite
Documentation	With code	Indefinite

Zenodo Integration¶

# .zenodo.json
{
    "title": "Code for 'My Research Paper'",
    "upload_type": "software",
    "creators": [
        {"name": "Your Name", "orcid": "0000-0000-0000-0000"}
    ],
    "license": "MIT",
    "related_identifiers": [
        {
            "identifier": "10.1234/journal.1234",
            "relation": "isSupplementTo",
            "scheme": "doi"
        }
    ]
}

GitHub releases can automatically archive to Zenodo with DOIs.

Container Archiving¶

# Save container image
docker save my-analysis:v1.0 | gzip > my-analysis-v1.0.tar.gz

# Include with archived code
# Can be loaded years later:
# docker load < my-analysis-v1.0.tar.gz

Collaboration Practices¶

Lab/Group Standards¶

Establish minimal standards:

# Lab Code Standards

## Required for All Projects
- [ ] README with setup instructions
- [ ] requirements.txt or environment.yml
- [ ] Version control (git)
- [ ] Basic documentation of what code does

## Required for Publications
- [ ] Frozen dependencies (exact versions)
- [ ] Dockerfile or container
- [ ] Script to reproduce figures/tables
- [ ] Archive to Zenodo

## Required for Sensitive Data
- [ ] IRB/ethics approval documented
- [ ] Data access controls documented
- [ ] No sensitive data in repository
- [ ] Compliance review completed

Code Review for Research¶

Even informal review helps:

## Lab Code Review Checklist

- [ ] Code runs without errors
- [ ] README is accurate
- [ ] Dependencies are specified
- [ ] No hardcoded paths (or documented)
- [ ] No credentials in code
- [ ] Results are reproducible

Grant and Publication Requirements¶

Data Management Plans¶

Many funders require data management plans:

## Software Management

Code developed under this project will be:
- Version controlled using git
- Archived to Zenodo upon publication
- Licensed under MIT license
- Documented with README and inline comments

Dependencies will be:
- Managed using pip/conda with lock files
- Scanned for vulnerabilities before deployment
- Documented in requirements files

Journal Requirements¶

Some journals require:

Code availability statements
Data availability statements
Environment specifications
Container images

Prepare these from the start, not as an afterthought.

Minimum Viable Rigor

The key is finding the minimum viable rigor for your situation. A script that runs once for a class project needs less than code underlying a clinical trial. But even the simplest analysis benefits from basic version control and documented dependencies. Start simple: requirements.txt, git, a README. Add more as your code becomes more important. But start.

Quick Reference¶

Minimum Reproducibility Kit¶

project/
├── README.md              # What this does, how to run
├── requirements.txt       # pip freeze output
├── run_analysis.py        # Main script
└── results/               # Expected outputs

Sensitive Data Checklist¶

IRB/ethics approval obtained
Data minimization applied
Identifiers separated from analysis data
Access controls documented
No sensitive data in version control
Dependencies audited for data handling

Publication Checklist¶

Code archived (Zenodo, institutional repo)
DOI obtained for code
Exact dependency versions frozen
Container available (Dockerfile or image)
Reproduction instructions tested
Data availability documented