Research Data Security¶

When your code touches sensitive data—research subjects, medical records, proprietary information—the stakes change. A dependency vulnerability becomes a potential data breach. A misconfigured environment becomes a compliance violation. This chapter covers the practices that protect research data.

Careers End Here

I've seen research projects shut down and careers damaged over data handling mistakes. Not malice—just someone who didn't understand what they were holding. Research data security isn't bureaucracy. It's respect for the people who trusted you with their information.

Data Classification¶

Not all data needs the same protection. Classification helps you match security controls to sensitivity.

Common Classification Levels¶

Level	Examples	Controls
Public	Published results, public datasets	Basic hygiene
Internal	Preliminary results, internal communications	Access controls
Confidential	Unpublished research, proprietary methods	Encryption, audit logging
Restricted	PII, PHI, human subjects data	Full compliance controls

Regulatory Considerations¶

Some data has legal requirements:

HIPAA (PHI): Health data requires specific safeguards, access controls, and audit trails.

FERPA: Student educational records have privacy requirements.

GDPR: EU personal data has consent, access, and deletion requirements.

IRB protocols: Human subjects research has institution-specific requirements.

Export controls: Some research data can't cross borders.

If you're unsure what applies to your data, ask. Your institution has compliance officers for a reason.

Environment Separation¶

The principle: production data belongs in production environments, not on your laptop.

The Problem¶

Researchers often:

Download real data to local machines for "quick analysis"
Copy production databases to development environments
Share datasets via email or cloud storage
Use real data in demos and presentations

Each of these creates risk:

Local machines are less secure than managed infrastructure
Development environments have weaker controls
Shared copies multiply attack surface
Demos expose data to unintended audiences

The Solution: Environment Tiers¶

Production Environment:

Contains real, sensitive data
Full security controls
Audit logging
Access restricted to authorized personnel
Used only for production workloads

Staging/Test Environment:

Mirrors production configuration
Uses synthetic or anonymized data
Safe for testing changes
Can be accessed by broader team

Development Environment:

Individual or team workspaces
Uses fake/synthetic data only
May have relaxed security
Where initial development happens

Data Flow Rules¶

Real Data → Production only
              ↓
         Anonymization
              ↓
Test Data → Staging/Test
              ↓
         Synthetic Generation
              ↓
Fake Data → Development

Data flows down through anonymization and synthetic generation, never up.

Test Data Strategies¶

Synthetic Data Generation¶

Create fake data that has the same statistical properties as real data:

# Using Faker for basic synthetic data
from faker import Faker
fake = Faker()

synthetic_patient = {
    "name": fake.name(),
    "dob": fake.date_of_birth(),
    "mrn": fake.uuid4(),
    "diagnosis": random.choice(diagnoses)
}

For more complex needs, libraries like SDV (Synthetic Data Vault) can learn distributions from real data and generate statistically similar synthetic data.

Anonymization¶

Remove or obscure identifying information:

De-identification: Remove direct identifiers (names, SSNs, email addresses).

Pseudonymization: Replace identifiers with consistent fake ones.

Aggregation: Use summary statistics instead of individual records.

k-anonymity: Ensure each record matches at least k-1 others on quasi-identifiers.

Differential privacy: Add statistical noise to protect individual records.

For healthcare data, HIPAA defines specific de-identification standards (Safe Harbor and Expert Determination methods).

Subsetting¶

Use a representative sample rather than full dataset:

-- Create a development subset
CREATE TABLE dev_patients AS
SELECT * FROM prod_patients
WHERE patient_id IN (
  SELECT patient_id FROM prod_patients
  ORDER BY RANDOM()
  LIMIT 1000
);

Combined with anonymization, this gives developers realistic data without full production access.

The Medallion Architecture¶

For data-intensive research, the medallion (bronze/silver/gold) architecture provides structure:

Bronze Layer (Raw)¶

Ingested data in original format
No transformations applied
Full audit trail of what arrived
May contain sensitive data in original form

Security: Restricted access, encrypted at rest, audit logged.

Silver Layer (Cleaned)¶

Validated and cleaned data
Schema enforced
Duplicates removed
Anonymization applied if needed

Security: Access based on data sensitivity, audit logged.

Gold Layer (Refined)¶

Business/research-ready views
Aggregated as needed
Optimized for analysis
De-identified where appropriate

Security: Broader access for analysis, may be less restricted.

Why This Matters for Security¶

The medallion architecture creates natural checkpoints for security controls:

Bronze → Silver: Apply anonymization transforms
Silver → Gold: Apply aggregation, verify de-identification
Each layer can have different access controls

This beats the alternative: raw data scattered across notebooks and shared drives.

Logging and Audit Trails¶

When working with sensitive data, you need to know: who accessed what, when, and why.

What to Log¶

Access events: Who queried what data, when
Data exports: What data left the system, where it went
Schema changes: Who modified data structure
Permission changes: Who granted or revoked access
Authentication events: Login attempts, especially failures

Logging Practices¶

Log enough to reconstruct events:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "user": "researcher@university.edu",
  "action": "query",
  "resource": "patient_records",
  "row_count": 150,
  "columns_accessed": ["diagnosis", "treatment", "outcome"],
  "ip_address": "192.168.1.50"
}

Don't log the sensitive data itself:

// BAD
{"action": "query", "result": {"patient_name": "John Doe", "diagnosis": "..."}}

// GOOD
{"action": "query", "row_count": 1, "table": "patients"}

Retain logs appropriately: Long enough for audit purposes, compliant with regulations. Often 6 months to 7 years depending on requirements.

Protect log integrity: Logs that can be modified aren't trustworthy. Use append-only storage, separate permissions from production systems.

Database Audit Logging¶

Most databases support audit logging:

PostgreSQL:

-- Enable logging
ALTER SYSTEM SET log_statement = 'all';
ALTER SYSTEM SET log_connections = on;

MySQL:

-- Enterprise feature, or use proxies like ProxySQL
SET GLOBAL general_log = 'ON';

For comprehensive audit logging, consider dedicated tools like pgAudit (PostgreSQL) or database activity monitoring solutions.

Access Control¶

Principle of Least Privilege¶

Users should have the minimum access needed for their role:

Role	Bronze	Silver	Gold	Notes
Data Engineer	Read/Write	Read/Write	Read	Maintains pipelines
Analyst	—	Read	Read	Analysis only
Researcher	—	—	Read	Uses refined data
Admin	Full	Full	Full	System administration

Implementation Approaches¶

Database-level controls:

-- Create role with limited access
CREATE ROLE analyst;
GRANT SELECT ON gold_schema.* TO analyst;

-- Assign users to roles
GRANT analyst TO researcher@university.edu;

Application-level controls: When database controls aren't granular enough, implement in your application layer.

Column-level security: Some databases support restricting access to specific columns—useful for keeping identifiers separate from analysis data.

Row-level security: Restrict access based on row contents (e.g., researchers only see their own study's data).

Notebook Security¶

Jupyter notebooks need special attention because they:

Save outputs (including data samples)
Are easily shared
Persist credentials entered in cells
Can access production data if connected

Notebook Hygiene¶

Don't hardcode credentials: Use environment variables or secrets managers.

Clear outputs before sharing:

jupyter nbconvert --clear-output --inplace notebook.ipynb

Don't display real data samples:

# BAD: Shows real data in output
df.head()

# BETTER: Show shape and dtypes only
print(df.shape)
print(df.dtypes)

Use nbstripout as a pre-commit hook:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/kynan/nbstripout
    rev: 0.6.1
    hooks:
      - id: nbstripout

JupyterHub Security¶

For shared JupyterHub environments:

Authentication: Integrate with institutional SSO
Authorization: Control who can access which resources
Resource limits: Prevent runaway computations
Audit logging: Track user activity
Network isolation: Limit what notebooks can connect to

Trust Has to Mean Something

I've seen research projects shut down over data handling violations. Not because of malicious intent—because someone emailed a spreadsheet, or left a notebook running with production credentials, or demoed with real patient data.

The researchers weren't careless. They were focused on their science. Data security wasn't part of their training. It wasn't in their grant proposal. It wasn't what they were being evaluated on. But it's still their responsibility. And when something goes wrong, "I didn't know" isn't an acceptable answer.

The practices in this chapter aren't about checking compliance boxes. They're about building habits that protect the people whose data you're using. Those research subjects trusted you with their information. That trust has to mean something.

Use synthetic data for development. Keep production data in production. Know who's accessing what. And when you're not sure if something is appropriate, ask. The inconvenience of doing this right is nothing compared to the consequences of doing it wrong.

Quick Reference¶

Data Environment Checklist¶

Data classified by sensitivity
Environments separated (dev/staging/prod)
Synthetic data available for development
Anonymization applied before leaving production
Access controls match data sensitivity
Audit logging enabled
Notebooks cleared before sharing

Compliance Quick Guide¶

Regulation	Applies To	Key Requirements
HIPAA	Health data (US)	Access controls, audit trails, encryption
FERPA	Student records (US)	Consent, access controls
GDPR	EU personal data	Consent, access rights, deletion
IRB	Human subjects	Institution-specific protocols