Skip to content

Research Data Security

When your code touches sensitive data—research subjects, medical records, proprietary information—the stakes change. A dependency vulnerability becomes a potential data breach. A misconfigured environment becomes a compliance violation. This chapter covers the practices that protect research data.

Careers End Here

I've seen research projects shut down and careers damaged over data handling mistakes. Not malice—just someone who didn't understand what they were holding. Research data security isn't bureaucracy. It's respect for the people who trusted you with their information.


Data Classification

Not all data needs the same protection. Classification helps you match security controls to sensitivity.

Common Classification Levels

Level Examples Controls
Public Published results, public datasets Basic hygiene
Internal Preliminary results, internal communications Access controls
Confidential Unpublished research, proprietary methods Encryption, audit logging
Restricted PII, PHI, human subjects data Full compliance controls

Regulatory Considerations

Some data has legal requirements:

HIPAA (PHI): Health data requires specific safeguards, access controls, and audit trails.

FERPA: Student educational records have privacy requirements.

GDPR: EU personal data has consent, access, and deletion requirements.

IRB protocols: Human subjects research has institution-specific requirements.

Export controls: Some research data can't cross borders.

If you're unsure what applies to your data, ask. Your institution has compliance officers for a reason.

Environment Separation

The principle: production data belongs in production environments, not on your laptop.

The Problem

Researchers often:

  • Download real data to local machines for "quick analysis"
  • Copy production databases to development environments
  • Share datasets via email or cloud storage
  • Use real data in demos and presentations

Each of these creates risk:

  • Local machines are less secure than managed infrastructure
  • Development environments have weaker controls
  • Shared copies multiply attack surface
  • Demos expose data to unintended audiences

The Solution: Environment Tiers

Production Environment:

  • Contains real, sensitive data
  • Full security controls
  • Audit logging
  • Access restricted to authorized personnel
  • Used only for production workloads

Staging/Test Environment:

  • Mirrors production configuration
  • Uses synthetic or anonymized data
  • Safe for testing changes
  • Can be accessed by broader team

Development Environment:

  • Individual or team workspaces
  • Uses fake/synthetic data only
  • May have relaxed security
  • Where initial development happens

Data Flow Rules

Real Data → Production only
         Anonymization
Test Data → Staging/Test
         Synthetic Generation
Fake Data → Development

Data flows down through anonymization and synthetic generation, never up.

Test Data Strategies

Synthetic Data Generation

Create fake data that has the same statistical properties as real data:

# Using Faker for basic synthetic data
from faker import Faker
fake = Faker()

synthetic_patient = {
    "name": fake.name(),
    "dob": fake.date_of_birth(),
    "mrn": fake.uuid4(),
    "diagnosis": random.choice(diagnoses)
}

For more complex needs, libraries like SDV (Synthetic Data Vault) can learn distributions from real data and generate statistically similar synthetic data.

Anonymization

Remove or obscure identifying information:

De-identification: Remove direct identifiers (names, SSNs, email addresses).

Pseudonymization: Replace identifiers with consistent fake ones.

Aggregation: Use summary statistics instead of individual records.

k-anonymity: Ensure each record matches at least k-1 others on quasi-identifiers.

Differential privacy: Add statistical noise to protect individual records.

For healthcare data, HIPAA defines specific de-identification standards (Safe Harbor and Expert Determination methods).

Subsetting

Use a representative sample rather than full dataset:

-- Create a development subset
CREATE TABLE dev_patients AS
SELECT * FROM prod_patients
WHERE patient_id IN (
  SELECT patient_id FROM prod_patients
  ORDER BY RANDOM()
  LIMIT 1000
);

Combined with anonymization, this gives developers realistic data without full production access.

The Medallion Architecture

For data-intensive research, the medallion (bronze/silver/gold) architecture provides structure:

Bronze Layer (Raw)

  • Ingested data in original format
  • No transformations applied
  • Full audit trail of what arrived
  • May contain sensitive data in original form

Security: Restricted access, encrypted at rest, audit logged.

Silver Layer (Cleaned)

  • Validated and cleaned data
  • Schema enforced
  • Duplicates removed
  • Anonymization applied if needed

Security: Access based on data sensitivity, audit logged.

Gold Layer (Refined)

  • Business/research-ready views
  • Aggregated as needed
  • Optimized for analysis
  • De-identified where appropriate

Security: Broader access for analysis, may be less restricted.

Why This Matters for Security

The medallion architecture creates natural checkpoints for security controls:

  • Bronze → Silver: Apply anonymization transforms
  • Silver → Gold: Apply aggregation, verify de-identification
  • Each layer can have different access controls

This beats the alternative: raw data scattered across notebooks and shared drives.

Logging and Audit Trails

When working with sensitive data, you need to know: who accessed what, when, and why.

What to Log

  • Access events: Who queried what data, when
  • Data exports: What data left the system, where it went
  • Schema changes: Who modified data structure
  • Permission changes: Who granted or revoked access
  • Authentication events: Login attempts, especially failures

Logging Practices

Log enough to reconstruct events:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "user": "researcher@university.edu",
  "action": "query",
  "resource": "patient_records",
  "row_count": 150,
  "columns_accessed": ["diagnosis", "treatment", "outcome"],
  "ip_address": "192.168.1.50"
}

Don't log the sensitive data itself:

// BAD
{"action": "query", "result": {"patient_name": "John Doe", "diagnosis": "..."}}

// GOOD
{"action": "query", "row_count": 1, "table": "patients"}

Retain logs appropriately: Long enough for audit purposes, compliant with regulations. Often 6 months to 7 years depending on requirements.

Protect log integrity: Logs that can be modified aren't trustworthy. Use append-only storage, separate permissions from production systems.

Database Audit Logging

Most databases support audit logging:

PostgreSQL:

-- Enable logging
ALTER SYSTEM SET log_statement = 'all';
ALTER SYSTEM SET log_connections = on;

MySQL:

-- Enterprise feature, or use proxies like ProxySQL
SET GLOBAL general_log = 'ON';

For comprehensive audit logging, consider dedicated tools like pgAudit (PostgreSQL) or database activity monitoring solutions.

Access Control

Principle of Least Privilege

Users should have the minimum access needed for their role:

Role Bronze Silver Gold Notes
Data Engineer Read/Write Read/Write Read Maintains pipelines
Analyst Read Read Analysis only
Researcher Read Uses refined data
Admin Full Full Full System administration

Implementation Approaches

Database-level controls:

-- Create role with limited access
CREATE ROLE analyst;
GRANT SELECT ON gold_schema.* TO analyst;

-- Assign users to roles
GRANT analyst TO researcher@university.edu;

Application-level controls: When database controls aren't granular enough, implement in your application layer.

Column-level security: Some databases support restricting access to specific columns—useful for keeping identifiers separate from analysis data.

Row-level security: Restrict access based on row contents (e.g., researchers only see their own study's data).

Notebook Security

Jupyter notebooks need special attention because they:

  • Save outputs (including data samples)
  • Are easily shared
  • Persist credentials entered in cells
  • Can access production data if connected

Notebook Hygiene

Don't hardcode credentials: Use environment variables or secrets managers.

Clear outputs before sharing:

jupyter nbconvert --clear-output --inplace notebook.ipynb

Don't display real data samples:

# BAD: Shows real data in output
df.head()

# BETTER: Show shape and dtypes only
print(df.shape)
print(df.dtypes)

Use nbstripout as a pre-commit hook:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/kynan/nbstripout
    rev: 0.6.1
    hooks:
      - id: nbstripout

JupyterHub Security

For shared JupyterHub environments:

  • Authentication: Integrate with institutional SSO
  • Authorization: Control who can access which resources
  • Resource limits: Prevent runaway computations
  • Audit logging: Track user activity
  • Network isolation: Limit what notebooks can connect to

Trust Has to Mean Something

I've seen research projects shut down over data handling violations. Not because of malicious intent—because someone emailed a spreadsheet, or left a notebook running with production credentials, or demoed with real patient data.

The researchers weren't careless. They were focused on their science. Data security wasn't part of their training. It wasn't in their grant proposal. It wasn't what they were being evaluated on. But it's still their responsibility. And when something goes wrong, "I didn't know" isn't an acceptable answer.

The practices in this chapter aren't about checking compliance boxes. They're about building habits that protect the people whose data you're using. Those research subjects trusted you with their information. That trust has to mean something.

Use synthetic data for development. Keep production data in production. Know who's accessing what. And when you're not sure if something is appropriate, ask. The inconvenience of doing this right is nothing compared to the consequences of doing it wrong.


Quick Reference

Data Environment Checklist

  • Data classified by sensitivity
  • Environments separated (dev/staging/prod)
  • Synthetic data available for development
  • Anonymization applied before leaving production
  • Access controls match data sensitivity
  • Audit logging enabled
  • Notebooks cleared before sharing

Compliance Quick Guide

Regulation Applies To Key Requirements
HIPAA Health data (US) Access controls, audit trails, encryption
FERPA Student records (US) Consent, access controls
GDPR EU personal data Consent, access rights, deletion
IRB Human subjects Institution-specific protocols