FLAGSHIP CASE STUDY · 2026

Rebuilding Tokenization on MWAA V3

A ground-up rewrite of the tokenization pipeline that powers RedSail's $9M data monetization platform. Migrated from Airflow V1 to MWAA V3 on AWS, reorganized the DAG into parallel client chains, and pre-hashed archive joins to cut a 35-37 minute wall-clock baseline.

ROLELead architect & engineer

SCALE6B records, 8+ enterprise clients

STACKMWAA V3, Snowflake, Datavant, Terraform, Python

The problem

The V1 pipeline ran clients sequentially on aging Airflow infrastructure. Each new client meant longer wall-clock and more contention for the same Snowflake warehouse. The platform had grown to 6 billion records across pharma, life sciences, and real-world evidence buyers — the orchestration layer was the bottleneck, and Amazon was sunsetting MWAA V1.

Sequential client processing capped throughput regardless of warehouse size.
Archive joins re-hashed PII on every run — expensive and wasteful at 6B-record scale.
MWAA V1 deprecation forced a migration window with concrete deadlines.
Tokenize step had safety gaps that created drift between Snowflake and Datavant outputs.

Architecture

SOURCE transaction_log ~6B rows · daily

→

DE-ID TRANSACTION_DEID expert determination

→

TOKENIZE Datavant V2 pre-hashed archive joins

FAN-OUT: parallel client chains (8+)

pharma analytics life sciences RWE platform claims aggregator market analytics commercial analytics research consortium payer analytics

DELIVERY S3 + Azure per-client SLAs · manifests · audit trail

What I built

Parallel client chains

Restructured the DAG so each client runs as an independent chain off the shared tokenize step. Per-client SLAs, schedules, and failure isolation. One client's outage no longer blocks the others.

Pre-hashed archive joins

Moved hashing upstream of Snowflake COPY (which blocks SHA2/MD5 in load transforms). Archive joins now reference cached hashes instead of recomputing — the single biggest perf win in the rebuild.

MWAA V1 to V3 migration

Reconciled IAM trust policies across V1 and V3 executor ARNs — clients trusting V1 don't auto-trust V3. Inline DAG definitions over assets/factories, which silently failed on V3.

Tokenize safety fixes

Closed drift gaps between Snowflake state and Datavant outputs. Idempotent retries, manifest validation, and explicit failure modes that page rather than silently complete.

Multi-tenant feed orchestration

Per-client schedules, lag awareness (Veeva and Millenium deliver yesterday's date, CPESN runs Mondays only), and consistent transfer DAG conventions across the fleet.

Terraform-managed infrastructure

Snowflake roles, warehouses, and tables; AWS S3 buckets, IAM, MWAA environments. Column ordering conventions to avoid cascading drift on plan logs.

Results

35-37→~41 min Daily wall-clock baseline before perf-tuning round; current XL-warehouse run targets sub-45 min total.

8+ clients Each running as an independent, isolated chain — one slow client no longer blocks the others.

Zero drift Tokenize safety fixes eliminated reconciliation gaps between Snowflake state and Datavant outputs.

V3 in prod Tokenization V2 plus six paused transfer DAGs deployed on MWAA V3, ahead of V1 deprecation.

What I owned

Architecture decisions: parallelization model, where hashing lives, MWAA V3 patterns.
Cross-team coordination: IAM trust policy negotiations with infra; client-side onboarding for migrated feeds.
Performance tuning: warehouse-size validation (XL vs L baselines), clustering follow-ups on TRANSACTION_DEID.
Roadmap: historical tokenization backfill DAG (next), run_pipeline_log + Sigma BI reporting (in flight with the analytics director).