Logo
Home

Identifying Hidden Cardiovascular Risk at Scale

Predictive analytics / ML model

ML-Powered Lp(a) Screening Model

INDUSTRY

Healthcare Technology

SERVICE

Predictive Model

DELIVERY MODEL

Outsource

SUMMARY

Enable new clinical capability - identify patients with likely elevated Lp(a)

"who would otherwise go undiagnosed, and support the client's broader mission of detection and cardiovascular risk reduction. Secondary goals include reducing unnecessary testing costs through precision patient selection and building a productionizable pipeline for multi-site deployment."

What does the system do?

Waverley developed a machine learning model that predicts which patients are likely to have elevated lipoprotein(a) — a genetic risk factor for cardiovascular disease — based on their EHR records, without requiring a direct Lp(a) lab test. The model ingests structured patient data (demographics, lab results, diagnoses, medications, procedures, vitals) from healthcare systems and outputs a risk score, enabling clinicians to prioritize Lp(a) testing for high-risk patients.

The system targets clinicians and researchers seeking to identify undertested patients and is intended for deployment across partner healthcare systems.

ABOUT THE CLIENT

ML-Powered Screening Model

Our client is a patient-driven nonprofit dedicated to saving generations of families from preventable heart disease through early identification and improved care of two underdiagnosed inherited lipid disorders — familial hypercholesterolemia and elevated lipoprotein(a), or Lp(a) — both of which significantly increase the risk of premature heart attacks and strokes across all ages, races, and ethnicities. Through pioneering research, advocacy, and education, the client aligns patients, clinicians, researchers, and policymakers around solutions that reduce the burden of cardiovascular disease in affected families.

Project Analysis & Challenges

The core challenge: Replicating a complex clinical model

Tasked with replicating a complex clinical model, the team overcame ambiguous domain logic and extreme feature dimensionality by deriving cohort strategies empirically from a 200,000+ code feature space. We resolved significant class imbalances and reverse-engineered undocumented legacy imputation steps to ensure model reproducibility and accuracy. Despite multi-site data heterogeneity and GCP infrastructure instability, we successfully standardized disparate schemas and stabilized workflows to deliver a high-performing, scalable solution

Empirical Discovery in the Absence of Domain Expertise

Navigating the lack of initial SME guidance by shifting to a data-first approach, where the team independently derived cohort strategies and clinical markers through rigorous empirical analysis

Culling the Curse of Dimensionality

Managing an extreme feature space of over 200,000 variables by implementing sophisticated feature selection techniques to distill a massive library of ICD, CPT, and NDC codes into a high-signal model

Optimizing for Rare Event Detection

Addressing severe class imbalance through the application of stratified sampling and precision-recall threshold tuning to ensure the model remained sensitive to underrepresented positive outcomes

Reverse-Engineering the Legacy "Black Box"

Overcoming the hurdles of undocumented data imputation in the original model to facilitate accurate reproduction and create a meaningful baseline for performance comparison

Harmonizing Multi-Site Data Heterogeneity

Architecting a path through fragmented data landscapes, standardizing disparate schemas and resolving unexpected value discrepancies across multiple clinical sites

Mitigating Cloud Infrastructure Instability

Ensuring development continuity and data integrity despite GCP environment friction, specifically resolving platform-level file transfer and inter-VM communication failures

Technical Work Delivered

Optimize diagnostic costs

Drive cardiovascular risk reduction by identifying undiagnosed patients with elevated Lp(a) through precision screening. This initiative optimizes diagnostic costs via targeted patient selection while establishing a scalable, production-grade pipeline for multi-site deployment

Phase 1

Full ML development cycle on the client’s data

  1. Data exploration and key identification across clinical tables (labs, diagnosis, medication, procedures, vitals)

  2. Definition of elevated Lp(a) thresholds (125, 150, 200 nmol/L) as model labels

  3. Feature generation pipeline producing 200k+ dimensional feature space from ICD/CPT/NDC codes

  4. Class imbalance analysis and stratified train/validation/holdout splits

  5. Binary classification model training (LightGBM), evaluation (confusion matrix, F-score, MCC), and selection of best-performing thresholds

  6. Deliverables: data dictionary, executable Jupyter runbook, fully documented code, QA validation report, unit test coverage

Phase 2

Multi-site data engineering and ETL expansion

  1. Onboarding and data ingestion from several healthcare systems

  2. SFTPGo-based secure file transfer infrastructure

  3. Google Cloud Storage restructuring and BigQuery dataset organization per site

  4. Data quality reporting for each site submission (raw and processed QC reports)

  5. CCS code grouping integration; feature generator expansion

Phase 3

Model retraining and productionalization

  1. Reproduced the provided original results (apples-to-apples comparison)

  2. Ran the final LGBM deployment models (LGBM_125, LGBM_150, LGBM_200) on expanded multi-site dataset

  3. Missing value imputation analysis: non-imputation, iterative imputer, KNN, Random Forest strategies

  4. Missing flag approach and scikit-learn pipeline refactoring

  5. Threshold-based precision/recall analysis across Lp(a) levels

  6. BigQuery imputation table storage and visualization pipeline (Max Doroshenko)

  7. Feature importance pre-selection analysis (Task 2, partial)

  8. Model explainability setup: SHAP/LIME/ELI5 environment (Task 3, initiated)

Infrastructure: GCP Vertex AI Workbench, BigQuery, GCS, SFTPGo for multi-site data transfer

Tech Stack at a Glance

Python
Scikit-learn
LightGBM (LGBM)
SHAP, LIME, ELI5
RESULTS & OUTCOMES

Measurable improvements across engagement and efficiency

Scalable Onboarding Data Engineering

Workflow documentation added to public repo, enabling faster onboarding of new sites

Pipeline Optimization

Scikit-learn pipeline refactoring reduced manual preprocessing steps and improved reproducibility

Persistent Data Lineage

BigQuery-based storage for imputed tables enabled persistent debugging and experiment tracking across sessions

Quote Icon.png

Waverley delivered a production-ready ML screening model that directly advances the client's clinical mission - surfacing high-risk Lp(a) patients who would otherwise go undiagnosed. Validated on a 34,499-patient hold-out set, the model achieves up to 327% screening enrichment, identifying at-risk patients at more than three times the rate of random screening and enabling more precise allocation of testing resources. The engagement also established a scalable multi-site data pipeline ready for onboarding additional healthcare systems, positioning the client to expand population-level cardiovascular prevention with measurable clinical and economic impact

Interesting technical decisions:

  1. Three separate LGBM models trained for three clinical thresholds (>125, >150, >200 nmol/L) to support different clinical risk tolerance scenarios.

  2. Imputation strategy chosen after rigorous comparison: default iterative imputer showed minimal precision improvement over no-imputation — a finding that informed the productionalization approach.

  3. Screening enrichment as a key business metric (rather than pure accuracy), directly translating model performance into clinical workflow value.

Let's build your AI-powered experience.

Is your platform rich in content and data that could be leveraged? We can help you build an AI solution that improves customer experience, maximizes your data's potential, and advances your company to the next level.