Predictive analytics / ML model
INDUSTRY
Healthcare Technology
SERVICE
Predictive Model
DELIVERY MODEL
Outsource
Enable new clinical capability - identify patients with likely elevated Lp(a)
"who would otherwise go undiagnosed, and support the client's broader mission of detection and cardiovascular risk reduction. Secondary goals include reducing unnecessary testing costs through precision patient selection and building a productionizable pipeline for multi-site deployment."
What does the system do?
Waverley developed a machine learning model that predicts which patients are likely to have elevated lipoprotein(a) — a genetic risk factor for cardiovascular disease — based on their EHR records, without requiring a direct Lp(a) lab test. The model ingests structured patient data (demographics, lab results, diagnoses, medications, procedures, vitals) from healthcare systems and outputs a risk score, enabling clinicians to prioritize Lp(a) testing for high-risk patients.
The system targets clinicians and researchers seeking to identify undertested patients and is intended for deployment across partner healthcare systems.
ML-Powered Screening Model
Our client is a patient-driven nonprofit dedicated to saving generations of families from preventable heart disease through early identification and improved care of two underdiagnosed inherited lipid disorders — familial hypercholesterolemia and elevated lipoprotein(a), or Lp(a) — both of which significantly increase the risk of premature heart attacks and strokes across all ages, races, and ethnicities. Through pioneering research, advocacy, and education, the client aligns patients, clinicians, researchers, and policymakers around solutions that reduce the burden of cardiovascular disease in affected families.

The core challenge: Replicating a complex clinical model
Tasked with replicating a complex clinical model, the team overcame ambiguous domain logic and extreme feature dimensionality by deriving cohort strategies empirically from a 200,000+ code feature space. We resolved significant class imbalances and reverse-engineered undocumented legacy imputation steps to ensure model reproducibility and accuracy. Despite multi-site data heterogeneity and GCP infrastructure instability, we successfully standardized disparate schemas and stabilized workflows to deliver a high-performing, scalable solution
Empirical Discovery in the Absence of Domain Expertise
Navigating the lack of initial SME guidance by shifting to a data-first approach, where the team independently derived cohort strategies and clinical markers through rigorous empirical analysis
Culling the Curse of Dimensionality
Managing an extreme feature space of over 200,000 variables by implementing sophisticated feature selection techniques to distill a massive library of ICD, CPT, and NDC codes into a high-signal model
Optimizing for Rare Event Detection
Addressing severe class imbalance through the application of stratified sampling and precision-recall threshold tuning to ensure the model remained sensitive to underrepresented positive outcomes
Reverse-Engineering the Legacy "Black Box"
Overcoming the hurdles of undocumented data imputation in the original model to facilitate accurate reproduction and create a meaningful baseline for performance comparison
Harmonizing Multi-Site Data Heterogeneity
Architecting a path through fragmented data landscapes, standardizing disparate schemas and resolving unexpected value discrepancies across multiple clinical sites
Mitigating Cloud Infrastructure Instability
Ensuring development continuity and data integrity despite GCP environment friction, specifically resolving platform-level file transfer and inter-VM communication failures
Optimize diagnostic costs
Drive cardiovascular risk reduction by identifying undiagnosed patients with elevated Lp(a) through precision screening. This initiative optimizes diagnostic costs via targeted patient selection while establishing a scalable, production-grade pipeline for multi-site deployment

Full ML development cycle on the client’s data
Data exploration and key identification across clinical tables (labs, diagnosis, medication, procedures, vitals)
Definition of elevated Lp(a) thresholds (125, 150, 200 nmol/L) as model labels
Feature generation pipeline producing 200k+ dimensional feature space from ICD/CPT/NDC codes
Class imbalance analysis and stratified train/validation/holdout splits
Binary classification model training (LightGBM), evaluation (confusion matrix, F-score, MCC), and selection of best-performing thresholds
Deliverables: data dictionary, executable Jupyter runbook, fully documented code, QA validation report, unit test coverage
Multi-site data engineering and ETL expansion
Onboarding and data ingestion from several healthcare systems
SFTPGo-based secure file transfer infrastructure
Google Cloud Storage restructuring and BigQuery dataset organization per site
Data quality reporting for each site submission (raw and processed QC reports)
CCS code grouping integration; feature generator expansion
Model retraining and productionalization
Reproduced the provided original results (apples-to-apples comparison)
Ran the final LGBM deployment models (LGBM_125, LGBM_150, LGBM_200) on expanded multi-site dataset
Missing value imputation analysis: non-imputation, iterative imputer, KNN, Random Forest strategies
Missing flag approach and scikit-learn pipeline refactoring
Threshold-based precision/recall analysis across Lp(a) levels
BigQuery imputation table storage and visualization pipeline (Max Doroshenko)
Feature importance pre-selection analysis (Task 2, partial)
Model explainability setup: SHAP/LIME/ELI5 environment (Task 3, initiated)
Infrastructure: GCP Vertex AI Workbench, BigQuery, GCS, SFTPGo for multi-site data transfer

Tech Stack at a Glance
Measurable improvements across engagement and efficiency
Scalable Onboarding Data Engineering
Workflow documentation added to public repo, enabling faster onboarding of new sites
Pipeline Optimization
Scikit-learn pipeline refactoring reduced manual preprocessing steps and improved reproducibility
Persistent Data Lineage
BigQuery-based storage for imputed tables enabled persistent debugging and experiment tracking across sessions

![]()
Waverley delivered a production-ready ML screening model that directly advances the client's clinical mission - surfacing high-risk Lp(a) patients who would otherwise go undiagnosed. Validated on a 34,499-patient hold-out set, the model achieves up to 327% screening enrichment, identifying at-risk patients at more than three times the rate of random screening and enabling more precise allocation of testing resources. The engagement also established a scalable multi-site data pipeline ready for onboarding additional healthcare systems, positioning the client to expand population-level cardiovascular prevention with measurable clinical and economic impact
Interesting technical decisions:
Three separate LGBM models trained for three clinical thresholds (>125, >150, >200 nmol/L) to support different clinical risk tolerance scenarios.
Imputation strategy chosen after rigorous comparison: default iterative imputer showed minimal precision improvement over no-imputation — a finding that informed the productionalization approach.
Screening enrichment as a key business metric (rather than pure accuracy), directly translating model performance into clinical workflow value.
Let's build your AI-powered experience.
Is your platform rich in content and data that could be leveraged? We can help you build an AI solution that improves customer experience, maximizes your data's potential, and advances your company to the next level.