Non-alcoholic steatohepatitis (NASH) real-world evidence: 
Advanced approaches to achieve high validity

Erin Murray1 MPH, Mahder Teka1 MS, Paddu VedamMS, Daniel Riskin1 MD, FACS


Regulatory, market access, and prescribing decisions are increasingly based on insights from routine care, also known as real-world evidence (RWE). As RWE increasingly influences the standard of care, data quality has come under scrutiny. For instance, use of just structured data can limit accurate capture of certain clinical conditions, such as NASH. Hence, the data quality standard of accuracy, completeness, and traceability has been distilled to measurable values.


To estimate the validity of structured and unstructured electronic health record (EHR) data and associated technologies in identifying concepts related to NASH through comparison with a manual reference standard.


The study involved retrospective analysis of EHR data from two data sources. Accuracy was tested in two study arms for 19 pre-selected NASH-related concepts. Data were extracted from 6,087 encounters (from 3,137 patients), each with paired structured and unstructured data. F1-score, a weighted average of recall and precision was used to measure accuracy. Recall, also called sensitivity, is the percent of data elements identified by the reference standard that were also identified by the automated algorithm. Precision, or positive predictive value, is the percent of data elements identified by the automated algorithm that were also identified by the reference standard.

An F1-score of 80% was considered the threshold for high accuracy. A two-sided p-value of 0.05 and a Chi-squared test were used to compare the study arms. Encounter occurrence is the number of encounters that have at least one occurrence of the concept. Per protocol, concepts with less than 20 occurrences within the dataset were excluded.


The average recall, precision, and F1-score in the traditional study arm were 37.2%, 98.9%, and 54.1%, respectively (Table 2). The average recall, precision, and F1-score in the advanced (NLP plus AI-based inference) study arm were 96.0%, 97.4%, and 96.7%, respectively.

There was a 42.6% absolute increase and a 78.7% relative increase in F1-score between traditional and advanced approaches. A statistically significant difference between the two arms was indicated (p<0.001) for all data elements where data were available.

Cohen’s kappa score indicated 88% inter-rater reliability, reflecting a highly credible reference standard. Use of NLP plus inference enabled improved accuracy for concepts where doctors used variable language (e.g., the doctor referenced a liver scan with a fibrosis stage without ever mentioning liver fibrosis itself or mentioned fibrosis and only referenced the liver later in the narrative).

In particular, for liver fibrosis and alcohol use (Table 3), F1-scores were 80.1% and 75.3% using NLP alone versus 94.7% and 91.2% using NLP plus inference. This represented an average relative increase of 19.7% by applying inference over NLP alone.


Given substantial variability in available data and technology, accuracy should be measured and reported using recall, precision, and F1-score if the study protocol requires high-validity evidence. Traditional NLP alone may be insufficient technology to enable high accuracy, as seen in clinically important NASH concepts such as liver fibrosis and alcohol use. Results suggest that augmentation with unstructured data and artificial intelligence represent a viable pathway to support high-quality data aggregation and enable high-validity evidence generation.

Presented at AASLD’s The Liver Meeting, November 11, 2023

1Verantos Inc., 325 Sharon Park Dr., Menlo Park, CA 94025