Healthcare systems continuously generate vast amounts of real-world data (RWD) that contain a wealth of information on patient disease manifestation, progression, and responses to treatment. When appropriately leveraged for research, RWD sources like electronic health records (EHR) and insurance claims offer a gold mine of insights into treatment efficacy and safety everyday clinical settings. With regulatory bodies like FDA increasingly supporting the use of RWD as primary evidence in submissions, there is a significant opportunity to leverage these data sources to reduce required enrollment in clinical trials and bring life-changing therapies to market faster.
To use RWD in regulatory submissions, FDA currently requires conformance of RWD to Clinical Data Interchange Standards Consortium (CDISC) standards. For sponsors submitting RWD to FDA to support substantial evidence of intervention efficacy and safety, this means transforming RWD into the CDISC Study Data Tabulation Model (SDTM). However, because RWD is collected for clinical care, there is a complete mismatch between how RWD is organized and represented and how it needs to be structured for a study and represented in SDTM for the specific regulatory purpose at hand. As a result, the RWD needs to be extensively transformed and mapped into SDTM, which poses major challenges for data reliability. This article examines the reasons why RWD in SDTM is by default too unreliable for regulatory decision-making and how it can be made to be reliable.
Why RWD in SDTM is Unreliable
The SDTM standard is designed to capture patient-level data for a specific study, encompassing study variables like inclusion/exclusion criteria, arm assignments, exposures, covariates, and outcomes. These variables define the study context and serve as the foundation for all downstream analyses. In clinical trials, these study variables are fully specified by the study protocol, meticulously collected into standardized forms in electronic data capture (EDC) systems by research staff, and harmonized according to the CDISC Clinical Data Acquisition Standards Harmonization (CDASH) guidelines. This process allows for a straightforward transformation of the collected data into SDTM following well-established SDTM Implementation Guides (IGs), and leads to very high reliability of the SDTM data in clinical trials.
In RWD however, there are no explicitly defined study variables like arm assignments, baseline covariates, or follow-up outcomes; these must be derived algorithmically from the original RWD. These derivations involve identifying relevant raw clinical data elements (like diagnosis codes, medication records, or lab results) and applying logic to calculate the resulting study variables. The challenge here is that this process is inherently far more complex than the straightforward data collection and mapping processes used in clinical trials. RWD sources like EHR are notoriously noisy, highly diverse, and extremely complex, with no consistent structure across different RWD sources. In fact, each individual patient will have a unique way in which their particular clinical information is represented. This means that transforming and mapping diverse RWD to SDTM is an enormous task. The army of well-intentioned data engineers, clinical experts, and epidemiologists that work diligently to transform these data for a single project will still find that this process results in data loss and errors that render the data unreliable [link to know thy data].

Taking RWD to SDTM is not only slow, iterative, and resource intensive, these unknown errors and information loss that occur in the process of generating the RWD-derived study data introduce unquantifiable biases that ultimately undermine the validity of research findings, rendering the RWD unsuitable for regulatory decision-making.
Making RWD in SDTM Reliable with Traceability: Droice SuperLineage
The key to overcoming these RWD reliability challenges in SDTM is traceability. This is exactly what Droice SuperLineage enables, providing comprehensive traceability for all source data elements—a critical FDA requirement for data reliability (read about Droice’s discussions on SuperLineage with FDA). How does traceability solve these reliability issues? It allows for any RWD-derived study variable to be validated against the original source. By maintaining comprehensive, atomic lineage to each source data element for each patient in the study in a standardized format across all sources, SuperLineage enables RWD in SDTM to be scalably validated to measure the performance of the algorithmic derivations for each study variable. Quantifying this error bound allows for the potential impact of information loss on efficacy and safety inferences to be quantitatively assessed such that the study inferences can be trusted in the face of these errors. Read more about SuperLineage.
The RWD Lineage Project
Because SDTM is intended for clinical trial data, it was never designed to handle the complex and extensive transformations required to bring RWD into SDTM. Droice recognized this gap and worked with CDISC to initiate the RWD Lineage Project, which aims to define a standardized and scalable approach to maintain data lineage for RWD in regulatory submissions. Read about the CDISC RWD Lineage Project.
Droice’s CTO Tasha Nagamine, along with Anita Umesh from Roche, will be presenting the progress so far in the RWD Lineage Project at the 2024 CDISC US Interchange on October 24th. Hope to see you there!
