Skip to main content

Real-world data (RWD) generated in the course of routine health care, such as electronic health records (EHRs) and insurance claims, holds immense value for generating evidence of treatment efficacy and safety for regulatory submissions. With regulatory bodies like FDA increasingly encouraging the use of RWD for primary evidence in submissions, there is a significant opportunity to unlock the utility of these data sources for regulatory applications like clinical trials. To be submitted to FDA, RWD needs to conform to FDA-mandated clinical study data standards, namely the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM). Unfortunately, bringing RWD into SDTM is not a straightforward task owing to the stark differences between RWD and clinical trial data, and how the process is handled can have a dramatic impact on both the labor required to transform the data and the acceptability of the resulting SDTM. This article covers the key challenges sponsors are facing when transforming RWD into reliable SDTM for FDA submissions, and how Droice Hawk and Droice Superlineage solve these challenges and massively accelerate the process (read about Droice’s discussion on its technologies with FDA).

Does FDA require RWD to be submitted in SDTM?

FDA’s recently finalized guidance document “Data Standards for Drug and Biological Product Submissions Containing Real-World Data” addresses the data formats required for submission of RWD to FDA, stating that: “Currently, and absent a waiver, sponsors submitting clinical and nonclinical study data (including those derived from RWD sources) in submissions subject to section 745A(a) of the FD&C Act are required to use the formats described in the Study Data Guidance and the supported study data standards listed in the Catalog.” For sponsors who submit RWD to FDA to support substantial evidence of efficacy and/or safety of an intervention, this means that the RWD needs to be transformed into SDTM.

While the processes for converting data from randomized controlled trials (RCTs) to SDTM are well established, the same cannot be said for RWD. The FDA RWD data standards guidance acknowledges this, stating:

Opening quoteClosing quote
FDA recognizes that a range of approaches may be used to apply the supported study data standards (e.g., Clinical Data Interchange Standards Consortium’s (CDISC’s) Study Data Tabulation Model (SDTM) or Analysis Data Model (ADaM)) to RWD sources such as EHR or claims data. FDA encourages sponsors to discuss such approaches with FDA. With adequate documentation of the conformance methods used and their rationale, study data derived from RWD can be transformed into SDTM and ADaM datasets and submitted to FDA in an applicable submission.
Data Standards for Drug and
Biological Product Submissions 
Containing Real-World Data
Guidance for Industry

This statement by FDA indicates that sponsors have flexibility in the approach for implementing SDTM for RWD, and indeed this has been the case when sponsors have discussed their approach with FDA. However, different decisions on how to handle the challenges and complexities of bringing RWD to SDTM can have a significant impact on both the operational resources and time required to generate SDTM and its reliability for regulatory decision-making. For example, implementations of SDTM for RWD for a single study frequently takes >5000 hours, and even after this extensive labor, FDA may still not accept the data if it does not sufficiently conform or if the resulting SDTM is not reliable. With Droice Labs’ technologies however, sponsors are taking RWD to SDTM in an order of magnitude less time while maintaining conformance to standards and meeting FDA requirements for RWD reliability.

Why is it so difficult to bring RWD to SDTM?

The challenges in transforming RWD into SDTM lie largely in the differences between how data is collected in real-world clinical settings versus controlled trials, and the fact that CDISC standards were designed for RCTs, not RWD. RCT data is collected with the explicit purpose of addressing pre-specified study questions and in a standardized format that can be seamlessly transformed into SDTM by design. In contrast, RWD is collected for clinical care, not research, which has two major implications that complicate the process to bring it to SDTM:

RWD is not study data:

The study variables required to generate SDTM simply don’t exist natively in RWD and must be derived. For example, RWD will need to be algorithmically processed to determine required study variables such as index dates, baseline and follow-up periods, arm assignments, baseline covariates, or the line of therapy a patient is on, to name a few, which are not explicitly captured in the source RWD like they are in clinical trials.

RWD is not collected in any standardized structure or format:

There is no standardized structure or format in which these underlying raw RWD elements like billing codes, lab values, prescriptions, and clinician notes are collected and organized - in fact, there is massive diversity in how patient information is stored and represented in RWD. Even within a single hospital there can be multiple EHR systems that store patient information in completely different schemas, and even within a single EHR system, different providers with differing workflows may store the same patient information in a completely different place or format. Furthermore, only a fraction of this patient information will be represented in standardized medical terminologies (like SNOMED, ICD10, CPT, LOINC, and NDC) - the rest exists in system-specific codings, semi-structured fields, and completely unstructured provider notes that all require semantic interpretation, further complicating comprehensive information extraction.

A Common but Flawed Approach to Transform RWD to SDTM

A popular method has been to treat SDTM as a common data model (CDM), where the raw RWD from multiple sources is mapped directly into SDTM domains and terminologies. For instance, this approach may map diagnosis codes from RWD (e.g. ICD10) to the Medical History (MH) domain, medication information to the Concomitant Medications (CM) or Exposure (EX) domains (depending on whether a specific drug is defined as a study exposure or not), and lab results to the Laboratory Data (LB) domain. This is followed by deriving the study-specific variables from the SDTM “CDM” into non-standard intermediate Analysis Data Model (ADaM) tables, which are then fed back into SDTM to populate the missing study variables. While this method may seem reasonable at first glance, it is undermined by two major drawbacks:

Complexity and variety lead to resource intensiveness and unreliability:

As described above, because RWD is collected using different electronic systems and practice patterns, the distribution, format, and content of patient clinical information in RWD varies widely across sources. Furthermore, raw RWD sources, such as EHR application databases, can be extremely complex and are often customized for specific institutions. This complexity and diversity of RWD, which is orders of magnitude greater than for RCT data, results in the requirement for labor-intensive, time-consuming, and custom programming for each data source in order to map the relevant raw clinical information to SDTM domains and terminologies. This mapping exercise is a humongous task and is the major reason these implementations are taking more than 5000 hours. Furthermore, these mappings are prone to errors and data loss that compromise the reliability of the RWD-derived study data for regulatory decision-making.

Deviation from standard SDTM implementation:

Using ADaM for derivations of SDTM variables disrupts the standard data flow from SDTM to ADaM, complicating traceability, leading to conformance issues, and creating a mismatch between corresponding CT data and RWD in SDTM and in ADaM.

In short, the “SDTM as a CDM” approach is onerous, does not scale across multiple data sources, does not conform to current CDISC standards, and suffers from significant data reliability issues. This means that in practice, sponsors can invest enormous time and financial resources attempting to map RWD into SDTM for regulatory submissions only to find they are still not able to meet FDA requirements for data reliability.

The Right Approach: Droice Labs’ Technologies for Efficient Transformation of RWD to Reliable SDTM

Droice Labs’ AI technologies have been designed from the ground up and trained on massive volumes of global RWD to solve the exact challenges involved in taking RWD to SDTM, helping sponsors to achieve dramatically faster implementations while meeting FDA requirements for RWD reliability. Droice Hawk is an AI middleware that scalably converts raw, messy RWD into analysis-ready data that massively simplifies the transformation of RWD to SDTM, reducing implementation time by an order of magnitude. Droice SuperLineage provides comprehensive traceability for all source data elements—a critical FDA requirement for data reliability and validation (read about Droice’s discussions on SuperLineage with FDA here). Pharma is leveraging these Droice technologies to enable a scalable and reliable transformation of RWD to SDTM to meet FDA requirements for "accuracy, completeness, and traceability."

To learn more about how Droice technologies are supplying reliable RWD-powered evidence generation infrastructure for life sciences clients from big pharma to innovative biotechs, please visit www.droicelabs.com.