Value sets for the analysis of real-world patient data: Problems, theory, and solutions
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Observational, retrospective, in silico studies based on real-world data—that is, data for research collected from sources other than randomized clinical trials—cost a minute fraction of randomized clinical trials and are essential for clinical research, pharmacoepidemiology, clinical quality measurement, health system administration, value-based care, clinical guideline compliance, and public health surveillance. They offer an alternative when randomized trials cannot provide large enough patient cohorts or patients representative of real populations in terms of comorbidities, age range, disease severity, rare conditions.Improvements in the speed, frequency, and quality of research investigations using real-world data have accelerated with the emergence of distributed research networks based on common data models over the past ten years. Analyses of repositories of coded patient data involve data models, controlled medical vocabularies and ontologies, analytic protocols, implementations of query logic, value sets of vocabulary terms, and software platforms for developing and using these. These studies generally rely on clinical data represented using controlled medical vocabularies and ontologies—like ICD10, SNOMED, RxNorm, CPT, and LOINC—which catalogue and organize clinical phenomena such as conditions, treatments, and observations. Clinicians, researchers, and other medical staff collect patient data into electronic health records, registries, and claims databases with each phenomenon represented by a code, a concept identifier, from a medical vocabulary. Value sets are groupings of these identifiers that facilitate data collection, representation, harmonization, and analysis. Although medical vocabularies use hierarchical classification and other data structures to represent phenomena at different levels of granularity, value sets are needed for concepts that cover a number of codes. These lists of codes representing medical terms are a common feature of the cohort, phenotype, or other variable definitions that are used to specify patients with particular clinical conditions in analytic algorithms. Developing and validating original value sets is difficult to do well; it is a relatively small but ubiquitous part of real-world data analysis, it is time-consuming, and it requires a range of clinical, terminological, and informatics expertise. When a value set fails to match all the appropriate records or matches records that do not indicate the phenomenon of interest, study results are compromised. An inaccurate value set can lead to completely wrong study results. When value set inaccuracy causes more subtle errors in study results, conclusions may be incorrect without catching researchers’ attention. One hopes in this case that the researchers will notice a problem and track it down to a value set issue. Verifying or measuring value set accuracy is difficult and costly, often impractical, sometimes impossible. Literature recognizing the deleterious effects of value set quality on the reliability of observational research results frequently recommends public repositories where high-quality value sets for reuse can be stored, maintained, and refined by successive users. Though such repositories have been available for years and populated with hundreds or thousands of value sets, regular reuse has not been demonstrated. Value set quality has continued to be questioned in the literature, but the value of reuse has continued to be recommended and generally accepted at face value. The hope for value set repositories has been not only for researchers to have access to expertly designed value sets but for incremental refinement, that, over time, researchers will take advantage of others’ work, building on it where possible instead of repeating it, evaluating the accuracy of the value sets, and contributing their changes back to the repository. Rather than incremental improvement or indications of value sets being vetted and validated, what we see in repositories is proliferation and clutter: new value sets that may or may not have been vetted in any way and junk concept sets, created for some reason but never finished. We have found general agreement in our data that the presence of many alternative value sets for a given condition often leads value set developers to ignore all of them and start from scratch, as there is generally no easy way to tell which will be more appropriate for the researcher’s needs. And if they share their value set back to the repository, they further compound the problem, especially if they neglect to document the new value set's intention and provenance. The research offered here casts doubt on the value of reuse with currently available software and infrastructure for value set management. It is about understanding the challenges value sets present; understanding how they are made, used, and reused; and offering practice and software design recommendations to advance the ability of researchers to efficiently make or find accurate value sets for their studies, leveraging and adding to prior value set development efforts. This required field work, and, with my advisors, I conducted a qualitative study of professionals in the field: an observational user study with the aim of understanding and characterizing normative and real-world practices in value set construction and validation, with a particular focus on how researchers use the knowledge embedded in medical terminologies and ontologies to inform that work. I collected data through an online survey of RWD analysts and researchers interviews with a subset of survey participants, and observation of certain participants performing actual work to create value sets. We performed open coding and thematic analysis on interview and observation transcripts, interview notes, and open-ended question text from the surveys. The requirements, recommendations, and theoretical contributions in prior literature have not been sufficient to guide the design of software that could make effective leveraging of shared value sets a reality. This dissertation presents a conceptual framework, real-world experience, and deep, detailed account of the challenges to reuse, and makes up that deficit with a high-level requirements roadmap for improved value set creation tools. I argue, based on the evidence marshalled throughout, that there is one way to get researchers to reuse appropriate value sets or to follow best practices in determining whether a new one is absolutely needed creating their own and dedicate sufficient and appropriate effort to create them well and prepare them for reuse by others. That is, giving them software that pushes them to do these things, mostly by making it easy and obviously beneficial to do them. I offer a start in building such software with Value Set Hub, a platform for browsing, comparing, analyzing, and authoring value sets—a tool in which the presence of multiple, sometimes redundant, value sets for the same condition strengthens rather than stymies efforts to build on the work of prior value set developers. Particular innovations include the presentation of multiple value sets on the same screen for easy comparison, the display of compared value sets in the context of vocabulary hierarchies, the integration of these analytic features and value set authoring, and value set browsing features that encourage users to review existing value sets that may be relevant to their needs. Fitness-for-use is identified as the central challenge for value set developers and the strategies for addressing this challenge are categorized into two approaches: value-set-focused and code-focused. The concluding recommendations offer a roadmap for future work in building the next generation of value set repository platforms and authoring tools.