Vocabulary management systems in electronic health records (EHRs) standardize how medical information is communicated.
Although primarily intended to improve patient care, they also benefit researchers by bringing structure to complicated, nuanced, and dynamic health data. Built into EHR software, vocabulary management systems direct practitioners to use a standardized clinical nomenclature when documenting a patient’s symptoms, diagnoses, and treatments. Specifically, these vocabulary sets consist of clinical terms that each have a distinct definition and discrete code. Clarifying descriptors similarly have narrow definitions and unique codes. A patient’s broken leg, for example, can be described with fracture of the femur (code = 71620000 in SNOMED CT terminology), motorcycle accident (297186008), and accident caused by blizzard (217724009).
As opposed to documenting a patient’s information in purely free text notes, vocabulary sets help make medical data more usable to researchers. More specifically, codifying clinical information offers researchers:
- A data format that is queryable;
- A data format that can easily be incorporated into most common statistical methods (e.g., linear regression with categorical predictors);
- Well-defined metrics for analysis (since each code represents a narrow clinical concept);
- Standardized terms, meaning linguistic variations like “cardiac disease” and “disorder of the heart,” for example, are represented by a single code for “heart disease”.
But to best take advantage of the information captured in EHRs, researchers need to appreciate the complexities of these vocabulary sets.
Namely, they should be aware that:
- There are many formal clinical vocabulary sets. Examples include SNOMED CT (a comprehensive nomenclature), ICD-10-DM (diagnostic codes), ICD-10-PCS (procedure codes), LOINC (lab testing codes), and RxNorm (pharmaceutical codes). Each of these have their own niche and deficiencies (e.g., describing timelines).
- Merging EHR datasets can be difficult. Different vocabulary sets do not fully overlap (e.g., about 6% of key terms in SNOMED CT are not in ICD-10-CM/PCS), so translating the nomenclatures is not straightforward.
- Coding systems evolve over time (e.g., ICD-10 officially replaced ICD-9 on October 1, 2015), so even a single multi-year EHR dataset may follow different coding structures.
- Vocabulary sets are vast and intricate. There are 141,000 ICD-10 codes, for example. This means researchers need to take care in determining which codes best represent phenomena of interest.
The question now is ‘how can researchers minimize the time spent mastering different nomenclatures’?
A big part of the answer involves publicly-available tools that centralize information on the vocabulary sets and help translate terms. Primarily maintained by the National Library of Medicine (NLM), these tools include:
- Value Set Authority Center: a database of current versions of multiple vocabulary sets.
- MedlinePlus: a web application that connects actual EHR data points to other NLM documentation and online resources.
- RxMix: a web application that connects APIs for RxNorm, RxTerms (a prescription writing and history nomenclature), and other medication terminology databases.
Looking forward, centralizing the vocabulary-transformation task at an institutional level or in a medical registry would ultimately benefit researchers. Centralizing these efforts means a governance committee can ensure that the tools described above and other resources are properly used to turn patient-centric vocabulary sets into research-centric data dictionaries. Furthermore, these governance committees can bring in clinical data experts—like the Prometheus Research team—to make sure nomenclature issues are addressed as part of the larger data curation effort. Together, these things will help keep researchers focused on actual research questions.