Electronic Health Record-based Phenotyping Algorithm for Familial Hypercholesterolemia

Familial hypercholesterolemia (FH) is a relatively common Mendelian genetic disorder that is associated with elevated plasma low-density lipoprotein cholesterol (LDL-C) levels and dramatically increased lifetime risk for premature atherosclerotic cardiovascular disease (ASCVD). FH can be diagnosed based on clinical presentation and/or genetic testing results, with a positive genetic testing considered to be the “gold standard”. Clinical diagnosis is based on a set of clinical criteria including lipid panel testing, personal and family history of hypercholesterolemia or premature ASCVD, presence of xanthomas on extensor tendons or thickening of the Achilles tendon, and early corneal arcus. We provide a pseudocode to identify cases and controls for primary hypercholesterolemia followed by FH. Structured data are processed using preset codes and unstructured data are processed using natural language processing (NLP). Final output consists of (i) a case/control/unknown status for primary hypercholesterolemia, (ii) demographics of each individual (age at the time of qualifying LDL-C ascertainment, gender, race/ethnicity), (iii) lipid profile (total cholesterol, LDL-C, HDL-C, triglycerides), (iv) lipid-lowering treatment and difference in time between the index date and date of treatment ascertainment, (v) personal history of premature ASCVD and/or hypercholesterolemia, (vi) family history of premature ASCVD, (vii) xanthomas and/or early corneal arcus, (viii) Dutch Lipid Clinic Network score and case/control/unknown for FH status.

Date Created: 
Thursday, November 10, 2016
Network Associations: 
Owner Phenotyping Groups: 

Suggested Citation

Safarova MS, Liu H, Arruda-Olson A, Rastegar M, Smith C, Cheng Y, Fan X, Balachandran P, Sohn S, Kullo IJ. Mayo Clinic. Electronic Health Record-based Phenotyping Algorithm for Familial Hypercholesterolemia. PheKB; 2016 Available from: https://phekb.org/phenotype/602

PubMed References



    At Partners HealthCare, our clinical notes are not stuctured in a standard way, and it is challenging to use NLP to determine family history.  We would like to know more about your NLP methods. 

    1. Does the program require that notes be in a specific format? Do the notes have to include a family history section?

    2. Are the notes coming from a particular software program?

    3. Did the validation site have the same formatting in their notes?

    4. How does the Java program deal with unstructured notes in other formats? If so, has this been tested and does it work?

    Thanks, Beth Karlson





    Dear Beth, Thank you for this feedback. Please see below our comments. Maya.

    1.       Does the program require that notes be in a specific format? Do the notes have to include a family history section?
    - NLP is run using MedTagger.  Per “FH_eAlgorithm_Pseudocode_FullText_2016”: A link to installation and user guides could be found here:
    There is no specific requirement pertinent  to the patient notes (free text).
    In the primary site, at Mayo Clinic, we used solely “Family History” section of clinical notes.
    Please see “VALIDATION OF THE FH eALGORITHM IN THE GEISINGER HEALTH SYSTEM_2017” regarding the feedback from the validation site: “In selected cases based on the adopted strategy to record encounters in the index implementation center, search space for the family history of early-onset ASCVD could be expanded to the “Personal|Past Medical History”.”

    2.       Are the notes coming from a particular software program?
    Given the diversity of medical language, NLP system is advised to be modified based on the adopted strategy to record encounters in the implementation site.
    Regardless of the EHR vendor, free text within generated clinical notes is amenable to MedTagger.

    3.       Did the validation site have the same formatting in their notes?
    Since primary and validation sites used different EHR vendors, there were differences in the baseline formatting of the clinical notes.

    4. How does the Java program deal with unstructured notes in other formats? If so, has this been tested and does it work?
                    Given that the input is text per se from the sections relating individual FHx, there should not be any issue regarding formatting.                List of references that may be helpful could be found here: http://ohnlp.org/index.php/OHNLP_Publications

    for those sites like us that are not able to distinquish fasting vs not for glucose tests, can we just use regular glucose test as exclusion (table 2A), & if so, what would the cutoff be (still >220 mg/dl?)

    Here are some examples including genderless terms such as (child, kid, sibling, etc...)

    For family history of premature ASCVD: Patient has three siblings, one of whom had a stroke at age 56.

    For family history of premature CHD: The patient has two siblings total one died at age 49 of heart disease and another is living at age 70.

    Maya, Thanks for your response to my initial question. In our EHR, we do not have a separate family history section. Does your program work if there is no clear designation of a section?  Have any of the other sites implemented your program in unstructured notes? Thanks, Beth

    Hi Beth, In EHRs with no designated family history section (structured or unstructured), this system will  scan the whole text. In this case scenario, we anticipate increased probability of false positive results.
    However, there was no feedback to share yet from the sites with EHRs without note sections/types. We look forward to learning from your experience. P.S. One thought to consider could be demarcating a search space for, e.g. keywords, negation --> age brackets and relatedness, within 2-3 sentences. Thank you! Maya

    Does MedTagger only run against files? or can it be run against a database as well? In reading the User Guide, it looks as though it only runs against either a single file or multiple files.  Is that correct?

    Thanks, Barbara

    Please note that dd for demographics has been updated. 'Index date' was removed. Two additional variables pertinent to LDL and DLCN score were added.  Please let me know if any questions. Thanks. 

    Execution steps:
    (i). download the version 1.0.1 on https://sourceforge.net/projects/ohnlp/files/MedTagger/
    (ii). follow the installation guide available http://ohnlp.org/index.php/MedTagger_Install_Guide
    MedTagger is an open source UIMA (Unstructured Information Management Architecture)-based NLP tool. Hence, it does require UIMA experience.
    RunMedTaggerCVD.bat file: java -Xms512M -Xmx1024M -cp resources;desc;descsrc;MedTagger-1.0.1.jar org.apache.uima.tools.cvd.CVD 

    1. Input:
    - NLP part of the eAlgorithm can use any clinical notes available in EHR. In Mayo site, we used predominantly a “family history” section of the EHR along with PPI (patient provided information which is a structured source). We started with scanning all note types and sections. However, in this quasi-experiment we found that focusing on the FHx section only is sufficient. This way we improved the accuracy of case detection, reducing the noise, and optimizing the time spent. Certainly, there are differences in local practices of recording patient notes, tailoring the search space on a site-level.

    2. Family history
    - For MedTagger, the input can be one file encompassing all notes as well as multiple files per targeted sections. The note does not have to be structured. Sectionizing is not obligatory. However, access to the section numbers/IDs improves accuracy. Notes should be decrypted. Concept detection with MedTagger occurs through converting free text terms to normalized ones. It allows indexing based on dictionaries. Absence of a family history section per se does not preclude pertinent family member data extraction. 

    3. NLP for personal history or physical examination. 
    Please check this file https://phekb.org/sites/phenotype/files/FH_eAlgorithm_Pseudocode_FullTex... --> Figure 1: PE: Algorithm 5&6; PHx of ASCVD Algos 3&4 (pp. 21-26). A different visual depiction could be found here: https://phekb.org/sites/phenotype/files/Appendix_1_0.pdf 

    4. MedTagger and SQL Server
    MedTagger requires physical files.

    The timeframe for the meds search is 1 year prior to the index date (date of the qualifying LDL), excluding six weeks immediately before the index date. If there is an order/prescription within this time frame --> label a person as ON LLT --> make an LDL correction assuming a 30% reduction in LDL on a statin ("Recalc LDL" variable). If no statin or any other lipid-lowering drug from the list is identified, use the LDL level as-is (still record as "Recalc LDL" variable per demographics dd). To report the meds in the dd pull the closest to the index date and give a preference to the drugs from the statin class.

    Recalc LDL cannot contain 0 values. Please include uncorrected LDL levels if not on LLT within a prespecified time interval.Thanks!

    Is there a useful implementation of this algorithm that does not require NLP and relies exclusively on structured data?  We ask because NLP development work for this phenotype at KPW would be prohibitively time consuming.  This is because 1) KPW notes lack regular section headings or cues that would facilitate identifying family history documentation, and 2) the FH NLP system provided only works with clinical notes stored as individual  text files, which adds additional work for sites like KPW that store clinical notes in a relational database.  (We note that other phenotype-specific NLP systems for some prior eMERGE phenotypes have accommodated input from data bases as well as individual files.)



    Submitted by Xiao Fan on

    Hi David,

    Sure, we totally understand the difficulty of runing NLP. You can disregard the NLP component without hurting the algorithm much.

    Basically, the FH algorithm calculates a DLCN score which is composed of LDL score, personal history score, family history score and physical examination score. NLP was involved in searching personal history and family history. If you cannot run NLP, personal history can still be checked using ICD codes. We will miss the points from family history. That's what it is.

    Please feel free to post any other questions. Thank you for implementing the algorithm.

    We understand, as you say, that "personal history can still be checked using ICD codes".  Are there also ICD codes for FAMILY history?  If so, please post.  Also, please clarify which ICD code you want us to include for "personal history."

    Thank you,


    Hi David,

    1. Please see “Input to the eAlgorithm for familial hypercholesterolemia” in the Flowchart_Electronic Health Record-based Phenotyping Algorithm for Familial Hypercholesterolemia.

    To identify personal history of CHD and/or CVD / PAD please start with a set of ICD codes available in Table 4 in https://phekb.org/sites/phenotype/files/FH_eAlgorithm_Pseudocode_FullTex...
    and Table 4 in https://phekb.org/sites/phenotype/files/Map_ICD9_2_ICD10_CS_MSS_03222017...

    General remarks: Premature ASCVD case status is defined with the presence of two or more pertinent diagnosis and/or procedural codes in EHR before age 56 in men and 66 in women. Two or more corresponding codes should be found during the same time frame and before the gender-specific age cut-offs, with at least 5 or more days separating the two codes. Assigned codes should be evaluated at discharge from each encounter during the surveillance period.

    To increase SN and SP of the case-control status identification, please refer to Algorithms 3 and 4 for the NLP logic in https://phekb.org/sites/phenotype/files/FH_eAlgorithm_Pseudocode_FullTex...

    Should any challenges occur with NLP implementation at your site, please proceed with code logic.

    2. To identify family history of ASCVD (=CHD and/or CVD / PAD) or FHx of hypercholesterolemia we utilized NLP. With less accuracy ICD codes could be also utilized. Please see page 26 in https://phekb.org/sites/phenotype/files/Map_ICD9_2_ICD10_CS_MSS_03222017...

    Family history: Following ICD codes will return 1 or 0 for a FHx component:
    ICD9 code V17.3 for Family history of ischemic heart disease.
    ICD10 code Z82.49 for Family history of ischemic heart disease and other diseases of the circulatory system

    Thank you.


    Please explain whether the information provided in the pseudo code document under "PPI (Table 5D)" is unique to Mayo Clinic.  It it is, please explain how you would like us to impliment something similar relevant to our setting (assuming a simple translation is feasible).



    PPI, as another section available in our EHR platform (GE system at the time of the development and implementation of the eAlgo), was utilized as a structured data source. We found that since FHx is not recorded systematically by the health care providers, PPI could be leveraged for this purpose.
    Please specify which EHR system are you currently using? Is there any section within your system that is filled out by the patient and contains any relevant to the FHx information? If you scan these data into the chart, is it being transcribed?

    Thanks, Maya