CAAD (Carotid Artery Atherosclerosis Disease)

Carotid artert atherosclerosis disease (CAAD) is measured in cases and controls by both structured data, including ICD diagnosis codes, and quantitative measurements of carotid stenosis based on doppler and other imaging technologies.

The phenotype algorithm includes typical eMERGE pseudo code for implementing the structured data components of the algorithm, as well as a portable natural language processing (NLP) system used to extract percent stenosis measurements from imaging reports.


Suggested Citation

David Carrell. Group Health and UW. CAAD (Carotid Artery Atherosclerosis Disease). PheKB; 2013 Available from:


Estimates of CAAD cases across eMERGE sites based on the eMERGE Record Counter are:

Table 1. Counts of potential CAAD cases by eMERGE site.







Marshfield Clinic


Mayo Clinic


Mt. Sinai


Northwestern University


Vanderbilt University


All Sites


Source: eMERGE Record Counter, 1/14/2014

For details see document "Estimate of network-wide CAAD cases" in the Files section of this page.

Group Health recommends the following rule for selecting documents for CAAD NLP processing. This rule requires querying the full text of each possible procedure report to identify those that contain minimally necessary textual content. Note that this assume procedure reports can be distinguished from other types of clinical documents such as progress notes.

Rule: Process all procedure reports that contain text satisfying at least one of the following conditions:

1.Report contains text strings "carotid" and ("doppler" or "us" or "u/s" or "ultrasound").

2.Report contains text strings ("head" or "neck") and ("us" or "u/s" or "ultrasound" or "ct" or "tomography").

3.Report contains text strings ("catheter" or "ct" or "mr") and "angiography".

If you cannot distinguish procedure reports from other clinical notes (such as progress reports) you can still apply this rule to all documents. This will result in processing more documents than necessarily, but is unlikely to yield misleading NLP results. This is because documents other than relevant procedure reports are highly unlikely to contain sufficient information to satisfy the NLP system's rules for identifying carotid stenosis findings.

A zip archive has been posted that includes SAS code illustrating how to identify CAAD cases and controls by post-processing the CAAD NLP system's results data.  The archive (listed above as "SAS code illustrating how to identify cases and controls from the CAAD NLP system's output") includes the SAS code as well as all data files that would be required and created if you processed the CAAD test documents shipped with the CAAD NLP system.  CAAD cases identified by NLP are defined as patients with at least one stenosis measurement greater than or equal to 50%.  CAAD controls identified by NLP are are defined as patients whose maximum measured stenosis is less than or equal to 15%.  Keep in mind that cases and controls can also be defined by structured data alone (defined elsewhere).
-David Carrell

We have changed the minimum health system contact criterion (criterion 6.1 in the pseudo code document).  The new criterion is more relazed than the previous definition, yet still assures that there is sufficient contact with the patients to reduce the likelihood that CAAD disease is present in a patient but undocumented in the EMR (a threat to the NPV in our control selection algorithm).

SAS code for implementing the new definition is available upon request (email David Carrell).


Hi All,

We have posted the final version of the CAAD phenotype pseudo code (document dated 12/29/2014).  It includes several changes to criteria for defining both cases and controls.  Please let us know if you have any questions.

We included in the pseudo code document a table of counts of cases, controls, and excluded patients for GH/UW subjects broken out by age group.  We hope this table is useful for helping you assess whether the counts of cases and controls you generate are reasonably comparable to ours.  Of course, differences in setting (e.g., managed care versus university hospital) may also account for substantial differences in yield for cases and controls, as may other factors.

Also updated today is the User Guide which explains how to install and use the CAAD NLP system, and a SAS code file that illustrates how data generated by the NLP system may be post-processed (using SAS) to generate the necessary person-level results for measured carotid stenosis.

-David Carrell

Is the requirement to exclude any subjects who ever have a max total chol level> 400 only for cases or should we use this exclusion for the controls as well?

Only use the chol >400 exclusion for CASES.  This is because we'd like to rule out high chol (>400) as an explanation for the CAAD cases.  If a control has high chol and no CAAD that is fine--indeed, their genetic makeup may protect them from CAAD even in the face of high chol (which makes them even more interesting as a control).


Excluding person that have the imaging study done elsewhere:  8.2.1 says  if they do not have have evidence of ever having received an imaging study of the carotid.  Because the NLP is only looking at VU imaging, it seems like we would pick up people who had it elsewhere if we just look at the NLP results.  Do you have sugggestions on how to insure they haven't have the study elsewhere?  Would this be mentioned in the PL or other documents in some way that we should look for?


Criterion 8.2.1 (patients who ... "Do not have evidence (ever) of having received an imaging study of the carotid arteries") is particularly relevant when an eMERGE site has insurance plan data or HMO-style data on patients.  When these data are available we would want to exclude from the controls anyone who ever had a carotid imaging study--even if we do not know the results of that study--because the simple fact that the study was done suggests CAAD disease could be present, and we want to keep the control group as "clean" as possible from possible contamination by patients who have or may have CAAD.  At a site like Group Health/UW, we can know, through medical claims data, if one of our insured patients had a carotid imaging study performed by an external provider (even though we will not know the results of that procedure).  Criterion 8.2.1 was designed to exclude such a patient from the control group.  We also imagined that such data, if available, would be available as structured data via the local site's data warehouse; we did not intend for anyone to develop new NLP systems to discover such information from narrative text.

Our advice for sites without insurance plan/HMO style data are to use any available structured data that may indicate an imagining study was performed (perhaps from structured data for a procedure order to an external provider?) when assessing criterion 8.2.1.  Beyond that, you are free to be as clever as you are inclined to be in attempting to extract additional information from the clinical narrative (but we don't have expectations that you will do this).



We don't really have enrollment/contact data.  I was thinking of using instances of Documents where Documents is NOT Clinical Communications OR Administrative documents.

Would this work?  If not, what would be a better alternative?  Thanks.

That's a great idea--using the presence of clinical documents (related to patient care) in your document repository as evidence of "contact with the health care system."  Please use your judgment as to which source of information would best indicate patient contact within your health care system.  If one source of information (e.g, clinical documents) is likely to cast a wider net than another source (e.g., diagnosis and procedure code data from your data warehouse), then go with the one that casts a wider net. 

If you decide to use your clinical documents as an indicator of contact then, yes, it's also a good idea to exclude documents that are less likely to reflect patient care encounters (i.e., excluding "clinical communications" and "administrative" documents, as you mention).


Today we post an updated version of the portable CAAD NLP system that addresses modifications designed to improve performance at Mayo Clinic.  We continue to offer two versions of the NLP system--one which checks each report for a relevant exam type description (e.g., "CAROTID ULTRASOUND") before processing the report, and one that assumes all incoming reports are relevant carotid studies and processes each report regardless of its content. 

The system posted today is the one that checks for exam type before processing each report.

The system posted today appears in the list of downloadable files as "Version 0.5.4 of CAAD Portable NLP System (checks exam type)."  The file that will be downloaded from this link is named ""

We belive these modification will improve performance at Mayo without degrading performance elsewhere.  (We have therefore hidded the previous version of this NLP system but it is still stored on PheKB and we can make it available if anyone wants it.)

As always, please reach out with any questions or issues you encounter with the CAAD NLP system or pseudo code.

For the remainder of April 2015 please make sure you include David Carrell ( in your communications as David Cronkite will be out of the country during this time.



Hi All,

Today we upload to PheKB an updated version of the CAAD NLP system.  It appears in the list of downloadable files as "Version 0.5.5 of CAAD Portable NLP System (checks exam type)."  It differs from the previous version in that it now checks for and disqualifies results that are reported for VERTEBRAL ARTERIES or results from CORONARY ANGIOGRAPHY procedures.

As always, please reach out with any questions or issues you encounter with the CAAD NLP system or pseudo code.

For the remainder of April 2015 please make sure you include David Carrell ( in your communications as David Cronkite will be out of the country during this time.



A question has come up about the minimum contact criteria (section 6.1 of the pseudo code).  We are not changing this specification.  Rather, we are attempting to make the description more clear.

Regarding the criteria establishing minimum contact with the health care system (sec. 6.1), these three criteria are applied within the entire 5-year (or ~1,826-day) period preceding a patient’s last known contact date; these criteria are NOT applied within individual calendar years.  This 5-year period ends on the date of the patient’s most recent contact with the health system, and extends back in time to the date 5 years before that (making a ~1,826 day period).  Within this period test for each of the three conditions:

1) patient has any 2 encounters >= 365 days apart in the relevant ~1,826-day period (e.g., 3/19/2012 and 1/22/2014 – the dates doe not have to be in the same calendar year), OR

2) patient has encounters in any 3 separate calendar quarters in the relevant ~20-quarter period (for example, Q1 2013, Q3 2013 and Q4 2013 qualifies, but so does Q4 2010, Q3 2012, and Q1 2014 – the quarters do not have to be in the same calendar year), OR

3) patient has encounters in any 4 separate calendar months in the relevant ~60-month period (for example, Jan 2012, Feb 2012 and  Nov 2014 – again, the months do not have to be in the same calendar year).

If any one of the above conditions is met the patient meets the minimum contact criterion.

In the following SAS code illustrates this logic.  Here, “adate” is the name of the encounter date variable and “mrn” is a unique patient identifier.

  **  Implement the continuous enrollment requirement (6.1 in pseudo code).
  **    1) Identify the most recent encounter date for each pt.
  **    2) Using the most recent encounter date, get all encounter dates for
  **       the preceding 1,826 days (5 yrs).
  **    3) Calculate the number of days between min and max dates in #2 (and
  **       if >= 365 then set flag_6_1_1 = 1 and 0 otherwise).
  **    4) Add a column to the data from #2 with the encounter date
  **       formatted as SAS format YYQD. (e.g., 2012-1).
  **    5) Add another column to the data from #2 with the encounter date
  **       formatted as SAS format MONYY7. (e.g., DEC2012).
  **    6) Count distinct values of YYQD. (and if >= 3 then set flag_6_1_2 =
  **       1 and 0 otherwise).
  **    7) Count distinct values of MONYY7. (and if >= 4 then set flag_6_1_3 =
  **       1 and 0 otherwise).
  ** ;

%let lookback=1826;

proc sql ;

create table local.caad_01_latest_enc
select       b.mrn
            , max(a.adate)            as adate_max            format date9.
            , max(a.adate)-&lookback  as adate_max_minus_5yrs format date9.
  from         em2data.em1_em2_nwigm_all_demog b
               left outer join __vdw.utilization a
                 on a.mrn = b.mrn
  group by     b.mrn
  order by     b.mrn

  create table local.caad_01_encs_past_5yrs
  select       put(a.adate, YYQD.)   as adate_yr_qtr
             , put(a.adate, MONYY7.) as adate_yr_mon
             , a.adate               as adate_date   format date9.
             , a.*
  from         __vdw.utilization a
               inner join local.caad_01_latest_enc b
                 on b.mrn = a.mrn
  where        a.adate between b.adate_max and b.adate_max_minus_5yrs
  order by     a.mrn
             , a.adate

  create table caad_01_encs_summary
  select       a.mrn
             , max(a.adate) as adate_max
             , min(a.adate) as adate_min
             , count(distinct a.adate_yr_qtr)            as adate_yr_qtr_N
             , count(distinct a.adate_yr_mon)            as adate_yr_mon_N
  from         local.caad_01_encs_past_5yrs a
  group by     a.mrn
  order by     a.mrn
  create table local.caad_01_encs_summary
  select       mrn
             , case when adate_max - adate_min >= 365
                         adate_yr_qtr_N >= 3
                         adate_yr_mon_N >= 4
                    then 1
                    else 0
                    end as Flag_Meets_Min_Contact
             , adate_max - adate_min as max_days_bn_encs
             , adate_yr_qtr_N
             , adate_yr_mon_N
             , case when adate_max - adate_min >= 365 then 1
                    else 0
                    end as flag_6_1_1
             , case when adate_yr_qtr_N >= 3 then 1
                    else 0
                    end as flag_6_1_2
             , case when adate_yr_mon_N >= 4 then 1
                    else 0
                    end as flag_6_1_3
  from         caad_01_encs_summary
  order by     mrn
  quit ;
run ;

proc freq data=local.caad_01_encs_summary ;
table Flag_Meets_Min_Contact ;

proc freq noprint data=local.caad_01_encs_summary ;
    table flag_6_1_1    *
          flag_6_1_2    *
    /missing out=mylist ;
proc print data=mylist(drop=percent);
    sum count;
run ;

proc contents data=local.caad_01_encs_summary varnum ;
run ;

Hope this helps,


Hi All,

I just posted a minor edit to the pseudo code, clarifying that the ICD-9 diagnoses codes for CAAD include 433.1, 433.10, and 433.11.  In the 12/29/2014 pseudo code document the ICD codes are mentioned in two places, but in one of the places only code 433.1 (with an implicit * following the last digit) was listed.  Now all three qualifying codes (433.1, 433.10, and 433.11) are listed in both places.  The pseudo code document date remains the dete it was when the final pseudo code was posted (12/29/2014) and a note has been addd to the title page acknowledging the edit.

-David Carrell (

We are working on implementing the CAAD algorithm.  The portable NLP does not work well on some reports that include both the result along with text describing mild, moderate, severe stenosis, with percentages as the NLP captures this text as 3 results. Do you have a method to find the result in such a report?

Beth Karlson



Just now seeing this message--apologies for the long delay!  Thank you for sharing with us, by email, two examples of reports where descriptions of reference ranges and a list of abbreviations were misinterpreted as stenosis measurements (along with the correct finding of no stenosis, described normatively). We will follow up to your email with some ideas on how to deal with this, as well as some additional questions.



Hi Marshfield Friends,

Analyzing the CAAD phenotype data the GH/UWgeneticists noticed the field CAAD_INDEX_DATE in the SUBJECT file is missing for all 505 subjects from the Marshfield eMERGE cohort (table below).  From your email below (6/25/2015) I am guessing that you do not have IRB approval to provide such dates as MM/DD/YYYY.  Can you please provide CAAD_INDEX_DATE as the year (YYYY) only?  It is an important measure for the analysis.

Table 1. Marshfield CAAD phenotype subjects  Missing CAAD_INDEX_DATE
Control type 1: 35
Control type 2: 426
Case type 1: 36
Case type 2: 8
Total 505

From the pseudo code document,  here is the definition of CAAD_INDEX_DATE, modified for the situation at Marshfield:

For case definition #1, case definition #2, and control definition #1 index date is the date year of the earliest qualifying stenosis measurement.  For control definition #2 index date is the date year of the data pull (or date year of death or loss to follow up if the patient is deceased or lost  to follow up at time of the data pull).

Please let me know if it would be helpful to discuss this, or if you would like me to send you the file name of the file you sent.