Diverticulosis and Diverticulitis

An algorithm for finding patients with diverticulosis, and of those, patients who also have diverticulitis, and to also find control patients.  Control patients will have had a colonoscopy but have no evidence of diverticula.

Simple NLP (a portable program is posted here, with instructions, and support is availabe from NU as needed) of colonoscopy reports is the gold standard algorithm, but if the text of colonoscopy reports is not available, an alternate algorithm using CPT & ICD-9 codes can be used, which is also posted.


Suggested Citation

Jennifer Allen Pacheco, Will Thompson. Northwestern University The Feinberg School of Medicine. Diverticulosis and Diverticulitis. PheKB; 2012 Available from: https://phekb.org/phenotype/92



To assist sites that need help w/ NLP, we will be posting by the end of this week an executable KNIME workflow into which you can simply plug in your colonoscopy (& abdominal CT scan) text reports (either from flat files or by connecting to your database) to run the algorithm which is NLP-based.

Question for all sites running this algorithm:

1.1) Do you need procedure codes in order to find the colonoscopies & abdominal CT scan text reports? 

1.2) If so, which codes do you use (CPT &/or ICD-9 procedure codes)?

Codes have been uploaded for those sites that need them.

KNIME workflow & accompanying Java NLP code files are in .zip files which currently are not allowed to be uploaded to pheKB.  Working w/ Coordinating Center on solution to this.

In the meantime if you wish to use our KNIME workflow, you can email me and you can also start by completing steps 1 & 2 of the attached KNIME instructions.



Addtion made to ICD-9 code list for diverticulitis.  2 optional repeating variables added for complications of diverticulosis.

The list of CPT codes has been shortened to only include colonoscopies and CT scans, as those are the only tests the algorithm was built to use.

Thanks to the Coordinating Center, the KNIME workflow and accompanying Java library are also uploaded now.  The instructions and accompanying workflow data dictionary have also been updated to be more precise.

A special thanks to Geisinger as well, I copied some of the nice formatting & instructions Geisinger used for their AAA KNIME workflow which I think make the workflow easier to follow.

Please email me if you have any questions.

The KNIME workflow & instructions were updated only to add some nodes for those who are unfamiliar with KNIME, the essence of the workflow is not really changed, so you only need to download the new workflow & instructions if you are having problems.  Specifically, I added the types of data reader nodes you can use, and based on some feedback from Geisinger, added a Cache node to reduce the number of connections from Input to Process from 2 to 1, & and a String to Number node for those who have character patient IDs.

Thanks to Geisinger for the feedback based on their successful run of the workflow.


I'm trying to help GHC import a SAS data set into KNIME.  KNIME has a lab extension node that seems to easily allow this, but when I test it and when GHC tests it, we get different errors.

Has anyone tried to import a SAS file into KNIME?

Or Has anyone exported their colonoscopy data into a delimited* text file & imported into KNIME?

(*delimiters in KNIME for text files are limited to commas, tabs, spaces, or semi-colons, so if any of those characters are in your data set, you must put single or double quotes around the text, & if there are quotes in your text, you must either choose a different character to enclose your text or put an escape character (\) in front of quotes inside your text first)

If so, could you please contact me?



Due to feedback from GHC, the KNIME workflow has been updated to inc. options for inputting data from SAS or Excel files, and the instructions have been updated on how to do so.  Also, there was a slight error in the KNIME workflow data dictionary:  the input field text_type is the correct spelling for the field that contains your report text type (CT or Colonoscopy).

Hello all,

Just a quick question regarding the timeframes of colonoscopy. Let's say we have a patient who has a colonoscopy without any mention of diverticulosis but this was performed 10 years ago.

Would this patient still qualify as a control? or should it be excluded as he/she may have developed diverticulosis in the interim. I note that there are no timeframes regarding the timing of colonoscopy when assigning controls. That does not apply on cases as once they have diverticulosis/itis, it's permanent.

Any thoughts about this would be appreciated.

Thanks a lot!


Due to feedback from a couple sites, the KNIME workflow, Java NLP libraries, instructions, and accompanying KNIME data dictionary have been updated as follows:

+ The NLP has been updated to disregard procedure report sections so that any mention anywhere in a report is counted and so that differences in section headers between sites is no longer an issue (and to not inc. unncessary code (=smaller .zip file)).  Re-configure the NLP node by removing all the old libraries & importing the new ones.

+ The calculation of the 2 age variables from the data dictionary, which come from the procedure report dates, has been added; therefore, you now need add to the input Reader node dates of the procedures, and in a new separate input Reader node, birth date for each patient (in order to calculate the ages)

RE: the previous comment on whether to count patients whose last colonoscopy was 10 yrs ago, we discussed and decided that it's not worth worrying about, you can leave those pts as controls. This is partly because some patients, esp. controls, won't have a colonoscopy recently if their last colonoscopy result was OK.  Also, we recognize that many of our EMRs have the limitation of not necessarily having all of a given patient's medical data, so we have to work with the data that we do have.  As age of last colonoscopy along with year &/or decade of birth are both variables, we can decide later if necessary to throw out those w/ colonoscopies > x # yrs ago if we find there are many of them and it seems to affect our analysis at all.

There was an error in the KNIME workflow where age last colonoscopy was being calculated which has now been corrected.  Updated workflow attached.  Also instructions updated to reflect new name of workflow which now inc. both executing the NLP to find cases and ctls. and calculating ages for those cases and contorls according to the data dictionary.

There was also an error w/ age first Dx, both were corrected in the updated workflow.  But, as it may be easier for some just to make the correction themselves in their currently configured version of the workflow, here are the 2 simple changes to make:  

  1. Open the Algorithm node, then open the Patient set node on the Algorithm page
  2. change #1: Configure the GroupBy - get date last colonoscopy node on the Patient set page
  3. Under the Options tab, for the dt Column on the right, click the Aggregration & change it from Last to Maximum, & click OK
  4. change #2: Configure the GroupBy - get date first Diagnosis node on the Patient set page
  5. Under the Options tab, for the dt Column on the right, click the Aggregration & change it from First to Minimum, & click OK
  6. Save the workflow & re-execute the CSV Writer node on the main workflow page ("1-Diverticulosis cases, controls, and ages") to generate corrected data

new data dictionary is posted, just a pared down version of prev. ver., sorry it got posted twice, the 2 file are the same.


Vivian had the following question, & below is the answer for others to see in case they were also confused:

In the flowchart, a case requires at least 1 positive mention of Diverticulosis.
However, when viewing the KNIME workflow, it appears that the requirement is at least 1 positive mention along with 0 negative mentions. If this is correct, does a negative mention at any time prevent the subject from being a case or is it only a negative mention within the same note as the positive mention?

That rule only applies w/in 1 colonoscopy or abd. CT scan rpt., i.e., if w/in the 1 report both neg. & pos. mentions, then neg. for that rpt.  But overall if >1 study/rpt. per pt., & if >1 positive, then a case.

Thank you Vivian for asking.



a newly validated data dictionary has been uploaded, and both within that file and the newly annotated flowchart file & new overview file, the difference between diverticulosis and diverticulitis is noted, i.e., if diverticulitis then also diverticulosis, both are cases, so if diverticulitis = Yes then case = Yes.