README.md 4.35 KB
Newer Older
Chia Ying Chiu's avatar
Chia Ying Chiu committed
1
2
# NLP Group Project

Chia Ying Chiu's avatar
Chia Ying Chiu committed
3
4
Introduction

Chia Ying Chiu's avatar
Chia Ying Chiu committed
5
The International Classification of Diseases (ICD) standardizes the format for reporting the cause of death certificate, promoting the comparability of mortality statistics internationally (CDC, 2015). To reflect the changes in the medical field, the ICD have been revised periodically and there have then ten revisions so far. In the United States, the International Classification of Diseases, Clinical Modification (ICD-9-CM), is implemented in assigning codes to diagnoses associated with inpatient, outpatient, and physician office utilization (CDC, 2015). The coding process is crucial and failure to correctly code a significant diagnosis can result in a substantial loss on reimbursement for the hospital. However, given the importance in ICD coding, it is still mainly accomplished manually, which is often expensive, time-consuming, and inefficient (Li et al, 2019). Therefore, in this study, we aim to automate the ICD-9 coding by implementing a natural language processing (NLP) model on unstructured clinical notes.
Chia Ying Chiu's avatar
Chia Ying Chiu committed
6
7
8

Method

Chia Ying Chiu's avatar
Chia Ying Chiu committed
9
Data: Data are from MIMIC-III, Medical Information Mart for Intensive Care, database, compromising health information of each encounter at the critical care units of a large tertiary care hospital (Johnson et al, 2016). For this study, 40,000 clinical notes were used. Find the ICDdata40K.csv in the repository and follow NLP_Project_Code.ipynb file for the preprocessing and analysis steps. Both dataset and Python notebook can be uploaded and run on Cheaha. Read the preprocessing and analysis section for the description of the codes.
Chia Ying Chiu's avatar
Chia Ying Chiu committed
10

Chia Ying Chiu's avatar
Chia Ying Chiu committed
11
Preprocessing and Analysis:
Chia Ying Chiu's avatar
Chia Ying Chiu committed
12
In order to simulate coder’s work in hospitals, our goal is to construct a model that predicts ICD-9 codes based on the given free-form texts. In our model, we first apply basic preprocessing methods via NLTK, and then build  model for learning the features from input texts. The preprocessing procedure includes converting into lowercase, stop words removal, and tokenization. Using text frequency - inverse document frequency (TF-IDF) to convert texts to vectors, data then were passed into a Multinomial Naive Bayes, logistic regression, random forest, and linear SVC machine learning models. Lastly, the results were evaluated using Cross-validation. Random Forest was explored by creating a confusion matrix and created a classification report.
Chia Ying Chiu's avatar
Chia Ying Chiu committed
13

Chia Ying Chiu's avatar
Chia Ying Chiu committed
14
15
16
Result:
The results suggested that the 40,000 sub-sample of MIMIC-III is a very sparse data, and the top 10 most commonly used ICD-9 Codes are for chronic kidney diseases, glomerulonephritis, iatrogenic hypotension, anemia in chronic kidney, anemia - other chronic disease, nephritis and nephropathy, hemangioma of intra-abdominal structures, other chronic pain, and abdominal pain, unspecified site.

Chia Ying Chiu's avatar
Chia Ying Chiu committed
17
18

Discussion
Chia Ying Chiu's avatar
Chia Ying Chiu committed
19
20
21
22
23
24
The result turned out to be really insignificant and we discussed a few possible reasons:
1. TF-IDF vectorization is not suitable for clinical notes.
2. Machine learning models are not suitable for direct ICD-9 code assignment, especially with so many ICD codes. Instead, a Deep Learning model could potentially perform better. Additionally, if one have information to subsetting clinical notes by departments, this can also result in a more centered sample for the model to learn.
3. Dataset is insufficient, particularly the training data. This could be due to not fully grasp the MIMIC-III dataset for the technical difficulty; an arbitrary cutoff of observation count may have adverse impact on the results.
To achieve better results, one can also focus on the first 3 characters of ICD-9 codes to determine the category of the diagnoses, instead the full coding.

Zaid A Ali's avatar
Zaid A Ali committed
25

Chia Ying Chiu's avatar
Chia Ying Chiu committed
26

Chia Ying Chiu's avatar
Chia Ying Chiu committed
27
28
29
30
31
32
33
34
35
Reference


- Centers for Disease Control and Prevention. (2015, November 6). ICD - ICD-9 - International Classification of Diseases, ninth revision. Centers for Disease Control and Prevention. Retrieved November 24, 2021, from https://www.cdc.gov/nchs/icd/icd9.htm.
- Li, M., Fei, Z., Wu, FX., Li, Y., Pan, Y., Wang, J. (2019), Automated ICD-9 Coding via A Deep Learning Approach, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 16, no. 4, pp. 1193-1202, doi: 10.1109/TCBB.2018.2817488.
- Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3, 160035. https://doi.org/10.1038/sdata.2016.35