README.md 3.16 KB
Newer Older
Chia Ying Chiu's avatar
Chia Ying Chiu committed
1
2
# NLP Group Project

Chia Ying Chiu's avatar
Chia Ying Chiu committed
3
4
Introduction

Chia Ying Chiu's avatar
Chia Ying Chiu committed
5
The International Classification of Diseases (ICD) standardizes the format for reporting the cause of death certificate, promoting the comparability of mortality statistics internationally (CDC, 2015). To reflect the changes in the medical field, the ICD have been revised periodically and there have then ten revisions so far. In the United States, the International Classification of Diseases, Clinical Modification (ICD-9-CM), is implemented in assigning codes to diagnoses associated with inpatient, outpatient, and physician office utilization (CDC, 2015). The coding process is crucial and failure to correctly code a significant diagnosis can result in a substantial loss on reimbursement for the hospital. However, given the importance in ICD coding, it is still mainly accomplished manually, which is often expensive, time-consuming, and inefficient (Li et al, 2019). Therefore, in this study, we aim to automate the ICD-9 coding by implementing a natural language processing (NLP) model on unstructured clinical notes.
Chia Ying Chiu's avatar
Chia Ying Chiu committed
6
7
8

Method

Chia Ying Chiu's avatar
Chia Ying Chiu committed
9
Data: Data are from MIMIC-III, Medical Information Mart for Intensive Care, database, compromising health information of each encounter at the critical care units of a large tertiary care hospital (Johnson et al, 2016). For this study, 40,000 clinical notes were used.
Chia Ying Chiu's avatar
Chia Ying Chiu committed
10

Chia Ying Chiu's avatar
Chia Ying Chiu committed
11
Preprocessing:
Chia Ying Chiu's avatar
Chia Ying Chiu committed
12
In order to simulate coder’s work in hospitals, our goal is to construct a model that predicts ICD-9 codes based on the given free-form texts. In our model, we first apply basic preprocessing methods via NLTK, and then build  model for learning the features from input texts. The preprocessing procedure includes converting into lowercase, stop words removal, and tokenization. Using text frequency - inverse document frequency (TF-IDF) to convert texts to vectors, data then were passed into a Multinomial Naive Bayes, logistic regression, random forest, and linear SVC machine learning models. Lastly, the results were evaluated using Cross-validation. Random Forest was explored by creating a confusion matrix and created a classification report.
Chia Ying Chiu's avatar
Chia Ying Chiu committed
13
14
15
16
17
18
19


Discussion
1. Tf-IDF vectorization is not suitable for clinical notes
2. ML models are not suitable for ICD9 code assignment, especially with so many ICD codes. Maybe a Deep Learning model could perform better
3. Dataset is insufficient, perhaps we should have been more careful in choosing data
4. We did not fully grasp the MIMIC-III dataset.
Zaid A Ali's avatar
Zaid A Ali committed
20

Chia Ying Chiu's avatar
Chia Ying Chiu committed
21

Chia Ying Chiu's avatar
Chia Ying Chiu committed
22
23
24
25
26
27
28
29
30
Reference


- Centers for Disease Control and Prevention. (2015, November 6). ICD - ICD-9 - International Classification of Diseases, ninth revision. Centers for Disease Control and Prevention. Retrieved November 24, 2021, from https://www.cdc.gov/nchs/icd/icd9.htm.
- Li, M., Fei, Z., Wu, FX., Li, Y., Pan, Y., Wang, J. (2019), Automated ICD-9 Coding via A Deep Learning Approach, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 16, no. 4, pp. 1193-1202, doi: 10.1109/TCBB.2018.2817488.
- Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3, 160035. https://doi.org/10.1038/sdata.2016.35