Commit 4f0ac110 authored by Chia Ying Chiu's avatar Chia Ying Chiu
Browse files

Update README.md

parent 1f5289e0
......@@ -6,17 +6,22 @@ The International Classification of Diseases (ICD) standardizes the format for r
Method
Data: Data are from MIMIC-III, Medical Information Mart for Intensive Care, database, compromising health information of each encounter at the critical care units of a large tertiary care hospital (Johnson et al, 2016). For this study, 40,000 clinical notes were used.
Data: Data are from MIMIC-III, Medical Information Mart for Intensive Care, database, compromising health information of each encounter at the critical care units of a large tertiary care hospital (Johnson et al, 2016). For this study, 40,000 clinical notes were used. Find the ICDdata40K.csv in the repository and follow NLP_Project_Code.ipynb file for the preprocessing and analysis steps. Both dataset and Python notebook can be uploaded and run on Cheaha. Read the preprocessing and analysis section for the description of the codes.
Preprocessing:
Preprocessing and Analysis:
In order to simulate coder’s work in hospitals, our goal is to construct a model that predicts ICD-9 codes based on the given free-form texts. In our model, we first apply basic preprocessing methods via NLTK, and then build model for learning the features from input texts. The preprocessing procedure includes converting into lowercase, stop words removal, and tokenization. Using text frequency - inverse document frequency (TF-IDF) to convert texts to vectors, data then were passed into a Multinomial Naive Bayes, logistic regression, random forest, and linear SVC machine learning models. Lastly, the results were evaluated using Cross-validation. Random Forest was explored by creating a confusion matrix and created a classification report.
Result:
The results suggested that the 40,000 sub-sample of MIMIC-III is a very sparse data, and the top 10 most commonly used ICD-9 Codes are for chronic kidney diseases, glomerulonephritis, iatrogenic hypotension, anemia in chronic kidney, anemia - other chronic disease, nephritis and nephropathy, hemangioma of intra-abdominal structures, other chronic pain, and abdominal pain, unspecified site.
Discussion
1. Tf-IDF vectorization is not suitable for clinical notes
2. ML models are not suitable for ICD9 code assignment, especially with so many ICD codes. Maybe a Deep Learning model could perform better
3. Dataset is insufficient, perhaps we should have been more careful in choosing data
4. We did not fully grasp the MIMIC-III dataset.
The result turned out to be really insignificant and we discussed a few possible reasons:
1. TF-IDF vectorization is not suitable for clinical notes.
2. Machine learning models are not suitable for direct ICD-9 code assignment, especially with so many ICD codes. Instead, a Deep Learning model could potentially perform better. Additionally, if one have information to subsetting clinical notes by departments, this can also result in a more centered sample for the model to learn.
3. Dataset is insufficient, particularly the training data. This could be due to not fully grasp the MIMIC-III dataset for the technical difficulty; an arbitrary cutoff of observation count may have adverse impact on the results.
To achieve better results, one can also focus on the first 3 characters of ICD-9 codes to determine the category of the diagnoses, instead the full coding.
Reference
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment