Commit 27cc6993 authored by Chia Ying Chiu's avatar Chia Ying Chiu
Browse files


parent 647c190f
......@@ -13,6 +13,18 @@ Preprocessing:
In order to simulate coder’s work in hospitals, our goal is to construct a model that predicts ICD-10 codes based on the given free-form texts. In our model, we first apply basic preprocessing methods via NLTK [8], and then build a neural network model for learning the features from input texts. The preprocessing procedure includes spell checking, converting into lower cases, stop words removal, tokenization, and removing infrequent
words. The preprocessed data are then split to training and validation set by Scikit-Learn library. (
Using text frequency - inverse document frequency (TF-IDF) to convert texts to vectorize text. passed this data into a Random Forest, SVC, Multinomial Naive Bayes, and Logistic Regression classifier and evaluated them using Cross-validation. I further explored Random Forest by creating a confusion matrix and created a classification report
1. Tf-IDF vectorization is not suitable for clinical notes
2. ML models are not suitable for ICD9 code assignment, especially with so many ICD codes. Maybe a Deep Learning model could perform better
3. Dataset is insufficient, perhaps we should have been more careful in choosing data
4. We did not fully grasp the MIMIC-III dataset.
Just throwing some ideas out there
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment