Commit 4245501a authored by Zaid A Ali's avatar Zaid A Ali
Browse files

Add code for creating datasets so that they can be easily used in investigations

parent 849455ee
%% Cell type:markdown id:bc36f532 tags:
NLP project ideas:
Article used the same data set MIMIC-III to evaluate the ICD9 code assignment of RNNs and CNNs. https://github.com/lsy3/clinical-notes-diagnosis-dl-nlp
GitHub seems to provide code and cleaned data sets.
Paper: https://arxiv.org/pdf/1802.02311v2.pdf
MIMIC-III: https://paperswithcode.com/dataset/mimic-iii
Link to data:
MIMIC- III: https://uab.box.com/s/pjf41j05n33wjktu93p98vla3mfvhjwu
Using the resources from the GitHub project to assign ICD9 code using different multi-label text classification models
Is it possible to optimize their models?
Prediction models using the ICD9 codes with covariates (insurance type, gender*, ethnicity, marital status, admission type) to see what are the top ICD codes that are associated with prolonged length of stay. https://towardsdatascience.com/predicting-hospital-length-of-stay-at-time-of-admission-55dfdfe69598
Compare the prediction models of different multi-label text classification models, and see if the results are agreed across models
| Task | Assigned To | Deadline|
------|-------------|----------
| Run the CNN and RNN models, refer to the GitHub link above| Zaid & Chia | 11/15|
|------|
%% Cell type:markdown id: tags:
For this project, our goal is create an NLP model to automatically assign ICD-9 encodings, given the clinical notes at each encounter).
%% Cell type:code id: tags:
``` python
#imports
import pandas as pd
print("All modules imported successfully")
```
%% Output
All modules imported successfully
%% Cell type:code id: tags:
``` python
diagnoses = pd.read_csv("DIAGNOSES_ICD.csv")
note_events = pd.read_csv("NOTEEVENTS.csv", engine="python", on_bad_lines='skip')
full_dataset = pd.merge(diagnoses, note_events, on =["HADM_ID", "SUBJECT_ID"])
full_dataset = full_dataset[:40000]
print(full_dataset)
```
%% Output
ROW_ID_x SUBJECT_ID HADM_ID SEQ_NUM ICD9_CODE ROW_ID_y CHARTDATE \
0 1297 109 172335 1.0 40301 14797 2141-09-24
1 1297 109 172335 1.0 40301 72706 2141-09-21
2 1297 109 172335 1.0 40301 170207 2141-09-18
3 1297 109 172335 1.0 40301 341513 2141-09-21
4 1297 109 172335 1.0 40301 341514 2141-09-21
... ... ... ... ... ... ... ...
39995 801 101 175533 9.0 2762 15782 2196-10-12
39996 801 101 175533 9.0 2762 170036 2196-09-26
39997 801 101 175533 9.0 2762 170037 2196-09-26
39998 801 101 175533 9.0 2762 170038 2196-09-26
39999 801 101 175533 9.0 2762 173709 2196-09-30
CHARTTIME STORETIME CATEGORY \
0 NaN NaN Discharge summary
1 NaN NaN Echo
2 NaN NaN ECG
3 2141-09-21 02:49:00 2141-09-21 02:49:45 Physician
4 2141-09-21 02:49:00 2141-09-21 02:57:11 Physician
... ... ... ...
39995 NaN NaN Discharge summary
39996 NaN NaN ECG
39997 NaN NaN ECG
39998 NaN NaN ECG
39999 NaN NaN ECG
DESCRIPTION CGID ISERROR \
0 Report NaN NaN
1 Report NaN NaN
2 Report NaN NaN
3 Physician Resident Admission Note 17650.0 NaN
4 Physician Resident Admission Note 17650.0 NaN
... ... ... ...
39995 Report NaN NaN
39996 Report NaN NaN
39997 Report NaN NaN
39998 Report NaN NaN
39999 Report NaN NaN
TEXT
0 Admission Date: [**2141-9-18**] ...
1 PATIENT/TEST INFORMATION:\nIndication: Pericar...
2 Sinus rhythm\nRightward axis\nSince previous t...
3 Chief Complaint: hypotension, altered mental ...
4 Chief Complaint: hypotension, altered mental ...
... ...
39995 Admission Date: [**2196-9-26**] Discharge...
39996 Baseline artifact\nSinus rhythm\nGeneralized l...
39997 Baseline artifact\nProbable atrial flutter wit...
39998 Baseline artifact\nProbable atrial flutter wit...
39999 Wide complex tachycardia with a right bundle-b...
[40000 rows x 14 columns]
%% Cell type:code id:72a0901e tags:
%% Cell type:code id: tags:
``` python
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment