Commit 0681578a authored by Ryan Melvin's avatar Ryan Melvin
Browse files

Merge branch 'parsing' into 'master'

Parsing

See merge request center-for-computational-genomics-and-data-science/sciops/covid-19_risk_predictor!3
parents 39bd579a d3d40f6c
# results from scripts
*.csv
*.xlsx
*.txt
*.png
*.zip
.vscode/
# Created by https://www.toptal.com/developers/gitignore/api/python
......@@ -151,4 +157,4 @@ Thumbs.db #thumbnail cache on Windows
.prof
# End of https://www.toptal.com/developers/gitignore/api/python
\ No newline at end of file
# End of https://www.toptal.com/developers/gitignore/api/python
This diff is collapsed.
The MIT License (MIT)
Copyright (c) 2021 Center for Computational Genomics and Data Science
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
\ No newline at end of file
- [COVID-19_RISK_PREDICTOR](#COVID-19_RISK_PREDICTOR)
- [Data availability](#Data-availability)
- [Usage](#Usage)
- [Installation](#Installation)
- [Requirements](#Requirements)
- [Activate conda environment](#Activate-conda-environment)
- [Run parser](#Run-parser)
- [Run model training](#Run-model-training)
- [Build Streamlit app](#Build-Streamlit-app)
- [Contact information](#Contact-information)
# COVID-19_RISK_PREDICTOR
To develop a model that takes in demographics, living style and symptoms/conditions to predict risk for patients.
***!!! For research purposes only !!!***
**Aim:** To develop a model that takes in demographics, living style and symptoms/conditions to predict risk of COVID-19 infection for patients.
### Directory structure used to read parsed data
## Data availability
Data was made available through the UAB Biomedical Research Information Technology Enhancement (U-BRITE) framework. Access to the level-2 i2b2 data was granted upon self-service pursuant to an IRB exemption. [link](https://www.uab.edu/ccts/research-commons/berd/55-research-commons/informatics/325-i2b2)
### Directory structure used to parse data from positive and negative cohorts
Dataset used was transformed to adhere to the [OMOP Common Data Model Version 5.3.1](https://ohdsi.github.io/CommonDataModel/cdm531.html) to enable systemic analyses of EHR data from disparate sources.
```
encoded-2-week-filter.csv
Cohorts/
├── positive <--- positive cohort directory
│   ├── measurement.csv - test and results
│   ├── condition_occurance.csv - conditions of patients
│   ├── observation.csv - things like smoking history
│   └── person.csv - demographic information
├── negative <--- negative cohort directory
│   ├── measurement.csv - test and results
│   ├── condition_occurance.csv - conditions of patients
│   ├── observation.csv - things like smoking history
│   └── person.csv - demographic information
└── README.md
```
### Create conda environment before running anything
## Usage
### Installation
Installation simply requires fetching the source code. Following are required:
- Git
To fetch source code, change in to directory of your choice and run:
```sh
git clone -b master \
git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/covid-19_risk_predictor.git
```
conda create env -n rico -f configs/environment.yaml
### Requirements
*OS:*
Currently works only in Linux OS. Docker versions may need to be explored later to make it useable in Mac (and
potentially Windows).
*Tools:*
- Anaconda3
- Tested with version: 2020.02
### Activate conda environment
Change in to root directory and run the commands below:
```sh
# create conda environment. Needed only the first time.
conda env create --file configs/environment.yaml
# if you need to update existing environment
conda env update --file configs/environment.yaml
# activate conda environment
conda activate rico
```
### Run parser
```
python src/filter_dataset.py --pos Cohorts/positive/ --neg Cohorts/negative/
```
For help, use the `-h` help argument
```
python src/filter_dataset.py -h
```
parsed files are saved in `./results` directory.
### Run model training
```
python src/Model.py
python src/Model.py --input results/encoded-100-week-filter.csv
```
output files are saved in the top level directory.
output files are saved in `./results` directory.
### Build Streamlit app
As an example, we created a streamlit app with the results from our model. Please refer to
`src/streamlit/RICO.py`
## Contact information
For issues, please send an email with clear description to
Tarun Mamidi - tmamidi@uab.edu
Ryan Melvin - rmelvin@uabmc.edu
\ No newline at end of file
person:
- PERSON_ID
- YEAR_OF_BIRTH
- GENDER_SOURCE_VALUE
- RACE_SOURCE_VALUE
- ETHNICITY_SOURCE_VALUE
death:
- PERSON_ID
- DEATH_DATE
measurement:
- PERSON_ID
- MEASUREMENT_DATE
- MEASUREMENT_SOURCE_VALUE
- VALUE_SOURCE_VALUE
condition:
- PERSON_ID
- CONDITION_START_DATE
- CONDITION_SOURCE_VALUE
- CONDITION_SOURCE_CONCEPT_ID
observation:
- PERSON_ID
- OBSERVATION_DATE
- VALUE_AS_STRING
- OBSERVATION_SOURCE_VALUE
- QUALIFIER_SOURCE_VALUE
drug:
- PERSON_ID
- DRUG_EXPOSURE_START_DATE
- SIG
- DRUG_SOURCE_VALUE
- ROUTE_SOURCE_VALUE
name: rico
channels:
- conda-forge
- defaults
dependencies:
- python=3.8.5
- pandas=1.2.1
- numpy=1.19.5
- pyyaml=5.4.1
- matplotlib=3.3.4
- scikit-learn=0.24.1
- pip
- pip:
- matplotlib==3.3.4
- researchpy==0.2.3
- scorecardpy==0.1.9.2
- xverse==1.0.5
- scikit-learn==0.23.2
- pandas==1.2.1
- numpy==1.19.2
#!/usr/bin/env python
#libraries
import pandas as pd
import numpy as np
......@@ -10,8 +8,10 @@ import sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold
import statsmodels.api as sm
from matplotlib import pyplot
%matplotlib inline
#%matplotlib inline
from joblib import dump, load
import argparse
from scipy import stats
# Functions for computing AUC CI using Delong's method
#!/usr/bin/python
......@@ -29,10 +29,6 @@ https://github.com/yandexdataschool/roc_comparison
updated: Raul Sanchez-Vazquez
"""
import numpy as np
import scipy.stats
from scipy import stats
# AUC comparison adapted from
# https://github.com/Netflix/vmaf/
def compute_midrank(x):
......@@ -226,178 +222,185 @@ def delong_roc_variance(ground_truth, predictions, sample_weight=None):
return aucs[0], delongcov
# Data setup
# Read, filter based on missingness and identical limits,
# train/test split, and perform WoE transform.
# load data
encoded = pd.read_csv('encoded-2-week-filter.csv')
encoded = encoded.drop(['PERSON_ID'],axis=1)
# filter variable via missing rate, iv, identical value rate
encoded_f = sc.var_filter(encoded
, y="class"
, positive='negative'
, identical_limit = 0.95
, iv_limit = 0
, missing_limit=0.95
, return_rm_reason=False # makes output a dictionary referencing 2 dfs
, var_kp=['f_R06'
, 'f_R05'
, 'f_R50'
, 'f_R53'
, 'f_M79'
, 'f_R09'
, 'f_R51'
, 'f_J44'
, 'f_E11'
, 'f_I25'
, 'f_I10'
]
, var_rm = [
'f_BMI-unknown'
, 'f_Unknown'
]
)
# breaking dt into train and test
train, test = sc.split_df(encoded_f, 'class').values()
# woe binning ------
bins = sc.woebin(encoded_f, y="class")
# converting train and test into woe values
train_woe = sc.woebin_ply(train, bins)
test_woe = sc.woebin_ply(test, bins)
# get xs and ys
y_train = train_woe.loc[:,'class']
X_train = train_woe.loc[:,train_woe.columns != 'class']
y_test = test_woe.loc[:,'class']
X_test = test_woe.loc[:,train_woe.columns != 'class']
# Lasso-based regression
# Determine a lambda for Lasso (l1) regularization using
# 10-fold cross validation, get predictions from best model, score, and make scorecard
# logistic regression ------
# lasso
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
lasso_cv = LogisticRegressionCV(penalty='l1'
, Cs = 100
, solver='saga'
, cv = StratifiedKFold(10)
, n_jobs=-1
, max_iter = 10000
, scoring = 'neg_log_loss'
, class_weight = 'balanced'
)
lasso_cv.fit(X_train, y_train)
# plot training ROC
sklearn.metrics.plot_roc_curve(lasso_cv, X_train, y_train)
pyplot.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--')
pyplot.title('2-Week LASSO Training ROC')
axes = pyplot.gca()
axes.set_facecolor("white")
axes.set_clip_on(False)
pyplot.savefig('2week_training_roc.png')
# plot testing ROC
sklearn.metrics.plot_roc_curve(lasso_cv, X_test, y_test)
pyplot.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--')
pyplot.title('2-Week LASSO Testing ROC')
axes = pyplot.gca()
axes.set_facecolor("white")
axes.set_clip_on(False)
pyplot.savefig('2week_testing_roc.png')
# predicted proability
train_pred = lasso_cv.predict_proba(X_train)[:,1]
train_pred_class = lasso_cv.predict(X_train)
test_pred = lasso_cv.predict_proba(X_test)[:,1]
test_pred_class = lasso_cv.predict(X_test)
# Make scorecard
card = sc.scorecard(bins, lasso_cv, X_train.columns)
# credit score
train_score = sc.scorecard_ply(train, card, print_step=0)
test_score = sc.scorecard_ply(test, card, print_step=0)
# psi
pyplot.rcParams["font.size"] = "18"
fig = sc.perf_psi(
score = {'train':train_score, 'test':test_score},
label = {'train':y_train, 'test':y_test},
x_tick_break=50
)
fig['pic']['score'].set_size_inches(18.5, 10.5)
fig['pic']['score'].savefig('2week_dist.png')
card_df = pd.concat(card)
card_df.to_csv('2week_lasso_card_df.csv')
scores_lasso_2week = sc.scorecard_ply(encoded, card, only_total_score=True, print_step=0, replace_blank_na=True)
scores_lasso_2week.to_csv('scores_lasso_2week.csv')
# Training Metrics and AUC CI
print("Training Metrics")
# calculate accuracy
acc = sklearn.metrics.accuracy_score(y_train, train_pred_class)
print('Accuracy: %.3f' % acc)
auc_score = sklearn.metrics.roc_auc_score(y_train, train_pred)
print('AUC: %.3f' % auc_score)
f_score = sklearn.metrics.f1_score(y_train, train_pred_class)
print('FS: %.3f' % f_score)
# delong ci
delong_alpha = 0.95
auc, auc_cov = delong_roc_variance(
np.ravel(y_train),
np.ravel(train_pred))
auc_std = np.sqrt(auc_cov)
lower_upper_q = np.abs(np.array([0, 1]) - (1 - delong_alpha) / 2)
ci = stats.norm.ppf(
lower_upper_q,
loc=auc_score,
scale=auc_std)
ci[ci > 1] = 1
print('AUC COV:', round(auc_cov,2))
print('95% AUC CI:', np.round(ci,2))
# Testing Metrics and AUC CI
print("Testing Metrics")
# calculate accuracy
acc = sklearn.metrics.accuracy_score(y_test, test_pred_class)
print('Accuracy: %.3f' % acc)
auc_score = sklearn.metrics.roc_auc_score(y_test, test_pred)
print('AUC: %.3f' % auc_score)
f_score = sklearn.metrics.f1_score(y_test, test_pred_class)
print('FS: %.3f' % f_score)
# delong ci
delong_alpha = 0.95
auc, auc_cov = delong_roc_variance(
np.ravel(y_test),
np.ravel(test_pred))
auc_std = np.sqrt(auc_cov)
lower_upper_q = np.abs(np.array([0, 1]) - (1 - delong_alpha) / 2)
ci = stats.norm.ppf(
lower_upper_q,
loc=auc_score,
scale=auc_std)
ci[ci > 1] = 1
print('AUC COV:', round(auc_cov,2))
print('95% AUC CI:', np.round(ci,2))
\ No newline at end of file
if __name__ == "__main__":
# Data setup
# Read, filter based on missingness and identical limits,
# train/test split, and perform WoE transform.
parser = argparse.ArgumentParser()
parser.add_argument(
"--input",
type=str,
required=True,
help="input encoded file")
args = parser.parse_args()
# load data
encoded = pd.read_csv(args.input)
encoded = encoded.drop(['PERSON_ID'],axis=1)
# filter variable via missing rate, iv, identical value rate
encoded_f = sc.var_filter(encoded
, y="class"
, positive='negative'
, identical_limit = 0.95
, iv_limit = 0
, missing_limit=0.95
, return_rm_reason=False # makes output a dictionary referencing 2 dfs
, var_kp=['f_R06'
, 'f_R05'
, 'f_R50'
, 'f_R53'
, 'f_M79'
, 'f_R09'
, 'f_R51'
, 'f_J44'
, 'f_E11'
, 'f_I25'
, 'f_I10'
]
, var_rm = [
'f_BMI-unknown'
, 'f_Unknown'
]
)
# breaking dt into train and test
train, test = sc.split_df(encoded_f, 'class').values()
# woe binning ------
bins = sc.woebin(encoded_f, y="class")
# converting train and test into woe values
train_woe = sc.woebin_ply(train, bins)
test_woe = sc.woebin_ply(test, bins)
# get xs and ys
y_train = train_woe.loc[:,'class']
X_train = train_woe.loc[:,train_woe.columns != 'class']
y_test = test_woe.loc[:,'class']
X_test = test_woe.loc[:,train_woe.columns != 'class']
# Lasso-based regression
# Determine a lambda for Lasso (l1) regularization using
# 10-fold cross validation, get predictions from best model, score, and make scorecard
# logistic regression ------
# lasso
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
lasso_cv = LogisticRegressionCV(penalty='l1'
, Cs = 100
, solver='saga'
, cv = StratifiedKFold(10)
, n_jobs=-1
, max_iter = 10000
, scoring = 'neg_log_loss'
, class_weight = 'balanced'
)
lasso_cv.fit(X_train, y_train)
# plot training ROC
sklearn.metrics.plot_roc_curve(lasso_cv, X_train, y_train)
pyplot.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--')
pyplot.title('LASSO Training ROC')
axes = pyplot.gca()
axes.set_facecolor("white")
axes.set_clip_on(False)
pyplot.savefig('results/training_roc.png')
# plot testing ROC
sklearn.metrics.plot_roc_curve(lasso_cv, X_test, y_test)
pyplot.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--')
pyplot.title('LASSO Testing ROC')
axes = pyplot.gca()
axes.set_facecolor("white")
axes.set_clip_on(False)
pyplot.savefig('results/testing_roc.png')
# predicted proability
train_pred = lasso_cv.predict_proba(X_train)[:,1]
train_pred_class = lasso_cv.predict(X_train)
test_pred = lasso_cv.predict_proba(X_test)[:,1]
test_pred_class = lasso_cv.predict(X_test)
# Make scorecard
card = sc.scorecard(bins, lasso_cv, X_train.columns)
# credit score
train_score = sc.scorecard_ply(train, card, print_step=0)
test_score = sc.scorecard_ply(test, card, print_step=0)
# psi
pyplot.rcParams["font.size"] = "18"
fig = sc.perf_psi(
score = {'train':train_score, 'test':test_score},
label = {'train':y_train, 'test':y_test},
x_tick_break=50
)
fig['pic']['score'].set_size_inches(18.5, 10.5)
fig['pic']['score'].savefig('results/dist.png')
card_df = pd.concat(card)
card_df.to_csv('results/lasso_card_df.csv')
scores_lasso_2week = sc.scorecard_ply(encoded, card, only_total_score=True, print_step=0, replace_blank_na=True)
scores_lasso_2week.to_csv('results/scores_lasso.csv')
# Training Metrics and AUC CI
print("Training Metrics")
# calculate accuracy
acc = sklearn.metrics.accuracy_score(y_train, train_pred_class)
print('Accuracy: %.3f' % acc)
auc_score = sklearn.metrics.roc_auc_score(y_train, train_pred)
print('AUC: %.3f' % auc_score)
f_score = sklearn.metrics.f1_score(y_train, train_pred_class)
print('FS: %.3f' % f_score)
# delong ci
delong_alpha = 0.95
auc, auc_cov = delong_roc_variance(
np.ravel(y_train),
np.ravel(train_pred))
auc_std = np.sqrt(auc_cov)
lower_upper_q = np.abs(np.array([0, 1]) - (1 - delong_alpha) / 2)
ci = stats.norm.ppf(
lower_upper_q,
loc=auc_score,
scale=auc_std)
ci[ci > 1] = 1
print('AUC COV:', round(auc_cov,2))
print('95% AUC CI:', np.round(ci,2))
# Testing Metrics and AUC CI
print("Testing Metrics")
# calculate accuracy
acc = sklearn.metrics.accuracy_score(y_test, test_pred_class)
print('Accuracy: %.3f' % acc)
auc_score = sklearn.metrics.roc_auc_score(y_test, test_pred)
print('AUC: %.3f' % auc_score)
f_score = sklearn.metrics.f1_score(y_test, test_pred_class)
print('FS: %.3f' % f_score)
# delong ci
delong_alpha = 0.95
auc, auc_cov = delong_roc_variance(
np.ravel(y_test),
np.ravel(test_pred))
auc_std = np.sqrt(auc_cov)
lower_upper_q = np.abs(np.array([0, 1]) - (1 - delong_alpha) / 2)
ci = stats.norm.ppf(
lower_upper_q,
loc=auc_score,
scale=auc_std)
ci[ci > 1] = 1
print('AUC COV:', round(auc_cov,2))
print('95% AUC CI:', np.round(ci,2))
\ No newline at end of file
import pandas as pd
pd.set_option('display.max_rows', None)
import re
import yaml
import numpy as np
import argparse
import os
#Load config file
def get_col_configs(config_f):
with open(config_f) as fh:
config_dict = yaml.safe_load(fh)
# print(config_dict)
return config_dict
#Extracts necessary columns from each table accroding to the config file
def extract_col(config_dict,df,file):
#print('Extracting columns according to config file !....')
df = df[config_dict[file]]
return df
#Parse column from measurements table to extract info
def parse_values(x):
if 'Never' in x and 'Tobacco' in x:
return 'never_smoker'
elif 'Former' in x and 'Tobacco' in x:
return 'former_smoker'
elif ('Current' in x or 'current' in x or 'Light' in x) and 'Tobacco' in x:
return 'current_smoker'
elif 'Unknown' in x and 'Tobacco' in x: