Commit 1f6a89ce authored by Tarun karthik kumar Mamidi's avatar Tarun karthik kumar Mamidi
Browse files

Merge branch 'unit-test-investigation' into 'master'

Unit test investigation

See merge request center-for-computational-genomics-and-data-science/sciops/covid-19_risk_predictor!4
parents 0681578a e0b53dce
Pipeline #4676 canceled with stage
pipeline {
agent any
options {
timestamps()
ansiColor('xterm')
}
environment {
GITLAB_API_TOKEN = credentials('GitLabToken')
BASE_GITLAB_URL = credentials('BaseGitlabUrl')
}
stages {
stage('Static Analysis') {
agent {
docker { image '${BASE_GITLAB_URL}/center-for-computational-genomics-and-data-science/utility-images/static-analysis:v1.1'}
}
steps {
sh '/bin/linting.sh'
}
post {
success {
sh "curl --request POST --header \"PRIVATE-TOKEN: ${GITLAB_API_TOKEN}\" \"https://gitlab.rc.uab.edu/api/v4/projects/1585/statuses/${GIT_COMMIT}?state=success&name=jenkins_static_analysis\""
}
failure {
sh "curl --request POST --header \"PRIVATE-TOKEN: ${GITLAB_API_TOKEN}\" \"https://gitlab.rc.uab.edu/api/v4/projects/1585/statuses/${GIT_COMMIT}?state=canceled&name=jenkins_static_analysis\""
}
}
}
stage('Unit Test') {
agent {
docker { image 'continuumio/miniconda3:4.9.2' }
}
steps {
sh 'conda env create --file configs/environment.yaml'
sh 'python -m unittest -v testing/unit_test.py'
}
post {
success {
sh "curl --request POST --header \"PRIVATE-TOKEN: ${GITLAB_API_TOKEN}\" \"https://gitlab.rc.uab.edu/api/v4/projects/1585/statuses/${GIT_COMMIT}?state=success&name=jenkins_unit_tests\""
}
failure {
sh "curl --request POST --header \"PRIVATE-TOKEN: ${GITLAB_API_TOKEN}\" \"https://gitlab.rc.uab.edu/api/v4/projects/1585/statuses/${GIT_COMMIT}?state=canceled&name=jenkins_unit_tests\""
}
}
}
}
post {
success {
sh "curl --request POST --header \"PRIVATE-TOKEN: ${GITLAB_API_TOKEN}\" \"https://gitlab.rc.uab.edu/api/v4/projects/1585/statuses/${GIT_COMMIT}?state=success&name=jenkins\""
}
failure {
sh "curl --request POST --header \"PRIVATE-TOKEN: ${GITLAB_API_TOKEN}\" \"https://gitlab.rc.uab.edu/api/v4/projects/1585/statuses/${GIT_COMMIT}?state=canceled&name=jenkins\""
}
}
}
The MIT License (MIT)
# The MIT License (MIT)
Copyright (c) 2021 Center for Computational Genomics and Data Science
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
\ No newline at end of file
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- [COVID-19_RISK_PREDICTOR](#COVID-19_RISK_PREDICTOR)
- [Data availability](#Data-availability)
- [Usage](#Usage)
- [Installation](#Installation)
- [Requirements](#Requirements)
- [Activate conda environment](#Activate-conda-environment)
- [Run parser](#Run-parser)
- [Run model training](#Run-model-training)
- [Build Streamlit app](#Build-Streamlit-app)
- [Contact information](#Contact-information)
# COVID-19_RISK_PREDICTOR
***!!! For research purposes only !!!***
**Aim:** To develop a model that takes in demographics, living style and symptoms/conditions to predict risk of COVID-19 infection for patients.
- [COVID-19_RISK_PREDICTOR](#covid-19_risk_predictor)
- [Data availability](#data-availability)
- [Usage](#usage)
- [Installation](#installation)
- [Requirements](#requirements)
- [Activate conda environment](#activate-conda-environment)
- [Run parser](#run-parser)
- [Run model training](#run-model-training)
- [Build Streamlit app](#build-streamlit-app)
- [Unit Testing](#unit-testing)
- [Contact information](#contact-information)
**Aim:** To develop a model that takes in demographics, living style and symptoms/conditions to predict risk of COVID-19
infection for patients.
## Data availability
Data was made available through the UAB Biomedical Research Information Technology Enhancement (U-BRITE) framework. Access to the level-2 i2b2 data was granted upon self-service pursuant to an IRB exemption. [link](https://www.uab.edu/ccts/research-commons/berd/55-research-commons/informatics/325-i2b2)
Data was made available through the UAB Biomedical Research Information Technology Enhancement (U-BRITE) framework.
Access to the level-2 i2b2 data was granted upon self-service pursuant to an IRB exemption.
[link](https://www.uab.edu/ccts/research-commons/berd/55-research-commons/informatics/325-i2b2)
### Directory structure used to parse data from positive and negative cohorts
Dataset used was transformed to adhere to the [OMOP Common Data Model Version 5.3.1](https://ohdsi.github.io/CommonDataModel/cdm531.html) to enable systemic analyses of EHR data from disparate sources.
```
Dataset used was transformed to adhere to the [OMOP Common Data Model Version 5.3.1](https://ohdsi.github.io/CommonDataModel/cdm531.html)
to enable systemic analyses of EHR data from disparate sources.
```directory
Cohorts/
├── positive <--- positive cohort directory
│   ├── measurement.csv - test and results
......@@ -38,10 +43,10 @@ Cohorts/
└── README.md
```
## Usage
### Installation
Installation simply requires fetching the source code. Following are required:
- Git
......@@ -81,32 +86,62 @@ conda activate rico
```
### Run parser
```
```sh
python src/filter_dataset.py --pos Cohorts/positive/ --neg Cohorts/negative/
```
For help, use the `-h` help argument
```
```sh
python src/filter_dataset.py -h
```
parsed files are saved in `./results` directory.
### Run model training
```
```sh
python src/Model.py --input results/encoded-100-week-filter.csv
```
output files are saved in `./results` directory.
### Build Streamlit app
As an example, we created a streamlit app with the results from our model. Please refer to
To demonstrate the application of these models one of the four was chosen and a sample Streamlit app was created and included in the project. Please refer to
`src/streamlit/RICO.py`
**Note** - This Streamlit app is for demonstration of one of the models and is not a necessity for the pipeline but only for display of calculation and interpretation. The questionnaire from the models can be used manually without this. Hence, the Streamlit app is not tested and should be used at your own risk for demo purposes or as a guide for building from this work.
### Unit Testing
To test the functions in `filter_dataset.py`, use the below command -
```sh
python -m unittest -v testing/unit_test.py
```
To test the coverage of testing, use the below commands -
```sh
# test the coverage
coverage run -m unittest -v testing/unit_test.py
# To get a coverage report
coverage report
# To get annotated HTML listings
coverage html
```
**Note** - Functions in `Model.py` are adapted from [this Github repo](https://github.com/yandexdataschool/roc_comparison),
where they already implemented unit testing.
## Contact information
For issues, please send an email with clear description to
For issues, please send an email with clear description to
Tarun Mamidi - tmamidi@uab.edu
Ryan Melvin - rmelvin@uabmc.edu
\ No newline at end of file
Ryan Melvin - rmelvin@uabmc.edu
......@@ -10,8 +10,10 @@ dependencies:
- pyyaml=5.4.1
- matplotlib=3.3.4
- scikit-learn=0.24.1
- pip
- black=21.5b0
- parameterized=0.8.1
- pip=21.1.1
- pip:
- scorecardpy==0.1.9.2
- xverse==1.0.5
- scorecardpy==0.1.9.2
- xverse==1.0.5
- coverage==5.5
#libraries
# libraries
import pandas as pd
import numpy as np
import xverse
......@@ -8,13 +8,14 @@ import sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold
import statsmodels.api as sm
from matplotlib import pyplot
#%matplotlib inline
from joblib import dump, load
import argparse
from scipy import stats
# Functions for computing AUC CI using Delong's method
#!/usr/bin/python
#!/usr/bin/python
"""
AUC DeLong CI
......@@ -47,7 +48,7 @@ def compute_midrank(x):
j = i
while j < N and Z[j] == Z[i]:
j += 1
T[i:j] = 0.5*(i + j - 1)
T[i:j] = 0.5 * (i + j - 1)
i = j
T2 = np.empty(N, dtype=np.float)
# Note(kazeevn) +1 is due to Python using 0-based indexing
......@@ -127,9 +128,9 @@ def fastDeLong_weights(predictions_sorted_transposed, label_1_count, sample_weig
total_negative_weights = sample_weight[m:].sum()
pair_weights = np.dot(sample_weight[:m, np.newaxis], sample_weight[np.newaxis, m:])
total_pair_weights = pair_weights.sum()
aucs = (sample_weight[:m]*(tz[:, :m] - tx)).sum(axis=1) / total_pair_weights
aucs = (sample_weight[:m] * (tz[:, :m] - tx)).sum(axis=1) / total_pair_weights
v01 = (tz[:, :m] - tx[:, :]) / total_negative_weights
v10 = 1. - (tz[:, m:] - ty[:, :]) / total_positive_weights
v10 = 1.0 - (tz[:, m:] - ty[:, :]) / total_positive_weights
sx = np.cov(v01)
sy = np.cov(v10)
delongcov = sx / m + sy / n
......@@ -215,192 +216,183 @@ def delong_roc_variance(ground_truth, predictions, sample_weight=None):
predictions: np.array of floats of the probability of being class 1
"""
order, label_1_count, ordered_sample_weight = compute_ground_truth_statistics(
ground_truth, sample_weight)
ground_truth, sample_weight
)
predictions_sorted_transposed = predictions[np.newaxis, order]
aucs, delongcov = fastDeLong(predictions_sorted_transposed, label_1_count, ordered_sample_weight)
aucs, delongcov = fastDeLong(
predictions_sorted_transposed, label_1_count, ordered_sample_weight
)
assert len(aucs) == 1, "There is a bug in the code, please forward this to the developers"
return aucs[0], delongcov
if __name__ == "__main__":
# Data setup
# Read, filter based on missingness and identical limits,
# Read, filter based on missingness and identical limits,
# train/test split, and perform WoE transform.
parser = argparse.ArgumentParser()
parser.add_argument(
"--input",
type=str,
required=True,
help="input encoded file")
parser.add_argument("--input", type=str, required=True, help="input encoded file")
args = parser.parse_args()
# load data
encoded = pd.read_csv(args.input)
encoded = encoded.drop(['PERSON_ID'],axis=1)
encoded = encoded.drop(["PERSON_ID"], axis=1)
# filter variable via missing rate, iv, identical value rate
encoded_f = sc.var_filter(encoded
, y="class"
, positive='negative'
, identical_limit = 0.95
, iv_limit = 0
, missing_limit=0.95
, return_rm_reason=False # makes output a dictionary referencing 2 dfs
, var_kp=['f_R06'
, 'f_R05'
, 'f_R50'
, 'f_R53'
, 'f_M79'
, 'f_R09'
, 'f_R51'
, 'f_J44'
, 'f_E11'
, 'f_I25'
, 'f_I10'
]
, var_rm = [
'f_BMI-unknown'
, 'f_Unknown'
]
)
encoded_f = sc.var_filter(
encoded,
y="class",
positive="negative",
identical_limit=0.95,
iv_limit=0,
missing_limit=0.95,
return_rm_reason=False, # makes output a dictionary referencing 2 dfs
var_kp=[
"f_R06",
"f_R05",
"f_R50",
"f_R53",
"f_M79",
"f_R09",
"f_R51",
"f_J44",
"f_E11",
"f_I25",
"f_I10",
],
var_rm=["f_BMI-unknown", "f_Unknown"],
)
# breaking dt into train and test
train, test = sc.split_df(encoded_f, 'class').values()
train, test = sc.split_df(encoded_f, "class").values()
# woe binning ------
bins = sc.woebin(encoded_f, y="class")
# converting train and test into woe values
train_woe = sc.woebin_ply(train, bins)
test_woe = sc.woebin_ply(test, bins)
# get xs and ys
y_train = train_woe.loc[:,'class']
X_train = train_woe.loc[:,train_woe.columns != 'class']
y_test = test_woe.loc[:,'class']
X_test = test_woe.loc[:,train_woe.columns != 'class']
y_train = train_woe.loc[:, "class"]
X_train = train_woe.loc[:, train_woe.columns != "class"]
y_test = test_woe.loc[:, "class"]
X_test = test_woe.loc[:, train_woe.columns != "class"]
# Lasso-based regression
# Determine a lambda for Lasso (l1) regularization using
# Determine a lambda for Lasso (l1) regularization using
# 10-fold cross validation, get predictions from best model, score, and make scorecard
# logistic regression ------
# lasso
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
lasso_cv = LogisticRegressionCV(penalty='l1'
, Cs = 100
, solver='saga'
, cv = StratifiedKFold(10)
, n_jobs=-1
, max_iter = 10000
, scoring = 'neg_log_loss'
, class_weight = 'balanced'
)
lasso_cv = LogisticRegressionCV(
penalty="l1",
Cs=100,
solver="saga",
cv=StratifiedKFold(10),
n_jobs=-1,
max_iter=10000,
scoring="neg_log_loss",
class_weight="balanced",
)
lasso_cv.fit(X_train, y_train)
# plot training ROC
sklearn.metrics.plot_roc_curve(lasso_cv, X_train, y_train)
pyplot.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--')
pyplot.title('LASSO Training ROC')
sklearn.metrics.plot_roc_curve(lasso_cv, X_train, y_train)
pyplot.plot([0, 1], [0, 1], color="black", lw=2, linestyle="--")
pyplot.title("LASSO Training ROC")
axes = pyplot.gca()
axes.set_facecolor("white")
axes.set_clip_on(False)
pyplot.savefig('results/training_roc.png')
pyplot.savefig("results/training_roc.png")
# plot testing ROC
sklearn.metrics.plot_roc_curve(lasso_cv, X_test, y_test)
pyplot.plot([0, 1], [0, 1], color='black', lw=2, linestyle='--')
pyplot.title('LASSO Testing ROC')
sklearn.metrics.plot_roc_curve(lasso_cv, X_test, y_test)
pyplot.plot([0, 1], [0, 1], color="black", lw=2, linestyle="--")
pyplot.title("LASSO Testing ROC")
axes = pyplot.gca()
axes.set_facecolor("white")
axes.set_clip_on(False)
pyplot.savefig('results/testing_roc.png')
pyplot.savefig("results/testing_roc.png")
# predicted proability
train_pred = lasso_cv.predict_proba(X_train)[:,1]
train_pred = lasso_cv.predict_proba(X_train)[:, 1]
train_pred_class = lasso_cv.predict(X_train)
test_pred = lasso_cv.predict_proba(X_test)[:,1]
test_pred = lasso_cv.predict_proba(X_test)[:, 1]
test_pred_class = lasso_cv.predict(X_test)
# Make scorecard
card = sc.scorecard(bins, lasso_cv, X_train.columns)
# credit score
train_score = sc.scorecard_ply(train, card, print_step=0)
test_score = sc.scorecard_ply(test, card, print_step=0)
# psi
pyplot.rcParams["font.size"] = "18"
fig = sc.perf_psi(
score = {'train':train_score, 'test':test_score},
label = {'train':y_train, 'test':y_test},
x_tick_break=50
score={"train": train_score, "test": test_score},
label={"train": y_train, "test": y_test},
x_tick_break=50,
)
fig['pic']['score'].set_size_inches(18.5, 10.5)
fig['pic']['score'].savefig('results/dist.png')
fig["pic"]["score"].set_size_inches(18.5, 10.5)
fig["pic"]["score"].savefig("results/dist.png")
card_df = pd.concat(card)
card_df.to_csv('results/lasso_card_df.csv')
scores_lasso_2week = sc.scorecard_ply(encoded, card, only_total_score=True, print_step=0, replace_blank_na=True)
scores_lasso_2week.to_csv('results/scores_lasso.csv')
card_df.to_csv("results/lasso_card_df.csv")
scores_lasso_2week = sc.scorecard_ply(
encoded, card, only_total_score=True, print_step=0, replace_blank_na=True
)
scores_lasso_2week.to_csv("results/scores_lasso.csv")
# Training Metrics and AUC CI
print("Training Metrics")
# calculate accuracy
acc = sklearn.metrics.accuracy_score(y_train, train_pred_class)
print('Accuracy: %.3f' % acc)
print("Accuracy: %.3f" % acc)
auc_score = sklearn.metrics.roc_auc_score(y_train, train_pred)
print('AUC: %.3f' % auc_score)
print("AUC: %.3f" % auc_score)
f_score = sklearn.metrics.f1_score(y_train, train_pred_class)
print('FS: %.3f' % f_score)
print("FS: %.3f" % f_score)
# delong ci
delong_alpha = 0.95
auc, auc_cov = delong_roc_variance(
np.ravel(y_train),
np.ravel(train_pred))
auc, auc_cov = delong_roc_variance(np.ravel(y_train), np.ravel(train_pred))
auc_std = np.sqrt(auc_cov)
lower_upper_q = np.abs(np.array([0, 1]) - (1 - delong_alpha) / 2)
ci = stats.norm.ppf(
lower_upper_q,
loc=auc_score,
scale=auc_std)
ci = stats.norm.ppf(lower_upper_q, loc=auc_score, scale=auc_std)
ci[ci > 1] = 1
print('AUC COV:', round(auc_cov,2))
print('95% AUC CI:', np.round(ci,2))
print("AUC COV:", round(auc_cov, 2))
print("95% AUC CI:", np.round(ci, 2))
# Testing Metrics and AUC CI
print("Testing Metrics")
# calculate accuracy
acc = sklearn.metrics.accuracy_score(y_test, test_pred_class)
print('Accuracy: %.3f' % acc)
print("Accuracy: %.3f" % acc)
auc_score = sklearn.metrics.roc_auc_score(y_test, test_pred)
print('AUC: %.3f' % auc_score)
print("AUC: %.3f" % auc_score)
f_score = sklearn.metrics.f1_score(y_test, test_pred_class)
print('FS: %.3f' % f_score)
print("FS: %.3f" % f_score)
# delong ci
delong_alpha = 0.95
auc, auc_cov = delong_roc_variance(
np.ravel(y_test),
np.ravel(test_pred))
auc, auc_cov = delong_roc_variance(np.ravel(y_test), np.ravel(test_pred))
auc_std = np.sqrt(auc_cov)
lower_upper_q = np.abs(np.array([0, 1]) - (1 - delong_alpha) / 2)
ci = stats.norm.ppf(
lower_upper_q,
loc=auc_score,
scale=auc_std)
ci = stats.norm.ppf(lower_upper_q, loc=auc_score, scale=auc_std)
ci[ci > 1] = 1
print('AUC COV:', round(auc_cov,2))
print('95% AUC CI:', np.round(ci,2))
\ No newline at end of file
print("AUC COV:", round(auc_cov, 2))
print("95% AUC CI:", np.round(ci, 2))
This diff is collapsed.
......@@ -2,247 +2,255 @@ import streamlit as st
import plotly.graph_objects as go
st.title("COVID-19 Risk Predictor")
st.markdown("<h3 style='text-align: right;'>for research purposes only</h3>", unsafe_allow_html=True)
'''
st.markdown(
"<h3 style='text-align: right;'>for research purposes only</h3>", unsafe_allow_html=True,
)
"""
'''
"""
pd.options.display.max_colwidth = 500
def imc_chart(imc):
if (imc>=213):
color="red"
'## Alert: Please take a COVID test immediately.'
# '### You are >20% likely.'
elif (imc>=170 and imc<213):
color="orange"
'## Alert: Please consult a doctor to take COVID test'
elif (imc>=0 and imc<170):
if imc >= 213:
color = "red"
"## Alert: Please take a COVID test immediately."
# '### You are >20% likely.'
elif imc >= 170 and imc < 213:
color = "orange"
"## Alert: Please consult a doctor to take COVID test"
elif imc >= 0 and imc < 170:
color = "lightgreen"
'## Alert: Please consult a doctor to take COVID test'
elif (imc<0):
color="green"
'## Alert: Please consult a doctor if you have symptoms'
fig = go.Figure(go.Indicator(
mode = "gauge+number+delta",
domain = {'x': [0, 1], 'y': [0, 1]},
value = imc,
title = {'text': "Patient Risk Score"},
delta = {'reference': 213, 'increasing': {'color': "RebeccaPurple"}},
gauge = {
'axis': {'range': [-170, 350], 'tickwidth': 1, 'tickcolor': "darkblue"},
'bar': {'color': color},
'steps' : [