Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
RC Data Science
createAndParseSACCT
Commits
b3f16d30
Commit
b3f16d30
authored
Jun 24, 2020
by
Ryan Randles Jones
Browse files
changed user to jobs per user
parent
871025b9
Changes
1
Hide whitespace changes
Inline
Side-by-side
Jobs-and-Users-ReqMemCPU.ipynb
View file @
b3f16d30
%% Cell type:markdown id: tags:
# Notebook Setup
%% Cell type:code id: tags:
```
# must run
import sqlite3
import slurm2sql
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import matplotlib.ticker as ticker
```
%% Cell type:code id: tags:
```
from RC_styles import rc_styles as style
```
%% Cell type:code id: tags:
```
# must run
# creates database of info from March 2020 using sqlite 3
db = sqlite3.connect('/data/rc/rc-team/slurm-since-March.sqlite3')
```
%% Cell type:code id: tags:
```
# must run
# df is starting database
df = pd.read_sql('SELECT * FROM slurm', db)
```
%% Cell type:code id: tags:
```
# voluntary
# for displaying all available column options
pd.set_option('display.max_columns', None)
df.head(5)
```
%% Cell type:code id: tags:
```
# must run
# converts units in ReqMemCPU column from bytes to gigs
df['ReqMemCPU'] = df['ReqMemCPU'].div(1024**3)
```
%% Cell type:code id: tags:
```
# must run
# df_completed is dataframe of all completed jobs
df_completed = df[df.State.str.contains('COMPLETED')]
#df_completed.head(5)
```
%% Cell type:code id: tags:
```
# must run
# df_batch is df with only batch jobs
df_batch = df[df.JobName.str.contains('batch')]
#df_batch.head(5)
```
%% Cell type:markdown id: tags:
# Average RAM per CPU Requested by User
%% Cell type:code id: tags:
```
# must run
# df_2 is database of completed jobs with only User and ReqMemCpu
# it is used for the user dataframes
df_2 = df_completed.loc[:,['User','ReqMemCPU']]
#df_2.head(5)
```
%% Cell type:code id: tags:
```
# must run
# fills empty strings in User column with NaN and then filters them out to give a dataset of users with no empty strings
nan_value = float("NaN")
df_2.replace("", nan_value, inplace=True)
df_2.dropna(subset = ["User"], inplace=True)
#df_2.head(5)
```
%% Cell type:code id: tags:
```
# must run
# count = count of jobs per user
# mean,std,min,25%,50%,75%, and max refers to the gigs of memory per cpu requested by that user for all their jobs
df_user = df_2.groupby('User')['ReqMemCPU'].describe().reset_index()
#df_user.head(5)
```
%% Cell type:code id: tags:
```
# must run
# creates user number column of strings of numbers from 0 to the total number of users
# used in graphs in place of usernames
usernames = df_user['User']
user_numbers = [str(i) for i in range(len(usernames))]
df_user['User Number'] = user_numbers
df_user.head(5)
```
%% Cell type:code id: tags:
```
# voluntary
# description of number of jobs run per user - can be used to choose the Upper Limit Job Count
df_user['count'].describe()
```
%% Cell type:code id: tags:
```
# must run
# variable for to be used in names of plots to describe the max job count per user
# max = 367257
UpperlimitJobCount = 50
```
%% Cell type:code id: tags:
```
# must run
# creates database from df_user that returns all jobs per user up to the UpperlimitJobCount defined above
jobscount_cutoff = df_user[(df_user['count'] <= UpperlimitJobCount)]
jobscount_cutoff.head(5)
```
%% Cell type:code id: tags:
```
# must run
# df_user_graph is df_user sorted in ascending order by count for easy readibility of graph
df_user_graph = jobscount_cutoff.sort_values(by='count', ascending=True)
df_user_graph.head(5)
```
%% Cell type:code id: tags:
```
style.default_axes_and_ticks()
style.figsize()
user_graph1 = sns.scatterplot(x="count", y="mean",data=df_user_graph)
plt.title('Average Requested RAM per CPU by User for all Users Running %i Jobs or less'%UpperlimitJobCount)
plt.xlabel('Job Count Per User')
plt.ylabel('Average Requested RAM per CPU (Gigs)')
plt.show()
```
%% Cell type:code id: tags:
```
style.default_axes_and_ticks()
style.figsize()
user_graph = sns.barplot(x="count", y="mean", data= df_user_graph, color = 'blue')
#user_graph.set_xscale('log')
#user_graph.set_xscale('log')
user_graph.xaxis.set_major_locator(ticker.MultipleLocator(2))
user_graph.xaxis.set_major_formatter(ticker.ScalarFormatter())
plt.title('Average Requested RAM per CPU by User for all Users Running %i Jobs or less'%UpperlimitJobCount)
plt.xlabel('Job Count')
plt.ylabel('Average Requested RAM per CPU (Gigs)')
plt.show()
```
%% Cell type:code id: tags:
```
# bar graph for jobs run per user - shows average requested RAM per CPU for all jobs by user
user_graph2 = px.bar(df_user_graph, x='count', y='mean', color = 'count',
hover_data=['max','count'],
labels={'mean':'Average Requested RAM per CPU (Gigs)'},
height=400)
user_graph2.update_layout(
xaxis_type = 'category',
title={
'text': "Average Requested RAM per CPU by User for all Users",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
user_graph2.show()
```
%% Cell type:markdown id: tags:
# Average RAM per CPU by Job
%% Cell type:code id: tags:
```
# must run
# df_4 is database with only JobStep, User, JobName, ReqMemCpu, ArrayJob, and ArrayTaskID
# it is used to pull out needed information and create separate datasets to compare
df_4 = df_batch.loc[:,['JobStep','ReqMemCPU','ArrayJobID']]
#df_4.head(5)
```
%% Cell type:code id: tags:
```
# must run
# variable for to be used in names of plots to describe the max gigs measured
UpperlimitGB = 5
# variable for max gigs of RAM requested - Charts range from 0 to upperRAMlimit gigs
upperRAMlimit = UpperlimitGB * 10e+10 # 5 gigs
```
%% Cell type:code id: tags:
```
# must run
# creates database from df_4 that returns all RAM per CPU requested up to the UpperRAMlimit defined above
batch_cutoff = df_4[(df_4.ReqMemCPU <= upperRAMlimit)]
#batch_cutoff.head(5)
```
%% Cell type:code id: tags:
```
# must run
# df_user_graph is df_user sorted in decending order by mean for easy readibility of graph
batch_cutoff_graph = batch_cutoff.sort_values(by='ReqMemCPU', ascending=False)
#batch_cutoff_graph.head(5)
```
%% Cell type:code id: tags:
```
style.default_axes_and_ticks()
style.figsize()
# shows the number of jobs requesting cpu memory for all jobs (array and non array jobs)
Jobs_fig = sns.distplot(batch_cutoff['ReqMemCPU'], kde=False, label='Number of Jobs Requesting RAM per CPU for all Jobs', color = "green")
Jobs_fig.set_yscale('log')
plt.legend(prop={'size': 12},loc='upper right',bbox_to_anchor=(2.25, 1.0),ncol=1)
plt.title('Number of Jobs Requesting RAM per CPU for all Jobs %i gigs or less'%UpperlimitGB)
plt.xlabel('Requested Gigs of RAM')
plt.ylabel('Number of Jobs Requesting')
```
%% Cell type:code id: tags:
```
```
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment