Commit 22f2bb38 authored by KOMAL BADI's avatar KOMAL BADI
Browse files

This is the throughput analysis plot for user jobs.

parent dd192b8c
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <h1><center>Throughput Analysis - User Jobs</center></h1>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objective:\n",
" \n",
"Throughput is the amount of output that can be processed in a given period of time. This throughput plot gives statistics about a single user's jobs. Database is created based on this single job and further Running jobs, Pending jobs, Completed jobs, Currently running jobs are calculated.\n",
" \n",
" \n",
"<b>Submit Time </b> - Job submmited time.\n",
"\n",
"<b>Start Time </b>- point of time when job/task has started.\n",
"\n",
"<b>End Time</b> - point time when job/task has ended.\n",
"\n",
"<b>Running Jobs</b> - Cummulative sum of started jobs.\n",
"\n",
"<b>Completed Jobs</b> - Cummulative sum of ended jobs.\n",
"\n",
"<b>Pending Jobs </b>- It is the difference between the total jobs and the running jobs, pending jobs are the jobs which are waiting to be started.\n",
"\n",
"<b>Currently Running Jobs </b>- It is the difference between the cummulative sum of started jobs and cummulative sum of ended jobs. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Enter the User ID for which you want to do throughput analysis \n",
"user_id='abgvg9'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Libraries Used\n",
"\n",
"1) sqlite3 , slurm2sql , pandas are mandatory for converting slurm to sql data.\n",
"\n",
"2) seaborn , matplotlib, RC_STYLES.rc_styles for visualization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Mandatory\n",
"import sqlite3\n",
"import slurm2sql\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"import warnings\n",
"from RC_STYLES import rc_styles as s\n",
"warnings.filterwarnings(\"ignore\")\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Database Connection\n",
"<b>slurm-since-March-allocation.sqlite3 :</b> We're connecting to slurm-since-March-allocation.sqlite3 database which as slurm data. SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. Some applications can use SQLite for internal data storage. It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Mandatory\n",
"db = sqlite3.connect('/data/rc/rc-team/slurm-since-March-allocation.sqlite3')\n",
"#db = sqlite3.connect('/data/rc/rc-team/slurm-since-March.sqlite3')\n",
"df = pd.read_sql('SELECT * FROM slurm', db)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Submit_Time , Start_Time , End_Time :\n",
"Termination time of the job. Format output is, YYYY-MM-DDTHH:MM:SS, unless changed through the SLURM_TIME_FORMAT environment variable. Submit_time is decribed as the time the job was submitted. Initiation time of the job in the same format as End. Here submit,start and End columns which are in epoch in the sacct are converted to date_time format and are saved under submit_time , start_time and end_time columns .\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Voluntary\n",
"df['start_time'] = pd.to_datetime(df['Start'],unit='s')\n",
"df['end_time'] = pd.to_datetime(df['End'],unit='s')\n",
"df['time'] = pd.to_datetime(df['Time'],unit='s')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Cleaning\n",
"\n",
"We're dropping the NA values and also calculating the Total RAM requested, Waiting time. We're also converting Elapsed time, Waiting time, CPU time to hours, converting the memory to GB and naming all the cancelled by user jobs as Cancelled jobs. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Voluntary\n",
"#ReqMemNode is Requested memory for each node, in MB. \n",
"#Hence we are converting ReqMemNode in GB for better understanding.\n",
"#Converting ReqMemNodes in GB\n",
"df['ReqMemNode']=df['ReqMemNode']/((1024)*(1024)*(1024)) \n",
"\n",
"#AveRSS is Average resident set size of all tasks in job.\n",
"#Converting AveRSS in GB\n",
"df['AveRSS']=df['AveRSS']/((1024)*(1024)*(1024))\n",
"\n",
"##ReqMemCPU is Requested memory for each CPU, in MB.\n",
"#Converting ReqMemCPU in GB\n",
"df['ReqMemCPU']=df['ReqMemCPU']/((1024)*(1024)*(1024))\n",
"\n",
"###ReqTotalRAM is multiplying Requested memory per each CPU by No. of CPUS requested\n",
"#Computing Total Requested RAM in GB\n",
"df['ReqTotalRAM']=df['NCPUS']*df['ReqMemCPU'] \n",
"\n",
"#Naming all the cancelled by user jobs as Cancelled jobs\n",
"df.loc[df['State'].str.contains('CANCELLED'), 'State'] = 'CANCELLED'\n",
"\n",
"#Waiting time is the time between the job being submitted \n",
"#to slurm scheduluer and the time at which job starts\n",
"#computing waiting time\n",
"df['Waiting'] = df['Start']-df['Submit']\n",
"df1 = df.dropna(subset=['Waiting'])\n",
"\n",
"#Computing waiting time in hours\n",
"df1['Waiting'] = df1['Waiting']/3600 \n",
"\n",
"#Computing Elapsed time in hours\n",
"df1['Elapsed'] = df1['Elapsed']/3600\n",
"\n",
"#Computing CPU time in hours\n",
"df1['CPUTime']=df1['CPUTime']/3600\n",
"\n",
"#droping na values for time(submitted jobs at a particular time)\n",
"#df1 = df1.dropna(subset=['Time']) \n",
"#df1 = df1.dropna(subset=['Submit']) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Succesful User Jobs:\n",
"\n",
"Creating a pandas dataframe in which we seperate all the User jobs where User is not equal to Nan. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_User_jobs=df1.dropna(subset=['User'])\n",
"df_User_jobs.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas Groupby Operator:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Mandatory\n",
"#Calculating number of jobs submitted per user\n",
"User_jobs = df_User_jobs.groupby(\"User\")[\"JobID\"].count().reset_index()\n",
"User_jobs.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"User_jobs=User_jobs[User_jobs!=0].dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pulling out data corresponding to 1 User Job:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Voluntary\n",
"#Sorting the previous pandas data frame in descending order to see \n",
"#highest no. of job for a single user job and pull out that specific user job.\n",
"sample_data=User_jobs.sort_values(by='JobID', ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_User_job = df.loc[df['User'] ==user_id]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_User_job.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Submit_Time , Start_Time , End_Time :\n",
"Termination time of the job. Format output is, YYYY-MM-DDTHH:MM:SS, unless changed through the SLURM_TIME_FORMAT environment variable. Submit_time is decribed as the time the job was submitted. Initiation time of the job in the same format as End. Here submit,start and End columns which are in epoch in the sacct are converted to date_time format and are saved under submit_time , start_time and end_time columns ."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Mandatory\n",
"sample_User_job['submit_time'] = pd.to_datetime(sample_User_job['Submit'],unit='s')\n",
"sample_User_job['start_time'] = pd.to_datetime(sample_User_job['Start'],unit='s')\n",
"sample_User_job['end_time'] = pd.to_datetime(sample_User_job['End'],unit='s')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Mandatory\n",
"#Creating 3 different dataframes in which each data frame \n",
"#is grouped by submitted,started and end time of user job.\n",
"# Job count of each time is calculated.\n",
"count_jobs_submit_time= sample_User_job.groupby([\"submit_time\"] , as_index=False)[\"JobID\"].count()\n",
"count_jobs_start_time= sample_User_job.groupby([\"start_time\"] , as_index=False)[\"JobID\"].count()\n",
"count_jobs_end_time= sample_User_job.groupby([\"end_time\"] , as_index=False)[\"JobID\"].count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Submitted Jobs Data Frame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"count_jobs_submit_time.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_submit_time = count_jobs_submit_time.rename(columns={'JobID': 'submitted_Job_count'})\n",
"print(df_submit_time)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Voluntary\n",
"#Submit_time as date-time is set as index \n",
"df_submit_time=df_submit_time.set_index('submit_time')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_submit_time.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Started Jobs Data frame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Mandatory\n",
"###Creating dataframe in which data frame \n",
"#is grouped by started time of single user job.\n",
"# Job count of each time is calculated.\n",
"df_start_time = count_jobs_start_time.rename(columns={'JobID': 'Started_Job_count'})\n",
"print(df_start_time)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1_start_time=df_start_time.set_index('start_time')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1_start_time.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1_start_time.plot(figsize=(15,5))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Voluntary\n",
"#Resampling the data\n",
"#Resampling is the method that consists of drawing repeated samples\n",
"#from the original data samples.\n",
"#The method of Resampling is a nonparametric method of statistical inference.\n",
"Running_df=df1_start_time.resample('T').sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Mandatory\n",
"#Creating a new column named Running jobs where \n",
"#Running jobs are cumulative sum of \n",
"#started job count\n",
"Running_df['Running']=Running_df.cumsum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Running_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Running_df.plot(figsize=(15,5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pending Jobs :\n",
" pending jobs are the jobs which are waiting to be started ."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_User_job['end_time'] = pd.to_datetime(sample_User_job['End'],unit='s')\n",
"count_jobs_end_time= sample_User_job.groupby([\"end_time\"] , as_index=False)[\"JobID\"].count()\n",
"df_end_time=count_jobs_end_time.rename(columns={'JobID': 'End_Job_count'})\n",
"print(df_end_time)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Completed Jobs:\n",
"Jobs gets completed at the time at which a job ends."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1_end_time=df_end_time.set_index('end_time')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df1_end_time.index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_df_end_time=df1_end_time.resample('T').sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_df_end_time.tail(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_df_end_time['Completed']=sample_df_end_time.cumsum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_df_end_time.tail()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_df_end_time['Completed'].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_df_end_time.plot(figsize=(15,5))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"axes = sample_df_end_time.plot( marker='.',alpha=0.5, figsize=(11, 4), subplots=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sample_df_end_time.plot(marker='.', alpha=0.5, linestyle='None')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Joining Pandas Dataframes:\n",
" Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mergedDf = Running_df.merge(sample_df_end_time, left_index=True, right_index=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mergedDf.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Currently_Running Jobs:\n",
" Currently running jobs are the difference between started jobs and completed jobs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mergedDf['Currently_Running']=mergedDf['Running']-mergedDf['Completed']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mergedDf['Currently_Running'].plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t=pd.DataFrame(mergedDf[['Running','Currently_Running']])\n",
"t.plot(figsize=(15,5))\n",
"plt.title('Throughput Analysis')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"language_info": {
"name": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment