Commit dafac808 authored by KOMAL BADI's avatar KOMAL BADI
Browse files

Added some additional documentation.

Some basic definitions and the variables used.
parent ebdf04b4
......@@ -4,9 +4,45 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Through-Put Analysis:\n",
"Throughput is the amount of output that can be processed in a given period of time.\n",
"This throughput plot gives statistics about a single array job ."
"\n",
"# <h1><center>Throughput Analysis</center></h1>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Objective:\n",
" \n",
"Throughput is the amount of output that can be processed in a given period of time. This throughput plot gives statistics about a single array job. Database is created based on this single job and further Running jobs, Pending jobs, Completed jobs, Currently running jobs are calculated.\n",
" \n",
" \n",
"<b>Submit Time </b> - Job submmited time.\n",
"\n",
"<b>Start Time </b>- point of time when job/task has started.\n",
"\n",
"<b>End Time</b> - point time when job/task has ended.\n",
"\n",
"<b>Running Jobs</b> - Cummulative sum of started jobs.\n",
"\n",
"<b>Completed Jobs</b> - Cummulative sum of ended jobs.\n",
"\n",
"<b>Pending Jobs </b>- It is the difference between the total array tasks and the running jobs, pending jobs are the jobs which are waiting to be started.\n",
"\n",
"<b>Currently Running Jobs </b>- It is the difference between the cummulative sum of started jobs and cummulative sum of ended jobs. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What is an array job?\n",
"\n",
"An array job is a job that shares common parameters, such as the job definition, vCPUs, and memory. It runs as a collection of related, yet separate, basic jobs that may be distributed across multiple hosts and may run concurrently. \n",
"\n",
"\n",
"Initially, array job id that require analysis is assigned to a varaiable, further we'll be using this id to do the analysis "
]
},
{
......@@ -16,7 +52,18 @@
"outputs": [],
"source": [
"#Enter the JobID for which you want to do throughput analysis \n",
"Job_id='5976984'"
"array_job_id='5976984'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Libraries Used\n",
"\n",
"1) sqlite3 , slurm2sql , pandas are mandatory for converting slurm to sql data.\n",
"\n",
"2) seaborn , matplotlib, RC_STYLES.rc_styles for visualization"
]
},
{
......@@ -43,7 +90,7 @@
"metadata": {},
"source": [
"# Database Connection\n",
"slurm-since-March-allocation.sqlite3 : This DB is created using the --allocations parameter in the sacct command.Shows data for only the job allocation , not taking batch,extern steps into consideration. SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. Some applications can use SQLite for internal data storage. It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle."
"<b>throughput_analysis_array_job.db :</b> We're creating a database for thr array job that requires throuput analysis. SQLite is a C library that provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. Some applications can use SQLite for internal data storage. It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle."
]
},
{
......@@ -52,6 +99,7 @@
"metadata": {},
"outputs": [],
"source": [
"#connecting to database\n",
"db = sqlite3.connect('throughput_analysis_array_job.db')"
]
},
......@@ -61,7 +109,8 @@
"metadata": {},
"outputs": [],
"source": [
"slurm2sql.slurm2sql(db, ['-j', Job_id, '-a'])"
"#creating a database based on the array job id\n",
"slurm2sql.slurm2sql(db, ['-j', array_job_id, '-a'])"
]
},
{
......@@ -85,12 +134,20 @@
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Filter Data "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Filtering data by removing batch and extern columns from dataframe. \n",
"df= df.loc[df['JobName'] =='R_array_job']"
]
},
......@@ -119,7 +176,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Cleaning:"
"# Data Cleaning\n",
"\n",
"We're dropping the NA values and also calculating the Total RAM requested, Waiting time. We're also converting Elapsed time, Waiting time, CPU time to hours, converting the memory to GB and naming all the cancelled by user jobs as Cancelled jobs. "
]
},
{
......@@ -174,8 +233,8 @@
"metadata": {},
"source": [
"# Succesful Array Jobs:\n",
" Creating a pandas dataframe in which we seperate all the array jobs where ArrayTaskID is not equal to Nan.\n",
" A SLURM job array is a collection of Tasks that differ from each other by only a single index parameter. Creating a job array provides an easy way to group related jobs together.."
"\n",
"Creating a pandas dataframe in which we seperate all the array jobs where ArrayTaskID is not equal to Nan. A SLURM job array is a collection of Tasks that differ from each other by only a single index parameter. Creating a job array provides an easy way to group related jobs together."
]
},
{
......@@ -256,7 +315,7 @@
"metadata": {},
"outputs": [],
"source": [
"sample_array_job=df1"
"array_job_data=df1"
]
},
{
......@@ -265,7 +324,7 @@
"metadata": {},
"outputs": [],
"source": [
"sample_array_job.head()"
"array_job_data.head()"
]
},
{
......@@ -283,9 +342,9 @@
"outputs": [],
"source": [
"#Mandatory\n",
"sample_array_job['submit_time'] = pd.to_datetime(sample_array_job['Submit'],unit='s')\n",
"sample_array_job['start_time'] = pd.to_datetime(sample_array_job['Start'],unit='s')\n",
"sample_array_job['end_time'] = pd.to_datetime(sample_array_job['End'],unit='s')"
"array_job_data['submit_time'] = pd.to_datetime(array_job_data['Submit'],unit='s')\n",
"array_job_data['start_time'] = pd.to_datetime(array_job_data['Start'],unit='s')\n",
"array_job_data['end_time'] = pd.to_datetime(array_job_data['End'],unit='s')"
]
},
{
......@@ -298,9 +357,9 @@
"#Creating 3 different dataframes in which each data frame \n",
"#is grouped by submitted,started and end time of array job.\n",
"# Job count of each time is calculated.\n",
"count_jobs_submit_time= sample_array_job.groupby([\"submit_time\"] , as_index=False)[\"ArrayTaskID\"].count()\n",
"count_jobs_start_time= sample_array_job.groupby([\"start_time\"] , as_index=False)[\"ArrayTaskID\"].count()\n",
"count_jobs_end_time= sample_array_job.groupby([\"end_time\"] , as_index=False)[\"ArrayTaskID\"].count()"
"count_jobs_submit_time= array_job_data.groupby([\"submit_time\"] , as_index=False)[\"ArrayTaskID\"].count()\n",
"count_jobs_start_time= array_job_data.groupby([\"start_time\"] , as_index=False)[\"ArrayTaskID\"].count()\n",
"count_jobs_end_time= array_job_data.groupby([\"end_time\"] , as_index=False)[\"ArrayTaskID\"].count()"
]
},
{
......@@ -397,7 +456,7 @@
"#Resampling is the method that consists of drawing repeated samples\n",
"#from the original data samples.\n",
"#The method of Resampling is a nonparametric method of statistical inference.\n",
"Running_df=df1_start_time.resample('S').sum()"
"running_df=df1_start_time.resample('1s').sum()"
]
},
{
......@@ -408,9 +467,8 @@
"source": [
"#Mandatory\n",
"#Creating a new column named Running jobs where \n",
"#Running jobs are cumulative sum of \n",
"#started job count\n",
"Running_df['Running']=Running_df.cumsum()"
"#Running jobs are cumulative sum of started job count\n",
"running_df['running']=running_df.cumsum()"
]
},
{
......@@ -419,7 +477,7 @@
"metadata": {},
"outputs": [],
"source": [
"Running_df.head()"
"running_df.head()"
]
},
{
......@@ -428,7 +486,8 @@
"metadata": {},
"outputs": [],
"source": [
"Running_df.plot(figsize=(15,10))"
"running_df.plot(figsize=(15,10))\n",
"plt.ylabel('Job_Count')"
]
},
{
......@@ -450,7 +509,7 @@
"#total jobs submitted at a particular time \n",
"#and Jobs that have been started.\n",
"Total_Jobs_Submitted = 50\n",
"Running_df['Pending_jobs'] = Total_Jobs_Submitted - Running_df['Running']"
"running_df['pending_jobs'] = Total_Jobs_Submitted - running_df['running']"
]
},
{
......@@ -459,7 +518,8 @@
"metadata": {},
"outputs": [],
"source": [
"Running_df.plot(figsize=(15,10))"
"running_df.plot(figsize=(15,10))\n",
"plt.ylabel('Job_Count')"
]
},
{
......@@ -506,7 +566,7 @@
"#Resampling is the method that consists of drawing repeated samples\n",
"#from the original data samples.\n",
"#The method of Resampling is a nonparametric method of statistical inference.\n",
"completed_df=df1_end_time.resample('S').sum()"
"completed_df=df1_end_time.resample('1s').sum()"
]
},
{
......@@ -528,7 +588,7 @@
"#Creating a new column named Completed jobs where \n",
"#Completed jobs are cumulative sum of \n",
"#end job count\n",
"completed_df['Completed']=completed_df.cumsum()"
"completed_df['completed']=completed_df.cumsum()"
]
},
{
......@@ -546,14 +606,15 @@
"metadata": {},
"outputs": [],
"source": [
"completed_df.plot(figsize=(15,10))\n"
"completed_df.plot(figsize=(15,10))\n",
"plt.ylabel('Job_Count')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Joining Pandas Dataframes:\n",
"#### Joining Pandas Dataframes:\n",
" Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list."
]
},
......@@ -565,7 +626,7 @@
"source": [
"#Mandatory\n",
"#Joing all the 2 dataframes to plot throughput analysis\n",
"merged_df=completed_df.join(Running_df)"
"merged_df=completed_df.join(running_df)"
]
},
{
......@@ -584,7 +645,7 @@
"metadata": {},
"outputs": [],
"source": [
"t=pd.DataFrame(merged_df[['Completed' , 'Running' , 'Pending_jobs']])"
"t=pd.DataFrame(merged_df[['completed' , 'running' , 'pending_jobs']])"
]
},
{
......@@ -595,8 +656,7 @@
"source": [
"#plt.figure(figsize=(15,4))\n",
"t.plot(figsize=(15,10))\n",
"plt.title('Throughput Analysis')\n",
"plt.xlabel('Time')\n",
"\n",
"plt.ylabel('Job_Count')"
]
},
......@@ -615,7 +675,7 @@
"outputs": [],
"source": [
"#Mandatory\n",
"merged_df['Currently_Running'] = merged_df['Running'] - merged_df['Completed'] "
"merged_df['currently_running'] = merged_df['running'] - merged_df['completed'] "
]
},
{
......@@ -633,7 +693,7 @@
"metadata": {},
"outputs": [],
"source": [
"p=pd.DataFrame(merged_df[['Currently_Running']])"
"p=pd.DataFrame(merged_df[['currently_running']])"
]
},
{
......@@ -643,7 +703,7 @@
"outputs": [],
"source": [
"p.plot(figsize=(15,5))\n",
"plt.title('currently_Running_Jobs_Statistics')\n",
"plt.title('Currently_Running_Jobs_Statistics')\n",
"plt.xlabel('Time')\n",
"plt.ylabel('Job_Count')"
]
......@@ -654,7 +714,7 @@
"metadata": {},
"outputs": [],
"source": [
"q=pd.DataFrame(merged_df[['Currently_Running' , 'Pending_jobs']])"
"q=pd.DataFrame(merged_df[['currently_running' , 'pending_jobs']])"
]
},
{
......@@ -664,7 +724,7 @@
"outputs": [],
"source": [
"q.plot(figsize=(15,5))\n",
"plt.title('currently_Running_Jobs vs pending_Jobs statistics')\n",
"plt.title('Currently_Running_Jobs vs pending_Jobs statistics')\n",
"plt.xlabel('Time')\n",
"plt.ylabel('Job_Count')"
]
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment