Functionalize the log processing pipeline
Right now, all log processing is run through Jupyter notebooks. JP initially used the notebooks in an automated fashion using papermill
. I refined the pipeline in another manual notebook. This pipeline needs to be functionalized so it can be easily run from either Jupyter notebooks or within shell scripts.
This module will assume the data are starting as a parquet dataset output from the current convert-to-parquet.sh
script. It will also assume the data are originally generated using the policy files available to the run-submit-pol-job.py
script (list-path-external
and list-path-dirplus
). Other policy formats may or may not work depending on similarity to those policy definitions.
Pipeline Stages
-
Determine compute backend and create local cluster if necessary -
If requested by the user, categorize file records by age using one of create
,modify
, oraccess
times. -
Aggregate the total file count and size (bytes) for each tld
(generally project or user name) and, if requested, age category -
Return the aggregated data as a regular pandas dataframe -
Optionally, write the aggregated data to a parquet file