Functionalize the log processing pipeline

Right now, all log processing is run through Jupyter notebooks. JP initially used the notebooks in an automated fashion using papermill. I refined the pipeline in another manual notebook. This pipeline needs to be functionalized so it can be easily run from either Jupyter notebooks or within shell scripts.

This module will assume the data are starting as a parquet dataset output from the current convert-to-parquet.sh script. It will also assume the data are originally generated using the policy files available to the run-submit-pol-job.py script (list-path-external and list-path-dirplus). Other policy formats may or may not work depending on similarity to those policy definitions.

Pipeline Stages

Determine compute backend and create local cluster if necessary
If requested by the user, categorize file records by age using one of create, modify, or access times.
Aggregate the total file count and size (bytes) for each tld (generally project or user name) and, if requested, age category
Return the aggregated data as a regular pandas dataframe
Optionally, write the aggregated data to a parquet file

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information