Automate conversion of GPFS policy outputs to parquet without Jupyter
Created a set of scripts to parse our standard GPFS policy outputs and save them as a parquet dataset without needing a Jupyter notebook. Iterated off of parquet-list-policy-data.ipynb
.
Changes
- Simplified parsing algorithm
- Automatically extracts the "top-level directory" (
tld
). Can parse from/data/user
,/data/user/home
,/data/project
, and/scratch
.- Sets the
tld
as the index within each parquet file for faster aggregation later
- Sets the
- Variable output directory: defaults to
log_dir/parquet
but can be specified elsewhere - Environment is controlled through a Singularity container (defaults to
daskdev/dask:2024.8.0-py3.12
) but is variable- If the container is not specified, the default is automatically downloaded and used
- Parallelization: Each part of a policy output is processed individually in an array task. Processing logs parts of 5 million lines apiece takes ~ 3 minutes
- Controlled via command line
run-convert-to-parquet.sh
Edited by Matthew K Defenderfer
Merge request reports
Activity
Filter activity
Fixes #8 (closed)
added 1 commit
- 8dd5cb12 - remove SIF option for now. Only use the latest repo container
Let's have a directory structure like this:
|-- src | |-- legacy-notebooks | | |-- *.ipynb (all the older notebooks) | | | |-- legacy-scripts | | |-- *.sh (only the older scripts, not the new ones) | | | |-- (new directories/files here) | |-- (other things stay where they are here)
We can decide what to do with the policy definitions later in the
policy
directory. I'm content leaving them there for now.mentioned in issue #10
mentioned in merge request !10 (merged)
Addresses #10. Further reorganization will need to happen in a separate request
mentioned in issue #8 (closed)
mentioned in commit f1b4bbcf
Please register or sign in to reply