Automate conversion of GPFS policy outputs to parquet without Jupyter (!8) · Merge requests · rc / gpfs-policy

Matthew K Defenderfer requested to merge convert-to-parquet into main Aug 15, 2024

Created a set of scripts to parse our standard GPFS policy outputs and save them as a parquet dataset without needing a Jupyter notebook. Iterated off of parquet-list-policy-data.ipynb.

Changes

Simplified parsing algorithm
Automatically extracts the "top-level directory" (tld). Can parse from /data/user, /data/user/home, /data/project, and /scratch.
- Sets the tld as the index within each parquet file for faster aggregation later
Variable output directory: defaults to log_dir/parquet but can be specified elsewhere
Environment is controlled through a Singularity container (defaults to daskdev/dask:2024.8.0-py3.12) but is variable
- If the container is not specified, the default is automatically downloaded and used
Parallelization: Each part of a policy output is processed individually in an array task. Processing logs parts of 5 million lines apiece takes ~ 3 minutes
Controlled via command line run-convert-to-parquet.sh

Edited Aug 15, 2024 by Matthew K Defenderfer

Automate conversion of GPFS policy outputs to parquet without Jupyter

Changes

Merge request reports