This PR adds necessary functionality of converting the flat, unsorted parquet structure from convert-to-parquet
to a hive format partitioned by tld
and policy run date. Converting the dataset to a hive format allows us to easily partition only on tld and compare file changes across GPFS log timepoints. This will greatly simplify the calculation of data churn by introducing a defined organization to these large datasets.
Additionally, all hive parquet files are sorted by tld
and repartitioned to be approximately 100 MiB in-memory, the suggested partition size by Dask.
policy.convert.hivize
function for converting a flat parquet structure to a hiveconvert-to-hive
CLI function along with batch processing functionalitysplit-log
, convert-to-parquet
, and convert-flat-to-hive
. These are found in the example-job-scripts
directory.cli.utils.batch_parser
. It can be included in the parents
field of any argument parser, and default values for different Slurm resources can be passed to it to customize default resources for different functionscompute.backend.start_local_cluster
is now automatically imported when calling import rc_gpfs.compute
rc_gpfs/__init__.py
to improve module load time. This is temporary until lazy imports for heavy packages like cudf
and dask
can be implemented--no-clobber
option for split-log
and convert-to-parquet
that will cease execution if the expected outputs already exist in the specified output directory. This makes it much easier to process a small batch of GPFS logs where indexes may not be known (due to how find
(un)organizes its results) by submitting a batch array task for every GPFS log. Already processed logs will be skipped without issue.cli.__init__.py