Add conversion of flat parquet structure to hive
This PR adds necessary functionality of converting the flat, unsorted parquet structure from convert-to-parquet
to a hive format partitioned by tld
and policy run date. Converting the dataset to a hive format allows us to easily partition only on tld and compare file changes across GPFS log timepoints. This will greatly simplify the calculation of data churn by introducing a defined organization to these large datasets.
Additionally, all hive parquet files are sorted by tld
and repartitioned to be approximately 100 MiB in-memory, the suggested partition size by Dask.
Major Changes
- added
policy.convert.hivize
function for converting a flat parquet structure to a hive - added corresponding
convert-to-hive
CLI function along with batch processing functionality
Minor changes
- Added example batch scripts that will process multiple GPFS logs in separate array jobs. Example scripts exist for
split-log
,convert-to-parquet
, andconvert-flat-to-hive
. These are found in theexample-job-scripts
directory. - Added variable argument parser for Slurm options that are shared among all of the CLI functions under
cli.utils.batch_parser
. It can be included in theparents
field of any argument parser, and default values for different Slurm resources can be passed to it to customize default resources for different functions -
compute.backend.start_local_cluster
is now automatically imported when callingimport rc_gpfs.compute
- Removed imports in
rc_gpfs/__init__.py
to improve module load time. This is temporary until lazy imports for heavy packages likecudf
anddask
can be implemented - Added
--no-clobber
option forsplit-log
andconvert-to-parquet
that will cease execution if the expected outputs already exist in the specified output directory. This makes it much easier to process a small batch of GPFS logs where indexes may not be known (due to howfind
(un)organizes its results) by submitting a batch array task for every GPFS log. Already processed logs will be skipped without issue.
Bugfixes
- Added CLI functions back to
cli.__init__.py
Issues
- Resolves #51 (closed)
- Resolves #47 (closed)
- Resolves #28 (closed)