Skip to content

Add conversion of flat parquet structure to hive

This PR adds necessary functionality of converting the flat, unsorted parquet structure from convert-to-parquet to a hive format partitioned by tld and policy run date. Converting the dataset to a hive format allows us to easily partition only on tld and compare file changes across GPFS log timepoints. This will greatly simplify the calculation of data churn by introducing a defined organization to these large datasets.

Additionally, all hive parquet files are sorted by tld and repartitioned to be approximately 100 MiB in-memory, the suggested partition size by Dask.

Major Changes

  • added policy.convert.hivize function for converting a flat parquet structure to a hive
  • added corresponding convert-to-hive CLI function along with batch processing functionality

Minor changes

  • Added example batch scripts that will process multiple GPFS logs in separate array jobs. Example scripts exist for split-log, convert-to-parquet, and convert-flat-to-hive. These are found in the example-job-scripts directory.
  • Added variable argument parser for Slurm options that are shared among all of the CLI functions under cli.utils.batch_parser. It can be included in the parents field of any argument parser, and default values for different Slurm resources can be passed to it to customize default resources for different functions
  • compute.backend.start_local_cluster is now automatically imported when calling import rc_gpfs.compute
  • Removed imports in rc_gpfs/__init__.py to improve module load time. This is temporary until lazy imports for heavy packages like cudf and dask can be implemented
  • Added --no-clobber option for split-log and convert-to-parquet that will cease execution if the expected outputs already exist in the specified output directory. This makes it much easier to process a small batch of GPFS logs where indexes may not be known (due to how find (un)organizes its results) by submitting a batch array task for every GPFS log. Already processed logs will be skipped without issue.

Bugfixes

  • Added CLI functions back to cli.__init__.py

Issues

Merge request reports

Loading