Skip to content

Draft: Partition parquet dataset for sync with s5cmd

Matthew K Defenderfer requested to merge partition-parquet-dataset into main

For large datasets, we or other users may want to sync large sets of files to external storage such as LTS. This merge request adds some scripts to split a piece of an entire parquet dataset into a number of partitions with sync commands for use with s5cmd. Each partition is synced in an array task. The current setting has 10 tasks running at once to try and minimize overall impact on the filesystem

Merge request reports

Loading