Skip to content
Snippets Groups Projects

Draft: Partition parquet dataset for sync with s5cmd

Closed Matthew K Defenderfer requested to merge partition-parquet-dataset into main

For large datasets, we or other users may want to sync large sets of files to external storage such as LTS. This merge request adds some scripts to split a piece of an entire parquet dataset into a number of partitions with sync commands for use with s5cmd. Each partition is synced in an array task. The current setting has 10 tasks running at once to try and minimize overall impact on the filesystem

Merge request reports

Checking pipeline status.

Closed by Matthew K DefenderferMatthew K Defenderfer 4 months ago (Dec 9, 2024 6:46pm UTC)

Merge details

  • The changes were not merged into .

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply
Loading