Draft: Partition parquet dataset for sync with s5cmd
For large datasets, we or other users may want to sync large sets of files to external storage such as LTS. This merge request adds some scripts to split a piece of an entire parquet dataset into a number of partitions with sync commands for use with s5cmd. Each partition is synced in an array task. The current setting has 10 tasks running at once to try and minimize overall impact on the filesystem
Merge request reports
Activity
Filter activity
added 10 commits
- 693d24f6 - add build for custom docker container
- 1a970090 - remove creating and running array job from python script
- c764d3af - create runscript for dataset partitioning and transfer
- 6a70466c - for now create a single container to be used for all pieces of analysis. move...
- 18300268 - ignore sif files
- 14426714 - change default sif file and add credentials options for s5cmd
- c4a3dff9 - fix Dockerfile path
- 464ca4c3 - add instructions for creating a Gitlab PAT for accessing the GPFS container
- 20090a6a - correctly calculate group and move group to in-memory calculation
- dea299f6 - force pull container
Toggle commit listmentioned in issue #10
mentioned in merge request !10 (merged)
added 15 commits
-
f5686380...f1b4bbcf - 13 commits from branch
main
- 7fd2308b - acheck to see if mode exists in the dataset in the first place
- 0cbe9cad - Merge branch 'main' of gitlab.rc.uab.edu:mdefende/gpfs-policy into partition-parquet-dataset
-
f5686380...f1b4bbcf - 13 commits from branch
Please register or sign in to reply