Draft: Partition parquet dataset for sync with s5cmd (!9) · Merge requests · rc / gpfs-policy · GitLab

Snippets Groups Projects

Closed Matthew K Defenderfer requested to merge partition-parquet-dataset into main 8 months ago

For large datasets, we or other users may want to sync large sets of files to external storage such as LTS. This merge request adds some scripts to split a piece of an entire parquet dataset into a number of partitions with sync commands for use with s5cmd. Each partition is synced in an array task. The current setting has 10 tasks running at once to try and minimize overall impact on the filesystem

Activity

Matthew K Defenderfer added 10 commits 8 months ago
added 10 commits

693d24f6 - add build for custom docker container

1a970090 - remove creating and running array job from python script

c764d3af - create runscript for dataset partitioning and transfer

6a70466c - for now create a single container to be used for all pieces of analysis. move...

18300268 - ignore sif files

14426714 - change default sif file and add credentials options for s5cmd

c4a3dff9 - fix Dockerfile path

464ca4c3 - add instructions for creating a Gitlab PAT for accessing the GPFS container

20090a6a - correctly calculate group and move group to in-memory calculation

dea299f6 - force pull container

Compare with previous version
Toggle commit list
William E Warriner mentioned in issue #10 8 months ago

mentioned in issue #10
Matthew K Defenderfer added 3 commits 8 months ago
added 3 commits

b56e9346 - rename directory

5e93381b - add instructions for transferring data

4c913204 - rename directory

Compare with previous version
Matthew K Defenderfer marked this merge request as ready 8 months ago

marked this merge request as ready
Matthew K Defenderfer mentioned in merge request !10 (merged) 8 months ago

mentioned in merge request !10 (merged)
Matthew K Defenderfer added 1 commit 8 months ago
added 1 commit

f5686380 - actually submit the s5cmd job

Compare with previous version
Matthew K Defenderfer marked this merge request as draft 7 months ago

marked this merge request as draft
Matthew K Defenderfer added 15 commits 7 months ago
added 15 commits

f5686380...f1b4bbcf - 13 commits from branch main

7fd2308b - acheck to see if mode exists in the dataset in the first place

0cbe9cad - Merge branch 'main' of gitlab.rc.uab.edu:mdefende/gpfs-policy into partition-parquet-dataset

Compare with previous version
Matthew K Defenderfer @mdefende · 4 months ago

Author Maintainer

Too far behind and doesn't really match the current trajectory of the repository
Matthew K Defenderfer closed 4 months ago

closed

Please register or sign in to reply