Information Lifecycle Managment (ILM) via GPFS policy engine
The GPFS policy engine is well described in this white paper. A good presentation overview of the policy file is here. The relavent documentation is available from IBM.
This project focuses on scheduled execution of lifecyle policies to gather and process data about file system objects and issue actions against those objects based on policy.
Getting Started
Gitlab Registry Authentication
To download from Gitlab's package repo, create a ~/.pypirc
file and add the following entry:
[gitlab]
repository = https://gitlab.rc.uab.edu/api/v4/projects/2550/packages/pypi
username = __token__
password = <personal_access_token>
Go to your Access Tokens page and create a new token with api
privileges. After the token is created, copy its value to the password field
Installation
Package dependencies should be installed through conda
due to some compilation errors on Cheaha when installed through PyPi. Use the following commands to create the environment.
module load Anaconda3
conda env create -n gpfs -f deps.yml
conda activate gpfs
pip install --index-url https://gitlab.rc.uab.edu/api/v4/projects/2550/packages/pypi/simple --no-deps rc-gpfs===<version>
pip
by default will install the latest development version (based on the current version format) if a specific version isn't given. Be sure to include the version number if you would like a stable release as opposed to a development release. As well, be sure to ===
instead of ==
for simple version string matching.
Jupyter Packages
The deps.yml
file only contains packages necessary to run the package from the CLI. If you want to run this package as a Jupyter kernel, activate your environment and use the following command to install the necessary packages.
conda install ipykernel jupyter notebook nbformat
Applying Policies
Applying a policy to filesets is done through the mmapplypolicy
command at a base level. This repo contains wrapper scripts to call that command with a specified policy file on a given fileset where each wrapper has different levels of functionality meant for different groups of users in RC. All scripts are stored in src/run-policy
-
run-mmpol
: the main script that callsmmapplypolicy
. Generally not invoked on its own -
submit-pol-job
: general wrapper that sets up the Slurm jobrun-mmpol
executes in. Admins can execute a policy run from this level using any policy file they have defined -
run-submit-pol-job.py
: a Python wrapper forsubmit-pol-job
meant specifically for running list policy jobs. This wrapper can be run by specific non-admins who have been givensudo
permissions on this file alone. It can only run one of two policies:list-path-external
andlist-path-dirplus
.
The production version of these scripts are kept in /data/rc/list-gpfs-dirs
. Admins can run any one of these scripts from anywhere, but non-admins are only granted sudo
privileges on the run-submit-pol-job.py
file in that directory.
Note: The command is aligned to run on specific nodes by way of arguments to mmapplypolicy. The command is technically not run inside of the job reservation so the resource constraints are imperfect. The goal is to use the scheduler to ensure the policy run does not conflict with existing resource allocations on the cluster.
List Policies (non-admin)
A list policy can be executed using run-submit-pol-job.py
using the following command:
sudo run-submit-pol-job.py [-h] [-o OUTDIR] [-f LOG_PREFIX] [--with-dirs]
[-N NODES] [-c CORES] [-p PARTITION] [-t TIME]
[-m MEM_PER_CPU]
device
-
outdir
: specifies the directory the output log should be saved to. Defaults to/data/rc/gpfs-policy/data
-
log-prefix
: string to begin the name of the policy output with. Metadata containing the policy file name, slurm job ID, and time run will be appended to this prefix. Defaults tolist-policy_<device>
. See below fordevice
- Note: this is currently non-functional
-
--with-dirs
: changes the policy file fromlist-path-external
tolist-path-dirplus
. The only difference is that directories are included in the policy output. -
device
: the fileset or directory to apply the policy to.
All other arguments are Slurm directives dictating resource requests. The default paramaters are as follows:
-
nodes
: 1 -
cores
: 16 -
partition
:amd-hdr100, medium
-
time
:24:00:00
-
mem-per-cpu
:8G
This script was written using the default python3
interpreter on Cheaha (version 3.6.8
) so no environment needs to be active. Running this script in an environment with a newer Python version may cause unintended errors and effects.
Run Any Policies (admins)
Any defined policy file can be run using the submit-pol-job
by running the following:
sudo ./submit-pol-job [ -h ] [ -o | --outdir ] [ -f | --outfile ] [ -P | --policy ]
[ -N | --nodes ] [ -c | --cores ] [ -p | --partition ]
[ -t | --time ] [ -m | --mem ]
device
The only difference here is that a path to the policy file can be specified using -P
or --policy
. All other arguments are the same and have the same defaults
Output
The list-policy-external policy provides an efficient tool to gather file stat data into a URL-encoded ASCII text file. The output file can then be processed by down-stream to create reports on storage patterns and use. Make sure the output dir has sufficient space to hold the resulting file listing (It could be 100's of Gigabytes for a large collection of files.)
The slurm job output file will be local to the directory from which this command executed. It can be watched to observe progress in the generation of the file list. A listing of 100's of millions of files may take a couple of hours to generate and consume serveral hundred gigabytes for the output file.
List Policy Specific Outputs
The standard organization scheme for list policy outputs can be seen below.
.
└── list-policy_<device>_<policy>_slurm-<jobid>_<rundatetime>/
├── raw/
│ └── list-policy_<device>_<policy>_slurm-<jobid>_<rundatetime>.gz
├── chunks
├── parquet
└── reports
chunks
, parquet
, and reports
are not automatically generated, however they are default directory names for outputs from downstream preprocessing and processing functions.
The output file contains one line per file object stored under the device
. No directories or non-file objects are included in this listing unless the list-path-dirplus
policy is used. Each entry is a space-seperated set of file attributes selected by the SHOW command in the LIST rule. Entries are encoded according to RFC3986 URI percent encoding. This means all spaces and special characters will be encoded, making it easy to split lines into fields using the space separator.
Processing the output file
Split and compress
Policy outputs generated using list-path-external
or list-path-dirplus
can be split into smaller log files to facilitate out-of-memory computation for very large filesets using tools such as dask. Policy outputs can be split and recompressed from the command line using the split-log
command. This processing can be automatically submitted as a separate batch job using the --batch
flag.
split-log [ -h ] [ -l | --lines ] [ -o | --output-dir ] [ --batch ]
[ -n | --cpus-per-task ] [ -p | --partition] [ -t | --time ] [ -m | --mem ]
log
-
lines
: the max number of lines to include in each split file. Defaults to 5000000 -
outdir
: directory to store the split files in. Defaults to ${log}.d in log's parent directory. -
log
: path to the GPFS policy log. Can be either uncompressed orgzip
compressed -
batch
: If specified, a separate batch job will be submitted that splits and compresses the log. Otherwise, both operations will be performed using the local compute resources.
All other options specify job resource parameters. Defaults are as follows:
-
cpus-per-task
:24
-
partition
:amd-hdr100
-
time
:02:00:00
-
mem
:8G
Split files will have the form ${outdir}/list-XXX.gz
where XXX is an incrementing index. Files are automatically compressed.
Pre-parse output for Python
Processing GPFS log outputs is done by convert-to-parquet
and assumes the GPFS log has been split into a number of files of the form list-XXX.gz
. convert-to-parquet
parses each log, adds a column specifying top-level directory (tld) for each file, and saves the data with the appropriate types in parquet format. This processing can be automatically submitted as a separate batch array job using the --batch
flag.
This script is written to parse the list-path-external
policy format with quoted special characters.
convert-to-parquet [ -h ] [ -o | --outdir ] [ --pool-size ] [ --batch ]
[ -n | --ntasks ] [ -p | --partition] [ -t | --time ] [ -m | --mem ]
[ --slurm-log-dir ]
gpfs_logdir
-
output-dir
: Path to save parquet outputs. Defaults to${gpfs_logdir}/../parquet
-
gpfs_logdir
: Directory path containing the split log files as*.gz
-
pool-size
: If performing conversion locally with multiple cores, this controls exactly how many cores within the job are used in the parallel pool. If not specified, all cores assigned to the job are used in the pool. -
batch
: If set, processing will be performed in an array job where each array task converts a single log file.
All other options control the array job resources. Default values are as follows:
-
ntasks
: 1 -
mem
:16G
-
time
:02:00:00
-
partition
:amd-hdr100
The default resources can parse 5 million line files in approximately 3 minutes so should cover all common use cases.
tld
For all policies run on filesets in /data/user
, /data/project
, /home
, or /scratch
will automatically have their "top-level directory" (tld
) computed and added to the parquet output. This is defined as the directory just under any of those specified filesets. For example, a file with path /data/project/datascienceteam/example.txt
will have tld
set to datascienceteam
.
Any files in a directory outside those specified filesets will have tld
set to None
.
Running reports
Disk usage by top level directies
A useful report is the top level directory (tld) report. This is akin to running a du -s *
in a directory of interest, but much faster since there is no walking of directory trees. Only the list policy output file is used, reducing the operation to a parsing an summing of the data in the list policy output file.
Comparing directory similarity
Scheduling regular policy runs via cron
The policy run can be scheduled automatically with the cronwrapper script.
Simpley add append the above script and arguments to the crownwrapper in a crontab line.
For example to run it every morning at 4 am you would add:
0 4 * * * /path/to/cronwrapper submit-pol-job <outdir> <policy> <nodecount> <corespernode> <ram> <partition>