# Information Lifecycle Managment (ILM) via GPFS policy engine

The GPFS policy engine is well described in this [white paper](https://www.ibm.com/support/pages/system/files/inline-files/Spectrum%20Scale%20ILM%20Policies_v10.2.pdf).
A good presentation overview of the policy file is [here](https://www.spectrumscaleug.org/event/ssugdigital-spectrum-scale-ilm-policy-engine/).
The relavent [documentation is available from IBM](https://www.ibm.com/docs/en/spectrum-scale/4.2.0?topic=guide-information-lifecycle-management-spectrum-scale).

This project focuses on scheduled execution of lifecyle policies to gather and process data about
file system objects and issue actions against those objects based on policy.

## Running a policy

A policy is executed in the context of a SLURM batch job reservation using the submit-pol-job script:
```
submit-pol-job <outdir> <policy> <nodecount> <corespernode> <ram> <partition> <time>
```
Where the positional arguments are:

outdir - the directory for the output files, should be global to cluster (e.g. /scratch of the user running the job)
policy - path to the GPFS policy to execute (e.g. in ./policy directory)
nodecount - number of nodes in the cluster that will run the policy
corespernode - number of cores on each node to reserve
ram - ram per core, can use "G" for gigabytes
partition - the partition to submit the job
time - the time in minutes to reserve for the job

Note: the resource reservation is imperfect. The job wrapper calls a script `run-mmpol.sh` which
is responsible for executing the `mmapplypolicy` command.
The command is aligned to run on specific nodes by way of arguments to
mmapplypolicy. The command is technically not run inside of the job reservation so the resource
constraints are imperfect. The goal is to use the scheduler to ensure the policy run does not conflict
with existing resource allocations on the cluster.

## Running the policy "list-policy-external"

The list-policy-external policy provides an efficient tool to gather file stat data into a URL-encoded
ASCII text file. The output file can then be processed by down-stream to create reports on storage
patterns and use.

An example invocation would be:
```
submit-pol-job /path/to/output/dir \
/absolute/path/policy/list-path-external \
4 24 4G partition_name \
/path/to/listed/dir \
180
```

Some things to keep in mind:
- the `submit-pol-job` script may need a `./` prefix if it is not in your path.
- use absolute paths for all directory arguments to avoid potential confusion
- make sure the output dir has sufficient space to hold the resulting file listing (It could be 100's of Gigabytes for a large collection of files.)

The slurm job output file will be local to the directory from which this command executed. It can be
watched to observe progress in the generation of the file list. A listing of 100's of millions of files
may take a couple of hours to generate and consume serveral hundred gigabytes for the output file.

The output file in `/path/to/output/dir` is named as follows
- a prefix of "list-${SLURM_JOBID}"
- ".list" for the name of the policy rule type of "list"
- a tag for the list name name defined in the policy file, "list-gather" for `list-path-external` policy

The output file contains one line per file object stored under the `/path/to/listed/dir`. No directories
or non-file objects are included in this listing. Each entry is a space-seperated set of file attributes
selected by the SHOW command in the LIST rule. Entries are encoded according to RFC3986 URI percent
encoding. This means all spaces and special characters will be encoded, making it easy to split lines
into fields using the space separator.

The ouput file is an unsorted list of files in uncompressed ASCII. Further processing is desireble
to use less space for storage and provide organized collections of data.

## Processing the output file

### Split and compress

### Pre-parse output for Python

## Running reports

### Disk usage by top level directies

A useful report is the top level directory (tld) report. This is akin to running a `du -s *` in a
directory of interest, but much faster since there is no walking of directory trees. Only the list
policy output file is used, reducing the operation to a parsing an summing of the data in the list
policy output file.

### Comparing directory similarity

## Scheduling regular policy runs via cron

The policy run can be scheduled automatically with the cronwrapper script.
Simpley add append the above script and arguments to the crownwrapper in a crontab line.
For example to run it every morning at 4 am you would add:

0 4 * * * /path/to/cronwrapper submit-pol-job <outdir> <policy> <nodecount> <corespernode> <ram> <partition>