From c42dbcfd91f2dc89bef4c7f22a9418715bea45a5 Mon Sep 17 00:00:00 2001 From: John-Paul Robinson <jpr@uab.edu> Date: Wed, 22 May 2024 18:42:53 -0500 Subject: [PATCH] Improve project documentation with instructions on running policy Provide details on running a policy script with the provided workflow framework. Information on the list-policy-external policy and generated output. Outline of future docs for preprocessing and reports. --- README.md | 86 +++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 78 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 2b7306d..22417db 100644 --- a/README.md +++ b/README.md @@ -4,19 +4,89 @@ The GPFS policy engine is well described in this [white paper](https://www.ibm.c A good presentation overview of the policy file is [here](https://www.spectrumscaleug.org/event/ssugdigital-spectrum-scale-ilm-policy-engine/). The relavent [documentation is available from IBM](https://www.ibm.com/docs/en/spectrum-scale/4.2.0?topic=guide-information-lifecycle-management-spectrum-scale). -This project focuses on scheduled execution of lifecyle policies. +This project focuses on scheduled execution of lifecyle policies to gather and process data about +file system objects and issue actions against those objects based on policy. -A policy is executed via the run_job +## Running a policy -submit-pol-job <outdir> <policy> <nodecount> <corespernode> <ram> <partition> +A policy is executed in the context of a SLURM batch job reservation using the submit-pol-job script: +``` +submit-pol-job <outdir> <policy> <nodecount> <corespernode> <ram> <partition> <time> +``` +Where the positional arguments are: -Where: - -outdir - the directory for the policy and job output, should be global to cluster -policy - path to the policy file +outdir - the directory for the output files, should be global to cluster (e.g. /scratch of the user running the job) +policy - path to the GPFS policy to execute (e.g. in ./policy directory) nodecount - number of nodes in the cluster that will run the policy corespernode - number of cores on each node to reserve -partition - the partition used to submit the job +ram - ram per core, can use "G" for gigabytes +partition - the partition to submit the job +time - the time in minutes to reserve for the job + +Note: the resource reservation is imperfect. The job wrapper calls a script `run-mmpol.sh` which +is responsible for executing the `mmapplypolicy` command. +The command is aligned to run on specific nodes by way of arguments to +mmapplypolicy. The command is technically not run inside of the job reservation so the resource +constraints are imperfect. The goal is to use the scheduler to ensure the policy run does not conflict +with existing resource allocations on the cluster. + +## Running the policy "list-policy-external" + +The list-policy-external policy provides an efficient tool to gather file stat data into a URL-encoded +ASCII text file. The output file can then be processed by down-stream to create reports on storage +patterns and use. + +An example invocation would be: +``` +submit-pol-job /path/to/output/dir \ + /absolute/path/policy/list-path-external \ + 4 24 4G partition_name \ + /path/to/listed/dir \ + 180 +``` + +Some things to keep in mind: +- the `submit-pol-job` script may need a `./` prefix if it is not in your path. +- use absolute paths for all directory arguments to avoid potential confusion +- make sure the output dir has sufficient space to hold the resulting file listing (It could be 100's of Gigabytes for a large collection of files.) + +The slurm job output file will be local to the directory from which this command executed. It can be +watched to observe progress in the generation of the file list. A listing of 100's of millions of files +may take a couple of hours to generate and consume serveral hundred gigabytes for the output file. + +The output file in `/path/to/output/dir` is named as follows +- a prefix of "list-${SLURM_JOBID}" +- ".list" for the name of the policy rule type of "list" +- a tag for the list name name defined in the policy file, "list-gather" for `list-path-external` policy + +The output file contains one line per file object stored under the `/path/to/listed/dir`. No directories +or non-file objects are included in this listing. Each entry is a space-seperated set of file attributes +selected by the SHOW command in the LIST rule. Entries are encoded according to RFC3986 URI percent +encoding. This means all spaces and special characters will be encoded, making it easy to split lines +into fields using the space separator. + +The ouput file is an unsorted list of files in uncompressed ASCII. Further processing is desireble +to use less space for storage and provide organized collections of data. + +## Processing the output file + +### Split and compress + +### Pre-parse output for Python + +## Running reports + +### Disk usage by top level directies + +A useful report is the top level directory (tld) report. This is akin to running a `du -s *` in a +directory of interest, but much faster since there is no walking of directory trees. Only the list +policy output file is used, reducing the operation to a parsing an summing of the data in the list +policy output file. + +### Comparing directory similarity + + +## Scheduling regular policy runs via cron The policy run can be scheduled automatically with the cronwrapper script. Simpley add append the above script and arguments to the crownwrapper in a crontab line. -- GitLab