From c42dbcfd91f2dc89bef4c7f22a9418715bea45a5 Mon Sep 17 00:00:00 2001
From: John-Paul Robinson <jpr@uab.edu>
Date: Wed, 22 May 2024 18:42:53 -0500
Subject: [PATCH] Improve project documentation with instructions on running
 policy

Provide details on running a policy script with the provided
workflow framework.
Information on the list-policy-external policy and generated output.
Outline of future docs for preprocessing and reports.
---
 README.md | 86 +++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 78 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index 2b7306d..22417db 100644
--- a/README.md
+++ b/README.md
@@ -4,19 +4,89 @@ The GPFS policy engine is well described in this [white paper](https://www.ibm.c
 A good presentation overview of the policy file is [here](https://www.spectrumscaleug.org/event/ssugdigital-spectrum-scale-ilm-policy-engine/).
 The relavent [documentation is available from IBM](https://www.ibm.com/docs/en/spectrum-scale/4.2.0?topic=guide-information-lifecycle-management-spectrum-scale).
 
-This project focuses on scheduled execution of lifecyle policies.
+This project focuses on scheduled execution of lifecyle policies to gather and process data about
+file system objects and issue actions against those objects based on policy.
 
-A policy is executed via the run_job
+## Running a policy
 
-submit-pol-job <outdir> <policy> <nodecount> <corespernode> <ram> <partition>
+A policy is executed in the context of a SLURM batch job reservation using the submit-pol-job script:
+```
+submit-pol-job <outdir> <policy> <nodecount> <corespernode> <ram> <partition> <time>
+```
+Where the positional arguments are:
 
-Where:
-
-outdir - the directory for the policy and job output, should be global to cluster
-policy - path to the policy file
+outdir - the directory for the output files, should be global to cluster (e.g. /scratch of the user running the job)
+policy - path to the GPFS policy to execute (e.g. in ./policy directory) 
 nodecount - number of nodes in the cluster that will run the policy
 corespernode - number of cores on each node to reserve
-partition - the partition used to submit the job
+ram - ram per core, can use "G" for gigabytes
+partition - the partition to submit the job
+time - the time in minutes to reserve for the job
+
+Note: the resource reservation is imperfect.  The job wrapper calls a script `run-mmpol.sh` which 
+is responsible for executing the `mmapplypolicy` command.  
+The command is aligned to run on specific nodes by way of arguments to 
+mmapplypolicy.  The command is technically not run inside of the job reservation so the resource
+constraints are imperfect.  The goal is to use the scheduler to ensure the policy run does not conflict
+with existing resource allocations on the cluster.
+
+## Running the policy "list-policy-external"
+
+The list-policy-external policy provides an efficient tool to gather file stat data into a URL-encoded
+ASCII text file.  The output file can then be processed by down-stream to create reports on storage
+patterns and use.
+
+An example invocation would be:
+```
+submit-pol-job /path/to/output/dir \
+     /absolute/path/policy/list-path-external \
+	 4 24 4G partition_name \
+	 /path/to/listed/dir \
+	 180
+```
+
+Some things to keep in mind:
+- the `submit-pol-job` script may need a `./` prefix if it is not in your path.
+- use absolute paths for all directory arguments to avoid potential confusion
+- make sure the output dir has sufficient space to hold the resulting file listing (It could be 100's of Gigabytes for a large collection of files.)
+
+The slurm job output file will be local to the directory from which this command executed.  It can be 
+watched to observe progress in the generation of the file list.  A listing of 100's of millions of files
+may take a couple of hours to generate and consume serveral hundred gigabytes for the output file.
+
+The output file in `/path/to/output/dir` is named as follows
+- a prefix of "list-${SLURM_JOBID}"
+- ".list" for the name of the policy rule type of "list"
+- a tag for the list name name defined in the policy file,  "list-gather" for `list-path-external` policy
+
+The output file contains one line per file object stored under the `/path/to/listed/dir`.  No directories
+or non-file objects are included in this listing.  Each entry is a space-seperated set of file attributes
+selected by the SHOW command in the LIST rule.  Entries are encoded according to RFC3986 URI percent 
+encoding.  This means all spaces and special characters will be encoded, making it easy to split lines
+into fields using the space separator.
+
+The ouput file is an unsorted list of files in uncompressed ASCII.  Further processing is desireble
+to use less space for storage and provide organized collections of data.
+
+## Processing the output file
+
+### Split and compress
+
+### Pre-parse output for Python
+
+## Running reports
+
+### Disk usage by top level directies
+
+A useful report is the top level directory (tld) report.  This is akin to running a `du -s *` in a
+directory of interest, but much faster since there is no walking of directory trees.  Only the list
+policy output file is used, reducing the operation to a parsing an summing of the data in the list
+policy output file.
+
+### Comparing directory similarity
+
+
+## Scheduling regular policy runs via cron
 
 The policy run can be scheduled automatically with the cronwrapper script.
 Simpley add append the above script and arguments to the crownwrapper in a crontab line.
-- 
GitLab