Resructure data dir and policy output names

The current naming scheme and structure for output policy files is somewhat confusing coming in. Most of the interaction we have had in the past with these data have been in the symlink directories with naming scheme list-policy_<device>_<date> that point to a directory already containing the chunked and parquet-converted policy data. However, the naming scheme for the raw policy logs is not similar to the previous directory name making it difficult to orient when starting from the initial policy run step. Having essentially 3 entries in the data directory for each policy run increases clutter (the initial policy log, the directory with the chunks, and the symlink pointing to the chunk dir). Instead, I propose we organize the directory to where a single subdirectory contains all relevant files for each policy run. The directory name would be descriptive of the type of policy, the device the policy was applied to, and the corresponding job ID and run datetime. The raw policy log would be named similarly and stored in the top level of the subdirectory. The split parquet dataset would be given its own subdirectory at the same level of the policy log. See below for an example.

/data/rc/gpfs-policy/data/
└── list-policy_<job_id>_<device>_%Y%m%dT%H%M%S_<policy_type>/
    ├── list-policy_<job_id>_<device>_%Y%m%dT%H%M%S_<policy_type>.list.gather-info.gz
    ├── [gz-chunks]
    └── parquet/
        ├── list-000.parquet
        ├── list-001.parquet
        └── ...

This would necessitate multiple changes for run-mmpol.sh. An initial look suggests the following:

  1. Probably converting to getopt to pass options instead of relying on environment variable inheritance. While not necessary for the restructuring, it would improve clarity
  2. Need to actually apply the file tag. The current output log only has the job ID as an identifier (ex. list-29582179.list.gather-info). I don't see anything resembling the tag in the file names in /data/rc/gpfs-policy/data.
    1. It's verified the mv command in line 57 is not being run. See the end of /data/rc/list-gpfs-dirs/src/run-policy/out/pol-29582179-list-path-external-scratch.out where it only says outfile= and [[ '' != '' ]]. If anything was assigned to outfile, it would appear in the log.
  3. No idea what LIST_OUTPUT_FILE is referring to since that string doesn't appear in the list-path-external or list-path-dirplus policy definitions. -M is just a string replacement in the policy definition based on what's passed to it. Not sure that line is doing anything
  4. Need to check mmapplypolicy to see exactly how to get the name of the log file. If that's not possible, can just continue to use the current bones and then perform all of the renaming and organization after the fact.
Edited Sep 13, 2024 by Matthew K Defenderfer
Assignee Loading
Time tracking Loading