This is the public version of an internal tool we use to deliver large batches of local data files to S3 buckets as selected by a manifest in Excel format.
usage: s3-catapult [-h] [--version] [--env_file ENV_FILE] input_path config_file
s3-catapult cli arg parser
positional arguments:
input_path Path to excel
config_file Path to the config YAML file
options:
-h, --help show this help message and exit
--version show program's version number and exit
--env_file ENV_FILE Specifies the .env file to be used
See '<command> --help' to read about a specific sub-command.
The main file used is /catapult/entry.py.
main() starts off setting up the logger via setup_logger(). This is used to set up a base logger to be used in the earliest stages of the project. Once several parameters needed for setup_logging() are gathered from proceeding functions, setup_logging() will be ran in order to utilize coloredlogs for logging.
main() then handles the cli via command_line_parser(), gets the config file using parse_yaml(), applies coloredlogs to the logs via setup_logging(), and gets the environment file using load_env_vars().
Using the working_dir field from the config file, run_command() is ran which returns a return_code relevant to the success of the project as a whole.
run_command() runs the copy_main() function which is found in the /catapult/validation/copy.py file.
Inside of copy_main() is where the majority of different important steps in the project are taken.
It first retrieves the manifest using get_data() which can be found in the /catapult/utils/utils.py file. This can either get the data from an excel path or a directory.
Next copy_main() verifies that all paths contained within the retrieved manifest are valid paths via verify_path_validity().
Currently there is a common issue in submitted requests by the client where the header Sample Internal ID or the version used within the code sample_internal_id is missing a character, making it Sample Internal D or sample_internal_d. This is accounted for within the verify_path_validity() function, which also appends any errors encountered to the passed in issues which is a list of Issues. The Issue class can be found in the /catapult/utils/issue.py file.
Next is sample validation, done with the validate_samples() function. It includes:
get_schema(), found in/catapult/validation/schema.py, which retrieves theschemabased on the passed in file namesample_validator, created from theSampleValidatorfound in/catapult/validation/sample.py. This is then used to validate eachsamplein the passed inmanifestbased on the rules outlined for each field persamplein theschemafrom/catapult/validation/schemas/catapult_schema.yamlutilizing some customcheck_with'screated viacerberusin theSampleValidatorclass.- For each
samplein themanifest, any new errors that arise are converted to theIssueclass viaconvert_errors_to_issues()and then appended toissues.
Following validate_samples() is the rclone_copy() command which copies every fastq path the destination s3 bucket provided in the config file. It too appends any errors encountered to issues.
A metrics file is then created through generate_metrics_file(). Within this function, if the sample_internal_id header is incorrectly named sample_internal_d, it is set to be sample_internal_id. Then, a csv is created containing all fields from the manifest save for the two headers relating to results paths.
After the generation of the metrics file, if any issues exist they are then formatted into a csv file similarly to the metrics file path but using the fields in an Issue. That csv file is then sent in an email using send_email() found in /catapult/utils/emailer.py.
If there aren't any issues, the metrics file is sent in an email using send_email().