Skip to content

BCM-HGSC/s3-catapult

Repository files navigation

S3-Catapult

This is the public version of an internal tool we use to deliver large batches of local data files to S3 buckets as selected by a manifest in Excel format.

Typical Operation

usage: s3-catapult [-h] [--version] [--env_file ENV_FILE] input_path config_file

s3-catapult cli arg parser

positional arguments:
  input_path           Path to excel
  config_file          Path to the config YAML file

options:
  -h, --help           show this help message and exit
  --version            show program's version number and exit
  --env_file ENV_FILE  Specifies the .env file to be used

See '<command> --help' to read about a specific sub-command.

Developer Notes

Entry.py

The main file used is /catapult/entry.py.

Logging

main() starts off setting up the logger via setup_logger(). This is used to set up a base logger to be used in the earliest stages of the project. Once several parameters needed for setup_logging() are gathered from proceeding functions, setup_logging() will be ran in order to utilize coloredlogs for logging.

cli, config, coloredlogs, and env

main() then handles the cli via command_line_parser(), gets the config file using parse_yaml(), applies coloredlogs to the logs via setup_logging(), and gets the environment file using load_env_vars().

run_command()

Using the working_dir field from the config file, run_command() is ran which returns a return_code relevant to the success of the project as a whole.

Copy.py

copy_main()

run_command() runs the copy_main() function which is found in the /catapult/validation/copy.py file.

Inside of copy_main() is where the majority of different important steps in the project are taken.

get_data()

It first retrieves the manifest using get_data() which can be found in the /catapult/utils/utils.py file. This can either get the data from an excel path or a directory.

verify_path_validity

Next copy_main() verifies that all paths contained within the retrieved manifest are valid paths via verify_path_validity().

Currently there is a common issue in submitted requests by the client where the header Sample Internal ID or the version used within the code sample_internal_id is missing a character, making it Sample Internal D or sample_internal_d. This is accounted for within the verify_path_validity() function, which also appends any errors encountered to the passed in issues which is a list of Issues. The Issue class can be found in the /catapult/utils/issue.py file.

Sample Validation

Next is sample validation, done with the validate_samples() function. It includes:

  • get_schema(), found in /catapult/validation/schema.py, which retrieves the schema based on the passed in file name
  • sample_validator, created from the SampleValidator found in /catapult/validation/sample.py. This is then used to validate each sample in the passed in manifest based on the rules outlined for each field per sample in the schema from /catapult/validation/schemas/catapult_schema.yaml utilizing some custom check_with's created via cerberus in the SampleValidator class.
  • For each sample in the manifest, any new errors that arise are converted to the Issue class via convert_errors_to_issues() and then appended to issues.
rclone_copy()

Following validate_samples() is the rclone_copy() command which copies every fastq path the destination s3 bucket provided in the config file. It too appends any errors encountered to issues.

generate_metrics_file()

A metrics file is then created through generate_metrics_file(). Within this function, if the sample_internal_id header is incorrectly named sample_internal_d, it is set to be sample_internal_id. Then, a csv is created containing all fields from the manifest save for the two headers relating to results paths.

Emailing

After the generation of the metrics file, if any issues exist they are then formatted into a csv file similarly to the metrics file path but using the fields in an Issue. That csv file is then sent in an email using send_email() found in /catapult/utils/emailer.py.

If there aren't any issues, the metrics file is sent in an email using send_email().

About

Automated delivery of data batches from local filesystems to S3

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages