MunGLER - Metadata UNtanGLER

MunGLER utilizes Large Language Models (LLMs) to unravel sample metadata from DNA sequencing data within the Sequence Read Archive.

Visual Concept

Installation

First clone this repository

git clone https://github.com/fbcorrea/mungler.git

Create a conda environment. I recommend using Mamba.

mamba create -n mungler

Install python packages using pip

pip install -r requirements.txt

Hardware requirements

To effectively run this pipeline, it is strongly recommended to utilize a High-Performance Computing (HPC) system, ideally equipped with at least one recent high-end GPU card, such as the NVIDIA A100. This ensures optimal performance and reliability of the results.

Hardware requirements will be associated with the model used. Roughly, the larger the model, more memory it will require. If you chose to run Llama2 models from Meta, smaller models require 8GB vRAM while bigger up to 64GB vRAM. Other large models like Mixtral87b will require around 100GB vRAM.

Alternatively, some services allow users to pay per hour of process execution such as Amazon SageMaker (Amazon Web Services), Google Cloud AI Platform (Google Cloud Platform) and Azure Machine Learning (Microsoft Azure).

Usage

Activate Conda environment

If you followed the installation instructions, you must activate your Conda environment to run the mungler main script.

conda activate mungler

Test run

In the first run, we recommend to run a test run with a toy dataset and tiny models.

mungler --acc-id SAMN06512631

If everything went well, you should get the following message:

Starting mungler test
Creating output
PASSED

Actual workflow

Example 1.

Running mungler on a list of accession numbers. Any SRA, ENA, Biosample, Bioproject are valid. Invalid ids will be saved on a text file.

mungler 
    --acc-list list-of-ids.txt
    --output results.txt
    --invdir invalid-ids.txt

Rationale

mungler takes as input accession numbers from SRA, access its metadata and tries to fill metadata fields

Before:

After:

Contributing

Guidelines on how to contribute to the project.

License

Information about the project's license.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
chroma		chroma
data		data
other_scripts		other_scripts
README.md		README.md
mervin_RAG.py		mervin_RAG.py
response.txt		response.txt
slurm_rag.sh		slurm_rag.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MunGLER - Metadata UNtanGLER

Table of Contents

Visual Concept

Installation

Hardware requirements

Usage

Activate Conda environment

Test run

Actual workflow

Example 1.

Rationale

Contributing

License

About

Uh oh!

Releases

Packages

Languages

fbcorrea/mungler

Folders and files

Latest commit

History

Repository files navigation

MunGLER - Metadata UNtanGLER

Table of Contents

Visual Concept

Installation

Hardware requirements

Usage

Activate Conda environment

Test run

Actual workflow

Example 1.

Rationale

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages