MatrixCurator

An AI-powered tool to automate the extraction of morphological character data from scientific publications and generate standardized, FAIR-compliant NEXUS files for phylogenetic analysis.

Deployment	Link	Purpose
Primary App		Main application link.
Mirror		Backup link for redundancy.

About The Project

The curation of biological and paleontological datasets, particularly morphological matrices, is a labor-intensive and error-prone process. Data is often locked away in published literature (PDFs, DOCX files) in inconsistent formats, hindering reproducibility and compliance with FAIR (Findable, Accessible, Interoperable, and Reusable) data principles.

MatrixCurator addresses this challenge by leveraging Large Language Models (LLMs) to automate the entire curation workflow. Developed for the MorphoBank repository, this tool transforms unstructured character descriptions from research papers into structured, machine-readable CHARSTATELABELS blocks within a NEXUS file.

This project aims to:

Accelerate research by drastically reducing manual data entry time.
Improve data quality by minimizing transcription errors and standardizing formats.
Enhance data reusability by producing complete, FAIR-compliant NEXUS files.

Key Features

Automated Data Extraction: Uses Google's Gemini family of LLMs to intelligently parse and extract character names and their corresponding states from text.
Multi-Parser Support: Robustly handles various document formats (.pdf, .docx, .txt) with multiple parsing backends, including:
- Google's Gemini native multimodal capabilities
- LlamaParse
- PyMuPDF
- python-docx
AI-Powered Validation: Employs a multi-agent system where an Evaluator agent scores the accuracy of the extracted data, ensuring high-quality output. The system retries with corrective prompts if the quality is below a set threshold.
NEXUS File Generation: Seamlessly integrates the extracted character data into an existing NEXUS file, creating or updating the CHARSTATELABELS block.
Web-Based Interface: Built with Streamlit for an intuitive, user-friendly experience that requires no coding to use.
Efficient & Cost-Effective: Utilizes LLM context caching to reduce API token consumption by over 90% for large documents, making the process both fast and affordable.

How It Works

The MatrixCurator pipeline is a multi-step process designed for accuracy and efficiency:

User Input: The user uploads a research article (PDF/DOCX), a base NEXUS file (typically containing only the TAXA and MATRIX blocks), and specifies parameters like the total number of characters and the relevant page range in the article.
Document Parsing: The selected pages of the article are isolated and parsed into a machine-readable format (Markdown or raw text) using the chosen parsing engine.
AI Core - Multi-Agent Extraction & Evaluation:
- Retriever Agent: For each character number, this agent is prompted to read the parsed document and extract the character's name and its list of states as JSON object.
- Evaluator Agent: The extracted data is passed to this agent, which compares it against the source text to assign an accuracy score (1-10).
- Self-Correction Loop: If the score is below a threshold (e.g., 8), the process is retried with a corrective prompt. This ensures high-fidelity extraction.
NEXUS File Update: The structured JSON data is converted into the CHARSTATELABELS format and inserted into the correct position within the user's original NEXUS file.
Output: The final, complete NEXUS file is made available for download.

Getting Started

Follow these instructions to set up and run the MatrixCurator project locally.

Prerequisites

Python 3.12+
Git
Docker (Optional, for containerized deployment)
API keys for required services (see Configuration)

Installation

Clone the repository:

git clone https://github.com/MorphoEx/morphoex.git
cd morphoex

Install Python dependencies: It is recommended to use a virtual environment.

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
pip install -r requirements.txt

Configure API Keys: See the Configuration section below to set up your API keys.

Configuration

MatrixCurator requires API keys to interact with external LLM and parsing services. You can provide these keys via a .streamlit/secrets.toml file.

Copy the template:

cp .streamlit/secrets_template.toml .streamlit/secrets.toml

Edit the secrets.toml file and add your keys:

# .streamlit/secrets.toml

# Required for accessing Google's Gemini family of models.
# Obtain from Google AI Studio (https://aistudio.google.com/app/apikey)
GEMINI_API_KEY="your-gemini-api-key"

# Required for accessing LlamaParse.
# Obtain from your LlamaCloud account dashboard.
LLAMACLOUD_API_KEY="your-llamacloude-api-key"

# Required for accessing LLM Prompts.
# Obtain from your Langfuse project settings.
LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_HOST="https://cloud.langfuse.com" # or your self-hosted instance

# Optional: For error tracking with Sentry.
SENTRY_DSN=""

Set Up Prompts in Langfuse:

The application dynamically fetches prompts from your Langfuse project. You must create three specific prompts in the Langfuse UI.

Log into your Langfuse project, navigate to the Prompts section, and create the following three prompts.

Important: The application fetches prompts by their unique name. You must use the exact names specified below (system_prompt, extraction_prompt, and evaluation_prompt).

A. Prompt Name: system_prompt

You are a helpful and precise research assistant. Focus on extracting the requested character descriptions and corresponding states accurately from the provided text.

B. Prompt Name: extraction_prompt

Here is a section of text from a phylogenetic research paper. Please extract the character descriptions and their corresponding states for character index: {character_index}

Previous attempts to extract information for this character index have yielded these incorrect results:

C. Prompt Name: evaluation_prompt

Evaluate the generated answer based on the previously provided section of a phylogenetic research paper and the following user query.

User Query: {extraction_prompt}
Generated Answer: {extraction_reponse}

Scoring Criteria:
- 1-3: The generated answer is not relevant to the user query.
- 4-6: The generated answer is relevant to the query but contains mistakes. A score of 4 indicates more significant errors, while 6 indicates minor errors.
- 7-10: The generated answer is relevant and fully correct, accurately extracting the complete character description and all corresponding states for the requested character index. A score of 7 indicates an ok answer, while 10 indicates a perfect extraction.

Running with Streamlit

Once installed and configured, you can run the web application locally.

streamlit run src/streamlit_app.py

Navigate to http://localhost:8501 in your web browser. From there, you can:

Upload your research article (.pdf, .docx).
Select the document parsing method.
Enter the total number of characters and the page range where they are described.
Choose the LLMs for extraction and evaluation.
Upload the base NEXUS file to be updated.
Click "Generate Updated NEXUS File" and download the result.

Running with Docker

You can run MatrixCurator using a pre-built image or by building it from source.

Note: Both docker run commands require you to mount your secrets file from .streamlit/secrets.toml.

Using the Pre-built Image

Pull the image:

docker pull ghcr.io/morphobankorg/matrixcurator:latest

Run the container:

docker run -p 8501:80 -v "$(pwd)/.streamlit/secrets.toml:/app/.streamlit/secrets.toml" ghcr.io/morphobankorg/matrixcurator:latest

Building from Source

Build the image:
```
docker build -t matrixcurator .
```

Run the container:

docker run -p 8501:80 -v "$(pwd)/.streamlit/secrets.toml:/app/.streamlit/secrets.toml" matrixcurator

Once started, the application is available at http://localhost:8501.

Project Structure

The project is organized into modular components for clarity and maintainability.

morphobankorg-matrixcurator/
├── src/
│   ├── streamlit_app.py        # Main Streamlit application UI and entry point
│   ├── llm/                    # Handles all LLM interactions
│   │   ├── services.py         # High-level service for extraction/evaluation cycle
│   │   └── external_service.py # Direct client for the Gemini API
│   ├── parser/                 # Manages document parsing from different formats
│   │   ├── services.py         # Main service to orchestrate different parsers
│   │   └── external_services.py# Client for LlamaParse
│   ├── nex/                    # Logic for reading and updating NEXUS files
│   │   └── services.py         # Service to build and insert CHARSTATELABELS
│   ├── config.py               # Model configurations and defaults
│   └── utils.py                # General utility functions
├── .streamlit/
│   └── secrets_template.toml   # Template for API keys
├── requirements.txt            # Python dependencies
├── Dockerfile                  # For building the Docker container
└── README.md                   # This file

Citation

If you use MatrixCurator or its underlying methodology in your research, please cite the following paper:

Jariwala, S., Long-Fox, B. L., & Berardini, T. Z. (2025). Advancing FAIR Data Management through AI-Assisted Curation of Morphological Data Matrices. (Journal and full citation details to be updated upon publication).

BibTeX:

@article{Jariwala2025MatrixCurator,
  title   = {Advancing FAIR Data Management through AI-Assisted Curation of Morphological Data Matrices},
  author  = {Jariwala, Shreya and Long-Fox, Brooke L. and Berardini, Tanya Z.},
  year    = {2025},
  journal = {To Be Determined}
}

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is distributed under the GNU GPL v3 License. See LICENSE for more information.

Acknowledgments

This work was supported by Phoenix Bioinformatics and the US National Science Foundation (NSF-DBI-2049965 and NSF-EAR-2148768).
We thank Dr. Maureen A. O’Leary for her ongoing support.
We acknowledge the use of Google's Gemini models, which were instrumental in the development of this tool.

Contact

Shreya Jariwala - [email protected]
Brooke L. Long-Fox (Corresponding Author) - [email protected]

Project Link: https://github.com/tair/matrixcurator

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.DS_Store		.DS_Store
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
packages.txt		packages.txt
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MatrixCurator

Table of Contents

About The Project

Key Features

How It Works

Getting Started

Prerequisites

Installation

Configuration

Running with Streamlit

Running with Docker

Using the Pre-built Image

Building from Source

Project Structure

Citation

Contributing

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

tair/matrixcurator

Folders and files

Latest commit

History

Repository files navigation

MatrixCurator

Table of Contents

About The Project

Key Features

How It Works

Getting Started

Prerequisites

Installation

Configuration

Running with Streamlit

Running with Docker

Using the Pre-built Image

Building from Source

Project Structure

Citation

Contributing

License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages