CLaSP

Overview

CLaSP (Contrastive Language-Structure Pre-training) is a multimodal learning framework that bridges crystal structures and text descriptions from scientific literature.

This repository contains the official implementation of the paper "Bridging text and crystal structures: literature-driven contrastive learning for materials science" (Y. Suzuki, T. Taniai, R. Igarashi et al 2025 Mach. Learn.: Sci. Technol. 6 035006).

CLaSP enables:

Text-based retrieval of crystal structures
Zero-shot classification of materials based on their properties
Serving as a foundation model for multimodal tasks bridging crystal structures and text (similar to CLIP)

Contrastive learning paradigm of CLaSP in two stages. (1) Pre-training using pairs of crystal structures and publication titles. (2) Fine-tuning using pairs of crystal structures and keywords that are generated from the titles and abstracts using an LLM. (image from our paper).

Installation

Using Docker (Recommended)

# Build Docker image from repository root
docker build -t clasp:v1.0 -f docker/Dockerfile .

Requirements for manual setup

Python 3.8+
PyTorch 2.2+
CUDA 12.1+ (for GPU support)
See docker/Dockerfile for complete dependencies

Pre-trained Models & Data (Release v1.0.0)

We release both the model checkpoint and the keyword caption dataset used in our experiments.
ℹ️ For full details, see the v1.0.0 release notes.

Asset	Description	Download
model checkpoint	Checkpoint fine-tuned with COD structures and text captions	model_finetuned_s30_m05.ckpt
keyword captions	JSON keyword caption data used for fine-tuning	keyword_captions_cod_full_20240331.zip

Quick Start

➡️ Download the checkpoint above and place it under model_weight/ (or any path of your choice) before running the quick start example.

Extract Crystal Embeddings from CIF Files

# Extract embeddings using pretrained model
docker run --gpus 1 --rm \
  -v $(pwd):/workspace \
  -w /workspace \
  clasp:v1.0 python examples/extract_embeddings.py \
    --checkpoint_path /workspace/model_weight/model_finetuned_s30_m05.ckpt \
    --cif_list /workspace/demo_data/cif_list.txt \
    --output_path /workspace/demo_data/embeddings.npz \
    --batch_size 32

Reproducing Paper Visualizations

We provide a Jupyter notebook to reproduce the visualization figures from the paper:
examples/embedding_visualization.ipynb

The notebook demonstrates how to:

Visualize crystal/text embeddings (e.g., t-SNE plots)
Explore similarities with text queries
Perform clustering and generate world-map style overviews

Training and Evaluation

Data Preparation

1. Download Crystallography Open Database (COD) Metadata

cd clasp/preprocess
python download_cod_metadata.py cod_metadata_YYYYMMDD.csv

2. Download Crystal Structures (CIF files)

mkdir -p COD
rsync -av --delete rsync://www.crystallography.net/cif/ COD/

Training

Training configurations are managed using Hydra. Key parameters in configs/training.yaml. Please modify the dataset path.

Pre-training CLaSP (with paper titles as the caption)

cd clasp
python train_pretraining.py --config-name training

Fine-tuning with Keywords (with generated keywords captions)

python train_finetuning.py --config-name finetuning \
  finetuning_caption_json_path=path/to/keywords.json \
  resume_ckpt_path=path/to/pretrained.ckpt

Evaluation

Zero-shot Classification

cd clasp/eval_scripts
python eval_zero_shot_roc.py \
  --config_path ../configs/training.yaml \
  --checkpoint_path path/to/checkpoint.ckpt

Testing

Run all unit tests:

# Using Docker
docker run --rm -v $(pwd):/workspace -v /path/to/cod:/cod:ro -w /workspace clasp:v1.0 bash run_tests.sh

# Or run individual test files
docker run --rm -v $(pwd):/workspace -v /path/to/cod:/cod:ro -w /workspace clasp:v1.0 python tests/test_dataloaders.py

Citation

If you use CLaSP in your research, please cite:

@misc{suzuki2025contrastivelanguagestructurepretrainingdriven,
  doi = {10.1088/2632-2153/ade58c},
  url = {https://dx.doi.org/10.1088/2632-2153/ade58c},
  year = {2025},
  month = {jul},
  publisher = {IOP Publishing},
  volume = {6},
  number = {3},
  pages = {035006},
  author = {Suzuki, Yuta and Taniai, Tatsunori and Igarashi, Ryo and Saito, Kotaro and Chiba, Naoya and Ushiku, Yoshitaka and Ono, Kanta},
  title = {Bridging text and crystal structures: literature-driven contrastive learning for materials science},
  journal = {Machine Learning: Science and Technology},
  }

Troubleshooting

Common Issues

Out of memory errors
- Reduce batch_size in configuration
- Enable gradient accumulation
- Use mixed precision training (already enabled by default)

Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch
Make your changes and add tests
Ensure all tests pass
Submit a pull request

License

This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.

Contact

For questions or issues, please open an issue on GitHub or contact the authors through the paper correspondence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

CLaSP

Overview

Installation

Using Docker (Recommended)

Requirements for manual setup

Pre-trained Models & Data (Release v1.0.0)

Quick Start

Extract Crystal Embeddings from CIF Files

Reproducing Paper Visualizations

Training and Evaluation

Data Preparation

1. Download Crystallography Open Database (COD) Metadata

2. Download Crystal Structures (CIF files)

Training

Pre-training CLaSP (with paper titles as the caption)

Fine-tuning with Keywords (with generated keywords captions)

Evaluation

Zero-shot Classification

Testing

Citation

Troubleshooting

Common Issues

Contributing

License

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
configs		configs
dataloaders		dataloaders
docker		docker
eval_scripts		eval_scripts
examples		examples
losses		losses
models		models
preprocess		preprocess
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_tests.sh		run_tests.sh
train_finetuning.py		train_finetuning.py
train_pretraining.py		train_pretraining.py

Uh oh!

License

Uh oh!

Toyota/clasp

Folders and files

Latest commit

History

Repository files navigation

CLaSP

Overview

Installation

Using Docker (Recommended)

Requirements for manual setup

Pre-trained Models & Data (Release v1.0.0)

Quick Start

Extract Crystal Embeddings from CIF Files

Reproducing Paper Visualizations

Training and Evaluation

Data Preparation

1. Download Crystallography Open Database (COD) Metadata

2. Download Crystal Structures (CIF files)

Training

Pre-training CLaSP (with paper titles as the caption)

Fine-tuning with Keywords (with generated keywords captions)

Evaluation

Zero-shot Classification

Testing

Citation

Troubleshooting

Common Issues

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages