🌍 Large-scale Multilingual Translation (LMT)

• 📢 News • 🤗 Open Resources • 📄 Contents

The LMT aims to advance the frontier of Multilingual Machine Translation (MMT) by building Inclusive, Scalable, and High-performance multilingual translation models.

📢 News

2025.11.11: Our LMT paper Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs and corresponding Models are released.

🤗 Open Resources

We have made the following resources available:

Resource	Description	Link
LMT-60-*B	Our high-performance multilingual translation models cover 60 languages and 234 directions. Available in four sizes: 0.6B / 1.7B / 4B / 8B.	LMT-60-0.6B LMT-60-1.7B LMT-60-4B LMT-60-8B
LMT-60-*B-Base	Our continued pre-training of Qwen3 on 90B tokens serves as the foundation for large-scale translation adaptation. Available in four sizes: 0.6B / 1.7B / 4B / 8B.	LMT-60-0.6B-Base LMT-60-1.7B-Base LMT-60-4B-Base LMT-60-8B-Base
LMT-60-sft-data	Our SFT dataset including Flores-200 devset, NTREX-128, SMol, WMT14–23, and IWSLT17–24 test sets, totaling 567K samples.	LMT-60-sft-data
FLORES-mn_cn	A new Chinese–Mongolian evaluation set annotated by native speakers to extend the FLORES-200 benchmark.	FLORES-mn_cn

📄 Contents

Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

Introduction

In this project, we take a step toward overcoming the prevailing English-centric bias in MMT. We introduce LMT, a suite of Chinese-English-centric MMT models trained on 90B mixed monolingual and bilingual tokens, covering 60 languages across 234 translation directions and achieving SOTA performance among models with similar language coverage. Our work makes the following main contributions:

We identify and analyze a previously overlooked issue, directional degeneration, in large-scale multilingual SFT with multi-way data and propose a simple yet effective Strategic Downsampling method to mitigate it.
We propose Parallel Multilingual Prompting (PMP), which enhances cross-lingual transfer by incorporating an auxiliary parallel sentence into the instruction.
We release LMT, a suite of large-scale Chinese–English-centric multilingual translation models in four sizes (0.6B/1.7B/4B/8B), providing strong baselines for future MMT research.

Support Languages

Resource Tier	Languages
High-resource Languages (13)	Arabic(ar), English(en), Spanish(es), German(de), French(fr), Italian(it), Japanese(ja), Dutch(nl), Polish(pl), Portuguese(pt), Russian(ru), Turkish(tr), Chinese(zh)
Medium-resource Languages (18)	Bulgarian(bg), Bengali(bn), Czech(cs), Danish(da), Modern Greek(el), Persian(fa), Finnish(fi), Hindi(hi), Hungarian(hu), Indonesian(id), Korean(ko), Norwegian(nb), Romanian(ro), Slovak(sk), Swedish(sv), Thai(th), Ukrainian(uk), Vietnamese(vi)
Low-resouce Languages (29)	Amharic(am), Azerbaijani(az), Tibetan(bo), Modern Hebrew(he), Croatian(hr), Armenian(hy), Icelandic(is), Javanese(jv), Georgian(ka), Kazakh(kk), Central Khmer(km), Kirghiz(ky), Lao(lo), Chinese Mongolian(mn_cn), Marathi(mr), Malay(ms), Burmese(my), Nepali(ne), Pashto(ps), Sinhala(si), Swahili(sw), Tamil(ta), Telugu(te), Tajik(tg), Tagalog(tl), Uighur(ug), Urdu(ur), Uzbek(uz), Yue Chinese(yue)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "NiuTrans/LMT-60-8B"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Translate the following text from English into Chinese.
English: The concept came from China where plum blossoms were the flower of choice.
Chinese: "
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512, num_beams=5, do_sample=False)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

print("response:", outputs)

For more details, please refer to src/inference.py.

Reference

Email: [email protected]

If you find our paper useful for your research, please kindly cite our paper:

@misc{luoyf2025lmt,
      title={Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs}, 
      author={Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu},
      year={2025},
      eprint={2511.07003},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.07003}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/flores_mn_cn		data/flores_mn_cn
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌍 Large-scale Multilingual Translation (LMT)

📢 News

🤗 Open Resources

📄 Contents

Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

Introduction

Support Languages

Usage

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

NiuTrans/LMT

Folders and files

Latest commit

History

Repository files navigation

🌍 Large-scale Multilingual Translation (LMT)

📢 News

🤗 Open Resources

📄 Contents

Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs

Introduction

Support Languages

Usage

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages