Skip to content

FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity.

License

cofe-ai/flm-audio

Repository files navigation

 中文  | English 



FLM-Audio

🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |   🖥️ Demo

FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity. It simultaneously listens, speaks, and composes internal monologue, delivering low‑latency, duplex conversational responses in both English and Chinese. FLM‑Audio is robust to noise and user interruptions, prioritizing responsiveness and naturalness.

Model Card

  • Language(s): Chinese; English;

Technical Report

Motivation & Survey: Toward Embodied AGI: A Review of Embodied AI and the Road Ahead

FLM-Audio Research Paper: FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

Omnimodal System Card: RoboEgo System Card: An Omnimodal Model with Native Full Duplexity

Online Demo

We provide a simple interactive demo example: flm-audio-Demo, Backup Demo The current model already supports streaming input and output. Users can interact with FLM-Audio through direct voice conversations, and the model will return streaming speech responses.

Notes:

  • We recommend asking history-related questions, such as “介绍下中国的历史 (Introduce the history of China)” or “介绍下唐太宗李世民 (Tell me about Emperor Taizong of Tang, Li Shimin.)”
  • Server resources are currently limited. If you see the message Too many concurrent connections. Please try again later!, please try again later.
  • Since streaming speech transmission requires low latency, you may get a better experience in a stable and high-quality network environment.
  • Since streaming speech transmission requires low latency, you may get a better experience in a stable and high-quality network environment.

Bias, Risks, and Limitations

Despite extensive data cleaning, FLM‑Audio may still produce undesired content (e.g., biased or offensive language). Users should not disseminate unsafe outputs. Project authors are not responsible for misuse or harmful consequences.

Quick Start

Recommended: Run the Server via Docker (Production/Deployment)

We recommend using the official Docker images published under the cofe-ai organization on GitHub Container Registry:

ghcr.io/cofe-ai/flm-audio

Image variants:

  • ghcr.io/cofe-ai/flm-audio:server-1.0.0-model-v202507includes the pre‑downloaded model (ideal for offline or fast startup environments).
  • ghcr.io/cofe-ai/flm-audio:server-1.0.0downloads the model from Hugging Face at runtime (requires internet access).

Example startup commands:

# Using the image with pre-downloaded model (recommended for offline/fast startup)
docker run -dit --gpus '"device=1"' -p 8990:8990 --restart always --name flm-audio-server ghcr.io/cofe-ai/flm-audio:server-1.0.0-model-v202507

# Or: using the image that downloads the model at runtime (requires network access)
docker run -dit --gpus '"device=1"' -p 8990:8990 --restart always --name flm-audio-server ghcr.io/cofe-ai/flm-audio:server-1.0.0

Notes / Tips:

  • --gpus '"device=1"': the example binds GPU device 1. Adjust as needed (e.g., --gpus all or --gpus '"device=0,1"').
  • Port 8990 is the default server port; adjust host mapping if necessary with -p HOST_PORT:8990.
  • For private images, you may need to docker login ghcr.io with a GitHub personal access token (PAT).
  • When using the non-preloaded image (server-1.0.0), the container will download the model on first startup. Download time depends on your network.

Please note: on first container startup, this model will perform a short compilation phase to speed up inference (approximately 2 minutes depending on server performance). The server is fully started when you see logs similar to:

[Info] model loaded
[Info] warming up the model
[Info] Access the API directly at http://0.0.0.0:8990/api/chat
======== Running on http://0.0.0.0:8990 ========
(Press CTRL+C to quit)

Local Development (Optional)

# install dependencies
pip install -r requirements-server.txt
python -m flmaudio.server --port 8990

Start the Web UI

# install dependencies
pip install -r requirements-clientgui.txt
python -m flmaudio.client_gradio --url http://localhost:8990

Then you can open http://localhost:50000 in your browser to try the demo.

Start the CLI

# install dependencies
pip install -r requirements-clientcli.txt
python -m flmaudio.client --url http://localhost:8990

Notes / Tips:

  • For both the Web UI and CLI, replace the --url value with your server’s IP and port, and ensure any firewalls allow access.
  • For Web UI debugging, because of Gradio and modern browser security, it is recommended to run the Python command on the same machine as the browser so you can use localhost in the browser.

Recommended Environment

  • OS: Linux (preferred for production).
  • GPU: NVIDIA GPU with at least 20 GB VRAM recommended for larger models and stable inference.
  • Software: Docker, NVIDIA Container Toolkit (nvidia-docker/--gpus support), and appropriate NVIDIA driver.
  • Storage: Ensure sufficient disk space for models and logs (model files can require ~16GB).
  • Network: Required only if using the image without pre-downloaded model; the preloaded image does not need internet to start.

FAQ (Brief)

  • Which image should I choose?

    • If your server can access the internet and you don’t mind first-run download: use server-1.0.0.
    • If your server cannot access the internet or you prefer fast startup: use server-1.0.0-model-v202507.
  • How to specify different GPUs?

    • Adjust the --gpus parameter, e.g., --gpus '"device=0"' or --gpus all, depending on your host configuration.

Acknowledgements

This work is supported by the National Science and Technology Major Project (No. 2022ZD0116314).

Citation

If you find our work helpful, please consider citing the following papers.

@article{flm-audio,
  title={Flm-audio: Natural monologues improves native full-duplex chatbots via dual training},
  author={Yao, Yiqun and Li, Xiang and Jiang, Xin and Fang, Xuezhi and Yu, Naitong and Wenjia, Ma and Sun, Aixin and Wang, Yequan},
  journal={arXiv preprint arXiv:2509.02521},
  year={2025}
}
@article{embodied-agi,
  title={Toward embodied agi: A review of embodied ai and the road ahead},
  author={Wang, Yequan and Sun, Aixin},
  journal={arXiv preprint arXiv:2505.14235},
  year={2025}
}
@article{roboego,
  title={RoboEgo System Card: An Omnimodal Model with Native Full Duplexity},
  author={Yao, Yiqun and Li, Xiang and Jiang, Xin and Fang, Xuezhi and Yu, Naitong and Sun, Aixin and Wang, Yequan},
  journal={arXiv preprint arXiv:2506.01934},
  year={2025}
}

License

FLM-Audio is licensed under the Apache License 2.0, except for python code under third_party/moshi, which is licensed under the MIT License. The default voice timbre copyright for FLM-Audio is retained by the original voice owner. This project is intended for research use only in compliance with applicable laws. For commercial use, please contact us.

About

FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages