Releases · oobabooga/text-generation-webui

@ShortTimeNoSee

Changes

Add --cpu-moe flag for llama.cpp to move MoE model experts to CPU, reducing VRAM usage.
Add ROCm portable builds for AMD GPUs on Linux. This was made possible by PR oobabooga/llama-cpp-binaries#7 by @ShortTimeNoSee. Thanks, @ShortTimeNoSee.
Remove deprecated macOS 13 wheels (no longer supported by GitHub Actions).

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/10e9780154365b191fb43ca4830659ef12def80f
Update ExLlamaV3 to 0.0.15
Update peft to 0.18.*
Update triton-windows to 3.5.1.post21

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@inyourface34456

Changes

Add weights_only=True to torch.load in Training_PRO for better security.

Bug fixes

Pin huggingface-hub to 0.36.0 to fix manual venv installs.
fix: Rename 'evaluation_strategy' to 'eval_strategy' in training. Thanks, @inyourface34456.

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/230d1169e5bfe04a013b2e20f4662ee56c2454b0 (adds Qwen3-VL support)
Update exllamav3 to 0.0.12

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@reksar

Changes

Make it possible to run a portable Web UI build via a symlink (#7277). Thanks, @reksar.

Bug fixes

Fixed python requirements for apple devices with macos tahoe (#7273). Thanks, @drieschel.

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/d0660f237a5c31771a3d6d1030ebe3e0c409ba92 (adds Ling-mini-2.0, Ring-mini-2.0 support)
Update exllamav3 to 0.0.11
Update triton-windows to 3.5.0.post21

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@mamei16

Changes

log error when llama-server request exceeds context size (#7263). Thanks, @mamei16.
Make --trust-remote-code immutable from the UI/API for better security.

Bug fixes

Fix metadata leaking into branched chats.
Fix "continue" missing an initial space in chat-instruct/chat modes.
Fix resuming incomplete downloads after HF moved to Xet.
Revert exllamav3_hf changes in v3.14 that made it output gibberish.

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/f9fb33f2630b4b4ba9081ce9c0c921f8cd8ba4eb.
Update exllamav3 0.0.10.

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@Remowylliams

Changes

Better handle multi-GPU setups when using Transformers with bitsandbytes (load-in-8bit and load-in-4bit).
Implement the /v1/internal/logits endpoint for the exllamav3 and exllamav3_hf loaders.
Make profile picture uploading safer.
Add fla to the requirements for Exllamav3 to support qwen3-next models.

Bug fixes

Fix an issue with loading certain chat histories in Instruct mode. Thanks, @Remowylliams.
Fix portable builds for macOS x86 missing llama.cpp binaries (#7238). Thanks, @IonoclastBrigham.

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/d00cbea63c671cd85a57adaa50abf60b3b87d86f.
Update transformers to 4.57.
Update exllamav3 0.0.7.
Update bitsandbytes to 0.48.

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@stevenxdavis

Bug fixes

Don't use $ $ for LaTeX, only $$ $$, to avoid broken rendering of text like apples cost $1, oranges cost $2
Fix exllamav3 ignoring the stop button
Fix a transformers issue when using --bf16 and Flash Attention 2 (#7217). Thanks, @stevenxdavis.
Fix x86_64 macos portable builds containing arm64 files

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/7f766929ca8e8e01dcceb1c526ee584f7e5e1408
Update transformers to 4.56

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Changes

Characters can now think in chat-instruct mode! This was possible thanks to many simplifications and improvements to jinja2 template handling:

Add support for the Seed-OSS-36B-Instruct template.
Better handle the growth of the chat input textarea:

Before	After

Make the --model flag work with absolute paths for gguf models, like --model /tmp/gemma-3-270m-it-IQ4_NL.gguf
Make venv portable installs work with Python 3.13
Optimize LaTeX rendering during streaming for long replies
Give streaming instruct messages more vertical space
Preload the instruct and chat fonts for smoother startup
Improve right sidebar borders in light mode
Remove the --flash-attn flag (it's always on now in llama.cpp)
Suppress "Attempted to select a non-interactive or hidden tab" console warnings, reducing the UI CPU usage during streaming
Statically link MSVC runtime to remove the Visual C++ Redistributable dependency on Windows for the llama.cpp binaries
Make the llama.cpp terminal output with --verbose less verbose

Bug fixes

llama.cpp: Fix stderr deadlock while loading some models
llama.cpp: Fix obtaining the maximum sequence length for GPT-OSS
Fix the UI failing to launch if the Notebook prompt is too long
Fix LaTeX rendering for equations with asterisks
Fix italic and quote colors in headings

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/9961d244f2df6baf40af2f1ddc0927f8d91578c8

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@altoiddealer

Changes

Add the Tensor Parallelism option to the ExLlamav3/ExLlamav3_HF loaders through the --enable-tp and --tp-backend options.
Set multimodal status during Model Loading instead of checking every generation (#7199). Thanks, @altoiddealer.
Improve the multimodal API examples slightly.

Bug fixes

Make web search functional again
mtmd: Fix a bug when "include past attachments" is unchecked
Fix code blocks having an extra empty line in the UI

Backend updates

Update llama.cpp to ggml-org/llama.cpp@6d7f111
Update ExLlamaV3 to 0.0.6
Update flash-attention to 2.8.3

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

@65a

See the Multimodal Tutorial

Changes

Add multimodal support to the UI and API
- With the llama.cpp loader (#7027). This was possible thanks to PR ggml-org/llama.cpp#15108 to llama.cpp. Thanks @65a.
- With ExLlamaV3 through a new ExLlamaV3 loader (#7174). Thanks @Katehuuh.
Add speculative decoding to the new ExLlamaV3 loader.
Use ExLlamav3 instead of ExLlamav3_HF by default for EXL3 models, since it supports multimodal and speculative decoding.
Support loading chat templates from chat_template.json files (EXL3/EXL2/Transformers models)
Default max_tokens to 512 in the API instead of 16
Better organize the right sidebar in the UI
llama.cpp: Pass --swa-full to llama-server when streaming-llm is checked to make it work for models with SWA.

Bug fixes

Fix getting the ctx-size for newer EXL3/EXL2/Transformers models
Fix the exllamav2 loader ignoring add_bos_token
Fix the color of italic text in chat messages
Fix edit window and buttons in Messenger theme (#7100). Thanks @mykeehu.

Backend updates

Bump llama.cpp to ggml-org/llama.cpp@f4586ee

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Changes

Several improvements to the GPT-OSS template handling. Special actions like "Continue" and "Impersonate" now work correctly.
Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Releases: oobabooga/text-generation-webui

v3.18

Changes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.17

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.16

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.15

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.14

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.13

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.12

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Uh oh!

v3.11

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.10 - Multimodal support!

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.9.1