Releases: oobabooga/text-generation-webui
v3.18
Changes
- Add
--cpu-moeflag for llama.cpp to move MoE model experts to CPU, reducing VRAM usage. - Add ROCm portable builds for AMD GPUs on Linux. This was made possible by PR oobabooga/llama-cpp-binaries#7 by @ShortTimeNoSee. Thanks, @ShortTimeNoSee.
- Remove deprecated macOS 13 wheels (no longer supported by GitHub Actions).
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/10e9780154365b191fb43ca4830659ef12def80f
- Update ExLlamaV3 to 0.0.15
- Update peft to 0.18.*
- Update triton-windows to 3.5.1.post21
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.17
Changes
- Add
weights_only=Truetotorch.loadin Training_PRO for better security.
Bug fixes
- Pin huggingface-hub to 0.36.0 to fix manual venv installs.
- fix: Rename 'evaluation_strategy' to 'eval_strategy' in training. Thanks, @inyourface34456.
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/230d1169e5bfe04a013b2e20f4662ee56c2454b0 (adds Qwen3-VL support)
- Update exllamav3 to 0.0.12
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.16
Changes
Bug fixes
- Fixed python requirements for apple devices with macos tahoe (#7273). Thanks, @drieschel.
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/d0660f237a5c31771a3d6d1030ebe3e0c409ba92 (adds Ling-mini-2.0, Ring-mini-2.0 support)
- Update exllamav3 to 0.0.11
- Update triton-windows to 3.5.0.post21
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.15
Changes
- log error when llama-server request exceeds context size (#7263). Thanks, @mamei16.
- Make --trust-remote-code immutable from the UI/API for better security.
Bug fixes
- Fix metadata leaking into branched chats.
- Fix "continue" missing an initial space in chat-instruct/chat modes.
- Fix resuming incomplete downloads after HF moved to Xet.
- Revert exllamav3_hf changes in v3.14 that made it output gibberish.
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/f9fb33f2630b4b4ba9081ce9c0c921f8cd8ba4eb.
- Update exllamav3 0.0.10.
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.14
Changes
- Better handle multi-GPU setups when using Transformers with bitsandbytes (
load-in-8bitandload-in-4bit). - Implement the
/v1/internal/logitsendpoint for theexllamav3andexllamav3_hfloaders. - Make profile picture uploading safer.
- Add
flato the requirements for Exllamav3 to supportqwen3-nextmodels.
Bug fixes
- Fix an issue with loading certain chat histories in Instruct mode. Thanks, @Remowylliams.
- Fix portable builds for macOS x86 missing llama.cpp binaries (#7238). Thanks, @IonoclastBrigham.
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/d00cbea63c671cd85a57adaa50abf60b3b87d86f.
- Update transformers to 4.57.
- Update exllamav3 0.0.7.
- Update bitsandbytes to 0.48.
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.13
Bug fixes
- Don't use
$ $for LaTeX, only$$ $$, to avoid broken rendering of text likeapples cost $1, oranges cost $2 - Fix exllamav3 ignoring the stop button
- Fix a transformers issue when using --bf16 and Flash Attention 2 (#7217). Thanks, @stevenxdavis.
- Fix x86_64 macos portable builds containing arm64 files
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/7f766929ca8e8e01dcceb1c526ee584f7e5e1408
- Update transformers to 4.56
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.12
Changes
- Characters can now think in
chat-instructmode! This was possible thanks to many simplifications and improvements to jinja2 template handling:
- Add support for the Seed-OSS-36B-Instruct template.
- Better handle the growth of the chat input textarea:
| Before | After |
|---|---|
- Make the
--modelflag work with absolute paths for gguf models, like--model /tmp/gemma-3-270m-it-IQ4_NL.gguf - Make venv portable installs work with Python 3.13
- Optimize LaTeX rendering during streaming for long replies
- Give streaming instruct messages more vertical space
- Preload the instruct and chat fonts for smoother startup
- Improve right sidebar borders in light mode
- Remove the
--flash-attnflag (it's always on now in llama.cpp) - Suppress "Attempted to select a non-interactive or hidden tab" console warnings, reducing the UI CPU usage during streaming
- Statically link MSVC runtime to remove the Visual C++ Redistributable dependency on Windows for the llama.cpp binaries
- Make the llama.cpp terminal output with
--verboseless verbose
Bug fixes
- llama.cpp: Fix stderr deadlock while loading some models
- llama.cpp: Fix obtaining the maximum sequence length for GPT-OSS
- Fix the UI failing to launch if the Notebook prompt is too long
- Fix LaTeX rendering for equations with asterisks
- Fix italic and quote colors in headings
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/9961d244f2df6baf40af2f1ddc0927f8d91578c8
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.11
Changes
- Add the Tensor Parallelism option to the ExLlamav3/ExLlamav3_HF loaders through the
--enable-tpand--tp-backendoptions. - Set multimodal status during Model Loading instead of checking every generation (#7199). Thanks, @altoiddealer.
- Improve the multimodal API examples slightly.
Bug fixes
- Make web search functional again
- mtmd: Fix a bug when "include past attachments" is unchecked
- Fix code blocks having an extra empty line in the UI
Backend updates
- Update llama.cpp to ggml-org/llama.cpp@6d7f111
- Update ExLlamaV3 to 0.0.6
- Update flash-attention to 2.8.3
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.10 - Multimodal support!
See the Multimodal Tutorial
Changes
- Add multimodal support to the UI and API
- With the llama.cpp loader (#7027). This was possible thanks to PR ggml-org/llama.cpp#15108 to llama.cpp. Thanks @65a.
- With ExLlamaV3 through a new ExLlamaV3 loader (#7174). Thanks @Katehuuh.
- Add speculative decoding to the new ExLlamaV3 loader.
- Use ExLlamav3 instead of ExLlamav3_HF by default for EXL3 models, since it supports multimodal and speculative decoding.
- Support loading chat templates from
chat_template.jsonfiles (EXL3/EXL2/Transformers models) - Default max_tokens to 512 in the API instead of 16
- Better organize the right sidebar in the UI
- llama.cpp: Pass
--swa-fullto llama-server whenstreaming-llmis checked to make it work for models with SWA.
Bug fixes
- Fix getting the ctx-size for newer EXL3/EXL2/Transformers models
- Fix the exllamav2 loader ignoring add_bos_token
- Fix the color of italic text in chat messages
- Fix edit window and buttons in Messenger theme (#7100). Thanks @mykeehu.
Backend updates
- Bump llama.cpp to ggml-org/llama.cpp@f4586ee
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
v3.9.1
Changes
- Several improvements to the GPT-OSS template handling. Special actions like "Continue" and "Impersonate" now work correctly.
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.