strix-halo-testing

This repository contains my testing and development for the AMD Strix Halo (Ryzen AI Max+ 395) APU (particularly the gfx1151 RNDA 3.5 GPU).

All this work was done on a pre-production Framework Desktop / remote Framework cluster courtesy of Framework (thanks guys!) with the goal of seeing if we could get Strix Halo actually useful/usable for local AI, and for getting some documentation along the lines of my RDNA3 AI/ML doc.

My WIP Documentation here: https://llm-tracker.info/_TOORG/Strix-Halo but since the release of the Framework Desktop, I've been focusing my documentation efforts on the AI section of the Strix Halo HomeLab Wiki. That should be considered the most up-to-date starting point and covers a summary of the technical capabilities, basic system setup and tweaks, and well as some docs on llama.cpp and vLLM setup.

ROCm environment scripts

I have some legacy rocm-env.sh scripts that may be useful, but for mamba/env setup that leverages the latest pip-based ROCm/TheRock RELEASE, see my torch-therock/00-setup-env.sh script, which leverages the rocm-sdk tool to generate the proper ROCM paths.

Most Useful Parts

This is a "working" repo, but there are a few things that are most worth looking at:

hardware-test - if you're looking for raw memory-bandwidth and performance testing, this is a good folder to look at
llm-bench - I ran a wide range of performance sweeps (which includes longer context across multiple llama.cpp backends) to characterize LLM performance. For some more up-to-date (pp512/tg128) numbers, check out kyuz0's Interactive Viewer - I also have a llama.cpp performance doc that goes over more of how to test, and things to consider/look out for
rpc-test - basic testing of how to use the llama.cpp RPC backend for clustering. If you have an interest in this, you'll probably want to check out Jeff Geerling's work on the topic
torch-therock - as of 2025-10-15 there is no AOTriton for PyTorch (and hence no Flash Attention!) being built automatically (seeROCm/TheRock #1408), but this is the script I use to build my own PyTorch + AOTriton
vllm - I use this to be able to build my own vLLM. AFAIK, this was a first, but is not for the faint of heart, and kyuz0 and others have since used this approach to build easier to use versions. If you're not already knee-deep in monkey-patching/struggling w/ vLLM builds, you'll probably want to check out this AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton) instead!

Strix Halo Notes

I've done a fair amount of posting on the Framework Forums on some Strix Halo details that may be useful (of course no one has time for reading all that, but just linking in case)

AMD Strix Halo vs Nvidia DGX Spark

I don't have an Nvidia DGX Spark but I may get SSH access to one and may do a bit of poking around. If you think of the Strix Halo GPU as a Radeon RX 7600 XT with 128GB of LPDDR5X, you can think about the Spark as a (very low power) RTX 5070 with 128GB of LPDDR5X. Beyond that, the Spark has a very slick getting started experience and a huge number of Playbooks that along with its CUDA support makes it well suited for those just getting started with AI/ML development.

That being said, while the Spark has a much more mature software ecosystem and a lot more compute that gives it both a significant prefill/prompt processing advantage for LLMs and makes it much more suitable for image, audio and video generation, for token generation/decode on big LLMs, the two devices are essentially neck-and-neck at short context with Vulkan, though Spark does pull ahead at longer context and with ROCm backends.

ggeranov ran a fair amount of llama.cpp performance sweeps on launch (perf actually has since improved/been updated). I was curious and ran some comparisons vs my Strix Halo (Framework Desktop, Arch 6.17.0-1-mainline, all optimizations (amd_iommu, tuned) set properly). I only tested against his gpt-oss-120b tests (this is the ggml-org one, so Q8/MXFP4).

This was tested with TheRock/ROCm nightly (7.10.0a20251017) and the latest Vulkan drivers (RADV 25.2.4-2, AMDVLK 2025.Q2.1-1) and I've picked the faster overall numbers for Vulkan (AMDVLK atm) and ROCm (regular hipblas w/ rocWMMA). llama.cpp build is 6792, almost the same as ggeranov's build (6767).

Token generation/decode performance is essentially even at short context with Vulkan (Spark +5.6% at 2K context), while Spark pulls ahead with ROCm (+13.6% at 2K context, +117.5% at 32K context). The CUDA backend drops less as context increases vs either Vulkan or ROCm (Vulkan does better than ROCm as context increases - at 32K context, Vulkan tg is 1.8X ROCm!). For prompt processing/prefill, ROCm performance has improved significantly but Strix Halo still lags behind Spark. With ROCm, Strix Halo starts off +67.8% slower at 2K context and by 32K gets to +445.6% slower, while Vulkan ranges from +131.7% to +790.9% slower.

Note on Vulkan drivers and batch sizes:

AMDVLK (shown below) uses optimal -ub 512 and has better pp performance
RADV uses optimal -ub 1024 with lower pp but tg decreases less at depth
ROCm tested with standard -ub 2048

Vulkan AMDVLK

Test	DGX	STXH	%
pp2048	1689.47	729.10	+131.7%
pp2048@d4096	1733.41	562.15	+208.4%
pp2048@d8192	1705.93	424.50	+301.9%
pp2048@d16384	1514.78	249.68	+506.7%
pp2048@d32768	1221.23	137.08	+790.9%

Test	DGX	STXH	%
tg32	52.87	50.05	+5.6%
tg32@d4096	51.02	46.11	+10.6%
tg32@d8192	48.46	43.15	+12.3%
tg32@d16384	44.78	38.46	+16.4%
tg32@d32768	38.76	31.54	+22.9%

ROCm w/ rocWMMA

Test	DGX	STXH	%
pp2048	1689.47	1006.65	+67.8%
pp2048@d4096	1733.41	790.45	+119.3%
pp2048@d8192	1705.93	603.83	+182.5%
pp2048@d16384	1514.78	405.53	+273.5%
pp2048@d32768	1221.23	223.82	+445.6%

Test	DGX	STXH	%
tg32	52.87	46.56	+13.6%
tg32@d4096	51.02	38.25	+33.4%
tg32@d8192	48.46	32.65	+48.4%
tg32@d16384	44.78	25.50	+75.6%
tg32@d32768	38.76	17.82	+117.5%

My Tuned rocWMMA

Test	DGX	STXH	%
pp2048	1689.47	977.22	+72.9%
pp2048@d4096	1733.41	878.54	+97.3%
pp2048@d8192	1705.93	743.36	+129.5%
pp2048@d16384	1514.78	587.25	+157.9%
pp2048@d32768	1221.23	407.87	+199.4%

Test	DGX	STXH	%
tg32	52.87	48.97	+8.0%
tg32@d4096	51.02	45.42	+12.3%
tg32@d8192	48.46	43.55	+11.3%
tg32@d16384	44.78	40.91	+9.5%
tg32@d32768	38.76	36.43	+6.4%

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
arch-torch		arch-torch
build-torch		build-torch
f43-setup		f43-setup
flash-attention		flash-attention
hardware-test		hardware-test
llama-cpp-fix-wmma		llama-cpp-fix-wmma
llm-bench		llm-bench
rpc-test		rpc-test
torch-therock		torch-therock
vllm		vllm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llama-cpp-cuda-hip.md		llama-cpp-cuda-hip.md
rocm-env.sh		rocm-env.sh
rocm-therock-env.sh		rocm-therock-env.sh
run-docker.sh		run-docker.sh
strix-vs-spark.py		strix-vs-spark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

strix-halo-testing

ROCm environment scripts

Most Useful Parts

Strix Halo Notes

AMD Strix Halo vs Nvidia DGX Spark

Vulkan AMDVLK

ROCm w/ rocWMMA

My Tuned rocWMMA

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

lhl/strix-halo-testing

Folders and files

Latest commit

History

Repository files navigation

strix-halo-testing

ROCm environment scripts

Most Useful Parts

Strix Halo Notes

AMD Strix Halo vs Nvidia DGX Spark

Vulkan AMDVLK

ROCm w/ rocWMMA

My Tuned rocWMMA

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages