Skip to content

lhl/strix-halo-testing

Repository files navigation

strix-halo-testing

This repository contains my testing and development for the AMD Strix Halo (Ryzen AI Max+ 395) APU (particularly the gfx1151 RNDA 3.5 GPU).

All this work was done on a pre-production Framework Desktop / remote Framework cluster courtesy of Framework (thanks guys!) with the goal of seeing if we could get Strix Halo actually useful/usable for local AI, and for getting some documentation along the lines of my RDNA3 AI/ML doc.

My WIP Documentation here: https://llm-tracker.info/_TOORG/Strix-Halo but since the release of the Framework Desktop, I've been focusing my documentation efforts on the AI section of the Strix Halo HomeLab Wiki. That should be considered the most up-to-date starting point and covers a summary of the technical capabilities, basic system setup and tweaks, and well as some docs on llama.cpp and vLLM setup.

ROCm environment scripts

I have some legacy rocm-env.sh scripts that may be useful, but for mamba/env setup that leverages the latest pip-based ROCm/TheRock RELEASE, see my torch-therock/00-setup-env.sh script, which leverages the rocm-sdk tool to generate the proper ROCM paths.

Most Useful Parts

This is a "working" repo, but there are a few things that are most worth looking at:

  • hardware-test - if you're looking for raw memory-bandwidth and performance testing, this is a good folder to look at
  • llm-bench - I ran a wide range of performance sweeps (which includes longer context across multiple llama.cpp backends) to characterize LLM performance. For some more up-to-date (pp512/tg128) numbers, check out kyuz0's Interactive Viewer - I also have a llama.cpp performance doc that goes over more of how to test, and things to consider/look out for
  • rpc-test - basic testing of how to use the llama.cpp RPC backend for clustering. If you have an interest in this, you'll probably want to check out Jeff Geerling's work on the topic
  • torch-therock - as of 2025-10-15 there is no AOTriton for PyTorch (and hence no Flash Attention!) being built automatically (seeROCm/TheRock #1408), but this is the script I use to build my own PyTorch + AOTriton
  • vllm - I use this to be able to build my own vLLM. AFAIK, this was a first, but is not for the faint of heart, and kyuz0 and others have since used this approach to build easier to use versions. If you're not already knee-deep in monkey-patching/struggling w/ vLLM builds, you'll probably want to check out this AMD Strix Halo — vLLM Toolbox/Container (gfx1151, PyTorch + AOTriton) instead!

Strix Halo Notes

  • I've done a fair amount of posting on the Framework Forums on some Strix Halo details that may be useful (of course no one has time for reading all that, but just linking in case)

AMD Strix Halo vs Nvidia DGX Spark

I don't have an Nvidia DGX Spark but I may get SSH access to one and may do a bit of poking around. If you think of the Strix Halo GPU as a Radeon RX 7600 XT with 128GB of LPDDR5X, you can think about the Spark as a (very low power) RTX 5070 with 128GB of LPDDR5X. Beyond that, the Spark has a very slick getting started experience and a huge number of Playbooks that along with its CUDA support makes it well suited for those just getting started with AI/ML development.

That being said, while the Spark has a much more mature software ecosystem and a lot more compute that gives it both a significant prefill/prompt processing advantage for LLMs and makes it much more suitable for image, audio and video generation, for token generation/decode on big LLMs, the two devices are essentially neck-and-neck at short context with Vulkan, though Spark does pull ahead at longer context and with ROCm backends.

ggeranov ran a fair amount of llama.cpp performance sweeps on launch (perf actually has since improved/been updated). I was curious and ran some comparisons vs my Strix Halo (Framework Desktop, Arch 6.17.0-1-mainline, all optimizations (amd_iommu, tuned) set properly). I only tested against his gpt-oss-120b tests (this is the ggml-org one, so Q8/MXFP4).

This was tested with TheRock/ROCm nightly (7.10.0a20251017) and the latest Vulkan drivers (RADV 25.2.4-2, AMDVLK 2025.Q2.1-1) and I've picked the faster overall numbers for Vulkan (AMDVLK atm) and ROCm (regular hipblas w/ rocWMMA). llama.cpp build is 6792, almost the same as ggeranov's build (6767).

Token generation/decode performance is essentially even at short context with Vulkan (Spark +5.6% at 2K context), while Spark pulls ahead with ROCm (+13.6% at 2K context, +117.5% at 32K context). The CUDA backend drops less as context increases vs either Vulkan or ROCm (Vulkan does better than ROCm as context increases - at 32K context, Vulkan tg is 1.8X ROCm!). For prompt processing/prefill, ROCm performance has improved significantly but Strix Halo still lags behind Spark. With ROCm, Strix Halo starts off +67.8% slower at 2K context and by 32K gets to +445.6% slower, while Vulkan ranges from +131.7% to +790.9% slower.

Note on Vulkan drivers and batch sizes:

  • AMDVLK (shown below) uses optimal -ub 512 and has better pp performance
  • RADV uses optimal -ub 1024 with lower pp but tg decreases less at depth
  • ROCm tested with standard -ub 2048

Vulkan AMDVLK

Test DGX STXH %
pp2048 1689.47 729.10 +131.7%
pp2048@d4096 1733.41 562.15 +208.4%
pp2048@d8192 1705.93 424.50 +301.9%
pp2048@d16384 1514.78 249.68 +506.7%
pp2048@d32768 1221.23 137.08 +790.9%
Test DGX STXH %
tg32 52.87 50.05 +5.6%
tg32@d4096 51.02 46.11 +10.6%
tg32@d8192 48.46 43.15 +12.3%
tg32@d16384 44.78 38.46 +16.4%
tg32@d32768 38.76 31.54 +22.9%

ROCm w/ rocWMMA

Test DGX STXH %
pp2048 1689.47 1006.65 +67.8%
pp2048@d4096 1733.41 790.45 +119.3%
pp2048@d8192 1705.93 603.83 +182.5%
pp2048@d16384 1514.78 405.53 +273.5%
pp2048@d32768 1221.23 223.82 +445.6%
Test DGX STXH %
tg32 52.87 46.56 +13.6%
tg32@d4096 51.02 38.25 +33.4%
tg32@d8192 48.46 32.65 +48.4%
tg32@d16384 44.78 25.50 +75.6%
tg32@d32768 38.76 17.82 +117.5%
Test DGX STXH %
pp2048 1689.47 977.22 +72.9%
pp2048@d4096 1733.41 878.54 +97.3%
pp2048@d8192 1705.93 743.36 +129.5%
pp2048@d16384 1514.78 587.25 +157.9%
pp2048@d32768 1221.23 407.87 +199.4%
Test DGX STXH %
tg32 52.87 48.97 +8.0%
tg32@d4096 51.02 45.42 +12.3%
tg32@d8192 48.46 43.55 +11.3%
tg32@d16384 44.78 40.91 +9.5%
tg32@d32768 38.76 36.43 +6.4%

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published