Posted by u/NunzeCs
4x AMD R9700 (128GB VRAM) + Threadripper 9955WX Build
Disclaimer: I am from Germany and my English is not perfect, so I used an LLM to help me structure and write this post. Context & Motivation: I built this system for my small company. The main reason for all new hardware is that I received a 50% subsidy/refund from my local municipality for digitalization investments. To qualify for this funding, I had to buy new hardware and build a proper "server-grade" system. My goal was to run large models (120B+) locally for data privacy. With the subsidy in mind, I had a budget of around 10,000€ (pre-refund). I initially considered NVIDIA, but I wanted to maximize VRAM. I decided to go with 4x AMD RDNA4 cards (ASRock R9700) to get 128GB VRAM total and used the rest of the budget for a solid Threadripper platform. Hardware Specs: Total Cost: \~9,800€ (I get \~50% back, so effectively \~4,900€ for me). * CPU: AMD Ryzen Threadripper PRO 9955WX (16 Cores) * Mainboard: ASRock WRX90 WS EVO * RAM: 128GB DDR5 5600MHz * GPU: 4x ASRock Radeon AI PRO R9700 32GB (Total 128GB VRAM) * Configuration: All cards running at full PCIe 5.0 x16 bandwidth. * Storage: 2x 2TB PCIe 4.0 SSD * PSU: Seasonic 2200W * Cooling: Alphacool Eisbaer Pro Aurora 360 CPU AIO * Case: PHANTEKS Enthoo Pro 2 Server * Fans: 11x Arctic P12 Pro Benchmark Results I tested various models ranging from 8B to 230B parameters. Llama.cpp (Focus: Single User Latency) Settings: Flash Attention ON, Batch 2048 |Modell|NGL|Prompt t/s|Gen t/s|Größe| |:-|:-|:-|:-|:-| |GLM-4.7-REAP-218B-A32B-Q3\_K\_M|999|504.15|17.48|97.6GB| |GLM-4.7-REAP-218B-A32B-Q4\_K\_M|65|428.80|9.48 |123.0GB| |gpt-oss-120b-GGUF |999|2977.83|97.47| 58.4GB| |Meta-Llama-3.1-70B-Instruct-Q4\_K\_M|999|399.03|12.66|39.6GB| |Meta-Llama-3.1-8B-Instruct-Q4\_K\_M |999|3169.16|81.01 |4.6GB| |MiniMax-M2.1-Q4\_K\_M|55|668.99|34.85|128.83 GB| |Qwen2.5-32B-Instruct-Q4\_K\_M |999|848.68 |25.14|18.5GB| |Qwen3-235B-A22B-Instruct-2507-Q3\_K\_M|999|686.45|24.45|104.7GB| Side note: I found that with PCIe 5.0, standard Pipeline Parallelism (Layer Split) is significantly faster (\~97 t/s) than Tensor Parallelism/Row Split (\~67 t/s) for a single user on this setup. vLLM (Focus: Throughput) Model: GPT-OSS-120B (bfloat16), TP=4, test for 20 requests Total Throughput: \~314 tokens/s (Generation) Prompt Processing: \~5339 tokens/s Single user throughput 50 tokens/s I used rocm 7.1.1 for llama.cpp also testet Vulkan but it was worse If I could do it again, I would have used the budget to buy a single NVIDIA RTX Pro 6000 Blackwell (96GB). Maybe I will, if local AI is going well for my use case, I swap the R9700 with Pro 6000 in the future. \*\*Edit nicer view for the results
More from r/LocalLLaMA
My story of underestimating /r/LocalLLaMA's thirst for VRAM
zai-org/GLM-4.7-Flash · Hugging Face
NVIDIA's new 8B model is Orchestrator-8B, a specialized 8-billion-parameter AI designed not to answer everything itself, but to intelligently manage and route complex tasks to different tools (like web search, code execution, other LLMs) for greater efficiency
I’ve seen some arguments we’ve reached AGI, it’s just about putting the separate pieces together in the right context....