No NVIDIA? No Problem. My 2018 "Potato" 8th Gen i3 hits 10 TPS on 16B MoE.

Tools 1.0K points 119 comments 1 month ago

I’m writing this from Burma. Out here, we can’t all afford the latest NVIDIA 4090s or high-end MacBooks. If you have a tight budget, corporate AI like ChatGPT will try to gatekeep you. If you ask it if you can run a 16B model on an old dual-core i3, it’ll tell you it’s "impossible." I spent a month figuring out how to prove them wrong. After 30 days of squeezing every drop of performance out of my hardware, I found the peak. I’m running DeepSeek-Coder-V2-Lite (16B MoE) on an HP ProBook 650 G5 (i3-8145U, 16GB Dual-Channel RAM) at near-human reading speeds. \#### The Battle: CPU vs iGPU I ran a 20-question head-to-head test with no token limits and real-time streaming. | Device | Average Speed | Peak Speed | My Rating | | --- | --- | --- | --- | | CPU | 8.59 t/s | 9.26 t/s | 8.5/10 - Snappy and solid logic. | | iGPU (UHD 620) | 8.99 t/s | 9.73 t/s | 9.0/10 - A beast once it warms up. | The Result: The iGPU (OpenVINO) is the winner, proving that even integrated Intel graphics can handle heavy lifting if you set it up right. \## How I Squeezed the Performance: \* MoE is the "Cheat Code": 16B parameters sounds huge, but it only calculates 2.4B per token. It’s faster and smarter than 3B-4B dense models. \* Dual-Channel is Mandatory: I’m running 16GB (2x8GB). If you have single-channel, don't even bother; your bandwidth will choke. \* Linux is King: I did this on Ubuntu. Windows background processes are a luxury my "potato" can't afford. \* OpenVINO Integration: Don't use OpenVINO alone—it's dependency hell. Use it as a backend for llama-cpp-python. \## The Reality Check 1. First-Run Lag: The iGPU takes time to compile. It might look stuck. Give it a minute—the "GPU" is just having his coffee. 2. Language Drift: On iGPU, it sometimes slips into Chinese tokens, but the logic never breaks. I’m sharing this because you shouldn't let a lack of money stop you from learning AI. If I can do this on an i3 in Burma, you can do it too. \## Clarifications Edited For those looking for OpenVINO CMAKE flags in the core llama.cpp repo or documentation: **It is not in the upstream core yet**. I am not using upstream llama.cpp directly. Instead, I am using llama-cpp-python, which is built from source with the OpenVINO backend enabled. While OpenVINO support hasn't been merged into the main llama.cpp master branch, llama-cpp-python already supports it through a custom CMake build path. Install llama-cpp-python like this: `CMAKE_ARGS="-DGGML_OPENVINO=ON" pip install llama-cpp-python` Benchmark Specifics For clarity, here is the benchmark output. This measures decode speed (after prefill), using a fixed max\_tokens=256, averaged across 10 runs with n\_ctx=4096. CPU Avg Decode: \~9.6 t/s iGPU Avg Decode: \~9.6 t/s When I say "\~10 TPS," I am specifically referring to the Decode TPS (Tokens Per Second), not the prefill speed. You can check the detailed comparison between DeepSeek-V2-Lite and GPT-OSS-20B on this same hardware here: \[https://www.reddit.com/r/LocalLLaMA/comments/1qycn5s/deepseekv2lite\_vs\_gptoss20b\_on\_my\_2018\_potato/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button\]

More from r/LocalLLaMA