Train MoE models 12x faster with 30% less memory! (<15GB VRAM)

Hey r/LocalLlama! We’re excited to introduce \~12x faster Mixture of Experts (MoE) training with **>35% less VRAM** and **\~6x longer context** via our new custom Triton kernels and math optimizations (no accuracy loss). Unsloth repo: https://github.com/unslothai/unsloth * Unsloth now supports fast training for MoE architectures including gpt-oss, Qwen3 (30B, 235B, VL, Coder), DeepSeek R1/V3 and GLM (4.5-Air, 4.7, Flash). * gpt-oss-20b fine-tunes in **12.8GB VRAM**. Qwen3-30B-A3B (16-bit LoRA) uses 63GB. * Our kernels work on both data-center (B200, H100), **consumer** and older GPUs (e.g., RTX 3090), and FFT, LoRA and QLoRA. * The larger the model and more context you use, **the more pronounced the memory savings from our Unsloth kernels will be** (efficiency will scale exponentially). * We previously introduced Unsloth Flex Attention for gpt-oss, and these optimizations should make it even more efficient. In collaboration with Hugging Face, we made all MoE training runs standardized with PyTorch’s new `torch._grouped_mm` function. Transformers v5 was recently optimized with \~6x faster MoE than v4 and Unsloth pushes this even further with custom Triton grouped‑GEMM + LoRA kernels for an **additional** \~2x speedup, >35% VRAM reduction and >6x longer context (12-30x overall speedup vs v4). You can read our educational blogpost for detailed analysis, benchmarks and more: https://unsloth.ai/docs/new/faster-moe We also released support for embedding model fine-tuning recently. You can use our free MoE fine-tuning notebooks: |**gpt-oss (20b)**-Fine-tuning.ipynb) **(free)**|gpt-oss (500K context)_500K_Context_Fine_tuning.ipynb)|GLM-4.7-Flash.ipynb) (A100)| |:-|:-|:-| |gpt-oss-120b_A100-Fine-tuning.ipynb) (A100)|Qwen3-30B-A3B (A100)|TinyQwen3 MoE T4 (free)| To update Unsloth to auto make training faster, update our Docker or: pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo Thanks for reading and hope y'all have a lovely week. We hear it'll be a busy week! :)

Train MoE models 12x faster with 30% less memory! (<15GB VRAM)

More from r/LocalLLaMA

This is where we are right now, LocalLLaMA

Anthropic: "We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax." 🚨

I feel personally attacked