7x Longer Context Reinforcement Learning in Unsloth

Hey r/LocalLlama! We're excited to show how Unsloth now enables **7x longer context lengths** (up to 12x) for Reinforcement Learning! By using 3 new techniques we developed, we enable you to train gpt-oss 20b QLoRA up to **20K context on a 24Gb card** \- all with **no accuracy degradation**. Unsloth GitHub: https://github.com/unslothai/unsloth * For larger GPUs, Unsloth now trains gpt-oss QLoRA with **380K context** on a single 192GB NVIDIA B200 GPU * Qwen3-8B GRPO reaches **110K context** on an 80GB VRAM H100 via vLLM and QLoRA, and **65K** for gpt-oss with BF16 LoRA. * Unsloth GRPO RL runs with Llama, Gemma & all models auto support longer contexts Also, all features in Unsloth can be combined together and work well together: 1. Unsloth's weight-sharing feature with vLLM and our Standby Feature in Memory Efficient RL 2. Unsloth's Flex Attention for long context gpt-oss and our 500K Context Training 3. Float8 training in FP8 RL and Unsloth's async gradient checkpointing and much more You can read our educational blogpost for detailed analysis, benchmarks and more: https://unsloth.ai/docs/new/grpo-long-context And you can of course train any model using our new features and kernels via our free fine-tuning notebooks: https://docs.unsloth.ai/get-started/unsloth-notebooks Some free Colab notebooks below which has the 7x longer context support backed in: |gpt-oss-20b-GRPO.ipynb) GSPO Colab|Qwen3-VL-8B-Vision-GRPO.ipynb) Vision RL|Qwen3-8B - FP8 L4 GPU| |:-|:-|:-| To update Unsloth to automatically make training faster, do: pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth_zoo And to enable GRPO runs in Unsloth, do import os os.environ["UNSLOTH_VLLM_STANDBY"] = "1" # Standby = extra 30% context lengths! from unsloth import FastLanguageModel import torch max_seq_length = 20000 # Can increase for longer reasoning traces lora_rank = 32 # Larger rank = smarter, but slower model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B-Base", max_seq_length = max_seq_length, load_in_4bit = False, # False for LoRA 16bit fast_inference = True, # Enable vLLM fast inference max_lora_rank = lora_rank, ) Hope you all have a great rest of the week and thank you!

7x Longer Context Reinforcement Learning in Unsloth

More from r/LocalLLaMA

My story of underestimating /r/LocalLLaMA's thirst for VRAM

zai-org/GLM-4.7-Flash · Hugging Face

NVIDIA's new 8B model is Orchestrator-8B, a specialized 8-billion-parameter AI designed not to answer everything itself, but to intelligently manage and route complex tasks to different tools (like web search, code execution, other LLMs) for greater efficiency