Posted by u/petroslamb
[D] Why Mamba rewrote its core algorithm and Microsoft abandoned RetNet
Mamba-2 restructured its recurrence from parallel scans (10-20% Tensor Core utilization) to block-diagonal GEMMs (60-70%). The architecture bent to fit the silicon. RetNet was published by Microsoft Research in July 2023 with promising results at 6.7B. Five months later, the same organization shipped Phi-2, a dense Transformer. Then Phi-3. Then Phi-4. The co-authors didn't bet on their own architecture. I wrote an analysis of why this pattern keeps repeating. The short version: Transformers and NVIDIA GPUs co-evolved into a stable attractor. Breaking out requires clearing two reinforcing gates at once, hardware compatibility and institutional backing, and the gates make each other harder to pass. At frontier scale, no pure alternative has done it. Essay has Tensor Core utilization numbers, analysis of alternative chip vendors, and three falsifiable predictions for 2028.
More from r/MachineLearning
Nvidia: End-to-End Test-Time Training for Long Context aka Being Able To Update A Model's Weights In Real-Time As You Use It | "TTT changes the paradigm from retrieving info to learning it on the fly...the TTT model treats the context window as a dataset & trains itself on it in real-time." [R]
####TL;DR: The paper describes a mechanism that essentially turns the context window into a training dataset for a...
[P] I Gave Claude Code 9.5 Years of Health Data to Help Manage My Thyroid Disease
I have episodic Graves' disease, which has been difficult b/c its not chronic. Meds are up and down and often lag when...
[D] Burnout from the hiring process
I've been interviewing for research (some engineering) interships for the last 2 months, and I think I'm at a point of...