Blog
10 hours ago
Optimizing LLM Pre-Training: Muon, Latent Attention, and MoE in Practice
Muon is a geometry-aware optimizer that halves training time for large language models. It uses polar decomposition and spectral normalization to speed up LLM pre-training. Muon also plays nicely with large batches and other architectural tricks like Multi-Head Latent Attention.
Source: HackerNoon →