News
14 hours ago
Optimizing LLM Pre-Training: Muon, Latent Attention, and MoE in Practice
Muon is a geometry-aware optimizer that halves training time for large language models. It uses polar decomposition and spectral n...
Muon is a geometry-aware optimizer that halves training time for large language models. It uses polar decomposition and spectral n...