Multilingual Isn’t Cross-Lingual: Inside My Benchmark of 11 LLMs on Mid- & Low-Resource Languages

I built an evaluation pipeline for multilingual and cross-lingual LLM performance on 11 mid/low-resource languages (e.g., Basque, Kazakh, Amharic, Hausa, Sundanese). I combined native-language datasets (KazMMLU, BertaQA, BLEnD), zero-shot chain-of-thought prompts, and a new metric - LASS (Language-Aware Semantic Score) - that rewards semantic correctness and outputting answers in the requested language. Findings: (1) scale helps but with diminishing returns; (2) reasoning-optimized models often beat larger non-reasoning models; (3) the best open-weight model is ~7% behind the best closed model; (4) "multilingual" models underperform on culturally specific cross-lingual tasks when evaluations move beyond translated English content. Code & data: see GitHub link in Reproducibility.

Source: HackerNoon →

Blog

Multilingual Isn’t Cross-Lingual: Inside My Benchmark of 11 LLMs on Mid- & Low-Resource Languages

Category

Related News

Cutting AI Costs Without Losing Capability: The Rise of Small Language Models

The Deception Problem: When AI Learns to Lie Without Being Taught

AI Is Still Culturally Blind

The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”

Why “Almost Right” Answers Are the Hardest Test for AI

Top Category

Blog

Multilingual Isn’t Cross-Lingual: Inside My Benchmark of 11 LLMs on Mid- & Low-Resource Languages

Category

Share

Related News

Cutting AI Costs Without Losing Capability: The Rise of Small Language Models

The Deception Problem: When AI Learns to Lie Without Being Taught

AI Is Still Culturally Blind

The Prompt Patterns That Decide If an AI Is “Correct” or “Wrong”

Why “Almost Right” Answers Are the Hardest Test for AI

Top Category