Blog

Nov 20, 2025

Multilingual Isn’t Cross-Lingual: Inside My Benchmark of 11 LLMs on Mid- & Low-Resource Languages

I built an evaluation pipeline for multilingual and cross-lingual LLM performance on 11 mid/low-resource languages (e.g., Basque, Kazakh, Amharic, Hausa, Sundanese). I combined native-language datasets (KazMMLU, BertaQA, BLEnD), zero-shot chain-of-thought prompts, and a new metric - LASS (Language-Aware Semantic Score) - that rewards semantic correctness and outputting answers in the requested language. Findings: (1) scale helps but with diminishing returns; (2) reasoning-optimized models often beat larger non-reasoning models; (3) the best open-weight model is ~7% behind the best closed model; (4) "multilingual" models underperform on culturally specific cross-lingual tasks when evaluations move beyond translated English content. Code & data: see GitHub link in Reproducibility.

Source: HackerNoon →


Share

BTCBTC
$71,718.00
1.4%
ETHETH
$2,116.73
2%
USDTUSDT
$1.00
0%
BNBBNB
$661.38
1.26%
XRPXRP
$1.42
1.72%
USDCUSDC
$1.000
0%
SOLSOL
$88.35
1.53%
TRXTRX
$0.297
0.52%
FIGR_HELOCFIGR_HELOC
$1.00
1.92%
DOGEDOGE
$0.0962
1.16%
WBTWBT
$55.85
0.88%
USDSUSDS
$1.000
0%
ADAADA
$0.265
1.71%
BCHBCH
$466.99
1.35%
HYPEHYPE
$37.55
0.58%
LEOLEO
$9.07
0.09%
XMRXMR
$357.83
1.02%
LINKLINK
$9.26
2.93%
USDEUSDE
$0.999
0.13%
CCCC
$0.152
0.14%