Blog

4 hours ago

Multilingual Isn’t Cross-Lingual: Inside My Benchmark of 11 LLMs on Mid- & Low-Resource Languages

I built an evaluation pipeline for multilingual and cross-lingual LLM performance on 11 mid/low-resource languages (e.g., Basque, Kazakh, Amharic, Hausa, Sundanese). I combined native-language datasets (KazMMLU, BertaQA, BLEnD), zero-shot chain-of-thought prompts, and a new metric - LASS (Language-Aware Semantic Score) - that rewards semantic correctness and outputting answers in the requested language. Findings: (1) scale helps but with diminishing returns; (2) reasoning-optimized models often beat larger non-reasoning models; (3) the best open-weight model is ~7% behind the best closed model; (4) "multilingual" models underperform on culturally specific cross-lingual tasks when evaluations move beyond translated English content. Code & data: see GitHub link in Reproducibility.

Source: HackerNoon →


Share

BTCBTC
$86,951.00
2.15%
ETHETH
$2,839.68
1.3%
USDTUSDT
$0.999
0.01%
BNBBNB
$877.31
0.22%
XRPXRP
$1.99
2.12%
USDCUSDC
$1.000
0%
SOLSOL
$133.00
1.35%
TRXTRX
$0.279
1.73%
STETHSTETH
$2,835.72
1.41%
DOGEDOGE
$0.148
0.53%
ADAADA
$0.433
1.53%
FIGR_HELOCFIGR_HELOC
$1.03
0.43%
WBTWBT
$57.03
3.29%
ZECZEC
$700.54
11.9%
WSTETHWSTETH
$3,455.79
1.48%
WBTCWBTC
$86,601.00
2.48%
HYPEHYPE
$37.76
0.47%
WBETHWBETH
$3,075.17
1.36%
BCHBCH
$476.96
0.93%
USDSUSDS
$1.000
0.02%