Low-Resource Languages Expose a Structural Gap in Code LLMs

03 [Code] Low-Resource Languages Expose a Structural Gap in Code LLMs

LLMs score well on Python, Java, and C++. That success has obscured a structural problem: general-purpose languages with thin training corpora aren’t just harder for these models — they expose a failure mode that standard augmentation strategies don’t fix, and in some cases actively worsen.

CangjieBench targets Cangjie, a low-resource general-purpose programming language developed by Huawei, chosen precisely because it sits outside the high-resource cluster that most code benchmarks optimize for. The benchmark contains 248 samples manually translated from HumanEval and ClassEval, covering both Text-to-Code (natural language to code) and Code-to-Code (translation between languages) tasks. Manual translation matters here: automated conversion of benchmark samples is a known contamination vector, and the evaluation is designed to stay clean.

The mechanism behind the performance gap is informative. Syntax-Constrained Generation — providing the model with formal grammar rules before generation — produces consistent improvements over direct generation. Models can follow structural rules when given them explicitly. That points to a specific failure mode: the bottleneck is syntactic knowledge, not reasoning capability. The models can reason through the problem; they don’t know what valid Cangjie looks like. RAG (Retrieval-Augmented Generation) performs worse than expected and in several configurations falls below direct generation baselines. Retrieved code snippets in low-resource settings are sparse and often low quality, meaning the retrieval step injects noise rather than useful signal. RAG’s core assumption, that retrieved examples are informative, breaks down when the corpus is thin.

The agent setting shows the most headroom. When models can iterate, execute, and self-correct, performance climbs beyond both RAG and syntax-constrained approaches, though a substantial gap relative to high-resource language performance persists across all settings. No single configuration closes it.

The limitation is scope: Cangjie is one language, and the 248-sample set, while high quality, is small. Generalization claims to other low-resource general-purpose languages need direct validation.

For teams building or evaluating code assistants in enterprise environments — where proprietary or niche languages are common — the syntax-constrained result is the most immediately actionable finding. Retrieval pipelines don’t transfer cleanly to low-resource settings.

Key takeaways:

Syntax-Constrained Generation outperforms RAG in low-resource code settings because the bottleneck is syntactic knowledge, not reasoning, and retrieved snippets from thin corpora introduce noise rather than signal.
Strong performance on high-resource languages gives a false ceiling on code LLM capability; general-purpose low-resource languages expose a distinct failure mode that persists across all four evaluated settings.
Teams deploying code assistants for enterprise-specific or niche languages should test syntax-constrained prompting before investing in RAG pipelines, as the retrieval corpus quality assumption may not hold.

Source: CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language