ScatterAI
Issue #5 · March 15, 2026

LLMs That Ace Python Collapse on a General-Purpose Language With Thin Training Data

Research

03 [Code] LLMs That Ace Python Collapse on a General-Purpose Language With Thin Training Data

High benchmark scores on Python and Java mask a structural gap: models degrade sharply on general-purpose languages that simply didn’t appear much in pretraining corpora. Most low-resource language research targets DSLs (Domain-Specific Languages) — narrowly scoped tools like SQL or regex. General-purpose languages with data scarcity get ignored, even though they carry the same breadth of programming demands as mainstream languages.

CangjieBench tests this directly on Cangjie, a general-purpose language with minimal web presence. 248 samples manually translated from HumanEval and ClassEval cover both Text-to-Code and Code-to-Code tasks. Four settings are tested: Direct Generation, Syntax-Constrained Generation, RAG (Retrieval-Augmented Generation), and Agent. Direct generation collapses across the board. RAG recovers meaningful performance by injecting Cangjie syntax examples at inference time, and the Agent setting pushes further, though neither fully closes the gap to high-resource language performance.

The ceiling here is data, not model architecture or reasoning capacity. Any new general-purpose language launching without a large public code corpus faces this same wall, regardless of how capable the underlying LLM is. For teams evaluating LLMs for proprietary or emerging languages, RAG over a curated syntax reference is the practical first move, ahead of fine-tuning.

Key takeaways:

Source: CangjieBench: Benchmarking LLMs on a Low-Resource General-Purpose Programming Language