ScatterAI
Issue #5 · March 14, 2026

Text-to-image models fail at complex text because glyph templates were never in the loop

Research

01 [Image Gen] Text-to-image models fail at complex text because glyph templates were never in the loop

Text-to-image models have gotten remarkably capable at visual composition, lighting, and style. Ask one to render a mathematical formula or a string of complex characters, and it falls apart. The failure mode is specific: these prompts sit outside the training distribution, so the model’s instruction-following breaks before generation even begins.

GlyphBanana bypasses this by injecting glyph templates (pre-rendered character shapes) directly into two places the model already attends to: the latent space (compressed internal representation where the model processes information) and the attention maps (how the model decides what to focus on). An agentic workflow then iterates, checking output quality and refining until the rendered text converges. The pipeline calls auxiliary tools at each step rather than relying on a single forward pass to get it right.

The approach is training-free, meaning it drops into existing T2I (Text-to-Image) models without retraining or fine-tuning. GlyphBanana ships with a dedicated benchmark for complex characters and formulas, a gap that generic text-rendering evaluations don’t cover. For teams building design tools, document generation pipelines, or any product that needs reliable formula or CJK character rendering on top of a diffusion model (generates output by gradually removing noise), this is a direct plug-in path.

Key takeaways:

Source: GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows


02 [Evaluation] LLMs Can Pass Wine Theory But Fail at the Glass

Cultural and multilingual benchmarks almost always test knowledge that lives comfortably in text — historical facts, social norms, language conventions. SommBench tests something structurally different: whether a model trained exclusively on text can emulate expert sensory judgment in a domain where expertise is built through smell and taste.

The benchmark covers three tasks. Wine Theory Question Answering (WTQA) tests declarative knowledge — the kind that exists in textbooks and can be memorized from text. Wine Feature Completion (WFC) asks models to infer sensory characteristics from partial wine profiles, bridging textual description and perceptual inference. Food-Wine Pairing (FWP) requires integrating flavor, texture, and cultural convention simultaneously. The gap between WTQA performance and WFC/FWP performance is the signal worth watching: a model that aces theory but collapses on sensory completion reveals exactly where textual grounding stops substituting for embodied experience.

The limitation is real. Wine expertise is also culturally and linguistically distributed, with regional vocabularies for describing tannin, acidity, and finish differing substantially across French, Italian, and Japanese sommelier traditions. A model fluent in English wine criticism may fail not because it lacks sensory grounding, but because it lacks multilingual sensory grounding. The benchmark’s multilingual design is the right call, but performance gaps across languages will be hard to disentangle from gaps in sensory reasoning itself.

Key takeaways:

Source: SommBench: Assessing Sommelier Expertise of Language Models


03 [RAG] Brain MRI Diagnosis Models Hallucinate Because They Skip the Measurements

Current VLMs (Vision-Language Models) applied to brain MRI produce fluent diagnostic summaries with a structural flaw: the language output is disconnected from the underlying volumetric data. Classifiers collapse a full scan into a single label. Volumetric pipelines produce measurements nobody interprets. VLMs fill the gap with plausible-sounding text that may have no grounding in what the scan actually shows.

LoV3D (Longitudinal Volume 3D) routes around this by forcing the diagnostic chain to pass through the numbers. The pipeline first extracts region-level volumetric measurements from longitudinal T1-weighted brain MRI, then compares those measurements to a prior scan before generating any text. The language model reasons only about quantified anatomical change — hippocampal volume loss, ventricular expansion, cortical thinning deltas — rather than raw image pixels. The final three-class output (Cognitively Normal, Mild Cognitive Impairment, or Dementia) is synthesized from that structured intermediate, making the reasoning auditable at each step.

The limitation is real: performance numbers aren’t in the abstract, and the pipeline depends on accurate volumetric segmentation upstream, meaning garbage measurements propagate directly into the diagnostic summary. For teams building clinical AI pipelines, the design pattern matters regardless: structured intermediate representations grounding language generation offer a generalizable defense against hallucination in high-stakes medical contexts.

Key takeaways:

Source: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments