ScatterAI
Issue #2 · March 12, 2026

Diffusion Models Don't Fail at Text Because They Can't Reason — They Fail Because They've Never Seen the Input

Research

01 [Image Gen] Diffusion Models Don’t Fail at Text Because They Can’t Reason — They Fail Because They’ve Never Seen the Input

Text-to-image models collapse on complex characters and mathematical formulas not from insufficient reasoning capacity but from distribution gap: prompts involving LaTeX-style notation or non-Latin scripts fall outside anything the model was trained to handle. No amount of scaling a standard T2I pipeline closes that gap.

GlyphBanana bypasses the distribution problem entirely by injecting glyph templates directly into the latent space and attention maps, rather than asking the model to hallucinate correct letterforms from text descriptions alone. Think of it as handing the model a stencil instead of a dictionary definition. An agentic workflow wraps this injection with iterative refinement — the model generates, evaluates, and corrects across multiple passes using auxiliary tools. The approach is training-free and plugs into existing T2I backbones without retraining.

The catch: training-free agentic loops add inference latency with each refinement iteration, and the quality ceiling still depends on how well the glyph templates themselves are sourced and aligned. For teams building document generation, scientific figure automation, or multilingual design tooling, this is a practical unblock — inject structure where the model is blind, rather than waiting for a model that somehow learns every glyph distribution from scratch.

Key takeaways:

Source: GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows


02 [Evaluation] LLMs Can Pass Wine Theory Exams But Fail at the Sensory Judgment That Actually Defines Expertise

Text-heavy wine knowledge — grape varieties, regional classifications, production regulations — is well-represented in training corpora. Assume that means LLMs can reason like sommeliers, and the first sensory task breaks that assumption immediately. SommBench separates declarative wine knowledge from perceptual judgment, and the gap between them is the finding.

Three tasks, increasing sensory demand. Wine Theory Question Answering (WTQA) tests codified knowledge directly available in text. Wine Feature Completion (WFC) requires inferring a wine’s sensory profile — acidity, tannin, aroma — from partial descriptors. Food-Wine Pairing (FWP) demands integrating sensory judgment across two domains simultaneously. Models that score well on WTQA do not reliably carry that performance through WFC and FWP. The multilingual structure adds a second axis: cultural encoding of sensory vocabulary varies by language, and model performance degrades unevenly across languages — not proportionally to general multilingual capability.

The limitation is real and structural: there is no ground-truth sensory signal in any LLM’s training data, only human textual descriptions of sensory experience. SommBench cannot close that gap — it measures how far textual grounding stretches before it fails. For teams building LLM applications in food, beverage, fragrance, or any domain where expert judgment is fundamentally embodied, this is a diagnostic tool worth running before deploying.

Key takeaways:

Source: SommBench: Assessing Sommelier Expertise of Language Models


03 [RAG] Brain MRI Diagnosis Models Hallucinate Because They Skip the Measurements

VLMs applied to brain MRI produce fluent diagnostic summaries. The problem: they skip the intermediate step of actually measuring anything. Without grounded volumetric evidence, fluent output and accurate output are two different things.

LoV3D inserts a mandatory measurement layer between raw 3D MRI and diagnostic conclusion. The pipeline extracts region-level anatomical volumes, runs explicit longitudinal comparison against a prior scan, then conditions the final three-class diagnosis (Cognitively Normal, MCI, or Dementia) and narrative summary on those measurements. The chain is: perceive → measure → compare → conclude. Each step is forced before the next unlocks. Hallucination gets harder when every claim must trace back to a regional volume number.

The limitation is real: this is a pipeline paper evaluated on a specific neurological progression task, and grounding quality depends entirely on how well the volumetric extraction step performs on out-of-distribution scanners or acquisition protocols. For teams building clinical AI on top of VLMs, the takeaway is direct: any diagnostic language model without an explicit measurement grounding step is generating plausible-sounding output, not evidence-based output.

Key takeaways:

Source: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments