Also Worth Noting
04 [RAG] New Attention Trick Stops Deep AI Models From Forgetting Early Insights A new attention mechanism called MoDA lets each part of an AI model selectively look back at earlier layers, preventing useful information from getting washed out as it travels through a very deep network. The deeper a model gets, the harder it becomes to preserve signals formed early on — existing designs just keep overwriting themselves with every additional layer, which is a fundamental structural flaw. Models built with MoDA could be made significantly deeper and more powerful without the usual quality trade-off, meaning smarter AI assistants and tools without needing to start from scratch. link
05 [Evaluation] New Benchmark Tests AI Agents on Real Enterprise Workflows Most AI benchmarks test chatbots on simple, one-shot tasks — but real office work involves multi-step plans where earlier actions permanently change what’s possible later. EnterpriseOps-Gym is a new testing environment that simulates exactly that: complex professional workflows with persistent state changes and strict access controls that mirror actual workplace systems. This gives companies a much more honest way to measure whether an AI agent is truly ready to handle real business operations, not just perform well on artificial tests. link
06 [Evaluation] LLMs That Optimize Themselves Using Feedback and Rewards An AI system called POLCA lets a language model act as its own optimizer, automatically improving complex AI pipelines — like multi-step agents or prompts — by learning from numerical scores and written feedback. Getting this right is genuinely hard because the search space is vast and the feedback is noisy, making it easy for naive approaches to chase dead ends instead of real improvements. Anyone building AI products that require tedious manual prompt tuning or agent debugging could use this to automate that iteration loop entirely. link
07 [Code] Two Rival AIs That Force Each Other to Write Better Code A system called Code-A1 pits two separate AI models against each other — one writes code, the other writes tests to try to break it — and each gets smarter by trying to outsmart the other. Keeping them separate prevents a known failure mode where a single model quietly “cheats” by writing tests it knows its own code will pass, making progress look real when it isn’t. This kind of adversarial setup could mean more reliable AI coding tools that actually catch real bugs rather than just appearing to pass quality checks. link
08 [Safety] Teaching AI to Judge Which Research Ideas Are Worth Pursuing Scientists don’t just execute experiments — they instinctively know which ideas are worth pursuing, and a new system called Reinfo tries to teach that same instinct to AI. Most AI research tools focus on doing science better, but judging which questions matter in the first place is a harder, more human skill that has largely been ignored. If AI can reliably filter high-potential ideas from dead ends, it could dramatically speed up discovery by pointing human researchers toward work that actually moves the needle. link
09 [Evaluation] Benchmark Tests AI Agents on Evolving, Real-World Codebases Most AI coding tests give agents a single problem to solve and call it done, but EvoClaw instead challenges them to maintain and evolve software over time — handling the messy, compounding complexity that real projects actually accumulate. Building this kind of benchmark is hard because it requires capturing genuine temporal dependencies, where earlier decisions create technical debt that later tasks must navigate. Any team deploying AI agents to manage long-running software projects now has a more honest way to measure whether those agents can handle the job. link