Ensemble weighting that punishes disagreement outperforms static mixing in non-stationary sequential tasks

01 [Evaluation] Ensemble weighting that punishes disagreement outperforms static mixing in non-stationary sequential tasks

Static ensemble weights are a known liability when the environment shifts — the model that dominated last week may be the worst performer today. The standard fix is offline reweighting, but that just bakes in a different kind of staleness. EARCP (Ensemble Auto-Régulé par Cohérence et Performance) treats the weighting problem as an online learning problem and adds a second signal most ensembles ignore entirely: how much the component models agree with each other right now.

The mechanism runs in two coupled loops. The first is a multiplicative weight update, a classical online learning algorithm that multiplies each expert’s weight by a factor proportional to its recent loss. The novel addition is a coherence-based regularization term that penalizes models whose predictions diverge from the current ensemble consensus. When a single expert starts drifting from the group, its weight decays faster than its raw performance alone would justify. The combined update provides formal regret bounds, meaning the gap between EARCP’s cumulative loss and the best fixed-weight ensemble in hindsight is bounded and grows sublinearly with time. That guarantee holds even in non-stationary environments where the optimal expert changes over time.

The limitation is real: the paper is a formalization and theoretical contribution, not a large-scale empirical benchmark. Performance numbers across diverse real-world sequential decision tasks are absent from the abstract, which means practitioners cannot yet read off how large the coherence penalty term’s contribution is relative to the base multiplicative update. The coherence signal is only useful if component models are genuinely heterogeneous — if all experts share an architecture or training distribution, consensus becomes a noisy proxy for majority error rather than a robustness signal.

For teams running multi-model inference pipelines where task distribution shifts over time (retrieval-augmented pipelines across shifting document corpora, or multi-agent routing where some specialists degrade as the query mix evolves), the coherence signal is worth evaluating. A model that looks individually competent but consistently disagrees with the rest of the ensemble is often the first to fail on distribution shifts. Building that signal into the weighting mechanism rather than monitoring it separately is the practical move.

Key takeaways:

Multiplicative weight updates handle performance-based reweighting; the coherence regularization term adds a second decay channel for experts that diverge from ensemble consensus, and both are updated continuously, not offline
Formal regret bounds hold in non-stationary environments, but empirical validation across diverse real-world tasks is not yet published, so the practical magnitude of the coherence term’s contribution remains an open question
Teams running heterogeneous multi-model pipelines on shifting distributions should evaluate whether disagreement between component models is already being tracked — if not, this framework offers a principled way to turn that signal into dynamic weight adjustment

Source: EARCP: Self-Regulating Coherence-Aware Ensemble Architecture for Sequential Decision Making