Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.
As feedback-driven prompt optimizers iterate, they tend to append rules, exceptions, and stylistic constraints that fit a handful of training examples. The result: prompts that swell in length while their rules drift toward narrow, sample-specific patches. We call this failure mode prompt distributional overfitting and frame it as representational inefficiency — the coupled growth of capacity cost and scope narrowness.
Problem illustration. Conventional methods produce long prompts saturated with narrow rules that degrade on OOD inputs (left). TextReg yields compact prompts of broadly applicable rules for stronger OOD generalization (right).
Longer prompts consume context budget and make it harder for the model to reliably locate and leverage relevant instructions.
Accumulated rules apply only to a small subset of inputs, functioning as ad-hoc patches rather than reusable task principles.
𝓘(p) = C(p) · W(p) — their coupled growth wastes prompt capacity and drives OOD failure.
We decompose every prompt update into a purified task gradient that drives empirical improvement, and a regularization gradient that opposes the growth of representational inefficiency. TextReg realizes this view through three stages, executed in sequence per optimization step.
Overview of TextReg. (Left) Dual-Evidence Gradient Purification filters raw task gradients via local batch and RuleBank recurrence evidence. (Middle) Semantic Edit Regularization diagnoses capacity and scope degradation and synthesizes the regularization gradient. (Right) Regularization-Guided Prompt Update selects the task-faithful candidate most compatible with the regularization signal.
Filters raw task gradients at the source by combining local batch evidence (how case-specific a proposed rule is) with global recurrence evidence from a persistent RuleBank. Only gradients that are broadly applicable and recurrently supported survive; narrow patches and pure stylistic edits are rejected.
Diagnoses the most recent edit using a finite-difference view: per-channel triggers detect when length grows beyond threshold or when rule scope narrows. When triggered, it synthesizes a textual regularization gradient encouraging compression, merging, or generalization.
Among task-faithful rewrite candidates, selects the one whose edit is most compatible with the regularization gradient. A task-dominance fallback ensures the task signal stays primary when no candidate aligns with regularization.
Prompts are optimized on Logical Deduction (3 obj), Tracking Shuffled Objects (3 obj), and GSM8K, then evaluated out-of-distribution on harder variants and unseen arithmetic benchmarks across four LLM test engines. TextReg achieves the best or second-best accuracy on nearly every (test engine, dataset) cell, with the largest gains on harder variants of the source task.
| Test engine | Method | Logical Ded. (3obj →) | Tracking Shuf. (3obj →) | GSM8K → | |||
|---|---|---|---|---|---|---|---|
| 5obj | 7obj | 5obj | 7obj | SVAMP | MultiArith | ||
| Qwen2-7B-Instruct | CoT | 51.6 | 47.4 | 42.0 | 33.9 | 89.8 | 94.7 |
| TextGrad | 51.3 | 46.6 | 38.2 | 33.1 | 89.3 | 95.8 | |
| REVOLVE | 54.4 | 47.6 | 40.3 | 40.2 | 89.2 | 96.2 | |
| TextReg | 55.3 | 47.8 | 45.4 | 39.1 | 90.1 | 96.5 | |
| Llama-3.1-8B-Instruct | CoT | 59.7 | 50.4 | 58.8 | 48.5 | 85.5 | 96.0 |
| TextGrad | 59.8 | 50.4 | 69.7 | 66.7 | 84.8 | 96.0 | |
| REVOLVE | 57.7 | 50.6 | 76.0 | 65.4 | 84.6 | 95.3 | |
| TextReg | 61.1 | 51.0 | 79.7 | 76.6 | 85.5 | 96.7 | |
| Llama-3-8B-Instruct | CoT | 52.6 | 48.8 | 35.6 | 36.5 | 83.7 | 94.7 |
| TextGrad | 42.1 | 41.1 | 45.9 | 38.0 | 83.2 | 94.7 | |
| REVOLVE | 51.5 | 48.8 | 46.1 | 41.8 | 83.9 | 94.7 | |
| TextReg | 53.3 | 52.0 | 54.3 | 48.3 | 83.9 | 94.9 | |
| Phi-3.5-Mini-Instruct | CoT | 57.0 | 50.5 | 89.6 | 89.1 | 90.4 | 97.0 |
| TextGrad | 46.1 | 46.4 | 83.6 | 92.8 | 80.6 | 86.8 | |
| REVOLVE | 43.4 | 38.7 | 86.8 | 87.7 | 87.0 | 95.1 | |
| TextReg | 57.9 | 55.2 | 94.1 | 92.2 | 88.5 | 94.7 | |
Cross-dataset, cross-engine generalization accuracy (%). Bold: best. Underlined: second-best. TextReg is consistently best or second-best across cells.
Beyond headline accuracy, we ask what each stage contributes (ablation) and how robust TextReg is when individual LLM roles in the optimization loop are weakened (resilience).
Ablation. Removing any of the three TextReg components produces a clear and consistent drop in mean OOD accuracy across four test engines.
Disabling Dual-Evidence Gradient Purification, Semantic Edit Regularization, or Regularization-Guided Update each yields a clear accuracy drop — the components contribute non-redundantly.
Downgrading any single LLM role to Qwen2.5-7B-Instruct retains strong performance, indicating TextReg's regularization signal is structural rather than capability-bound.
We replace one of three LLM-driven roles at a time with the substantially weaker Qwen2.5-7B-Instruct: Gradient (feedback & RuleBank), Regularization (semantic edit analysis & gradient synthesis), or Optimizer (prompt rewriting). TextReg retains strong performance under all three weakenings; on the hardest variant (Tracking Shuf. 7 obj), Weak-Regularization and Weak-Optimizer even surpass the All-Strong baseline.
Logical Ded. 5 obj
Logical Ded. 7 obj
Tracking Shuf. 5 obj
Tracking Shuf. 7 obj
@article{fu2026textreg,
title={TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization},
author={Fu, Lucheng and Yu, Ye and Wang, Yiyang and Jin, Yiqiao and Jin, Haibo and Prakash, B Aditya and Wang, Haohan},
journal={arXiv preprint arXiv:2605.21318},
year={2026}
}