TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

Abstract

Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

Motivation

Prompts grow long and narrow — and then fail out of distribution.

As feedback-driven prompt optimizers iterate, they tend to append rules, exceptions, and stylistic constraints that fit a handful of training examples. The result: prompts that swell in length while their rules drift toward narrow, sample-specific patches. We call this failure mode prompt distributional overfitting and frame it as representational inefficiency — the coupled growth of capacity cost and scope narrowness.

Illustration of prompt distributional overfitting

Problem illustration. Conventional methods produce long prompts saturated with narrow rules that degrade on OOD inputs (left). TextReg yields compact prompts of broadly applicable rules for stronger OOD generalization (right).

Capacity cost C(p)

Longer prompts consume context budget and make it harder for the model to reliably locate and leverage relevant instructions.

Scope narrowness W(p)

Accumulated rules apply only to a small subset of inputs, functioning as ad-hoc patches rather than reusable task principles.

Representational inefficiency

𝓘(p) = C(p) · W(p) — their coupled growth wastes prompt capacity and drives OOD failure.

Framework

TextReg: a regularized textual-gradient framework

We decompose every prompt update into a purified task gradient that drives empirical improvement, and a regularization gradient that opposes the growth of representational inefficiency. TextReg realizes this view through three stages, executed in sequence per optimization step.

Overview of TextReg. (Left) Dual-Evidence Gradient Purification filters raw task gradients via local batch and RuleBank recurrence evidence. (Middle) Semantic Edit Regularization diagnoses capacity and scope degradation and synthesizes the regularization gradient. (Right) Regularization-Guided Prompt Update selects the task-faithful candidate most compatible with the regularization signal.

1

Dual-Evidence Gradient Purification

Filters raw task gradients at the source by combining local batch evidence (how case-specific a proposed rule is) with global recurrence evidence from a persistent RuleBank. Only gradients that are broadly applicable and recurrently supported survive; narrow patches and pure stylistic edits are rejected.

2

Semantic Edit Regularization

Diagnoses the most recent edit using a finite-difference view: per-channel triggers detect when length grows beyond threshold or when rule scope narrows. When triggered, it synthesizes a textual regularization gradient encouraging compression, merging, or generalization.

3

Regularization-Guided Prompt Update

Among task-faithful rewrite candidates, selects the one whose edit is most compatible with the regularization gradient. A task-dominance fallback ensures the task signal stays primary when no candidate aligns with regularization.

Inputs

prompt p_t, mini-batch 𝓑_t, RuleBank 𝓡_t

Gradient signals

g̃_task(p_t) + g_reg(p_t)

Output

p_t+1 (task-faithful & regularized)

Results

Stronger OOD generalization across datasets and engines

Prompts are optimized on Logical Deduction (3 obj), Tracking Shuffled Objects (3 obj), and GSM8K, then evaluated out-of-distribution on harder variants and unseen arithmetic benchmarks across four LLM test engines. TextReg achieves the best or second-best accuracy on nearly every (test engine, dataset) cell, with the largest gains on harder variants of the source task.

Phi-3.5-Mini · Logical Ded. 5obj

57.9% +11.8 vs TextGrad

Llama-3.1-8B · Tracking Shuf. 7obj

76.6% +9.9 vs TextGrad

Llama-3-8B · Tracking Shuf. 7obj

48.3% +10.3 vs TextGrad

Phi-3.5-Mini · Logical Ded. 7obj

55.2% +16.5 vs REVOLVE

Test engine	Method	Logical Ded. (3obj →)		Tracking Shuf. (3obj →)		GSM8K →
Test engine	Method	5obj	7obj	5obj	7obj	SVAMP	MultiArith
Qwen2-7B-Instruct	CoT	51.6	47.4	42.0	33.9	89.8	94.7
	TextGrad	51.3	46.6	38.2	33.1	89.3	95.8
	REVOLVE	54.4	47.6	40.3	40.2	89.2	96.2
	TextReg	55.3	47.8	45.4	39.1	90.1	96.5
Llama-3.1-8B-Instruct	CoT	59.7	50.4	58.8	48.5	85.5	96.0
	TextGrad	59.8	50.4	69.7	66.7	84.8	96.0
	REVOLVE	57.7	50.6	76.0	65.4	84.6	95.3
	TextReg	61.1	51.0	79.7	76.6	85.5	96.7
Llama-3-8B-Instruct	CoT	52.6	48.8	35.6	36.5	83.7	94.7
	TextGrad	42.1	41.1	45.9	38.0	83.2	94.7
	REVOLVE	51.5	48.8	46.1	41.8	83.9	94.7
	TextReg	53.3	52.0	54.3	48.3	83.9	94.9
Phi-3.5-Mini-Instruct	CoT	57.0	50.5	89.6	89.1	90.4	97.0
	TextGrad	46.1	46.4	83.6	92.8	80.6	86.8
	REVOLVE	43.4	38.7	86.8	87.7	87.0	95.1
	TextReg	57.9	55.2	94.1	92.2	88.5	94.7

Cross-dataset, cross-engine generalization accuracy (%). Bold: best. Underlined: second-best. TextReg is consistently best or second-best across cells.

Analysis

Each component contributes — and the framework is resilient.

Beyond headline accuracy, we ask what each stage contributes (ablation) and how robust TextReg is when individual LLM roles in the optimization loop are weakened (resilience).

Ablation. Removing any of the three TextReg components produces a clear and consistent drop in mean OOD accuracy across four test engines.

All three stages are necessary

Disabling Dual-Evidence Gradient Purification, Semantic Edit Regularization, or Regularization-Guided Update each yields a clear accuracy drop — the components contribute non-redundantly.

Resilient to weaker engines

Downgrading any single LLM role to Qwen2.5-7B-Instruct retains strong performance, indicating TextReg's regularization signal is structural rather than capability-bound.

Resilience under role-wise engine degradation

We replace one of three LLM-driven roles at a time with the substantially weaker Qwen2.5-7B-Instruct: Gradient (feedback & RuleBank), Regularization (semantic edit analysis & gradient synthesis), or Optimizer (prompt rewriting). TextReg retains strong performance under all three weakenings; on the hardest variant (Tracking Shuf. 7 obj), Weak-Regularization and Weak-Optimizer even surpass the All-Strong baseline.

Logical Ded. 5 obj

Logical Ded. 7 obj

Tracking Shuf. 5 obj

Tracking Shuf. 7 obj

BibTeX

If you find this work useful, please cite

@article{fu2026textreg,
  title={TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization},
  author={Fu, Lucheng and Yu, Ye and Wang, Yiyang and Jin, Yiqiao and Jin, Haibo and Prakash, B Aditya and Wang, Haohan},
  journal={arXiv preprint arXiv:2605.21318},
  year={2026}
}