Prompt Optimization OOD Generalization

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

Lucheng Fu1, Ye Yu2, Yiyang Wang1, Yiqiao Jin1, Haibo Jin2, B. Aditya Prakash1, Haohan Wang2
1 Georgia Institute of Technology    2 University of Illinois Urbana-Champaign
Corresponding authors
Up to +11.8%
best-cell OOD gain over TextGrad
Up to +16.5%
best-cell OOD gain over REVOLVE
4
test-engine LLM backends
9
reasoning benchmarks evaluated
Abstract

Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

Motivation

Prompts grow long and narrow — and then fail out of distribution.

As feedback-driven prompt optimizers iterate, they tend to append rules, exceptions, and stylistic constraints that fit a handful of training examples. The result: prompts that swell in length while their rules drift toward narrow, sample-specific patches. We call this failure mode prompt distributional overfitting and frame it as representational inefficiency — the coupled growth of capacity cost and scope narrowness.

Illustration of prompt distributional overfitting

Problem illustration. Conventional methods produce long prompts saturated with narrow rules that degrade on OOD inputs (left). TextReg yields compact prompts of broadly applicable rules for stronger OOD generalization (right).

Capacity cost C(p)

Longer prompts consume context budget and make it harder for the model to reliably locate and leverage relevant instructions.

Scope narrowness W(p)

Accumulated rules apply only to a small subset of inputs, functioning as ad-hoc patches rather than reusable task principles.

Representational inefficiency

𝓘(p) = C(p) · W(p) — their coupled growth wastes prompt capacity and drives OOD failure.

Framework

TextReg: a regularized textual-gradient framework

We decompose every prompt update into a purified task gradient that drives empirical improvement, and a regularization gradient that opposes the growth of representational inefficiency. TextReg realizes this view through three stages, executed in sequence per optimization step.

TextReg framework overview

Overview of TextReg. (Left) Dual-Evidence Gradient Purification filters raw task gradients via local batch and RuleBank recurrence evidence. (Middle) Semantic Edit Regularization diagnoses capacity and scope degradation and synthesizes the regularization gradient. (Right) Regularization-Guided Prompt Update selects the task-faithful candidate most compatible with the regularization signal.

1

Dual-Evidence Gradient Purification

Filters raw task gradients at the source by combining local batch evidence (how case-specific a proposed rule is) with global recurrence evidence from a persistent RuleBank. Only gradients that are broadly applicable and recurrently supported survive; narrow patches and pure stylistic edits are rejected.

2

Semantic Edit Regularization

Diagnoses the most recent edit using a finite-difference view: per-channel triggers detect when length grows beyond threshold or when rule scope narrows. When triggered, it synthesizes a textual regularization gradient encouraging compression, merging, or generalization.

3

Regularization-Guided Prompt Update

Among task-faithful rewrite candidates, selects the one whose edit is most compatible with the regularization gradient. A task-dominance fallback ensures the task signal stays primary when no candidate aligns with regularization.

Inputs
prompt pt, mini-batch 𝓑t, RuleBank 𝓡t
Gradient signals
task(pt)  +  greg(pt)
Output
pt+1 (task-faithful & regularized)
Results

Stronger OOD generalization across datasets and engines

Prompts are optimized on Logical Deduction (3 obj), Tracking Shuffled Objects (3 obj), and GSM8K, then evaluated out-of-distribution on harder variants and unseen arithmetic benchmarks across four LLM test engines. TextReg achieves the best or second-best accuracy on nearly every (test engine, dataset) cell, with the largest gains on harder variants of the source task.

Phi-3.5-Mini · Logical Ded. 5obj
57.9% +11.8 vs TextGrad
Llama-3.1-8B · Tracking Shuf. 7obj
76.6% +9.9 vs TextGrad
Llama-3-8B · Tracking Shuf. 7obj
48.3% +10.3 vs TextGrad
Phi-3.5-Mini · Logical Ded. 7obj
55.2% +16.5 vs REVOLVE
Test engine Method Logical Ded. (3obj →) Tracking Shuf. (3obj →) GSM8K →
5obj7obj 5obj7obj SVAMPMultiArith
Qwen2-7B-InstructCoT 51.647.442.033.989.894.7
TextGrad 51.346.638.233.189.395.8
REVOLVE 54.447.640.340.289.296.2
TextReg 55.347.845.439.190.196.5
Llama-3.1-8B-InstructCoT 59.750.458.848.585.596.0
TextGrad 59.850.469.766.784.896.0
REVOLVE 57.750.676.065.484.695.3
TextReg 61.151.079.776.685.596.7
Llama-3-8B-InstructCoT 52.648.835.636.583.794.7
TextGrad 42.141.145.938.083.294.7
REVOLVE 51.548.846.141.883.994.7
TextReg 53.352.054.348.383.994.9
Phi-3.5-Mini-InstructCoT 57.050.589.689.190.497.0
TextGrad 46.146.483.692.880.686.8
REVOLVE 43.438.786.887.787.095.1
TextReg 57.955.294.192.288.594.7

Cross-dataset, cross-engine generalization accuracy (%). Bold: best. Underlined: second-best. TextReg is consistently best or second-best across cells.

Analysis

Each component contributes — and the framework is resilient.

Beyond headline accuracy, we ask what each stage contributes (ablation) and how robust TextReg is when individual LLM roles in the optimization loop are weakened (resilience).

Ablation study across components

Ablation. Removing any of the three TextReg components produces a clear and consistent drop in mean OOD accuracy across four test engines.

All three stages are necessary

Disabling Dual-Evidence Gradient Purification, Semantic Edit Regularization, or Regularization-Guided Update each yields a clear accuracy drop — the components contribute non-redundantly.

Resilient to weaker engines

Downgrading any single LLM role to Qwen2.5-7B-Instruct retains strong performance, indicating TextReg's regularization signal is structural rather than capability-bound.

Resilience under role-wise engine degradation

We replace one of three LLM-driven roles at a time with the substantially weaker Qwen2.5-7B-Instruct: Gradient (feedback & RuleBank), Regularization (semantic edit analysis & gradient synthesis), or Optimizer (prompt rewriting). TextReg retains strong performance under all three weakenings; on the hardest variant (Tracking Shuf. 7 obj), Weak-Regularization and Weak-Optimizer even surpass the All-Strong baseline.

Resilience on Logical Ded. 5obj

Logical Ded. 5 obj

Resilience on Logical Ded. 7obj

Logical Ded. 7 obj

Resilience on Tracking Shuf. 5obj

Tracking Shuf. 5 obj

Resilience on Tracking Shuf. 7obj

Tracking Shuf. 7 obj

BibTeX

If you find this work useful, please cite

@article{fu2026textreg,
  title={TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization},
  author={Fu, Lucheng and Yu, Ye and Wang, Yiyang and Jin, Yiqiao and Jin, Haibo and Prakash, B Aditya and Wang, Haohan},
  journal={arXiv preprint arXiv:2605.21318},
  year={2026}
}