Clipped SGD Algorithms for Performative Prediction: Tight Bounds for Clipping Bias and Remedies

1Department of SEEM, The Chinese University of Hong Kong, Hong Kong
2Faculty of Engineering, Bar-Ilan University, Israel
International Conference on Machine Learning (ICML 2025) Main Conference
CUHK Logo
BIU Logo

Abstract

This paper studies the convergence of clipped stochastic gradient descent (SGD) algorithms with decision-dependent data distribution. Our setting is motivated by privacy preserving optimization algorithms that interact with performative data where the prediction models can influence future outcomes. This challenging setting involves the non-smooth clipping operator and non-gradient dynamics due to distribution shifts. We make two contributions in pursuit for a performative stable solution using clipped SGD algorithms. First, we characterize the clipping bias with projected clipped SGD (PCSGD) algorithm which is caused by the clipping operator that prevents PCSGD from reaching a stable solution. When the loss function is strongly convex, we quantify the lower and upper bounds for this clipping bias and demonstrate a bias amplification phenomenon with the sensitivity of data distribution. When the loss function is non-convex, we bound the magnitude of stationarity bias. Second, we propose remedies to mitigate the bias either by utilizing an optimal step size design for PCSGD, or to apply the recent DiceSGD algorithm [Zhang et al., 2023]. Our analysis is also extended to show that the latter algorithm is free from clipping bias in the performative setting. Numerical experiments verify our findings.

Keywords

Performative Prediction (Non)convex Optimization Clipping SGD Algorithm Differential Privacy Distribution Shift

Performative Prediction

Performative Prediction refers to a stochastic optimization problem in which the data distribution depends on the decision variable. This framework is especially relevant in a range of real-world applications:

Performative Prediction
  • Loan approval prediction: When training a classifier for loan applications, the published decision rule (e.g., from a bank) may influence how applicants behave. In anticipation of the classifier, applicants might manipulate their profiles before submission—such as by increasing credit card usage or making artificial transactions—to improve their chances of approval. This feedback loop can impact the convergence and stability of the learning algorithm itself.
  • Signal degeneracy in financial markets: As discussed in Shifting from Beta to Alpha, traditional alpha factors often degrade under changing macroeconomic regimes, reducing model accuracy. Additionally, trading behavior influenced by predictive models can itself reshape market dynamics, affecting the distribution of financial data in future periods.
The optmization problem is formulated as following:

$\min_{\prm \in \RR^d} \EE_{Z\sim \mathcal{D}(\prm)} [\lossfn(\prm; Z)]$

where ${\cal D}(\prm)$ capture the distribution induced by decision $\prm$.

Algorithms: PCSGD and DiceSGD

Motivation

Gradient clipping techniques is used to deal with multiple obstacles in learning algorithms such as the need for privacy preservation [Abadi et al., 2016], dealing with gradient explosion in non-smooth learning [Shor, 2012] such as training neural networks [Mikolov et al., 2012]; [Zhang et al., 2020a], [Hazan et al., 2015], etc.

Research Gap

Most existing works have focused on analyzing SGD algorithms with a smooth drift term. An open problem in the performative prediction literature is to analyze the behavior of clipped SGD algorithms which limit the magnitude of the stochastic gradient at every update step.

📌 Key Contributions

  • 🔄 Convergence Analysis under Performative Feedback: We analyze the convergence of PC-SGD when data distribution depends on the model's decision, showing convergence to a biased performative stable solution under both convex and non-convex settings.
  • 📊 Quantification of Clipping Bias: We characterize the asymptotic bias introduced by gradient clipping and show it is amplified by distribution shift and clipping sensitivity. A matching lower bound is provided in the strongly convex case.
  • ✂️ Bias Mitigation via Double Clipping: We study the recently proposed DICE-SGD algorithm and prove that its double clipping scheme can eliminate or significantly reduce the bias, achieving near-exact convergence.
  • 🔒 Privacy–Efficiency Tradeoff: We identify a tradeoff between differential privacy guarantees and computational efficiency, which informs optimal step size selection in practice.

Theoretical Results

Algorithm Loss Function Convergence Bias Level
PCSGD Strongly Convex ${\cal O}(1/t) + \text{bias}$ ${\bf \Theta}( 1 / (\mu - L \beta)^2 )$
PCSGD Non-Convex ${\cal O}(1/\sqrt{T}) + \text{bias}$ ${\cal O}(\beta + \max\{G-c, 0\}^2)$
DiceSGD Strongly Convex ${\cal O}(1/t)$ N/A
DiceSGD Non-Convex ${\cal O}(1/\sqrt{T}) + \text{bias}$ ${\cal O}(\beta)$

Note: $\beta$ represents the strength of distribution shift. $G$ is bound of gradient and $c$ is the clipping threshold, $t$ is the iteration index, $T$ is the total number of iterations.

PCSGD Algorithm

Deploy: $Z_{t+1} \sim {\cal D}( \prm_t )$

Update: $\prm_{t+1} = {\cal P}_{\cal X} \big(\prm_t - \gamma_{t+1} \clip_c( \grad \lossfn(\prm_t; Z_{t+1})) \big)$

where ${\cal P}_{\cal X}(\cdot)$ denotes the Euclidean projection operator onto ${\cal X}$, and $\clip_c(\cdot)$ is the clipping operator: for any $\bm{g} \in \RR^d$,

$\clip_{c} ( \bm{g} ): \bm{g} \in \RR^d \mapsto \min\left\{1, {c} / {\norm{\bm{g}}_2} \right\} \bm{g}$

where $c>0$ is a clipping parameter.

DiceSGD Algorithm

Input: Parameters $C_1$, $C_2$, step sizes, privacy parameters $\varepsilon$, $\delta$, initialization $\prm_0$, $e_0 = \bm{0}$.

For $t=0$ to $T-1$:

  1. Draw new sample $Z_{t+1}\sim {\cal D}(\prm_t)$ and Gaussian noise $\zeta_{t+1} \sim {\cal N}(0, \sigma^2 \bm{I})$.
  2. $v_{t+1} = \clip_{C_1}(\grad \lossfn(\prm_t; Z_{t+1})) + \clip_{C_2}(e_t)$.
  3. $\prm_{t+1} = \prm_t - \gamma_{t+1} (v_{t+1} + \zeta_{t+1})$
  4. $e_{t+1} = e_t + \grad \lossfn(\prm_t; Z_{t+1}) - v_{t+1}$.

Output: Last iterate $\prm_{T}$.

Simulation Results

We consider a scalar performative risk optimization problem with synthetic data:

$\min_{\prm \in {\cal X}}~ \EE_{z\sim {\cal D}(\prm)}[(\prm + az)^2/2]$

where ${\cal D}(\prm)$ is a uniform distribution over the data points $\{ b \tilde{Z}_i - \beta \prm \}_{i=1}^m$ such that $\tilde{Z}_i \sim {\cal B}(p)$ is Bernoulli distribution.

Experimental Results
Quadratic Minimization
Privacy vs Utility
Parameter vs Utility

Figure: (left) The performative stability gap $\|\prm_t - \prm_{PS}\|^2$. (middle) Trade off between privacy budget $\varepsilon$ and bias. (right) Bias amplification effect due to $\beta$.

Observations

  • PCSGD cannot converge to $\prm_{PS}$ due to the clipping bias which increases as $\beta \uparrow$.
  • DiceSGD finds a bias-free solution as it converges to $\prm_{PS}$ at rate ${\cal O}(1/t)$.
  • As the privacy budget decreases $\varepsilon \downarrow 0$ or $\beta \uparrow \frac{\mu}{L}$, the bias increases.

Contact

Welcome to check our paper for more details of this research work. If you have any comments or questions, please feel free to contact us.

If you find our paper and repo useful, please cite us as follows:

@inproceedings{li2025clipped,
  title={Clipped SGD Algorithms for Performative Prediction: Tight Bounds for Clipping Bias and Remedies},
  author={Li, Qiang and Yemini, Michal and Wai, Hoi-To},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025},
  organization={PMLR}
}