Derivation of Some Robust and Distributionally Robust Optimization Problems

In the two-step Bayesian pipeline, first a joint probability distribution of the asset prices is predicted based on the training data. In the second step, this predicted distribution is used as input (either via its parameters or as samples) to the portfolio optimization (PO) that provides the optimal portfolio weights depending on the trade-off between the mean and risk of the portfolio return. Classical methods like the mean-variance (Markowitz) and mean-expected shortfall models have been used widely. However, as predicted quantities, distribution parameters (for instance, the mean vector and the covariance matrix in the case of Markowitz PO) can also have uncertainties, which are not taken into account by traditional PO approaches. Robust optimization (RO) and distributionally robust optimization (DRO) models give the possibility to model such uncertainties directly. In this document, we provide a general framework of recent DRO models. Besides the derivation of the original models, we also suggest some additional and more general ones.

## Robust optimization models

In the case of robust optimization (RO) models, we introduce uncertainty sets for some PO parameters. In a rather general setting, the RO models we are interested in have the following concise form: $$\max_{w \in \mathcal{W}} \min_{\xi \in \mathcal{U}} o(w, \xi), \label{eq:ro}$$ where $w \in \mathcal{W}$ is the decision vector (i.e. portfolio weights), $\xi$ is the vector of the distribution parameters, $\mathcal{U}$ is the uncertainty set for these parameters and $o(w, \xi)$ is the objective function (e.g. expected return). More specifically, we will consider robust optimization approaches within the mean-variance (aka Markowitz) portfolio optimization framework (MV-PO) using an uncertainty set on the mean and applying the maximization of the expected return formulation: \begin{aligned} \max_{w} & \ R \\ \mathrm{s.t.} & \ R = \min_{\mu \in \mathcal{U}} w^{T}\mu, \\ & \ V \le \overline{V}, \\ & \ V = w^{T} \Sigma w, \\ & \ w \in \mathcal{W}, \\ \end{aligned} \label{eq:ro_mean} where $w$ is the portfolio weight vector, $\mu$ and $\Sigma$ are the mean vector and covariance matrix of the joint distribution of the assets' prices, and $R$ and $V$ are the mean and variance variables of the portfolio return, $\overline{V}$ is the maximum variance (risk) and $\mathcal{W}$ is a convex set. Unlike in conventional MV-PO, in this model, we maximize the worst case expected return that is obtained from a minimization problem over all possible mean vectors in the uncertainty set $\mathcal{U}$.

### Bertsimas model

In the case of the Bertsimas model [Bertsimas04], the uncertainty set is defined as: $$\mathcal{U}_{\mathrm{Bertsimas}} = \left\{ \mu \ \begin{array}{|ll} \hat{\mu}_{i} - z_{i}\Delta_{i} \le \mu_{i} \le \hat{\mu}_{i} + z_{i}\Delta_{i} & \forall i \\ 0 \le z_{i} \le 1 & \forall i \\ \sum_{i} z_{i} \le \Gamma \\ \end{array} \label{eq:bertsimas_uncertainty_set} \right\}.$$ This leads to the following optimization problem for the worst case expected return: \begin{aligned} \min_{\mu, z} & \ w^{T}\mu \\ \mathrm{s.t.} & \ \hat{\mu}_{i} - \mu_{i} - z_{i}\Delta_{i} \le 0 & \forall i, \\ & \ -\hat{\mu}_{i} + \mu_{i} - z_{i}\Delta_{i} \le 0 & \forall i, \\ & \ z_{i} - 1 \le 0 & \forall i, \\ & \ -z_{i} \le 0 & \forall i, \\ & \sum_{i} z_{i} - \Gamma \le 0. \\ \end{aligned} \label{eq:bertsimas_uncertainty_model} However, incorporating this optimization into \ref{eq:ro_mean} leads to a maximin problem that cannot be straightforwardly solved. Instead, we consider the dual problem [Boyd04] (assuming strong duality) that can be easily derived given the above is a linear programming (LP) problem. First, we consider the Lagrangian of problem \ref{eq:bertsimas_uncertainty_model}: \begin{aligned} L(\mu, z, \pi^{+}, \pi^{-}, \gamma, \theta, \lambda) =\ & w^{T} \mu + \\ & \sum_{i} (\mu_{i} - \hat{\mu}_{i} - z_{i} \Delta_{i}) \pi_{i}^{+} + \\ & \sum_{i} (-\mu_{i} + \hat{\mu}_{i} - z_{i} \Delta_{i}) \pi_{i}^{-} + \\ & \sum_{i} (z_{i} - 1) \theta_{i} + \\ & \sum_{i} (-z_{i}) \gamma_{i} + \\ & \left(\sum_{i} z_{i} - \Gamma\right) \lambda, \end{aligned} where $\pi^{+}, \pi^{-}, \gamma, \theta, \lambda$ are the corresponding Lagrange multipliers. The Lagrange dual function then can be obtained as: \begin{aligned} g(\pi^{+}, \pi^{-}, \gamma, \theta, \lambda) = & \inf_{\mu, z} L(\mu, z, \pi^{+}, \pi^{-}, \gamma, \theta, \lambda) \\ = & \sum_{i} \left( (\pi_{i}^{+} - \pi_{i}^{-}) \hat{\mu}_{i} -\theta_{i} \right) - \Gamma \lambda + \\ & \inf_{\mu} \sum_{i} (w_{i} - \pi_{i}^{+} + \pi_{i}^{-}) \mu_{i} + \\ & \inf_{z} \sum_{i} (-\Delta_{i} \pi_{i}^{+} -\Delta_{i} \pi_{i}^{-} + \theta_{i} -\gamma_{i} + \lambda) z_{i}. \end{aligned} The last two terms in the above equation are bounded below only if $w_{i} - \pi_{i}^{+} + \pi_{i}^{-} = 0$ and $-\Delta_{i} \pi_{i}^{+} -\Delta_{i} \pi_{i}^{-} + \theta_{i} -\gamma_{i} + \lambda = 0$ for all $i$. The corresponding dual problem, therefore, can be written as: \begin{aligned} \max_{\pi^{+}, \pi^{-}, \theta, \lambda} &\ \sum_{i} \left( (\pi_{i}^{+} - \pi_{i}^{-}) \hat{\mu}_{i} -\theta_{i} \right) - \Gamma \lambda \\ \mathrm{s.t.} & \ w_{i} - \pi_{i}^{+} + \pi_{i}^{-} = 0 & \forall i, \\ & \ -\Delta_{i} \pi_{i}^{+} -\Delta_{i} \pi_{i}^{-} + \theta_{i} + \lambda \ge 0 & \forall i, \\ & \ \pi_{i}^{+} \ge 0 & \forall i, \\ & \ \pi_{i}^{-} \ge 0 & \forall i, \\ & \ \theta_{i} \ge 0 & \forall i, \\ & \ \lambda \ge 0 , \\ \end{aligned} where we got rid of the variable $\gamma$, since: $-\Delta_{i} \pi_{i}^{+} -\Delta_{i} \pi_{i}^{-} + \theta_{i} -\gamma_{i} + \lambda = 0$ and $\gamma_{i} \ge 0 \ \Rightarrow -\Delta_{i} \pi_{i}^{+} -\Delta_{i} \pi_{i}^{-} + \theta_{i} + \lambda = \gamma_{i} \ge 0$ for all $i$. The above maximization problem can be easily incorporated into problem \ref{eq:ro_mean}: \begin{array}{c} \mathbf{(Bertsimas\ model)} \\ \begin{aligned} \max_{w, \pi^{+}, \pi^{-}, \theta, \lambda} &\ R \\ \mathrm{s.t.} & \ R = \sum_{i} \left( (\pi_{i}^{+} - \pi_{i}^{-}) \hat{\mu}_{i} -\theta_{i} \right) - \Gamma \lambda, \\ & \ w_{i} = \pi_{i}^{+} - \pi_{i}^{-} & \forall i, \\ & \ \Delta_{i}( \pi_{i}^{+} + \pi_{i}^{-}) - \theta_{i} \le \lambda & \forall i, \\ & \ \pi_{i}^{+} \ge 0 & \forall i, \\ & \ \pi_{i}^{-} \ge 0 & \forall i, \\ & \ \theta_{i} \ge 0 & \forall i, \\ & \ \lambda \ge 0, \\ & \ V \le \overline{V}, \\ & \ V = w^{T} \Sigma w, \\ & \ w \in \mathcal{W} . \\ \end{aligned} \end{array}

### Ben-Tal model

The Ben-Tal model [BenTal00] defines the uncertainty set on the mean as: $$\mathcal{U}_{\mathrm{BenTal}} = \left\{ \mu \ \begin{array}{|ll} \sqrt{(\mu - \hat{\mu})^{T} \Sigma^{-1} (\mu - \hat{\mu})} \le \delta \\ \end{array} \right\}. \label{eq:bental_uncertainty_set}$$ The corresponding optimization problem for the worst case expected return has the following concise form: \begin{aligned} \min_{\mu} & \ w^{T}\mu \\ \mathrm{s.t.} & \ \| (\Sigma^{-1/2})^{T} (\mu - \hat{\mu})\|_2 \le \delta, \end{aligned} \label{eq:bental_uncertainty_model} where $\Sigma^{-1/2}$ is the upper triangular matrix of the Cholesky decomposition of $\Sigma$, i.e. $\Sigma = \left( \Sigma^{-1/2} \right)^{T} \Sigma^{-1/2}$. However, similarly to the Bertsimas model, this optimization problem cannot be directly utilized within problem \ref{eq:ro_mean} and we will use again the dual problem. For this, we first note that \ref{eq:bental_uncertainty_model} is a second order cone programming (SOCP). The general form of the SOCP is: \begin{aligned} \min_{x} & \ f^{T}x \\ \mathrm{s.t.} & \ \| A_{i}x + b_{i} \|_{2} \le c_{i}^{T}x + d_{i} \quad i = 1,\dots,N. \end{aligned} \label{eq:socp} As it is well known [Lobo98], the dual problem of the optimization problem \ref{eq:socp} is given by: \begin{aligned} \min_{\lambda, \theta} & \ -\sum_{i=1}^{N} \left( \lambda_{i}^{T}b_{i} + \theta_{i}d_{i} \right)\\ \mathrm{s.t.} & \ \sum_{i=1}^{N} \left( A_{i}^{T} \lambda_{i} + c_{i}\theta_{i} \right) = f, \\ & \ \| \lambda_{i} \|_{2} \le \theta_{i} \quad i = 1,\dots,N. \end{aligned} \label{eq:socp_dual} Identifying that in problem \ref{eq:bental_uncertainty_model}: $N = 1$, $x = w$, $f = \mu$, $A = \left(\Sigma^{-1/2}\right)^{T}$, $b = -\left(\Sigma^{-1/2}\right)^{T} \hat{\mu}$, $c = 0$ and $d = \delta$, the corresponding dual problem of \ref{eq:bental_uncertainty_model} can be written as: \begin{aligned} \min_{\lambda, \theta} & \ \lambda^{T} \left(\Sigma^{-1/2} \right)^{T} \hat{\mu} - \theta \delta \\ \mathrm{s.t.} & \ \Sigma^{-1/2} \lambda = w, \\ & \ \| \lambda \|_{2} \le \theta, \\ \end{aligned} where the equality constraint gives us the possibility to get rid of the dual variable $\lambda = \Sigma^{1/2} w$ and write: \begin{aligned} \min_{\theta} & \ w^{T} \hat{\mu} - \theta \delta \\ \mathrm{s.t.} & \ \left \| \Sigma^{1/2} w \right \|_{2} \le \theta . \\ \end{aligned} Using the above equation in problem \ref{eq:ro_mean}, the final form of the optimization problem is: \begin{array}{c} \mathbf{(Ben-Tal\ model)} \\ \begin{aligned} \max_{w, \theta} &\ R \\ \mathrm{s.t.} & \ R = w^{T} \hat{\mu} - \theta \delta, \\ & \ \left \| \Sigma^{1/2} w \right \|_{2} \le \theta, \\ & \ V \le \overline{V}, \\ & \ V = w^{T} \Sigma w, \\ & \ w \in \mathcal{W} . \\ \end{aligned} \end{array}

## Distributionally robust optimization models

In the case of distributionally robust optimization (DRO) models, we treat the PO parameters as random variables and introduce an ambiguity set of the probability measures on these variables. We are interested in models that have the following concise form: $$\max_{w \in \mathcal{W}} \min_{F(\xi) \in \mathcal{A}} \mathbb{E}_{F(\xi)} \left[ u(w, \xi) \right], \label{eq:dro_utility}$$ where $w \in \mathcal{W}$ is the decision vector, as before, $\xi$ is the vector of the PO parameters treated as a random variable, $F(\xi)$ is a probability measure, $\mathcal{A}$ is the ambiguity set of the probability measures and $u(w, \xi)$ is a utility function. We will see later that the ambiguity set is actually parameterized by the uncertainty set on $\xi$, i.e. $\mathcal{A} = \mathcal{A}(\mathcal{U})$. Utility functions are usually monotonic and quasi-concave. In the DRO literature, an alternative formulation to problem \ref{eq:dro_utility} is also widely applied by using a loss function $\ell(w, \xi)$ instead of the utility function: $$\min_{w \in \mathcal{W}} \max_{F(\xi) \in \mathcal{A}} \mathbb{E}_{F(\xi)} \left[ \ell(w, \xi) \right]. \label{eq:dro_loss}$$ It is easy to see that the two formulations are equivalent by setting $u(w, \xi) = -\ell(w, \xi)$.

### Delage model

The ambiguity set of $F(\xi)$ in the Delage model [Delage10] is defined as: $$\mathcal{A}_{\mathrm{Delage}} = \left\{ F(\xi) \in \mathcal{M}(\mathcal{U}) \ \begin{array}{|l} \mathbb{P}_{F(\xi)}\left(\xi \in \mathcal{U} \right) = 1 \\ \left(\mathbb{E}_{F(\xi)}\left[ \xi \right] - \hat{\mu}\right)^{T} \hat{\Sigma}^{-1}\left(\mathbb{E}_{F(\xi)}\left[ \xi \right] - \hat{\mu}\right) \le \gamma_{1} \\ \mathbb{E}_{F(\xi)}\left[ \left( \xi - \hat{\mu} \right) \left( \xi - \hat{\mu} \right)^{T} \right] \preceq \gamma_{2} \hat{\Sigma} \\ \end{array} \right\}, \label{eq:delage_ambiguity_set}$$ where $\mathcal{M}(\mathcal{U})$ is the set of all probability measures supported on the uncertainty set $\mathcal{U}$, which is a closed convex set containing the support of $F$; $\hat{\mu}$, $\hat{\Sigma}$, $\gamma_{1}$ and $\gamma_{2}$ are parameters. Using an arbitrary loss function, the inner maximization problem of \ref{eq:dro_loss} becomes: \begin{aligned} \max_{F} &\ \int_{\mathcal{U}} \ell(w, \xi) \mathrm{d}F(\xi) \\ \mathrm{s.t.} & \int_{\mathcal{U}} \mathrm{d}F(\xi) = 1, \\ & \ \int_{\mathcal{U}}\begin{bmatrix} \hat{\Sigma} & \left( \xi - \hat{\mu} \right) \\ \left( \xi - \hat{\mu} \right)^{T} & \gamma_{1}\end{bmatrix} \mathrm{d}F(\xi) \succeq 0, \\ & \ \int_{\mathcal{U}} \left( \xi - \hat{\mu} \right) \left( \xi - \hat{\mu} \right)^{T} \mathrm{d}F(\xi) \preceq \gamma_{2} \hat{\Sigma}, \\ & \ F \in \mathcal{M}(\mathcal{U}), \\ \end{aligned} where for the second inequality constraint we used the Schur's complement, i.e. for invertible matrix $A$, vector $b$ and scalar $c$: $b^{T}A^{-1}b \le c \iff \begin{bmatrix} A & b \\ b^{T} & c \end{bmatrix} \succeq 0$. The Lagrangian of the above problem is: \begin{aligned} L(F, r, P, p, s, Q) = &\ \int_{\mathcal{U}} \ell(w, \xi) \mathrm{d}F(\xi) + \\ & r \left( 1 - \int_{\mathcal{U}} \mathrm{d}F(\xi) \right) + \\ & \begin{bmatrix} P & p \\ p^{T} & s \end{bmatrix} \bullet \left( \int_{\mathcal{U}}\begin{bmatrix} \hat{\Sigma} & \left( \xi - \hat{\mu} \right) \\ \left( \xi - \hat{\mu} \right)^{T} & \gamma_{1}\end{bmatrix} \mathrm{d}F(\xi) \right) + \\ & Q \bullet \left( \gamma_{2} \hat{\Sigma} - \left( \int_{\mathcal{U}} \left( \xi - \hat{\mu} \right) \left( \xi - \hat{\mu} \right)^{T} \mathrm{d}F(\xi) \right) \right), \end{aligned} where $X \bullet Y = \sum_{ij}X_{ij}Y_{ij}$ refers to the Frobenius inner product and the dual variables are collected in $P$ and $Q$ symmetric matrices, $p$ vector and $r$ and $s$ scalars and $Q \succeq 0$ and $\begin{bmatrix} P & p \\ p^{T} & s \end{bmatrix} \succeq 0$. The Lagrange dual function can be obtained as: \begin{aligned} g(r, P, p, s, Q) = &\ r + P \bullet \hat{\Sigma} - 2p^{T} \hat{\mu} + s \gamma_{1} + Q \bullet \left( \gamma_{2} \hat{\Sigma} \right) - Q \bullet \left( \hat{\mu} \hat{\mu}^{T} \right) + \\ & \sup_{F} \int_{\mathcal{U}} \left( \ell(w, \xi) - r + 2p^{T} \xi - \xi^{T} Q \xi + 2\xi^{T} Q \hat{\mu} \right) \mathrm{d}F(\xi). \end{aligned} Given $F(\xi)$ is a probability measure, the last term is bounded above if $\ell(w, \xi) - r + 2p^{T} \xi - \xi^{T} Q \xi + 2\xi^{T} Q \hat{\mu} \le 0$. The corresponding dual problem, therefore, has the following form: \begin{aligned} \min_{r, P, p, s, Q} &\ r + P \bullet \hat{\Sigma} - 2p^{T} \hat{\mu} + s \gamma_{1} + Q \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q \hat{\mu} \\ \mathrm{s.t.} &\ -\ell(w, \xi) + r + \xi^{T} Q \xi - 2\xi^{T} (Q \hat{\mu} + p) \ge 0 \quad \forall \xi \in \mathcal{U}, \\ & \ \begin{bmatrix} P & p \\ p^{T} & s \end{bmatrix} \succeq 0, \\ & \ Q \succeq 0 . \end{aligned} \label{eq:delage_dual_loss} The final step in the derivation is to incorporate the above equation into problem \ref{eq:dro_loss}. In a more general setup, we will consider a piecewise linear utility function: $u(w, \xi) = \min \limits_{k} a_{k}\xi^{T}w + b_{k}$ with $a_{k} \ge 0 \quad \forall k = 1, \dots, K$. Using the corresponding loss function, i.e. $\ell(w, \xi) = \max \limits_{k} -a_{k} \xi^{T} w - b_{k}$ and introducing the variable $q = -2(p + Q \hat{\mu})$ we can write: \begin{aligned} \min_{w, r, P, p, s, Q, q} &\ r + P \bullet \hat{\Sigma} - 2p^{T} \hat{\mu} + s \gamma_{1} + Q \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q \hat{\mu} \\ \mathrm{s.t.} &\ q = -2(p + Q \hat{\mu}), \\ & \ \xi^{T} Q \xi + \xi^{T}q + r \ge -a_{k}\xi^{T}w - b_{k} \quad \xi \in \mathcal{U} \quad k = 1, \dots, K, \\ & \ \begin{bmatrix} P & p \\ p^{T} & s \end{bmatrix} \succeq 0, \\ & \ Q \succeq 0, \\ & \ w \in \mathcal{W} . \end{aligned} \label{eq:delage_dual_xi} Now we will investigate two cases depending on the uncertainty set $\mathcal{U}$.

1. Let $\mathcal{U} = \mathbb{R}^{n}$. In this case, for each $k$: $\xi^{T} Q \xi + \xi^{T}q + r + a_{k}\xi^{T}w + b_{k} \ge 0 \iff \left( \min \limits_{\xi}\ \xi^{T} Q \xi + \xi^{T}q + r + a_{k}\xi^{T}w + b_{k} \right) \ge 0$. The minimizer can be obtained easily by taking the derivative of the quadratic form wrt. $\xi$. We find that $\xi^{*} = -\frac{1}{2}Q^{g}(q + a_{k}w)$, where $Q^{g}$ is the generalized inverse of $Q$. For the optimal value of the quadratic form we can write: $-\frac{1}{4}(q + a_{k}w)^{T}Q^{g}(q + a_{k}w) + r + b_{k} \ge 0$. Using again the Schur complement, the equivalent matrix inequality is: $\begin{bmatrix} Q & \frac{1}{2}(q + a_{k}w) \\ \frac{1}{2}(q + a_{k}w)^{T} & r + b_{k} \end{bmatrix} \succeq 0$. Using this expression in problem \ref{eq:delage_dual_xi} leads to the final form of the Delage model we use: \begin{array}{c} \mathbf{(Delage\ simple\ model)} \\ \begin{aligned} \min_{w, r, P, p, s, Q, q} &\ r + P \bullet \hat{\Sigma} - 2p^{T} \hat{\mu} + s \gamma_{1} + Q \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q \hat{\mu} \\ \mathrm{s.t.} &\ q = -2(p + Q \hat{\mu}), \\ & \ \begin{bmatrix} Q & \frac{1}{2}(q + a_{k}w) \\ \frac{1}{2}(q + a_{k}w)^{T} & r + b_{k} \end{bmatrix} \succeq 0 \quad k = 1, \dots, K, \\ & \ \begin{bmatrix} P & p \\ p^{T} & s \end{bmatrix} \succeq 0, \\ & \ Q \succeq 0, \\ & \ w \in \mathcal{W} . \end{aligned} \end{array}
2. We can also have some bounds on the components of $\xi$. Let $\mathcal{U} = \left[\underline{\xi}_{1}, \overline{\xi}_{1} \right] \times \dots \times \left[\underline{\xi}_{n}, \overline{\xi}_{n} \right]$ be a multidimensional box. The corresponding optimization problems for each $k$ then become: \begin{aligned} \min_{\xi} &\ \xi^{T} Q \xi + \xi^{T}q + r + a_{k}\xi^{T}w + b_{k} \\ \mathrm{s.t.} &\ \underline{\xi} - \xi \preceq 0, \\ &\ \xi- \overline{\xi} \preceq 0 . \\ \end{aligned} The Lagrangian of the above problem is: $$L(\xi, \underline{\lambda}, \overline{\lambda}) = \xi^{T} Q \xi + (q + a_{k}w - \underline{\lambda} + \overline{\lambda})^{T} \xi + r + b_{k} + \underline{\lambda}^{T} \underline{\xi} - \overline{\lambda}^{T} \overline{\xi},$$ with $\underline{\lambda}$ and $\overline{\lambda}$ dual variables. The Lagrange dual $g(\underline{\lambda}, \overline{\lambda}) = \inf \limits_{\xi} L(\xi, \underline{\lambda}, \overline{\lambda})$ can be obtained straightforwardly by solving $\frac{\partial L}{\partial \xi} = 0$ for $\xi$. This yields the minimizer: $\xi^{*} = -\frac{1}{2}Q^{g} (q + a_{k}w - \underline{\lambda} + \overline{\lambda})$ and the Lagrange dual: $g(\underline{\lambda}, \overline{\lambda}) = -\frac{1}{4}(q + a_{k}w - \underline{\lambda} + \overline{\lambda})^{T}Q^{g}(q + a_{k}w - \underline{\lambda} + \overline{\lambda}) + r + b_{k} + \underline{\lambda}^{T} \underline{\xi} - \overline{\lambda}^{T} \overline{\xi}$.\\[1.1mm] The dual problem is, therefore: \begin{aligned} \max_{\nu_{k}, \underline{\lambda}, \overline{\lambda}} &\ \nu_{k} \\ \mathrm{s.t.} &\ \nu_{k} \le -\frac{1}{4}(q + a_{k}w - \underline{\lambda} + \overline{\lambda})^{T}Q^{g}(q + a_{k}w - \underline{\lambda} + \overline{\lambda}) + r + b_{k} + \underline{\lambda}^{T} \underline{\xi} - \overline{\lambda}^{T} \overline{\xi}, \\ & \ \underline{\lambda} \succeq 0, \\ & \ \overline{\lambda} \succeq 0 , \\ \end{aligned} where we introduced the scalar variable $\nu_{k}$ for convenience ($\nu_{k} \ge 0$). Applying the Schur complement for the first constraint, the equivalent matrix inequality is: $$\begin{bmatrix} Q & \frac{1}{2}(q + a_{k}w - \underline{\lambda} + \overline{\lambda}) \\ (q + a_{k}w - \underline{\lambda} + \overline{\lambda})^{T} & r + b_{k} + \underline{\lambda}^{T} \underline{\xi} - \overline{\lambda}^{T} \overline{\xi} - \nu_{k} \end{bmatrix} \succeq 0 .$$ The corresponding Delage model: \begin{array}{c} \mathbf{(Delage\ box\ model)} \\ \begin{aligned} \max_{w, r, P, p, s, Q, q, \nu, \underline{\lambda}, \overline{\lambda}} &\ -\left( r + P \bullet \hat{\Sigma} - 2p^{T} \hat{\mu} + s \gamma_{1} + Q \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q \hat{\mu} \right)\\ \mathrm{s.t.} &\ q = -2(p + Q \hat{\mu}), \\ & \ \begin{bmatrix} Q & \frac{1}{2}(q + a_{k}w - \underline{\lambda} + \overline{\lambda}) \\ \frac{1}{2}(q + a_{k}w - \underline{\lambda} + \overline{\lambda})^{T} & r + b_{k} + \underline{\lambda}^{T} \underline{\xi} - \overline{\lambda}^{T} \overline{\xi} - \nu_{k} \end{bmatrix} \succeq 0 & k = 1, \dots, K, \\ & \ \begin{bmatrix} P & p \\ p^{T} & s \end{bmatrix} \succeq 0, \\ & \ Q \succeq 0, \\ & \ \nu_{k} \ge 0 & k = 1, \dots, K, \\ & \ \underline{\lambda} \succeq 0, \\ & \ \overline{\lambda} \succeq 0, \\ & \ w \in \mathcal{W} . \end{aligned} \end{array}

### Yang model

In a more general framework, DRO models can also include constraints depending on some functional of the utility / loss function [Rahimian19]. For instance, problem \ref{eq:dro_utility} can be extended by a constraint on the worst expected shortfall (aka conditional value at risk) of the utility function: $$\max_{w \in \mathcal{W}} \min_{F(\xi) \in \mathcal{A}} \left\{ \mathbb{E}_{F(\xi)} \left[ u(w, \xi) \right] \left| \max_{F(\xi) \in \mathcal{A}} \mathbb{ES}_{F(\xi)}^{\alpha} \left[ u(w, \xi) \right] \le \sigma \right. \right\}, \label{eq:dro_utility_es}$$ where the expected shortfall of the utility function with $0 < \alpha < 1$ quantile level is defined as: $$\mathbb{ES}_{F(\xi)}^{\alpha}\left[ u(w, \xi) \right] = -\frac{1}{\alpha} \int \limits_{\xi:\ u(w, \xi) \le q_{\alpha}(u(w, \xi))} u(w, \xi) \mathrm{d}F(\xi), \label{eq:es_utility}$$ where $q_{\alpha}$ is the quantile function with $\alpha$-quantile level. Similarly, DRO models with a corresponding loss function can be written as: $$\min_{w \in \mathcal{W}} \max_{F(\xi) \in \mathcal{A}} \left\{ \mathbb{E}_{F(\xi)} \left[ \ell(w, \xi) \right] \left| \max_{F(\xi) \in \mathcal{A}} \mathbb{ES}_{F(\xi)}^{\beta} \left[ \ell(w, \xi) \right] \le \sigma \right. \right\}, \label{eq:dro_loss_es}$$ where the expected shortfall of the loss function with $\beta = 1 - \alpha$ confidence (probability) level is defined as: $$\mathbb{ES}_{F(\xi)}^{\beta}\left[ \ell(w, \xi) \right] = \frac{1}{1 - \beta} \int \limits_{\xi:\ \ell(w, \xi) \ge q_{\beta}(l(w, \xi))} \ell(w, \xi) \mathrm{d}F(\xi). \label{eq:es_loss}$$ As before, the equivalence of these equations can be easily seen for $\ell(w, \xi) = -u(w, \xi)$. Based on the famous result of Rockafellar and Uryasev [Rockafellar00], the expected shortfall can be also computed by the following optimization problem: \begin{aligned} \mathbb{ES}_{F(\xi)}^{\beta}\left[ \ell(w, \xi) \right] & = \min_{\delta \in \mathbb{R}} \Phi_{\beta}(w, \delta), \\ \Phi_{\delta}(w, \delta) & = \delta + \frac{1}{1-\beta} \mathbb{E}_{F(\xi)} \left[ \left[ \ell(w, \xi) - \delta \right]^{+}\right] = \delta + \frac{1}{1-\beta} \int_{\mathcal{U}} \left[ \ell(w, \xi) - \delta \right]^{+} \mathrm{d}F(\xi), \\ \left[ t \right]^{+} & = \max \left\{ 0, t \right\}, \end{aligned} \label{eq:es_convex} where $\Phi_{\delta}(w, \delta)$ is a convex and continuously differentiable function of $\delta$. In the Yang model [Yang14], the ambiguity set is exactly the same as the one defined in the Delage model but the optimization problem includes an inequality constraint on the worst-case expected shortfall: \begin{aligned} \min_{w \in W} &\ \max_{F(\xi) \in \mathcal{A}} \mathbb{E}_{F(\xi)} \left[ \ell(w, \xi) \right] \\ \mathrm{s.t.} &\ \max_{F(\xi) \in \mathcal{A}} \min_{\delta \in \mathbb{R}} \left( \delta + \frac{1}{1-\beta} \mathbb{E}_{F(\xi)} \left[ \left[ (\ell(w, \xi) - \delta \right]^{+} \right] \right) \le \sigma , \end{aligned} \label{eq:yang_opt} where $\mathcal{A} = \mathcal{A}_{\mathrm{Delage}}$. Based on problem \ref{eq:delage_dual_loss}, for the expression $\max \limits_{F(\xi) \in \mathcal{A}} \mathbb{E}_{F(\xi)} \left[ \ell(w, \xi) \right]$ we have: \begin{aligned} \min_{r_{1}, P_{1}, p_{1}, s_{1}, Q_{1}, q_{1}} &\ r_{1} + P_{1} \bullet \hat{\Sigma} - 2p_{1}^{T} \hat{\mu} + s_{1} \gamma_{1} + Q_{1} \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q_{1} \hat{\mu} \\ \mathrm{s.t.} &\ q_{1} = -2(p_{1} + Q_{1} \hat{\mu}), \\ & \ -\ell(w, \xi) + r_{1} + \xi^{T} Q_{1} \xi + \xi^{T} q_{1} \ge 0 \quad \forall \xi \in \mathcal{U}, \\ & \ \begin{bmatrix} P_{1} & p_{1} \\ p_{1}^{T} & s_{1} \end{bmatrix} \succeq 0, \\ & \ Q_{1} \succeq 0 . \\ \end{aligned} \label{eq:yang_dual_loss} For the expected shortfall term, first we consider the Lagrange dual function of the expression $\max \limits_{F(\xi) \in \mathcal{A}} \left( \delta + \frac{1}{1-\beta} \mathbb{E}_{F(\xi)} \left[ \left[ (\ell(w, \xi) - \delta \right]^{+} \right] \right)$: \begin{aligned} & g(r_{2}, P_{2}, p_{2}, s_{2}, Q_{2}) = \ r_{2} + P_{2} \bullet \hat{\Sigma} - 2p_{2}^{T} \hat{\mu} + s_{2} \gamma_{1} + Q_{2} \bullet \left( \gamma_{2} \hat{\Sigma} \right) - Q_{2} \bullet \left( \hat{\mu} \hat{\mu}^{T} \right) + \\ & \sup_{F} \int_{\mathcal{U}} \left( \delta + \max \left\{ 0, \frac{\ell(w, \xi) - \delta}{1-\beta} \right\} - r_{2} + 2p_{2}^{T} \xi - \xi^{T} Q_{2} \xi + 2\xi^{T} Q_{2} \hat{\mu} \right) \mathrm{d}F(\xi) . \end{aligned} It is easy to see that the above expression leads to the following optimization problem for the worst expected shortfall in problem \ref{eq:yang_opt}: \begin{aligned} \min_{\delta, r_{2}, P_{2}, p_{2}, s_{2}, Q_{2}, q_{2}} &\ r_{2} + P_{2} \bullet \hat{\Sigma} - 2p_{2}^{T} \hat{\mu} + s_{2} \gamma_{2} + Q_{2} \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q_{2} \hat{\mu} \\ \mathrm{s.t.} &\ q_{2} = -2(p_{2} + Q_{2} \hat{\mu}), \\ & \ r_{2} + \xi^{T} Q_{2} \xi + \xi^{T} q_{2} \ge \delta \quad \forall \xi \in \mathcal{U}, \\ & \ -\frac{\ell(w, \xi)}{1-\beta} + r_{2} + \xi^{T} Q_{2} \xi + \xi^{T} q_{2} \ge \delta - \frac{\delta}{1-\beta} \quad \forall \xi \in \mathcal{U}, \\ & \ \begin{bmatrix} P_{2} & p_{2} \\ p_{2}^{T} & s_{2} \end{bmatrix} \succeq 0, \\ & \ Q_{2} \succeq 0. \\ \end{aligned} \label{eq:yang_es_dual_loss} As for the Delage model, we use again a rather general piecewise linear loss function, i.e. $\ell(w, \xi) = \max \limits_{k} -a_{k}\xi^{T}w - b_{k}$ with $a_{k} \ge 0 \quad \forall k$. Also, as before, we provide two general models depending on the uncertainty set $\mathcal{U}$. We omit the details of the derivations as they follow a similar way as before.

1. For $\mathcal{U} = \mathbb{R}^{n}$ the general model has the following form: \begin{array}{c} \mathbf{(Yang\ simple\ model)} \\ \begin{aligned} \max_{\substack{{w, \delta} \\ {r_{1}, P_{1}, p_{1}, s_{1}, Q_{1}, q_{1}} \\ {r_{2}, P_{2}, p_{2}, s_{2}, Q_{2}, q_{2}} \\ {\nu, \rho, \tau}}} &\ -\left( r_{1} + P_{1} \bullet \hat{\Sigma} - 2p_{1}^{T} \hat{\mu} + s_{1} \gamma_{1} + Q_{1} \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q_{1} \hat{\mu} \right) \\ \mathrm{s.t.} &\ r_{2} + P_{2} \bullet \hat{\Sigma} - 2p_{2}^{T} \hat{\mu} + s_{2} \gamma_{2} + Q_{2} \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q_{2} \hat{\mu} \le \sigma, \\ &\ q_{1} = -2(p_{1} + Q_{1} \hat{\mu}), \quad q_{2} = -2(p_{2} + Q_{2} \hat{\mu}), \\ &\ \begin{bmatrix} Q_{1} & \frac{1}{2}(q_{1} + a_{k}w) \\ \frac{1}{2}(q_{1} + a_{k}w)^{T} & r_{1} + b_{k} - \nu_{k} \end{bmatrix} \succeq 0 & k = 1, \dots, K, \\ &\ \begin{bmatrix} Q_{2} & \frac{1}{2}q_{2} \\ \frac{1}{2}q_{2}^{T} & r_{2} - \rho_{k} \end{bmatrix} \succeq 0 & k = 1, \dots, K, \\ &\ \begin{bmatrix} Q_{2} & \frac{1}{2}\left(q_{2} + \frac{a_{k}w}{1-\beta}\right) \\ \frac{1}{2}\left(q_{2} + \frac{a_{k}w}{1-\beta}\right)^{T} & r_{2} + b_{k} - \tau_{k} \end{bmatrix} \succeq 0 & k = 1, \dots, K, \\ &\ \begin{bmatrix} P_{1} & p_{1} \\ p_{1}^{T} & s_{1} \end{bmatrix} \succeq 0, \quad \begin{bmatrix} P_{2} & p_{2} \\ p_{2}^{T} & s_{2} \end{bmatrix} \succeq 0, \\ &\ Q_{1} \succeq 0, \quad Q_{2} \succeq 0, \\ &\ \nu_{k} \ge 0 & k = 1, \dots, K, \\ &\ \rho_{k} \ge \delta & k = 1, \dots, K, \\ &\ \tau_{k} \ge \left(1 - \frac{1}{1-\beta} \right) \delta & k = 1, \dots, K, \\ &\ w \in \mathcal{W} . \end{aligned} \end{array}
2. For $\mathcal{U} = \left[ \underline{\xi}, \overline{\xi} \right]$ we can write: \begin{array}{c} \mathbf{(Yang\ box\ model)} \\ \begin{aligned} \max_{\substack{{w, \delta} \\ {r_{1}, P_{1}, p_{1}, s_{1}, Q_{1}, q_{1}} \\ {r_{2}, P_{2}, p_{2}, s_{2}, Q_{2}, q_{2}} \\ {\nu, \rho, \tau, \underline{\lambda}_{1}, \overline{\lambda}_{1}, \underline{\lambda}_{2}, \overline{\lambda}_{2}, \underline{\lambda}_{3}, \overline{\lambda}_{3}}}} &\ -\left( r_{1} + P_{1} \bullet \hat{\Sigma} - 2p_{1}^{T} \hat{\mu} + s_{1} \gamma_{1} + Q_{1} \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q_{1} \hat{\mu} \right) \\ \mathrm{s.t.} &\ r_{2} + P_{2} \bullet \hat{\Sigma} - 2p_{2}^{T} \hat{\mu} + s_{2} \gamma_{2} + Q_{2} \bullet \left( \gamma_{2} \hat{\Sigma} \right) - \hat{\mu}^{T} Q_{2} \hat{\mu} \le \sigma, \\ &\ q_{1} = -2(p_{1} + Q_{1} \hat{\mu}), \quad q_{2} = -2(p_{2} + Q_{2} \hat{\mu}), \\ &\ \begin{bmatrix} Q_{1} & \frac{1}{2}(q_{1} + a_{k}w - \underline{\lambda}_{1} + \overline{\lambda}_{1}) \\ \frac{1}{2}(q_{1} + a_{k}w - \underline{\lambda}_{1} + \overline{\lambda}_{1})^{T} & r_{1} + b_{k} + \underline{\lambda}_{1}^{T} \underline{\xi} - \overline{\lambda}_{1}^{T} \overline{\xi} - \nu_{k} \end{bmatrix} \succeq 0 & k = 1, \dots, K, \\ &\ \begin{bmatrix} Q_{2} & \frac{1}{2}(q_{2} - \underline{\lambda}_{2} + \overline{\lambda}_{2}) \\ \frac{1}{2}(q_{2} - \underline{\lambda}_{2} + \overline{\lambda}_{2})^{T} & r_{2} + \underline{\lambda}_{2}^{T} \underline{\xi} - \overline{\lambda}_{2}^{T} \overline{\xi} - \rho_{k} \end{bmatrix} \succeq 0 & k = 1, \dots, K, \\ &\ \begin{bmatrix} Q_{2} & \frac{1}{2}\left( q_{2} + \frac{a_{k}w}{1-\beta} - \underline{\lambda}_{3} + \overline{\lambda}_{3} \right) \\ \frac{1}{2}\left( q_{2} + \frac{a_{k}w}{1-\beta} - \underline{\lambda}_{3} + \overline{\lambda}_{3} \right)^{T} & r_{2} + b_{k} + \underline{\lambda}_{3}^{T} \underline{\xi} - \overline{\lambda}_{3}^{T} \overline{\xi} - \tau_{k} \end{bmatrix} \succeq 0 & k = 1, \dots, K, \\ &\ \begin{bmatrix} P_{1} & p_{1} \\ p_{1}^{T} & s_{1} \end{bmatrix} \succeq 0, \quad \begin{bmatrix} P_{2} & p_{2} \\ p_{2}^{T} & s_{2} \end{bmatrix} \succeq 0, \\ &\ Q_{1} \succeq 0, \quad Q_{2} \succeq 0, \\ &\ \nu_{k} \ge 0 & k = 1, \dots, K, \\ &\ \rho_{k} \ge \delta & k = 1, \dots, K, \\ &\ \tau_{k} \ge \left(1 - \frac{1}{1-\beta} \right) \delta & k = 1, \dots, K, \\ &\ \underline{\lambda}_{1} \succeq 0, \quad \overline{\lambda}_{1} \succeq 0, \quad \underline{\lambda}_{2} \succeq 0, \quad \overline{\lambda}_{2} \succeq 0,\quad \underline{\lambda}_{3} \succeq 0,\quad \overline{\lambda}_{3} \succeq 0, \\ &\ w \in \mathcal{W} . \end{aligned} \end{array}
Finally, we note that the actual Yang model is based on the above optimization problem by setting $K = 1$, $a_{1} = 1.0$ and $b_{1} = 0$.

### Du model

A widely used approach for data-driven decision making is to approximate the distribution $F(\xi)$ with the discrete empirical probability distribution [Bertsimas18] [Esfahani18]: $$\widehat{F}(\xi) = \frac{1}{N} \sum_{i=1}^{N}\delta_{\hat{\xi}_{i}},$$ where $\delta_{\hat{\xi}_{i}}$ is a delta function at the observed $\hat{\xi}_{i}$ position and $N$ is the number of observations. Using the discrete empirical probability distribution, the expected value of some loss function has a simple expression: $$\mathbb{E}_{\widehat{F}(\xi)} \left[ \ell(w, \xi) \right] = \frac{1}{N}\sum_{i=1}^{N} \ell(w, \hat{\xi}_{i}).$$ We can consider now an ambiguity set of distributions that are within a given probability metric from this empirical probability distribution. One of the most widely used examples is the Wasserstein metric defined as: $$d^{(p)}_{\mathrm{W}}(F_{1}, F_{2}) = \left( \inf_{\phi \in \Phi(F_{1}, F_{2})} \int_{\mathcal{U} \times \mathcal{U}} \| \xi_{1} - \xi_{2} \|^{p} \mathrm{d}\phi(\xi_{1}, \xi_{2}) \right)^{1/p},$$ where $\Phi(\xi_{1}, \xi_{2})$ denotes all measures on $\mathcal{U} \times \mathcal{U}$ with marginals $F_{1}$ and $F_{2}$ and $\| \cdot \|$ represents an arbitrary norm. Using the special case $p=1$ (i.e. the 1-Wasserstein metric, which is also called the Kantorovich metric) we can define the following ambiguity set [Esfahani18]: $$\mathcal{A}_{\mathrm{W}}^{\varepsilon} = \left\{ F(\xi) \in \mathcal{M}(\mathcal{U}) \ \begin{array}{|l} d^{(1)}_{\mathrm{W}}(\widehat{F}, F) \le \varepsilon \\ \end{array} \right\}. \label{eq:du_ambiguity_set}$$ Let us consider now the following optimization problem: $$\max_{F(\xi) \in \mathcal{A}_{\mathrm{W}}^{\varepsilon}} \mathbb{E} \left[ \ell(w, \xi) \right]. \label{eq:wasserstein_opt}$$ As it has been proven in [Esfahani18], if $\mathcal{U}$ is convex and closed and using a loss function, which is defined as the pointwise maximum of some elementary measurable loss functions, i.e. $\ell(w, \xi) = \max \limits_{k} \ell_{k}(w, \xi)$ with $k = 1, \dots, K$ and for each $k$ the $-\ell_{k}(w, \xi)$ functions are proper, convex, and lower semicontinuous, then the above optimization problem is equivalent to the following finite convex program: \begin{aligned} \min_{\lambda, \{s_{i}\}, \{y_{ik}\}, \{a_{ik}\}} &\ \lambda \varepsilon + \frac{1}{N} \sum_{i=1}^{N}s_{i} \\ \mathrm{s.t.} &\ \left[ -\ell_{k} \right]^{*}(w, y_{ik} - a_{ik}) + \sigma_{\mathcal{U}}(a_{ik}) - y_{ik}^{T} \hat{\xi}_{i} \le s_{i} & i = 1, \dots, N &\ k = 1, \dots, K, \\ &\ \| y_{ik} \|_{*} \le \lambda & i = 1, \dots, N &\ k = 1, \dots, K, \end{aligned} \label{eq:kuhn_opt} where $\left[ -\ell_{k} \right]^{*}(w, y_{ik} - a_{ik})$ is the conjugate of $-\ell_{k}$ wrt. $\xi$ and evaluated at $y_{ik} - a_{ik}$, $\| \cdot \|_{*}$ is the dual of the norm used in the Kantorovich metric, and $\sigma_{\mathcal{U}}(a_{ik})$ is the conjugate of the characteristic function of support $\mathcal{U}$. In the Du model [Du21], we consider the following optimization problem using a trade-off objective: $$\min_{w \in \mathcal{W}} \max_{F(\xi) \in \mathcal{A}_{\mathrm{W}}^{\varepsilon}} \eta \mathbb{E}_{F(\xi)} \left[ \ell(w, \xi) \right] + (1-\eta) \mathbb{ES}_{F(\xi)}^{\beta} \left[ \ell(w, \xi) \right],$$ where $\eta \in [0, 1]$ is the risk preference parameter. Setting $\ell = -\xi^{T}w$ and using the optimization problem for the expected shortfall (problem \ref{eq:es_convex}), an equivalent formulation of the above problem can be obtained as: \begin{aligned} & \min_{w \in \mathcal{W}, \delta} \max_{F(\xi) \in \mathcal{A}_{\mathrm{W}}^{\varepsilon}} \eta \mathbb{E}_{F(\xi)} \left[ -\xi^{T}w \right] + (1-\eta) \left( \delta + \frac{1}{1-\beta} \mathbb{E}_{F(\xi)} \left[ \max \left\{ -\xi^{T}w - \delta, 0 \right\} \right] \right) = \\ & \min_{w \in \mathcal{W}, \delta} \max_{F(\xi) \in \mathcal{A}_{\mathrm{W}}^{\varepsilon}} \mathbb{E}_{F(\xi)} \left[ \max \left\{ \underbrace{-\left( \eta + \frac{1-\eta}{1-\beta} \right) \xi^{T}w + (1-\eta)\left( 1 - \frac{1}{1-\beta} \right) \delta}_{\ell_{1}(w, \xi)}, \underbrace{-\eta \xi^{T}w + (1-\eta)\delta}_{\ell_{2}(w, \xi)} \right\} \right] = \\ & \min_{w \in \mathcal{W}, \delta} \max_{F(\xi) \in \mathcal{A}_{\mathrm{W}}^{\varepsilon}} \mathbb{E}_{F(\xi)} \left[ \max \left\{ \ell_{1}(w, \xi), \ell_{2}(w, \xi) \right\} \right]. \end{aligned} Comparing the above problem to \ref{eq:wasserstein_opt} and \ref{eq:kuhn_opt} with $K = 2$ and the two elementary loss functions as defined above, we get the following general convex reduction: \begin{aligned} \min_{\substack{{w, \delta} \\ {\lambda, \{s_{i}\}, \{y_{i}\}, \{z_{i}\}, \{a_{i}\}, \{b_{i}\}}}} &\ \lambda \varepsilon + \sum_{i=1}^{N} s_{i} \\ \mathrm{s.t.} &\ \left[ -\ell_{1} \right]^{*}(w, y_{i} - a_{i}) + \sigma_{\mathcal{U}}(a_{i}) - y_{i}^{T}\hat{\xi}_{i} \le s_{i} & i = 1, \dots, N, \\ &\ \left[ -\ell_{2} \right]^{*}(w, z_{i} - b_{i}) + \sigma_{\mathcal{U}}(b_{i}) - z_{i}^{T}\hat{\xi}_{i} \le s_{i} & i = 1, \dots, N, \\ &\ \| y_{i} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \| z_{i} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ w \in \mathcal{W} . \end{aligned} \label{eq:du_opt_loss} Based on the definition of the conjugate function, i.e. $f^{*}(y) = \sup \limits_{x} \left( y^{T}x - f(x) \right)$, we have that: \begin{aligned} \left[ -\ell_{1} \right]^{*}(w, y_{i} - a_{i}) = & \sup_{\xi} \left( \xi^{T}(y_{i} - a_{i}) - \left( \eta + \frac{1-\eta}{1-\beta} \right) \xi^{T}w + (1-\eta)\left( 1 - \frac{1}{1-\beta} \right) \delta \right) = \\ & (1-\eta)\left( 1 - \frac{1}{1-\beta} \right) \delta \mathrm{\ and\ } y_{i} = \left( \eta + \frac{1-\eta}{1-\beta} \right)w + a_{i} \quad i = 1, \dots, N . \end{aligned} Similarly, for the other loss function: \begin{aligned} \left[ -\ell_{2} \right]^{*}(w, z_{i} - b_{i}) = & \sup_{\xi} \left( \xi^{T}(z_{i} - b_{i}) - \eta \xi^{T}w + (1-\eta)\delta \right) = \\ & (1-\eta) \delta \mathrm{\ and\ } z_{i} = \eta w + b_{i} \quad i = 1, \dots, N . \end{aligned} Inserting the above two results into problem \ref{eq:du_opt_loss} and eliminating $y_{i}$'s and $z_{i}$'s we get: \begin{aligned} \min_{\substack{{w, \delta} \\ {\lambda, \{s_{i}\}, \{a_{i}\}, \{b_{i}\}}}} &\ \lambda \varepsilon + \sum_{i=1}^{N} s_{i} \\ \mathrm{s.t.} &\ (1-\eta)\left( 1 - \frac{1}{1-\beta} \right) \delta + \sigma_{\mathcal{U}}(a_{i}) - \left( \eta + \frac{1-\eta}{1-\beta} \right)w^{T}\hat{\xi}_{i} - a_{i}^{T}\hat{\xi}_{i} \le s_{i} & i = 1, \dots, N, \\ &\ (1-\eta) \delta + \sigma_{\mathcal{U}}(b_{i}) - \eta w^{T} \hat{\xi}_{i} - b_{i}^{T}\hat{\xi}_{i} \le s_{i} & i = 1, \dots, N, \\ &\ \| \left( \eta + \frac{1-\eta}{1-\beta} \right)w + a_{i} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \| \eta w + b_{i} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ w \in \mathcal{W} . \end{aligned} The final piece of the general Du model is the assumption that the uncertainty set is a non-empty cone representation, i.e. $\mathcal{U} = \left\{ \xi \in \mathbb{R}^{n}: \exists u \in \mathbb{R}^{m}: A\xi + Bu + c \in C \right\}$, where $C$ is a closed convex pointed cone, $A$ and $B$ are matrices and $c$ is a vector. Using the conic duality theorem we can eliminate $\xi$ from $\sigma_{\mathcal{U}}(a_{i})$'s: \begin{aligned} \sigma_{\mathcal{U}}(a_{i}) = & \sup_{\xi, u} \left\{ a_{i}^{T}\xi: A\xi + Bu + c \in C \right\} = \\ & \inf_{\phi_{i}} \sup_{\xi, u} \left\{ L(\xi, u, \phi_{i}): \phi_{i} \in C^{*} \right\} = \\ & \inf_{\phi_{i}} \sup_{\xi, u} \left\{ a_{i}^{T}\xi + (A\xi + Bu + c)^{T} \phi_{i}: \phi_{i} \in C^{*} \right\} = \\ & \inf_{\phi_{i}} \left\{ c^{T} \phi_{i}: A^{T}\phi_{i} = -a_{i}, B^{T}\phi_{i} = 0, \phi_{i} \in C^{*} \right\} \quad i = 1, \dots, N , \end{aligned} where $C^{*}$ denotes the dual cone of $C$. Similarly, for $\sigma_{\mathcal{U}}(b_{i})$'s we get: $$\sigma_{\mathcal{U}}(b_{i}) = \inf_{\psi_{i}} \left\{ c^{T} \psi_{i}: A^{T}\psi_{i} = -b_{i}, B^{T}\psi_{i} = 0, \psi_{i} \in C^{*} \right\} \quad i = 1, \dots, N .$$ Replacing these expressions in the optimization problem, the form of the general Du problem becomes: \begin{aligned} \min_{\substack{{w, \delta} \\ {\lambda, \{s_{i}\}, \{\phi_{i}\}, \{\psi_{i}\}}}} &\ \lambda \varepsilon + \sum_{i=1}^{N} s_{i} \\ \mathrm{s.t.} &\ (1-\eta)\left( 1 - \frac{1}{1-\beta} \right) \delta - \left( \eta + \frac{1-\eta}{1-\beta} \right)w^{T}\hat{\xi}_{i} + \phi_{i}^{T} (c + A\hat{\xi}_{i}) \le s_{i} & i = 1, \dots, N, \\ &\ (1-\eta) \delta - \eta w^{T} \hat{\xi}_{i} + \psi_{i}^{T}(c + A\hat{\xi}_{i}) \le s_{i} & i = 1, \dots, N, \\ &\ \| \left( \eta + \frac{1-\eta}{1-\beta} \right)w - A^{T}\phi_{i} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \| \eta w - A^{T}\psi_{i} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ B^{T} \phi_{i} = 0 & i = 1, \dots, N, \\ &\ B^{T} \psi_{i} = 0 & i = 1, \dots, N, \\ &\ \phi_{i} \in C^{*} & i = 1, \dots, N, \\ &\ \psi_{i} \in C^{*} & i = 1, \dots, N, \\ &\ w \in \mathcal{W} . \end{aligned} \label{eq:du_general} Depending on the actual form of the cone we investigate three variants: box, budget and ellipsoid uncertainty sets. In all cases we set $\phi_{i} = [\nu_{i}^{(1)}, \tau_{i}^{(1)}]$ and $\psi_{i} = [\nu_{i}^{(2)}, \tau_{i}^{(2)}]$, where $\nu_{i}^{(1)}, \nu_{i}^{(2)} \in \mathbb{R}^{n}$ and $\tau_{i}^{(1)}, \tau_{i}^{(2)} \in \mathbb{R}$ for all $i = 1, \dots, N$.

1. Box uncertainty set: $\mathcal{U} = \left\{ \xi \in \mathbb{R}^{n}: |\xi_{i}| \le \Lambda, i=1, \dots, N \right\}$. In the corresponding cone representation $A = \begin{bmatrix} I_{n \times n} \\ 0_{1 \times n} \end{bmatrix}$, $B = 0_{(n+1) \times m}$, $c = \begin{bmatrix} 0_{m \times 1} \\ \Lambda \end{bmatrix}$ and $C = \left\{ [x; t] \in \mathbb{R}^{n} \times \mathbb{R}: t \ge \|x\|_{\infty} \right\}$ and its dual cone $C^{*} = \left\{ [x; t] \in \mathbb{R}^{n} \times \mathbb{R}: t \ge \|x\|_{1} \right\}$. Substituting these expressions into problem \ref{eq:du_general} the following model is obtained: \begin{array}{c} \mathbf{(Du\ box\ model)} \\ \begin{aligned} \min_{\substack{{w, \delta} \\ {\lambda, \{s_{i}\}} \\ {\{\nu_{i}^{(1)}\}, \{\nu_{i}^{(2)}\}} \\ {\{\tau_{i}^{(1)}\}, \{\tau_{i}^{(2)}\}}}} &\ \lambda \varepsilon + \sum_{i=1}^{N} s_{i} \\ \mathrm{s.t.} &\ (1-\eta)\left( 1 - \frac{1}{1-\beta} \right) \delta - \left( \eta + \frac{1-\eta}{1-\beta} \right)w^{T}\hat{\xi}_{i} + (\nu_{i}^{(1)})^{T}\hat{\xi}_{i} + \tau_{i}^{(1)}\Lambda \le s_{i} \quad i = 1, \dots, N, \\ &\ (1-\eta) \delta - \eta w^{T} \hat{\xi}_{i} + (\nu_{i}^{(2)})^{T}\hat{\xi}_{i} + \tau_{i}^{(2)}\Lambda \le s_{i} & i = 1, \dots, N, \\ &\ \| \left( \eta + \frac{1-\eta}{1-\beta} \right)w - \nu_{i}^{(1)} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \| \eta w - \nu_{i}^{(2)} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \tau_{i}^{(1)} \ge \| \nu_{i}^{(1)} \|_{1} & i = 1, \dots, N, \\ &\ \tau_{i}^{(2)} \ge \| \nu_{i}^{(2)} \|_{1} & i = 1, \dots, N, \\ &\ w \in \mathcal{W} . \end{aligned} \end{array}
2. Budget uncertainty set: $\mathcal{U} = \left\{ \xi \in \mathbb{R}^{n}: \sum \limits_{i=1}^{N} \left| \frac{\xi_{i}}{\Delta_{i}} \right| \le \Gamma \right\}$. In the corresponding cone representation $A = \begin{bmatrix} Q^{-1} \\ 0_{1 \times n} \end{bmatrix}$ with $Q = \mathrm{diag}(\Delta_{1}, \dots, \Delta_{N})$, $B = 0_{(n+1) \times m}$, $c = \begin{bmatrix} 0_{m \times 1} \\ \Gamma \end{bmatrix}$ and $C = \left\{ [x; t] \in \mathbb{R}^{n} \times \mathbb{R}: t \ge \|x\|_{1} \right\}$ and its dual cone $C^{*} = \left\{ [x; t] \in \mathbb{R}^{n} \times \mathbb{R}: t \ge \|x\|_{\infty} \right\}$. Substituting these expressions into problem \ref{eq:du_general} the following model is obtained: \begin{array}{c} \mathbf{(Du\ budget\ model)} \\ \begin{aligned} \min_{\substack{{w, \delta} \\ {\lambda, \{s_{i}\}} \\ {\{\nu_{i}^{(1)}\}, \{\nu_{i}^{(2)}\}} \\ {\{\tau_{i}^{(1)}\}, \{\tau_{i}^{(2)}\}}}} &\ \lambda \varepsilon + \sum_{i=1}^{N} s_{i} \\ \mathrm{s.t.} &\ (1-\eta)\left( 1 - \frac{1}{1-\beta} \right) \delta - \left( \eta + \frac{1-\eta}{1-\beta} \right)w^{T}\hat{\xi}_{i} + (\nu_{i}^{(1)})^{T}Q^{-1}\hat{\xi}_{i} + \tau_{i}^{(1)}\Gamma \le s_{i} & i = 1, \dots, N, \\ &\ (1-\eta) \delta - \eta w^{T} \hat{\xi}_{i} + (\nu_{i}^{(2)})^{T}Q^{-1}\hat{\xi}_{i} + \tau_{i}^{(2)}\Gamma \le s_{i} & i = 1, \dots, N, \\ &\ \| \left( \eta + \frac{1-\eta}{1-\beta} \right)w - Q^{-1}\nu_{i}^{(1)} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \| \eta w - Q^{-1}\nu_{i}^{(2)} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \tau_{i}^{(1)} \ge \| \nu_{i}^{(1)} \|_{\infty} & i = 1, \dots, N, \\ &\ \tau_{i}^{(2)} \ge \| \nu_{i}^{(2)} \|_{\infty} & i = 1, \dots, N, \\ &\ w \in \mathcal{W} . \end{aligned} \end{array}
3. Ellipsoid uncertainty set: $\mathcal{U} = \left\{ \xi \in \mathbb{R}^{n}: \sum \limits_{i=1}^{N} \left( \frac{\xi_{i}}{\Delta_{i}} \right)^{2} \le \Omega \right\}$. In the corresponding cone representation $A = \begin{bmatrix} Q^{-1} \\ 0_{1 \times n} \end{bmatrix}$ with $Q = \mathrm{diag}(\Delta_{1}, \dots, \Delta_{N})$, $B = 0_{(n+1) \times m}$, $c = \begin{bmatrix} 0_{m \times 1} \\ \Omega \end{bmatrix}$ and $C = \left\{ [x; t] \in \mathbb{R}^{n} \times \mathbb{R}: t \ge \|x\|_{2} \right\}$ and its dual cone $C^{*} = \left\{ [x; t] \in \mathbb{R}^{n} \times \mathbb{R}: t \ge \|x\|_{2} \right\}$. Substituting these expressions into problem \ref{eq:du_general} the following model is obtained: \begin{array}{c} \mathbf{(Du\ ellipsoid\ model)} \\ \begin{aligned} \min_{\substack{{w, \delta} \\ {\lambda, \{s_{i}\}} \\ {\{\nu_{i}^{(1)}\}, \{\nu_{i}^{(2)}\}} \\ {\{\tau_{i}^{(1)}\}, \{\tau_{i}^{(2)}\}}}} &\ \lambda \varepsilon + \sum_{i=1}^{N} s_{i} \\ \mathrm{s.t.} &\ (1-\eta)\left( 1 - \frac{1}{1-\beta} \right) \delta - \left( \eta + \frac{1-\eta}{1-\beta} \right)w^{T}\hat{\xi}_{i} + (\nu_{i}^{(1)})^{T}Q^{-1}\hat{\xi}_{i} + \tau_{i}^{(1)}\Omega \le s_{i} & i = 1, \dots, N, \\ &\ (1-\eta) \delta - \eta w^{T} \hat{\xi}_{i} + (\nu_{i}^{(2)})^{T}Q^{-1}\hat{\xi}_{i} + \tau_{i}^{(2)}\Omega \le s_{i} & i = 1, \dots, N, \\ &\ \| \left( \eta + \frac{1-\eta}{1-\beta} \right)w - Q^{-1}\nu_{i}^{(1)} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \| \eta w - Q^{-1}\nu_{i}^{(2)} \|_{*} \le \lambda & i = 1, \dots, N, \\ &\ \tau_{i}^{(1)} \ge \| \nu_{i}^{(1)} \|_{2} & i = 1, \dots, N, \\ &\ \tau_{i}^{(2)} \ge \| \nu_{i}^{(2)} \|_{2} & i = 1, \dots, N, \\ &\ w \in \mathcal{W} . \end{aligned} \end{array}

## References

[Bertsimas04]: D. Bertsimas and M. Sim, "The Price of Robustness", Operations Research, 52, pp. 35. (2004).

[Boyd04]: S. Boyd and L. Vandenberghe, "Convex Optimization", New York: Cambridge University Press, (2004).

[BenTal00]: A. Ben-Tal and A. Nemirovski, "Robust solutions of Linear Programming problems contaminated with uncertain data", Mathematical Programming, 88, pp. 411 (2000).

[Lobo98]: M. S. Lobo, L. Vandenberghe, S. Boyd and H. Lebret, "Applications of second-order cone programming", Linear Algebra and its Applications, 284, pp. 193 (1998).

[Delage10]: E. Delage and Y. Ye, "Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems", Operations Research, 58, pp. 595 (2010).

[Rahimian19]: H. Rahimian and S. Mehrotra, "Distributionally Robust Optimization: A Review", arXiv:1908.05659 (2019).

[Rockafellar00]: R. T. Rockafellar and S. Uryasev, "Optimization of Conditional Value-at-Risk", Journal of Risk, pp. 21 (2000).

[Yang14]: L. Yang, Y. Li, Z. Zhou and K. Chen "Distributionally Robust Return-Risk Optimization Models and Their Applications", Journal of Applied Mathematics, pp. 1 (2014).

[Bertsimas18]: D. Bertsimas, V. Gupta and N. Kallus, "Data-Driven Robust Optimization", Math. Program., 167, pp. 235 (2018).

[Esfahani18]: P. M. Esfahani and D. Kuhn, "Data-Driven Distributionally Robust Optimization Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations", Math. Program., 171, pp. 115 (2018).

[Du21]: N. Du, Y. Liu, and Y. Liu. "A New Data-Driven Distributionally Robust Portfolio Optimization Method Based on Wasserstein Ambiguity Set", IEEE Access, 9, pp. 3174 (2021).