CMA-ES - Algorithm

Algorithm

In the following the most commonly used (μ/μw, λ)-CMA-ES is outlined, where in each iteration step a weighted combination of the μ best out of λ new candidate solutions is used to update the distribution parameters. The main loop consists of three main parts: 1) sampling of new solutions, 2) re-ordering of the sampled solutions based on their fitness, 3) update of the internal state variables based on the re-ordered samples. A pseudocode of the algorithm looks as follows.

set // number of samples per iteration, at least two, generally > 4 initialize, // initialize state variables while not terminate // iterate for in // sample new solutions and evaluate them = sample_multivariate_normal(mean=, covariance_matrix=) = fitness ← with = argsort(, ) // sort solutions = // we need later and ← update_m // move mean to better solutions ← update_ps // update isotropic evolution path ← update_pc // update anisotropic evolution path ← update_C // update covariance matrix ← update_sigma // update step-size using isotropic path length return or

The order of the five update assignments is relevant. In the following, the update equations for the five state variables are specified.

Given are the search space dimension and the iteration step . The five state variables are

, the distribution mean and current favorite solution to the optimization problem,
, the step-size,
, a symmetric and positive definite covariance matrix with and
, two evolution paths, initially set to the zero vector.

The iteration starts with sampling candidate solutions from a multivariate normal distribution, i.e. for


\begin{align} x_i \ &\sim\ \mathcal{N}(m_k,\sigma_k^2 C_k) \\&\sim\ m_k + \sigma_k\times\mathcal{N}(0,C_k)
\end{align}

The second line suggests the interpretation as perturbation (mutation) of the current favorite solution vector (the distribution mean vector). The candidate solutions are evaluated on the objective function to be minimized. Denoting the -sorted candidate solutions as

 \{x_{i:\lambda}\;|\;i=1\dots\lambda\} = \{x_i\;|\;i=1\dots\lambda\} \;\;\text{and}\;\; f(x_{1:\lambda})\le\dots\le f(x_{\mu:\lambda})\le f(x_{\mu+1:\lambda}) \dots,

the new mean value is computed as


\begin{align} m_{k+1} &= \sum_{i=1}^{\mu} w_i\, x_{i:\lambda} \\ &= m_k + \sum_{i=1}^{\mu} w_i\, (x_{i:\lambda} - m_k)
\end{align}

where the positive (recombination) weights sum to one. Typically, and the weights are chosen such that . The only feedback used from the objective function here and in the following is an ordering of the sampled candidate solutions due to the indices .

The step-size is updated using cumulative step-size adaptation (CSA), sometimes also denoted as path length control. The evolution path (or search path) is updated first.

 p_\sigma \gets \underbrace{(1-c_\sigma)}_{\!\!\!\!\!\text{discount factor}\!\!\!\!\!}\, p_\sigma + \overbrace{\sqrt{1 - (1-c_\sigma)^2}}^{ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\text{complements for discounted variance} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!} \underbrace{\sqrt{\mu_w} \,C_k^{\;-1/2} \, \frac{\overbrace{m_{k+1} - m_k}^{\!\!\!\text{displacement of}\; m\!\!\!}}{\sigma_k}}_{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \text{distributed as}\; \mathcal{N}(0,I)\;\text{under neutral selection} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}
 \sigma_{k+1} = \sigma_k \times \exp\bigg(\frac{c_\sigma}{d_\sigma} \underbrace{\left(\frac{\|p_\sigma\|}{E\|\mathcal{N}(0,I)\|} - 1\right)}_{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \text{unbiased about 0 under neutral selection} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
}\bigg)

where

is the backward time horizon for the evolution path and larger than one,
is the variance effective selection mass and by definition of ,
is the unique symmetric square root of the inverse of, and
is the damping parameter usually close to one. For or the step-size remains unchanged.

The step-size is increased if and only if is larger than the expected value

\begin{align}E\|\mathcal{N}(0,I)\| &= \sqrt{2}\,\Gamma((n+1)/2)/\Gamma(n/2) \\&\approx \sqrt{n}\,(1-1/(4\,n)+1/(21\,n^2)) \end{align}

and decreased if it is smaller. For this reason, the step-size update tends to make consecutive steps -conjugate, in that after the adaptation has been successful .

Finally, the covariance matrix is updated, where again the respective evolution path is updated first.

 p_c \gets \underbrace{(1-c_c)}_{\!\!\!\!\!\text{discount factor}\!\!\!\!\!}\, p_c + \underbrace{\mathbf{1}_{}(\|p_\sigma\|)}_{\text{indicator function}} \overbrace{\sqrt{1 - (1-c_c)^2}}^{ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\text{complements for discounted variance} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!} \underbrace{\sqrt{\mu_w} \, \frac{m_{k+1} - m_k}{\sigma_k}}_{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \text{distributed as}\; \mathcal{N}(0,C_k)\;\text{under neutral selection} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!}
 C_{k+1} = \underbrace{(1 - c_1 - c_\mu + c_s)}_{\!\!\!\!\!\text{discount factor}\!\!\!\!\!} \, C_k + c_1 \underbrace{p_c p_c^T}_{ \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \text{rank one matrix} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!} + \,c_\mu \underbrace{\sum_{i=1}^\mu w_i \frac{x_{i:\lambda} - m_k}{\sigma_k} \left( \frac{x_{i:\lambda} - m_k}{\sigma_k} \right)^T}_{ \text{rank} \;\min(\mu,n)\; \text{matrix}}

where denotes the transpose and

is the backward time horizon for the evolution path and larger than one,
and the indicator function evaluates to one iff or, in other words, which is usually the case,
makes partly up for the small variance loss in case the indicator is zero,
is the learning rate for the rank-one update of the covariance matrix and
is the learning rate for the rank- update of the covariance matrix and must not exceed .

The covariance matrix update tends to increase the likelihood for and for to be sampled from . This completes the iteration step.

The number of candidate samples per iteration, is not determined a priori and can vary in a wide range. Smaller values, for example, lead to more local search behavior. Larger values, for example with default value, render the search more global. Sometimes the algorithm is repeatedly restarted with increasing by a factor of two for each restart. Besides of setting (or possibly instead, if for example is predetermined by the number of available processors), the above introduced parameters are not specific to the given objective function and therefore not meant to be modified by the user.

Read more about this topic:  CMA-ES