Blind Optimization

Derivative-free optimization, sometimes referred to as black-box optimization, is a discipline in mathematical optimization that does not rely on derivative information in the classical sense to find optimal solutions. This is because sometimes information about the derivative of the objective function denoted as $f$ , is either unavailable, unreliable, or impractical to obtain. For example, $f$ might be non-smooth, time-consuming to evaluate, or inherently noisy, rendering methods that depend on derivatives or approximate them using finite differences of limited utility.

The problem of finding optimal solutions in such situations is referred to as derivative-free optimization, and algorithms that do not use derivatives or finite differences are commonly known as derivative-free algorithms.

Examples of such algorithms are:

Genetic Evolution
Differential Evolution
Particle Swarm Optimization

Particle Swarms

Particle Swarm Optimization was proposed by Kennedy and Eberhart in 1995. As mentioned in the original paper, sociobiologists believe a school of fish or a flock of birds that moves in a group “can profit from the experience of all other members”. In other words, while a bird flies and searches randomly for food, for instance, all birds in the flock can share their discovery and help the entire flock get the best hunt.

PSO is best used to find the maximum or minimum of a function defined on a multidimensional vector space. Assume we have a function $f(x)$ that produces a real value from a vector parameter $X$ (such as coordinate $(x,y)$ in a plane) and can take on virtually any value in the space (for example, $f(X)$ is the altitude and we can find one for any point on the plane), then we can apply PSO. The PSO algorithm will return the parameter it found that produces the minimum $f(X)$ .

Let’s start with the following function:

f(x, y) = (x-3.14)^2 + (y-2.72)^2 + \sin{(3x+1.41)} + \sin{(4y-1.73)}

As we can see from the plot above, this function looks like a curved egg carton. It is not a convex function and therefore it is hard to find its minimum because a local minimum found is not necessarily the global minimum.

So how can we find the minimum point in this function? For sure, we can resort to an exhaustive search: If we check the value of $f(x,y)$ for every point on the plane, we can find the minimum point. Or we can just randomly find some sample points on the plane and see which one gives the lowest value on $f(x,y)$ if we think it is too expensive to search every point. However, we also note from the shape of $f(x,y)$ that if we have found a point with a smaller value of $f(x,y)$ , it is easier to find an even smaller value around its proximity.

This is how a particle swarm optimization does. Similar to the flock of birds looking for food, we start with a number of random points on the plane (call them particles) and let them look for the minimum point in random directions. At each step, every particle should search around the minimum point it ever found as well as around the minimum point found by the entire swarm of particles. After certain iterations, we consider the minimum point of the function as the minimum point ever explored by this swarm of particles.

Algorithm Details

Assume we have P particles and we denote the position of particle i at iteration t as $X^i(t)$ , which in the example above, we have as a coordinate $X^i(t) = (x^i(t), y^i(t))$ . Besides the position, we also have a velocity for each particle, denoted as $V^i(t) = (v^i_x(t), v^i_y(t))$ . At the next iteration, the position of each particle would be updated as $X^i(t+1) = X^i(t) + V^i(t+1)$ .

At the same time, the velocities are also updated by the rule:

V^i(t+1) = wV^i(t)+c_1r_1(pbest^i-X^i(t)) + c_2r_2(gbest-X^i(t))

Where $r_1$ and $r_2$ are random numbers between 0 and 1, constants w, $c_1$ , and $c_2$ are parameters to the PSO algorithm, and $pbest^i$ is the position that gives the best $f(X)$ value ever explored by particle i and $gbest$ is that explored by all the particles in the swarm.

We call the parameter w the inertia weight constant. It is between 0 and 1 and determines how much should the particle keep on with its previous velocity (i.e., speed and direction of the search). The parameters $c_1$ and $c_2$ are called the cognitive and the social coefficients respectively. They control how much weight should be given between refining the search result of the particle itself and recognizing the search result of the swarm. We can consider these parameters control the trade-off between exploration and exploitation.

The positions $pbest^i$ and $gbest$ are updated in each iteration to reflect the best position ever found thus far.

One interesting property of this algorithm that distinguishes it from other optimization algorithms is that it does not depend on the gradient of the objective function. In gradient descent, for example, we look for the minimum of a function $f(X)$ by moving X to the direction $-\nabla f(X)$ of as it is where the function going down the fastest. For any particle at the position X at the moment, how it moves does not depend on which direction is the “down hill” but only on where are $pbest^i$ and $gbest$ . This makes PSO particularly suitable if differentiating $f(X)$ is difficult.

Another property of PSO is that it can be parallelized easily. As we are manipulating multiple particles to find the optimal solution, each particle can be updated in parallel and we only need to collect the updated value once per iteration. This makes map-reduce architecture a perfect candidate to implement PSO.

Variations

All PSO algorithms are mostly the same as we mentioned above. In the above example, we set the PSO to run in a fixed number of iterations. It is trivial to set the number of iterations to run dynamically in response to the progress. For example, we can make it stop once we cannot see any update to the global best solution $gbest$ in several iterations.

Research on PSO was mostly on how to determine the hyperparameters w, $c_1$ , and $c_2$ or varying their values as the algorithm progressed. For example, proposals are making the inertia weight linear decrease. There are also proposals trying to make the cognitive coefficient $c_1$ decrease while the social coefficient $c_2$ increases to bring more exploration at the beginning and more exploitation at the end.

PreviousSPAM or HAM NextGradient descent

Last updated 5 months ago