Notes on L-BFGS and Wolfe condition

2020-11-22 23:29 Mathematics

1. Introduction
2. Find descent direction $p_{k}$
3. Determine step length $α_{k}$
4. Limit memory scenario
- L-BFGS: vanilla iteration
- L-BFGS: two-loop recursion
5. Summary

Recently while I was working on a problem, instead of regular SGD or Adam, I wanted to try some second-order methods, like L-BFGS.

In this post, I’ve summarised my notes to give an intuitive yet self-contained introduction, which only requires a minimum calculus background, like Taylor Theorem. We will discuss the derivation, algorithm, convergence result, along with a few other practical computation issues of L-BFGS.

1. Introduction

Suppose $f : R^{n} \to R$ , the objective function we want to minimized, is twice continuously differentiable, with its minimum at $x^{*}$ . Given an initial point $x_{0}$ , a typical optimization problem is to use an iterative method

$x_{k + 1} = x_{k} + α_{k} p_{k}, k = 0, 1, \dots,$

to approximate $x^{*}$ , where $p_{k}$ is the search direction, and $α_{k}$ is the step length. So we have two tasks:

find a descent direction $p_{k}$ ;
determine how far we move in that direction, i.e. step length $α_{k}$ .

2. Find descent direction $p_{k}$

We use $\nabla f$ to denote the gradient of $f$ , and use $\nabla f_{k}$ as a short form of $\nabla f (x_{k})$ . Similarly, we use $\nabla^{2} f$ to denote the Hessian matrix of $f$ , and $\nabla^{2} f_{k}$ is the Hessian matrix’s value at $x_{k}$ .

For $x^{*}$ , we have $\nabla f (x^{*}) = 0$ , and $\nabla^{2} f (x^{*})$ is a positive definite symmetric matrix.

Quasi-Newton and secant equation

Apply Taylor theorem to $\nabla f$ at $x^{*}$ give us

$\nabla f (x^{*}) = \nabla f_{k} + \nabla^{2} f_{k} \cdot (x^{*} - x_{k}) + o (‖ x^{*} - x_{k} ‖)$

Ignoring high-order term, we get approximate

$\begin{array}{rcl} \nabla f_{k} + \nabla^{2} f_{k} \cdot (x^{*} - x_{k}) & \approx & \nabla f (x^{*}) = 0 \\ x^{*} & \approx & x_{k} - (\nabla^{2} f_{k})^{- 1} \nabla f_{k} \end{array}$

Newton method uses the right hand side form to update $x_{k}$ ,

$x_{k + 1} = x_{k} - (\nabla^{2} f_{k})^{- 1} \nabla f_{k}$

However, the calculation of Hessian matrix’s inverse is usually expensive, and sometimes we cannot guarantee $\nabla^{2} f_{k}$ is positive definite.

The idea of quasi-Newton method is to replace Hessian matrix with another positive definite symmetric matrix $B_{k}$ , and refine it in every iteration with a much smaller computation cost.

From Newton method formula, we have

$\nabla^{2} f_{k} \cdot (x_{k + 1} - x_{k}) \approx \nabla f_{k + 1} - \nabla f_{k}$

Let $\begin{array}{rcl} s_{k} & = & x_{k + 1} - x_{k} \\ (1) & y_{k} & = & \nabla f_{k + 1} - \nabla f_{k} \end{array}$

The next matrix $B_{k + 1}$ we are looking for should satisfy the above approximate,

$\begin{matrix} (2) & B_{k + 1} s_{k} = y_{k} \end{matrix}$

This equation is called secant equation.

There is another approach to get the secant equation. Let

$m_{k} (p) = f_{k} + \nabla f_{k}^{T} p + \frac{1}{2} p^{T} B_{k} p$

be the second-order approximate to $f$ at $x_{k}$ . It is a function of $p$ with gradient

$m_{k}^{'} (p) = \nabla f_{k} + B_{k} p$

When we iterate from $k$ to $k + 1$ , it’s easy to check that $m_{k + 1} (0) = f_{k + 1}$ and $m_{k + 1}^{'} (0) = \nabla f_{k + 1}$ . This means $m_{k + 1} (p)$ equals $f$ with both function value and gradient at $x_{k + 1}$ . If we want this $m_{k + 1} (p)$ to approximate $f$ even better, we may ask its gradient to be equal with $f$ at last step $x_{k}$ as well. At this point, $x_{k} = x_{k + 1} - α_{k} p_{k}$ , So $p = - α_{k} p_{k} = - s_{k}$ , and we have

$\begin{array}{rcl} m_{k + 1}^{'} (- s_{k}) = \nabla f_{k + 1} - B_{k + 1} s_{k} & = & \nabla f_{k} \\ B_{k + 1} s_{k} & = & \nabla f_{k + 1} - \nabla f_{k} \\ B_{k + 1} s_{k} & = & y_{k} \end{array}$

Rank-two update

So the next question becomes how to iterate $B_{k}$ . The Davidon-Fletcher-Powell (DFP) formula (Fletcher and Powell 1963) gives a rank-two matrix update approach.

$\begin{matrix} (3) & B_{k + 1} = B_{k} - \frac{B_{k} s_{k} s_{k}^{T} B_{k}}{s_{k}^{T} B_{k} s_{k}} + \frac{y_{k} y_{k}^{T}}{y_{k}^{T} s_{k}} \end{matrix}$

In Nocedal and Wright (2006), the authors provides another interpretation. They view $B_{k + 1}$ as the solution of the following optimization problem.

$\begin{array}{r} min_{B} ‖ B - B_{k} ‖_{W} \\ s . t . B = B^{T}, B s_{k} = y_{k} \end{array}$

where $‖ A ‖_{W}$ norm is defined as

$‖ A ‖_{W} = ‖ W^{1 / 2} A W^{1 / 2} ‖_{F}$

where $‖ \cdot ‖_{F}$ is Frobenius norm and $W$ is any matrix satisfying $W y_{k} = s_{k}$ . This is a more intuitive way, however, I haven’t finish the detailed proof of this conclusion. I would revise this part later.

BFGS

The DFP formula maintains properties of $B_{k}$ like symmetric and positive definite. But we still have to calculate its inverse to get $p_{k} = - B_{k}^{- 1} \nabla f_{k}$ . To avoid the inverse matrix calculation, we need to approximate matrix $H_{k} = B_{k}^{- 1}$ directly.

Many materials say just applying Sherman–Morrison formula and we can get a iterate equation for $B_{k}^{- 1}$ . However, this matrix deduction is not trivial. After a little calculation, I think applying Woodbury matrix identity would be a more straightforward way.

Woodbury matrix identity says

$\begin{matrix} (4) & (A + U C V)^{- 1} = A^{- 1} - A^{- 1} U (C^{- 1} + V A^{- 1} U)^{- 1} V A^{- 1} \end{matrix}$

Let $ρ = \frac{1}{y_{k}^{T} s_{k}}$ and drop subscript $k$ to rewrite DFP as

$\begin{array}{rcl} B_{k + 1} & = & B - \frac{B s s^{T} B}{s^{T} B s} + \frac{y y^{T}}{y^{T} s} \\ = & B + (y B s) (\begin{array}{cc} ρ & 0 \\ 0 & - \frac{1}{s^{T} B s} \end{array}) (\begin{array}{c} y^{T} \\ s^{T} B \end{array}) \\ (5) & = & B + U C V \end{array}$

$\begin{array}{rcl} C^{- 1} + V A^{- 1} U & = & (\begin{array}{cc} \frac{1}{ρ} & 0 \\ 0 & - s^{T} B s \end{array}) + (\begin{array}{c} y^{T} \\ s^{T} B \end{array}) H (y, B s) \\ = & (\begin{array}{cc} 1 / ρ + y^{T} H y & 1 / ρ \\ 1 / ρ & 0 \end{array}) \end{array}$

We can easily check that

$(\begin{array}{cc} a + b & a \\ a & 0 \end{array}) (\begin{array}{cc} 0 & \frac{1}{a} \\ \frac{1}{a} & - \frac{a + b}{a^{2}} \end{array}) = I$

Using this property, we can get

$\begin{array}{rcl} (C^{- 1} + V A^{- 1} U)^{- 1} & = & (\begin{array}{cc} 0 & ρ \\ ρ & - ρ - ρ y^{T} H y \end{array}) \end{array}$

Substitute this back to (4) with (5) (still ignore subscript $k$ for simplicity)

$\begin{array}{rcl} H_{k + 1} & = & H - H (y B s) (\begin{array}{cc} 0 & ρ \\ ρ & - ρ - ρ y^{T} H y \end{array}) (\begin{array}{c} y^{T} \\ s^{T} B \end{array}) H \\ = & H - ρ s y^{T} H - ρ H y s^{T} + ρ s y^{T} H y s^{T} + ρ s s^{T} \\ = & (I - ρ s y^{T}) H (I - ρ y s^{T}) + ρ s s^{T} \end{array}$

Add back subscript $k$ and we finally get the BFGS formula

$\begin{matrix} (6) & H_{k + 1} = (I - ρ_{k} s_{k} y_{k}^{T}) H_{k} (I - ρ_{k} y_{k} s_{k}^{T}) + ρ_{k} s_{k} s_{k}^{T}, ρ_{k} = \frac{1}{y_{k}^{T} s_{k}} \end{matrix}$

Initial $H_{0}$

A simple choice of $H_{0}$ is setting it to identical matrix $I$ . A similar strategy is to introduce a scalar $β$ , then let $H_{0} = β I$ .

There are many other approaches. For instance, we can also use other optimization method to “warm up”—do a few iterations in order to get a better approximate of $\nabla^{2} f$ , then change back to BFGS.

Convergence

I’m going to discuss the convergence theory of Newton and quasi-Newton method in another article later. Therefore, here I only mention the convergence result theorem.

Theorem Suppose that $f$ is twice continuously differentiable and that the iterates generated by the BFGS algorithm converge to a minimizer $x^{*}$ at which the Hessian matrix $G$ satisfies Lipschitz continuous in a neighborhood with some positive constant $L$ ,

$‖ G (x) - G (x^{*}) ‖ \leq L ‖ x - x^{*} ‖$

Suppose the sequence also satisfies

$\sum_{k = 1}^{\infty} ‖ x_{k} - x^{*} ‖ < \infty$

Then $x_{k}$ converges to $x^{*}$ at a superlinear rate.

For L-BFGS and stochastic L-BFGS, some convergence discussion can be found in Mokhtari and Ribeiro (2015).

3. Determine step length $α_{k}$

With BFGS updating formula, we have solved the task of how to find direction $p_{k}$ . Now we turn to determine how far we should go in this direction.

Wolfe condition

We introduce a helper function

$ϕ (α) = f (x_{k} + α p_{k}), α > 0$

The minimizer of $ϕ (α)$ is what we need. However, solving this univariate minimum problem accurately could be too expensive. An inexact solution is acceptable as long as $ϕ (α) = f (x_{k} + α p_{k}) < f_{k}$ .

However, a simple function value reduction may not be enough sometimes. The picture below show an example of this situation. We need a kind of sufficient decrease to avoid this.

Figure 1: Insufficient reduction¹

The following Wolfe condition is the formalization of this sufficient decrease.

$\begin{array}{rcl} f (x_{k} + α_{k} p_{k}) & \leq & f (x_{k}) + c_{1} α_{k} \nabla f_{k}^{T} p_{k} \\ (7) & \nabla f (x_{k} + α_{k} p_{k})^{T} p_{k} & \geq & c_{2} \nabla f_{k}^{T} p_{k} \end{array}$

where $0 < c_{1} < c_{2} < 1$ .

The right hand side of the first inequality is a line $l (α)$ drawn from $x_{k}$ with the same slope of $ϕ (α)$ . Thus the intuition behind the first inequality is that the function value at step $α_{k}$ should below this line $l (α)$ . This condition is usually called Armijo condition, or sufficient decrease condition.

The intuition of the second inequality uses more information of $f$ ’s curvature. Since $p_{k}$ is the descent direction, we have $\nabla f_{k}^{T} p_{k} < 0$ . Suppose the slope of $ϕ (α)$ at step $α_{k}$ is smaller (steeper) than the slope at the start point, i.e. $\nabla f_{k}^{T} p_{k}$ , then we can go further safely to reach a even low objective value. Therefore, a sufficient descent step should be at a point with a more gentle slope. This second condition is usually referred to as curvature condition.

Figure 2: The curvature condition²

There is a chance that step $α_{k}$ goes too further that the slope becomes positive. To prevent this case, we can use strong Wolfe condition

$\begin{array}{rcl} f (x_{k} + α_{k} p_{k}) & \leq & f (x_{k}) + c_{1} α_{k} \nabla f_{k}^{T} p_{k} \\ (8) & ‖ \nabla f (x_{k} + α_{k} p_{k})^{T} p_{k} ‖ & \leq & c_{2} ‖ \nabla f_{k}^{T} p_{k} ‖ \end{array}$

where $0 < c_{1} < c_{2} < 1$ .

Existance and convergence

The next question is does these $α_{k}$ exist; and if does, how to find them.

The following lemma and theorem (Nocedal and Wright 2006) guarantee the existence of $α_{k}$ that satisfies Wolfe condition, and show us the superlinearly convergence under certain conditions.

Lemma Suppose that $f$ is continuously differentiable. Let $p_{k}$ be a descent direction at $x_{k}$ , and assume that $f$ is bounded below along ${x_{k} + α p_{k} | α > 0}$ . For $0 < c_{1} < c_{2} < 1$ , there exist intervals of step lengths that satisfying the Wolfe condition (7) and strong Wolfe condition (8).

Theorem Suppose that $f$ is twice continuously differentiable. Consider the iteration $x_{k + 1} = x_{k} + α_{k} p_{k}$ , where $p_{k}$ is a descent direction and $α_{k}$ satisfies the Wolfe conditions (7) with $c_{1} \leq 1 / 2$ . If the sequence ${x_{k}}$ converges to a point $x^{*}$ such that $\nabla f (x^{*})$ and $\nabla^{2} f (x^{*})$ is positive definite, and if the search direction satisfies

$lim_{k \to \infty} \frac{‖ \nabla f_{k} + \nabla^{2} f_{k} p_{k} ‖}{‖ p_{k} ‖} = 0$

then

the step length $α_{k} = 1$ is admissible for all $k$ greater than a certain index $k_{0}$ ;
if $α_{k} = 1$ for all $k > k_{0}$ , ${x_{k}}$ converges to $x^{*}$ superlinearly.

Linear search algorithm

We use linear search algorithm to locate a valid $α$ . The idea is to generate a sequence of monotonically increasing ${α_{i}}$ in $(0, α_{m a x})$ . If $α_{i}$ satisfies Wolfe condition, return that step; otherwise narrow the search interval.

We show the algorithm in Julia pseudo code below.

using Flux

function linear_search(ϕ, α_0=0, α_max=1, c1=0.0001, c2=0.9)
    α = Dict(0 => α_0)
    α[1] = choose(α_0, α_max)
    ϕ′(x) = gradient(ϕ, x)[1]
    i = 1
    while true
        y = ϕ(α[i])
        if y > ϕ(0) + c1 * α[i] * ϕ′(0) or (ϕ(α[i]) >= ϕ(α[i - 1]) and i > 1)
            return zoom(α[i - 1], α[i])
        end
        dy = ϕ′(α[i])
        if abs(dy) <= -c2 * ϕ′(0)
            return α[i]
        end
        if dy >= 0
            return zoom(α[i], α[i - 1])
        end
        α[i + 1] = choose(α[i], α_max)
        i = i + 1
    end
end

function zoom(ϕ, α_lo, α_hi, c2=0.9)
    while true
        # using quadratic, cubic, or bisection to find a trial step length α
        α = choose(α_lo, α_hi)
        y = ϕ(α)
        if y > ϕ(0) + c1 * α * ϕ′(0) or y >= ϕ(α_lo)
            α_hi = α
        else
            dy = ϕ′(α)
            if abs(dy) <= -c2 * ϕ′(0)
                return α
            end
            if dy * (α_hi - α_lo) >= 0
                α_hi = α_lo
            end
            α_lo = α
        end
    end
end

Notice that in function zoom(), the order of $α_{i}$ and $α_{i - 1}$ could swap. $α_{l o}$ always gives the smallest function value.

4. Limit memory scenario

We have discussed how to use BFGS algorithm to find a descent direction $p_{k}$ and how to use linear search to choose a step length $α_{k}$ that satisfies the Wolfe condition. Now let’s talk about some practical computation issues.

The BFGS algorithm needs to store and update the Hessian matrix at each iteration. This requires $O (n^{2})$ memory space, which is not feasible at all, since the parameter size $n$ of many modern deep learning models could be millions or hundreds of millions.

Limit memory BFGS, or L-BFGS, is a BFGS’s variation addressing this issue. Instead of doing matrix multiplication directly, L-BFGS use $m$ previous vectors, ${s_{i}, y_{i}}, i = k - 1, \dots, k - m$ , to reconstruct $H_{k}$ . This reduce memory cost from $O (n^{2})$ to $O (m n)$ .

L-BFGS: vanilla iteration

Recall (1) and BFGS (6), and for simplicity, let

$V_{k} = I - ρ_{k} y_{k} s_{k}^{T}$

The BFGS formula becomes

$\begin{matrix} (9) & H_{k + 1} = V_{k}^{T} H_{k} V_{k} + ρ_{k} s_{k} s_{k}^{T} \end{matrix}$

We can use a vanilla iterative way to calculate $H_{k} \nabla f_{k}$ directly (let $q = \nabla f_{k}$ )

$V_{k} \nabla f_{k} = q - ρ_{k} y_{k} s_{k}^{T} q$ : calculate $ρ s^{T} q$ first in $n$ multiplications to get a scalar $α$ , then calculate $q - α y$ in another $n$ multiplications.
Suppose $H_{k}^{0}$ is a diagonal matrix, so we can calculate $H_{k}^{0} V_{k} q$ in $n$ multiplications.
Multiply $V_{k}^{T}$ is analogous, which needs $2 n$ multiplications as well.
Finally, $ρ s s^{T}$ and add the results together needs another $2 n$ multiplications.

So this vanilla iteration requires $7 n$ multiplications, leading to a total cost of $7 n m$ multiplications.

Can we do better?

L-BFGS: two-loop recursion

L-BFGS has a two-loop recursion algorithm, which is quite brilliant.

function L_BFGS(H0, ∇f_k, s, y, ρ, k, m)
    q = ∇f_k
    for i = k - 1 : -1 : k - m
        α[i] = ρ[i] * transpose(s[i]) * q
        q = q - α[i] * y[i]
    end
    w = H0 * q
    for i = k - m : k - 1
        β = ρ[i] * transpose(y[i]) * w
        w = w + s[i] * (α[i] - β)
    end
    return w
end

The return value $w = H_{k} \nabla f_{k}$ is what we need, and there are only $5 n m$ multiplications in it.

The data manipulation in the two-loop recursion is not that easy to understand at first glimpse. To make it more clear, we expand (9) $m$ steps to get

\begin{aligned} H_{k} = & (V_{k - 1}^{T} \dots V_{k - m}^{T}) H_{k}^{0} (V_{k - m} \dots V_{k - 1}) \\ + ρ_{k - m} (V_{k - 1}^{T} \dots V_{k - m + 1}^{T}) s_{k - m} s_{k - m}^{T} (V_{k - m + 1} \dots V_{k - 1}) \\ + ρ_{k - m + 1} (V_{k - 1}^{T} \dots V_{k - m + 2}^{T}) s_{k - m + 1} s_{k - m + 1}^{T} (V_{k - m + 2} \dots V_{k - 1}) \\ + \dots \\ + ρ_{k - 2} V_{k - 1}^{T} s_{k - 2} s_{k - 2}^{T} V_{k - 1} \\ + ρ_{k - 1} s_{k - 1} s_{k - 1}^{T} \end{aligned}

Some key observations to help you understand the two-loop recursion:

In the first loop, $α_{i} = ρ_{i} s_{i}^{T} V_{i + 1} \dots V_{k - 1} \nabla f_{k}$
After the first loop and w = H0 * q, $q = H_{k}^{0} V_{k - m} \dots V_{k - 1}$
In the second loop, after iteration $i$ , the vector $w$ will be

\begin{aligned} w = & (V_{i}^{T} \dots V_{k - m}^{T}) H_{k}^{0} (V_{k - m} \dots V_{k - 1} \nabla f_{k}) \\ + ρ_{k - m} (V_{i}^{T} \dots V_{k - m + 1}^{T}) s_{k - m} s_{k - m}^{T} (V_{k - m + 1} \dots V_{k - 1} \nabla f_{k}) \\ + \dots \\ + ρ_{i} s_{i} s_{i}^{T} V_{i + 1} \dots V_{k - 1} \nabla f_{k} \end{aligned}

5. Summary

In this article, we have talked about the basic idea of quasi-Newton method. We carefully derive the DFP and BFGS formula, and show how to find descent direction with them. We have also discussed how to use Wolfe condition and linear search to find a feasible step length. And at last, we demonstrate how to use two-loop recursion L-BFGS to apply a fast and memory-efficient iteration.

In most of the materials I’ve found about L-BFGS, many non-trivial details are omitted. I’ve tried to make this post as self-contained and as clear as possible. Maybe it will help one or two newcomers of this topic to overcome some obscure steps.

Fletcher, Roger, and Michael JD Powell. 1963. “A Rapidly Convergent Descent Method for Minimization.” The Computer Journal 6 (2): 163–68.

Mokhtari, Aryan, and Alejandro Ribeiro. 2015. “Global Convergence of Online Limited Memory BFGS.” The Journal of Machine Learning Research 16 (1): 3151–81.

Nocedal, Jorge, and Stephen Wright. 2006. Numerical Optimization. Springer Science & Business Media.

Figure 3.2 from Nocedal and Wright (2006).↩︎
Figure 3.4 from Nocedal and Wright (2006).↩︎

optimization quasi-Newton Wolfe condition L-BFGS

F. Shen

Algorithm Engineer

Be an informed citizen, life hacker, and sincere creator.