Notes on L-BFGS and Wolfe condition

Recently while I was working on a problem, instead of regular SGD or Adam, I wanted to try some second-order methods, like L-BFGS.

In this post, I’ve summarised my notes to give an intuitive yet self-contained introduction, which only requires a minimum calculus background, like Taylor Theorem. We will discuss the derivation, algorithm, convergence result, along with a few other practical computation issues of L-BFGS.

1. Introduction

Suppose f:RnR, the objective function we want to minimized, is twice continuously differentiable, with its minimum at x. Given an initial point x0, a typical optimization problem is to use an iterative method

xk+1=xk+αkpk,k=0,1,,

to approximate x, where pk is the search direction, and αk is the step length. So we have two tasks:

  1. find a descent direction pk;
  2. determine how far we move in that direction, i.e. step length αk.

2. Find descent direction pk

We use f to denote the gradient of f, and use fk as a short form of f(xk). Similarly, we use 2f to denote the Hessian matrix of f, and 2fk is the Hessian matrix’s value at xk.

For x, we have f(x)=0, and 2f(x) is a positive definite symmetric matrix.

Quasi-Newton and secant equation

Apply Taylor theorem to f at x give us

f(x)=fk+2fk(xxk)+o(xxk)

Ignoring high-order term, we get approximate

fk+2fk(xxk)f(x)=0xxk(2fk)1fk

Newton method uses the right hand side form to update xk,

xk+1=xk(2fk)1fk

However, the calculation of Hessian matrix’s inverse is usually expensive, and sometimes we cannot guarantee 2fk is positive definite.

The idea of quasi-Newton method is to replace Hessian matrix with another positive definite symmetric matrix Bk, and refine it in every iteration with a much smaller computation cost.

From Newton method formula, we have

2fk(xk+1xk)fk+1fk

Let sk=xk+1xk(1)yk=fk+1fk

The next matrix Bk+1 we are looking for should satisfy the above approximate,

(2)Bk+1sk=yk

This equation is called secant equation.

There is another approach to get the secant equation. Let

mk(p)=fk+fkTp+12pTBkp

be the second-order approximate to f at xk. It is a function of p with gradient

mk(p)=fk+Bkp

When we iterate from k to k+1, it’s easy to check that mk+1(0)=fk+1 and mk+1(0)=fk+1. This means mk+1(p) equals f with both function value and gradient at xk+1. If we want this mk+1(p) to approximate f even better, we may ask its gradient to be equal with f at last step xk as well. At this point, xk=xk+1αkpk, So p=αkpk=sk, and we have

mk+1(sk)=fk+1Bk+1sk=fkBk+1sk=fk+1fkBk+1sk=yk

Rank-two update

So the next question becomes how to iterate Bk. The Davidon-Fletcher-Powell (DFP) formula (Fletcher and Powell 1963) gives a rank-two matrix update approach.

(3)Bk+1=BkBkskskTBkskTBksk+ykykTykTsk

In Nocedal and Wright (2006), the authors provides another interpretation. They view Bk+1 as the solution of the following optimization problem.

minBBBkWs.t.B=BT,Bsk=yk

where AW norm is defined as

AW=W1/2AW1/2F

where F is Frobenius norm and W is any matrix satisfying Wyk=sk. This is a more intuitive way, however, I haven’t finish the detailed proof of this conclusion. I would revise this part later.

BFGS

The DFP formula maintains properties of Bk like symmetric and positive definite. But we still have to calculate its inverse to get pk=Bk1fk. To avoid the inverse matrix calculation, we need to approximate matrix Hk=Bk1 directly.

Many materials say just applying Sherman–Morrison formula and we can get a iterate equation for Bk1. However, this matrix deduction is not trivial. After a little calculation, I think applying Woodbury matrix identity would be a more straightforward way.

Woodbury matrix identity says

(4)(A+UCV)1=A1A1U(C1+VA1U)1VA1

Let ρ=1ykTsk and drop subscript k to rewrite DFP as

Bk+1=BBssTBsTBs+yyTyTs=B+(yBs)(ρ001sTBs)(yTsTB)(5)=B+UCV

So

C1+VA1U=(1ρ00sTBs)+(yTsTB)H(y,Bs)=(1/ρ+yTHy1/ρ1/ρ0)

We can easily check that

(a+baa0)(01a1aa+ba2)=I

Using this property, we can get

(C1+VA1U)1=(0ρρρρyTHy)

Substitute this back to (4) with (5) (still ignore subscript k for simplicity)

Hk+1=HH(yBs)(0ρρρρyTHy)(yTsTB)H=HρsyTHρHysT+ρsyTHysT+ρssT=(IρsyT)H(IρysT)+ρssT

Add back subscript k and we finally get the BFGS formula

(6)Hk+1=(IρkskykT)Hk(IρkykskT)+ρkskskT,ρk=1ykTsk

Initial H0

A simple choice of H0 is setting it to identical matrix I. A similar strategy is to introduce a scalar β, then let H0=βI.

There are many other approaches. For instance, we can also use other optimization method to “warm up”—do a few iterations in order to get a better approximate of 2f, then change back to BFGS.

Convergence

I’m going to discuss the convergence theory of Newton and quasi-Newton method in another article later. Therefore, here I only mention the convergence result theorem.

Theorem Suppose that f is twice continuously differentiable and that the iterates generated by the BFGS algorithm converge to a minimizer x at which the Hessian matrix G satisfies Lipschitz continuous in a neighborhood with some positive constant L,

G(x)G(x)Lxx

Suppose the sequence also satisfies

k=1xkx<

Then xk converges to x at a superlinear rate.

For L-BFGS and stochastic L-BFGS, some convergence discussion can be found in Mokhtari and Ribeiro (2015).

3. Determine step length αk

With BFGS updating formula, we have solved the task of how to find direction pk. Now we turn to determine how far we should go in this direction.

Wolfe condition

We introduce a helper function

ϕ(α)=f(xk+αpk),α>0

The minimizer of ϕ(α) is what we need. However, solving this univariate minimum problem accurately could be too expensive. An inexact solution is acceptable as long as ϕ(α)=f(xk+αpk)<fk.

However, a simple function value reduction may not be enough sometimes. The picture below show an example of this situation. We need a kind of sufficient decrease to avoid this.

Insufficient reduction[^fig3.2]

Figure 1: Insufficient reduction1

The following Wolfe condition is the formalization of this sufficient decrease.

f(xk+αkpk)f(xk)+c1αkfkTpk(7)f(xk+αkpk)Tpkc2fkTpk

where 0<c1<c2<1.

The right hand side of the first inequality is a line l(α) drawn from xk with the same slope of ϕ(α). Thus the intuition behind the first inequality is that the function value at step αk should below this line l(α). This condition is usually called Armijo condition, or sufficient decrease condition.

The intuition of the second inequality uses more information of f’s curvature. Since pk is the descent direction, we have fkTpk<0. Suppose the slope of ϕ(α) at step αk is smaller (steeper) than the slope at the start point, i.e. fkTpk, then we can go further safely to reach a even low objective value. Therefore, a sufficient descent step should be at a point with a more gentle slope. This second condition is usually referred to as curvature condition.

The curvature condition[^fig3.4]

Figure 2: The curvature condition2

There is a chance that step αk goes too further that the slope becomes positive. To prevent this case, we can use strong Wolfe condition

f(xk+αkpk)f(xk)+c1αkfkTpk(8)f(xk+αkpk)Tpkc2fkTpk

where 0<c1<c2<1.

Existance and convergence

The next question is does these αk exist; and if does, how to find them.

The following lemma and theorem (Nocedal and Wright 2006) guarantee the existence of αk that satisfies Wolfe condition, and show us the superlinearly convergence under certain conditions.

Lemma Suppose that f is continuously differentiable. Let pk be a descent direction at xk, and assume that f is bounded below along {xk+αpk|α>0}. For 0<c1<c2<1, there exist intervals of step lengths that satisfying the Wolfe condition (7) and strong Wolfe condition (8).

Theorem Suppose that f is twice continuously differentiable. Consider the iteration xk+1=xk+αkpk , where pk is a descent direction and αk satisfies the Wolfe conditions (7) with c11/2. If the sequence {xk} converges to a point x such that f(x) and 2f(x) is positive definite, and if the search direction satisfies

limkfk+2fkpkpk=0

then

  1. the step length αk=1 is admissible for all k greater than a certain index k0;
  2. if αk=1 for all k>k0, {xk} converges to x superlinearly.

Linear search algorithm

We use linear search algorithm to locate a valid α. The idea is to generate a sequence of monotonically increasing {αi} in (0,αmax). If αi satisfies Wolfe condition, return that step; otherwise narrow the search interval.

We show the algorithm in Julia pseudo code below.

using Flux

function linear_search(ϕ, α_0=0, α_max=1, c1=0.0001, c2=0.9)
    α = Dict(0 => α_0)
    α[1] = choose(α_0, α_max)
    ϕ′(x) = gradient(ϕ, x)[1]
    i = 1
    while true
        y = ϕ(α[i])
        if y > ϕ(0) + c1 * α[i] * ϕ′(0) or (ϕ(α[i]) >= ϕ(α[i - 1]) and i > 1)
            return zoom(α[i - 1], α[i])
        end
        dy = ϕ′(α[i])
        if abs(dy) <= -c2 * ϕ′(0)
            return α[i]
        end
        if dy >= 0
            return zoom(α[i], α[i - 1])
        end
        α[i + 1] = choose(α[i], α_max)
        i = i + 1
    end
end

function zoom(ϕ, α_lo, α_hi, c2=0.9)
    while true
        # using quadratic, cubic, or bisection to find a trial step length α
        α = choose(α_lo, α_hi)
        y = ϕ(α)
        if y > ϕ(0) + c1 * α * ϕ′(0) or y >= ϕ(α_lo)
            α_hi = α
        else
            dy = ϕ′(α)
            if abs(dy) <= -c2 * ϕ′(0)
                return α
            end
            if dy * (α_hi - α_lo) >= 0
                α_hi = α_lo
            end
            α_lo = α
        end
    end
end

Notice that in function zoom(), the order of αi and αi1 could swap. αlo always gives the smallest function value.

4. Limit memory scenario

We have discussed how to use BFGS algorithm to find a descent direction pk and how to use linear search to choose a step length αk that satisfies the Wolfe condition. Now let’s talk about some practical computation issues.

The BFGS algorithm needs to store and update the Hessian matrix at each iteration. This requires O(n2) memory space, which is not feasible at all, since the parameter size n of many modern deep learning models could be millions or hundreds of millions.

Limit memory BFGS, or L-BFGS, is a BFGS’s variation addressing this issue. Instead of doing matrix multiplication directly, L-BFGS use m previous vectors, {si,yi},i=k1,,km, to reconstruct Hk. This reduce memory cost from O(n2) to O(mn).

L-BFGS: vanilla iteration

Recall (1) and BFGS (6), and for simplicity, let

Vk=IρkykskT

The BFGS formula becomes

(9)Hk+1=VkTHkVk+ρkskskT

We can use a vanilla iterative way to calculate Hkfk directly (let q=fk)

  1. Vkfk=qρkykskTq: calculate ρsTq first in n multiplications to get a scalar α, then calculate qαy in another n multiplications.
  2. Suppose Hk0 is a diagonal matrix, so we can calculate Hk0Vkq in n multiplications.
  3. Multiply VkT is analogous, which needs 2n multiplications as well.
  4. Finally, ρssT and add the results together needs another 2n multiplications.

So this vanilla iteration requires 7n multiplications, leading to a total cost of 7nm multiplications.

Can we do better?

L-BFGS: two-loop recursion

L-BFGS has a two-loop recursion algorithm, which is quite brilliant.

function L_BFGS(H0, ∇f_k, s, y, ρ, k, m)
    q = ∇f_k
    for i = k - 1 : -1 : k - m
        α[i] = ρ[i] * transpose(s[i]) * q
        q = q - α[i] * y[i]
    end
    w = H0 * q
    for i = k - m : k - 1
        β = ρ[i] * transpose(y[i]) * w
        w = w + s[i] * (α[i] - β)
    end
    return w
end

The return value w=Hkfk is what we need, and there are only 5nm multiplications in it.

The data manipulation in the two-loop recursion is not that easy to understand at first glimpse. To make it more clear, we expand (9) m steps to get

Hk=(Vk1TVkmT)Hk0(VkmVk1)+ρkm(Vk1TVkm+1T)skmskmT(Vkm+1Vk1)+ρkm+1(Vk1TVkm+2T)skm+1skm+1T(Vkm+2Vk1)++ρk2Vk1Tsk2sk2TVk1+ρk1sk1sk1T

Some key observations to help you understand the two-loop recursion:

  • In the first loop, αi=ρisiTVi+1Vk1fk
  • After the first loop and w = H0 * q, q=Hk0VkmVk1
  • In the second loop, after iteration i, the vector w will be
w=(ViTVkmT)Hk0(VkmVk1fk)+ρkm(ViTVkm+1T)skmskmT(Vkm+1Vk1fk)++ρisisiTVi+1Vk1fk

5. Summary

In this article, we have talked about the basic idea of quasi-Newton method. We carefully derive the DFP and BFGS formula, and show how to find descent direction with them. We have also discussed how to use Wolfe condition and linear search to find a feasible step length. And at last, we demonstrate how to use two-loop recursion L-BFGS to apply a fast and memory-efficient iteration.

In most of the materials I’ve found about L-BFGS, many non-trivial details are omitted. I’ve tried to make this post as self-contained and as clear as possible. Maybe it will help one or two newcomers of this topic to overcome some obscure steps.


Fletcher, Roger, and Michael JD Powell. 1963. “A Rapidly Convergent Descent Method for Minimization.” The Computer Journal 6 (2): 163–68.
Mokhtari, Aryan, and Alejandro Ribeiro. 2015. “Global Convergence of Online Limited Memory BFGS.” The Journal of Machine Learning Research 16 (1): 3151–81.
Nocedal, Jorge, and Stephen Wright. 2006. Numerical Optimization. Springer Science & Business Media.

  1. Figure 3.2 from Nocedal and Wright (2006).↩︎

  2. Figure 3.4 from Nocedal and Wright (2006).↩︎

F. Shen
F. Shen
Algorithm Engineer

Be an informed citizen, life hacker, and sincere creator.

Previous

Related