Notes on L-BFGS and Wolfe condition
Recently while I was working on a problem, instead of regular SGD or Adam, I wanted to try some second-order methods, like L-BFGS.
In this post, I’ve summarised my notes to give an intuitive yet self-contained introduction, which only requires a minimum calculus background, like Taylor Theorem. We will discuss the derivation, algorithm, convergence result, along with a few other practical computation issues of L-BFGS.
1. Introduction
Suppose 
to approximate 
- find a descent direction 
 ; - determine how far we move in that direction, i.e. step length 
 . 
2. Find descent direction 
We use 
For 
Quasi-Newton and secant equation
Apply Taylor theorem to 
Ignoring high-order term, we get approximate
Newton method uses the right hand side form to update 
However, the calculation of Hessian matrix’s inverse is usually expensive, and sometimes we cannot guarantee 
The idea of quasi-Newton method is to replace Hessian matrix with another positive definite symmetric matrix 
From Newton method formula, we have
Let
The next matrix 
This equation is called secant equation.
There is another approach to get the secant equation. Let
be the second-order approximate to 
When we iterate from 
Rank-two update
So the next question becomes how to iterate 
In Nocedal and Wright (2006), the authors provides another interpretation.
They view 
where 
where 
BFGS
The DFP formula maintains properties of 
Many materials say just applying Sherman–Morrison formula and we can get a iterate equation for 
Woodbury matrix identity says
Let 
So
We can easily check that
Using this property, we can get
Substitute this back to (4) with (5) (still ignore subscript 
Add back subscript 
Initial 
A simple choice of 
There are many other approaches.
For instance, we can also use other optimization method to “warm up”—do a few iterations in order to get a better approximate of 
Convergence
I’m going to discuss the convergence theory of Newton and quasi-Newton method in another article later. Therefore, here I only mention the convergence result theorem.
Theorem
Suppose that 
Suppose the sequence also satisfies
Then 
For L-BFGS and stochastic L-BFGS, some convergence discussion can be found in Mokhtari and Ribeiro (2015).
3. Determine step length 
With BFGS updating formula, we have solved the task of how to find direction 
Wolfe condition
We introduce a helper function
The minimizer of 
However, a simple function value reduction may not be enough sometimes. The picture below show an example of this situation. We need a kind of sufficient decrease to avoid this.
Figure 1: Insufficient reduction1
The following Wolfe condition is the formalization of this sufficient decrease.
where 
The right hand side of the first inequality is a line 
The intuition of the second inequality uses more information of 
Figure 2: The curvature condition2
There is a chance that step 
where 
Existance and convergence
The next question is does these 
The following lemma and theorem (Nocedal and Wright 2006) guarantee the existence of 
Lemma
Suppose that 
Theorem
Suppose that 
then
- the step length 
 is admissible for all greater than a certain index ; - if 
 for all , converges to superlinearly. 
Linear search algorithm
We use linear search algorithm to locate a valid 
We show the algorithm in Julia pseudo code below.
using Flux
function linear_search(ϕ, α_0=0, α_max=1, c1=0.0001, c2=0.9)
    α = Dict(0 => α_0)
    α[1] = choose(α_0, α_max)
    ϕ′(x) = gradient(ϕ, x)[1]
    i = 1
    while true
        y = ϕ(α[i])
        if y > ϕ(0) + c1 * α[i] * ϕ′(0) or (ϕ(α[i]) >= ϕ(α[i - 1]) and i > 1)
            return zoom(α[i - 1], α[i])
        end
        dy = ϕ′(α[i])
        if abs(dy) <= -c2 * ϕ′(0)
            return α[i]
        end
        if dy >= 0
            return zoom(α[i], α[i - 1])
        end
        α[i + 1] = choose(α[i], α_max)
        i = i + 1
    end
end
function zoom(ϕ, α_lo, α_hi, c2=0.9)
    while true
        # using quadratic, cubic, or bisection to find a trial step length α
        α = choose(α_lo, α_hi)
        y = ϕ(α)
        if y > ϕ(0) + c1 * α * ϕ′(0) or y >= ϕ(α_lo)
            α_hi = α
        else
            dy = ϕ′(α)
            if abs(dy) <= -c2 * ϕ′(0)
                return α
            end
            if dy * (α_hi - α_lo) >= 0
                α_hi = α_lo
            end
            α_lo = α
        end
    end
end
Notice that in function zoom(), the order of 
4. Limit memory scenario
We have discussed how to use BFGS algorithm to find a descent direction 
The BFGS algorithm needs to store and update the Hessian matrix at each iteration.
This requires 
Limit memory BFGS, or L-BFGS, is a BFGS’s variation addressing this issue.
Instead of doing matrix multiplication directly,
L-BFGS use 
L-BFGS: vanilla iteration
Recall (1) and BFGS (6), and for simplicity, let
The BFGS formula becomes
We can use a vanilla iterative way to calculate 
 : calculate first in multiplications to get a scalar , then calculate in another multiplications.- Suppose 
 is a diagonal matrix, so we can calculate in multiplications. - Multiply 
 is analogous, which needs multiplications as well. - Finally, 
 and add the results together needs another multiplications. 
So this vanilla iteration requires 
Can we do better?
L-BFGS: two-loop recursion
L-BFGS has a two-loop recursion algorithm, which is quite brilliant.
function L_BFGS(H0, ∇f_k, s, y, ρ, k, m)
    q = ∇f_k
    for i = k - 1 : -1 : k - m
        α[i] = ρ[i] * transpose(s[i]) * q
        q = q - α[i] * y[i]
    end
    w = H0 * q
    for i = k - m : k - 1
        β = ρ[i] * transpose(y[i]) * w
        w = w + s[i] * (α[i] - β)
    end
    return w
end
The return value 
The data manipulation in the two-loop recursion is not that easy to understand at first glimpse.
To make it more clear, we expand (9) 
Some key observations to help you understand the two-loop recursion:
- In the first loop, 
 - After the first loop and 
w = H0 * q, - In the second loop, after iteration 
 , the vector will be 
5. Summary
In this article, we have talked about the basic idea of quasi-Newton method. We carefully derive the DFP and BFGS formula, and show how to find descent direction with them. We have also discussed how to use Wolfe condition and linear search to find a feasible step length. And at last, we demonstrate how to use two-loop recursion L-BFGS to apply a fast and memory-efficient iteration.
In most of the materials I’ve found about L-BFGS, many non-trivial details are omitted. I’ve tried to make this post as self-contained and as clear as possible. Maybe it will help one or two newcomers of this topic to overcome some obscure steps.
Figure 3.2 from Nocedal and Wright (2006).↩︎
Figure 3.4 from Nocedal and Wright (2006).↩︎