Notes on L-BFGS and Wolfe condition
Recently while I was working on a problem, instead of regular SGD or Adam, I wanted to try some second-order methods, like L-BFGS.
In this post, I’ve summarised my notes to give an intuitive yet self-contained introduction, which only requires a minimum calculus background, like Taylor Theorem. We will discuss the derivation, algorithm, convergence result, along with a few other practical computation issues of L-BFGS.
1. Introduction
Suppose
to approximate
- find a descent direction
; - determine how far we move in that direction, i.e. step length
.
2. Find descent direction
We use
For
Quasi-Newton and secant equation
Apply Taylor theorem to
Ignoring high-order term, we get approximate
Newton method uses the right hand side form to update
However, the calculation of Hessian matrix’s inverse is usually expensive, and sometimes we cannot guarantee
The idea of quasi-Newton method is to replace Hessian matrix with another positive definite symmetric matrix
From Newton method formula, we have
Let
The next matrix
This equation is called secant equation.
There is another approach to get the secant equation. Let
be the second-order approximate to
When we iterate from
Rank-two update
So the next question becomes how to iterate
In Nocedal and Wright (2006), the authors provides another interpretation.
They view
where
where
BFGS
The DFP formula maintains properties of
Many materials say just applying Sherman–Morrison formula and we can get a iterate equation for
Woodbury matrix identity says
Let
So
We can easily check that
Using this property, we can get
Substitute this back to (4) with (5) (still ignore subscript
Add back subscript
Initial
A simple choice of
There are many other approaches.
For instance, we can also use other optimization method to “warm up”—do a few iterations in order to get a better approximate of
Convergence
I’m going to discuss the convergence theory of Newton and quasi-Newton method in another article later. Therefore, here I only mention the convergence result theorem.
Theorem
Suppose that
Suppose the sequence also satisfies
Then
For L-BFGS and stochastic L-BFGS, some convergence discussion can be found in Mokhtari and Ribeiro (2015).
3. Determine step length
With BFGS updating formula, we have solved the task of how to find direction
Wolfe condition
We introduce a helper function
The minimizer of
However, a simple function value reduction may not be enough sometimes. The picture below show an example of this situation. We need a kind of sufficient decrease to avoid this.
![Insufficient reduction[^fig3.2]](insufficient-decrease.png)
Figure 1: Insufficient reduction1
The following Wolfe condition is the formalization of this sufficient decrease.
where
The right hand side of the first inequality is a line
The intuition of the second inequality uses more information of
![The curvature condition[^fig3.4]](wolfe-condition.png)
Figure 2: The curvature condition2
There is a chance that step
where
Existance and convergence
The next question is does these
The following lemma and theorem (Nocedal and Wright 2006) guarantee the existence of
Lemma
Suppose that
Theorem
Suppose that
then
- the step length
is admissible for all greater than a certain index ; - if
for all , converges to superlinearly.
Linear search algorithm
We use linear search algorithm to locate a valid
We show the algorithm in Julia pseudo code below.
using Flux
function linear_search(ϕ, α_0=0, α_max=1, c1=0.0001, c2=0.9)
α = Dict(0 => α_0)
α[1] = choose(α_0, α_max)
ϕ′(x) = gradient(ϕ, x)[1]
i = 1
while true
y = ϕ(α[i])
if y > ϕ(0) + c1 * α[i] * ϕ′(0) or (ϕ(α[i]) >= ϕ(α[i - 1]) and i > 1)
return zoom(α[i - 1], α[i])
end
dy = ϕ′(α[i])
if abs(dy) <= -c2 * ϕ′(0)
return α[i]
end
if dy >= 0
return zoom(α[i], α[i - 1])
end
α[i + 1] = choose(α[i], α_max)
i = i + 1
end
end
function zoom(ϕ, α_lo, α_hi, c2=0.9)
while true
# using quadratic, cubic, or bisection to find a trial step length α
α = choose(α_lo, α_hi)
y = ϕ(α)
if y > ϕ(0) + c1 * α * ϕ′(0) or y >= ϕ(α_lo)
α_hi = α
else
dy = ϕ′(α)
if abs(dy) <= -c2 * ϕ′(0)
return α
end
if dy * (α_hi - α_lo) >= 0
α_hi = α_lo
end
α_lo = α
end
end
end
Notice that in function zoom()
, the order of
4. Limit memory scenario
We have discussed how to use BFGS algorithm to find a descent direction
The BFGS algorithm needs to store and update the Hessian matrix at each iteration.
This requires
Limit memory BFGS, or L-BFGS, is a BFGS’s variation addressing this issue.
Instead of doing matrix multiplication directly,
L-BFGS use
L-BFGS: vanilla iteration
Recall (1) and BFGS (6), and for simplicity, let
The BFGS formula becomes
We can use a vanilla iterative way to calculate
: calculate first in multiplications to get a scalar , then calculate in another multiplications.- Suppose
is a diagonal matrix, so we can calculate in multiplications. - Multiply
is analogous, which needs multiplications as well. - Finally,
and add the results together needs another multiplications.
So this vanilla iteration requires
Can we do better?
L-BFGS: two-loop recursion
L-BFGS has a two-loop recursion algorithm, which is quite brilliant.
function L_BFGS(H0, ∇f_k, s, y, ρ, k, m)
q = ∇f_k
for i = k - 1 : -1 : k - m
α[i] = ρ[i] * transpose(s[i]) * q
q = q - α[i] * y[i]
end
w = H0 * q
for i = k - m : k - 1
β = ρ[i] * transpose(y[i]) * w
w = w + s[i] * (α[i] - β)
end
return w
end
The return value
The data manipulation in the two-loop recursion is not that easy to understand at first glimpse.
To make it more clear, we expand (9)
Some key observations to help you understand the two-loop recursion:
- In the first loop,
- After the first loop and
w = H0 * q
, - In the second loop, after iteration
, the vector will be
5. Summary
In this article, we have talked about the basic idea of quasi-Newton method. We carefully derive the DFP and BFGS formula, and show how to find descent direction with them. We have also discussed how to use Wolfe condition and linear search to find a feasible step length. And at last, we demonstrate how to use two-loop recursion L-BFGS to apply a fast and memory-efficient iteration.
In most of the materials I’ve found about L-BFGS, many non-trivial details are omitted. I’ve tried to make this post as self-contained and as clear as possible. Maybe it will help one or two newcomers of this topic to overcome some obscure steps.
Figure 3.2 from Nocedal and Wright (2006).↩︎
Figure 3.4 from Nocedal and Wright (2006).↩︎