# Notes on L-BFGS and Wolfe condition

Recently while I was working on a problem, instead of regular SGD or Adam, I wanted to try some second-order methods, like L-BFGS.

In this post, I’ve summarised my notes to give an intuitive yet self-contained introduction, which only requires a minimum calculus background, like Taylor Theorem. We will discuss the derivation, algorithm, convergence result, along with a few other practical computation issues of L-BFGS.

# 1. Introduction

Suppose \(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\), the objective function we want to minimized, is twice continuously differentiable, with its minimum at \(x^{\ast}\). Given an initial point \(x_0\), a typical optimization problem is to use an iterative method

\[ x_{k + 1} = x_{k} + \alpha_k p_k, \quad k = 0, 1, \dots, \]

to approximate \(x^{\ast}\), where \(p_k\) is the *search direction*, and \(\alpha_k\) is the *step length*. So we have two tasks:

- find a descent direction \(p_k\);
- determine how far we move in that direction, i.e. step length \(\alpha_k\).

# 2. Find descent direction \(p_k\)

We use \(\nabla f\) to denote the gradient of \(f\), and use \(\nabla f_k\) as a short form of \(\nabla f(x_k)\). Similarly, we use \(\nabla^{2}f\) to denote the Hessian matrix of \(f\), and \(\nabla^{2}f_k\) is the Hessian matrix’s value at \(x_k\).

For \(x^{\ast}\), we have \(\nabla f(x^{\ast}) = 0\), and \(\nabla^{2}f(x^{\ast})\) is a positive definite symmetric matrix.

## Quasi-Newton and secant equation

Apply Taylor theorem to \(\nabla f\) at \(x^{\ast}\) give us

\[\begin{equation} \nabla f(x^{\ast}) = \nabla f_k + \nabla^{2}f_k \cdot (x^{\ast} - x_k) + o(\| x^{\ast} - x_k \|) \end{equation}\]

Ignoring high-order term, we get approximate

\[\begin{eqnarray} \nabla f_k + \nabla^{2}f_k \cdot (x^{\ast} - x_k) & \approx & \nabla f(x^{\ast}) = 0 \\ x^{\ast} & \approx & x_k - (\nabla^{2}f_k)^{-1}\nabla{f_k} \end{eqnarray}\]

Newton method uses the right hand side form to update \(x_k\),

\[ x_{k+1} = x_k - (\nabla^{2}f_k)^{-1}\nabla f_k \]

However, the calculation of Hessian matrix’s inverse is usually expensive, and sometimes we cannot guarantee \(\nabla^{2}f_k\) is positive definite.

The idea of quasi-Newton method is to replace Hessian matrix with another positive definite symmetric matrix \(B_k\), and refine it in every iteration with a much smaller computation cost.

From Newton method formula, we have

\[\begin{equation} \nabla^{2}f_k \cdot (x_{k + 1} - x_k) \approx \nabla f_{k+1} - \nabla f_{k} \end{equation}\]

Let \[\begin{eqnarray} s_k & = & x_{k+1} - x_k \\ y_k & = & \nabla f_{k+1} - \nabla f_{k} \tag{1} \end{eqnarray}\]

The next matrix \(B_{k+1}\) we are looking for should satisfy the above approximate,

\[\begin{equation} B_{k+1} s_k = y_k \tag{2} \end{equation}\]

This equation is called *secant equation*.

There is another approach to get the secant equation. Let

\[ m_k(p) = f_k + \nabla f_{k}^{T}p + \frac{1}{2}p^{T}B_k{p} \]

be the second-order approximate to \(f\) at \(x_k\). It is a function of \(p\) with gradient

\[ m_k^{\prime}(p) = \nabla f_k + B_k p \]

When we iterate from \(k\) to \(k + 1\), it’s easy to check that \(m_{k+1}(0)= f_{k+1}\) and \(m_{k+1}^{\prime}(0) = \nabla f_{k+1}\). This means \(m_{k+1}(p)\) equals \(f\) with both function value and gradient at \(x_{k+1}\). If we want this \(m_{k+1}(p)\) to approximate \(f\) even better, we may ask its gradient to be equal with \(f\) at last step \(x_k\) as well. At this point, \(x_k = x_{k+1} - \alpha_k p_k\), So \(p = -\alpha_k p_k = -s_k\), and we have

\[\begin{eqnarray} m_{k+1}^{\prime}(-s_k) = \nabla f_{k+1} - B_{k+1}s_k & = & \nabla f_k \\ B_{k+1} s_k & = & \nabla f_{k+1} - \nabla f_k \\ B_{k+1} s_k & = & y_k \end{eqnarray}\]

## Rank-two update

So the next question becomes how to iterate \(B_k\). The Davidon-Fletcher-Powell (DFP) formula (Fletcher and Powell 1963) gives a rank-two matrix update approach.

\[\begin{equation} B_{k+1} = B_k - \frac{B_k s_k s_k^{T} B_k}{s_k^{T} B_k s_k} + \frac{y_k y_k^{T}}{y_k^{T}s_k} \tag{3} \end{equation}\]

In Nocedal and Wright (2006), the authors provides another interpretation. They view \(B_{k+1}\) as the solution of the following optimization problem.

\[\begin{eqnarray} \min_{B} \|B - B_k\|_{W} \\ s.t. B = B^{T}, \quad B s_k = y_k \end{eqnarray}\]

where \(\|A\|_{W}\) norm is defined as

\[ \|A\|_{W} = \| W^{1/2}A W^{1/2} \|_{F} \]

where \(\|\cdot\|_{F}\) is Frobenius norm and \(W\) is any matrix satisfying \(W y_k = s_k\). This is a more intuitive way, however, I haven’t finish the detailed proof of this conclusion. I would revise this part later.

## BFGS

The DFP formula maintains properties of \(B_k\) like symmetric and positive definite. But we still have to calculate its inverse to get \(p_k = -B_k^{-1}\nabla f_k\). To avoid the inverse matrix calculation, we need to approximate matrix \(H_k = B_k^{-1}\) directly.

Many materials say just applying Sherman–Morrison formula and we can get a iterate equation for \(B_k^{-1}\). However, this matrix deduction is not trivial. After a little calculation, I think applying Woodbury matrix identity would be a more straightforward way.

Woodbury matrix identity says

\[\begin{equation} (A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1} \tag{4} \end{equation}\]

Let \(\rho = \frac{1}{y_k^{T}s_{k}}\) and drop subscript \(k\) to rewrite DFP as

\[\begin{eqnarray} B_{k+1} & = & B - \frac{Bss^{T}B}{s^{T}Bs} + \frac{y y^{T}}{y^{T}s} \\ & = & B + (y \quad Bs)\left( \begin{array}{cc} \rho & 0 \\ 0 & -\frac{1}{s^{T}Bs} \end{array} \right) \left( \begin{array}{c} y^{T} \\ s^{T}B \end{array} \right) \\ & = & B + UCV \tag{5} \end{eqnarray}\]

So

\[\begin{eqnarray} C^{-1} + VA^{-1}U & = & \left( \begin{array}{cc} \frac{1}{\rho} & 0 \\ 0 & -s^{T}Bs \end{array} \right) + \left( \begin{array}{c} y^{T} \\ s^{T}B \end{array} \right) H \left(y,\quad Bs\right) \\ & = & \left( \begin{array}{cc} 1/\rho + y^{T}Hy & 1/\rho \\ 1/\rho & 0 \end{array} \right) \end{eqnarray}\]

We can easily check that

\[\begin{equation} \left( \begin{array}{cc} a + b & a \\ a & 0 \end{array} \right) \left( \begin{array}{cc} 0 & \frac{1}{a} \\ \frac{1}{a} & -\frac{a+b}{a^2} \end{array} \right) = I \end{equation}\]

Using this property, we can get

\[\begin{eqnarray} (C^{-1} + VA^{-1}U)^{-1} & = & \left( \begin{array}{cc} 0 & \rho \\ \rho & -\rho - \rho y^{T}Hy \end{array} \right) \end{eqnarray}\]

Substitute this back to (4) with (5) (still ignore subscript \(k\) for simplicity)

\[\begin{eqnarray} H_{k+1} & = & H - H (y \quad Bs)\left( \begin{array}{cc} 0 & \rho \\ \rho & -\rho - \rho y^{T}Hy \end{array} \right) \left( \begin{array}{c} y^{T} \\ s^{T}B \end{array} \right) H \\ & = & H - \rho s y^{T}H - \rho H y s^{T} + \rho s y^{T} H y s^{T} + \rho s s^{T} \\ & = & (I - \rho s y^{T})H(I - \rho y s^{T}) + \rho s s^{T} \end{eqnarray}\]

Add back subscript \(k\) and we finally get the BFGS formula

\[\begin{equation} H_{k+1} = (I - \rho_k s_k y_k^{T})H_k(I - \rho_k y_k s_k^{T}) + \rho_k s_k s_k^{T}, \quad \rho_k = \frac{1}{y_k^{T}s_k} \tag{6} \end{equation}\]

## Initial \(H_0\)

A simple choice of \(H_0\) is setting it to identical matrix \(I\). A similar strategy is to introduce a scalar \(\beta\), then let \(H_0 = \beta I\).

There are many other approaches. For instance, we can also use other optimization method to “warm up”—do a few iterations in order to get a better approximate of \(\nabla^{2}f\), then change back to BFGS.

## Convergence

I’m going to discuss the convergence theory of Newton and quasi-Newton method in another article later. Therefore, here I only mention the convergence result theorem.

**Theorem**
*Suppose that \(f\) is twice continuously differentiable and that the iterates generated by the BFGS algorithm converge to a minimizer \(x^{\ast}\) at which the Hessian matrix \(G\) satisfies Lipschitz continuous in a neighborhood with some positive constant \(L\),*

\[\begin{equation} \| G(x) - G(x^{\ast}) \| \le L \|x - x^{\ast}\| \end{equation}\]

*Suppose the sequence also satisfies*

\[\begin{equation} \sum_{k=1}^{\infty} \| x_k - x^{\ast} \| < \infty \end{equation}\]

*Then \(x_k\) converges to \(x^{\ast}\) at a superlinear rate.*

For L-BFGS and stochastic L-BFGS, some convergence discussion can be found in Mokhtari and Ribeiro (2015).

# 3. Determine step length \(\alpha_k\)

With BFGS updating formula, we have solved the task of how to find direction \(p_k\). Now we turn to determine how far we should go in this direction.

## Wolfe condition

We introduce a helper function

\[\phi(\alpha) = f(x_k + \alpha p_k), \quad \alpha > 0\]

The minimizer of \(\phi(\alpha)\) is what we need. However, solving this univariate minimum problem accurately could be too expensive. An inexact solution is acceptable as long as \(\phi(\alpha) = f(x_k + \alpha p_k) < f_k\).

However, a simple function value reduction may not be enough sometimes.
The picture below show an example of this situation.
We need a kind of *sufficient decrease* to avoid this.

The following *Wolfe condition* is the formalization of this sufficient decrease.

\[\begin{eqnarray} f(x_k + \alpha_k p_k) & \le & f(x_k) + c_1 \alpha_k \nabla f_k^{T} p_k \\ \nabla f(x_k + \alpha_k p_k)^{T} p_k & \ge & c_2 \nabla f_k^{T} p_k \tag{7} \end{eqnarray}\]

where \(0 < c_1 < c_2 < 1\).

The right hand side of the first inequality is a line \(l(\alpha)\) drawn from \(x_k\) with the same slope of \(\phi(\alpha)\).
Thus the intuition behind the first inequality is that the function value at step \(\alpha_k\) should below this line \(l(\alpha)\).
This condition is usually called *Armijo condition*, or *sufficient decrease condition*.

The intuition of the second inequality uses more information of \(f\)’s curvature.
Since \(p_k\) is the descent direction, we have \(\nabla f_k^{T} p_k < 0\).
Suppose the slope of \(\phi(\alpha)\) at step \(\alpha_k\) is smaller (steeper) than the slope at the start point,
i.e. \(\nabla f_k^{T} p_k\), then we can go further safely to reach a even low objective value.
Therefore, a sufficient descent step should be at a point with a more gentle slope.
This second condition is usually referred to as *curvature condition*.

There is a chance that step \(\alpha_k\) goes too further that the slope becomes positive.
To prevent this case, we can use *strong Wolfe condition*

\[\begin{eqnarray} f(x_k + \alpha_k p_k) & \le & f(x_k) + c_1 \alpha_k \nabla f_k^{T} p_k \\ \| \nabla f(x_k + \alpha_k p_k)^{T} p_k \| & \le & c_2 \|\nabla f_k^{T} p_k\| \tag{8} \end{eqnarray}\]

where \(0 < c_1 < c_2 < 1\).

## Existance and convergence

The next question is does these \(\alpha_k\) exist; and if does, how to find them.

The following lemma and theorem (Nocedal and Wright 2006) guarantee the existence of \(\alpha_k\) that satisfies Wolfe condition, and show us the superlinearly convergence under certain conditions.

**Lemma**
*Suppose that \(f\) is continuously differentiable. Let \(p_k\) be a descent direction at \(x_k\),
and assume that \(f\) is bounded below along \(\{x_k + \alpha p_k | \alpha > 0\}\).
For \(0 < c_1 < c_2 < 1\),
there exist intervals of step lengths that satisfying the Wolfe condition (7) and strong Wolfe condition (8).*

**Theorem**
*Suppose that \(f\) is twice continuously differentiable. Consider the iteration \(x_{k+1} = x_k + \alpha_k p_k\) , where \(p_k\) is a descent direction and \(\alpha_k\) satisfies the Wolfe conditions (7) with \(c_1 \le 1/2\). If the sequence \(\{x_k\}\) converges to a point \(x^{\ast}\) such that \(\nabla f(x^{\ast})\) and \(\nabla^{2}f(x^{\ast})\) is positive definite, and if the search direction satisfies*

\[\begin{equation} \lim_{k\to \infty}\frac{\| \nabla f_k + \nabla^{2}f_k p_k \|}{\|p_k\|} = 0 \end{equation}\]

then

*the step length \(\alpha_k = 1\) is admissible for all \(k\) greater than a certain index \(k_0\);**if \(\alpha_k = 1\) for all \(k > k_0\), \(\{x_k\}\) converges to \(x^{\ast}\) superlinearly.*

## Linear search algorithm

We use linear search algorithm to locate a valid \(\alpha\). The idea is to generate a sequence of monotonically increasing \(\{\alpha_i\}\) in \((0, \alpha_{max})\). If \(\alpha_i\) satisfies Wolfe condition, return that step; otherwise narrow the search interval.

We show the algorithm in Julia pseudo code below.

```
using Flux
function linear_search(ϕ, α_0=0, α_max=1, c1=0.0001, c2=0.9)
α = Dict(0 => α_0)
α[1] = choose(α_0, α_max)
ϕ′(x) = gradient(ϕ, x)[1]
i = 1
while true
y = ϕ(α[i])
if y > ϕ(0) + c1 * α[i] * ϕ′(0) or (ϕ(α[i]) >= ϕ(α[i - 1]) and i > 1)
return zoom(α[i - 1], α[i])
end
dy = ϕ′(α[i])
if abs(dy) <= -c2 * ϕ′(0)
return α[i]
end
if dy >= 0
return zoom(α[i], α[i - 1])
end
α[i + 1] = choose(α[i], α_max)
i = i + 1
end
end
function zoom(ϕ, α_lo, α_hi, c2=0.9)
while true
# using quadratic, cubic, or bisection to find a trial step length α
α = choose(α_lo, α_hi)
y = ϕ(α)
if y > ϕ(0) + c1 * α * ϕ′(0) or y >= ϕ(α_lo)
α_hi = α
else
dy = ϕ′(α)
if abs(dy) <= -c2 * ϕ′(0)
return α
end
if dy * (α_hi - α_lo) >= 0
α_hi = α_lo
end
α_lo = α
end
end
end
```

Notice that in function `zoom()`

, the order of \(\alpha_i\) and \(\alpha_{i - 1}\) could swap.
\(\alpha_{lo}\) always gives the smallest function value.

# 4. Limit memory scenario

We have discussed how to use BFGS algorithm to find a descent direction \(p_k\) and how to use linear search to choose a step length \(\alpha_k\) that satisfies the Wolfe condition. Now let’s talk about some practical computation issues.

The BFGS algorithm needs to store and update the Hessian matrix at each iteration. This requires \(O(n^2)\) memory space, which is not feasible at all, since the parameter size \(n\) of many modern deep learning models could be millions or hundreds of millions.

*Limit memory BFGS*, or *L-BFGS*, is a BFGS’s variation addressing this issue.
Instead of doing matrix multiplication directly,
L-BFGS use \(m\) previous vectors, \(\{s_i, y_i\}, i = k-1, \dots, k-m\),
to reconstruct \(H_k\).
This reduce memory cost from \(O(n^2)\) to \(O(m n)\).

## L-BFGS: vanilla iteration

Recall (1) and BFGS (6), and for simplicity, let

\[\begin{equation} V_k = I - \rho_k y_k s_k^{T} \end{equation}\]

The BFGS formula becomes

\[\begin{equation} H_{k + 1} = V_k^{T} H_k V_k + \rho_k s_k s_k^{T} \tag{9} \end{equation}\]

We can use a vanilla iterative way to calculate \(H_k \nabla f_k\) directly (let \(q = \nabla f_k\))

- \(V_k \nabla f_k = q - \rho_k y_k s_k^{T} q\): calculate \(\rho s^{T} q\) first in \(n\) multiplications to get a scalar \(\alpha\), then calculate \(q - \alpha y\) in another \(n\) multiplications.
- Suppose \(H_k^{0}\) is a diagonal matrix, so we can calculate \(H_k^{0}V_k q\) in \(n\) multiplications.
- Multiply \(V_k^{T}\) is analogous, which needs \(2n\) multiplications as well.
- Finally, \(\rho s s^{T}\) and add the results together needs another \(2n\) multiplications.

So this vanilla iteration requires \(7n\) multiplications, leading to a total cost of \(7nm\) multiplications.

Can we do better?

## L-BFGS: two-loop recursion

L-BFGS has a *two-loop recursion* algorithm, which is quite brilliant.

```
function L_BFGS(H0, ∇f_k, s, y, ρ, k, m)
q = ∇f_k
for i = k - 1 : -1 : k - m
α[i] = ρ[i] * transpose(s[i]) * q
q = q - α[i] * y[i]
end
w = H0 * q
for i = k - m : k - 1
β = ρ[i] * transpose(y[i]) * w
w = w + s[i] * (α[i] - β)
end
return w
end
```

The return value \(w = H_k \nabla f_k\) is what we need, and there are only \(5nm\) multiplications in it.

The data manipulation in the two-loop recursion is not that easy to understand at first glimpse. To make it more clear, we expand (9) \(m\) steps to get

\[\begin{aligned} H_{k}=&\left(V_{k-1}^{T} \cdots V_{k-m}^{T}\right) H_{k}^{0}\left(V_{k-m} \cdots V_{k-1}\right) \\ &+\rho_{k-m}\left(V_{k-1}^{T} \cdots V_{k-m+1}^{T}\right) s_{k-m} s_{k-m}^{T}\left(V_{k-m+1} \cdots V_{k-1}\right) \\ &+\rho_{k-m+1}\left(V_{k-1}^{T} \cdots V_{k-m+2}^{T}\right) s_{k-m+1} s_{k-m+1}^{T}\left(V_{k-m+2} \cdots V_{k-1}\right) \\ &+\cdots \\ &+\rho_{k-2} V_{k-1}^{T} s_{k-2} s_{k-2}^{T} V_{k-1} \\ &+\rho_{k-1} s_{k-1} s_{k-1}^{T} \end{aligned}\]Some key observations to help you understand the two-loop recursion:

- In the first loop, \(\alpha_i = \rho_i s_i^{T} V_{i + 1} \cdots V_{k-1} \nabla f_k\)
- After the first loop and
`w = H0 * q`

, \(q = H_k^{0} V_{k-m} \cdots V_{k-1}\) - In the second loop, after iteration \(i\), the vector \(w\) will be

# 5. Summary

In this article, we have talked about the basic idea of quasi-Newton method. We carefully derive the DFP and BFGS formula, and show how to find descent direction with them. We have also discussed how to use Wolfe condition and linear search to find a feasible step length. And at last, we demonstrate how to use two-loop recursion L-BFGS to apply a fast and memory-efficient iteration.

In most of the materials I’ve found about L-BFGS, many non-trivial details are omitted. I’ve tried to make this post as self-contained and as clear as possible. Maybe it will help one or two newcomers of this topic to overcome some obscure steps.

*The Computer Journal*6 (2): 163–68.

*The Journal of Machine Learning Research*16 (1): 3151–81.

*Numerical Optimization*. Springer Science & Business Media.

Figure 3.2 from Nocedal and Wright (2006).↩︎

Figure 3.4 from Nocedal and Wright (2006).↩︎