English Posts on Free Verse Polynomial
https://fshen.org/en/
Recent content in English Posts on Free Verse Polynomial
Hugo  gohugo.io
enus
© 2006{year} F. Shen · All Rights Reserved
Sun, 22 Nov 2020 23:29:54 +0800

Notes on LBFGS and Wolfe condition
https://fshen.org/en/2020/1122lbfgsandwolfecondition/
Sun, 22 Nov 2020 23:29:54 +0800
https://fshen.org/en/2020/1122lbfgsandwolfecondition/
<script src="https://fshen.org/rmarkdownlibs/headerattrs/headerattrs.js"></script>
<link href="https://fshen.org/rmarkdownlibs/anchorsections/anchorsections.css" rel="stylesheet" />
<script src="https://fshen.org/rmarkdownlibs/anchorsections/anchorsections.js"></script>
<div id="TOC">
<ul>
<li><a href="#introduction">1. Introduction</a></li>
<li><a href="#finddescentdirectionp_k">2. Find descent direction <span class="math inline">\(p_k\)</span></a>
<ul>
<li><a href="#quasinewtonandsecantequation">QuasiNewton and secant equation</a></li>
<li><a href="#ranktwoupdate">Ranktwo update</a></li>
<li><a href="#bfgs">BFGS</a></li>
<li><a href="#initialh_0">Initial <span class="math inline">\(H_0\)</span></a></li>
<li><a href="#convergence">Convergence</a></li>
</ul></li>
<li><a href="#determinesteplengthalpha_k">3. Determine step length <span class="math inline">\(\alpha_k\)</span></a>
<ul>
<li><a href="#wolfecondition">Wolfe condition</a></li>
<li><a href="#existanceandconvergence">Existance and convergence</a></li>
<li><a href="#linearsearchalgorithm">Linear search algorithm</a></li>
</ul></li>
<li><a href="#limitmemoryscenario">4. Limit memory scenario</a>
<ul>
<li><a href="#lbfgsvanillaiteration">LBFGS: vanilla iteration</a></li>
<li><a href="#lbfgstwolooprecursion">LBFGS: twoloop recursion</a></li>
</ul></li>
<li><a href="#summary">5. Summary</a></li>
</ul>
</div>
<p>Recently while I was working on a problem, instead of regular SGD or Adam, I wanted to try some secondorder methods, like LBFGS.</p>
<p>In this post, I’ve summarised my notes to give an intuitive yet selfcontained introduction, which only requires a minimum calculus background, like <a href="https://en.wikipedia.org/wiki/Taylor%27s_theorem">Taylor Theorem</a>. We will discuss the derivation, algorithm, convergence result, along with a few other practical computation issues of LBFGS.</p>
<div id="introduction" class="section level1">
<h1>1. Introduction</h1>
<p>Suppose <span class="math inline">\(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\)</span>, the objective function we want to minimized, is twice continuously differentiable, with its minimum at <span class="math inline">\(x^{\ast}\)</span>. Given an initial point <span class="math inline">\(x_0\)</span>, a typical optimization problem is to use an iterative method</p>
<p><span class="math display">\[ x_{k + 1} = x_{k} + \alpha_k p_k, \quad k = 0, 1, \dots, \]</span></p>
<p>to approximate <span class="math inline">\(x^{\ast}\)</span>, where <span class="math inline">\(p_k\)</span> is the <em>search direction</em>, and <span class="math inline">\(\alpha_k\)</span> is the <em>step length</em>. So we have two tasks:</p>
<ol style="liststyletype: loweralpha">
<li>find a descent direction <span class="math inline">\(p_k\)</span>;</li>
<li>determine how far we move in that direction, i.e. step length <span class="math inline">\(\alpha_k\)</span>.</li>
</ol>
</div>
<div id="finddescentdirectionp_k" class="section level1">
<h1>2. Find descent direction <span class="math inline">\(p_k\)</span></h1>
<p>We use <span class="math inline">\(\nabla f\)</span> to denote the gradient of <span class="math inline">\(f\)</span>, and use <span class="math inline">\(\nabla f_k\)</span> as a short form of <span class="math inline">\(\nabla f(x_k)\)</span>. Similarly, we use <span class="math inline">\(\nabla^{2}f\)</span> to denote the Hessian matrix of <span class="math inline">\(f\)</span>, and <span class="math inline">\(\nabla^{2}f_k\)</span> is the Hessian matrix’s value at <span class="math inline">\(x_k\)</span>.</p>
<p>For <span class="math inline">\(x^{\ast}\)</span>, we have <span class="math inline">\(\nabla f(x^{\ast}) = 0\)</span>, and <span class="math inline">\(\nabla^{2}f(x^{\ast})\)</span> is a positive definite symmetric matrix.</p>
<div id="quasinewtonandsecantequation" class="section level2">
<h2>QuasiNewton and secant equation</h2>
<p>Apply Taylor theorem to <span class="math inline">\(\nabla f\)</span> at <span class="math inline">\(x^{\ast}\)</span> give us</p>
<p><span class="math display">\[\begin{equation}
\nabla f(x^{\ast}) = \nabla f_k + \nabla^{2}f_k \cdot (x^{\ast}  x_k) + o(\ x^{\ast}  x_k \)
\end{equation}\]</span></p>
<p>Ignoring highorder term, we get approximate</p>
<p><span class="math display">\[\begin{eqnarray}
\nabla f_k + \nabla^{2}f_k \cdot (x^{\ast}  x_k) & \approx & \nabla f(x^{\ast}) = 0 \\
x^{\ast} & \approx & x_k  (\nabla^{2}f_k)^{1}\nabla{f_k}
\end{eqnarray}\]</span></p>
<p>Newton method uses the right hand side form to update <span class="math inline">\(x_k\)</span>,</p>
<p><span class="math display">\[
x_{k+1} = x_k  (\nabla^{2}f_k)^{1}\nabla f_k
\]</span></p>
<p>However, the calculation of Hessian matrix’s inverse is usually expensive, and sometimes we cannot guarantee <span class="math inline">\(\nabla^{2}f_k\)</span> is positive definite.</p>
<p>The idea of quasiNewton method is to replace Hessian matrix with another positive definite symmetric matrix <span class="math inline">\(B_k\)</span>, and refine it in every iteration with a much smaller computation cost.</p>
<p>From Newton method formula, we have</p>
<p><span class="math display">\[\begin{equation}
\nabla^{2}f_k \cdot (x_{k + 1}  x_k) \approx \nabla f_{k+1}  \nabla f_{k}
\end{equation}\]</span></p>
<p>Let
<span class="math display" id="eq:skyk">\[\begin{eqnarray}
s_k & = & x_{k+1}  x_k \\
y_k & = & \nabla f_{k+1}  \nabla f_{k}
\tag{1}
\end{eqnarray}\]</span></p>
<p>The next matrix <span class="math inline">\(B_{k+1}\)</span> we are looking for should satisfy the above approximate,</p>
<p><span class="math display" id="eq:secant">\[\begin{equation}
B_{k+1} s_k = y_k
\tag{2}
\end{equation}\]</span></p>
<p>This equation is called <em>secant equation</em>.</p>
<p>There is another approach to get the secant equation. Let</p>
<p><span class="math display">\[ m_k(p) = f_k + \nabla f_{k}^{T}p + \frac{1}{2}p^{T}B_k{p} \]</span></p>
<p>be the secondorder approximate to <span class="math inline">\(f\)</span> at <span class="math inline">\(x_k\)</span>. It is a function of <span class="math inline">\(p\)</span> with gradient</p>
<p><span class="math display">\[ m_k^{\prime}(p) = \nabla f_k + B_k p \]</span></p>
<p>When we iterate from <span class="math inline">\(k\)</span> to <span class="math inline">\(k + 1\)</span>, it’s easy to check that <span class="math inline">\(m_{k+1}(0)= f_{k+1}\)</span> and <span class="math inline">\(m_{k+1}^{\prime}(0) = \nabla f_{k+1}\)</span>. This means <span class="math inline">\(m_{k+1}(p)\)</span> equals <span class="math inline">\(f\)</span> with both function value and gradient at <span class="math inline">\(x_{k+1}\)</span>. If we want this <span class="math inline">\(m_{k+1}(p)\)</span> to approximate <span class="math inline">\(f\)</span> even better, we may ask its gradient to be equal with <span class="math inline">\(f\)</span> at last step <span class="math inline">\(x_k\)</span> as well. At this point, <span class="math inline">\(x_k = x_{k+1}  \alpha_k p_k\)</span>, So <span class="math inline">\(p = \alpha_k p_k = s_k\)</span>, and we have</p>
<p><span class="math display">\[\begin{eqnarray}
m_{k+1}^{\prime}(s_k) = \nabla f_{k+1}  B_{k+1}s_k & = & \nabla f_k \\
B_{k+1} s_k & = & \nabla f_{k+1}  \nabla f_k \\
B_{k+1} s_k & = & y_k
\end{eqnarray}\]</span></p>
</div>
<div id="ranktwoupdate" class="section level2">
<h2>Ranktwo update</h2>
<p>So the next question becomes how to iterate <span class="math inline">\(B_k\)</span>. The DavidonFletcherPowell (DFP) formula <span class="citation">(<a href="#reffletcher1963rapidly" role="docbiblioref">Fletcher and Powell 1963</a>)</span> gives a ranktwo matrix update approach.</p>
<p><span class="math display" id="eq:DFP">\[\begin{equation}
B_{k+1} = B_k  \frac{B_k s_k s_k^{T} B_k}{s_k^{T} B_k s_k} + \frac{y_k y_k^{T}}{y_k^{T}s_k}
\tag{3}
\end{equation}\]</span></p>
<p>In <span class="citation"><a href="#refnocedal2006numerical" role="docbiblioref">Nocedal and Wright</a> (<a href="#refnocedal2006numerical" role="docbiblioref">2006</a>)</span>, the authors provides another interpretation.
They view <span class="math inline">\(B_{k+1}\)</span> as the solution of the following optimization problem.</p>
<p><span class="math display">\[\begin{eqnarray}
\min_{B} \B  B_k\_{W} \\
s.t. B = B^{T}, \quad B s_k = y_k
\end{eqnarray}\]</span></p>
<p>where <span class="math inline">\(\A\_{W}\)</span> norm is defined as</p>
<p><span class="math display">\[ \A\_{W} = \ W^{1/2}A W^{1/2} \_{F} \]</span></p>
<p>where <span class="math inline">\(\\cdot\_{F}\)</span> is <a href="https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm">Frobenius norm</a> and <span class="math inline">\(W\)</span> is any matrix satisfying <span class="math inline">\(W y_k = s_k\)</span>.
This is a more intuitive way,
however, I haven’t finish the detailed proof of this conclusion.
I would revise this part later.</p>
</div>
<div id="bfgs" class="section level2">
<h2>BFGS</h2>
<p>The DFP formula maintains properties of <span class="math inline">\(B_k\)</span> like symmetric and positive definite.
But we still have to calculate its inverse to get <span class="math inline">\(p_k = B_k^{1}\nabla f_k\)</span>.
To avoid the inverse matrix calculation,
we need to approximate matrix <span class="math inline">\(H_k = B_k^{1}\)</span> directly.</p>
<p>Many materials say just applying Sherman–Morrison formula and we can get a iterate equation for <span class="math inline">\(B_k^{1}\)</span>.
However, this matrix deduction is not trivial.
After a little calculation, I think applying <a href="https://en.wikipedia.org/wiki/Woodbury_matrix_identity">Woodbury matrix identity</a> would be a more straightforward way.</p>
<p>Woodbury matrix identity says</p>
<p><span class="math display" id="eq:Woodbury">\[\begin{equation}
(A + UCV)^{1} = A^{1}  A^{1}U(C^{1} + VA^{1}U)^{1}VA^{1}
\tag{4}
\end{equation}\]</span></p>
<p>Let <span class="math inline">\(\rho = \frac{1}{y_k^{T}s_{k}}\)</span> and drop subscript <span class="math inline">\(k\)</span> to rewrite DFP as</p>
<p><span class="math display" id="eq:dfpwoodbury">\[\begin{eqnarray}
B_{k+1} & = & B  \frac{Bss^{T}B}{s^{T}Bs} + \frac{y y^{T}}{y^{T}s} \\
& = & B + (y \quad Bs)\left(
\begin{array}{cc}
\rho & 0 \\
0 & \frac{1}{s^{T}Bs}
\end{array}
\right)
\left(
\begin{array}{c}
y^{T} \\
s^{T}B
\end{array}
\right) \\
& = & B + UCV
\tag{5}
\end{eqnarray}\]</span></p>
<p>So</p>
<p><span class="math display">\[\begin{eqnarray}
C^{1} + VA^{1}U & = & \left(
\begin{array}{cc}
\frac{1}{\rho} & 0 \\
0 & s^{T}Bs
\end{array}
\right) + \left(
\begin{array}{c}
y^{T} \\
s^{T}B
\end{array}
\right) H \left(y,\quad Bs\right) \\
& = & \left(
\begin{array}{cc}
1/\rho + y^{T}Hy & 1/\rho \\
1/\rho & 0
\end{array}
\right)
\end{eqnarray}\]</span></p>
<p>We can easily check that</p>
<p><span class="math display">\[\begin{equation}
\left(
\begin{array}{cc}
a + b & a \\
a & 0
\end{array}
\right)
\left(
\begin{array}{cc}
0 & \frac{1}{a} \\
\frac{1}{a} & \frac{a+b}{a^2}
\end{array}
\right) = I
\end{equation}\]</span></p>
<p>Using this property, we can get</p>
<p><span class="math display">\[\begin{eqnarray}
(C^{1} + VA^{1}U)^{1} & = & \left(
\begin{array}{cc}
0 & \rho \\
\rho & \rho  \rho y^{T}Hy
\end{array}
\right)
\end{eqnarray}\]</span></p>
<p>Substitute this back to <a href="#eq:Woodbury">(4)</a> with <a href="#eq:dfpwoodbury">(5)</a> (still ignore subscript <span class="math inline">\(k\)</span> for simplicity)</p>
<p><span class="math display">\[\begin{eqnarray}
H_{k+1} & = & H  H (y \quad Bs)\left(
\begin{array}{cc}
0 & \rho \\
\rho & \rho  \rho y^{T}Hy
\end{array}
\right)
\left(
\begin{array}{c}
y^{T} \\
s^{T}B
\end{array}
\right) H \\
& = & H  \rho s y^{T}H  \rho H y s^{T} + \rho s y^{T} H y s^{T} + \rho s s^{T} \\
& = & (I  \rho s y^{T})H(I  \rho y s^{T}) + \rho s s^{T}
\end{eqnarray}\]</span></p>
<p>Add back subscript <span class="math inline">\(k\)</span> and we finally get the BFGS formula</p>
<p><span class="math display" id="eq:bfgs">\[\begin{equation}
H_{k+1} = (I  \rho_k s_k y_k^{T})H_k(I  \rho_k y_k s_k^{T}) + \rho_k s_k s_k^{T}, \quad
\rho_k = \frac{1}{y_k^{T}s_k}
\tag{6}
\end{equation}\]</span></p>
</div>
<div id="initialh_0" class="section level2">
<h2>Initial <span class="math inline">\(H_0\)</span></h2>
<p>A simple choice of <span class="math inline">\(H_0\)</span> is setting it to identical matrix <span class="math inline">\(I\)</span>.
A similar strategy is to introduce a scalar <span class="math inline">\(\beta\)</span>, then let <span class="math inline">\(H_0 = \beta I\)</span>.</p>
<p>There are many other approaches.
For instance, we can also use other optimization method to “warm up”—do a few iterations in order to get a better approximate of <span class="math inline">\(\nabla^{2}f\)</span>, then change back to BFGS.</p>
</div>
<div id="convergence" class="section level2">
<h2>Convergence</h2>
<p>I’m going to discuss the convergence theory of Newton and quasiNewton method in another article later.
Therefore, here I only mention the convergence result theorem.</p>
<p><strong>Theorem</strong>
<em>Suppose that <span class="math inline">\(f\)</span> is twice continuously differentiable and that the iterates generated by the BFGS algorithm converge to a minimizer <span class="math inline">\(x^{\ast}\)</span> at which the Hessian matrix <span class="math inline">\(G\)</span> satisfies Lipschitz continuous in a neighborhood with some positive constant <span class="math inline">\(L\)</span>,</em></p>
<p><span class="math display">\[\begin{equation}
\ G(x)  G(x^{\ast}) \ \le L \x  x^{\ast}\
\end{equation}\]</span></p>
<p><em>Suppose the sequence also satisfies</em></p>
<p><span class="math display">\[\begin{equation}
\sum_{k=1}^{\infty} \ x_k  x^{\ast} \ < \infty
\end{equation}\]</span></p>
<p><em>Then <span class="math inline">\(x_k\)</span> converges to <span class="math inline">\(x^{\ast}\)</span> at a superlinear rate.</em></p>
<p>For LBFGS and stochastic LBFGS, some convergence discussion can be found in <span class="citation"><a href="#refmokhtari2015global" role="docbiblioref">Mokhtari and Ribeiro</a> (<a href="#refmokhtari2015global" role="docbiblioref">2015</a>)</span>.</p>
</div>
</div>
<div id="determinesteplengthalpha_k" class="section level1">
<h1>3. Determine step length <span class="math inline">\(\alpha_k\)</span></h1>
<p>With BFGS updating formula, we have solved the task of how to find direction <span class="math inline">\(p_k\)</span>.
Now we turn to determine how far we should go in this direction.</p>
<div id="wolfecondition" class="section level2">
<h2>Wolfe condition</h2>
<p>We introduce a helper function</p>
<p><span class="math display">\[\phi(\alpha) = f(x_k + \alpha p_k), \quad \alpha > 0\]</span></p>
<p>The minimizer of <span class="math inline">\(\phi(\alpha)\)</span> is what we need.
However, solving this univariate minimum problem accurately could be too expensive.
An inexact solution is acceptable as long as <span class="math inline">\(\phi(\alpha) = f(x_k + \alpha p_k) < f_k\)</span>.</p>
<p>However, a simple function value reduction may not be enough sometimes.
The picture below show an example of this situation.
We need a kind of <em>sufficient decrease</em> to avoid this.</p>
<div class="figure" style="textalign: center"><span id="fig:unnamedchunk2"></span>
<img src="insufficientdecrease.png" alt="Insufficient reduction[^fig3.2]" width="85%" />
<p class="caption">
Figure 1: Insufficient reduction<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a>
</p>
</div>
<p>The following <em>Wolfe condition</em> is the formalization of this sufficient decrease.</p>
<p><span class="math display" id="eq:wolfe">\[\begin{eqnarray}
f(x_k + \alpha_k p_k) & \le & f(x_k) + c_1 \alpha_k \nabla f_k^{T} p_k \\
\nabla f(x_k + \alpha_k p_k)^{T} p_k & \ge & c_2 \nabla f_k^{T} p_k
\tag{7}
\end{eqnarray}\]</span></p>
<p>where <span class="math inline">\(0 < c_1 < c_2 < 1\)</span>.</p>
<p>The right hand side of the first inequality is a line <span class="math inline">\(l(\alpha)\)</span> drawn from <span class="math inline">\(x_k\)</span> with the same slope of <span class="math inline">\(\phi(\alpha)\)</span>.
Thus the intuition behind the first inequality is that the function value at step <span class="math inline">\(\alpha_k\)</span> should below this line <span class="math inline">\(l(\alpha)\)</span>.
This condition is usually called <em>Armijo condition</em>, or <em>sufficient decrease condition</em>.</p>
<p>The intuition of the second inequality uses more information of <span class="math inline">\(f\)</span>’s curvature.
Since <span class="math inline">\(p_k\)</span> is the descent direction, we have <span class="math inline">\(\nabla f_k^{T} p_k < 0\)</span>.
Suppose the slope of <span class="math inline">\(\phi(\alpha)\)</span> at step <span class="math inline">\(\alpha_k\)</span> is smaller (steeper) than the slope at the start point,
i.e. <span class="math inline">\(\nabla f_k^{T} p_k\)</span>, then we can go further safely to reach a even low objective value.
Therefore, a sufficient descent step should be at a point with a more gentle slope.
This second condition is usually referred to as <em>curvature condition</em>.</p>
<div class="figure" style="textalign: center"><span id="fig:unnamedchunk3"></span>
<img src="wolfecondition.png" alt="The curvature condition[^fig3.4]" width="85%" />
<p class="caption">
Figure 2: The curvature condition<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a>
</p>
</div>
<p>There is a chance that step <span class="math inline">\(\alpha_k\)</span> goes too further that the slope becomes positive.
To prevent this case, we can use <em>strong Wolfe condition</em></p>
<p><span class="math display" id="eq:strongwolfe">\[\begin{eqnarray}
f(x_k + \alpha_k p_k) & \le & f(x_k) + c_1 \alpha_k \nabla f_k^{T} p_k \\
\ \nabla f(x_k + \alpha_k p_k)^{T} p_k \ & \le & c_2 \\nabla f_k^{T} p_k\
\tag{8}
\end{eqnarray}\]</span></p>
<p>where <span class="math inline">\(0 < c_1 < c_2 < 1\)</span>.</p>
</div>
<div id="existanceandconvergence" class="section level2">
<h2>Existance and convergence</h2>
<p>The next question is does these <span class="math inline">\(\alpha_k\)</span> exist; and if does, how to find them.</p>
<p>The following lemma and theorem <span class="citation">(<a href="#refnocedal2006numerical" role="docbiblioref">Nocedal and Wright 2006</a>)</span> guarantee the existence of <span class="math inline">\(\alpha_k\)</span> that satisfies Wolfe condition, and show us the superlinearly convergence under certain conditions.</p>
<p><strong>Lemma</strong>
<em>Suppose that <span class="math inline">\(f\)</span> is continuously differentiable. Let <span class="math inline">\(p_k\)</span> be a descent direction at <span class="math inline">\(x_k\)</span>,
and assume that <span class="math inline">\(f\)</span> is bounded below along <span class="math inline">\(\{x_k + \alpha p_k  \alpha > 0\}\)</span>.
For <span class="math inline">\(0 < c_1 < c_2 < 1\)</span>,
there exist intervals of step lengths that satisfying the Wolfe condition <a href="#eq:wolfe">(7)</a> and strong Wolfe condition <a href="#eq:strongwolfe">(8)</a>.</em></p>
<p><strong>Theorem</strong>
<em>Suppose that <span class="math inline">\(f\)</span> is twice continuously differentiable. Consider the iteration <span class="math inline">\(x_{k+1} = x_k + \alpha_k p_k\)</span> , where <span class="math inline">\(p_k\)</span> is a descent direction and <span class="math inline">\(\alpha_k\)</span> satisfies the Wolfe conditions <a href="#eq:wolfe">(7)</a> with <span class="math inline">\(c_1 \le 1/2\)</span>. If the sequence <span class="math inline">\(\{x_k\}\)</span> converges to a point <span class="math inline">\(x^{\ast}\)</span> such that <span class="math inline">\(\nabla f(x^{\ast})\)</span> and <span class="math inline">\(\nabla^{2}f(x^{\ast})\)</span> is positive definite, and if the search direction satisfies</em></p>
<p><span class="math display">\[\begin{equation}
\lim_{k\to \infty}\frac{\ \nabla f_k + \nabla^{2}f_k p_k \}{\p_k\} = 0
\end{equation}\]</span></p>
<p>then</p>
<ol style="liststyletype: decimal">
<li><em>the step length <span class="math inline">\(\alpha_k = 1\)</span> is admissible for all <span class="math inline">\(k\)</span> greater than a certain index <span class="math inline">\(k_0\)</span>;</em></li>
<li><em>if <span class="math inline">\(\alpha_k = 1\)</span> for all <span class="math inline">\(k > k_0\)</span>, <span class="math inline">\(\{x_k\}\)</span> converges to <span class="math inline">\(x^{\ast}\)</span> superlinearly.</em></li>
</ol>
</div>
<div id="linearsearchalgorithm" class="section level2">
<h2>Linear search algorithm</h2>
<p>We use linear search algorithm to locate a valid <span class="math inline">\(\alpha\)</span>.
The idea is to generate a sequence of monotonically increasing <span class="math inline">\(\{\alpha_i\}\)</span> in <span class="math inline">\((0, \alpha_{max})\)</span>.
If <span class="math inline">\(\alpha_i\)</span> satisfies Wolfe condition, return that step; otherwise narrow the search interval.</p>
<p>We show the algorithm in Julia pseudo code below.</p>
<pre class="julia"><code>using Flux
function linear_search(ϕ, α_0=0, α_max=1, c1=0.0001, c2=0.9)
α = Dict(0 => α_0)
α[1] = choose(α_0, α_max)
ϕ′(x) = gradient(ϕ, x)[1]
i = 1
while true
y = ϕ(α[i])
if y > ϕ(0) + c1 * α[i] * ϕ′(0) or (ϕ(α[i]) >= ϕ(α[i  1]) and i > 1)
return zoom(α[i  1], α[i])
end
dy = ϕ′(α[i])
if abs(dy) <= c2 * ϕ′(0)
return α[i]
end
if dy >= 0
return zoom(α[i], α[i  1])
end
α[i + 1] = choose(α[i], α_max)
i = i + 1
end
end
function zoom(ϕ, α_lo, α_hi, c2=0.9)
while true
# using quadratic, cubic, or bisection to find a trial step length α
α = choose(α_lo, α_hi)
y = ϕ(α)
if y > ϕ(0) + c1 * α * ϕ′(0) or y >= ϕ(α_lo)
α_hi = α
else
dy = ϕ′(α)
if abs(dy) <= c2 * ϕ′(0)
return α
end
if dy * (α_hi  α_lo) >= 0
α_hi = α_lo
end
α_lo = α
end
end
end</code></pre>
<p>Notice that in function <code>zoom()</code>, the order of <span class="math inline">\(\alpha_i\)</span> and <span class="math inline">\(\alpha_{i  1}\)</span> could swap.
<span class="math inline">\(\alpha_{lo}\)</span> always gives the smallest function value.</p>
</div>
</div>
<div id="limitmemoryscenario" class="section level1">
<h1>4. Limit memory scenario</h1>
<p>We have discussed how to use BFGS algorithm to find a descent direction <span class="math inline">\(p_k\)</span> and how to use linear search to choose a step length <span class="math inline">\(\alpha_k\)</span> that satisfies the Wolfe condition.
Now let’s talk about some practical computation issues.</p>
<p>The BFGS algorithm needs to store and update the Hessian matrix at each iteration.
This requires <span class="math inline">\(O(n^2)\)</span> memory space, which is not feasible at all,
since the parameter size <span class="math inline">\(n\)</span> of many modern deep learning models could be millions or hundreds of millions.</p>
<p><em>Limit memory BFGS</em>, or <em>LBFGS</em>, is a BFGS’s variation addressing this issue.
Instead of doing matrix multiplication directly,
LBFGS use <span class="math inline">\(m\)</span> previous vectors, <span class="math inline">\(\{s_i, y_i\}, i = k1, \dots, km\)</span>,
to reconstruct <span class="math inline">\(H_k\)</span>.
This reduce memory cost from <span class="math inline">\(O(n^2)\)</span> to <span class="math inline">\(O(m n)\)</span>.</p>
<div id="lbfgsvanillaiteration" class="section level2">
<h2>LBFGS: vanilla iteration</h2>
<p>Recall <a href="#eq:skyk">(1)</a> and BFGS <a href="#eq:bfgs">(6)</a>, and for simplicity, let</p>
<p><span class="math display">\[\begin{equation}
V_k = I  \rho_k y_k s_k^{T}
\end{equation}\]</span></p>
<p>The BFGS formula becomes</p>
<p><span class="math display" id="eq:bfgsv">\[\begin{equation}
H_{k + 1} = V_k^{T} H_k V_k + \rho_k s_k s_k^{T}
\tag{9}
\end{equation}\]</span></p>
<p>We can use a vanilla iterative way to calculate <span class="math inline">\(H_k \nabla f_k\)</span> directly (let <span class="math inline">\(q = \nabla f_k\)</span>)</p>
<ol style="liststyletype: decimal">
<li><span class="math inline">\(V_k \nabla f_k = q  \rho_k y_k s_k^{T} q\)</span>: calculate <span class="math inline">\(\rho s^{T} q\)</span> first in <span class="math inline">\(n\)</span> multiplications to get a scalar <span class="math inline">\(\alpha\)</span>, then calculate <span class="math inline">\(q  \alpha y\)</span> in another <span class="math inline">\(n\)</span> multiplications.</li>
<li>Suppose <span class="math inline">\(H_k^{0}\)</span> is a diagonal matrix, so we can calculate <span class="math inline">\(H_k^{0}V_k q\)</span> in <span class="math inline">\(n\)</span> multiplications.</li>
<li>Multiply <span class="math inline">\(V_k^{T}\)</span> is analogous, which needs <span class="math inline">\(2n\)</span> multiplications as well.</li>
<li>Finally, <span class="math inline">\(\rho s s^{T}\)</span> and add the results together needs another <span class="math inline">\(2n\)</span> multiplications.</li>
</ol>
<p>So this vanilla iteration requires <span class="math inline">\(7n\)</span> multiplications, leading to a total cost of <span class="math inline">\(7nm\)</span> multiplications.</p>
<p>Can we do better?</p>
</div>
<div id="lbfgstwolooprecursion" class="section level2">
<h2>LBFGS: twoloop recursion</h2>
<p>LBFGS has a <em>twoloop recursion</em> algorithm, which is quite brilliant.</p>
<pre class="julia"><code>function L_BFGS(H0, ∇f_k, s, y, ρ, k, m)
q = ∇f_k
for i = k  1 : 1 : k  m
α[i] = ρ[i] * transpose(s[i]) * q
q = q  α[i] * y[i]
end
w = H0 * q
for i = k  m : k  1
β = ρ[i] * transpose(y[i]) * w
w = w + s[i] * (α[i]  β)
end
return w
end</code></pre>
<p>The return value <span class="math inline">\(w = H_k \nabla f_k\)</span> is what we need,
and there are only <span class="math inline">\(5nm\)</span> multiplications in it.</p>
<p>The data manipulation in the twoloop recursion is not that easy to understand at first glimpse.
To make it more clear, we expand <a href="#eq:bfgsv">(9)</a> <span class="math inline">\(m\)</span> steps to get</p>
<span class="math display">\[\begin{aligned}
H_{k}=&\left(V_{k1}^{T} \cdots V_{km}^{T}\right) H_{k}^{0}\left(V_{km} \cdots V_{k1}\right) \\
&+\rho_{km}\left(V_{k1}^{T} \cdots V_{km+1}^{T}\right) s_{km} s_{km}^{T}\left(V_{km+1} \cdots V_{k1}\right) \\
&+\rho_{km+1}\left(V_{k1}^{T} \cdots V_{km+2}^{T}\right) s_{km+1} s_{km+1}^{T}\left(V_{km+2} \cdots V_{k1}\right) \\
&+\cdots \\
&+\rho_{k2} V_{k1}^{T} s_{k2} s_{k2}^{T} V_{k1} \\
&+\rho_{k1} s_{k1} s_{k1}^{T}
\end{aligned}\]</span>
<p>Some key observations to help you understand the twoloop recursion:</p>
<ul>
<li>In the first loop, <span class="math inline">\(\alpha_i = \rho_i s_i^{T} V_{i + 1} \cdots V_{k1} \nabla f_k\)</span></li>
<li>After the first loop and <code>w = H0 * q</code>, <span class="math inline">\(q = H_k^{0} V_{km} \cdots V_{k1}\)</span></li>
<li>In the second loop, after iteration <span class="math inline">\(i\)</span>, the vector <span class="math inline">\(w\)</span> will be</li>
</ul>
<span class="math display">\[\begin{aligned}
w =&\left(V_{i}^{T} \cdots V_{km}^{T}\right) H_{k}^{0}\left(V_{km} \cdots V_{k1}\nabla f_k\right) \\
&+\rho_{km}\left(V_{i}^{T} \cdots V_{km+1}^{T}\right) s_{km} s_{km}^{T}\left(V_{km+1} \cdots V_{k1}\nabla f_k\right) \\
&+\cdots \\
&+\rho_{i} s_{i} s_{i}^{T} V_{i + 1} \cdots V_{k1} \nabla f_k
\end{aligned}\]</span>
</div>
</div>
<div id="summary" class="section level1">
<h1>5. Summary</h1>
<p>In this article, we have talked about the basic idea of quasiNewton method.
We carefully derive the DFP and BFGS formula, and show how to find descent direction with them.
We have also discussed how to use Wolfe condition and linear search to find a feasible step length.
And at last, we demonstrate how to use twoloop recursion LBFGS to apply a fast and memoryefficient iteration.</p>
<p>In most of the materials I’ve found about LBFGS, many nontrivial details are omitted.
I’ve tried to make this post as selfcontained and as clear as possible.
Maybe it will help one or two newcomers of this topic to overcome some obscure steps.</p>
<hr />
<div id="refs" class="references cslbibbody hangingindent">
<div id="reffletcher1963rapidly" class="cslentry">
Fletcher, Roger, and Michael JD Powell. 1963. <span>“A Rapidly Convergent Descent Method for Minimization.”</span> <em>The Computer Journal</em> 6 (2): 163–68.
</div>
<div id="refmokhtari2015global" class="cslentry">
Mokhtari, Aryan, and Alejandro Ribeiro. 2015. <span>“Global Convergence of Online Limited Memory BFGS.”</span> <em>The Journal of Machine Learning Research</em> 16 (1): 3151–81.
</div>
<div id="refnocedal2006numerical" class="cslentry">
Nocedal, Jorge, and Stephen Wright. 2006. <em>Numerical Optimization</em>. Springer Science & Business Media.
</div>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Figure 3.2 from <span class="citation"><a href="#refnocedal2006numerical" role="docbiblioref">Nocedal and Wright</a> (<a href="#refnocedal2006numerical" role="docbiblioref">2006</a>)</span>.<a href="#fnref1" class="footnoteback">↩︎</a></p></li>
<li id="fn2"><p>Figure 3.4 from <span class="citation"><a href="#refnocedal2006numerical" role="docbiblioref">Nocedal and Wright</a> (<a href="#refnocedal2006numerical" role="docbiblioref">2006</a>)</span>.<a href="#fnref2" class="footnoteback">↩︎</a></p></li>
</ol>
</div>

Better Intent Classification via BERT and MMS
https://fshen.org/en/2020/0918betterintentclassificationviabertandmms/
Fri, 18 Sep 2020 19:09:18 +0800
https://fshen.org/en/2020/0918betterintentclassificationviabertandmms/
<script src="https://fshen.org/rmarkdownlibs/headerattrs/headerattrs.js"></script>
<link href="https://fshen.org/rmarkdownlibs/anchorsections/anchorsections.css" rel="stylesheet" />
<script src="https://fshen.org/rmarkdownlibs/anchorsections/anchorsections.js"></script>
<script src="https://fshen.org/rmarkdownlibs/kePrint/kePrint.js"></script>
<link href="https://fshen.org/rmarkdownlibs/lightable/lightable.css" rel="stylesheet" />
<div id="TOC">
<ul>
<li><a href="#background">1. Background</a></li>
<li><a href="#berthigheraccuracyandlesshandcraft">2. BERT: higher accuracy and less handcraft</a>
<ul>
<li><a href="#replacemanualfeatures">Replace manual features</a></li>
<li><a href="#replaceclassifierfordensefeature">Replace classifier for dense feature</a></li>
</ul></li>
<li><a href="#scalingwithmxnetmodelserver">3. Scaling with MXNet Model Server</a>
<ul>
<li><a href="#customization">Customization</a></li>
<li><a href="#loadtest">Load test</a></li>
</ul></li>
<li><a href="#recap">4. Recap</a></li>
</ul>
</div>
<p>In this article, I will give a brief introduction on how to improve intent classification using pretrained model <em>BERT</em> and <em>MXNet Model Server (MMS)</em>. Most of the work in this passage were done during last year, when I was still at my previous company — a chatbot solution provider.</p>
<div id="background" class="section level1">
<h1>1. Background</h1>
<p>Intent classification, or intent recognition, is a core part of natural language understanding (NLU). When building a chatbot, it’s usually the very first module of the whole system, helping us to figure out what users want, like making a dentist appointment, asking a question about insurance premium, etc.</p>
<p>We usually formulate intent recognition as a multiclassification problem. It shouldn’t be a difficult problem in the context of supervised learning if we only have to deal with one chatbot case. However, as a chatbot service provider, when we want to scale this procedure to different business clients, there are some obstacles:</p>
<ul>
<li><strong>Scarcity of labeled data</strong>.
The client often have very few, sometimes even none dialogue data beforehand. Most of the time, we have to train a classifier with several or a dozen examples per intent. The shortage of training data excludes many powerful models, like deep neural network.</li>
<li><strong>Lack of transferability between different domains</strong>.
The <em>intents</em>, i.e. those classification result labels, as well as the <em>corpus characteristic</em>, vary a lot since the clients are from different industries. We couldn’t simply reuse a financial chatbot model to an airline booking scenario. Hence when a new project arrives, more or less, we have to do some feature engineering and model finetuning manually.</li>
</ul>
<p>Therefore, the previous workflow could achieve a reasonable high accuracy at a cost of manual data augmentation, handcraft feature engineering and classifier finetuning. What we want is a more scalable approach that works well in different domains and is able to get a high accuracy with very few training data.</p>
<p>BERT along with MMS is a possible approach to tackle this problem.</p>
</div>
<div id="berthigheraccuracyandlesshandcraft" class="section level1">
<h1>2. BERT: higher accuracy and less handcraft</h1>
<p>Back then last year, <a href="https://arxiv.org/abs/1810.04805">BERT</a> was still the goto pretrained model for language understanding problems. This huge general language model could provide a “good enough semantic representation” for arbitrary text, based on which a more general, domainagnostic intent model could be built.</p>
<div id="replacemanualfeatures" class="section level2">
<h2>Replace manual features</h2>
<p>The initial idea was to replace all handcraft features — <em>key words, POS, regexp patterns</em> — with BERT output, while keeping other model components fixed.</p>
<p>I used <code>gluonnlp.model.bert.get_bert_model</code> to initialize a BERT network, and set <code>use_pooler=True</code> to obtain a pooler vector. Another choice would be using all tokens’ outputs to conduct a new representing vector. I had tried both ways, but the difference wasn’t significant.</p>
<p>While several datasets were tested in this experiment, I will only demonstrate one of them here, with all the sensitive data removed.</p>
<p>This dataset included more than ten intents; examples per intent ranged from under 10 to more than 50.</p>
<p><img src="https://fshen.org/en/20200918betterintentclassificationviabertandmms_files/figurehtml/unnamedchunk21.png" width="672" /></p>
<p>Surprisingly, using BERT as feature didn’t improve intent model’s accuracy directly, neither in single feature setup replacing <code>tfidf</code>, nor in composite feature setup replacing those manual ones.</p>
</div>
<div id="replaceclassifierfordensefeature" class="section level2">
<h2>Replace classifier for dense feature</h2>
<p>Why didn’t BERT work? The reason was in the classifier. As stated before, the scarcity of labeled data forced us to use classifiers like SVM, naive Bayes, or logistic regression. These classifiers worked well on previous <em>sparse</em> features — tfidf, onehot key words, etc. But they didn’t fully use BERT’s dense<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a> feature’s capability.</p>
<p>Based on this assumption, I tested a new classifier — a fullyconnected dense layer<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a>. And this setup outperformed previous results.</p>
<table class="table tablestriped tablehover tablecondensed tableresponsive" style="marginleft: auto; marginright: auto;">
<thead>
<tr>
<th style="textalign:left;">
feature
</th>
<th style="textalign:left;">
classifier
</th>
<th style="textalign:right;">
accuracy
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="textalign:left;">
tfidf + manual
</td>
<td style="textalign:left;">
svm
</td>
<td style="textalign:right;">
0.884615
</td>
</tr>
<tr>
<td style="textalign:left;">
tfidf + manual
</td>
<td style="textalign:left;">
dense
</td>
<td style="textalign:right;">
0.897436
</td>
</tr>
<tr>
<td style="textalign:left;">
bert
</td>
<td style="textalign:left;">
dense
</td>
<td style="textalign:right;">
0.935897
</td>
</tr>
<tr>
<td style="textalign:left;">
tfidf + bert
</td>
<td style="textalign:left;">
dense
</td>
<td style="textalign:right;">
0.923077
</td>
</tr>
</tbody>
</table>
<p><img src="https://fshen.org/en/20200918betterintentclassificationviabertandmms_files/figurehtml/unnamedchunk31.png" width="672" /></p>
<p>Among all the experiment datasets, pretraining method improved accuracy by several percentage points. More importantly, it helped us to get rid of manual feature engineering for every new corpus, reducing PoC projects’ cost dramatically.</p>
</div>
</div>
<div id="scalingwithmxnetmodelserver" class="section level1">
<h1>3. Scaling with MXNet Model Server</h1>
<p>Despite some private client projects, it was not a valid option to deploy BERT intent model in every chatbot:</p>
<ul>
<li>Huge memory usage: BERT, as well as other pretrained language models, was very huge.</li>
<li>Inference speed: many deployed instances didn’t have GPU, hence the inference speed would be slow.</li>
<li>Extra dependency: deploying such a model meant introducing all the deep learning framework dependencies to client side, which was resourceconsuming and hard to maintain.</li>
</ul>
<p>Therefore, using BERT through a service was a very intuitive idea. Other teams had explored many <a href="https://medium.com/apachemxnet/howcimpressdeliverscloudinferenceforitsimageprocessingservicesc38951732f97">model service options</a>. Since we had used <a href="https://mxnet.apache.org/">MXNet</a> and <a href="https://gluonnlp.mxnet.io/">gluonnlp</a> as deep learning toolkit already, MMS<a href="#fn3" class="footnoteref" id="fnref3"><sup>3</sup></a> became a convenient choice.</p>
<div id="customization" class="section level2">
<h2>Customization</h2>
<p>There were some customization in our MMS setup. The first one was the trade off between parallel capacity and delay latency. Bigger batch size was more efficient in computing, but the backend service could be waiting as requests were coming at an unstable frequency. MMS tweaked this through <code>batch_size</code> and <code>max_batch_delay</code> — a batch of <code>batch_size</code> requests would be sent to backend service as long as waiting time was within <code>max_batch_delay</code>. Those parameters were chosen carefully based on the actual service load.</p>
<p>The second customization was a wrapper of gluonnlp BERT model class. There were some special processes for <em>outofvocabulary</em> tokens in our application, therefore besides encoder tensors, we had to return input tokens and pooler vectors as well. So I created a custom <code>mxnet.gluon.HybridBlock</code> subclass to wrap the BERT model, and exported the hybridized model for later deploy.</p>
</div>
<div id="loadtest" class="section level2">
<h2>Load test</h2>
<p>I did a simple load test with ApacheBench:</p>
<pre><code>ab k l n 10000 c 10 T "application/json" \
p test.json http://127.0.0.1:9123/predictions/bertintent
Requests per second: 11.63 [#/sec] (mean)
Time per request: 859.758 [ms] (mean)
Time per request: 85.976 [ms] (mean, across all concurrent requests)
Transfer rate: 8346.39 [Kbytes/sec] received
3.79 kb/s sent
8350.18 kb/s total</code></pre>
<p>Well, my former team had very limited and outofdate computing resources. In this 1card GPU server, BERT intent model would increase 86ms time per request, which was not ideal but tolerable for a chatbot application.</p>
</div>
</div>
<div id="recap" class="section level1">
<h1>4. Recap</h1>
<p>Intent classification was an essential part in chatbot system. Pretrained language model like BERT could save most of manual feature engineering for new domain corpus, and improve classification accuracy. Meanwhile, MMS provided a practical solution for deploying huge machine learning model service.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Some special networks, like WideandDeep, could get a better result on those sparse + dense features. I will discuss these issues in another post. This article only focuses on my work at my former company.<a href="#fnref1" class="footnoteback">↩︎</a></p></li>
<li id="fn2"><p>Its struture is <code>dense > ReLU > dense > softmax</code>.<a href="#fnref2" class="footnoteback">↩︎</a></p></li>
<li id="fn3"><p>In our experiment last year, MMS was still called <em>MXNet Model Server</em>. Now it has been migrated to <a href="https://github.com/awslabs/multimodelserver">awslabs/multimodelserver</a>.<a href="#fnref3" class="footnoteback">↩︎</a></p></li>
</ol>
</div>

Distributed Majority Voting Algorithm in Julia
https://fshen.org/en/2020/0908distributedmajorityvotingalgorithminjulia/
Tue, 08 Sep 2020 16:05:46 +0800
https://fshen.org/en/2020/0908distributedmajorityvotingalgorithminjulia/
<script src="https://fshen.org/rmarkdownlibs/headerattrs/headerattrs.js"></script>
<link href="https://fshen.org/rmarkdownlibs/anchorsections/anchorsections.css" rel="stylesheet" />
<script src="https://fshen.org/rmarkdownlibs/anchorsections/anchorsections.js"></script>
<div id="TOC">
<ul>
<li><a href="#introduction">1. Introduction</a></li>
<li><a href="#sequentialboyermoorealgorithm">2. Sequential BoyerMoore algorithm</a></li>
<li><a href="#distributedmajorityvotingalgorithm">3. Distributed majority voting algorithm</a>
<ul>
<li><a href="#multiprocessing">Multiprocessing</a></li>
<li><a href="#multithreading">MultiThreading</a></li>
<li><a href="#complexityanalysisfordistributedversion">Complexity analysis for distributed version</a></li>
</ul></li>
<li><a href="#benchmark">4. Benchmark</a></li>
<li><a href="#summary">5. Summary</a></li>
</ul>
</div>
<div id="introduction" class="section level1">
<h1>1. Introduction</h1>
<p>The original majority voting problem is to find the element that appears more than half times in a given array, and if there is no such majority element, algorithm should just return empty result.</p>
<p>There are some extended versions of this problem, like <a href="https://leetcode.com/problems/majorityelementii/">leetcode: Majority Element II</a>. We can consider a general form: for an array of <span class="math inline">\(n\)</span> elements and a given integer <span class="math inline">\(k \ge 2\)</span>, find all elements that appears more than <span class="math inline">\(n/k\)</span> times in that array. And the original problem becomes the <span class="math inline">\(k = 2\)</span> case.</p>
<p>In this post, I will give both sequential and distributed algorithm for this general form majority voting problem, along with complexity analysis. I will also use this algorithm as an example to explore how to write parallel computing program in Julia. It turns out to be a very pleasant journey.</p>
</div>
<div id="sequentialboyermoorealgorithm" class="section level1">
<h1>2. Sequential BoyerMoore algorithm</h1>
<p>The brute force solution could be sorting the array, or counting appearances for all elements, like python’s <code>collections.Counter</code>. However, that would cost either <span class="math inline">\(O(n\log n)\)</span> time complexity or extra <span class="math inline">\(O(n)\)</span> space. The <a href="https://gregable.com/2013/10/majorityvotealgorithmfindmajority.html">BoyerMoore algorithm</a> is the classic algorithm to solve this problem, which runs in <span class="math inline">\(O(n)\)</span> time and <span class="math inline">\(O(1)\)</span> extra space.</p>
<p>I will not deliberate the original algorithm since this post is about general <span class="math inline">\(1/k\)</span> form and distributed scenario. The core insight is the same as <span class="math inline">\(k=2\)</span> case. Those majorities can provide one element for each of those nonmajority elements, and they will remain some after those “annihilation” if they are true majorities.</p>
<p>For a given <span class="math inline">\(k\)</span>, there are <span class="math inline">\(k  1\)</span> majority elements as most. We can use a hash map to record all candidates. See Julia code below for details.</p>
<pre class="julia"><code>function BoyerMoore(A::AbstractArray{T, 1}, k::Int=2)::Dict{T, Int} where T
candidates = Dict{T, Int}()
for a in A
if length(candidates) < k  1  haskey(candidates, a)
candidates[a] = get!(candidates, a, 0) + 1
else
to_del = Vector{T}()
for key in keys(candidates)
candidates[key] = 1
candidates[key] <= 0 && append!(to_del, key)
end
for key in to_del
pop!(candidates, key)
end
end
end
return candidates
end
function majority_element(A::Vector{T}, k::Int=2)::Vector{T} where T
@assert k >= 2 "k must be an integer no less than 2"
candidates = BoyerMoore(A, k)
for key in keys(candidates)
candidates[key] = 0
end
for a in A
haskey(candidates, a) && (candidates[a] += 1)
end
bar = div(length(A), k) + 1
return [key for (key, v) in candidates if v >= bar]
end</code></pre>
<p>The potential updating time for <code>candidates</code> in each look up is <span class="math inline">\(O(k)\)</span>, which adds up to <span class="math inline">\(O(n k)\)</span> in total. The verification phase cost remains <span class="math inline">\(O(n)\)</span>. So the time complexity for the sequential general form majority voting algorithm is <span class="math inline">\(O(n k) + O(n) = O(n k)\)</span>. The extra space complexity is <span class="math inline">\(O(k)\)</span>.</p>
</div>
<div id="distributedmajorityvotingalgorithm" class="section level1">
<h1>3. Distributed majority voting algorithm</h1>
<p>The majority element property can be easily extend to a distributed version. The candidates obtained from each sub array, can be viewed as a simplified “counter”, and we can merge them to get the global candidates.</p>
<p>I will give both multiprocessing and multithreading approaches in Julia.</p>
<p>The only tricky part is in the merge phase. When a distinctive element shows up, instead of decreasing one for all candidates’ count, we have to subtract the current minimum count (probably not the new distinctive ones). See <code>merge_candidates!</code> below.</p>
<div id="multiprocessing" class="section level2">
<h2>Multiprocessing</h2>
<p>With the help of <code>SharedArrays</code> and <code>@distributed</code> macro, the parallel code in Julia is very neat.</p>
<pre class="julia"><code>@everywhere function merge_candidates!(X::Dict{T, Int}, Y::Dict{T, Int}, k::Int=2) where T
for (key, v) in Y
if length(X) < k  1  haskey(X, key)
X[key] = get!(X, key, 0) + v
else
min_v = min(minimum(values(X)), v)
to_del = Vector{T}()
for a in keys(X)
X[a] = min_v
X[a] <= 0 && append!(to_del, a)
end
for a in to_del
pop!(X, a)
end
v > min_v && (X[key] = v  min_v)
end
end
return X
end
function distributed_majority_element(A::Vector{T}, p::Int, k::Int=2)::Vector{T} where T
@assert k >= 2 "k must be an integer no less than 2"
n = length(A)
step = n ÷ p
A = SharedVector(A)
candidates = @distributed merge_candidates! for i = 1:p
left = (i  1) * step + 1
right = i == p ? n : i * step
BoyerMoore(view(A, left:right), k)
end
global_counter = @distributed mergewith(+) for i = 1:p
counter = Dict(key => 0 for (key, v) in candidates)
left = (i  1) * step + 1
right = i == p ? n : i * step
for a in view(A, left:right)
haskey(counter, a) && (counter[a] += 1)
end
counter
end
bar = n ÷ k + 1
return [key for (key, v) in global_counter if v >= bar]
end</code></pre>
</div>
<div id="multithreading" class="section level2">
<h2>MultiThreading</h2>
<p>In recent updated <code>v1.5</code>, multithreading is no longer an experimental feature in Julia</p>
<pre class="julia"><code>function parallel_majority_element(A::Vector{T}, p::Int, k::Int=2)::Vector{T} where T
@assert k >= 2 "k must be an integer no less than 2"
n = length(A)
step = n ÷ p
pool = Vector{Dict{T, Int}}(undef, p)
Threads.@threads for i = 1:p
left = (i  1) * step + 1
right = i == p ? n : i * step
pool[i] = BoyerMoore(view(A, left:right), k)
end
candidates = reduce(merge_candidates!, pool)
Threads.@threads for i = 1:p
pool[i] = Dict(key => 0 for (key, v) in candidates)
left = (i  1) * step + 1
right = i == p ? n : i * step
for a in view(A, left:right)
haskey(pool[i], a) && (pool[i][a] += 1)
end
end
counter = reduce(mergewith(+), pool)
bar = n ÷ k + 1
return [key for (key, v) in counter if v >= bar]
end</code></pre>
</div>
<div id="complexityanalysisfordistributedversion" class="section level2">
<h2>Complexity analysis for distributed version</h2>
<p>We parallel the procedure with <span class="math inline">\(p\)</span> processes or threads, and each sub array will have <span class="math inline">\(O(\frac{n}{p})\)</span> elements. So the updating time is <span class="math inline">\(O(\frac{n}{p}k)\)</span>. In <code>merge_candidates!</code>, the arguments <code>X</code> and <code>Y</code> may have <span class="math inline">\((k  1)\)</span> elements at most, so the cost for merging is <span class="math inline">\(O(k^2)\)</span>, which gives us <span class="math inline">\(O(k^2 \log p)\)</span> for reducing in total.</p>
<p>The verification part is less than above, and we also ignore the cost analysis of latency and bandwidth, since they are too small in this case. Therefore the total cost is distributed general form majority voting algorithm is
<span class="math display">\[ O(\frac{nk}{p} + k^2\log p) \]</span>
The extra space is also <span class="math inline">\(O(k)\)</span> in each node.</p>
</div>
</div>
<div id="benchmark" class="section level1">
<h1>4. Benchmark</h1>
<p>For a parallel computing benchmark, I started Julia in my laptop with <code>julia p 7 t 8</code>.</p>
<pre class="julia"><code>julia> using BenchmarkTools
julia> nprocs()
8
julia> Threads.nthreads()
8
julia> A = vcat([repeat([i], i) for i = 1:10]...);
julia> A_big = repeat(A, 2_000_000);
julia> @btime y = majority_element($A_big, 11)
3.980 s (8 allocations: 896 bytes)
5element Array{Int64,1}:
7
9
10
8
6
julia> @btime y = distributed_majority_element($A_big, 8, 11)
4.840 s (2019 allocations: 94.38 KiB)
5element Array{Int64,1}:
7
9
10
8
6
julia> @btime y = parallel_majority_element($A_big, 8, 11)
1.215 s (180 allocations: 25.23 KiB)
5element Array{Int64,1}:
7
9
10
8
6</code></pre>
<p>We can see the multiprocessing version is slower than the sequential version. The much more allocations (2019 vs 8) could be the cause.</p>
<p>For comparison, I also implement a sequential version majority voting algorithm in Python. The execute time for the same data above is <code>33.415s</code>, way behind Julia.</p>
<center>
<table>
<thead>
<tr class="header">
<th>lang</th>
<th>exec mode</th>
<th>time (s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>python 3.8.5</td>
<td>sequential</td>
<td>33.415</td>
</tr>
<tr class="even">
<td>julia 1.5.1</td>
<td>sequential</td>
<td>3.980</td>
</tr>
<tr class="odd">
<td>julia 1.5.1</td>
<td>multiprocessing</td>
<td>4.840</td>
</tr>
<tr class="even">
<td>julia 1.5.1</td>
<td>multithreading</td>
<td>1.215</td>
</tr>
</tbody>
</table>
</center>
</div>
<div id="summary" class="section level1">
<h1>5. Summary</h1>
<p>I present a general form of majority voting algorithm, i.e. each majority element appears more than <span class="math inline">\(1/k\)</span> in a given array. The sequential algorithm has a time complexity of <span class="math inline">\(O(n k)\)</span>, and the distributed version costs <span class="math inline">\(O(\frac{nk}{p} + k^2\log p)\)</span>. They both use <span class="math inline">\(O(k)\)</span> extra space locally. You can find the complete implementation in Julia in this <a href="https://github.com/shenfei/Highbury.jl/pull/1/files">pull request</a>. And we can see from the benchmark experiment that Julia’s parallel computation is easy and fast.</p>
</div>

Migrate to blogdown
https://fshen.org/en/2019/0324migratetoblogdown/
Sun, 24 Mar 2019 22:51:00 +0800
https://fshen.org/en/2019/0324migratetoblogdown/
<script src="https://fshen.org/rmarkdownlibs/headerattrs/headerattrs.js"></script>
<link href="https://fshen.org/rmarkdownlibs/anchorsections/anchorsections.css" rel="stylesheet" />
<script src="https://fshen.org/rmarkdownlibs/anchorsections/anchorsections.js"></script>
<p>Finally I have migrated both of my personal and technical blogs from jekyll to <a href="https://github.com/rstudio/blogdown">blogdown</a>.
This will be the first step of my plan to reorganizing my whole writing system.<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a></p>
<p>There are already many comprehensive articles on how to set up a blogdown site,
so I don’t plan to draft an extra step by step tutorial.
This post just records the main steps I’ve done in my blog migration,
along with some other thoughts during this procedure.</p>
<div id="motivation" class="section level1">
<h1>1. Motivation</h1>
<div id="whyyoushouldstartyourownblog" class="section level2">
<h2>Why you should start your own blog?</h2>
<p>Before we dive into the technical details, a question should be asked first:
<em>why you need your own blog?</em></p>
<p>We are living in a world where everyone wants to seize your attention in a second.
People are absorbed in short videos or timeline feeds with provoking headlines,
and if a passage is more than 280 characters, a tl;dr is expected.</p>
<p>So in 2019, a blog, especially a blog with self domain, looks like a kind of quixotic behavior.
But as an indigenous resident of digital world and a disciple of open web,
I do believe you should have your own blog if you are a creator.
Some benefits are obvious:</p>
<ul>
<li>Take back your own data. Don’t let tech giants tell you what to read or what to say. Keep faith in the open web.</li>
<li>Writing is a fantastic “output” form. It helps you to deepen your understanding and review your knowledge in a more systemic way.</li>
<li>Like a <a href="https://www.pottermore.com/writingbyjkrowling/pensieve">pensieve</a>, your blog provides a place to deposit your thoughts. It will pay you back in the future.</li>
<li>You may build a personal brand from this.</li>
</ul>
</div>
<div id="andwhyblogdownrmarkdown" class="section level2">
<h2>And why blogdown/RMarkdown?</h2>
<p>I’ve tried many writing tools or blog services, they all have pros and cons.
The last one is jekyll with GitHub pages. The markdown format and git workflow is very convenient and flexible.
However, since I know nothing about Ruby, the Gem dependency often drove me crazy.
The render speed is not a major concern until I experienced <a href="https://gohugo.io/">Hugo</a>.</p>
<p>Therefore, the <a href="https://github.com/rstudio/blogdown">blogdown</a> framework provides an excellent choice,
combining the power of <a href="https://rmarkdown.rstudio.com/">RMarkdown</a> and Hugo:</p>
<ul>
<li>Clean and more powerful markdown syntax (thanks to <a href="https://pandoc.org/">Pandoc</a>).</li>
<li>Literate programming with <a href="https://yihui.name/knitr/">knitr</a>.
You can show and run your reproducible R/Python/C++/Julia code inside the document,
and render the result, diagrams, interactive widgets along with the article,
even in an inline expressions form.</li>
<li><span class="math inline">\(\LaTeX\)</span> and code highlights support.</li>
<li><em>Plain text gives your freedom</em>.<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a> You don’t have to bear the git mess of jupyter notebook anymore.</li>
<li>Powerful static site generator, easy to configure and customize.</li>
</ul>
<p>Even if blogdown is written in R, you can use it without learning R programming.
Xie Yihui, the author of blogdown, published a very nice <a href="https://bookdown.org/yihui/blogdown/">online book</a>.
You can read it and learn blogdown within a couple of hours.</p>
</div>
</div>
<div id="blogbuilding" class="section level1">
<h1>2. Blog building</h1>
<p>Of course you have to study the basic of blogdown first,
but I encourage you to get your hand dirty soon and learn the details in practice.</p>
<div id="pickahugotheme" class="section level2">
<h2>Pick a hugo theme</h2>
<p>I have to confess that I spent too much time on the choice of suitable blog themes,
since I have several sites to migrate.
I highly recommend you to follow <a href="https://bookdown.org/yihui/blogdown/otherthemes.html">Yihui’s advice</a> on themes.
And don’t panic, you can always change it later if you’re discontent.
Hugo’s project structure makes this very easy.</p>
<p>After you pick a beloved style, you can start your blog with just one line code:</p>
<pre class="r"><code># change the `theme` argument to any other one you like
blogdown::new_site(theme = "kakawait/hugotranquilpeaktheme")</code></pre>
</div>
<div id="customizeyourblog" class="section level2">
<h2>Customize your blog</h2>
<p>It’s not a joke:</p>
<blockquote>
<p>Nearly all of my frontend knowledge come from my impulse to modify a static page blog theme.
No, no js yet, css has already made me crazy.</p>
</blockquote>
<p>Almost every time you choose a theme, there are some aesthetic details you don’t like.
My (rational) suggestion is always put writing prior to the appearance. However, if you’re not afraid to html/css,
you can customize them as you like.</p>
<p>My situation was a little tricky: I was planning to have two <em>post sections</em> to write both English and Chinese articles.
So I copied <code>layouts/section/post.html</code> to <code>zh.html</code> and <code>en.html</code>. I also created a new <code>layouts/partials/li_list.html</code> since I didn’t like the default post list style.
Academic theme provides a nice tag cloud widget, however, I still want some kind of standalone pages for tags and categories, like the taxonomy style of my old jekyll theme <a href="http://codinfox.github.io/blog/categories/">codinfoxlanyon</a>.
So I handcrafted a <code>layouts/_default/terms.html</code> as well.</p>
<p>There are some other small modification like rss links, fonts, color, icons, etc.
I list some tips I’ve learned from my recent struggle with hugoacademic theme below.
Hope they will save you one minute of time.</p>
<ul>
<li>Upgrade both blogdown and hugo theme to the latest version. I once thought I catch a new bug of blogdown,
but when I was going to create a minimal reproducible example, I found out it has already been fixed in blogdown’s newest version.</li>
<li>Do read <a href="https://bookdown.org/yihui/blogdown/templates.html">this chapter on Hugo templates</a>,
it helps you understand some fundamental structures.</li>
<li>Keep tracking of upstream theme via git submodule, and do all your customization in separate directories under your repo’s root path.
For example, <code>layouts</code> for html template, <code>static</code> or <code>assets</code> for css and images.</li>
<li>If you’re puzzled with some expressions, or want more functionality,
check out Hugo’s <a href="https://gohugo.io/documentation/">documents</a>.</li>
<li>Modern browsers are shipped with developer tools. Make use of them to catch a glimpse on why some customization didn’t work.</li>
</ul>
</div>
<div id="migratefromjekyll" class="section level2">
<h2>Migrate from jekyll</h2>
<p>Jekyll and many other static site generator use markdown as their default format.
It should be not difficult to translate those <code>.md</code> files into <code>.Rmd</code> ones.
If your old blog contains only a few posts, you could just manually modify them;
otherwise, you may consider writing some scripts to do this job.</p>
</div>
<div id="buyadomainanddeployyourblog" class="section level2">
<h2>Buy a domain and deploy your blog</h2>
<p>There are many places to deploy a static page site, some of which are even free.
But I still encourage you to buy your own domain name.
Sadly, <em>url</em> is not as important as before since mobile apps took the dominance, but it is still a core part of open web.
Services raise and fall, while domain name is always an identity of your own brand.</p>
<p>My old solution is to use GitHub page for jekyll deploying, and to customize DNS via Cloudflare.
Because back then, GitHub didn’t support https for custom domain page.</p>
<p>A new solution is through <a href="https://www.netlify.com/">Netlify</a>.
The whole process will be more smooth.
I will not list every steps, since all the configurations may change in the future.
I just show you an outline:</p>
<ul>
<li>Create a Netlify account and add your new domain to it.</li>
<li>Follow the domain verification steps, add DNS information to your domain register’s nameserver list.</li>
<li>Push your blogdown git repo to GitHub. (GitLab and Bitbucket are also supported)</li>
<li>Install Netlify app in your GitHub applications.</li>
<li>Give access to only selected repositories<a href="#fn3" class="footnoteref" id="fnref3"><sup>3</sup></a>, like the one you just pushed.</li>
<li>Add a <a href="https://www.netlify.com/docs/netlifytomlreference/"><code>netlify.toml</code></a> in your repo.</li>
<li>Create a “New site from git”, and link it to your repo and domain, also remember to enable https.</li>
</ul>
<p>After all of these steps, you should head to your domain address and see your site is online.</p>
</div>
<div id="customdomainemails" class="section level2">
<h2>Custom domain emails</h2>
<p>With a custom domain name, you can also set up personal emails with that address.
I used mailgun to do the email forwarding before. Its free plan are enough for me.
However, mailgun’s service looks like a little rough.
I’m also considering other alternatives, like Fastmail or Protonmail.</p>
</div>
</div>
<div id="whatwegot" class="section level1">
<h1>3. What we got</h1>
<p>So, now what we got here is a custom domain blog.
You can easily write posts with RMarkdown, manage content version control in GitHub, and deploy site automatically via Netlify.
It will be your little garden, marking some moments in your life.</p>
<p>Finally, there are some other things we can do with a new blog, like RSS feed, Google Analytics, etc.
But the most important thing is to develop a habit of writing.
I admit that I didn’t do well on this subject.
Although I have some other places to write, but still I didn’t write enough, especially on my professional topics.
I’m very satisfied with the current workflow of blogdown, so I hope this will be a new beginning.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Only one side project site to go,
I guess I would finish that within an hour if I could make up my mind quickly on which Hugo theme to choose.<a href="#fnref1" class="footnoteback">↩︎</a></p></li>
<li id="fn2"><p>I will write another piece on this topic later, perhaps even a podcast episode.<a href="#fnref2" class="footnoteback">↩︎</a></p></li>
<li id="fn3"><p>Both public and private repositories are supported.
Although the final blog content will go public, you can hide your writing procedure in a private repo.<a href="#fnref3" class="footnoteback">↩︎</a></p></li>
</ol>
</div>