Bài giảng Tối ưu hóa nâng cao - Bài 8: Proximal Gradient Descent (and Acceleration) - Hoàng Nam Dũng

Outline
Today
I Proximal gradient descent
I Convergence analysis
I ISTA, matrix completion
I Special cases
I Acceleration
50 trang phuongnguyen 10960
Download
Bạn đang xem 20 trang mẫu của tài liệu "Bài giảng Tối ưu hóa nâng cao - Bài 8: Proximal Gradient Descent (and Acceleration) - Hoàng Nam Dũng", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: Bài giảng Tối ưu hóa nâng cao - Bài 8: Proximal Gradient Descent (and Acceleration) - Hoàng Nam Dũng

Proximal Gradient Descent (and Acceleration)
Hoàng Nam Dũng
Khoa Toán - Cơ - Tin học, Đại học Khoa học Tự nhiên, Đại học Quốc gia Hà Nội
Last time: subgradient method
Consider the problem
min
x
f (x)
with f convex, and dom(f ) = Rn.
Subgradient method: choose an initial x (0) ∈ Rn, and repeat:
x (k) = x (k−1) − tk · g (k−1), k = 1, 2, 3, . . .
where g (k−1) ∈ ∂f (x (k−1)). We use pre-set rules for the step sizes
(e.g., diminshing step sizes rule).
If f is Lipschitz, then subgradient method has a convergence rate
O(1/ε2).
Upside: very generic. Downside: can be slow — addressed today.
1
Outline
Today
I Proximal gradient descent
I Convergence analysis
I ISTA, matrix completion
I Special cases
I Acceleration
2
Decomposable functions
Suppose
f (x) = g(x) + h(x)
where
I g is convex, differentiable, dom(g) = Rn
I h is convex, not necessarily differentiable.
If f were differentiable, then gradient descent update would be
x+ = x − t · ∇f (x)
Recall motivation: minimize quadratic approximation to f around
x , replace ∇2f (x) by 1t I
x+ = argminz f (x) +∇f (x)T (z − x) +
1
2t
‖z − x‖22︸ ︷︷ ︸
f˜t(z)
.
3
Decomposable functions
In our case f is not differentiable, but f = g + h, g differentiable.
Why don’t we make quadratic approximation to g , leave h alone?
I.e., update
x+ = argminz g˜t(z) + h(z)
= argminz g(x) +∇g(x)T (z − x) +
1
2t
‖z − x‖22 + h(z)
= argminz
1
2t
‖z − (x − t∇g(x))‖22 + h(z).
1
2t ‖z − (x − t∇g(x))‖22 stay close to gradient update for g
h(z) also make h small
4
Proximal mapping
The proximal mapping (or prox-operator) of a convex function h is
defined as
proxh(x) = argminz
1
2
‖x − z‖22 + h(z).
Examples:
I h(x) = 0: proxh(x) = x .
I h(x) is indicator function of a closed convex set C : proxh is
the projection on C
proxh(x) = argminz∈C
1
2
‖x − z‖22 = PC (x).
I h(x) = ‖x‖1: proxh is the ’soft-threshold’ (shrinkage)
operation
proxh(x)i =

xi − 1 xi ≥ 1
0 |xi | ≤ 1
xi + 1 xi ≤ −1.
5
Proximal mapping
The proximal mapping (or prox-operator) of a convex function h is
defined as
proxh(x) = argminz
1
2
‖x − z‖22 + h(z).
Examples:
I h(x) = 0: proxh(x) = x .
I h(x) is indicator function of a closed convex set C : proxh is
the projection on C
proxh(x) = argminz∈C
1
2
‖x − z‖22 = PC (x).
I h(x) = ‖x‖1: proxh is the ’soft-threshold’ (shrinkage)
operation
proxh(x)i =

xi − 1 xi ≥ 1
0 |xi | ≤ 1
xi + 1 xi ≤ −1.
5
Proximal mapping
The proximal mapping (or prox-operator) of a convex function h is
defined as
proxh(x) = argminz
1
2
‖x − z‖22 + h(z).
Examples:
I h(x) = 0: proxh(x) = x .
I h(x) is indicator function of a closed convex set C : proxh is
the projection on C
proxh(x) = argminz∈C
1
2
‖x − z‖22 = PC (x).
I h(x) = ‖x‖1: proxh is the ’soft-threshold’ (shrinkage)
operation
proxh(x)i =

xi − 1 xi ≥ 1
0 |xi | ≤ 1
xi + 1 xi ≤ −1.
5
Proximal mapping
The proximal mapping (or prox-operator) of a convex function h is
defined as
proxh(x) = argminz
1
2
‖x − z‖22 + h(z).
Examples:
I h(x) = 0: proxh(x) = x .
I h(x) is indicator function of a closed convex set C : proxh is
the projection on C
proxh(x) = argminz∈C
1
2
‖x − z‖22 = PC (x).
I h(x) = ‖x‖1: proxh is the ’soft-threshold’ (shrinkage)
operation
proxh(x)i =

xi − 1 xi ≥ 1
0 |xi | ≤ 1
xi + 1 xi ≤ −1. 5
Proximal mapping
Theorem
If h is convex and closed (has closed epigraph) then
proxh(x) = argminz
1
2
‖x − z‖22 + h(z).
exists and is unique for all x .
Chứng minh.
See 
proxop.pdf
Uniqueness since the objective function is strictly convex.
Optimality condition
z = proxh(x)⇔ x − z ∈ ∂h(z)
⇔ h(u) ≥ h(z) + (x − z)T (u − z), ∀u.
6
Proximal mapping
Theorem
If h is convex and closed (has closed epigraph) then
proxh(x) = argminz
1
2
‖x − z‖22 + h(z).
exists and is unique for all x .
Chứng minh.
See 
proxop.pdf
Uniqueness since the objective function is strictly convex.
Optimality condition
z = proxh(x)⇔ x − z ∈ ∂h(z)
⇔ h(u) ≥ h(z) + (x − z)T (u − z), ∀u.
6
Properties of proximal mapping
Theorem
Proximal mappings are firmly nonexpansive (co-coercive with
constant 1)
(proxh(x)− proxh(y))T (x − y) ≥ ‖ proxh(x)− proxh(y)‖22.
Chứng minh.
With u = proxh(x) and v = proxh(y) we have
x − u ∈ ∂f (u) and y − v ∈ ∂f (v).
From the monotonicity of subdifferential we get(
(x − u)− (y − v))T (u − v) ≥ 0.
From firm nonexpansiveness and Cauchy-Schwarz inequality we get
nonexpansiveness (Lipschitz continuity with constant 1)
‖ proxh(x)− proxh(y)‖2 ≤ ‖x − y‖2.
7
Properties of proximal mapping
Theorem
Proximal mappings are firmly nonexpansive (co-coercive with
constant 1)
(proxh(x)− proxh(y))T (x − y) ≥ ‖ proxh(x)− proxh(y)‖22.
Chứng minh.
With u = proxh(x) and v = proxh(y) we have
x − u ∈ ∂f (u) and y − v ∈ ∂f (v).
From the monotonicity of subdifferential we get(
(x − u)− (y − v))T (u − v) ≥ 0.
From firm nonexpansiveness and Cauchy-Schwarz inequality we get
nonexpansiveness (Lipschitz continuity with constant 1)
‖ proxh(x)− proxh(y)‖2 ≤ ‖x − y‖2.
7
Properties of proximal mapping
Theorem
Proximal mappings are firmly nonexpansive (co-coercive with
constant 1)
(proxh(x)− proxh(y))T (x − y) ≥ ‖ proxh(x)− proxh(y)‖22.
Chứng minh.
With u = proxh(x) and v = proxh(y) we have
x − u ∈ ∂f (u) and y − v ∈ ∂f (v).
From the monotonicity of subdifferential we get(
(x − u)− (y − v))T (u − v) ≥ 0.
From firm nonexpansiveness and Cauchy-Schwarz inequality we get
nonexpansiveness (Lipschitz continuity with constant 1)
‖ proxh(x)− proxh(y)‖2 ≤ ‖x − y‖2. 7
Proximal gradient descent
Proximal gradient descent: choose initialize x (0), repeat:
x (k) = proxtkh
(
x (k−1) − tk · ∇g(x (k−1))
)
, k = 1, 2, 3, . . .
To make this update step look familiar, can rewrite it as
x (k) = x (k−1) − tk · Gtk (x (k−1))
where Gt is the generalized gradient of f ,
Gt(x) =
x − proxth(x − t∇g(x))
t
.
For h = 0 it is gradient descent.
8
Proximal gradient descent
Proximal gradient descent: choose initialize x (0), repeat:
x (k) = proxtkh
(
x (k−1) − tk · ∇g(x (k−1))
)
, k = 1, 2, 3, . . .
To make this update step look familiar, can rewrite it as
x (k) = x (k−1) − tk · Gtk (x (k−1))
where Gt is the generalized gradient of f ,
Gt(x) =
x − proxth(x − t∇g(x))
t
.
For h = 0 it is gradient descent.
8
Proximal gradient descent
Proximal gradient descent: choose initialize x (0), repeat:
x (k) = proxtkh
(
x (k−1) − tk · ∇g(x (k−1))
)
, k = 1, 2, 3, . . .
To make this update step look familiar, can rewrite it as
x (k) = x (k−1) − tk · Gtk (x (k−1))
where Gt is the generalized gradient of f ,
Gt(x) =
x − proxth(x − t∇g(x))
t
.
For h = 0 it is gradient descent.
8
Examples
minimize g(x) + h(x)
Gradient method: special case with h(x) = 0
x+ = x− t∇g(x)
Gradient projection method: special case with h(x) = δC(x) (indicator of C)
x+ = PC (x− t∇g(x)) C
x
x− t∇g(x)x+
Proximal gradient method 6-5
What good did this do?
You have a right to be suspicious ... may look like we just swapped
one minimization problem for another.
Key point is that proxh(·) is can be computed analytically for a lot
of important functions h1.
Note:
I Mapping proxh(·) doesn’t depend on g at all, only on h.
I Smooth part g can be complicated, we only need to compute
its gradients.
Convergence analysis: will be in terms of number of iterations of
the algorithm. Each iteration evaluates proxh(·) once and this can
be cheap or expensive depending on h.
1see 
9
Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)
Given y ∈ Rn,X ∈ Rn×p, recall lasso criterion
f (β) =
1
2
‖y − Xβ‖22︸ ︷︷ ︸
g(β)
+λ‖β‖1︸ ︷︷ ︸
h(β)
.
Proximal mapping is now
proxth(β) = argminz
1
2t
‖β − z‖22 + λ‖z‖1
= Sλt(β),
where Sλ(β) is the soft-thresholding operator
[Sλ(β)]i =

βi − λ if βi > λ
0 if − λ ≤ βi ≤ λ, i = 1, . . . , n.
βi + λ if βi < −λ
10
Example: ISTA (Iterative Shrinkage-Thresholding Algorithm)
Recall ∇g(β) = −XT (y − Xβ), hence proximal gradient update is
β+ = Sλt(β + tX
T (y − Xβ)).
Often called the iterative soft-thresholding algorithm (ISTA)2. Very
simple algorithm.
Example of proximal
gradient (ISTA) vs.
subgradient method
convergence rates
Recall ∇g(β) = −XT (y−Xβ), hence proximal gradient update is:
β+ = Sλt
(
β + tXT (y −Xβ))
Often called the iterative soft-thresholding algorithm (ISTA).1 Very
simple algorithm
Example of proximal
gradient (ISTA) vs.
subgradient method
convergence rates
0 200 400 600 800 1000
0.
02
0.
05
0.
10
0.
20
0.
50
k
f−
fs
ta
r
Subgradient method
Proximal gradient
1Beck and Teboulle (2008), “A fast iterative shrinkage-thresholding
algorithm for linear inverse problems”
9
2Beck and Teboulle (2008), “A fast iterative shrinkage-thresholding algorithm
for linear inverse problems”
11
Backtracking line search
Backtracking for prox gradient descent works similar as before (in
gradient descent), but operates on g and not f .
Choose parameter 0 < β < 1. At each iteration, start at t = tinit,
and while
g(x − tGt(x)) > g(x)− t∇g(x)TGt(x) + t
2
‖Gt(x)‖22
shrink t = βt, for some 0 < β < 1. Else perform proximal gradient
update.
(Alternative formulations exist that require less computation, i.e.,
fewer calls to prox)
12
Convergence analysis
For criterion f (x) = g(x) + h(x), we assume
I g is convex, differentiable, dom(g) = Rn, and ∇g is Lipschitz
continuous with constant L > 0.
I h is convex, proxth(x) = argminz{‖x − z‖22/(2t) + h(z)} can
be evaluated.
Theorem
Proximal gradient descent with fixed step size t ≤ 1/L satisfies
f (x (k))− f ∗ ≤ ‖x
(0) − x∗‖22
2tk
and same result holds for backtracking with t replaced by β/L.
Proximal gradient descent has convergence rate O(1/k) or O(1/ε).
Same as gradient descent! (But remember, prox cost matters ...).
Proof: See 
lectures/proxgrad.pdf
13
Convergence analysis
For criterion f (x) = g(x) + h(x), we assume
I g is convex, differentiable, dom(g) = Rn, and ∇g is Lipschitz
continuous with constant L > 0.
I h is convex, proxth(x) = argminz{‖x − z‖22/(2t) + h(z)} can
be evaluated.
Theorem
Proximal gradient descent with fixed step size t ≤ 1/L satisfies
f (x (k))− f ∗ ≤ ‖x
(0) − x∗‖22
2tk
and same result holds for backtracking with t replaced by β/L.
Proximal gradient descent has convergence rate O(1/k) or O(1/ε).
Same as gradient descent! (But remember, prox cost matters ...).
Proof: See 
lectures/proxgrad.pdf
13
Convergence analysis
For criterion f (x) = g(x) + h(x), we assume
I g is convex, differentiable, dom(g) = Rn, and ∇g is Lipschitz
continuous with constant L > 0.
I h is convex, proxth(x) = argminz{‖x − z‖22/(2t) + h(z)} can
be evaluated.
Theorem
Proximal gradient descent with fixed step size t ≤ 1/L satisfies
f (x (k))− f ∗ ≤ ‖x
(0) − x∗‖22
2tk
and same result holds for backtracking with t replaced by β/L.
Proximal gradient descent has convergence rate O(1/k) or O(1/ε).
Same as gradient descent! (But remember, prox cost matters ...).
Proof: See 
lectures/proxgrad.pdf
13
Example: matrix completion
Given a matrix Y ∈ Rm×n, and only observe entries Yij , (i , j) ∈ Ω.
Suppose we want to fill in missing entries (e.g., for a recommender
system), so we solve a matrix completion problem3
min
B
1
2
∑
(i ,j)∈Ω
(Yij − Bij)2 + λ‖B‖tr.
Here ‖B‖tr is the trace (or nuclear) norm of B
‖B‖tr =
r∑
i=1
σi (B),
where r = rank(B) and σ1(X ) ≥ · · · ≥ σr (X ) ≥ 0 are the singular
values4.
3Wikipedia: In the case of the Netflix problem the ratings matrix is expected
to be low-rank since user preferences can often be described by a few factors,
such as the movie genre and time of release
4https://math.berkeley.edu/~hutching/teach/54-2017/svd-notes.pdf
14
Example: matrix completion
Define PΩ, projection operator onto observed set
[PΩ(B)]ij =
Bij (i , j) ∈ Ω0 (i , j) 6∈ Ω.
Then the criterion is
f (B) =
1
2
‖PΩ(Y )− PΩ(B)‖2F︸ ︷︷ ︸
g(B)
+λ‖B‖tr︸ ︷︷ ︸
h(B)
.
Two ingredients needed for proximal gradient descent:
I Gradient calculation
∇g(B) = −(PΩ(Y )− PΩ(B)).
I Prox function
proxth(B) = argminZ
1
2t
‖B − Z‖2F + λ‖Z‖tr.
15
Example: matrix completion
Claim:
proxt(B) = Sλt(B), matrix soft-thresholding at the level λ.
Here Sλ(B) is defined by
Sλ(B) = UΣλV
T
where B = UΣV T is an SVD, and Σλ is diagonal with
(Σλ)ii = max{Σii − λ, 0}.
Proof : note that proxth(B) = Z , where Z satisfies
0 ∈ Z − B + λt · ∂‖Z‖tr.
Helpful fact: if Z = UΣV T , then
∂‖Z‖tr = {UV T + W : ‖W ‖op ≤ 1,UTW = 0,WV = 0}.
Now plug in Z = Sλt(B) and check that we can get 0.
16
Example: matrix completion
Hence proximal gradient update step is
B+ = Sλt (B + t(PΩ(Y )− PΩ(B))) .
Note that ∇g(B) is Lipschitz continuous with L = 1, so we can
choose fixed step size t = 1. Update step is now
B+ = Sλ(PΩ(Y ) + P
⊥
Ω (B))
where P⊥Ω projects onto unobserved set, PΩ(B) + P
⊥
Ω = B.
This is the soft-impute algorithm5, simple and effective method for
matrix completion.
5Mazumder et al. (2011), “Spectral regularization algorithms for learning large
incomplete matrices”
17
Special cases
Proximal gradient descent also called composite gradient descent
or generalized gradient descent.
Why “generalized”? This refers to the several special cases, when
minimizing f = g + h
I h = 0 – gradient descent
I h = IC – projected gradient descent
I g = 0 – proximal minimization algorithm.
Therefore these algorithms all have O(1/ε) convergence rate.
18
Projected gradient descent
Given closed, convex set C ∈ Rn,
min
x∈C
g(x)⇐⇒ min
x
g(x) + IC (x)
where IC (x) =
0 x ∈ C∞ x 6∈ C is the indicator function of C .
We have
proxtIC (x) = argminz
1
2t
‖x − z‖22 + IC (z)
= argminz∈C ‖x − z‖22,
i.e., proxtIC (x) = PC (x), projection operator onto C .
19
Projected gradient descent
Given closed, convex set C ∈ Rn,
min
x∈C
g(x)⇐⇒ min
x
g(x) + IC (x)
where IC (x) =
0 x ∈ C∞ x 6∈ C is the indicator function of C .
We have
proxtIC (x) = argminz
1
2t
‖x − z‖22 + IC (z)
= argminz∈C ‖x − z‖22,
i.e., proxtIC (x) = PC (x), projection operator onto C .
19
Projected gradient descent
Therefore proximal gradient update step is
x+ = PC (x − t∇g(x)),
i.e., perform usual gradient update and then project back onto C.
Called projected gradient descent.
Therefore proximal gradient update step is:
x+ = PC
(
x− t∇g(x))
i. ., l i j
ll r j t r i t s t
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−
1.
5
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
1.
5
c()
l
l
18
20
Proximal minimization algorithm
Consider for h convex (not necessarily differentiable)
min
x
h(x).
Proximal gradient update step is just
x+ = argminz
1
2t
‖x − z‖22 + h(z).
Called proximal minimization algorithm. Faster than subgradient
method, but not implementable unless we know prox in closed
form.
21
What happens if we can’t evaluate prox?
Theory for proximal gradient, with f = g + h, assumes that prox
function can be evaluated, i.e., assumes the minimization
proxth(x) = argminz
1
2t
‖x − z‖22 + h(z)
can be done exactly. In general, not clear what happens if we just
minimize this approximately.
But, if you can precisely control the errors in approximating the
prox operator, then you can recover the original convergence rates6.
In practice, if prox evaluation is done approximately, then it should
be done to decently high accuracy.
6Schmidt et al. (2011), “Convergence rates of inexact proximal-gradient
methods for convex optimization”
22
Acceleration
Turns out we can accelerate proximal gradient descent in order to
achieve the optimal O(1/
√
ε) convergence rate. Four ideas (three
acceleration methods) by Nesterov:
I 1983: original acceleration idea for smooth functions
I 1988: another acceleration idea for smooth functions
I 2005: smoothing techniques for nonsmooth functions, coupled
with original acceleration idea
I 2007: acceleration idea for composite functions7.
We will follow Beck and Teboulle (2008), an extension of Nesterov
(1983) to composite functions8.
7Each step uses entire history of previous steps and makes two prox calls
8Each step uses information from two last steps and makes one prox call
23
Accelerated proximal gradient method
As before consider
min
x
g(x) + h(x),
where g convex, differentiable, and h convex.
Accelerated proximal gradient method: choose initial point
x (0) = x (−1) ∈ Rn, repeat:
v = x (k−1) +
k − 2
k + 1
(x (k−1) − x (k−2))
x (k) = proxtkh(v − tk∇g(v))
for k = 1, 2, 3, . . .
I First step k = 1 is just usual proximal gradient update.
I After that, v = x (k−1) + k−2k+1(x
(k−1) − x (k−2)) carries some
“momentum” from previous iterations.
I h = 0 gives accelerated gradient method.
24
Accelerated proximal gradient method
Momentum weights
Momentum weights:
l
l
l
l
l
l
l
l
ll
ll
lll
llll
llllll
llllllllll
llllllllllllllllllllll
lllllllllllllllllllllllllllllllllllllllllll
0 20 40 60 80 100
−
0.
5
0.
0
0.
5
1.
0
k
(k 
− 2
)/(
k +
 1)
23
25
Accelerated proximal gradient method
Back to lasso example: acceleration can really help!B ck to l sso ex mp : acceleration can really help!
0 200 400 600 800 1000
0.
00
2
0.
00
5
0.
02
0
0.
05
0
0.
20
0
0.
50
0
k
f−
fs
ta
r
Subgradient method
Proximal gradient
Nesterov acceleration
Note: accelerated proximal gradient is not a descent method
24
Note: accelerated proximal gradient is not a descent method. 26
Backtracking line search
Backtracking under with acceleration in different ways.
Simple approach: fix β < 1, t0 = 1. At iteration k, start with
t = tk−1, and while
g(x+) > g(v) +∇g(v)T (x+ − v) + 1
2t
‖x+ − v‖22
shrink t = βt, and let x+ = proxth(v − t∇g(v)). Else keep x+.
Note that this strategy forces us to take decreasing step sizes ...
(more complicated strategies exist which avoid this).
27
Convergence analysis
For criterion f (x) = g(x) + h(x), we assume as before
I g is convex, differentiable, dom(g) = Rn, and ∇g is Lipschitz
continuous with constant L > 0.
I h is convex, proxth(x) = argminz{‖x − z‖22/(2t) + h(z)} can
be evaluated.
Theorem
Accelerated proximal gradient method with fixed step size t ≤ 1/L
satisfies
f (x (k))− f ∗ ≤ 2‖x
(0) − x∗‖22
t(k + 1)2
and same result holds for backtracking, with t replaced by β/L.
Achieves optimal rate O(1/k2) or O(1/
√
ε) for first-order methods.
28
FISTA (Fast ISTA)
Back to lasso problem
min
β
1
2
‖y − Xβ‖22 + λ‖β‖1.
Recall ISTA (Iterative Soft-thresholding Algorithm):
β(k) = Sλtk (β
(k−1) + tkXT (y − Xβ(k−1))), k = 1, 2, 3, . . .
Sλ(·) being vector soft-thresholding.
Applying acceleration gives us FISTA (F is for Fast)9: for
k = 1, 2, 3, . . .
v = β(k−1) +
k − 2
k + 1
(β(k−1) − β(k−2))
β(k) = Sλtk (v + tkX
T (y − Xv)).
9Beck and Teboulle (2008) actually call their general acceleration technique
(for general g , h) FISTA, which may be somewhat confusing 29
ISTA vs. FISTA
Lasso regression: 100 instances (with n = 100, p = 500):
Lasso regression: 100 instances (with n = 100, p = 500):
0 200 400 600 800 1000
1e
−0
4
1e
−0
3
1e
−0
2
1e
−0
1
1e
+0
0
k
f(k
)−f
sta
r
ISTA
FISTA
28
30
ISTA vs. FISTA
Lasso logistic regression: 100 instances (n = 100, p = 500):
Lasso logistic regression: 100 instances (n = 100, p = 500):
0 200 400 600 800 1000
1e
−0
4
1e
−0
3
1e
−0
2
1e
−0
1
1e
+0
0
k
f(k
)−f
sta
r
ISTA
FISTA
29
31
Is acceleration always useful?
Acceleration can be a very effective speedup tool ... but should it
always be used?
In practice the speedup of using acceleration is diminished in the
presence of warm starts. E.g., suppose want to solve lasso problem
for tuning parameters values
λ1 > λ2 > · · · > λr .
I When solving for λ1, initialize x (0) = 0, record solution xˆ(λ1).
I When solving for λj , initialize x (0) = xˆ(λj−1), the recorded
solution for λj−1.
Over a fine enough grid of λ values, proximal gradient descent can
often perform just as well without acceleration.
32
Is acceleration always useful?
Acceleration can be a very effective speedup tool ... but should it
always be used?
In practice the speedup of using acceleration is diminished in the
presence of warm starts. E.g., suppose want to solve lasso problem
for tuning parameters values
λ1 > λ2 > · · · > λr .
I When solving for λ1, initialize x (0) = 0, record solution xˆ(λ1).
I When solving for λj , initialize x (0) = xˆ(λj−1), the recorded
solution for λj−1.
Over a fine enough grid of λ values, proximal gradient descent can
often perform just as well without acceleration.
32
Is acceleration always useful?
Sometimes backtracking and acceleration can be disadvantageous!
Recall matrix completion problem: the proximal gradient update is
B+ = Sλ
(
B + t(PΩ(Y )− P⊥Ω (B))
)
where Sλ is the matrix soft-thresholding operator ... requires SVD.
I One backtracking loop evaluates generalized gradient Gt(x),
i.e., evaluates proxt(x), across various values of t. For matrix
completion, this means multiple SVDs ...
I Acceleration changes argument we pass to prox: v − t∇g(v)
instead of x − t∇g(x). For matrix completion (and t = 1),
B −∇g(B) = PΩ(Y )︸ ︷︷ ︸
sparse
+P⊥Ω (B)︸ ︷︷ ︸
low rank
⇒ fast SVD
V −∇g(V ) = PΩ(Y )︸ ︷︷ ︸
sparse
+ P⊥Ω (V )︸ ︷︷ ︸
not necessarily low rank
⇒ slow SVD.
33
References and further reading
Nesterov’s four ideas (three acceleration methods):
Y. Nesterov (1983), A method for solving a convex
programming problem with convergence rate O(1/k2)
Y. Nesterov (1988), On an approach to the construction of
optimal methods of minimization of smooth convex functions
Y. Nesterov (2005), Smooth minimization of non-smooth
functions
Y. Nesterov (2007), Gradient methods for minimizing
composite objective function
34
References and further reading
Extensions and/or analyses:
A. Beck and M. Teboulle (2008), A fast iterative
shrinkage-thresholding algorithm for linear inverse problems
S. Becker and J. Bobin and E. Candes (2009), NESTA: a fast
and accurate first-order method for sparse recovery
P. Tseng (2008), On accelerated proximal gradient methods
for convex-concave optimization
35
References and further reading
Helpful lecture notes/books:
E. Candes, Lecture notes for Math 301, Stanford University,
Winter 2010-2011
Y. Nesterov (1998), Introductory lectures on convex
optimization: a basic course, Chapter 2
L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012
36
File đính kèm:
bai_giang_toi_uu_hoa_nang_cao_bai_8_proximal_gradient_descen.pdf