《非线性最优化基础》学习笔记
August 2, 2015
《非线性最优化基础》 作者 福嶋雅夫 11 《非线性最优化基础》(豆瓣链接:http://book.douban.com/subject/6510671/)。福嶋雅夫(Masao Fukushima),教授,日本南山大学理工学院系统与数学科学系,日本京都大学名誉教授,加拿大滑铁卢大学/比利时那慕尔大学/澳大利亚新南威尔士大学客座教授。主页:http://www.seto.nanzan-u.ac.jp/~fuku/index.html。
该文为冯象初教授22 冯象初,教授,西安电子科技大学数学系。主页:http://web.xidian.edu.cn/xcfeng/有关非线性最优化的讲座的笔记。
主要内容
理论基础
- 凸函数、闭函数
- 共轭函数
- 鞍点问题
- Lagrange 对偶问题
- Lagrange 对偶性的推广
- Fenchel 对偶性
算法
- Proximal gradient methods
- Dual proximal gradient methods
- Fast proximal gradient methods
- Fast dual proximal gradient methods
理论基础
凸函数、闭函数
给定函数 ,称 的子集
为 的图像(graph),而称位于 的图像上方的点的全体构成的集合
为 的上图(epigraph)。若上图 为凸集,则称 为凸函数(convex function)。
定理 2.27 设 为任意非空指标集,而 均为凸函数,则由
定义的函数 为凸函数。进一步,若 为有限指标集,每个 均为正常的凸函数,并且 ,则 为正常凸函数。
若对任意收敛于 的点列 均有
成立,则称函数 在 处上半连续(upper semicontinuous);反之,当
成立时,称 在 处下半连续(lower semicontinuous)。若 在 处既为上半连续又为下半连续,则称 在 处连续(continuous)。
共轭函数
给定正常凸函数 ,由
定义的函数 称为 的共轭函数(conjuagate function)。
定理 2.36 正常凸函数 的共轭函数 必为闭正常凸函数。
鞍点问题
设 与 分别为 与 的非空子集,给定以 为定义域的函数 ,定义两个函数 与 如下:
引理 4.1 对任意 与 均有 成立。进一步,还有
定理 4.1 点 为函数 的鞍点的充要条件是 与 满足
Lagrange 对偶问题
考虑如下非线性规划问题:
其中 , 。
Constrains relax
引理 4.5 Lagrange 函数 与函数 之间有如下关系成立:
Lagrange 对偶性的推广
对于原始问题 ,考虑函数 ,使得对任意固定的 , 均为闭正常凸函数,并且满足
例 4.7 设 ,考虑函数 ,利用满足 的闭正常凸函数 定义函数 如下:
Fenchel 对偶性
算法
1. Proximal Gradient Method
参考 Algorithms for large-scale convex optimization - DTU 201033 A Lecture note from “02930 Algorithms for Large-Scale Convex Optimization” taught by Per Christian Hansen (pch@imm.dtu.dk) and Professor Lieven Vandenberghe (http://www.seas.ucla.edu/~vandenbe/) at Danmarks Tekniske Universitet (http://www.kurser.dtu.dk/2010-2011/02930.aspx?menulanguage=en-GB). The Download Link is found at the page of “EE227BT: Convex Optimization - Fall 2013” taught by Laurent El Ghaoui at Berkeley (http://www.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture18.pdf). And both of the lectures mentioned the book “Convex Optimization” by Stephen Boyd and Lieven Vandenberghe (http://stanford.edu/~boyd/cvxbook/) and the software “CVX” - a MATLAB software for desciplined Convex Programming (http://cvxr.com/cvx/). A similar lecture note on Proximal Gradient Method from “EE236C - Optimization Methods for Large-Scale Systems (Spring 2013-14)” (http://www.seas.ucla.edu/~vandenbe/ee236c.html) at UCLA
Proximal mapping
The proximal mapping (or proximal operator) of a convex function is
examples
1.
2. (indicator function of ): is projection on
3. : is shinkage (soft threshold) operation
Proximal gradient method
unconstrained problem with cost function split in two components
convex, differentiable, with dom
closed, convex, possibly nondifferentiable; is inexpensive
proximal gradient algorithm
constant or determined by line search
Interpretation
from definition of proximal operator:
minimizes plus a simple quadratic local of around
Examples
gradient method: , i.e., minimize g(x)
gradient projection method: , i.e., minimize over
iterative soft-thresholding: , i.e.,
and
2. Dual Proximal Gradient Methods
参考 L. Vandenberghe EE236C (Spring 2013-14)
Composite structure in the Dual
dual has the right structure for the proximal gradient method if
prox-operator of (or ) is cheap (closed form or simple algorithm)
is strongly convex ( is convex) implies has Lipschitz continuous gradient ():
because is Lipschitz continuous with constant
Dual proximal gradient update
equivalent expression in term of :
1. if is separable, calculation of decomposes into independent problems
2. step size constant or from backtracking line search
Alternating minimization interpretation
Moreau decomposition gives alternate expression for -update
where
in each iteration, an alternating minimization of:
1. Lagrangian over
2. augmented Lagrangian over
Regularized norm approximation
a special case with
C is unit norm ball for dual norm
dual gradient projection update
Example
dual gradient projection update
is unit Euclidean norm ball in , if
Minimization over intersection of convex sets
strongly convex; e.g., for projecting on intersection
sets are closed, convex, and easy to project onto
dual proximal gradient update
Decomposition of separable problems
each is strongly convex; has inexpensive prox-operator
dual proximal gradient update
3. Fast proximal gradient methods
参考 L. Vandenberghe EE236C (Spring 2013-14)
FISTA (basic version)
convex, differentiable with
closed, convex, with inexpensive operator
algorithm: choose any ; for , repeat the steps
step size fixed or determined by line search
acronym stands for ‘Fast Iterative Shrinkage-Thresholding Algorithm’
Interpretation
first iteration () is a proximal gradient step at
next iterations are proximal gradient steps at extrapolated points
note: is feasible (in ); may be outside
Reformulation of FISTA
define and introduce an intermediate variable
algorithm: choose ; for , repeat the steps
Nesterov’s second method
algorithm: choose ; for , repeat the steps
User and , or one of the line search methods
identical to FISTA if
unlike in FISTA, is feasible (in ) if we take
4. Fast dual proximal gradient methods
参考 A Fast Dual Proximal Gradient Algorithm for Convex Minimization and Applications by Amir Beck and Marc Teboulle at October 10, 2013
Initialization: , , .
General Step :
The Fast Dual-Based Proximal Gradient Method (FDPG)
Input:
Step . Take , .
Step . () Compute