The Expectation Maximization Algorithm and Finite Mixture Models

July 24, 2015

期望最大化算法和有限混合模型

概念主要来自于一次跟师弟的讨论。师弟提到 Expectation Maximization Algorithm (EM 算法) 方面的专家 Prof. Geoffrey John McLachlan 的一篇早期的论文就完全涵盖了多个最新顶级期刊论文的思想。就跟着追到该教授的主页，找到这两本书：Finite Mixture Models (有限混合模型) 和 The EM Algorithm and Extensions (期望最大化算法及其拓展)。

Prof. Geoffrey John McLachlan 近期相关论文

McLachlan, G.J. and Ng, S.K. (2009). The EM Algorithm. In The Top-Ten Algorithms in Data Mining, X. Wu and V. Kumar (Eds.). Boca Raton, Florida: Chapman & Hall/CRC, pp. 93-115.
McLachlan, G.J., Ng, S.K., and Wang, K. (2010). Clustering of high-dimensional and correlated data. In Studies in Classification, Data Analysis, and Knowledge Organization: Data Analysis and Classification, C. Lauro, F. Palumbo, and M. Greenacre (Eds.). Berlin: Springer-Verlag, pp. 3-11.
McLachlan, G.J. and Baek, J. (2010). Clustering of high-dimensional data via finite mixture models. In Advances in Data Analysis, Data Handling and Business Intelligence, A. Fink, B. Lausen, W. Seidel, and A. Ultsch (Eds.). Berlin: Springer-Verlag, pp. 33-44.
Baek, J., McLachlan, G.J., and Flack, L. (2010). Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualisation of high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1298-1309.
Nikulin, V., Huang, T.-H., Ng, S.K., Rathnayake, S.I., and McLachlan, G.J. (2011). A very fast algorithm for matrix factorization. Statistics & Probability Letters 81, 773-782.
Lee, S.X. and McLachlan, G.J. (2011). On the fitting of mixtures of multivariate skew t-distributions via the EM algorithm. Preprint arXiv:1109.4706v2.
Lee, S. and McLachlan, G.J. (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing 24, 181-202. See also amended version with corrections.
Lin, T.-I., McLachlan, G.J., and Lee, S.X. (2016). Extending mixtures of factor models using the restricted multivariate skew-normal distribution. Journal of Multivariate Analysis. To appear. Preprint arXiv.1307.1748.
McLachlan, G.J. and Lee, S.X. (2014). Comment on “Comparing two formulations of skew distributions with special reference to model-based clustering” by A. Azzalini, R. Browne, M. Genton, and P. McNicholas. Preprint arXiv:1404.1733.
Lee, S.X., McLachlan, G.J., and Pyne, S. (2014). Supervised classification of flow cytometric samples via the joint clustering and matching (JCM) procedure. Preprint arXiv:1411.0685.

高斯混合模型

在应用领域，更多提及的是高斯混合模型 (GMM, Gaussian Mixture Model)。最简单的是单高斯模型。在单高斯模型中，假设多维变量符合高斯分布。高斯分布从空间上观察，在二维情况，近似于椭圆；在三维情况，近似于椭球。当单高斯分布不足以表达高维随机变量的分布时，用多个高斯分布的合成来近似该高维随机变量的分布。可以类比小波分解，使用足够多的高斯分布，可以拟合任意高维数据分布。特别考虑到，独立同分布 (i.i.d., independent identically distributed) 随机变量，当数据足够多时，近似服从高斯分布。

这里 GMM 跟 K-means 算法在很多方面近似。K-means 算法是从一组样本中选择其中一个最近邻样本作为聚类中心。GMM 中是根据一组数据按照高斯分布计算中心与方差，然后为每一个数据计算属于该高斯分布的概率。GMM 可以看成是一种柔性、概率的 K-means。K-means 可以看成是鉴别/分类方法 (identification/classification)；对应的 GMM 可以看成是验证方法 (veriication)。K-means 把每一个样本分配到 K 个聚类中心中最近的一个；GMM 为每一个样本计算其属于 K 个不同的高斯分布的概率。

给定样本及类别信息，可以通过极大似然估计 (ML, Maximum Likelihood) 算法求解 GMM 参数。当样本类别未知时，使用各样本符合分布的概率，期望最大化算法 (EM)。当微小概率连乘时，可以通过取对数的方式，提高数值计算的稳定性。