The Expectation Maximization Algorithm and Finite Mixture Models
July 24, 2015
期望最大化算法和有限混合模型
概念主要来自于一次跟师弟的讨论。师弟提到 Expectation Maximization Algorithm (EM 算法) 方面的专家 Prof. Geoffrey John McLachlan 的一篇早期的论文就完全涵盖了多个最新顶级期刊论文的思想。就跟着追到该教授的主页,找到这两本书:Finite Mixture Models (有限混合模型) 和 The EM Algorithm and Extensions (期望最大化算法及其拓展)。
Prof. Geoffrey John McLachlan 近期相关论文
- McLachlan, G.J. and Ng, S.K. (2009). The EM Algorithm. In The Top-Ten Algorithms in Data Mining, X. Wu and V. Kumar (Eds.). Boca Raton, Florida: Chapman & Hall/CRC, pp. 93-115.
- McLachlan, G.J., Ng, S.K., and Wang, K. (2010). Clustering of high-dimensional and correlated data. In Studies in Classification, Data Analysis, and Knowledge Organization: Data Analysis and Classification, C. Lauro, F. Palumbo, and M. Greenacre (Eds.). Berlin: Springer-Verlag, pp. 3-11.
- McLachlan, G.J. and Baek, J. (2010). Clustering of high-dimensional data via finite mixture models. In Advances in Data Analysis, Data Handling and Business Intelligence, A. Fink, B. Lausen, W. Seidel, and A. Ultsch (Eds.). Berlin: Springer-Verlag, pp. 33-44.
- Baek, J., McLachlan, G.J., and Flack, L. (2010). Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualisation of high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1298-1309.
- Nikulin, V., Huang, T.-H., Ng, S.K., Rathnayake, S.I., and McLachlan, G.J. (2011). A very fast algorithm for matrix factorization. Statistics & Probability Letters 81, 773-782.
- Lee, S.X. and McLachlan, G.J. (2011). On the fitting of mixtures of multivariate skew t-distributions via the EM algorithm. Preprint arXiv:1109.4706v2.
- Lee, S. and McLachlan, G.J. (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing 24, 181-202. See also amended version with corrections.
- Lin, T.-I., McLachlan, G.J., and Lee, S.X. (2016). Extending mixtures of factor models using the restricted multivariate skew-normal distribution. Journal of Multivariate Analysis. To appear. Preprint arXiv.1307.1748.
- McLachlan, G.J. and Lee, S.X. (2014). Comment on “Comparing two formulations of skew distributions with special reference to model-based clustering” by A. Azzalini, R. Browne, M. Genton, and P. McNicholas. Preprint arXiv:1404.1733.
- Lee, S.X., McLachlan, G.J., and Pyne, S. (2014). Supervised classification of flow cytometric samples via the joint clustering and matching (JCM) procedure. Preprint arXiv:1411.0685.
高斯混合模型
在应用领域,更多提及的是高斯混合模型 (GMM, Gaussian Mixture Model)。最简单的是单高斯模型。在单高斯模型中,假设多维变量符合高斯分布。高斯分布从空间上观察,在二维情况,近似于椭圆;在三维情况,近似于椭球。当单高斯分布不足以表达高维随机变量的分布时,用多个高斯分布的合成来近似该高维随机变量的分布。可以类比小波分解,使用足够多的高斯分布,可以拟合任意高维数据分布。特别考虑到,独立同分布 (i.i.d., independent identically distributed) 随机变量,当数据足够多时,近似服从高斯分布。
这里 GMM 跟 K-means 算法在很多方面近似。K-means 算法是从一组样本中选择其中一个最近邻样本作为聚类中心。GMM 中是根据一组数据按照高斯分布计算中心与方差,然后为每一个数据计算属于该高斯分布的概率。GMM 可以看成是一种柔性、概率的 K-means。K-means 可以看成是鉴别/分类方法 (identification/classification);对应的 GMM 可以看成是验证方法 (veriication)。K-means 把每一个样本分配到 K 个聚类中心中最近的一个;GMM 为每一个样本计算其属于 K 个不同的高斯分布的概率。
给定样本及类别信息,可以通过极大似然估计 (ML, Maximum Likelihood) 算法求解 GMM 参数。当样本类别未知时,使用各样本符合分布的概率,期望最大化算法 (EM)。当微小概率连乘时,可以通过取对数的方式,提高数值计算的稳定性。