Principal Component Analysis(PCA)
Aim
Given a collection of data points \(y_1, \dots, y_n \in \mathbb{R}^m\), perform a low-dimensional representation
\[y_i = Ab_i + c + v_i, \quad i = 1, \dots n,\]where \(A \in \mathbb{R}^{m\times k} \ (k < m)\) to be a basis matrix; \(b_i \in \mathbb{R}^k\) to be the coefficient for \(y_i\); \(c \in \mathbb{R}^m\) is the base or mean in statistics terms; \(v_i\) is the noise or modeling error (assume that it is distributed in Gaussian Distribution).
Remove ambiguity
- We can make \(y_i = A(b_i-b_0) + (Ab_0 + c) + v_i\), thus there exists an ambiguity in translation
- We can make \(y_i = AUU^{-1} b_i + c+v_i\), thus there exists an ambiguity in choosing basis.
As a result, we constriant the free variables by introducing:
- \(A^TA=I\) (basis to be orthogonal)
- \(\sum_{i=1}^n b_i = 0\) (centerlize all coefficients)
From the perspective of minimizing errors
Since \(v_i\) is distributed as Gaussian, it would be equivalent to minimize
\[F = \sum_{i=1}^n ||y_i - Ab_i - c||^2_2.\]Compute gradient of \(c\) to be
\[\frac{\partial F}{\partial c} = \sum_{i=1}^n2(Ab_i-y_i +c) = 2(c - \frac{1}{n}\sum_{i=1}^n y_i),\]thus we set \(c = \frac{1}{n}\sum_{i=1}^n y_i\), which is to centerlize all data.
Define \(\bar{Y} \in \mathbb{R}^{m\times n}\) to be the centerlized data matrix, where each column is a data point. Thus we are minimizing
\[F = ||\bar{Y} - AB||_F^2\]Notice that
\[\frac{\partial F}{\partial B} = -2A^T\bar{Y} + 2A^TAB = 2(B - A^T\bar{Y})\]set \(B = A^T\bar{Y}\)
As a result, we want to minimize
\[F = ||\bar{Y} - AA^T\bar{Y}||_F^2 = tr(\bar{Y}^T\bar{Y}) - tr(\bar{Y}^TAA^T\bar{Y})\]At last, it would be equivalent to maximize
\[\begin{equation} F = tr(\bar{Y}^TAA^T\bar{Y}) = tr(B^TB) = tr(A^T\bar{Y}\bar{Y}^TA) \end{equation}\]From the perspective of maximizing variance
Notice that the mean of \(b_i\) is zero, thus the variance of \(b_i\) would be
\[\sum_{i=1}^n b_i^Tb_i = tr(B^TB)\]As a result, minimizing errors and maximizing variance are equivalent things.
Find the basis matrix
Define \(A\ =\ \begin{bmatrix} A_{1} & \dotsc & A_{k} \end{bmatrix}\), where \(A_j \in \mathbb{R}^m\) and \(\Sigma = \bar{Y}\bar{Y}^T\) to be the covariance of the centerlized data. we can rewrite the problem as
\[\begin{aligned} \max & \ \sum _{j=1}^{n} A_{j}^{T} \Sigma A_{j}\\ s.t. & \ A_{j}^{T} A_{j} =1 \end{aligned}\]Find the Lagrangian dual function and compute gradient \(\partial/\partial A_j\) to be zero:
\[2 (\Sigma A_j - \lambda A_j) = 0\]Notice that the equation above is exactly the equation of computing eigenvalues and eigenvectors! As a result, columns of \(A\) are the eigenvetors of the covariance matrix.