Principal Component Analysis(PCA)

Aim

Given a collection of data points \(y_1, \dots, y_n \in \mathbb{R}^m\), perform a low-dimensional representation

\[y_i = Ab_i + c + v_i, \quad i = 1, \dots n,\]

where \(A \in \mathbb{R}^{m\times k} \ (k < m)\) to be a basis matrix; \(b_i \in \mathbb{R}^k\) to be the coefficient for \(y_i\); \(c \in \mathbb{R}^m\) is the base or mean in statistics terms; \(v_i\) is the noise or modeling error (assume that it is distributed in Gaussian Distribution).

Remove ambiguity

  1. We can make \(y_i = A(b_i-b_0) + (Ab_0 + c) + v_i\), thus there exists an ambiguity in translation
  2. We can make \(y_i = AUU^{-1} b_i + c+v_i\), thus there exists an ambiguity in choosing basis.

As a result, we constriant the free variables by introducing:

  1. \(A^TA=I\) (basis to be orthogonal)
  2. \(\sum_{i=1}^n b_i = 0\) (centerlize all coefficients)

From the perspective of minimizing errors

Since \(v_i\) is distributed as Gaussian, it would be equivalent to minimize

\[F = \sum_{i=1}^n ||y_i - Ab_i - c||^2_2.\]

Compute gradient of \(c\) to be

\[\frac{\partial F}{\partial c} = \sum_{i=1}^n2(Ab_i-y_i +c) = 2(c - \frac{1}{n}\sum_{i=1}^n y_i),\]

thus we set \(c = \frac{1}{n}\sum_{i=1}^n y_i\), which is to centerlize all data.

Define \(\bar{Y} \in \mathbb{R}^{m\times n}\) to be the centerlized data matrix, where each column is a data point. Thus we are minimizing

\[F = ||\bar{Y} - AB||_F^2\]

Notice that

\[\frac{\partial F}{\partial B} = -2A^T\bar{Y} + 2A^TAB = 2(B - A^T\bar{Y})\]

set \(B = A^T\bar{Y}\)

As a result, we want to minimize

\[F = ||\bar{Y} - AA^T\bar{Y}||_F^2 = tr(\bar{Y}^T\bar{Y}) - tr(\bar{Y}^TAA^T\bar{Y})\]

At last, it would be equivalent to maximize

\[\begin{equation} F = tr(\bar{Y}^TAA^T\bar{Y}) = tr(B^TB) = tr(A^T\bar{Y}\bar{Y}^TA) \end{equation}\]

From the perspective of maximizing variance

Notice that the mean of \(b_i\) is zero, thus the variance of \(b_i\) would be

\[\sum_{i=1}^n b_i^Tb_i = tr(B^TB)\]

As a result, minimizing errors and maximizing variance are equivalent things.

Find the basis matrix

Define \(A\ =\ \begin{bmatrix} A_{1} & \dotsc & A_{k} \end{bmatrix}\), where \(A_j \in \mathbb{R}^m\) and \(\Sigma = \bar{Y}\bar{Y}^T\) to be the covariance of the centerlized data. we can rewrite the problem as

\[\begin{aligned} \max & \ \sum _{j=1}^{n} A_{j}^{T} \Sigma A_{j}\\ s.t. & \ A_{j}^{T} A_{j} =1 \end{aligned}\]

Find the Lagrangian dual function and compute gradient \(\partial/\partial A_j\) to be zero:

\[2 (\Sigma A_j - \lambda A_j) = 0\]

Notice that the equation above is exactly the equation of computing eigenvalues and eigenvectors! As a result, columns of \(A\) are the eigenvetors of the covariance matrix.

References

  1. Matrix cookbook for derivatives