原理

主成分分析(PCA)是一种数据降维算法。设在nn维空间中的mm个样本,记作X=[x(1),x(2),...,x(m)]X=[x^{(1)},x^{(2)},...,x^{(m)}],现在对其进行压缩,投影到kk维空间中,k<nk<n,要求损失的信息最少。设W=[w(1),w(2),...,w(k)]W=[w^{(1)},w^{(2)},...,w^{(k)}]是此kk维空间的一组标准正交基,其中w(i)Rnw^{(i)}\in R^n,满足:w(i)2=1\|w^{(i)}\|_2=1(w(i))Tw(j)=0ij(w^{(i)})^Tw^{(j)}=0(i\ne j),设XX在投影空间中的矩阵表示为ZZ,即有:

Xn×m=Wn×kZk×mX_{n\times m}=W_{n\times k}·Z_{k\times m}

主成分分析(PCA)是一种数据降维算法。设在nn维空间中的mm个样本,记作X=[x(1),x(2),...,x(m)]X=[x^{(1)},x^{(2)},...,x^{(m)}],现在对其进行压缩,投影到kk维空间中,k<nk<n,要求损失的信息最少。设W=[w(1),w(2),...,w(k)]W=[w^{(1)},w^{(2)},...,w^{(k)}]是此kk维空间的一组标准正交基,其中w(i)Rnw^{(i)}\in R^n,满足:w(i)2=1\|w^{(i)}\|_2=1(w(i))Tw(j)=0ij(w^{(i)})^Tw^{(j)}=0(i\ne j),设XX在投影空间中的矩阵表示为ZZ,即有:Xn×m=Wn×kZk×mX_{n\times m}=W_{n\times k}·Z_{k\times m}

显然Z=WTXZ=W^TX(根据正交阵性质:WTW=EW^TW=E,即W1=WTW^{-1}=W^T),再将XX投影在kk维空间中的矩阵ZZ重新映射回原nn维空间,重构XX,得到X=WZ=WWTXX^*=WZ=WW^TX,既然要使前后损失的信息最少,一种合理的设想是重构后的XX^*与原始XX之间的“距离”最小,于是可转化为求解以下最优化问题:

minWXXF2=minWXWWTXF2s.t.WTW=E\min\limits_W||X-X^*||_F^2=\min\limits_W||X-WW^TX||_F^2 \\\\ s.t. W^TW=E

显然Z=WTXZ=W^TX(根据正交阵性质:WTW=EW^TW=E,即W1=WTW^{-1}=W^T),再将XX投影在kk维空间中的矩阵ZZ重新映射回原nn维空间,重构XX,得到X=WZ=WWTXX^*=WZ=WW^TX,既然要使前后损失的信息最少,一种合理的设想是重构后的XX^*与原始XX之间的“距离”最小,于是可转化为求解最优化问题:minWXXF2=minWXWWTXF2,\min\limits_W\|X-X^*\|_F^2=\min\limits_W\|X-WW^TX\|_F^2,\,s.t.WTW=Es.t. W^TW=E

化简上式:minWXXF2=minWtr((XWWTX)T(XWWTX))\min\limits_W||X-X^*||_F^2=\min\limits_W tr((X-WW^TX)^T(X-WW^TX))(根据矩阵F范数和迹的关系:AF=tr(ATA)||A||_F=\sqrt{tr(A^TA)}

=minWtr(XTXXTWWTXXTWWTX+XTWWTWWTX)=\min\limits_Wtr(X^TX-X^TWW^TX-X^TWW^TX+X^TWW^TWW^TX) (和WW无关的项不会影响最优化结果,因此丢弃XTXX^TX

=minWtr(2XTWWTX+XTWWTX)=minWtr(XTWWTX)=maxWtr(XTWWTX)=\min\limits_Wtr(-2X^TWW^TX+X^TWW^TX)=\min\limits_Wtr(-X^TWW^TX)=\max\limits_Wtr(X^TWW^TX)(下面再根据矩阵迹运算的循环不变性:tr(ABC)=tr(CAB)=tr(BCA)tr(ABC)=tr(CAB)=tr(BCA),调一下序)

化简上式:minWXXF2=minWtr((XWWTX)T(XWWTX))\min\limits_W||X-X^*||_F^2=\min\limits_W tr((X-WW^TX)^T(X-WW^TX))=minWtr(2XTWWTX+XTWWTX)=maxWtr(XTWWTX)=\min\limits_Wtr(-2X^TWW^TX+X^TWW^TX)=\max\limits_Wtr(X^TWW^TX)

=maxWtr(WTXXTW)=\max\limits_Wtr(W^TXX^TW),最终转换为以下优化问题:

maxWtr(WTXXTW)s.t.WTW=E(1)\max\limits_Wtr(W^TXX^TW) \\ \tag{1}\\ s.t. W^TW=E

=maxWtr(WTXXTW)=\max\limits_Wtr(W^TXX^TW),最终转换为优化问题:maxWtr(WTXXTW),\max\limits_Wtr(W^TXX^TW),\,s.t.WTW=Es.t. W^TW=E(式(1)(1)

利用拉格朗日乘子法求解上述最优化问题,转为求解以下拉格朗日函数的极值(参考南瓜书):

L(W,Θ)=tr(WTXXTW)+<Θ,(EWTW)>L(W,\Theta)=tr(W^TXX^TW)+<\Theta,(E-W^TW)>

L(W,Θ)=tr(WTXXTW)L(W,\Theta)=tr(W^TXX^TW)+<Θ,(EWTW)>+<\Theta,(E-W^TW)>

其中ΘRk×k\Theta\in R^{k\times k}是拉格朗日乘子矩阵,若将约束条件WTW=EW^TW=E拆分下来看,即同时满足(w(i))Tw(i)=1(w^{(i)})^Tw^{(i)}=1(w(i))Tw(j)=0ij(w^{(i)})^Tw^{(j)}=0(i\ne j)(只看前者模长为1的条件,后者等于0也就是正交的约束就不管了,至于为何继续阅读就明白了,那么拆解下来的约束条件一共就kk个,拉格朗日乘子也就只有kk个),显然Θ\Theta是一个对角矩阵,记新的拉格朗日乘子矩阵为Λ\Lambda,并设Λ=diag{λ1,λ2,...,λk}\Lambda=diag\{\lambda_1,\lambda_2,...,\lambda_k\},于是新的拉格朗日函数为:

L(W,Λ)=tr(WTXXTW)+tr(ΛT(EWTW))L(W,\Lambda)=tr(W^TXX^TW)+tr(\Lambda^T(E-W^TW))

其中ΘRk×k\Theta\in R^{k\times k}是拉格朗日乘子矩阵,若将约束条件WTW=EW^TW=E拆分下来看,即同时满足(w(i))Tw(i)=1(w^{(i)})^Tw^{(i)}=1(w(i))Tw(j)=0ij(w^{(i)})^Tw^{(j)}=0(i\ne j)(只看前者模长为1的条件,后者等于0也就是正交的约束就不管了,至于为何继续阅读就明白了,那么拆解下来的约束条件一共就kk个,拉格朗日乘子也就只有kk个),显然Θ\Theta是一个对角矩阵,记新的拉格朗日乘子矩阵为Λ\Lambda,并设Λ=diag{λ1,λ2,...,λk}\Lambda=diag\{\lambda_1,\lambda_2,...,\lambda_k\},于是新的拉格朗日函数为:L(W,Λ)=tr(WTXXTW)L(W,\Lambda)=tr(W^TXX^TW)+tr(ΛT(EWTW))+tr(\Lambda^T(E-W^TW))

两端对WW求偏导,令其为0,得到:L(W,Λ)W=tr(WTXXTW)Wtr(ΛTWTW)W=2XXTW2WΛ=0\frac{\partial L(W,\Lambda)}{\partial W}=\frac{\partial tr(W^TXX^TW)}{\partial W}-\frac{\partial tr(\Lambda^TW^TW)}{\partial W}=2XX^TW-2W\Lambda=0,于是XXTW=WΛXX^TW=W\Lambda

注:求tr(WTXXTW)W\frac{\partial tr(W^TXX^TW)}{\partial W},迹是标量,即标量对矩阵求导,参考矩阵求导术维基百科,主要基于以下规则:

tr(A±B)=tr(A)±tr(B)tr(A)=tr(AT)tr(ABC)=tr(CAB)=tr(BCA)d(AB)=d(A)B+Ad(B)d(AT)=(dA)Td(A±B)=d(A)±d(B)d(tr(A))=tr(d(A))tr(A\pm B)=tr(A)\pm tr(B) \\\\ tr(A)=tr(A^T) \\\\ tr(ABC)=tr(CAB)=tr(BCA) \\\\ d(AB)=d(A)B+Ad(B) \\\\ d(A^T)=(dA)^T \\\\ d(A\pm B)=d(A)\pm d(B) \\\\ d(tr(A))=tr(d(A)) \\\\

于是d(tr(WTXXTW))=tr(d(WTXXTW))=tr(d(WT)XXTW+WTd(XXTW))=tr(d(WT)XXTW)+tr(WTd(XXTW))=tr(XXTWd(WT))+tr(WTXXTd(W))d(tr(W^TXX^TW))=tr(d(W^TXX^TW))=tr(d(W^T)XX^TW+W^Td(XX^TW))=tr(d(W^T)XX^TW)+tr(W^Td(XX^TW))=tr(XX^TWd(W^T))+tr(W^TXX^Td(W))
=tr((XXTWd(WT))T)+tr(WTXXTd(W))=tr((XX^TWd(W^T))^T)+tr(W^TXX^Td(W))=tr(d(W)WTXXT)=tr(d(W)W^TXX^T)+tr(WTXXTd(W))+tr(W^TXX^Td(W))=tr(WTXXTd(W))=tr(W^TXX^Td(W))+tr(WTXXTd(W))+tr(W^TXX^Td(W))=tr(2WTXXTd(W))=tr(2W^TXX^Td(W))=tr((2XXTW)Td(W))=tr((2XX^TW)^Td(W))

f=f(x1,1,...,x1,n,x2,1,...,x2,n,...,f=f(x_{1,1},...,x_{1,n},x_{2,1},...,x_{2,n},...,xm,1,...,xm,n)=f(X)x_{m,1},...,x_{m,n})=f(X),有df=fx1,1dx1,1+...+fxm,ndxm,n=i=1mj=1nfxi,jdxi,j=tr((fX)TdX)df=\frac{\partial f}{\partial x_{1,1}}dx_{1,1}+...+\frac{\partial f}{\partial x_{m,n}}dx_{m,n}=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n}\frac{\partial f}{\partial x_{i,j}}dx_{i,j}=tr((\frac{\partial f}{\partial X})^TdX)成立,其中

X=[x1,1x1,2x1,nx2,1x2,2x2,nxm,1xm,2xm,n]X=\left[\begin{array}{cccc}{x_{1,1}} & {x_{1,2}} & {\cdots} & {x_{1,n}} \\ {x_{2,1}} & {x_{2,2}} & {\cdots} & {x_{2,n}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {x_{m,1}} & {x_{m,2}} & {\cdots} & {x_{m,n}}\end{array}\right]

f=f(x1,1,...,x1,n,x2,1,...,x2,n,...,f=f(x_{1,1},...,x_{1,n},x_{2,1},...,x_{2,n},...,xm,1,...,xm,n)=f(X)x_{m,1},...,x_{m,n})=f(X),有df=fx1,1dx1,1+...+fxm,ndxm,n=i=1mj=1nfxi,jdxi,j=tr((fX)TdX)df=\frac{\partial f}{\partial x_{1,1}}dx_{1,1}+...+\frac{\partial f}{\partial x_{m,n}}dx_{m,n}=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n}\frac{\partial f}{\partial x_{i,j}}dx_{i,j}=tr((\frac{\partial f}{\partial X})^TdX)成立,其中X=\{x_{i,j}\}_{i\in\{1,...,m\},\,j\in\{1,...,n\}}\inR^{m\times n}

于是立即推,tr(WTXXTW)W=2XXTW\frac{\partial tr(W^TXX^TW)}{\partial W}=2XX^TW

再求tr(ΛTWTW)W\frac{\partial tr(\Lambda^TW^TW)}{\partial W},同上,有
d(tr(ΛTWTW))=tr(d(ΛTWTW))=tr(d(ΛT)WTW+ΛTd(WTW))=tr(ΛTd(WTW))=tr(ΛT(d(WT)W+WTd(W)))d(tr(\Lambda^TW^TW))=tr(d(\Lambda^TW^TW))=tr(d(\Lambda^T)W^TW+\Lambda^Td(W^TW))=tr(\Lambda^Td(W^TW))=tr(\Lambda^T(d(W^T)W+W^Td(W)))
=tr(ΛTd(WT)W)=tr(\Lambda^Td(W^T)W)+tr(ΛTWTd(W))+tr(\Lambda^TW^Td(W))=tr(WTd(W)Λ)=tr(W^Td(W)\Lambda)+tr(ΛTWTd(W))+tr(\Lambda^TW^Td(W))=tr(ΛWTd(W))=tr(\Lambda W^Td(W))+tr(ΛTWTd(W))+tr(\Lambda^TW^Td(W))=tr(2ΛTWTd(W))=tr(2\Lambda^TW^Td(W))(其中ΛT=Λ\Lambda^T=\Lambda

立即推,tr(ΛTWTW)W=2WΛ\frac{\partial tr(\Lambda^TW^TW)}{\partial W}=2W\Lambda(注解部分结束)

XXTW=WΛXX^TW=W\Lambda展开来看,有:

XXTw(i)=λiw(i)(2)XX^Tw^{(i)}=\lambda_i w^{(i)} \tag{2}

XXTW=WΛXX^TW=W\Lambda展开来看,有:XXTw(i)=λiw(i)XX^Tw^{(i)}=\lambda_i w^{(i)}(式(2)(2)

i{1,2,...,k}i\in\{1,2,...,k\},显然这是矩阵特征值和特征向量的定义式,事实上XXTXX^T是实对称矩阵((XXT)T=XXT(XX^T)^T=XX^T),而实对称矩阵的不同特征值所对应的特征向量是相互正交的,即使同一特征值的不同特征向量也可以通过施密特正交化使其变得正交,所以上述仅仅在(w(i))Tw(i)=1(w^{(i)})^Tw^{(i)}=1限制条件下计算出的WW可以同时满足限制条件:(w(i))Tw(j)=0ij(w^{(i)})^Tw^{(j)}=0(i\ne j)

回代式(1)(1),有:

maxWtr(WTXXTW)=maxWi=1k(w(i))TXXTw(i)注:XXT相当于常数=maxWi=1k(w(i))Tλiw(i)=maxWi=1kλi\max\limits_Wtr(W^TXX^TW) \\\\ = \max\limits_W\sum\limits_{i=1}^{k}(w^{(i)})^TXX^Tw^{(i)}\quad \text{注}:XX^T\text{相当于常数} \\\\ = \max\limits_W\sum\limits_{i=1}^{k}(w^{(i)})^T·\lambda_i w^{(i)} \\\\ = \max\limits_W\sum\limits_{i=1}^{k}\lambda_i

maxWtr(WTXXTW)\max\limits_Wtr(W^TXX^TW)=maxWi=1k(w(i))TXXTw(i)= \max\limits_W\sum\limits_{i=1}^{k}(w^{(i)})^TXX^Tw^{(i)}=maxWi=1k(w(i))Tλiw(i)= \max\limits_W\sum\limits_{i=1}^{k}(w^{(i)})^T·\lambda_i w^{(i)}=maxWi=1kλi= \max\limits_W\sum\limits_{i=1}^{k}\lambda_i

所以只要使特征值λ1,λ2,...,λk\lambda_1,\lambda_2,...,\lambda_k是矩阵XXTXX^T的前kk个最大的特征值即可,此时目标函数(1)(1)达到最优,即数据降维后的耗损是最小的。简单来说,计算过程是这样的:首先计算出XXTXX^T的特征值,挑出其中最大的前kk个特征值所对应的特征向量,然后进行单位化以及施密特正交化,最后将这些特征向量按列拼凑即得待求WW

在实际算法中,首先会对数据进行预处理,使样本的均值变为0,我的理解是,考虑到不同特征之间的数值或数量级存在较大差异,此举可防止数值较小的特征被忽略,此时XXTXX^T正是这些特征的协方差矩阵

PCA算法:

输入:样本集X={x(1),x(2),...,x(m)}X=\{x^{(1)},x^{(2)},...,x^{(m)}\},投影空间维度kk

过程:

  1. 预处理:x(i)=x(i)1mj=1mx(j)x^{(i)}=x^{(i)}-\frac{1}{m}\sum\limits_{j=1}^mx^{(j)}
  2. 计算样本协方差矩阵XXTXX^T(此处省略了协方差矩阵前面的常系数,这是因为对一个矩阵做特征分解,Aξ=λξA\xi=\lambda\xi,两边同乘一个常系数aa并不影响特征向量的值,aAξ=aλξ=λξaA\xi=a\lambda\xi=\lambda^{'}\xi
  3. 对协方差矩阵做特征值分解
  4. 取前kk个最大的特征值以及其对应的特征向量(经过单位化和正交化,简称规范化):w(1),w(2),...,w(k)w^{(1)},w^{(2)},...,w^{(k)}

输出:W=[w(1),w(2),...,w(k)]W=[w^{(1)},w^{(2)},...,w^{(k)}],于是投影矩阵为WTW^T,有Z=WTXZ=W^TXZZ是降维后的数据

核PCA

——本节推导参考[1]

对于数据集X=[x1,x2,...,xm]X=[x_1,x_2,...,x_m]xiRnx_i\in R^n,现在通过非线性变换ϕ:RnRN\phi:R^n\rightarrow R^N将全部样本映射到高维特征空间(希尔伯特空间),设ziz_i是原始空间中的样本点xix_i在高维空间中的像,有zi=ϕ(xi)z_i=\phi(x_i),然后在此高维空间(RNR^N)实施PCA,降维到dd维空间,要做PCA,首先计算协方差矩阵:

C=1mi=1m(ϕ(xi)1mj=1mϕ(xj))(ϕ(xi)1mj=1mϕ(xj))TC=\frac{1}{m}\sum\limits_{i=1}^{m}\left(\phi(x_i)-\frac{1}{m}\sum\limits_{j=1}^{m}\phi(x_j)\right)\left(\phi(x_i)-\frac{1}{m}\sum\limits_{j=1}^{m}\phi(x_j)\right)^T

对于数据集X=[x1,x2,...,xm]X=[x_1,x_2,...,x_m]xiRnx_i\in R^n,现在通过非线性变换ϕ:RnRN\phi:R^n\rightarrow R^N将全部样本映射到高维特征空间(希尔伯特空间),设ziz_i是原始空间中的样本点xix_i在高维空间中的像,有zi=ϕ(xi)z_i=\phi(x_i),然后在此高维空间(RNR^N)实施PCA,降维到dd维空间,要做PCA,首先计算协方差矩阵:C=1mi=1m(ϕ(xi)1mj=1mϕ(xj))C=\frac{1}{m}\sum\limits_{i=1}^{m}(\phi(x_i)-\frac{1}{m}\sum\limits_{j=1}^{m}\phi(x_j))(ϕ(xi)1mj=1mϕ(xj))T(\phi(x_i)-\frac{1}{m}\sum\limits_{j=1}^{m}\phi(x_j))^T

为描述方便,令ψ(xi)=ϕ(xi)1mj=1mϕ(xj)\psi(x_i)=\phi(x_i)-\frac{1}{m}\sum\limits_{j=1}^{m}\phi(x_j),则协方差矩阵C=i=1mψ(xi)ψ(xi)TC=\sum\limits_{i=1}^{m}\psi(x_i)\psi(x_i)^T(此处已省略常系数1m\frac{1}{m}),根据公式(2)(2)Cvj=λjvjCv_j=\lambda_jv_jj{1,2,...,d}j\in\{1,2,...,d\}(指取前dd个最大的特征值λj\lambda_j),其中vjv_jCC对应于特征值λj\lambda_j的特征向量,即:

(i=1mψ(xi)ψ(xi)T)vj=λjvj(\sum\limits_{i=1}^{m}\psi(x_i)\psi(x_i)^T)v_j=\lambda_jv_j

于是vj=i=1m1λjψ(xi)ψ(xi)Tvjv_j=\sum\limits_{i=1}^{m}\frac{1}{\lambda_j}\psi(x_i)\psi(x_i)^Tv_j,令αij=1λjψ(xi)Tvj\alpha_i^j=\frac{1}{\lambda_j}\psi(x_i)^Tv_j,则:

vj=i=1mψ(xi)αijv_j=\sum\limits_{i=1}^{m}\psi(x_i)\alpha_i^j

为描述方便,令ψ(xi)=ϕ(xi)1mj=1mϕ(xj)\psi(x_i)=\phi(x_i)-\frac{1}{m}\sum\limits_{j=1}^{m}\phi(x_j),则协方差矩阵C=i=1mψ(xi)ψ(xi)TC=\sum\limits_{i=1}^{m}\psi(x_i)\psi(x_i)^T(此处已省略常系数1m\frac{1}{m}),根据公式(2)(2)Cvj=λjvjCv_j=\lambda_jv_jj{1,2,...,d}j\in\{1,2,...,d\}(指取前dd个最大的特征值λj\lambda_j),其中vjv_jCC对应于特征值λj\lambda_j的特征向量,即:(i=1mψ(xi)ψ(xi)T)vj=λjvj(\sum\limits_{i=1}^{m}\psi(x_i)\psi(x_i)^T)v_j=\lambda_jv_j,于是vj=i=1m1λjψ(xi)ψ(xi)Tvjv_j=\sum\limits_{i=1}^{m}\frac{1}{\lambda_j}\psi(x_i)\psi(x_i)^Tv_j,令αij=1λjψ(xi)Tvj\alpha_i^j=\frac{1}{\lambda_j}\psi(x_i)^Tv_j,则:vj=i=1mψ(xi)αijv_j=\sum\limits_{i=1}^{m}\psi(x_i)\alpha_i^j

Cvj=λjvjCv_j=\lambda_jv_j两边左乘ψ(xk)T\psi(x_k)^T得到ψ(xk)TCvj=ψ(xk)Tλjvj\psi(x_k)^TCv_j=\psi(x_k)^T\lambda_jv_j,化简该式:左式=ψ(xk)T(i=1mψ(xi)ψ(xi)T)(i=1mψ(xi)αij)\text{左式}=\psi(x_k)^T(\sum\limits_{i=1}^m\psi(x_i)\psi(x_i)^T)(\sum\limits_{i=1}^m\psi(x_i)\alpha_i^j)\quad(根据(i=1mf(xi))(i=1ng(xi))=i=1mj=1n(f(xi)g(xj))=ψ(xk)Ti=1ml=1m(ψ(xi)ψ(xi)Tψ(xl)αlj)=i=1ml=1m(ψ(xk)Tψ(xi)ψ(xi)Tψ(xl)αlj)=i=1m([ψ(xk)Tψ(xi)ψ(xi)Tψ(x1),(\text{根据}(\sum\limits_{i=1}^mf(x_i))·(\sum\limits_{i=1}^ng(x_i))=\sum\limits_{i=1}^m\sum\limits_{j=1}^n(f(x_i)g(x_j)))\\=\psi(x_k)^T\sum\limits_{i=1}^m\sum\limits_{l=1}^m(\psi(x_i)\psi(x_i)^T·\psi(x_l)\alpha_l^j)=\sum\limits_{i=1}^m\sum\limits_{l=1}^m(\psi(x_k)^T\psi(x_i)\psi(x_i)^T\psi(x_l)·\alpha_l^j)\\=\sum\limits_{i=1}^m([\psi(x_k)^T\psi(x_i)\psi(x_i)^T\psi(x_1),ψ(xk)Tψ(xi)ψ(xi)Tψ(x2),...,\psi(x_k)^T\psi(x_i)\psi(x_i)^T\psi(x_2),...,ψ(xk)Tψ(xi)ψ(xi)Tψ(xm)][α1jα2j...αmj])=[ψ(xk)T(i=1mψ(xi)ψ(xi)T)ψ(x1),\psi(x_k)^T\psi(x_i)\psi(x_i)^T\psi(x_m)]\begin{bmatrix}\alpha_1^j \\ \alpha_2^j \\.\\.\\.\\ \alpha_m^j\end{bmatrix})\\=[\psi(x_k)^T·(\sum\limits_{i=1}^m\psi(x_i)\psi(x_i)^T)·\psi(x_1),ψ(xk)T(i=1mψ(xi)ψ(xi)T)ψ(x2),...,\psi(x_k)^T·(\sum\limits_{i=1}^m\psi(x_i)\psi(x_i)^T)·\psi(x_2),...,ψ(xk)T(i=1mψ(xi)ψ(xi)T)ψ(xm)][α1jα2j...αmj]=右式=λji=1m(ψ(xk)Tψ(xi)αij)=λj[ψ(xk)Tψ(x1),ψ(xk)Tψ(x2),...,\psi(x_k)^T·(\sum\limits_{i=1}^m\psi(x_i)\psi(x_i)^T)·\psi(x_m)]\begin{bmatrix}\alpha_1^j \\ \alpha_2^j \\.\\.\\.\\ \alpha_m^j\end{bmatrix}\\=\text{右式}=\lambda_j\sum\limits_{i=1}^m(\psi(x_k)^T\psi(x_i)·\alpha_i^j)=\lambda_j[\psi(x_k)^T\psi(x_1),\psi(x_k)^T\psi(x_2),...,ψ(xk)Tψ(xm)][α1jα2j...αmj]\psi(x_k)^T\psi(x_m)]\begin{bmatrix}\alpha_1^j \\ \alpha_2^j \\.\\.\\.\\ \alpha_m^j\end{bmatrix}

Cvj=λjvjCv_j=\lambda_jv_j两边左乘ψ(xk)T\psi(x_k)^T得到ψ(xk)TCvj=ψ(xk)Tλjvj\psi(x_k)^TCv_j=\psi(x_k)^T\lambda_jv_j,化简该式:左式=[ψ(xk)T(i=1mψ(xi)ψ(xi)T)ψ(x1),\text{左式}=[\psi(x_k)^T·(\sum\limits_{i=1}^m\psi(x_i)\psi(x_i)^T)·\psi(x_1),ψ(xk)T(i=1mψ(xi)ψ(xi)T)ψ(x2),...,\psi(x_k)^T·(\sum\limits_{i=1}^m\psi(x_i)\psi(x_i)^T)·\psi(x_2),...,ψ(xk)T(i=1mψ(xi)ψ(xi)T)ψ(xm)]\psi(x_k)^T·(\sum\limits_{i=1}^m\psi(x_i)\psi(x_i)^T)·\psi(x_m)][α1j,α2j,...,αmj]T[\alpha_1^j , \alpha_2^j ,... , \alpha_m^j]^T=右式=λj[ψ(xk)Tψ(x1),ψ(xk)Tψ(x2),...,=\text{右式}=\lambda_j[\psi(x_k)^T\psi(x_1),\psi(x_k)^T\psi(x_2),...,ψ(xk)Tψ(xm)]\psi(x_k)^T\psi(x_m)][α1j,α2j,...,αmj]T[\alpha_1^j , \alpha_2^j ,... , \alpha_m^j]^T

其中k{1,2,...,m}k\in \{1,2,...,m\},对全部的kk应用上述等式,拼凑得到:K2αj=λjKαj\overline{K}^2\alpha^j=\lambda_j\overline{K}\alpha^j(如果正向推有困难,可以反向证明其成立),其中K2=KK\overline{K}^2=\overline{K}·\overline{K}(矩阵乘法),Kαj\overline{K}、\alpha^j分别为:

K=[ψ(x1)Tψ(x1)ψ(x1)Tψ(x2)ψ(x1)Tψ(xm)ψ(x2)Tψ(x1)ψ(x2)Tψ(x2)ψ(x2)Tψ(xm)ψ(xm)Tψ(x1)ψ(xm)Tψ(x2)ψ(xm)Tψ(xm)]αj=[α1j,α2j,...,αmj]T\overline{K}=\begin{bmatrix}{\psi(x_1)^T\psi(x_1)} & {\psi(x_1)^T\psi(x_2)} & {\cdots} & {\psi(x_1)^T\psi(x_m)} \\ {\psi(x_2)^T\psi(x_1)} & {\psi(x_2)^T\psi(x_2)} & {\cdots} & {\psi(x_2)^T\psi(x_m)} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\psi(x_m)^T\psi(x_1)} & {\psi(x_m)^T\psi(x_2)} & {\cdots} & {\psi(x_m)^T\psi(x_m)}\end{bmatrix} \\ \alpha^j=[\alpha_1^j,\alpha_2^j,...,\alpha_m^j]^T

K={ψ(xi)Tψ(xj)}i,j{1,...,m}Rm×m,\overline{K}=\{\psi(x_i)^T\psi(x_j)\}_{i,j\in\{1,...,m\}}\in R^{m\times m},\,αj=[α1j,α2j,...,αmj]T\alpha^j=[\alpha_1^j,\alpha_2^j,...,\alpha_m^j]^T

于是有:

Kαj=λjαj(3)\overline{K}\alpha^j=\lambda_j\alpha^j \tag{3}

于是有:Kαj=λjαj\overline{K}\alpha^j=\lambda_j\alpha^j(式(3)(3)

j{1,2,...,d}j\in\{1,2,...,d\},显然,式(3)(3)就是一个特征值分解问题(并取K\overline{K}dd个最大的特征值所对应的特征向量),现在要计算K\overline{K}

Ki,j=ψ(xi)Tψ(xj)=(ϕ(xi)1ml=1mϕ(xl))T\overline{K}_{i,j}=\psi(x_i)^T\psi(x_j)=\left(\phi(x_i)-\frac{1}{m}\sum\limits_{l=1}^{m}\phi(x_l)\right)^T(ϕ(xj)1ml=1mϕ(xl))\left(\phi(x_j)-\frac{1}{m}\sum\limits_{l=1}^{m}\phi(x_l)\right)=(ϕ(xi)T1ml=1mϕ(xl)T)=\left(\phi(x_i)^T-\frac{1}{m}\sum\limits_{l=1}^m\phi(x_l)^T\right)(ϕ(xj)1ml=1mϕ(xl))=ϕ(xi)Tϕ(xj)1mϕ(xi)T(l=1mϕ(xl))1m(l=1mϕ(xl)T)ϕ(xj)+1m2(l=1mϕ(xl)T)(l=1mϕ(xl))\left(\phi(x_j)-\frac{1}{m}\sum\limits_{l=1}^{m}\phi(x_l)\right) \\=\phi(x_i)^T\phi(x_j)-\frac{1}{m}\phi(x_i)^T(\sum\limits_{l=1}^m\phi(x_l))-\frac{1}{m}(\sum\limits_{l=1}^m\phi(x_l)^T)\phi(x_j)+\frac{1}{m^2}(\sum\limits_{l=1}^m\phi(x_l)^T)(\sum\limits_{l=1}^m\phi(x_l))

Ki,j=ψ(xi)Tψ(xj)=(ϕ(xi)1ml=1mϕ(xl))T\overline{K}_{i,j}=\psi(x_i)^T\psi(x_j)=(\phi(x_i)-\frac{1}{m}\sum\limits_{l=1}^{m}\phi(x_l))^T(ϕ(xj)1ml=1mϕ(xl))(\phi(x_j)-\frac{1}{m}\sum\limits_{l=1}^{m}\phi(x_l))=(ϕ(xi)T1ml=1mϕ(xl)T)=(\phi(x_i)^T-\frac{1}{m}\sum\limits_{l=1}^m\phi(x_l)^T)(ϕ(xj)1ml=1mϕ(xl))(\phi(x_j)-\frac{1}{m}\sum\limits_{l=1}^{m}\phi(x_l))=ϕ(xi)Tϕ(xj)1mϕ(xi)T(l=1mϕ(xl))1m(l=1mϕ(xl)T)ϕ(xj)+1m2(l=1mϕ(xl)T)(l=1mϕ(xl))=\phi(x_i)^T\phi(x_j)-\frac{1}{m}\phi(x_i)^T(\sum\limits_{l=1}^m\phi(x_l))-\frac{1}{m}(\sum\limits_{l=1}^m\phi(x_l)^T)\phi(x_j)+\frac{1}{m^2}(\sum\limits_{l=1}^m\phi(x_l)^T)(\sum\limits_{l=1}^m\phi(x_l))

引入核函数k(,)\mathbf{k}(·,·)

k(xi,xj)=ϕ(xi)Tϕ(xj)\mathbf{k}(x_i,x_j)=\phi(x_i)^T\phi(x_j)

引入核函数k(,)\mathbf{k}(·,·)k(xi,xj)=ϕ(xi)Tϕ(xj)\mathbf{k}(x_i,x_j)=\phi(x_i)^T\phi(x_j)

其对应的核矩阵为:

K=[k(x1,x1)k(x1,x2)k(x1,xm)k(x2,x1)k(x2,x2)k(x2,xm)k(xm,x1)k(xm,x2)k(xm,xm)](4)K=\begin{bmatrix}{\mathbf{k}(x_1,x_1)} & {\mathbf{k}(x_1,x_2)} & {\cdots} & {\mathbf{k}(x_1,x_m)} \\ {\mathbf{k}(x_2,x_1)} & {\mathbf{k}(x_2,x_2)} & {\cdots} & {\mathbf{k}(x_2,x_m)} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\mathbf{k}(x_m,x_1)} & {\mathbf{k}(x_m,x_2)} & {\cdots} & {\mathbf{k}(x_m,x_m)}\end{bmatrix} \tag{4}

其对应的核矩阵为:K={k(xi,xj)}i,j{1,...,m}Rm×mK=\{\mathbf{k}(x_i,x_j)\}_{i,j\in\{1,...,m\}} \in R^{m\times m}(式(4)(4)

Ki,j\overline{K}_{i,j}继续化简得到Ki,j=Ki,j1ml=1m(ϕ(xi)Tϕ(xl))1ml=1m(ϕ(xl)Tϕ(xj))+1m2l=1mp=1m(ϕ(xl)Tϕ(xp))\overline{K}_{i,j}=K_{i,j}-\frac{1}{m}\sum\limits_{l=1}^m(\phi(x_i)^T\phi(x_l))-\frac{1}{m}\sum\limits_{l=1}^m(\phi(x_l)^T\phi(x_j))+\frac{1}{m^2}\sum\limits_{l=1}^m\sum\limits_{p=1}^m(\phi(x_l)^T\phi(x_p))=Ki,j1ml=1mKi,l1ml=1mKl,j+1m2l=1mp=1mKl,p=K_{i,j}-\frac{1}{m}\sum\limits_{l=1}^mK_{i,l}-\frac{1}{m}\sum\limits_{l=1}^mK_{l,j}+\frac{1}{m^2}\sum\limits_{l=1}^m\sum\limits_{p=1}^mK_{l,p}

于是有(如果看不懂,可以反向证明其成立):

K=KKImImK+ImKIm(5)\overline{K}=K-KI_m-I_mK+I_mKI_m \tag{5}

于是有(如果看不懂,可以反向证明其成立):K=KKImImK+ImKIm\overline{K}=K-KI_m-I_mK+I_mKI_m(式(5)(5)

其中II是元素全为1的m×mm\times m形矩阵,ImI_m表示1mI\frac{1}{m}IK\overline{K}求出来了,αj\alpha^j也就解出来了,于是vj=i=1mψ(xi)αij=ψ(X)αjv_j=\sum\limits_{i=1}^m\psi(x_i)\alpha_i^j=\psi(X)\alpha^j,其中ψ(X)=[ψ(x1),ψ(x2),...,ψ(xm)]\psi(X)=[\psi(x_1),\psi(x_2),...,\psi(x_m)],根据之前的PCA算法可知,vjv_j需要进行单位化和正交化(?),然而vjv_j实际是算不出来的,但是可以修改αj\alpha^j的值使满足vj=1\|v_j\|=1的条件,根据vj2=vjTvj=(αj)Tψ(X)Tψ(X)αj=(αj)TKαj=λj(αj)Tαj=1\|v_j\|^2=v_j^Tv_j=(\alpha^j)^T\psi(X)^T\psi(X)\alpha^j=(\alpha^j)^T\overline{K}\alpha^j=\lambda_j(\alpha^j)^T\alpha^j=1,从而得到新的αj\alpha^j值(j{1,2,...,d}j\in \{1,2,...,d\})。现在给定原始空间中的样本xix_i,可以计算其在KPCA投影空间中的表示VTψ(xi)V^T\psi(x_i)(先将xix_i映射到高维空间得到ψ(xi)\psi(x_i),再从高维希尔伯特空间做PCA降维,即左乘投影矩阵VTV^T),其中V=[v1,v2,...,vd]=ψ(X)[α1,α2,...,αd]V=[v_1,v_2,...,v_d]=\psi(X)[\alpha^1,\alpha^2,...,\alpha^d],有VTψ(xi)=[(α1)T(αd)T]ψ(X)Tψ(xi)=[(α1)T(αd)T][ψ(x1)Tψ(xm)T]ψ(xi)=[(α1)T(αd)T]d×m[ψ(x1)Tψ(xi)ψ(xm)Tψ(xi)]m×1=[(α1)T(αd)T]K:,iRdV^T\psi(x_i)=\begin{bmatrix}(\alpha^1)^T\\\vdots\\(\alpha^d)^T\end{bmatrix}\psi(X)^T\psi(x_i)=\begin{bmatrix}(\alpha^1)^T\\\vdots\\(\alpha^d)^T\end{bmatrix}\begin{bmatrix}\psi(x_1)^T\\\vdots\\\psi(x_m)^T\end{bmatrix}\psi(x_i)=\begin{bmatrix}(\alpha^1)^T\\\vdots\\(\alpha^d)^T\end{bmatrix}_{d\times m}\begin{bmatrix}\psi(x_1)^T\psi(x_i)\\\vdots\\\psi(x_m)^T\psi(x_i)\end{bmatrix}_{m\times 1}=\begin{bmatrix}(\alpha^1)^T\\\vdots\\(\alpha^d)^T\end{bmatrix}K_{:,i}\in R^d

其中II是元素全为1的m×mm\times m形矩阵,ImI_m表示1mI\frac{1}{m}IK\overline{K}求出来了,αj\alpha^j也就解出来了,于是vj=i=1mψ(xi)αij=ψ(X)αjv_j=\sum\limits_{i=1}^m\psi(x_i)\alpha_i^j=\psi(X)\alpha^j,其中ψ(X)=[ψ(x1),ψ(x2),...,ψ(xm)]\psi(X)=[\psi(x_1),\psi(x_2),...,\psi(x_m)],根据之前的PCA算法可知,vjv_j需要进行单位化和正交化(?),然而vjv_j实际是算不出来的,但是可以修改αj\alpha^j的值使满足vj=1\|v_j\|=1的条件,根据vj2=vjTvj=(αj)Tψ(X)Tψ(X)αj=(αj)TKαj=λj(αj)Tαj=1\|v_j\|^2=v_j^Tv_j=(\alpha^j)^T\psi(X)^T\psi(X)\alpha^j=(\alpha^j)^T\overline{K}\alpha^j=\lambda_j(\alpha^j)^T\alpha^j=1,从而得到新的αj\alpha^j值(j{1,2,...,d}j\in \{1,2,...,d\})。现在给定原始空间中的样本xix_i,可以计算其在KPCA投影空间中的表示VTψ(xi)V^T\psi(x_i)(先将xix_i映射到高维空间得到ψ(xi)\psi(x_i),再从高维希尔伯特空间做PCA降维,即左乘投影矩阵VTV^T),其中V=[v1,v2,...,vd]=ψ(X)[α1,α2,...,αd]V=[v_1,v_2,...,v_d]=\psi(X)[\alpha^1,\alpha^2,...,\alpha^d],有VTψ(xi)=[α1,...,αd]T[ψ(x1),...,ψ(xm)]Tψ(xi)=[α1,...,αd]d×mT(K:,i)m×1RdV^T\psi(x_i)=[\alpha^1,...,\alpha^d]^T[\psi(x_1),...,\psi(x_m)]^T\psi(x_i)=[\alpha^1,...,\alpha^d]^T_{d\times m}(K_{:,i})_{m\times 1}\in R^d

注:推荐李振轩老师的视频教程,其从“什么样的投影轴可以最大化投影后的样本方差,以使得投影后的样本分散得最开(显然如果投影到同一个点意味着信息将全部丢失,所以投影后的点离散程度越大越好)”的视角来看待PCA问题,而且推导形式更简单,可以参见我做的笔记[3]