The Glowing Python: Principal Component Analysis with numpy

Friday, July 22, 2011

Principal Component Analysis with numpy

The following function is a three-line implementation of the Principal Component Analysis (PCA). It is inspired by the function princomp of the matlab's statistics toolbox.

from numpy import mean,cov,double,cumsum,dot,linalg,array,rank
from pylab import plot,subplot,axis,stem,show,figure

def princomp(A):
 """ performs principal components analysis 
     (PCA) on the n-by-p data matrix A
     Rows of A correspond to observations, columns to variables. 

 Returns :  
  coeff :
    is a p-by-p matrix, each column containing coefficients 
    for one principal component.
  score : 
    the principal component scores; that is, the representation 
    of A in the principal component space. Rows of SCORE 
    correspond to observations, columns to components.
  latent : 
    a vector containing the eigenvalues 
    of the covariance matrix of A.
 """
 # computing eigenvalues and eigenvectors of covariance matrix
 M = (A-mean(A.T,axis=1)).T # subtract the mean (along columns)
 [latent,coeff] = linalg.eig(cov(M)) # attention:not always sorted
 score = dot(coeff.T,M) # projection of the data in the new space
 return coeff,score,latent

(In this other post you can find an updated version of this function).
In the following test a 2D dataset wil be used. The result of this test is a plot with the two principal components (dashed lines), the original data (blue dots) and the new data (red stars). As we expected the first principal component describes the direction of maximum variance and the second is orthogonal to the first.

A = array([ [2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9],
            [2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1] ])

coeff, score, latent = princomp(A.T)

figure()
subplot(121)
# every eigenvector describe the direction
# of a principal component.
m = mean(A,axis=1)
plot([0, -coeff[0,0]*2]+m[0], [0, -coeff[0,1]*2]+m[1],'--k')
plot([0, coeff[1,0]*2]+m[0], [0, coeff[1,1]*2]+m[1],'--k')
plot(A[0,:],A[1,:],'ob') # the data
axis('equal')
subplot(122)
# new data
plot(score[0,:],score[1,:],'*g')
axis('equal')
show()

In this second example princomp(.) is tested on a 4D dataset. In this example the matrix of the data is rank deficient and only the first two components are necessary to bring the information of the entry dataset.

A = array([[-1, 1, 2, 2],
           [-2, 3, 1, 0],
           [ 4, 0, 3,-1]],dtype=double)

coeff, score, latent = princomp(A)
perc = cumsum(latent)/sum(latent)
figure()
# the following plot shows that first two components 
# account for 100% of the variance.
stem(range(len(perc)),perc,'--b')
axis([-0.3,4.3,0,1.3])
show()
print 'the principal component scores'
print score.T # only the first two columns are nonzero
print 'The rank of A is'
print rank(A)  # indeed, the rank of A is 2

Coefficients for principal components
[[  1.464140e+00   1.588382e+00   0.000000e+00  -4.440892e-16]
 [  2.768170e+00  -1.292503e+00  -2.775557e-17   6.557254e-16]
 [ -4.232310e+00  -2.958795e-01   1.110223e-16  -3.747002e-16]]
The rank of A is
2

12 comments:

Olivier GriselJuly 22, 2011 at 1:37 PM
Note you can also do PCA using the Singular Value Decomposition (numpy.linalg.svd) as done in scikit-learn:

http://scikit-learn.sourceforge.net/dev/modules/decomposition.html#principal-component-analysis-pca

Also if you have very large dataset (e.g. more than 5000 observations and features / dimensions) then the default SVD is too expensive to compute. In that case you can use the randomized approximate SVD by Halko et al. implemented both in scikit-learn and gensim:

https://github.com/scikit-learn/scikit-learn/blob/babe4a5d0637ca172d47e1dfdd2f6f3c3ecb28db/scikits/learn/utils/extmath.py#L105

https://github.com/piskvorky/gensim/blob/develop/gensim/models/lsimodel.py#L555

Those latter methods also works with scipy.sparse matrices.
ReplyDelete
Replies
AinaJanuary 9, 2012 at 7:05 PM
Hi, I have a question on this script. It is very convenient to use, however I am not sure what the results of 4D sets are telling us. I have a 4D set and I am trying to figure out whether there is a dependence between the dimensions. So if dimension 1 and 2 would be strongly correlated, would it make coefficients of principal components big or small? Thanks, Aina.
ReplyDelete
Replies
JustGlowingJanuary 9, 2012 at 8:01 PM
Hi Aina,
in the second example I used a 4D dataset where the information is stored in only two dimensions out of the four. Indeed, the transformation shows that only two column of the result are non zero. Keep in mind that the example do not reduces the dimensionality of the data and the PCA is a linear transformation of the variables into a lower dimensional space which retain maximal amount of information about the variables.

You can find the answer to your question in the following property of the PCA: Subsequent PCs are defined as the projected variable which is uncorrelated with the earlier PCs and has maximal variance with arbitrary sign.

By the way, if you want to find two variables are correlated, I suggest to try with a correlation coefficient.
ReplyDelete
Replies
JerryApril 4, 2012 at 10:38 PM
A slightly annoying issue: the latent values are not sorted, so the eigenvectors are not in order... any quick fix?
ReplyDelete
Replies
JustGlowingApril 5, 2012 at 9:26 AM
Hi Jerry, there's an updated version of the princuomp function in the post http://glowingpython.blogspot.it/2011/07/pca-and-image-compression-with-numpy.html
ReplyDelete
Replies
Jiangang HaoAugust 10, 2012 at 12:37 AM
score :
the principal component scores; that is, the representation
of A in the principal component space. Rows of SCORE
correspond to observations, columns to components.

This is not correct
ReplyDelete
Replies
JustGlowingAugust 10, 2012 at 6:39 AM
Hi Jiangag, I can't see the mistake. Explain your point.
ReplyDelete
Replies
AbhijeetAugust 14, 2012 at 6:05 PM
What he means is there is a missing transpose in score calculation. It should be:
score = dot(coeff.T,M).T
ReplyDelete
Replies
jamesMarch 21, 2015 at 6:48 PM
This is helpful, but I believe This may be incorrect (possibly extra transpose) maybe). :)
for 2D dataset containing 2 attributes(x1, and x2) and 10 data points, you are supposed to have:
- 2 pairs of eigenvectors
- 2 eigenvalues for x1 and x2
- 2 means: 1 for x1 and 1 for x2

You are supposed to subtract the mean of each attributes. But you have 10 means. In your code:
mean(A.T,axis=1)
# this results in 10 means. This is calculating attribute1 + attribute2 and average them.

You are supposed to calculate the mean over 10 data points for each attribute.
you should do
mean(A,axis=1)

Your covariance matrix is also a 10x10 matrix. this is not correct. You are suppose to have 2x2 matrix for 2D dataset.
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Friday, July 22, 2011

Principal Component Analysis with numpy

12 comments:

Quote