Principal Component Analysis#
- class autoprot.analysis.PCA.AutoPCA(dataframe: DataFrame, clabels: list[str] | None, rlabels: list[str] | None = None, batch: list[str] | None = None)[source]#
Conduct principal component analyses.
The class encompasses a set of helpful visualizations for further investigating the results of the PCA It needs the matrix on which the PCA is performed as well as row labels (rlabels) and column labels (clabels) corresponding to the provided matrix.
Notes
PCA is a method which allows you to visually investigate the underlying structure in your data after reduction of the dimensionality. With the .autoPCA() you can easily perform a PCA and also generate exploratory figures. Intro to PCA: https://learnche.org/pid/latent-variable-modelling/principal-component-analysis/index
Examples
For PCA no missing values are allowed in the dataset. We first filter those and store complete dataframe. Then get the matrix of quantitative values corresponding to conditions of interest (here we only use the first replicate for clarity). Next, generate appropiate names for the columns and rows of the matrix - in this example the columns represent the conditions, but we are not interested in the rows (which are the genes). The scree plots describe how much of the total variance of the dataset is explained ba the first n components. As you want to explain as variance as possible with as little variables as possible, chosing the number of components directly right to the steep descend of the plot is usually a good idea.
prot = pd.read_csv("../data/proteinGroups_minimal.zip", sep="\t", low_memory=False) protRatio = prot.filter(regex="Ratio .\/. normalized").columns protLog = pp.log(prot, protRatio, base=2) temp = protLog[~protLog.filter(regex="log2.*norm").isnull().any(axis=1)] dataframe = temp.filter(regex="log2.*norm.*_1$") clabels = dataframe.columns rlabels = None autopca = ana.AutoPCA(dataframe=dataframe, clabels=clabels, rlabels=rlabels) autopca.scree()
(
Source code
,png
,hires.png
,pdf
)The corrComp heatmap shows the PCA loads (i.e. how much a principal component is influenced by a change in that variable) relative to the variables (i.e. the experiment conditions). If a weight (colorbar) is close to zero, the corresponding PC is barely influenced by it.
autopca.corr_comp(annot=False)
(
Source code
,png
,hires.png
,pdf
)The bar loading plot is a different way to represent the weights/loads for each condition and principal component. High values indicate a high influence of the variable/condition on the PC.
autopca.bar_load(pc=1) autopca.bar_load(pc=2)
The score plot shows how the different data points (i.e. proteins) are positioned with respect to two principal components. In more detail, the scores are the original data values multiplied by the weights of each value for each principal component. Usually they will separate more in the direction of PC1 as this component explains the largest share of the data variance
autopca.score_plot(pc1=1, pc2=2)
(
Source code
,png
,hires.png
,pdf
)The loading plot is the 2D representation of the barLoading plots and shows the weights how each variable influences the two PCs.
autopca.loading_plot(pc1=1, pc2=2, labeling=True)
(
Source code
,png
,hires.png
,pdf
)The Biplot is a combination of loading plot and score plot as it shows the scores for each protein as point and the weights for each variable as vectors.
autopca.bi_plot(pc1=1, pc2=2)
(
Source code
,png
,hires.png
,pdf
)- bar_load(pc: int = 1, n: int = 25) None [source]#
Plot the loadings of a given component in a barplot.
- Parameters:
pc (int, optional) – Component to draw. The default is 1.
n (int, optional) – Plot only the n first rows. The default is 25.
- Return type:
None.
- bi_plot(pc1: int = 1, pc2: int = 2, num_load: Literal['all'] | int = 'all', figsize: tuple[int, int] = (5, 5), **kwargs) None [source]#
Generate a biplot, a combined loadings and score plot.
- Parameters:
pc1 (int, optional) – Number of the first PC to plot. The default is 1.
pc2 (int, optional) – Number of the second PC to plot. The default is 2.
num_load ('all' or int, optional) – Plot only the n first rows. The default is “all”.
figsize (tuple of int, optional) – Figure size. The default is (3,3).
**kwargs – Passed to plt.scatter.
Notes
In the biplot, scores are shown as points and loadings as vectors.
- Return type:
None.
- corr_comp(annot=False, ax: axis | None = None) None [source]#
Plot heatmap of PCA weights vs. variables.
- Parameters:
annot (bool, optional) – If True, write the data value in each cell. If an array-like with the same shape as data, then use this to annotate the heatmap instead of the data. Note that DataFrames will match on position, not index. The default is False.
ax (plt.axis, optional) – axis to plot on. Default is None.
Notes
2D representation how strong each observation (e.g. log protein ratio) weights for each principal component.
- Return type:
None.
- loading_plot(pc1: int = 1, pc2: int = 2, labeling: bool = False, ax: axis | None = None, figsize: tuple[int] = (5, 5))[source]#
Generate a PCA loading plot.
- Parameters:
pc1 (int, optional) – Number of the first PC to plot. The default is 1.
pc2 (int, optional) – Number of the second PC to plot. The default is 2.
labeling (bool, optional) – If True, points are labelled with the corresponding column labels. The default is False.
figsize (tuple of int, optional) – The size of the figure object. Will be ignored if ax is not None. The default is (5,5).
ax (plt.axis, optional.) – The axis to plot on. Default is None.
Notes
This will return a scatter plot with as many points as there are components (i.e. conditions) in the dataset. For each component a load magnitude for two PCs will be printed that describes how much each condition influences the magnitude of the respective PC.
- Return type:
None.
- pair_plot(n: int = 0) None [source]#
Draw a pair plot of for pca for the given number of dimensions.
- Parameters:
n (int, optional) – Plot only the n first rows. The default is 0.
Notes
Be careful for large data this might crash you PC -> better specify n!
- Return type:
None.
- return_load(pc: int = 1, n: int = 25) DataFrame [source]#
Return the load for a given principal component.
- Parameters:
pc (int, optional) – Component to draw. The default is 1.
n (int, optional) – Plot only the n first rows. The default is 25.
- Returns:
Dataframe containing load vs. condition.
- Return type:
pd.DataFrame
- return_score() DataFrame [source]#
Return a dataframe of all scorings for all principal components.
- Returns:
scores – Dataframe holding the principal components as colnames and the scores for each protein on that PC as values.
- Return type:
pd.DataFrame
- score_plot(pc1: int = 1, pc2: int = 2, labeling: bool = False, file: str | None = None, figsize: tuple[int | float, int | float] = (5, 5)) None [source]#
Generate a PCA score plot.
- Parameters:
pc1 (int, optional) – Number of the first PC to plot. The default is 1.
pc2 (int, optional) – Number of the second PC to plot. The default is 2.
labeling (bool, optional) – If True, points are labelled with the corresponding column labels. The default is False.
file (str, optional) – Path to save the plot. The default is None.
figsize (tuple of int, optional) – Figure size. The default is (5,5).
Notes
This will return a scatter plot with as many points as there are entries (i.e. protein IDs). The scores for each PC are the original protein ratios multiplied with the loading weights. The score plot corresponds to the individual positions of of each protein on a hyperplane generated by the pc1 and pc2 vectors.
- Return type:
None.