Transformations#
Module Name: transformation#
This module contains functions for data transformation and preprocessing.
- autoprot.preprocessing.transformation.calculate_iBAQ(intensity, gene_name=None, protein_id=None, organism='human', get_seq='online', uniprot=None) float [source]#
Convert raw intensities to ‘intensity-based absolute quantification’ or iBAQ intensities. Given intensities are divided by the number of theoretically observable tryptic peptides.
- Parameters:
intensity (int) – Integer with raw MS intensity for Transformation.
gene_name (str) – Gene name of the protein related to the intensity given.
protein_id (str) – Uniprot Protein ID of the protein related to the intensity given.
organism (str, optional) – This conversion is based on Uniprot Identifier used in data. possible organisms: ‘mouse’, ‘human’, ‘rat’, ‘sheep’, ‘SARSCoV2’, ‘guinea pig’, ‘cow’, ‘hamster’, ‘fruit fly’, ‘dog’, ‘rabbit’, ‘pig’, ‘chicken’, ‘frog’, ‘quail’, ‘horse’, ‘goat’, ‘papillomavirus’, ‘water buffalo’, ‘marmoset’, ‘turkey’, ‘cat’, ‘starfish’, ‘torpedo’, ‘SARSCoV1’, ‘green monkey’, ‘ferret’. The default is “human”.
get_seq (str, "local" or "online") – Defines if sequence is fetched locally or downloaded from uniprot. It is advised to give a locally loaded dataframe when function is used in batch processing.
uniprot (pd.DataFrame, optional) – contains Sequences listed by Gene Names and UniProt IDs
Notes
This function gets the protein sequence online at UniProt. This can be slow. For batch processing it is advisable to provide local Sequence data or use the local copy of the UniProt in autoprot, be aware to keep it up to date.
- Returns:
int
- Return type:
iBAQ intensity
Examples
>>> calculate_iBAQ(1000, gene_name="TP53") 0.000203376
- autoprot.preprocessing.transformation.collapse_rows(df: ~pandas.core.frame.DataFrame, columns: list | str, numeric_func: callable = <function nanmedian>, delimiter: str = ';')[source]#
Merge rows of data frames based on values of column(s). Non-numeric values are concatenated and numeric values are treated with specific function.
- Parameters:
df (pd.DataFrame) – Input dataframe
columns (list or str) – Column name(s) to collapse on
numeric_func (callable, optional) – Function to aggregate the numeric columns. Default is np.nanmedian.
delimiter (str, optional) – Str to concatenate the non-numeric values. Default is semicolon.
- Returns:
collapsed
- Return type:
pd.DataFrame
- autoprot.preprocessing.transformation.exp_semi_col(df: DataFrame, columns: list | str, suffix: str = '_exploded', delimiter: str = ';', cast_to: object | None = None)[source]#
Expand a semicolon containing string column and generate a new column based on its content.
- Parameters:
df (pd.dataframe) – Dataframe to expant columns.
columns (str or list) – Colname of column(s) containing semicolon-separated values.
suffix (str) – Will be appended to the newly generated split column.
delimiter (str, optional) – the delimiter to split the strings on. Default is semicolon.
cast_to (dtype, optional) – If provided new column will be set to the provided dtype. The default is None.
- Returns:
df – Dataframe with the semicolon-separated values on separate rows.
- Return type:
pd.dataframe
Examples
>>> expSemi = phos.sample(100) >>> expSemi["Proteins"].head() 0 P61255;B1ARA3;B1ARA5 0 P61255;B1ARA3;B1ARA5 0 P61255;B1ARA3;B1ARA5 1 Q6XZL8;F7CVL0;F6SJX8 1 Q6XZL8;F7CVL0;F6SJX8 Name: Proteins, dtype: object >>> expSemi = autoprot.preprocessing.exp_semi_col(expSemi, "Proteins", "SingleProts") >>> expSemi["SingleProts"].head() 0 P61255 0 B1ARA3 0 B1ARA5 1 Q6XZL8 1 F7CVL0 Name: SingleProts, dtype: object
- autoprot.preprocessing.transformation.expand_site_table(df: DataFrame, cols: list[str], replace_zero: bool = True)[source]#
Convert a phosphosite table into a phosphopeptide table.
These functions are used for Phospho (STY)Sites.txt files. It converts the phosphosite table into a phosphopeptide table. After expansion peptides with no quantitative information are dropped. You might want to consider to remove some columns after the expansion. For example if you expanded on the normalized ratios it might be good to remove the non-normalized ones, or vice versa.
- Parameters:
df (pd.DataFrame) – Dataframe to be expanded. Must contain a column named “id”.
cols (list of str) – Cols which are going to be expanded (format: Ratio.*___.).
replace_zero (bool) – If true 0 values in the provided columns are replaced by NaN (default). Set to False if you want explicitely to keep the 0 values after expansion.
- Raises:
ValueError – Raised if the dataframe does not contain all columns correspondiong to the provided columns without __n extension.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
Examples
>>> phosRatio = phos.filter(regex="^Ratio .\/.( | normalized )R.___").columns >>> phosLog = autoprot.preprocessing.log(phos, phosRatio, base=2) >>> phosRatio = phosLog.filter(regex="log2_Ratio .\/. normalized R.___").columns >>> phos_expanded = autoprot.preprocessing.expand_site_table(phosLog, phosRatio) 47936 phosphosites in dataframe. 47903 phosphopeptides in dataframe after expansion.
- autoprot.preprocessing.transformation.log(df: DataFrame, cols: Sequence[str], base: int = 2, invert: Sequence[int] | None = None, return_cols: bool = False, ratio_identifier: str = '(\\w)(/)(\\w)', ratio_replace: str = '\\3\\2\\1')[source]#
Perform log transformation.
- Parameters:
df (pd.dfFrame) – Input dfframe.
cols (list of str) – Cols which are transformed.
base (int, optional) – Base of log. The default is 2.
invert (list of int, optional) – Vector corresponding in length to number of to columns. Columns are multiplied with corresponding number. The default is None.
return_cols (bool, optional) – Whether to return a list of names corresponding to the columns added to the dfframe. The default is False.
ratio_identifier (str, optional) – Regular expression to find ratios and invert the labels if invert is True
ratio_replace (str, optional) – Regular expression to reinsert the ratio identifier after transformation. Default is the inversion along a divide sign (i.e H/L -> L/H)
- Returns:
pd.dfframe – The log transformed dfframe.
list – A list of column names (if returnCols is True).
Examples
First collect colnames holding the intensity ratios.
>>> protRatio = prot.filter(regex="^Ratio .\/.( | normalized )B").columns >>> phosRatio = phos.filter(regex="^Ratio .\/.( | normalized )R.___").columns
Some ratios need to be inverted as a result from label switches. This can be accomplished using the invert variable. Log transformations using arbitrary bases can be used, however, 2 and 10 are most commonly applied.
>>> invert = [-1., -1., 1., 1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., ... 1., 1., 1., 1., 1., 1., 1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1., ... 1., -1.] >>> prot2 = autoprot.preprocessing.log(prot, protRatio, base=2, invert=invert + invert) >>> phos2 = autoprot.preprocessing.log(phos, phosRatio, base=10)
The resulting dfframe contains log ratios or NaN.
>>> prot2.filter(regex="log.+_Ratio M/L BC18_1$").head() log2_Ratio M/L BC18_1 0 NaN 1 -0.478609 2 NaN 3 NaN 4 1.236503
- autoprot.preprocessing.transformation.make_sim_score(m1: Sequence, m2: Sequence, corr: Literal['Pearson', 'Spearman'] = 'pearson') float [source]#
Calculate similarity score.
To quantitatively describe the resemblance between the temporal profiles observed after subjecting the cells to the two treatments. Implemented as described in [1].
- Parameters:
m1 (array-like) – Time course of SILAC ratios after treatment 1.
m2 (array-like) – Time course of SILAC ratios after treatment 2.
corr (str, optional) – Correlation parameter. ‘Pearson’ or ‘Spearman’. The default is “pearson”.
- Returns:
S-score that describes both the resemblance of the patterns of regulation and the resemblance between the degrees of regulation in the range from zero to infinity.
- Return type:
float
Examples
Similar temporal profiles result in high S-scores
>>> s1 = [1,1,1,2,3,4,4] >>> s2 = [1,1,1,2,3,3,4] >>> autoprot.preprocessing.make_sim_score(s1, s2) 50.97173553835997
Low resemblance results in low scores
>>> s2 = [1.1,1.1,1,1,1,1,1] >>> autoprot.preprocessing.make_sim_score(s1, s2) 16.33374591446012
References
- autoprot.preprocessing.transformation.merge_semi_cols(m1: DataFrame, m2: DataFrame, semicolon_col1: str, semicolon_col2: str | None = None, how: Literal['left', 'right', 'outer', 'inner'] = 'left')[source]#
Merge two dataframes on a semicolon separated column.
Here m2 is merged to m1 (left merge). -> entries in m2 which are not matched to m1 are dropped
- Parameters:
m1 (pd.Dataframe) – First dataframe to merge.
m2 (pd.Dataframe) – Second dataframe to merge with first.
semicolon_col1 (str) – Colname of a column containing semicolon-separated values in m1.
semicolon_col2 (str, optional) – Colname of a column containing semicolon-separated values in m2. If sCol2 is None it is assumed to be the same as sCol1. The default is None.
how ('left', 'right', 'outer', 'inner') – How to perform the merge
- Returns:
Merged dataframe with expanded columns.
- Return type:
pd.dataframe