Transformations#

Module Name: transformation#

This module contains functions for data transformation and preprocessing.

Authors#

Wignand Mühlhäuser Julian Bender <julian.bender@uni-wuerzburg.de> Johannes Zimmermann <johannes.zimmermann@uni-wuerzburg.de>

autoprot.preprocessing.transformation.calculate_iBAQ(intensity, gene_name=None, protein_id=None, organism='human', get_seq='online', uniprot=None) float[source]#

Convert raw intensities to ‘intensity-based absolute quantification’ or iBAQ intensities. Given intensities are divided by the number of theoretically observable tryptic peptides.

Parameters:
  • intensity (int) – Integer with raw MS intensity for Transformation.

  • gene_name (str) – Gene name of the protein related to the intensity given.

  • protein_id (str) – Uniprot Protein ID of the protein related to the intensity given.

  • organism (str, optional) – This conversion is based on Uniprot Identifier used in data. possible organisms: ‘mouse’, ‘human’, ‘rat’, ‘sheep’, ‘SARSCoV2’, ‘guinea pig’, ‘cow’, ‘hamster’, ‘fruit fly’, ‘dog’, ‘rabbit’, ‘pig’, ‘chicken’, ‘frog’, ‘quail’, ‘horse’, ‘goat’, ‘papillomavirus’, ‘water buffalo’, ‘marmoset’, ‘turkey’, ‘cat’, ‘starfish’, ‘torpedo’, ‘SARSCoV1’, ‘green monkey’, ‘ferret’. The default is “human”.

  • get_seq (str, "local" or "online") – Defines if sequence is fetched locally or downloaded from uniprot. It is advised to give a locally loaded dataframe when function is used in batch processing.

  • uniprot (pd.DataFrame, optional) – contains Sequences listed by Gene Names and UniProt IDs

Notes

This function gets the protein sequence online at UniProt. This can be slow. For batch processing it is advisable to provide local Sequence data or use the local copy of the UniProt in autoprot, be aware to keep it up to date.

Returns:

int

Return type:

iBAQ intensity

Examples

>>> calculate_iBAQ(1000, gene_name="TP53")
0.000203376
autoprot.preprocessing.transformation.collapse_rows(df: ~pandas.core.frame.DataFrame, columns: list | str, numeric_func: callable = <function nanmedian>, delimiter: str = ';')[source]#

Merge rows of data frames based on values of column(s). Non-numeric values are concatenated and numeric values are treated with specific function.

Parameters:
  • df (pd.DataFrame) – Input dataframe

  • columns (list or str) – Column name(s) to collapse on

  • numeric_func (callable, optional) – Function to aggregate the numeric columns. Default is np.nanmedian.

  • delimiter (str, optional) – Str to concatenate the non-numeric values. Default is semicolon.

Returns:

collapsed

Return type:

pd.DataFrame

autoprot.preprocessing.transformation.exp_semi_col(df: DataFrame, columns: list | str, suffix: str = '_exploded', delimiter: str = ';', cast_to: object | None = None)[source]#

Expand a semicolon containing string column and generate a new column based on its content.

Parameters:
  • df (pd.dataframe) – Dataframe to expant columns.

  • columns (str or list) – Colname of column(s) containing semicolon-separated values.

  • suffix (str) – Will be appended to the newly generated split column.

  • delimiter (str, optional) – the delimiter to split the strings on. Default is semicolon.

  • cast_to (dtype, optional) – If provided new column will be set to the provided dtype. The default is None.

Returns:

df – Dataframe with the semicolon-separated values on separate rows.

Return type:

pd.dataframe

Examples

>>> expSemi = phos.sample(100)
>>> expSemi["Proteins"].head()
0    P61255;B1ARA3;B1ARA5
0    P61255;B1ARA3;B1ARA5
0    P61255;B1ARA3;B1ARA5
1    Q6XZL8;F7CVL0;F6SJX8
1    Q6XZL8;F7CVL0;F6SJX8
Name: Proteins, dtype: object
>>> expSemi = autoprot.preprocessing.exp_semi_col(expSemi, "Proteins", "SingleProts")
>>> expSemi["SingleProts"].head()
0    P61255
0    B1ARA3
0    B1ARA5
1    Q6XZL8
1    F7CVL0
Name: SingleProts, dtype: object
autoprot.preprocessing.transformation.expand_site_table(df: DataFrame, cols: list[str], replace_zero: bool = True)[source]#

Convert a phosphosite table into a phosphopeptide table.

These functions are used for Phospho (STY)Sites.txt files. It converts the phosphosite table into a phosphopeptide table. After expansion peptides with no quantitative information are dropped. You might want to consider to remove some columns after the expansion. For example if you expanded on the normalized ratios it might be good to remove the non-normalized ones, or vice versa.

Parameters:
  • df (pd.DataFrame) – Dataframe to be expanded. Must contain a column named “id”.

  • cols (list of str) – Cols which are going to be expanded (format: Ratio.*___.).

  • replace_zero (bool) – If true 0 values in the provided columns are replaced by NaN (default). Set to False if you want explicitely to keep the 0 values after expansion.

Raises:

ValueError – Raised if the dataframe does not contain all columns correspondiong to the provided columns without __n extension.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

Examples

>>> phosRatio = phos.filter(regex="^Ratio .\/.( | normalized )R.___").columns
>>> phosLog = autoprot.preprocessing.log(phos, phosRatio, base=2)
>>> phosRatio = phosLog.filter(regex="log2_Ratio .\/. normalized R.___").columns
>>> phos_expanded = autoprot.preprocessing.expand_site_table(phosLog, phosRatio)
47936 phosphosites in dataframe.
47903 phosphopeptides in dataframe after expansion.
autoprot.preprocessing.transformation.log(df: DataFrame, cols: Sequence[str], base: int = 2, invert: Sequence[int] | None = None, return_cols: bool = False, ratio_identifier: str = '(\\w)(/)(\\w)', ratio_replace: str = '\\3\\2\\1')[source]#

Perform log transformation.

Parameters:
  • df (pd.dfFrame) – Input dfframe.

  • cols (list of str) – Cols which are transformed.

  • base (int, optional) – Base of log. The default is 2.

  • invert (list of int, optional) – Vector corresponding in length to number of to columns. Columns are multiplied with corresponding number. The default is None.

  • return_cols (bool, optional) – Whether to return a list of names corresponding to the columns added to the dfframe. The default is False.

  • ratio_identifier (str, optional) – Regular expression to find ratios and invert the labels if invert is True

  • ratio_replace (str, optional) – Regular expression to reinsert the ratio identifier after transformation. Default is the inversion along a divide sign (i.e H/L -> L/H)

Returns:

  • pd.dfframe – The log transformed dfframe.

  • list – A list of column names (if returnCols is True).

Examples

First collect colnames holding the intensity ratios.

>>> protRatio = prot.filter(regex="^Ratio .\/.( | normalized )B").columns
>>> phosRatio = phos.filter(regex="^Ratio .\/.( | normalized )R.___").columns

Some ratios need to be inverted as a result from label switches. This can be accomplished using the invert variable. Log transformations using arbitrary bases can be used, however, 2 and 10 are most commonly applied.

>>> invert = [-1., -1., 1., 1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
...       1., 1., 1., 1., 1., 1., 1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
...       1., -1.]
>>> prot2 = autoprot.preprocessing.log(prot, protRatio, base=2, invert=invert + invert)
>>> phos2 = autoprot.preprocessing.log(phos, phosRatio, base=10)

The resulting dfframe contains log ratios or NaN.

>>> prot2.filter(regex="log.+_Ratio M/L BC18_1$").head()
   log2_Ratio M/L BC18_1
0                    NaN
1              -0.478609
2                    NaN
3                    NaN
4               1.236503
autoprot.preprocessing.transformation.make_sim_score(m1: Sequence, m2: Sequence, corr: Literal['Pearson', 'Spearman'] = 'pearson') float[source]#

Calculate similarity score.

To quantitatively describe the resemblance between the temporal profiles observed after subjecting the cells to the two treatments. Implemented as described in [1].

Parameters:
  • m1 (array-like) – Time course of SILAC ratios after treatment 1.

  • m2 (array-like) – Time course of SILAC ratios after treatment 2.

  • corr (str, optional) – Correlation parameter. ‘Pearson’ or ‘Spearman’. The default is “pearson”.

Returns:

S-score that describes both the resemblance of the patterns of regulation and the resemblance between the degrees of regulation in the range from zero to infinity.

Return type:

float

Examples

Similar temporal profiles result in high S-scores

>>> s1 = [1,1,1,2,3,4,4]
>>> s2 = [1,1,1,2,3,3,4]
>>> autoprot.preprocessing.make_sim_score(s1, s2)
50.97173553835997

Low resemblance results in low scores

>>> s2 = [1.1,1.1,1,1,1,1,1]
>>> autoprot.preprocessing.make_sim_score(s1, s2)
16.33374591446012

References

[1] https://www.doi.org/10.1126/scisignal.2001570

autoprot.preprocessing.transformation.merge_semi_cols(m1: DataFrame, m2: DataFrame, semicolon_col1: str, semicolon_col2: str | None = None, how: Literal['left', 'right', 'outer', 'inner'] = 'left')[source]#

Merge two dataframes on a semicolon separated column.

Here m2 is merged to m1 (left merge). -> entries in m2 which are not matched to m1 are dropped

Parameters:
  • m1 (pd.Dataframe) – First dataframe to merge.

  • m2 (pd.Dataframe) – Second dataframe to merge with first.

  • semicolon_col1 (str) – Colname of a column containing semicolon-separated values in m1.

  • semicolon_col2 (str, optional) – Colname of a column containing semicolon-separated values in m2. If sCol2 is None it is assumed to be the same as sCol1. The default is None.

  • how ('left', 'right', 'outer', 'inner') – How to perform the merge

Returns:

Merged dataframe with expanded columns.

Return type:

pd.dataframe