Seqeunce annotation for data preprocessing#

Autoprot Preprocessing Functions.

@author: Wignand, Julian, Johannes

@documentation: Julian

autoprot.preprocessing.annotation.annotate_phosphosite(df, ps, cols_to_keep=None)[source]#

Annotate phosphosites with information derived from PhosphositePlus.

Parameters:
  • df (pd.Dataframe) – dataframe containing PS of interst.

  • ps (str) – Column containing info about the PS. Format: GeneName_AminoacidPositoin (e.g. AKT_T308).

  • cols_to_keep (list, optional) – Which columns from original dataframe (input df) to keep in output. The default is None.

Returns:

The input dataframe with the kept columns and additional phosphosite cols.

Return type:

pd.Dataframe

autoprot.preprocessing.annotation.get_subcellular_loc(series, database='compartments', loca=None, colname='Gene names')[source]#

Annotate the df with subcellular localization.

For compartments gene names are required.

Parameters:
  • series (pd.Series) – Must contain the colname to identify genes.

  • database (str, optional) – Possible values are “compartments” and “hpa”. The default is “compartments”.

  • loca (str, optional) – Only required for the compartments database. Filter the returned localisation table by this string. Must match exactly to the localisation terms in the compartments DB. The default is None.

  • colname (str, optional) – Colname holding the gene names. The default is “Gene names”.

Raises:

ValueError – Wrong value is provided for the database arg.

Notes

The compartments database is obtained from https://compartments.jensenlab.org/Downloads . The hpa database is the human protein atlas available at https://www.proteinatlas.org .

Returns:

  • pd.DataFrame – Dataframe with columns “ENSMBL”, “Gene name”, “LOCID”, “LOCNAME”, “SCORE” for compartments database.

  • tuple of lists (main_loc, alt_loc) – Lists of main and alternative localisations if the hpa database was chosen.

Examples

>>> series = pd.Series(['PEX14',], index=['Gene names'])

Find all subcellular localisations of PEX14. The second line filters the returned dataframe so that only values with the highest score are retained. The dataframe is converted to list for better visualisation.

>>> loc_df = autoprot.preprocessing.get_subcellular_loc(series)
>>> sorted(loc_df.loc[loc_df[loc_df['SCORE'] == loc_df['SCORE'].max()].index,
...                   'LOCNAME'].tolist())
['Bounding membrane of organelle', 'Cellular anatomical entity', 'Cytoplasm', 'Intracellular',
'Intracellular membrane-bounded organelle', 'Intracellular organelle', 'Membrane', 'Microbody',
'Microbody membrane', 'Nucleus', 'Organelle', 'Organelle membrane', 'Peroxisomal membrane', 'Peroxisome',
'Whole membrane', 'cellular_component', 'membrane-bounded organelle', 'protein-containing complex']

Get the score for PEX14 being peroxisomally localised

>>> loc_df = autoprot.preprocessing.get_subcellular_loc(series, loca='Peroxisome')
>>> loc_df['SCORE'].tolist()[0]
5.0

Using the Human Protein Atlas, a tuple of two lists containing the main and alternative localisations is returned

>>> autoprot.preprocessing.get_subcellular_loc(series, database='hpa')
(['Peroxisomes'], ['Nucleoli fibrillar center'])
autoprot.preprocessing.annotation.go_annot(prots: DataFrame, gos: list, only_prots: bool = False, exact: bool = True) DataFrame | Series[source]#

Filter a list of experimentally determined gene names by GO annotation.

Homo sapiens.gene_info and gene2go files are needed for annotation

In case of multiple gene names per line (e.g. AKT1;PKB) only the first name will be extracted.

Parameters:
  • prots (list of str) – List of Gene names.

  • gos (list of str) – List of go terms.

  • only_prots (bool, optional) – Whether to return dataframe or only list of gene names annotated with terms. The default is False.

  • exact (bool, optional) – Whether the go term must match exactly. i.e. MAPK activity <-> regulation of MAPK acitivity etc. The default is True.

Returns:

Dataframe with columns “index”, “Gene names”, “GeneID”, “GO_ID”, “GO_term” or Series with gene names

Return type:

pd.DataFrame or pd.Series

Examples

>>> gos = ["ribosome"]
>>> go = autoprot.preprocessing.go_annot(prot["Gene names"],gos, only_prots=False)
>>> go.head()
   index Gene names  GeneID       GO_ID   GO_term
0   1944      RPS27    6232  GO:0005840  ribosome
1   6451      RPS25    6230  GO:0005840  ribosome
2   7640     RPL36A    6173  GO:0005840  ribosome
3  11130      RRBP1    6238  GO:0005840  ribosome
4  16112        SF1    7536  GO:0005840  ribosome
autoprot.preprocessing.annotation.motif_annot(df, motif, col='Sequence window')[source]#

Search for phosphorylation motif in the provided dataframe.

If not specified, the “Sequence window” column is searched. The phosphorylated central residue in a motif has to be indicated with “S/T”. Arbitrary amino acids can be denoted with x.

Parameters:
  • df (pd.Dataframe) – input dataframe.

  • motif (str) – Target motif. E.g. “RxRxxS/T”, “PxS/TP” or “RxRxxS/TxSxxR”

  • col (str, optional) – Alternative column to be searched in if Sequence window is not desired. The default is “Sequence window”.

Returns:

Dataframe with additional boolean column with True/False for whether the motif is found in this .

Return type:

pd.dataframe

autoprot.preprocessing.annotation.to_canonical_ps(series, organism='human', get_seq='online', uniprot=None, print_alignment=False)[source]#

Convert phosphosites to “canonical” phosphosites.

Parameters:
  • series (pd.Series) – Series containing the indices “Gene names” and “Sequence Window”. Corresponds e.g. to a row in MQ Phospho(STY)Sites.txt.

  • organism (str, optional) – This conversion is based on Uniprot Identifier used in PSP data. possible organisms: ‘mouse’, ‘human’, ‘rat’, ‘sheep’, ‘SARSCoV2’, ‘guinea pig’, ‘cow’, ‘hamster’, ‘fruit fly’, ‘dog’, ‘rabbit’, ‘pig’, ‘chicken’, ‘frog’, ‘quail’, ‘horse’, ‘goat’, ‘papillomavirus’, ‘water buffalo’, ‘marmoset’, ‘turkey’, ‘cat’, ‘starfish’, ‘torpedo’, ‘SARSCoV1’, ‘green monkey’, ‘ferret’. The default is “human”.

  • get_seq ("local" or "online")

  • uniprot (str, optional) – Path to a gzipped uniprot.tsv file. Required if get_seq is ‘local’

  • print_alignment (bool, optional) – If True, alignments from which the new phosphosite information is derived are printed.

Notes

This function compares a certain gene name to the genes found in the phosphosite plus (https://www.phosphosite.org) phosphorylation site dataset.

Returns:

  • list of (str, str, str) – (UniProt ID, Position of phosphosite in the UniProt sequence, score)

  • Proteins with two Gene names seperated by a semicolon are given back in the same way and order.

Examples

The correct position of the phosphorylation is returned independent of the completeness of the sequence window.

>>> series=pd.Series(['PEX14', "VSNESTSSSPGKEGHSPEGSTVTYHLLGPQE"], index=['Gene names', 'Sequence window'])
>>> autoprot.preprocessing.to_canonical_ps(series, organism='human')
['O75381', '282', '31.0']
>>> series=pd.Series(['PEX14', "_____TSSSPGKEGHSPEGSTVTYHLLGP__"], index=['Gene names', 'Sequence window'])
>>> autoprot.preprocessing.to_canonical_ps(series, organism='human')
['O75381', '282', '31.0']