Seqeunce annotation for data preprocessing#
Autoprot Preprocessing Functions.
@author: Wignand, Julian, Johannes
@documentation: Julian
- autoprot.preprocessing.annotation.annotate_phosphosite(df, ps, cols_to_keep=None)[source]#
Annotate phosphosites with information derived from PhosphositePlus.
- Parameters:
df (pd.Dataframe) – dataframe containing PS of interst.
ps (str) – Column containing info about the PS. Format: GeneName_AminoacidPositoin (e.g. AKT_T308).
cols_to_keep (list, optional) – Which columns from original dataframe (input df) to keep in output. The default is None.
- Returns:
The input dataframe with the kept columns and additional phosphosite cols.
- Return type:
pd.Dataframe
- autoprot.preprocessing.annotation.get_subcellular_loc(series, database='compartments', loca=None, colname='Gene names')[source]#
Annotate the df with subcellular localization.
For compartments gene names are required.
- Parameters:
series (pd.Series) – Must contain the colname to identify genes.
database (str, optional) – Possible values are “compartments” and “hpa”. The default is “compartments”.
loca (str, optional) – Only required for the compartments database. Filter the returned localisation table by this string. Must match exactly to the localisation terms in the compartments DB. The default is None.
colname (str, optional) – Colname holding the gene names. The default is “Gene names”.
- Raises:
ValueError – Wrong value is provided for the database arg.
Notes
The compartments database is obtained from https://compartments.jensenlab.org/Downloads . The hpa database is the human protein atlas available at https://www.proteinatlas.org .
- Returns:
pd.DataFrame – Dataframe with columns “ENSMBL”, “Gene name”, “LOCID”, “LOCNAME”, “SCORE” for compartments database.
tuple of lists (main_loc, alt_loc) – Lists of main and alternative localisations if the hpa database was chosen.
Examples
>>> series = pd.Series(['PEX14',], index=['Gene names'])
Find all subcellular localisations of PEX14. The second line filters the returned dataframe so that only values with the highest score are retained. The dataframe is converted to list for better visualisation.
>>> loc_df = autoprot.preprocessing.get_subcellular_loc(series) >>> sorted(loc_df.loc[loc_df[loc_df['SCORE'] == loc_df['SCORE'].max()].index, ... 'LOCNAME'].tolist()) ['Bounding membrane of organelle', 'Cellular anatomical entity', 'Cytoplasm', 'Intracellular', 'Intracellular membrane-bounded organelle', 'Intracellular organelle', 'Membrane', 'Microbody', 'Microbody membrane', 'Nucleus', 'Organelle', 'Organelle membrane', 'Peroxisomal membrane', 'Peroxisome', 'Whole membrane', 'cellular_component', 'membrane-bounded organelle', 'protein-containing complex']
Get the score for PEX14 being peroxisomally localised
>>> loc_df = autoprot.preprocessing.get_subcellular_loc(series, loca='Peroxisome') >>> loc_df['SCORE'].tolist()[0] 5.0
Using the Human Protein Atlas, a tuple of two lists containing the main and alternative localisations is returned
>>> autoprot.preprocessing.get_subcellular_loc(series, database='hpa') (['Peroxisomes'], ['Nucleoli fibrillar center'])
- autoprot.preprocessing.annotation.go_annot(prots: DataFrame, gos: list, only_prots: bool = False, exact: bool = True) DataFrame | Series [source]#
Filter a list of experimentally determined gene names by GO annotation.
Homo sapiens.gene_info and gene2go files are needed for annotation
In case of multiple gene names per line (e.g. AKT1;PKB) only the first name will be extracted.
- Parameters:
prots (list of str) – List of Gene names.
gos (list of str) – List of go terms.
only_prots (bool, optional) – Whether to return dataframe or only list of gene names annotated with terms. The default is False.
exact (bool, optional) – Whether the go term must match exactly. i.e. MAPK activity <-> regulation of MAPK acitivity etc. The default is True.
- Returns:
Dataframe with columns “index”, “Gene names”, “GeneID”, “GO_ID”, “GO_term” or Series with gene names
- Return type:
pd.DataFrame or pd.Series
Examples
>>> gos = ["ribosome"] >>> go = autoprot.preprocessing.go_annot(prot["Gene names"],gos, only_prots=False) >>> go.head() index Gene names GeneID GO_ID GO_term 0 1944 RPS27 6232 GO:0005840 ribosome 1 6451 RPS25 6230 GO:0005840 ribosome 2 7640 RPL36A 6173 GO:0005840 ribosome 3 11130 RRBP1 6238 GO:0005840 ribosome 4 16112 SF1 7536 GO:0005840 ribosome
- autoprot.preprocessing.annotation.motif_annot(df, motif, col='Sequence window')[source]#
Search for phosphorylation motif in the provided dataframe.
If not specified, the “Sequence window” column is searched. The phosphorylated central residue in a motif has to be indicated with “S/T”. Arbitrary amino acids can be denoted with x.
- Parameters:
df (pd.Dataframe) – input dataframe.
motif (str) – Target motif. E.g. “RxRxxS/T”, “PxS/TP” or “RxRxxS/TxSxxR”
col (str, optional) – Alternative column to be searched in if Sequence window is not desired. The default is “Sequence window”.
- Returns:
Dataframe with additional boolean column with True/False for whether the motif is found in this .
- Return type:
pd.dataframe
- autoprot.preprocessing.annotation.to_canonical_ps(series, organism='human', get_seq='online', uniprot=None, print_alignment=False)[source]#
Convert phosphosites to “canonical” phosphosites.
- Parameters:
series (pd.Series) – Series containing the indices “Gene names” and “Sequence Window”. Corresponds e.g. to a row in MQ Phospho(STY)Sites.txt.
organism (str, optional) – This conversion is based on Uniprot Identifier used in PSP data. possible organisms: ‘mouse’, ‘human’, ‘rat’, ‘sheep’, ‘SARSCoV2’, ‘guinea pig’, ‘cow’, ‘hamster’, ‘fruit fly’, ‘dog’, ‘rabbit’, ‘pig’, ‘chicken’, ‘frog’, ‘quail’, ‘horse’, ‘goat’, ‘papillomavirus’, ‘water buffalo’, ‘marmoset’, ‘turkey’, ‘cat’, ‘starfish’, ‘torpedo’, ‘SARSCoV1’, ‘green monkey’, ‘ferret’. The default is “human”.
get_seq ("local" or "online")
uniprot (str, optional) – Path to a gzipped uniprot.tsv file. Required if get_seq is ‘local’
print_alignment (bool, optional) – If True, alignments from which the new phosphosite information is derived are printed.
Notes
This function compares a certain gene name to the genes found in the phosphosite plus (https://www.phosphosite.org) phosphorylation site dataset.
- Returns:
list of (str, str, str) – (UniProt ID, Position of phosphosite in the UniProt sequence, score)
Proteins with two Gene names seperated by a semicolon are given back in the same way and order.
Examples
The correct position of the phosphorylation is returned independent of the completeness of the sequence window.
>>> series=pd.Series(['PEX14', "VSNESTSSSPGKEGHSPEGSTVTYHLLGPQE"], index=['Gene names', 'Sequence window']) >>> autoprot.preprocessing.to_canonical_ps(series, organism='human') ['O75381', '282', '31.0'] >>> series=pd.Series(['PEX14', "_____TSSSPGKEGHSPEGSTVTYHLLGP__"], index=['Gene names', 'Sequence window']) >>> autoprot.preprocessing.to_canonical_ps(series, organism='human') ['O75381', '282', '31.0']