Filtering#
Autoprot Preprocessing Functions.
@author: Wignand, Julian, Johannes
@documentation: Julian
- autoprot.preprocessing.filtering.cleaning(df, file='proteinGroups')[source]#
Remove contaminant, reverse and identified by site only entries.
- Parameters:
df (pd.DataFrame) – Dataframe to clean up.
file (str, optional) – Which file is provided in the dataframe. Possible values are “proteinGroups”; “Phospho (STY)”, “evidence”, “modificationSpecificPeptides” or “peptides”. The default is “proteinGroups”.
- Returns:
df – The cleaned dataframe.
- Return type:
pd.DataFrame
Examples
Cleaning can target different MQ txt files such as proteinGroups and phospho (STY) tables. The variables phos and prot are parsed MQ results tables.
>>> prot_clean = pp.cleaning(prot, "proteinGroups") 4910 rows before filter operation. 4624 rows after filter operation.
>>> phos_clean = pp.cleaning(phos, file = "Phospho (STY)") 47936 rows before filter operation. 47420 rows after filter operation.
- autoprot.preprocessing.filtering.filter_loc_prob(df, thresh=0.75)[source]#
Filter by localization probability.
- Parameters:
df (pd.DataFrame) – Dataframe to filter.
thresh (int, optional) – Entries with localization probability below will be removed. The default is .75.
Examples
The .filter_loc_prob() function filters a Phospho (STY)Sites.txt file. You can provide the desired threshold with the thresh parameter.
>>> phos_filter = pp.filter_loc_prob(phos, thresh=.75) 47936 rows before filter operation. 33311 rows after filter operation.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- autoprot.preprocessing.filtering.filter_seq_cov(df, thresh, cols=None)[source]#
Filter by sequence coverage.
- Parameters:
df (pd.DataFrame) – Dataframe to filter.
thresh (int, optional) – Entries below that value will be excluded from the dataframe.
cols (list of str, optional) – List of sequence coverage colnames. A row is excluded fromt the final dataframe the value in any of the provided columns is below the threshold. The default is None.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
- autoprot.preprocessing.filtering.filter_vv(df, groups, n=2, valid_values=True)[source]#
Filter dataframe for minimum number of valid values.
- Parameters:
df (pd.DataFrame) – Dataframe to be filtered.
groups (list of lists of str) – Lists of colnames of the experimental groups. Each group is filtered for at least n vv.
n (int, optional) – Minimum amount of valid values. The default is 2.
valid_values (bool, optional) – True for minimum amount of valid values; False for maximum amount of missing values. The default is True.
- Returns:
pd.DataFrame – Filtered dataframe.
set (optional) – Set of indices after filtering.
Examples
The function filterVv() filters the dataframe for a minimum number of valid values per group. You have to provide the data, the groups as well as the desired number of valid values. If the specified n is not reached in one or more groups the respective row is dropped. Setting the keyword vv=False inverts the logic and filters the dataframe for a maximum number of missing values.
>>> protRatio = prot.filter(regex="Ratio .\/. normalized") >>> protLog = pp.log(prot, protRatio, base=2)
>>> a = ['log2_Ratio H/M normalized BC18_1','log2_Ratio M/L normalized BC18_2', ... 'log2_Ratio H/M normalized BC18_3','log2_Ratio H/L normalized BC36_1', ... 'log2_Ratio H/M normalized BC36_2','log2_Ratio M/L normalized BC36_2'] >>> b = ["log2_Ratio H/L normalized BC18_1","log2_Ratio H/M normalized BC18_2", ... "log2_Ratio H/L normalized BC18_3","log2_Ratio M/L normalized BC36_1", ... "log2_Ratio H/L normalized BC36_2","log2_Ratio H/M normalized BC36_2"] >>> c = ["log2_Ratio M/L normalized BC18_1","log2_Ratio H/L normalized BC18_2", ... "log2_Ratio M/L normalized BC18_3", "log2_Ratio H/M normalized BC36_1", ... "log2_Ratio M/L normalized BC36_2","log2_Ratio H/L normalized BC36_2"] >>> protFilter = pp.filter_vv(protLog, groups=[a,b,c], n=3) 4910 rows before filter operation. 2674 rows after filter operation.
- autoprot.preprocessing.filtering.remove_non_quant(df, cols)[source]#
Remove entries without quantitative data.
- Parameters:
df (pd.DataFrame) – Dataframe to filter.
cols (list of str) – cols to be evaluated for missingness.
- Returns:
Filtered dataframe.
- Return type:
pd.DataFrame
Examples
>>> df = pd.DataFrame({'a':[1,2,np.nan,4], 'b':[4,0,np.nan,1], 'c':[None, None, 1, 1]}) >>> pp.remove_non_quant(df, cols=['a', 'b']) 4 rows before filter operation. 3 rows after filter operation. a b c 0 1.0 4.0 NaN 1 2.0 0.0 NaN 3 4.0 1.0 1.0
Rows are only removed if the all values in the specified columns are NaN.
>>> pp.remove_non_quant(df, cols=['b', 'c']) 4 rows before filter operation. 4 rows after filter operation. a b c 0 1.0 4.0 NaN 1 2.0 0.0 NaN 2 NaN NaN 1.0 3 4.0 1.0 1.0
Example with real data.
>>> phosRatio = phos.filter(regex="^Ratio .\/.( | normalized )R.___").columns >>> phosQuant = pp.remove_non_quant(phosLog, phosRatio) 47936 rows before filter operation. 39398 rows after filter operation.