Filtering#

Autoprot Preprocessing Functions.

@author: Wignand, Julian, Johannes

@documentation: Julian

autoprot.preprocessing.filtering.cleaning(df, file='proteinGroups')[source]#

Remove contaminant, reverse and identified by site only entries.

Parameters:
  • df (pd.DataFrame) – Dataframe to clean up.

  • file (str, optional) – Which file is provided in the dataframe. Possible values are “proteinGroups”; “Phospho (STY)”, “evidence”, “modificationSpecificPeptides” or “peptides”. The default is “proteinGroups”.

Returns:

df – The cleaned dataframe.

Return type:

pd.DataFrame

Examples

Cleaning can target different MQ txt files such as proteinGroups and phospho (STY) tables. The variables phos and prot are parsed MQ results tables.

>>> prot_clean = pp.cleaning(prot, "proteinGroups")
4910 rows before filter operation.
4624 rows after filter operation.
>>> phos_clean = pp.cleaning(phos, file = "Phospho (STY)")
47936 rows before filter operation.
47420 rows after filter operation.
autoprot.preprocessing.filtering.filter_loc_prob(df, thresh=0.75)[source]#

Filter by localization probability.

Parameters:
  • df (pd.DataFrame) – Dataframe to filter.

  • thresh (int, optional) – Entries with localization probability below will be removed. The default is .75.

Examples

The .filter_loc_prob() function filters a Phospho (STY)Sites.txt file. You can provide the desired threshold with the thresh parameter.

>>> phos_filter = pp.filter_loc_prob(phos, thresh=.75)
47936 rows before filter operation.
33311 rows after filter operation.
Returns:

Filtered dataframe.

Return type:

pd.DataFrame

autoprot.preprocessing.filtering.filter_seq_cov(df, thresh, cols=None)[source]#

Filter by sequence coverage.

Parameters:
  • df (pd.DataFrame) – Dataframe to filter.

  • thresh (int, optional) – Entries below that value will be excluded from the dataframe.

  • cols (list of str, optional) – List of sequence coverage colnames. A row is excluded fromt the final dataframe the value in any of the provided columns is below the threshold. The default is None.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

autoprot.preprocessing.filtering.filter_vv(df, groups, n=2, valid_values=True)[source]#

Filter dataframe for minimum number of valid values.

Parameters:
  • df (pd.DataFrame) – Dataframe to be filtered.

  • groups (list of lists of str) – Lists of colnames of the experimental groups. Each group is filtered for at least n vv.

  • n (int, optional) – Minimum amount of valid values. The default is 2.

  • valid_values (bool, optional) – True for minimum amount of valid values; False for maximum amount of missing values. The default is True.

Returns:

  • pd.DataFrame – Filtered dataframe.

  • set (optional) – Set of indices after filtering.

Examples

The function filterVv() filters the dataframe for a minimum number of valid values per group. You have to provide the data, the groups as well as the desired number of valid values. If the specified n is not reached in one or more groups the respective row is dropped. Setting the keyword vv=False inverts the logic and filters the dataframe for a maximum number of missing values.

>>> protRatio = prot.filter(regex="Ratio .\/. normalized")
>>> protLog = pp.log(prot, protRatio, base=2)
>>> a = ['log2_Ratio H/M normalized BC18_1','log2_Ratio M/L normalized BC18_2',
...      'log2_Ratio H/M normalized BC18_3','log2_Ratio H/L normalized BC36_1',
...      'log2_Ratio H/M normalized BC36_2','log2_Ratio M/L normalized BC36_2']
>>> b = ["log2_Ratio H/L normalized BC18_1","log2_Ratio H/M normalized BC18_2",
...      "log2_Ratio H/L normalized BC18_3","log2_Ratio M/L normalized BC36_1",
...      "log2_Ratio H/L normalized BC36_2","log2_Ratio H/M normalized BC36_2"]
>>> c = ["log2_Ratio M/L normalized BC18_1","log2_Ratio H/L normalized BC18_2",
...      "log2_Ratio M/L normalized BC18_3", "log2_Ratio H/M normalized BC36_1",
...      "log2_Ratio M/L normalized BC36_2","log2_Ratio H/L normalized BC36_2"]
>>> protFilter = pp.filter_vv(protLog, groups=[a,b,c], n=3)
4910 rows before filter operation.
2674 rows after filter operation.
autoprot.preprocessing.filtering.remove_non_quant(df, cols)[source]#

Remove entries without quantitative data.

Parameters:
  • df (pd.DataFrame) – Dataframe to filter.

  • cols (list of str) – cols to be evaluated for missingness.

Returns:

Filtered dataframe.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({'a':[1,2,np.nan,4], 'b':[4,0,np.nan,1], 'c':[None, None, 1, 1]})
>>> pp.remove_non_quant(df, cols=['a', 'b'])
4 rows before filter operation.
3 rows after filter operation.
     a    b    c
0  1.0  4.0  NaN
1  2.0  0.0  NaN
3  4.0  1.0  1.0

Rows are only removed if the all values in the specified columns are NaN.

>>> pp.remove_non_quant(df, cols=['b', 'c'])
4 rows before filter operation.
4 rows after filter operation.
     a    b    c
0  1.0  4.0  NaN
1  2.0  0.0  NaN
2  NaN  NaN  1.0
3  4.0  1.0  1.0

Example with real data.

>>> phosRatio = phos.filter(regex="^Ratio .\/.( | normalized )R.___").columns
>>> phosQuant = pp.remove_non_quant(phosLog, phosRatio)
47936 rows before filter operation.
39398 rows after filter operation.