Imputation#

Autoprot Preprocessing Functions.

@author: Wignand, Julian, Johannes

@documentation: Julian

autoprot.preprocessing.imputation.dima(df, cols: list[str] | Index, selection_substr=None, ttest_substr='cluster', methods='fast', npat=20, performance_metric='RMSE', print_r=True, min_values_for_imputation=0, return_cols=False)[source]#

Perform Data-Driven Selection of an Imputation Algorithm.

Parameters:
  • df (pd.DataFrame) – Input dataframe.

  • cols (list of str or pd.Index) – Colnames to perform imputation on. NOTE: if used on intensities, use log-transformed values.

  • selection_substr (str) – pattern to extract columns for processing during DIMA run.

  • ttest_substr (2-element list or str) –

    For statistical interpretation based on the t-test, the RMSEt ≔ RMSE(tR, tI) serves as rank criterion, where t is the t-test statistics calculated from the observed data R and the imputed data O.Todefine the null hypothesis H0, the group assignments of the samples have to be specified by the user.

    If is string, two elements need to be separated by ‘,’ If is list, concatenation will be done automatically. The two elements must be substrings of the columns to compare. Make sure that for each substring at least two matching colnames are present in the data.

  • methods (str or list of str, optional) – Methods to evaluate. Default ‘fast’ for the 9 most used imputation methods. Possible values are ‘impSeqRob’,’impSeq’,’missForest’, ‘imputePCA’,’ppca’,’bpca’, …

  • npat (int, optional) – Number of missing value patterns to evaluate

  • performance_metric (str, optional) – Metric used to select the best algorithm. Possible values are Dev, RMSE, RSR, pF, Acc, PCC, RMSEt.

  • min_values_for_imputation (int, optional) – Minimum number of non-missing values for imputation. Default is 0, which means that all values will be imputed.

  • print_r (bool) – Whether to print the R output to the Python console.

  • return_cols (bool, optional) – Whether to return the columns that were imputed. The default is False.

Returns:

  • pd.DataFrame – Input dataframe with imputed values.

  • pd.DataFrame – Overview of performance metrices of the different algorithms.

  • list of str – Columns that were imputed.

Examples

We will use a standard sample dataframe and generate some missing values to demonstrate the imputation.

>>> from autoprot import preprocessing as pp
>>> import seaborn as sns
>>> import pandas as pd
>>> import numpy as np
>>> iris = sns.load_dataset('iris')
>>> _ = iris.pop('species')
>>> for col in iris.columns:
...     iris.loc[iris.sample(frac=0.1).index, col] = np.nan
>>> imp, perf = pp.dima(
...     iris, iris.columns, performance_metric="RMSEt", ttest_substr=["petal", "sepal"]
... )
>>> imp.head()
   sepal_length  sepal_width  petal_length  ...  sepal_width_imputed  petal_length_imputed  petal_width_imputed
0           5.1          3.5           1.4  ...                  3.5                   1.4                  0.2
1           4.9          3.0           1.4  ...                  3.0                   1.4                  0.2
2           4.7          3.2           1.3  ...                  3.2                   1.3                  0.2
3           4.6          3.1           1.5  ...                  3.1                   1.5                  0.2
4           5.0          3.6           1.4  ...                  3.6                   1.4                  0.2

[5 rows x 9 columns]

>>> perf.head()
            Deviation      RMSE       RSR  p-Value_F-test   Accuracy       PCC  RMSEttest
impSeqRob    0.404402  0.531824  0.265112        0.924158  94.735915  0.997449   0.222656
impSeq       0.348815  0.515518  0.256984        0.943464  95.413732  0.997563   0.223783
missForest   0.348815  0.515518  0.256984        0.943464  95.413732  0.997563   0.223783
imputePCA    0.404402  0.531824  0.265112        0.924158  94.735915  0.997449   0.222656
ppca         0.377638  0.500354  0.249424        0.933919  95.000000  0.997721   0.199830

It is also possible to specify the minimum number of non-missing values that are required for imputation.

>>> for col in iris.columns:
...     iris.loc[iris.sample(frac=0.4).index, col] = np.nan
>>> imp, perf = pp.dima(
...     iris, iris.columns, performance_metric="RMSEt", min_values_for_imputation=2
... )

References

Egert, J., Brombacher, E., Warscheid, B. & Kreutz, C. DIMA: Data-Driven Selection of an Imputation Algorithm.

Journal of Proteome Research 20, 3489–3496 (2021-06).

autoprot.preprocessing.imputation.imp_min_prob(df: DataFrame, cols_to_impute: list[str] | Index, min_missing: int | None = None, downshift: int | float = 1.8, width: int | float = 0.3, return_cols: bool = False)[source]#

Perform an imputation by modeling a distribution on the far left site of the actual distribution.

The final distribution will be mean shifted and has a smaller variation. Intensities should be log-transformed before being supplied to this function.

Downsshift: mean - downshift*sigma Var: width*sigma

Parameters:
  • df (pd.dataframe) – Dataframe on which imputation is performed.

  • cols_to_impute (list) – Columns to impute. Should correspond to a single condition (i.e. control).

  • min_missing (int, optional) – How many missing values have to be missing across all columns to perfom imputation If None all values have to be missing. The default is None.

  • downshift (float, optional) – How many Stds to lower values the mean of the new population is shifted. The default is 1.8.

  • width (float, optional) – How to scale the Std of the new distribution with respect to the original. The default is .3.

  • return_cols (bool, optional) – Whether to return the columns that were imputed. The default is False.

Returns:

  • pd.dataframe – The dataframe with imputed values.

  • list of str – Columns that were imputed.

Examples

phos = pd.read_csv("../data/Phospho (STY)Sites_minimal.zip", sep="\t", low_memory=False)
forImp = np.log10(phos.filter(regex="Int.*R1").replace(0, np.nan))
impProt = pp.imp_min_prob(forImp, phos.filter(regex="Int.*R1").columns, width=.4, downshift=2.5)
fig, ax1 = plt.subplots(1)
imputed_values = impProt.filter(regex="Int.*R1$").isnull()
ax1.hist(impProt.filter(regex="Int.*R1_min_imputed").values[~imputed_values],
          density=True, bins=50, label="not Imputed", alpha=.5)
ax1.hist(impProt.filter(regex="Int.*R1_min_imputed").values[imputed_values],
          density=True, bins=50, label="Imputed", alpha=.5)
ax1.set_xlabel("log10 Intensity")
ax1.set_ylabel("Density")

plt.legend()
plt.show()

(Source code, png, hires.png, pdf)

../_images/imputation-1.png
autoprot.preprocessing.imputation.imp_seq(df, cols: list[str] | Index, print_r=False, return_cols=False)[source]#

Perform sequential imputation in R using impSeq from rrcovNA.

See https://rdrr.io/cran/rrcovNA/man/impseq.html for a description of the algorithm. SEQimpute starts from a complete subset of the data set Xc and estimates sequentially the missing values in an incomplete observation, say x*, by minimizing the determinant of the covariance of the augmented data matrix X* = [Xc; x’]. Then the observation x* is added to the complete data matrix and the algorithm continues with the next observation with missing values.

Parameters:
  • df (pd.DataFrame) – Input dataframe.

  • cols (list of str) – Colnames to perform imputation of.

  • print_r (bool, optional) – Whether to print the output of R, default is False.

  • return_cols (bool, optional) – Whether to return the columns that were imputed. The default is False.

Returns:

  • pd.DataFrame – Dataframe with imputed values. Cols with imputed values are named _imputed. Contains a col UID that was used for processing.

  • list of str – Columns that were imputed.