Garfield.preprocessing.DataProcess

Garfield.preprocessing.DataProcess(adata_list, profile, data_type=None, sub_data_type=None, sample_col='batch', genome=None, weight=None, graph_const_method=None, use_gene_weight=True, user_cache_path=None, use_top_pcs=False, used_hvg=True, min_features=100, min_cells=3, keep_mt=False, target_sum=10000.0, rna_n_top_features=3000, atac_n_top_features=10000, n_components=50, n_neighbors=15, metric='correlation', svd_solver='arpack')[source]

Processes single or multi-modal data (e.g., RNA, ATAC, ADT, spatial) with optional preprocessing steps such as normalization, feature selection, and dimensionality reduction.

Parameters:
  • adata_list (list of AnnData or MuData objects) – List of AnnData or MuData objects to be concatenated and processed.

  • profile (str) – Data profile type, e.g., ‘RNA’, ‘ATAC’, ‘ADT’, ‘multi-modal’, or ‘spatial’.

  • data_type (str, optional) – Type of data being processed, e.g., ‘single-cell’, ‘bulk’. Default is None.

  • sub_data_type (list[str], optional) – List of sub-data types for multi-modal data, e.g., [‘rna’, ‘atac’] or [‘rna’, ‘adt’]. Default is None.

  • sample_col (str, optional) – Column in the dataset used to indicate batch or sample groupings. Default is ‘batch’.

  • genome (str, optional) – Reference genome for the dataset. Default is None.

  • weight (float or None, optional) – Weight for certain data processing steps, such as graph construction. Default is None.

  • graph_const_method (str, optional) – Method for constructing the graph if applicable, e.g., ‘knn’. Default is None.

  • use_gene_weight (bool, optional) – Whether to use gene weights in the preprocessing steps. Default is True.

  • user_cache_path (str, optional) – Path to the user’s cache directory. Default is None.

  • use_top_pcs (bool, optional) – Whether to use the top principal components during dimensionality reduction. Default is False.

  • used_hvg (bool, optional) – Whether to use highly variable genes (HVG) for the analysis. Default is True.

  • min_features (int, optional) – Minimum number of features required for a cell to be included. Default is 100.

  • min_cells (int, optional) – Minimum number of cells required for a feature to be included. Default is 3.

  • keep_mt (bool, optional) – Whether to keep mitochondrial genes in the dataset. Default is False.

  • target_sum (float, optional) – Target sum for normalization. Default is 1e4.

  • rna_n_top_features (int, optional) – Number of top features to keep for RNA data. Default is 3000.

  • atac_n_top_features (int, optional) – Number of top features to keep for ATAC data. Default is 10000.

  • n_components (int, optional) – Number of components for dimensionality reduction (e.g., PCA). Default is 50.

  • n_neighbors (int, optional) – Number of neighbors for graph-based algorithms. Default is 15.

  • metric (str, optional) – Distance metric to use in graph construction. Default is ‘correlation’.

  • svd_solver (str, optional) – Solver to use for singular value decomposition (SVD). Default is ‘arpack’.

Returns:

Preprocessed single or multi-modal data based on the specified profile and sub_data_type.

Return type:

AnnData or MuData