dataset_utils#

class scconfluence.dataset_utils.BatchSampler(indices: ndarray, batch_size: int, shuffle: bool, drop_last: bool | int = False)#

Bases: Sampler

Code imported and adapted from https://github.com/scverse/scvi-tools/blob/master/scvi/dataloaders/_ann_dataloader.py Custom torch Sampler that returns a list of indices of size batch_size.

class scconfluence.dataset_utils.DictDataset(dict_data)#

Bases: Dataset

Dataset class for dictionary data. It is a subclass of torch.utils.data.Dataset. It is meant to be used to load data from a dictionary where all entries are indexed by the same observations. The keys of the dictionary are the fields and the values are the data to be loaded.

Parameters:: dict_data – dictionary with the data.

split_train_val(ratio_val: float) → tuple[numpy.ndarray, numpy.ndarray]#

Split the dataset into a training and a validation set.

Parameters:: ratio_val – proportion of cells to be used for validation.
Returns:: the indices of the observations to be used for training and validation.

class scconfluence.dataset_utils.MModalDataset(datasets: dict[str, scconfluence.dataset_utils.DictDataset], cross_relations: dict[str, numpy.ndarray])#

Bases: Dataset

Dataset class for multi-modal data. It is a subclass of torch.utils.data.Dataset.

Parameters:

datasets – dictionary with the data for each modality. The keys are the names of the modalities and the values are the data for each modality stored in DictDataset objects.
cross_relations – dictionary with the cross-modal relations. The keys are the names of the pair of modalities compared and the values are the cost matrices.

split_train_val(ratio_val: float) → tuple[numpy.ndarray, numpy.ndarray]#

Split the dataset into a training and a validation set.

Parameters:: ratio_val – proportion of cells to be used for validation.
Returns:: the indices of the observations to be used for training and validation.

scconfluence.dataset_utils.configure_multimodal_dataset(mdata: MuData, modality_pairs: list[str], unimodal_datasets: dict[str, scconfluence.dataset_utils.DictDataset]) → MModalDataset#

Create the MModalDataset object from the MuData object

Parameters:

mdata – the input data
modality_pairs – the pairs of modalities for which a cost matrix is available
unimodal_datasets – the unimodal datasets

Returns:

a MModalDataset which can be used for training and inference

scconfluence.dataset_utils.configure_unimodal_dataset(adata: AnnData, rep_in: None | str, rep_out: None | str, batch_key: str, modality: str) → tuple[scconfluence.dataset_utils.DictDataset, int, int, dict[str, int]]#

Create a dictionary dataset from an anndata.

Parameters:

adata – data
rep_in – string indicating the entry of the Anndata where to look for the input data, i.e. the data used as input of the encoder. If not None, the input data will be extracted from the obsm field of the AnnData object. If None, the input data is assumed to be the X field of the AnnData object.
rep_out – string indicating the entry of the Anndata where to look for the output data, i.e. the data used to compare with the output of the decoder. If not None, the output data will be extracted from the layers field of the AnnData object. If None, the output data is assumed to be the X field of the AnnData object.
batch_key – where to extract the batch information in the adata object
modality – name of the data modality

Returns:

a DictDataset object, the input dimension, the output dimension and a dictionary mapping batch indexes to their original name.

scconfluence.dataset_utils.format_batch(batch: dict) → dict#

Format the batch to be used in the model. It removes the batch dimension from the data and the cell index.

Parameters:: batch – mini-batch of input data.
Returns:: formatted mini-batch.

scconfluence.dataset_utils.inference_dl_trainer(dataset: DictDataset | MModalDataset, use_cuda: bool = True, batch_size: int = 512, pin_memory: bool = True, num_workers: int = 0) → tuple[torch.utils.data.dataloader.DataLoader, pytorch_lightning.trainer.trainer.Trainer]#

Wrapper to create a DataLoader and a Trainer for the prediction after the end of the training of the model.

Parameters:

dataset – dataset to be used for the prediction (which has also benn used for the training).
use_cuda – whether to use GPU acceleration if cuda is available.
batch_size – size of the mini-batches used for training. Not to be confused with the experimental batches.
pin_memory – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them.
num_workers – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.

Returns:

a DataLoader and a Trainer for the prediction.