In a recent study posted to the bioRxiv* preprint server, researchers developed and elucidated a ‘bridge integration’ method to harmonize single-cell ribonucleic acid (RNA) sequencing (scRNA-seq) datasets.
Mapping new scRNA-seq datasets to reference sets is an exciting and growing opportunity in single-cell genomics. Unlike the unsupervised approach, supervised mapping leverages well-curated and extensive reference datasets to annotate query profiles enabled by curation and newer computational tools. Although the extant practices are powerful, they are constructed from scRNA-seq data and cannot annotate datasets not measuring gene expression.
The Human Cell Atlas (HCA), Human Biomolecular Atlas Project (HuBMAP), and the Chan Zuckerberg Biohub are carefully curated references annotated by experts. Mapping datasets to these references help harmonize data and compare scRNA-seq datasets across different experimental conditions and disease states. Mapping additional molecular modalities such as single-cell assays for transposase-accessible chromatin sequencing (scATAC-seq), single-cell bisulfite sequencing (scBS-seq) for DNA methylation assessment, cytometry by time of flight (cyTOF) protein levels, and single-cell cleavage under targets and tagmentation (scCUT&TAG) for histone modifications are challenging as these estimate different features than scRNA-seq.
In the present study, researchers revealed ‘bridge integration’ to integrate single-cell datasets that measure disparate modalities. The method introduced here leverages another dataset as a ‘bridge’ in which both modalities are calculated. Dictionary learning, commonly used in image analysis, is utilized for bridge integration. This learning form represents input data (for instance, a noisy image) as individual elements, and these elements (image patches) are termed atoms that collectively constitute the ‘dictionary’. Image reconstruction with a weighted linear combination of these atoms could be effective for denoising representing the conversion of image (dataset) into the dictionary-defined space.
The rationale for bridge integration was to combine single-cell sequence data wherein different modalities (single-modality datasets) are measured. Although the authors previously described the conversion of one feature set into another, the transformation makes strict biological assumptions between modalities and may not always hold.
The authors leveraged multi-omic dataset(s) as a bridge to translate between separate modalities by dictionary learning for (bridge) integration at single-cell resolution. Essentially, the multi-omic dataset was treated as a dictionary, and the (multi-)omic profile of the individual cell represented an atom. Next, the dictionary representation of each of these disparate unimodal datasets is inferred based on the atoms. The distinct datasets are described in a space with similar features and are finally aligned.
The bridge integration method makes no (biological) assumptions between the distinct modalities, but these are automatically learned from the multi-omic dataset. Subsequently, these disparate datasets are transformed to be represented by a shared set of features. Post-transformation, a final alignment procedure is followed, compatible with other single-cell integration methods like Harmony, Seurat, mnnCorrect, etc.
The bridge integration technique was implemented to map scRNA-seq and scATAC-seq specimens of human bone marrow mononuclear cells (BMMCs). These specimens contain entire hematopoietic differentiation cells, including hematopoietic stem cells (HSCs), multipotent progenitors, and fully differentiated cells. A scRNA-seq reference dataset of BMMCs termed Azimuth reference with 297,627 cells was constructed from publicly available datasets (HuBMAP). The scATAC-seq (query) dataset of BMMCs was mapped to this reference dataset, and a 10x multiome dataset (32,368 cells with paired scRNA-seq and scATAC-seq data) was used as the molecular bridge.
The authors reported successful mapping of the query dataset to the Azimuth reference that allowed visualizing and annotating scRNA-seq and scATAC-seq data. They noted exclusive mapping of CD34+ BMMC fractions to HSCs and progenitors in the reference, indicating the robustness of the bridge integration strategy.
Additionally, unlike unsupervised analysis, bridge integration annotated rare high-resolution subpopulations, namely, monocytes were further grouped into CD16+ and CD14+ fractions, natural killer cells into CD56bright and CD56dm subgroups. Of note, the rare sets of innate lymphoid cells were identified along with AXL+SIGLEC6+ (ASDC) dendritic cells with this method.
Although downscaling the bridge dataset size produced concordant results, the annotation accuracy for rare cell types could be compromised. Further assessments with scaling bridge size revealed that a bridge dataset comprising at least 50 cells (atoms) per subpopulation should produce acceptable results with annotations of rare cell types.
The current study demonstrated the successful application of the bridge integration technique. Moreover, the methodology incorporated with atomic sketching could scale up the application to harmonize large datasets comprising millions of cells. The bridge integration method is suitable for studies wherein the multi-omic technique is applied for a subset instead of all experimental samples.
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.
- Yuhan Hao, Tim Stuart, Madeline Kowalski, Saket Choudhary, Paul Hoffman, Austin Hartman, Avi Srivastava, Gesmira Molla, Shaista Madad, Carlos Fernandez-Granda, Rahul Satija. (2022). Dictionary learning for integrative, multimodal, and scalable single-cell analysis. bioRxiv. doi: https://doi.org/10.1101/2022.02.24.481684 https://www.biorxiv.org/content/10.1101/2022.02.24.481684v1