Initialization is critical for preserving global data structure in both t-SNE and UMAP

arising from Becht, E. et al. Nature Biotechnology https://doi.org/10.1038/nbt.4314 (2019)

One of the most ubiquitous analysis tools in single-cell transcriptomics and cytometry is t-distributed stochastic neighbor embedding (t-SNE)¹, which is used to visualize individual cells as points on a two-dimensional scatterplot such that similar cells are positioned close together². A related algorithm, called uniform manifold approximation and projection (UMAP)³, has attracted substantial attention in the single-cell community⁴. In Nature Biotechnology, Becht et al.⁴ argued that UMAP is preferable to t-SNE because it better preserves the global structure of the data and is more consistent across runs. Here we show that this alleged superiority of UMAP can be entirely attributed to different choices of initialization in the implementations used by Becht et al.: the t-SNE implementations by default used random initialization, while the UMAP implementation used a technique called Laplacian eigenmaps (LE)⁵ to initialize the embedding. We show that UMAP with random initialization preserves global structure as poorly as t-SNE with random initialization, while t-SNE with informative initialization performs as well as UMAP with informative initialization. On the basis of these observations, we argue that there is currently no evidence that the UMAP algorithm per se has any advantage over t-SNE in terms of preserving global structure. We also contend that these algorithms should always use informative initialization by default.

This is a preview of subscription content, access via your institution

Data availability

The data in this study were sourced from refs. ^8,9,10.

Code availability

The R code extending the analysis of Becht et al. is available at https://github.com/linqiaozhi/DR_benchmark_initialization. The Python code used to produce Fig. 1 is available at https://github.com/dkobak/tsne-umap-init.

References

van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).
Article Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://arxiv.org/abs/1802.03426 (2018).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–40 (2019).
Article CAS Google Scholar
Belkin, M. & Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 585–591 (2002).
Coifman, R. R. & Lafon, S. Diffusion maps. Appl. Comput. Harmon. Anal. 21, 5–30 (2006).
Article Google Scholar
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S. & Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 16, 243–245 (2019).
Article CAS Google Scholar
Samusik, N., Good, Z., Spitzer, M. H., Davis, K. L. & Nolan, G. P. Automated mapping of phenotype space with single-cell data. Nat. Methods 13, 493–496 (2016).
Article CAS Google Scholar
Wong, M. T. et al. A high-dimensional atlas of human T cell diversity reveals tissue-specific trafficking and cytokine signatures. Immunity 45, 442–456 (2016).
Article CAS Google Scholar
Han, X. et al. Mapping the mouse cell atlas by Microwell-seq. Cell 172, 1091–1107 (2018).
Article CAS Google Scholar
Policar, P. G., Strazar, M. & Zupan, B. openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding. Preprint at bioRxiv https://doi.org/10.1101/731877 (2019).
Böhm, J. N., Berens, B. & Kobak, D. A unifying perspective on neighbor embeddings along the attraction–repulsion spectrum. Preprint at https://arxiv.org/abs/2007.08902 (2020).

Download references

Acknowledgements

The authors thank P. Berens, S. Steinerberger and Y. Kluger for discussions and helpful comments. D.K. was supported by the Deutsche Forschungsgemeinschaft (BE5601/4-1 and the Cluster of Excellence ‘Machine Learning—New Perspectives for Science’, EXC 2064, project number 390727645), the Federal Ministry of Education and Research (FKZ 01GQ1601 and 01IS18039A) and the National Institute of Mental Health of the National Institutes of Health under award number U19MH114830. G.C.L. was supported by the National Human Genome Research Institute (F30HG010102) and National Institutes of Health MSTP training grant T32GM007205. A portion of the benchmarks were run on computational resources funded by the National Institutes of Health (R01GM131642, principal investigator: Y. Kluger). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Institute for Ophthalmic Research, University of Tübingen, Tübingen, Germany
Dmitry Kobak
Applied Mathematics Program, Yale University, New Haven, CT, USA
George C. Linderman

Authors

Dmitry Kobak
George C. Linderman

Contributions

The authors contributed equally.

Corresponding authors

Correspondence to Dmitry Kobak or George C. Linderman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Preservation of pairwise distances in embeddings.

The exact analogue of Fig. 5 in the original publication by Becht et al.⁴ To quote the original caption: ‘Box plots represent distances across pairs of points in the embeddings, binned using 50 equal-width bins over the pairwise distances in the original space using 10,000 randomly selected points, leading to 49,995,000 pairs of pairwise distances. […] The value of the Pearson correlation coefficient computed over the pairs of pairwise distances is reported. For the box plots, the central bar represents the median, and the top and bottom boundary of the boxes represent the 75th and 25th percentiles, respectively. The whiskers represent 1.5 times the interquartile range above (or, respectively, below) the top (or, respectively, bottom) box boundary, truncated to the data range if applicable.’ We recomputed all embeddings (except for the UMAP with LE initialization of the Wong et al.⁹ dataset, which was loaded from external source, as in the code accompanying the original publication). All algorithms were run with the same parameters as in the original publication (which always were the default parameters, apart from n_neighbors set to 30 in UMAP for the Han et al.¹⁰ dataset; we kept this value for both initializations). We used the same version of FIt-SNE as in the original publication, to make sure that all the default parameters stayed the same. Y-axis goes from zero to the maximum pairwise distance in all subplots.

Extended Data Fig. 2 Reproducibility of large-scale structures in embeddings.

The exact analogue of Fig. 6 in the original publication⁴. To quote the original caption: ‘Bar plots represent the average unsigned Pearson correlation coefficient of the points’ coordinates in the embedding of subsamples versus in the embedding of the full dataset, thus measuring the correlation of coordinates in subsamples versus in the embedding of the full dataset, up to symmetries along the graph axes. Bar heights represent the average across three replicates and vertical bars the corresponding s.d.’

Extended Data Fig. 3 Qualitative assessment of the reproducibility of embeddings using the Samusik et al.8 dataset.

The exact analogue of Supplementary Fig. 7a from the original publication⁴. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 4 Qualitative assessment of the reproducibility of embeddings using the Wong et al.9 dataset.

The exact analogue of Supplementary Fig. 7b from the original publication⁴. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 5 Qualitative assessment of the reproducibility of embeddings using the Han et al.10 dataset.

The exact analogue of Supplementary Fig. 7c from the original publication⁴. To quote the original caption: ‘Embeddings of full datasets as well as subsamples of varying sizes replicated thrice for [four] dimensionality reduction methods. The color-code is generated using the embedding of the full dataset and propagated to the subsamples.’

Extended Data Fig. 6 Annotated embeddings of the Samusik_01 dataset (sample size n=86,864).

Top row: UMAP with random initialization (left) and t-SNE with random initialization (right). Bottom row: UMAP with default initialization (left) and t-SNE with PCA initialization (right). The bottom-left and upper-right panels are analogues of Fig. 2a,b from the original publication⁴. Note that the T cells are not colocalized in the UMAP embedding with random initialization.

Rights and permissions

About this article

Cite this article

Kobak, D., Linderman, G.C. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nat Biotechnol 39, 156–157 (2021). https://doi.org/10.1038/s41587-020-00809-z

Download citation

Received: 02 December 2019
Accepted: 23 December 2020
Published: 01 February 2021
Version of record: 01 February 2021
Issue date: February 2021
DOI: https://doi.org/10.1038/s41587-020-00809-z