WO2020198942A1

WO2020198942A1 - Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Info

Publication number: WO2020198942A1
Application number: PCT/CN2019/080443
Authority: WO
Inventors: 瞿昆; 方靖文; 黎斌; 李杨
Original assignee: 中国科学技术大学
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-10-08

Abstract

A single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering. The method comprises: comparing single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching for peaks on the basis of the comparison result, and calculating a reading within each peak to obtain cell*peak reading matrices; and calculating a mathematical distance between the peaks in the cell*peak reading matrices, clustering the peaks, and merging the cell*peak reading matrices into a cell*accesson reading matrix, wherein accesson is the clustered peak. The method provides the first scATAC-seq data analysis method and system from fastq to clustering, visualization and development path reshaping, and significantly improves the clustering effect.

Description

Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Technical field

The invention belongs to the technical field of biological sequencing data analysis, and specifically relates to a single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering.

Background technique

Since its invention in 2012, ATAC-seq has been widely used in biological research due to its advantages of simplicity, low cost, and fewer cells. It has contributed a breakthrough in research on embryonic development, stem cell differentiation, cancer mechanism and typing, etc. progress. For example, in a 2017 CANCER Cell (IF=24), it was found that ATAC-seq could be used to explain the pathogenesis and precise drug classification of T-cell lymphoma. In 2018, ATAC-seq data entered the TCGA database. Therefore, in order to further study the heterogeneity of cells, the scATAC-seq sequencing technology was proposed in 2015 and achieved a variety of different technical solutions in several years of development. The resulting data is the analysis and interpretation of scATAC-seq sequencing results.

The main purpose of scATAC-seq data analysis is to restore the main cell populations or developmental differentiation pathways in mixed biological samples through sequencing results. However, the current scATAC-seq technology is relatively cutting-edge, and the signal-to-noise ratio of the data is low. Therefore, scATAC-seq data analysis requires a set of easy-to-use analysis methods and restores cell heterogeneity information to the greatest extent. The currently published scATAC-seq data analysis method, on the one hand, does not have a complete and easy-to-use analysis process from fastq to clustering, visualization, and developmental path reconstruction. On the other hand, it is evaluated by using the gold standard test data set, that is, some test data sets where each cell belongs to the subgroup or the position in the developmental differentiation path. Existing methods are still ineffective in information restoration and urgently need to be improved (using ARI evaluation). For this reason, scATAC-seq analysis does not currently have a unified analysis method in the industry.

There are three analysis methods in the prior art: ChromVAR, LSI and Cicero.

In the ChromVAR method, the input data of this method is the reading matrix of the cell*peak and the sequence information of each peak. This method uses the known transcription factor motif information to calculate the preference degree of the transcription factor for each peak. This constructs a preference score matrix for cell * transcription factors, and uses this matrix to restore information.

In the LSI method, the input data of this method is the cell * peak reading matrix. This method uses the TF-IDF algorithm (term frequency (Term Frequency), IDF means inverse text frequency index) to complicate the matrix, and then use a new matrix to perform Information restoration.

In the Cicero method, the input data of this method is the reading matrix of the cell*peak and the position information of the peak on the chromosome. This method combines the readings of the peaks in a certain absolute space by the position of the peak on the chromatin (such as : Peaks within 250kb). Then use this matrix to restore downstream information.

Summary of the invention

In view of this, the present invention proposes a complete, easy-to-use, and efficient biological sample scATAC-seq data analysis method and system with efficient cell heterogeneity information reduction ability.

In order to achieve the above objective, on the one hand, the present invention proposes a single-cell chromatin accessibility sequencing data analysis method based on peak clustering, including:

Compare the single-cell chromatin accessibility sequencing data with the corresponding biological sample genome data to obtain the comparison result, and find the peak based on the comparison result, and calculate the reading within each peak to obtain the cell*peak The reading matrix;

Calculate the mathematical distance between the peaks in the cell*peak reading matrix, cluster the peaks, and merge the cell*peak reading matrix into the cell*accesson reading matrix, where accesson is the clustered peak.

In some embodiments, the method further includes reducing the dimensionality of the reading matrix of the cell *accesson to a two-digit visualization matrix. Preferably, the dimensionality reduction method includes PCA, T-SNE or UMAP.

In some embodiments, the method further includes clustering the cells according to the reading matrix of the cell *accesson. Preferably, the clustering algorithm includes KNN clustering, kernel clustering or louvain clustering.

In some embodiments, the method further includes using the read matrix of the cell *accesson to construct the false time condition of the cell development path. Preferably, the algorithm used when constructing the false time condition of the cell development path includes SPRING or monocle.

On the other hand, the present invention provides a single-cell chromatin accessibility sequencing data analysis system based on peak clustering, including a preprocessing module and an accesson building module;

Among them, the preprocessing module includes a) a comparison unit, which is used to compare single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result; b) a peak finding unit, which is used to compare all single cells The comparison results of the cells are combined, and then the peak is searched; c) The reading calculation unit calculates the readings in each peak to obtain the reading matrix of the cell*peak;

The accesson building module includes a) a peak distance calculation unit, used to calculate the mathematical distance between peaks in the cell*peak reading matrix; b) a peak clustering unit, used to cluster peaks based on the mathematical distance between peaks C) A matrix conversion unit for combining the reading matrix of the cell*peak into the reading matrix of the cell*accesson, where the accesson is the peak after clustering.

In some embodiments, the system further includes a visualization module for reducing the dimensionality of the reading matrix of the cell *accesson to a two-digit visualization matrix. Preferably, the dimensionality reduction method includes PCA, T-SNE or UMAP.

In some embodiments, the system further includes a cell clustering module, which is used to cluster the cells according to the reading matrix of the cell *accesson. Preferably, the clustering algorithm includes KNN clustering, kernel clustering or louvain clustering. class.

In some embodiments, the system further includes a cell development path remodeling module, which is used to construct the false time condition of the cell development path using the reading matrix of the cell *accesson, preferably, the algorithm used when constructing the false time condition of the cell development path Including SPRING or monocle.

In some embodiments, the mathematical distance includes Euclidean distance, Pearson correlation coefficient, or cityblock distance.

In some embodiments, the peak clustering method includes KNN, DBSAN, or K-Mean.

In some embodiments, the method of combining the reading matrix of the cell*peak into the reading matrix of the cell*accesson includes taking the sum of the peak readings in the accesson, the average of the peak readings, the median of the peak readings, or the variance of the peak readings.

In another aspect, the present invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, including:

processor;

A memory has instructions stored thereon, and when the instructions are executed by the processor, the processor executes the analysis method.

In yet another aspect, the present invention also provides a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to execute the analysis method.

Compared with the prior art, the present invention has the following beneficial effects:

The present invention provides the first scATAC-seq data analysis method and system from fastq to clustering, visualization and developmental path reshaping;

The present invention proposes an accesson construction method based on peak clustering as a key module of scATAC-seq data analysis. The transformed cell *accesson reading matrix is used for subsequent clustering, visualization and cell development path remodeling. In the gold-labeled data set test, the clustering effect is statistically significantly higher than the existing method (ARI).

Description of the drawings

Figure 1 is a schematic diagram of accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention;

Figure 2 shows the relationship between the number of accesson and the clustering effect ARI in an embodiment of the present invention (gold standard test data set 1);

Figure 3 is the scATAC-seq data of human leukemia cells and related lineage cells in an embodiment of the present invention: A. Data clustering (hierarchical clustering) and B. Visualization effect (tSNE);

Figure 4 shows the scATAC-seq data related to the development and differentiation lineage of the artificial hematopoietic stem cells in the embodiment of the present invention: data development path remodeling (monocle);

Figure 5 is the scATAC-seq data of mouse forebrain nerve cells in an embodiment of the present invention: data clustering (KNN) and visualization (tSNE);

6A-6D are mouse thymic T cell scATAC-seq data in an embodiment of the present invention: data clustering (Louvain, hierarchical clustering), visualization (tSNE) and developmental path remodeling (monocle);

FIG. 7 is a comparison between the clustering effect and time used in the embodiment of the present invention with existing methods (gold standard test data set 1);

Fig. 8 is a comparison of the clustering effect and time used in the embodiment of the present invention with the existing method (gold standard test data set 2).

detailed description

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

For ease of understanding, the domain-specific terms involved in this article are explained here in a unified manner, and will not be repeated here.

Cells: Mammals (such as humans and mice) are the basic components of life activities and are often the pathogenesis of various diseases, such as nerve cells, epithelial cells, and tumor cells.

Cell heterogeneity: Biological tissue samples (such as tumor tissue, brain tissue) are composed of a large number of cells, and the physiological functions of the constituent cells are not the same. Common cell heterogeneity has the following two manifestations: 1) The constituent cells are composed of a variety of clear cell populations (discrete). 2) The constituent cells are in a continuous cell differentiation path (continuous).

Genome: The whole DNA sequence of an organism, composed of four ATCG bases arranged in an orderly manner. The genomes of major mammals such as humans and mice have all been sequenced.

Genes: Genes (genetic factors) are all DNA sequences required to produce a polypeptide chain or functional RNA. A gene is generally one or more segments of DNA in the genome.

Transcription factor: A protein that binds to DNA to initiate or regulate gene expression. It binds to DNA often by recognizing specific DNA sequence patterns (Motif).

Chromatin: A linear composite structure composed of DNA, histones, non-histone proteins and a small amount of RNA in the nucleus. The basic original is the nucleosome formed by DNA winding on histone.

Chromatin accessibility: to evaluate whether a certain piece of DNA is entangled on histones. Under normal circumstances, there are two situations for chromatin accessibility: 1) DNA is tightly wound around nucleosomes, called closed DNA; 2) DNA is wound around nucleosomes and is naked, called open DNA.

Chromatin Accessibility Sequencing (ATAC-seq): A sequencing technology developed by Stanford University in 2012 to detect chromatin accessibility in biological samples (>500 cells).

TCGA: The Cancer Genome Atlas (TCGA). Contains different omics sequencing data of cancer tissues and normal tissues from 33 different cancers and 11,000 patients.

Single-cell chromatin accessibility sequencing (scATAC-seq): A collective term for several existing sequencing methods used to detect the chromatin accessibility of a single cell. Including mononuclear chromatin accessibility sequencing (snATAC-seq), single-cell composite index chromatin accessibility sequencing (sciATAC-seq), flow-based single-cell chromatin accessibility sequencing (FACS scATAC-seq).

Sequence reads: DNA fragments obtained in bioomics.

Mapping: Compare the short sequence with the known genome information, and find the position of each short sequence on the genome.

Peak Calling: and through the results of data analysis and comparison, find the open position of the DNA. The position information is called a peak and assigned a number.

Readings: the number of short sequences in each sample and each peak.

Accesson: The peak clustering result proposed by the present invention is called the clustering situation of a peak. For example, Accesson 1=peak 2, peak 3, peak 5; Accesson 2=peak 1, peak 4.

ARI (Adjusted Rand index) is a commonly used evaluation index for clustering algorithms, which is used to evaluate the consistency of the clustering results of the algorithm with the actual clustering results.

An embodiment of the present invention proposes a single-cell chromatin accessibility (scATAC-seq) sequencing data analysis system based on peak clustering (hereinafter referred to as APEC): It includes the following modules:

1) Preprocessing module: including a) alignment unit, used to compare fastq files (single-cell chromatin accessibility sequencing data) to genome sequences to form bam files; b) peak finding unit, used to compare all single The bam files of the cell comparison results are merged into a merge_bam file, and peaks are searched on this basis; c) The reading calculation unit calculates the count of reads in each peak, and finally outputs the reading matrix of the cell*peak.

2) Accesson building module: including a) Peak distance calculation unit, which calculates the mathematical distance between peaks (including but not limited to Euclidean distance, Pearson correlation coefficient, cityblock distance) through the cell * peak reading matrix; b) The peak clustering unit uses the mathematical distance between the peaks to cluster the peaks. The peaks after clustering are called accesson. The clustering methods include but are not limited to (KNN, DBSAN). c) The matrix conversion unit, according to the accesson information, merges the cell*peak reading matrix into the cell*accesson matrix. The merging method includes but is not limited to taking the sum, average, median, and variance of the peak readings in the accesson.

3) Visualization module: Reduce the dimensionality of the cell *accesson reading matrix to a two-digit visualization matrix. The dimensionality reduction visualization methods used include but are not limited to PCA, T-SNE, UMAP.

4) Cell clustering module: use the cell *accesson reading matrix to cluster cells. Clustering algorithms include but are not limited to KNN clustering, kernel clustering, and louvain clustering.

5) Cell development path remodeling module: Use cell *accesson reading matrix to construct false time situation of cell development path. Algorithms used include but not limited to SPRING and monocle.

The following is the usage of APEC in 4 different gold standard test data sets in the embodiment of the present invention to illustrate the universality of APEC in scATAC-seq data analysis of different biological samples. The data set includes: 1) People ScATAC-seq data of leukemia cells and related lineage cells; 2) scATAC-seq data related to the development and differentiation lineage of artificial hematopoietic stem cells; 3) scATAC-seq data of mouse forebrain nerve cells; 4) scATAC-seq data of mouse thymic T cells.

The analysis process of the scATAC-seq analysis system (APEC) based on peak clustering of the present invention includes the following steps:

1) Data input:

The input data is a fastq file, and its format can be: a), a single fastq file for each cell; b), a mixed fastq file, but each cell can be split by the split rule given by the data provider Split the data into each cell. Such as index sequence (using different splits of the first 5-10 bases of fastq)

2) Data preprocessing:

The input data can be compared to different biological sample genomes through the comparison unit, for example, data sets 1, 2 are compared to the human genome, and data sets 3, 4 are compared to the mouse genome. Or the biological sample genome designated by the data provider. The result of the comparison produces a Bam file, which indicates the position of the read in each fastq to the genome. Using the peak finding unit to process the bam file, the chromatin open sites in the biological sample can be defined, combined with the reading calculation unit, the reading matrix (m×n) of each cell (m) and each peak (n) can be obtained.

3) Accesson construction:

Fig. 1 is a schematic diagram of accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention. In the accesson construction, the m×n reading matrix is first transferred to the accesson building module.

In the peak distance calculation unit, Euclidean distance can be used to calculate the relative distance between peaks (

data set

1, 2, 3, 4), and other commonly used vector distance calculation methods can also be used, such as Pearson correlation coefficient, cityblock distance Wait.

In the peak clustering unit, the KNN algorithm can be used to cluster the peaks into a specified number of Accesson (

data set

1, 2, 3, 4). Among them, the clustering algorithm can be a common vector clustering algorithm, such as DBSCAN, K-Mean, etc. The number of specified accesson will not affect the result over a wide distance (Figure 2), so the default is 2000, which can be adjusted according to specific data.

In the matrix conversion unit, the accesson is first selected according to the basic nature of the accesson, such as removing the accesson whose peak number is less than the specified value, or removing the accesson whose internal Gini coefficient is less than the specified value. After that, according to the accesson information, the cell*peak reading matrix is merged into the cell*accesson matrix. The merging method is to take the sum of the peak readings in the accesson (

data set

1, 2, 3, 4). At the same time, you can also use other simple vector property calculation methods, such as the average of the readings, the median of the readings, and the variance of the readings.

4) Data clustering and visualization

In this step, the visualization module can be used to reduce the dimension of the cell *accesson reading matrix to a two-digit visualization matrix, and/or the cell clustering module can be used to cluster cells, and/or the cell development path remodeling module can be used to construct cell development Route false time situation.

Figure 3 shows scATAC-seq data of human leukemia cells and related lineage cells: A. Data clustering (hierarchical clustering) and B. Visualization effect (tSNE);

Figure 4 shows scATAC-seq data related to the development and differentiation lineage of artificial hematopoietic stem cells: data development path remodeling (monocle);

Figure 5 shows scATAC-seq data of mouse forebrain nerve cells: data clustering (KNN) and visualization (tSNE);

Figures 6A-6D are mouse thymic T cell scATAC-seq data: Figure 6A is Louvain clustering, Figure 6B is hierarchical clustering, Figure 6C is visualization (tSNE), and Figure 6D is developmental path remodeling (monocle).

It can be seen that the present invention can realize reshaping from fastq to clustering, visualization and developmental path. And in the gold-labeled data set test, the clustering effect (ARI) is statistically significantly higher than the existing methods, as shown in Figures 7 and 8. The reason why it can efficiently restore cell heterogeneity information is that the accesson construction method proposed in this method is a filtering process that reduces noise and amplifies the signal. The details are: 1) Compared with LSI and ChromVAR, the present invention can The sparse cell*peak matrix is transformed into a denser cell*accesson matrix, which reduces the noise signal in subsequent analysis; 2) Compared with the Cicero method based on chromatin position for peak merging, the present invention uses mathematical distance and clustering Algorithm to cluster the peaks and merge them. The peaks clustered together by this method have similar expression patterns. Therefore, the construction of accesson is more biologically meaningful. For example, the peaks within an accesson may be regulated by the same transcription factor, or closer in the three-dimensional structure of chromatin. Therefore, the *accesson matrix of transformed cells further amplifies the heterogeneity of cells.

The present invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, including:

processor;

The present invention also provides a computer-readable storage medium storing instructions, which when executed by a processor cause the processor to execute the analysis method.

It should be noted that each functional module/unit in the present invention can be hardware, for example, the hardware can be a circuit, including a digital circuit, an analog circuit, and so on. The physical realization of the hardware structure includes but is not limited to physical devices, which includes but is not limited to transistors, memristors, and so on. The data processing module can be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on. The storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.

Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional modules is used as an example. In practical applications, the above-mentioned functions can be allocated by different functional modules as required, namely The internal structure of the device is divided into different functional modules to complete all or part of the functions described above.

The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. Within the spirit and principle of the present invention, any modification, equivalent replacement, improvement, etc., shall be included in the protection scope of the present invention.

Claims

A single-cell chromatin accessibility sequencing data analysis method based on peak clustering includes:

Compare the single-cell chromatin accessibility sequencing data with the corresponding biological sample genome data to obtain the comparison result, and find the peak based on the comparison result, and calculate the reading within each peak to obtain the cell*peak The reading matrix;

Calculate the mathematical distance between the peaks in the cell*peak reading matrix, cluster the peaks, and merge the cell*peak reading matrix into the cell*accesson reading matrix, where accesson is the clustered peak.
The analysis method according to claim 1, wherein the method further comprises dimensionality reduction of the reading matrix of the cell *accesson to a two-digit visualization matrix, preferably, the dimensionality reduction method includes PCA, T-SNE or UMAP.
The analysis method according to claim 1 or 2, wherein the method further comprises clustering the cells according to the reading matrix of the cell *accesson, preferably, the clustering algorithm comprises KNN clustering, kernel clustering or louvain Clustering.
The analysis method according to any one of claims 1 to 3, wherein the method further comprises using the read matrix of the cell *accesson to construct a false time situation of the cell development path, preferably, construct a false time situation of the cell development path The algorithm used here includes SPRING or monocle.
The analysis method according to any one of claims 1 to 4, wherein the mathematical distance includes Euclidean distance, Pearson correlation coefficient or cityblock distance.
The analysis method according to any one of claims 1 to 5, wherein the method of peak clustering comprises KNN, DBSAN or K-Mean.
The analysis method according to any one of claims 1-6, wherein the method of combining the reading matrix of cell*peak into the reading matrix of cell*accesson comprises taking the sum of the peak readings in the accesson, the average of the peak readings, The median of the peak readings or the variance of the peak readings.
A single-cell chromatin accessibility sequencing data analysis system based on peak clustering, including a preprocessing module and an accesson building module;

Among them, the preprocessing module includes a) a comparison unit, which is used to compare single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result; b) a peak finding unit, which is used to compare all single cells The comparison results of the cells are combined, and then the peak is searched; c) The reading calculation unit calculates the readings in each peak to obtain the reading matrix of the cell*peak;

The accesson building module includes a) a peak distance calculation unit, used to calculate the mathematical distance between peaks in the cell*peak reading matrix; b) a peak clustering unit, used to cluster peaks based on the mathematical distance between peaks C) A matrix conversion unit for combining the reading matrix of the cell*peak into the reading matrix of the cell*accesson, where the accesson is the peak after clustering.
The analysis system according to claim 8, wherein the system further comprises a visualization module for reducing the dimensionality of the reading matrix of the cell *accesson to a two-digit visualization matrix. Preferably, the dimensionality reduction method includes PCA, T -SNE or UMAP.
The analysis system according to claim 8 or 9, wherein the system further comprises a cell clustering module for clustering the cells according to the reading matrix of the cell *accesson, preferably, the clustering algorithm includes KNN clustering Class, kernel clustering or louvain clustering.
The analysis system according to any one of claims 8-10, wherein the system further comprises a cell development path remodeling module for constructing a false time situation of a cell development path using the reading matrix of the cell *accesson, preferably In particular, the algorithms used to construct false-time conditions of cell development paths include SPRING or monocle.
The analysis system according to any one of claims 8-11, wherein the mathematical distance includes Euclidean distance, Pearson correlation coefficient or cityblock distance.
The analysis system according to any one of claims 8-12, wherein the method of peak clustering comprises KNN, DBSAN or K-Mean.
The analysis system according to any one of claims 8-13, wherein the method of combining the reading matrix of the cell*peak into the reading matrix of the cell*accesson comprises taking the sum of the peak readings in the accesson, the average of the peak readings, The median of the peak readings or the variance of the peak readings.
A single-cell chromatin accessibility sequencing data analysis device based on peak clustering includes:

processor;

A memory having instructions stored thereon, and when the instructions are executed by the processor, the processor executes the analysis method according to any one of claims 1-7.
A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to execute the analysis method according to any one of claims 1-7.