CN111755071B

CN111755071B - Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Info

Publication number: CN111755071B
Application number: CN201910256667.0A
Authority: CN
Inventors: 瞿昆; 方靖文; 黎斌; 李杨
Original assignee: University of Science and Technology of China USTC
Current assignee: Qu Kun; University of Science and Technology of China USTC
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2023-04-21
Anticipated expiration: 2039-03-29
Also published as: CN111755071A

Abstract

A method and system for single cell chromatin accessibility sequencing data analysis based on peak clustering, the method comprising: comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating readings in each peak to obtain a reading matrix of cell-based peaks; calculating mathematical distances between peaks in a cell-peak reading matrix, clustering the peaks, and merging the cell-peak reading matrix into a cell-peak reading matrix, wherein the cell is the clustered peaks. The invention provides a method and a system for analyzing scATAC-seq data from fastq to clustering, visualization and development path remodeling, and the grouping effect is remarkably improved.

Description

Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Technical Field

The invention belongs to the technical field of biological sequencing data analysis, and particularly relates to a single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering.

Background

ATAC-seq has been widely used in research in the field of biology since 2012, and has contributed to breakthrough progress in research on embryonic development, stem cell differentiation, cancer mechanism, typing, and the like, due to advantages of simplicity, low cost, and few required cells. As one cancel Cell in 2017 (if=24) found that the pathogenesis and precise drug typing of T Cell lymphoma could be explained by ATAC-seq, data from ATAC-seq in 2018 entered the TCGA database. Thus, to further investigate cellular heterogeneity, scATAC-seq sequencing technology was proposed by people in 2015 and developed over several years to implement a number of different protocols, with the consequent analytical interpretation of the scATAC-seq sequencing result data.

The primary purpose of scATAC-seq data analysis is to reduce the primary cell population or developmental differentiation pathway in a mixed biological sample by sequencing results. However, current scattac-seq techniques compare the leading edge and the signal to noise ratio of the data is low. Therefore, scATAC-seq data analysis requires a set of easy-to-use analysis methods and maximally restores cellular heterogeneity information. On the one hand, the currently disclosed scattac-seq data analysis method does not have a perfect and easy-to-use analysis flow from fastq start to clustering, visualization and development path reconstruction. On the other hand, the evaluation was performed by using a gold standard test dataset, i.e., some test datasets in which the location in the subpopulation or developmental differentiation pathway to which each cell belongs was known. The existing methods still have poor information recovery, and improvements (using ARI assessment) are needed. As such, scattac-seq analysis is currently not an industry-uniform method of analysis.

In the prior art, the following three analysis methods exist: chromVAR, LSI and Cicero.

In the ChromVAR method, the input data of the method are a matrix of cell-based peak readings and sequence information of each peak. Thereby constructing a preference score matrix of cell transcription factors, and using the matrix to perform information reduction.

In the LSI method, the input data is a matrix of cell peak readings, and the method complicates the matrix by TF-IDF algorithm (Term Frequency, IDF means inverse text Frequency index) and then performs information reduction by a new matrix.

In the Cicero method, the input data is a matrix of cell-based peak readings, and peak position information on the chromosome. Downstream information reduction is then performed using this matrix.

Disclosure of Invention

In view of the above, the present invention provides a complete, easy-to-use method and system for analyzing scATAC-seq data of biological samples with high-efficiency cell heterogeneity information reduction capability.

In order to achieve the above object, in one aspect, the present invention provides a method for analyzing single-cell chromatin accessibility sequencing data based on peak clustering, comprising:

comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating readings in each peak to obtain a reading matrix of cell-based peaks;

calculating mathematical distances between peaks in a cell-peak reading matrix, clustering the peaks, and merging the cell-peak reading matrix into a cell-peak reading matrix, wherein the cell is the clustered peaks.

In some embodiments, the method further comprises reducing the read matrix of the cell access to a two-bit visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.

In some embodiments, the method further comprises clustering cells according to the matrix of readings of the cells, preferably the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.

In some embodiments, the method further comprises constructing a cell development path pseudotime instance using the matrix of readings of cell x-accesson, preferably the algorithm used in constructing the cell development path pseudotime instance comprises SPRING or monocle.

On the other hand, the invention provides a single-cell chromatin accessibility sequencing data analysis system based on peak clustering, which comprises a preprocessing module and an accesson construction module;

the pretreatment module comprises a) a comparison unit, a comparison unit and a control unit, wherein the comparison unit is used for comparing single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain comparison results; b) The peak searching unit is used for combining the comparison results of all single cells and then searching peaks; c) A reading calculation unit for calculating the reading in each peak to obtain a reading matrix of the cell;

the accesson construction module comprises a) a peak distance calculation unit, which is used for calculating mathematical distance between peaks in a cell-peak reading matrix; b) A peak clustering unit for clustering peaks according to mathematical distances between peaks; c) And the matrix conversion unit is used for combining the reading matrix of the cell-based peaks into the reading matrix of the cell-based peaks, wherein the peaks are clustered.

In some embodiments, the system further comprises a visualization module for reducing the reading matrix of the cell access to a two-bit visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.

In some embodiments, the system further comprises a cell clustering module for clustering cells according to the matrix of readings of cell-x-accesson, preferably the clustering algorithm comprises KNN clustering, kernel clustering or louvain clustering.

In some embodiments, the system further comprises a cell development path remodeling module for constructing a cell development path pseudotime instance using the matrix of cell x-accesson readings, preferably the algorithm used in constructing the cell development path pseudotime instance comprises SPRING or monocle.

In some embodiments, the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance.

In some embodiments, the method of peak clustering comprises KNN, DBSAN, or K-Mean.

In some embodiments, the method of combining the matrix of cell-peak readings into the matrix of cell-peak readings comprises taking the sum of the peak readings in the accesson, the average of the peak readings, the median of the peak readings, or the variance of the peak readings.

In yet another aspect, the present invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, including:

a processor;

a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method.

In yet another aspect, the present invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.

Compared with the prior art, the invention has the following beneficial effects:

the present invention provides a first scATAC-seq data analysis method and system from fastq to clustering, visualization and developmental path remodeling;

the invention provides an accesson construction method based on peak clustering, which is used as a key module for scattac-seq data analysis. The transformed cell-access reading matrix was used for subsequent clustering, visualization and cell development path remodeling. The grouping effect was statistically significantly higher than the existing method (ARI) on the gold-labeled dataset test.

Drawings

FIG. 1 is a schematic diagram of an accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention;

FIG. 2 is a graph showing the relationship between the number of accessons and the clustering effect ARI (gold mark test dataset 1);

FIG. 3 shows the scATAC-seq data for human leukemia cells and related lineage cells according to an embodiment of the present invention: A. data clustering (hierarchical clustering) and b. visualization effect (tSNE);

FIG. 4 is the scATAC-seq data relating to the developmental differentiation lineage of human hematopoietic stem cells according to an embodiment of the present invention: data development path remodeling (monocle);

FIG. 5 shows mouse forebrain nerve cell scaTAC-seq data in the examples of the present invention: data clustering (KNN) and visualization (tSNE);

FIGS. 6A-6D are mouse thymus T cell scaTAC-seq data in examples of the present invention: data clustering (Louvain, hierarchical clustering), visualization (tSNE), and developmental path remodeling (monocle);

FIG. 7 is a graph showing the clustering effect and time-consuming comparison with the prior art method (gold mark test dataset 1) in accordance with the present invention;

FIG. 8 is a graph showing the clustering effect and time-consuming comparison with the prior art method (gold test dataset 2) in accordance with the present invention.

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

For ease of understanding, the domain names referred to herein are collectively explained herein and are not described in detail.

And (3) cells: the fundamental components of the life activities of mammals (e.g., humans, mice) are often the pathogenesis of various diseases, such as nerve cells, epithelial cells, and tumor cells.

Cell heterogeneity: biological tissue samples (e.g., tumor tissue, brain tissue) are composed of a large number of cells, the physiological functions of which are different. The common cellular heterogeneity is represented by two types: 1) Constitutive cells consist of a variety of well-defined cell populations (discrete). 2) The constituent cells are in a continuous cell differentiation pathway (continuous).

Genome: namely the whole DNA sequence of the organism, which consists of four bases of ATCG in ordered arrangement. The genome of a major mammal such as human, mouse, etc. has been completely sequenced.

Gene: genes (genetic factors) are all DNA sequences required to produce one polypeptide chain or functional RNA. A gene is typically one or more stretches of DNA on the genome.

Transcription factor: a protein bound to DNA initiates or regulates gene expression. Binding to DNA is often accomplished by recognizing a specific pattern of DNA sequences (Motif).

Chromatin: linear composite structures consisting of DNA, histones, nonhistones and small amounts of RNA in the nucleus. The basic element is nucleosome formed by DNA winding on histone.

Chromatin accessibility: i.e. to evaluate whether a piece of DNA is entangled to histones. In general, chromatin accessibility is in two cases: 1) DNA is tightly entangled around nucleosomes, called closed DNA; 2) DNA is DNA which is wound around the nucleosome and is exposed, and is called open DNA.

Chromatin accessibility sequencing (ATAC-seq): a sequencing technology developed by university of stamford 2012 for detecting chromatin accessibility of biological samples (> 500 cells).

TCGA: i.e. cancer and tumor genetic map planning (Cancer Genome Atlas, TCGA). Different sets of sequencing data comprising cancer tissue and normal tissue from 33 different cancers and 11,000 patients.

Single cell chromatin accessibility sequencing (scattac-seq): several sequencing methods exist for detecting chromatin accessibility of individual cells. Including single-core chromatin accessibility sequencing (snATAC-seq), single-cell combinatorial index chromatin accessibility sequencing (sciATAC-seq), flow-based single-cell chromatin accessibility sequencing (FACS scaatac-seq).

Short sequences (sequences reads): i.e.the DNA fragments obtained in biology.

Alignment (Mapping): the short sequences are compared to known genomic information to find the position of each short sequence on the genome.

Peak rolling (Peak rolling): and searching the open position of the DNA through the result of data analysis and comparison, wherein the position information is called peak and is given with a number.

Reading: i.e. the number of short sequences per sample, per peak.

Access on: the peak clustering result provided by the invention is called an Access, namely the clustering condition of the peaks. E.g. Accesson 1 = peak 2, peak 3, peak 5; accesson 2 = peak 1, peak 4.

ARI (Adjusted Rand index) is an evaluation index commonly used in clustering algorithms for evaluating the consistency of the algorithm clustering results with the actual clustering results.

One embodiment of the present invention proposes a single cell chromatin accessibility (scattac-seq) sequencing data analysis system (hereinafter abbreviated as APEC) based on peak clustering: the device comprises the following modules:

1) And a pretreatment module: comprises a) an alignment unit for aligning fastq files (i.e. single cell chromatin accessibility sequencing data) to genomic sequences to form bam files; b) The peak searching unit is used for merging the bam files of all single cell comparison results into a merge_bam file and searching peaks on the basis; c) And a reading calculation unit for calculating the count of reads in each peak and finally outputting a reading matrix of the cell.

2) an accesson construction module: comprises a) a peak distance calculation unit for calculating mathematical distances (including but not limited to Euclidean distance, pearson correlation coefficient, cityblock distance) between peaks through a reading matrix of cell-by-peak; b) And a peak clustering unit for clustering the peaks by mathematical distance between the peaks, wherein the clustered peaks are called accesson, and the clustering method comprises, but is not limited to, (KNN, DBSAN). c) And the matrix conversion unit is used for merging the reading matrix of the cell peak into the cell peak according to the information of the cell peak, and the merging method comprises, but is not limited to, taking the sum, the average value, the median, the variance and the like of peak readings in the cell peak.

3) And a visualization module: the cell-by-cell reading matrix is reduced in dimension to a two-bit visualization matrix using dimension reduction visualization methods including but not limited to PCA, T-SNE, UMAP.

4) Cell clustering module: cells are clustered using an accesson reading matrix, and clustering algorithms include, but are not limited to, KNN clustering, kernel clustering, louvain clustering.

5) Cell development pathway remodeling module: using the matrix of cell x-accesson readings, a pseudo-time condition of the cell development pathway was constructed using algorithms including, but not limited to SPRING, monocle.

The following is a description of the use of APECs in 4 different gold-labeled test data sets in an embodiment according to the invention, illustrating the versatility of APECs in the analysis of different biological sample scattac-seq data sets, the data sets comprising: 1) Human leukemia cells and related lineage cell scATAC-seq data; 2) Human hematopoietic stem cell developmental differentiation lineage related scATAC-seq data; 3) Mouse forebrain nerve cell scattac-seq data; 4) Mouse thymus T cell scATAC-seq data.

The analysis flow using the peak cluster based scataac-seq analysis system (APEC) of the present invention comprises the following steps:

1) Data input:

the input data is fastq file, and its format can be: a) A single fastq file per cell; b) A fastq file mixed together, but each cell can be split into each cell data by a splitting rule given by the data provider. Such as index sequences (using different splits of 5-10 bases before fastq)

2) Data preprocessing:

input data can be compared to different biological sample genomes by the comparison unit, such as

data sets

1 and 2 to human genome and

data sets

3 and 4 to mouse genome. Or a biological sample genome specified by a data provider. The alignment results produced a Bam file that indicated the location of the read alignment on the genome in each fastq. The processing of the bam file with the peak-finding unit can define chromatin opening sites in the biological sample, and the reading matrix (mxn) of each peak (n) of each cell (m) can be obtained in combination with the reading calculation unit.

3) accesson construction:

FIG. 1 is a schematic diagram of an accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention. In the accesson construction, an mxn matrix of readings is first passed into an accesson construction module.

In the peak distance calculation unit, the relative distance between peaks (

data sets

1,2,3, 4) may be calculated using the euclidean distance, and other commonly used vector distance calculation methods, such as pearson correlation coefficient, cityblock distance, and the like, may be used.

In the peak clustering unit, peaks can be clustered into a specified number of accessons (

data sets

1,2,3, 4) using KNN algorithm. The clustering algorithm may be a common vector clustering algorithm, such as DBSCAN, K-Mean, etc. Where the specified number of accessions does not affect the result over a wide distance (fig. 2), and is therefore defaulted to 2000, which is adjustable according to the specific data.

In the matrix conversion unit, firstly, certain screening is carried out on the accesson according to the basic property of the accesson, for example, the accesson with the number of the contained peaks smaller than a specified value is removed, or the accesson with the internal coefficient of the foundation smaller than the specified value is removed. Then, according to the accesson information, the matrix of readings of the cell-peak is combined into the matrix of the cell-peak by taking the sum of the peak readings in the accesson (

data sets

1,2,3, 4). Other simple vector property calculation methods, such as average value of readings, median of readings, variance of readings, etc., can also be utilized.

4) Data clustering and visualization

In this step, the cell-x-accesson reading matrix may be reduced in dimension to a two-position visualization matrix using the visualization module, and/or the cells may be clustered using the cell clustering module, and/or the cell development path pseudo-time condition may be constructed using the cell development path remodeling module.

FIG. 3 shows human leukemia cells and related lineage cell scattac-seq data: A. data clustering (hierarchical clustering) and b. visualization effect (tSNE);

FIG. 4 shows human hematopoietic stem cell developmental differentiation lineage related scaTAC-seq data: data development path remodeling (monocle);

fig. 5 is mouse forebrain neural cell scattac-seq data: data clustering (KNN) and visualization (tSNE);

FIGS. 6A-6D are mouse thymus T cell scATAC-seq data: wherein, fig. 6A is a Louvain cluster, fig. 6B is a hierarchical cluster, fig. 6C is a visualization (tSNE), and fig. 6D is a developmental path remodeling (monocle).

It can be seen that the present invention can achieve from fastq to clustering, visualization and developmental pathway remodeling. And the grouping effect (ARI) was statistically significantly higher than the existing methods on the gold labeled dataset test, as shown in fig. 7 and 8. The reason that the cell heterogeneity information can be efficiently restored is that the method for constructing the accesson is a filtering process for reducing noise and amplifying signals, and the details are as follows: 1) Compared with LSI and ChromVAR, the invention can convert the originally sparse cell peak matrix into a denser cell peak matrix, thereby reducing noise signals in subsequent analysis; 2) Compared with the Cicero method for peak merging based on chromatin position, the method provided by the invention clusters peaks through mathematical distance and clustering algorithm and merges the peaks. Peaks clustered together in this way have similar expression patterns, and therefore construction of an accesson is more biologically significant, e.g., peaks within an accesson may be regulated by the same transcription factor or more closely related in the chromatin three-dimensional structure. Thus transformed cells access matrix further amplifies the cell heterogeneity.

The invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, which comprises:

a processor;

The invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.

It should be noted that each functional module/unit in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and so on. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The data processing module may be any suitable hardware processor such as CPU, GPU, FPGA, DSP and ASIC, etc. The storage unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.

Claims

1. A method for analyzing single cell chromatin accessibility sequencing data based on peak clustering, comprising:

calculating mathematical distances between peaks in a reading matrix of cell-peak, clustering the peaks, and merging the reading matrix of cell-peak into a reading matrix of cell-peak, wherein the cell is the clustered peak;

the method further comprises constructing a cell development pathway pseudotime profile using the matrix of cell x-accesson readings.

2. The method of analysis of claim 1, wherein the method further comprises dimensionality reducing the matrix of readings of the cell x-accesson to a two-bit visualization matrix.

3. The method of analysis of claim 2, wherein the method of dimension reduction comprises PCA, T-SNE or UMAP.

4. The assay of claim 1, wherein the method further comprises clustering cells according to the matrix of cell-x-access readings.

5. The analysis method of claim 1, wherein the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.

6. The assay of claim 1, wherein the algorithm used in constructing the pseudo-time condition of the cell development pathway comprises SPRING or monocle.

7. The analysis method of any one of claims 1-6, wherein the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance; the peak clustering method comprises KNN, DBSAN or K-Mean.

8. The method of analysis of claim 1, wherein combining the matrix of cell peak readings into the matrix of cell peak readings comprises taking the sum of peak readings in the accesson, the average of peak readings, the median of peak readings, or the variance of peak readings.

9. A single-cell chromatin accessibility sequencing data analysis system based on peak clustering comprises a preprocessing module and an accesson construction module;

the accesson construction module comprises i) a peak distance calculation unit, which is used for calculating mathematical distance between peaks in a cell-peak reading matrix; ii) a peak clustering unit for clustering peaks according to mathematical distance between peaks; iii) The matrix conversion unit is used for combining the reading matrixes of the cell-based peaks into the reading matrixes of the cell-based peaks, wherein the peaks are clustered;

the system further includes a cell development path remodeling module for constructing a cell development path pseudo-time condition using the matrix of cell-x-accesson readings.

10. The analysis system of claim 9, wherein the system further comprises a visualization module for dimensionality reduction of the cell x-access reading matrix to a two-bit visualization matrix.

11. The analysis system of claim 10, wherein the dimension reduction method comprises PCA, T-SNE or UMAP.

12. The analysis system of claim 9, wherein the system further comprises a cell clustering module for clustering cells according to the matrix of readings of the cell.

13. The analysis system of claim 12, wherein the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.

14. The analysis system of claim 9, wherein the algorithm used in constructing the pseudo-time condition of the cell development pathway comprises SPRING or monocle.

15. The analysis system of claim 9 or 10, wherein the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance; the peak clustering method comprises KNN, DBSAN or K-Mean.

16. The analysis system of claim 9, wherein the means for combining the matrix of cell peak readings into the matrix of cell peak readings comprises taking the sum of peak readings in the accesson, the average of peak readings, the median of peak readings, or the variance of peak readings.

17. A single cell chromatin accessibility sequencing data analysis device based on peak clustering, comprising:

a processor;

a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method of any of claims 1-8.

18. A computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method of any one of claims 1-8.