CN111755071A

CN111755071A - Single cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Info

Publication number: CN111755071A
Application number: CN201910256667.0A
Authority: CN
Inventors: 瞿昆; 方靖文; 黎斌; 李杨
Original assignee: University of Science and Technology of China USTC
Current assignee: Qu Kun; University of Science and Technology of China USTC
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-10-09
Anticipated expiration: 2039-03-29
Also published as: CN111755071B

Abstract

A method and system for peak clustering based single cell chromatin accessibility sequencing data analysis, the method comprising: comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating the reading in each peak to obtain a reading matrix of cells and peaks; calculating the mathematical distance between peaks in the reading matrix of the cell peaks, clustering the peaks, and combining the reading matrix of the cell peaks into the reading matrix of the cell accesson, wherein the accesson is the clustered peaks. The invention provides a first method and a first system for analyzing data from fastq to clustering, visualization and development path remodeling and obviously improves the clustering effect.

Description

Single cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Technical Field

The invention belongs to the technical field of biological sequencing data analysis, and particularly relates to a single cell chromatin accessibility sequencing data analysis method and system based on peak clustering.

Background

ATAC-seq is widely popularized in the research of the biological field due to the advantages of simplicity, low price and less required cells since the invention of 2012, and contributes to breakthrough progress in the research of embryonic development, stem cell differentiation, cancer mechanism, typing and the like. ATAC-seq can be used for explaining the pathogenesis and precise dose type of T Cell lymphoma as found in a CANCER Cell (IF ═ 24) in 2017, and ATAC-seq data are entered into a TCGA database in 2018. Thus, to further investigate cellular heterogeneity, the scATAC-seq sequencing technology was proposed in 2015 and implemented a number of different technical solutions in several years of development, with the consequent analytical interpretation of data of the scATAC-seq sequencing results.

The primary purpose of the scATAC-seq data analysis is to restore the major cell population or developmental differentiation pathways in the mixed biological sample by sequencing results. However, the current scATAC-seq technique is relatively low in signal-to-noise ratio of data compared to the leading edge. Therefore, the scATAC-seq data analysis requires a set of easy-to-use analysis methods and minimizes the reduction of cell heterogeneity information. On one hand, the existing scATAC-seq data analysis method has no perfect and easy-to-use analysis process from fastq initiation to clustering, visualization and development path reconstruction. On the other hand, by using gold standard test datasets, i.e., some test datasets that are known to each cell to belong to a subpopulation or location in the developmental differentiation pathway. The existing method still has poor effect on information reduction, and needs to be improved (by utilizing ARI evaluation). As such, there is currently no uniform analytical approach in the industry for the scaTAC-seq analysis.

The prior art has the following three analysis methods: ChromVAR, LSI and Cicero.

In the ChromVAR method, the input data is a reading matrix of cell peaks, and sequence information for each peak. Thus, a cell transcription factor preference fraction matrix is constructed, and information is restored by using the matrix.

In the LSI method, the input data is a reading matrix of cell peaks, which complicates the matrix by TF-IDF algorithm (Term Frequency, IDF means inverse text Frequency index), and then performs information restoration by a new matrix.

In the Cicero method, the input data is a reading matrix of the cell-peak and information about the position of the peak on the chromosome. And then using the matrix to perform downstream information restoration.

Disclosure of Invention

In view of the above, the present invention provides a complete, easy-to-use, and efficient method and system for analyzing scATAC-seq data of biological samples with cell heterogeneity information reduction capability.

In order to achieve the above object, in one aspect, the present invention provides a single-cell chromatin accessibility sequencing data analysis method based on peak clustering, comprising:

comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating the reading in each peak to obtain a reading matrix of cells and peaks;

calculating the mathematical distance between peaks in the reading matrix of the cell peaks, clustering the peaks, and combining the reading matrix of the cell peaks into the reading matrix of the cell accesson, wherein the accesson is the clustered peaks.

In some embodiments, the method further comprises reducing the reading matrix of the cells accesson to a two-dimensional visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.

In some embodiments, the method further comprises clustering cells according to the reading matrix of the cells accesson, preferably the clustering algorithm comprises KNN clustering, kernel clustering or lovain clustering.

In some embodiments, the method further comprises constructing a cell development pathway pseudo-temporal profile using the read matrix of cells accesson, preferably the algorithm used in constructing the cell development pathway pseudo-temporal profile comprises SPRING or monocle.

On the other hand, the invention provides a single cell chromatin accessibility sequencing data analysis system based on peak clustering, which comprises a pretreatment module and an accesson construction module;

the pretreatment module comprises a) a comparison unit, a comparison unit and a comparison unit, wherein the comparison unit is used for comparing the single-cell chromatin accessibility sequencing data with the corresponding biological sample genome data to obtain a comparison result; b) the peak searching unit is used for merging the comparison results of all the single cells and then searching peaks; c) a reading calculation unit for calculating the reading in each peak to obtain a reading matrix of cells and peaks;

the accesson construction module comprises a) a peak distance calculation unit for calculating the mathematical distance between peaks in a reading matrix of cells by peaks; b) the peak clustering unit is used for clustering peaks according to the mathematical distance between the peaks; c) and the matrix conversion unit is used for merging the reading matrixes of the cell peaks into a reading matrix of the cell accesson, wherein the accesson is the clustered peaks.

In some embodiments, the system further comprises a visualization module for reducing the reading matrix of the cells accesson to a two-dimensional visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.

In some embodiments, the system further comprises a cell clustering module for clustering cells according to the reading matrix of cells accesson, preferably the clustering algorithm comprises KNN clustering, kernel clustering or lovain clustering.

In some embodiments, the system further comprises a cell development pathway remodeling module for constructing a cell development pathway pseudo-temporal scenario using the read matrix of the cells, preferably, the algorithm used in constructing the cell development pathway pseudo-temporal scenario comprises SPRING or monocle.

In some embodiments, the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance.

In some embodiments, the method of peak clustering includes KNN, DBSAN, or K-Mean.

In some embodiments, the method of consolidating the reading matrices of cell peaks into a reading matrix of cell peaks comprises taking the sum of the peak readings in accesson, the mean of the peak readings, the median of the peak readings, or the variance of the peak readings.

In another aspect, the present invention further provides a single-cell chromatin accessibility sequencing data analysis apparatus based on peak clustering, including:

a processor;

a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method.

In yet another aspect, the present invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a first method and a first system for analyzing scATAC-seq data from fastq to clustering, visualization and development path remodeling;

the invention provides an accesson construction method based on peak clustering, which is used as a key module for data analysis of scaTAC-seq. Transformed cells are used for subsequent clustering, visualization and cell development pathway remodeling. The clustering effect was statistically significantly higher on the gold labeled dataset test than the existing Approach (ARI).

Drawings

FIG. 1 is a schematic diagram of the construction and downstream analysis of accesson based on peak clustering in the embodiment of the present invention;

FIG. 2 shows the relationship between the number of accessons and the ARI (gold mark test data set 1) according to the embodiment of the present invention;

FIG. 3 is data of human leukemia cells and related lineage cells, scATAC-seq, according to the present invention: A. data clustering (hierarchical clustering) and b.

FIG. 4 is data of human hematopoietic stem cell developmental differentiation lineage associated scATAC-seq in accordance with an embodiment of the present invention: data development pathway remodeling (monocle);

FIG. 5 shows data of mouse forebrain neural cells scATAC-seq in examples of the present invention: data clustering (KNN) and visualization (tSNE);

FIGS. 6A-6D are data of mouse thymic T cell scatAC-seq in accordance with the present invention: data clustering (Louvain, hierarchical clustering), visualization (tSNE), and developmental pathway remodeling (monocle);

FIG. 7 shows a comparison between the clustering effect and the time of use in the embodiment of the present invention and the existing method (gold mark test data set 1);

FIG. 8 shows a comparison between the clustering effect and the time of use in the embodiment of the present invention and the existing method (gold test data set 2).

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

For the sake of understanding, the term of art will be used herein to explain the same and to avoid redundancy.

Cell: the basic components of the life activities of mammals (such as human and mouse) are also the pathogenesis of various diseases, such as nerve cells, epithelial cells and tumor cells.

Cellular heterogeneity: biological tissue samples (e.g., tumor tissue, brain tissue) are composed of a large number of cells, whose constituent cells have different physiological functions. The following two manifestations of common cellular heterogeneity exist: 1) the constituent cells are composed of a variety of well-defined cell populations (discrete). 2) The constituent cells are in a continuous cellular differentiation pathway (continuum).

Genome: i.e. the organism full DNA sequence, is composed of the ATCG four basic groups ordered arrangement. The genome of major mammals such as human and mouse has been completely sequenced.

Gene: a gene (genetic element) is the entire DNA sequence required to produce a polypeptide chain or functional RNA. A gene is typically one or more DNA segments of a genome.

Transcription factors: a protein that binds to DNA, initiates or regulates expression of a gene. Its binding to DNA is often through the recognition of specific DNA sequence patterns (Motif).

Chromatin: the cell nucleus is a linear composite structure consisting of DNA, histone, non-histone and a small amount of RNA. The basic element is a nucleosome formed by winding DNA on histone.

Chromatin accessibility: i.e., to assess whether a stretch of DNA is wound around a histone. In general, chromatin accessibility is in two contexts: 1) DNA tightly wound around nucleosomes, termed closed DNA; 2) DNA is wound around nucleosomes in a naked state and is called open DNA.

Chromatin accessibility sequencing (ATAC-seq): a sequencing technique developed at stanford university 2012 for detecting chromatin accessibility in biological samples (> 500 cells).

TCGA: namely The Cancer Genome Atlas (TCGA) project. Contains distinct omics sequencing data for cancer tissues and normal tissues for 33 different cancers and 11,000 patients.

Single cell chromatin accessibility sequencing (scATAC-seq): there are several sequencing methods used to detect the accessibility of chromatin to individual cells. Including single cell chromatin accessibility sequencing (snATAC-seq), single cell combinatorial index chromatin accessibility sequencing (sciATAC-seq), flow-based single cell chromatin accessibility sequencing (FACS scATAC-seq).

Short sequences (Sequence reads): i.e., in the biomemics, the DNA fragments obtained.

Alignment (Mapping): the short sequences are compared to known genomic information to find the location of each short sequence on the genome.

Peak finding (Peak Calling): and searching the position of DNA opening through the result of data analysis and comparison, wherein the position information is called as a peak and is assigned with a number.

Reading: i.e., the number of short sequences per sample, per peak.

And (2) Access: the peak clustering result provided by the invention is referred to as an Access, namely the clustering condition of the peak. Such as access 1 ═ peak 2, peak 3, peak 5; accesson 2 ═ peak 1, peak 4.

Ari (adjusted rank index) is a commonly used evaluation index for clustering algorithm, and is used for evaluating the consistency of algorithm clustering results and actual clustering results.

One embodiment of the present invention proposes a single-cell chromatin accessibility (scATAC-seq) sequencing data analysis system (hereinafter abbreviated as APEC) based on peak clustering: the system comprises the following modules:

1) a preprocessing module: comprises a) an alignment unit for aligning a fastq file (namely single cell chromatin accessibility sequencing data) to a genome sequence to form a bam file; b) the peak searching unit is used for merging the bam files of all the single cell comparison results into a merge _ bam file and searching peaks on the basis; c) and a reading calculation unit for finally outputting a reading matrix of the cell-peak by calculating the counts of reads in each peak.

2) The accesson building module: comprising a) a peak distance calculation unit, calculating the mathematical distance (including but not limited to Euclidean distance, Pearson correlation coefficient, cityblock distance) between the peaks through the reading matrix of cell peaks; b) and a peak clustering unit for clustering peaks through mathematical distances between the peaks, wherein the clustered peaks are called accesson, and the clustering method includes but is not limited to (KNN, DBSAN). c) And the matrix conversion unit is used for combining the reading matrixes of the cell peaks into the cell peaks matrix according to the accesson information, wherein the combination method comprises but is not limited to taking the sum, the average value, the median, the variance and the like of the reading matrixes of the peaks in the accesson.

3) A visualization module: and reducing the dimension of the cell accesson reading matrix into a two-dimensional visualization matrix, wherein the used dimension reduction visualization method comprises but is not limited to PCA, T-SNE and UMAP.

4) A cell clustering module: and clustering the cells by using the accesson reading matrix, wherein the clustering algorithm comprises but is not limited to KNN clustering, kernel clustering and louvain clustering.

5) Cell developmental pathway remodeling module: cell development pathway pseudo-time profiles are constructed using cell accesson readout matrices using algorithms including, but not limited to, SPRING, monocle.

The following is a use case of the APEC in 4 different gold-labeled test data sets for illustrating the universality of the APEC in data analysis of different biological samples scATAC-seq in the embodiment according to the present invention, wherein the data sets include: 1) human leukemia cell and related lineage cell scATAC-seq data; 2) human hematopoietic stem cell developmental differentiation lineage associated scATAC-seq data; 3) mouse forebrain nerve cell scATAC-seq data; 4) mouse thymic T cell scATAC-seq data.

The analysis process of the scATAC-seq analysis system (APEC) based on peak clustering according to the present invention comprises the following steps:

1) data input:

the input data is a fastq file, and the format of the fastq file can be as follows: a) one fastq file per cell; b) a fastq file mixed together, but each cell can be split into data of each cell by a splitting rule given by a data provider. Such as index sequences (different splits with 5-10 bases before fastq)

2) Data preprocessing:

the input data can be aligned to different biological sample genomes through the alignment unit, such as the

data sets

1 and 2 are aligned to human genomes, and the

data sets

3 and 4 are aligned to mouse genomes. Or a biological sample genome specified by a data provider. The alignment results produced a Bam file that indicates where the reads in each fastq align to the genome. The bam file is processed by a peak finding unit to define chromatin opening sites in the biological sample, and a matrix of reads (m × n) per peak (n) per cell (m) is obtained in conjunction with a read calculation unit.

3) The accesson constructs:

fig. 1 is a schematic diagram of the construction and downstream analysis of accesson based on peak clustering in the embodiment of the present invention. In the accesson construction, an m × n matrix of readings is first passed into the accesson construction module.

In the peak distance calculation unit, the relative distance between peaks (

data sets

1, 2, 3, 4) may be calculated using the euclidean distance, or other commonly used vector distance calculation methods may be used, such as pearson correlation coefficient, cityblock distance, etc.

In the peak clustering unit, peaks can be clustered into a specified number of accessons (

data sets

1, 2, 3, 4) using the KNN algorithm. The clustering algorithm may be a common vector clustering algorithm, such as DBSCAN, K-Mean, and the like. Where the specified number of accessons does not affect the result over a wide distance (fig. 2), the default is 2000, which can be adjusted according to the specific data.

In the matrix conversion unit, firstly, certain screening is carried out on the accessons according to the basic properties of the accessons, for example, the accessons with the number of peaks smaller than a specified value are removed, or the accessons with the internal damping coefficient smaller than the specified value are removed. The cell peak readings are then combined into a cell peak matrix according to the accesson information by summing the peak readings in the accesson (

data sets

1, 2, 3, 4). Other simple vector property calculation methods such as mean of readings, median of readings, variance of readings, etc. may also be utilized.

4) Data clustering and visualization

In this step, the visualization module may be used to reduce the cells-accesson reading matrix into a two-dimensional visualization matrix, and/or the cell clustering module may be used to cluster the cells, and/or the cell development pathway remodeling module may be used to construct the cell development pathway pseudo-time.

FIG. 3 is data for human leukemia cells and related lineage cells, scATAC-seq: A. data clustering (hierarchical clustering) and b.

FIG. 4 is data of human hematopoietic stem cell developmental differentiation lineage associated scATAC-seq: data development pathway remodeling (monocle);

FIG. 5 shows data of mouse forebrain nerve cells scATAC-seq: data clustering (KNN) and visualization (tSNE);

FIGS. 6A-6D are mouse thymic T cell scataC-seq data: in which fig. 6A is luvain clustering, fig. 6B is hierarchical clustering, fig. 6C is visualization (tSNE), and fig. 6D is developmental pathway remodeling (monocle).

Therefore, the method can realize the remodeling of the clustering, visualization and development paths from fastq. And the clustering effect (ARI) was statistically significantly higher on the gold labeled dataset test than the existing methods, as shown in fig. 7 and 8. The reason that the cell heterogeneity information can be efficiently reduced is that the accesson construction method provided by the method is a filtering process for reducing noise and amplifying signals, and the details are as follows: 1) compared with LSI and ChromVAR, the method can convert originally sparse cell peak matrix into more compact cell accesson matrix, and reduces noise signals in subsequent analysis; 2) compared with the Cicero method for peak combination based on chromatin position, the invention combines the clustered peaks through mathematical distance and clustering algorithm. Peaks clustered together in the method have similar expression patterns, so that the construction of the accesson is more biological, for example, peaks inside the accesson are possibly regulated and controlled by the same transcription factor or are closer in the three-dimensional structure of chromatin. Thus, the transformed cells are in the accesson matrix, further amplifying the cell heterogeneity.

The invention also provides a single cell chromatin accessibility sequencing data analysis device based on peak clustering, which comprises:

a processor;

The invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.

It should be noted that each functional module/unit in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and the like. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The data processing module may be any suitable hardware processor such as a CPU, GPU, FPGA, DSP, ASIC, etc. The memory unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A single cell chromatin accessibility sequencing data analysis method based on peak clustering, comprising:

2. An assay method according to claim 1, wherein the method further comprises reducing the reading matrix of the cells accesson to a two-dimensional visualisation matrix, preferably the dimension reduction method comprises PCA, T-SNE or UMAP.

3. An analysis method according to claim 1, wherein the method further comprises clustering cells according to the cell accesson's reading matrix, preferably the clustering algorithm comprises KNN clustering, kernel clustering or lovain clustering.

4. An assay method according to claim 1, wherein the method further comprises using the reading matrix for cells accesson to construct cell development pathway pseudo-temporal profiles, preferably wherein the algorithm used to construct the cell development pathway pseudo-temporal profiles comprises SPRING or monocle.

5. The analysis method of any of claims 1-4, wherein the mathematical distance comprises a Euclidean distance, a Pearson correlation coefficient, or a cityblock distance;

preferably, the method of peak clustering comprises KNN, DBSAN or K-Mean.

Preferably, the method of combining the reading matrices of cell peaks into a reading matrix of cell accesson comprises taking the sum of the peak readings in accesson, the mean of the peak readings, the median of the peak readings or the variance of the peak readings.

6. A single cell chromatin accessibility sequencing data analysis system based on peak clustering comprises a preprocessing module and an accesson construction module;

7. The analysis system according to claim 6, wherein the system further comprises a visualization module for reducing the reading matrix of cells accesson to a two-dimensional visualization matrix, preferably the method of reducing the dimensions comprises PCA, T-SNE or UMAP;

preferably, the system further comprises a cell clustering module for clustering cells according to the reading matrix of the cells, preferably, the clustering algorithm comprises KNN clustering, kernel clustering or louvain clustering;

preferably, the system further comprises a cell development pathway remodeling module for constructing a cell development pathway pseudo-temporal condition using the reading matrix of the cells accesson, and preferably, the algorithm used in constructing the cell development pathway pseudo-temporal condition comprises SPRING or monocle.

8. The analysis system of claim 6 or 7, wherein the mathematical distance comprises a Euclidean distance, a Pearson correlation coefficient, or a cityblock distance;

preferably, the method of peak clustering comprises KNN, DBSAN or K-Mean;

9. A single cell chromatin accessibility sequencing data analysis apparatus based on peak clustering, comprising:

a processor;

a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method of any of claims 1-5.

10. A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method of any one of claims 1-5.