CN111755071B - Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering - Google Patents
Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering Download PDFInfo
- Publication number
- CN111755071B CN111755071B CN201910256667.0A CN201910256667A CN111755071B CN 111755071 B CN111755071 B CN 111755071B CN 201910256667 A CN201910256667 A CN 201910256667A CN 111755071 B CN111755071 B CN 111755071B
- Authority
- CN
- China
- Prior art keywords
- cell
- peak
- clustering
- matrix
- readings
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 210000003483 chromatin Anatomy 0.000 title claims abstract description 31
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 30
- 108010077544 Chromatin Proteins 0.000 title claims abstract description 25
- 238000007405 data analysis Methods 0.000 title claims abstract description 17
- 210000004027 cell Anatomy 0.000 claims abstract description 76
- 239000011159 matrix material Substances 0.000 claims abstract description 69
- 238000012800 visualization Methods 0.000 claims abstract description 22
- 239000012472 biological sample Substances 0.000 claims abstract description 12
- 238000007634 remodeling Methods 0.000 claims abstract description 12
- 238000004458 analytical method Methods 0.000 claims description 29
- 230000011712 cell development Effects 0.000 claims description 15
- 238000010276 construction Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000037361 pathway Effects 0.000 claims description 9
- 230000009467 reduction Effects 0.000 claims description 8
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000003556 assay Methods 0.000 claims 2
- 230000000694 effects Effects 0.000 abstract description 9
- 238000011161 development Methods 0.000 abstract description 4
- 108020004414 DNA Proteins 0.000 description 13
- 241000699666 Mus <mouse, genus> Species 0.000 description 8
- 238000012360 testing method Methods 0.000 description 8
- 206010028980 Neoplasm Diseases 0.000 description 7
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 7
- 230000004069 differentiation Effects 0.000 description 5
- 201000011510 cancer Diseases 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 210000001519 tissue Anatomy 0.000 description 4
- 108010033040 Histones Proteins 0.000 description 3
- 108091028043 Nucleic acid sequence Proteins 0.000 description 3
- 108010047956 Nucleosomes Proteins 0.000 description 3
- 210000001744 T-lymphocyte Anatomy 0.000 description 3
- 108091023040 Transcription factor Proteins 0.000 description 3
- 102000040945 Transcription factor Human genes 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 239000010931 gold Substances 0.000 description 3
- 229910052737 gold Inorganic materials 0.000 description 3
- 210000003958 hematopoietic stem cell Anatomy 0.000 description 3
- 208000032839 leukemia Diseases 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 210000001623 nucleosome Anatomy 0.000 description 3
- 210000004129 prosencephalon Anatomy 0.000 description 3
- 210000001541 thymus gland Anatomy 0.000 description 3
- UXUFTKZYJYGMGO-CMCWBKRRSA-N (2s,3s,4r,5r)-5-[6-amino-2-[2-[4-[3-(2-aminoethylamino)-3-oxopropyl]phenyl]ethylamino]purin-9-yl]-n-ethyl-3,4-dihydroxyoxolane-2-carboxamide Chemical compound O[C@@H]1[C@H](O)[C@@H](C(=O)NCC)O[C@H]1N1C2=NC(NCCC=3C=CC(CCC(=O)NCCN)=CC=3)=NC(N)=C2N=C1 UXUFTKZYJYGMGO-CMCWBKRRSA-N 0.000 description 2
- 102000006947 Histones Human genes 0.000 description 2
- 241000124008 Mammalia Species 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000024245 cell differentiation Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000008506 pathogenesis Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005096 rolling process Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 206010042971 T-cell lymphoma Diseases 0.000 description 1
- 208000027585 T-cell non-Hodgkin lymphoma Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000005013 brain tissue Anatomy 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000013020 embryo development Effects 0.000 description 1
- 210000002919 epithelial cell Anatomy 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000001943 fluorescence-activated cell sorting Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000003061 neural cell Anatomy 0.000 description 1
- 210000004940 nucleus Anatomy 0.000 description 1
- 230000035790 physiological processes and functions Effects 0.000 description 1
- 229920001184 polypeptide Polymers 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 239000000523 sample Substances 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 210000000130 stem cell Anatomy 0.000 description 1
- 210000004881 tumor cell Anatomy 0.000 description 1
- 238000007794 visualization technique Methods 0.000 description 1
- 238000004804 winding Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
A method and system for single cell chromatin accessibility sequencing data analysis based on peak clustering, the method comprising: comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating readings in each peak to obtain a reading matrix of cell-based peaks; calculating mathematical distances between peaks in a cell-peak reading matrix, clustering the peaks, and merging the cell-peak reading matrix into a cell-peak reading matrix, wherein the cell is the clustered peaks. The invention provides a method and a system for analyzing scATAC-seq data from fastq to clustering, visualization and development path remodeling, and the grouping effect is remarkably improved.
Description
Technical Field
The invention belongs to the technical field of biological sequencing data analysis, and particularly relates to a single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering.
Background
ATAC-seq has been widely used in research in the field of biology since 2012, and has contributed to breakthrough progress in research on embryonic development, stem cell differentiation, cancer mechanism, typing, and the like, due to advantages of simplicity, low cost, and few required cells. As one cancel Cell in 2017 (if=24) found that the pathogenesis and precise drug typing of T Cell lymphoma could be explained by ATAC-seq, data from ATAC-seq in 2018 entered the TCGA database. Thus, to further investigate cellular heterogeneity, scATAC-seq sequencing technology was proposed by people in 2015 and developed over several years to implement a number of different protocols, with the consequent analytical interpretation of the scATAC-seq sequencing result data.
The primary purpose of scATAC-seq data analysis is to reduce the primary cell population or developmental differentiation pathway in a mixed biological sample by sequencing results. However, current scattac-seq techniques compare the leading edge and the signal to noise ratio of the data is low. Therefore, scATAC-seq data analysis requires a set of easy-to-use analysis methods and maximally restores cellular heterogeneity information. On the one hand, the currently disclosed scattac-seq data analysis method does not have a perfect and easy-to-use analysis flow from fastq start to clustering, visualization and development path reconstruction. On the other hand, the evaluation was performed by using a gold standard test dataset, i.e., some test datasets in which the location in the subpopulation or developmental differentiation pathway to which each cell belongs was known. The existing methods still have poor information recovery, and improvements (using ARI assessment) are needed. As such, scattac-seq analysis is currently not an industry-uniform method of analysis.
In the prior art, the following three analysis methods exist: chromVAR, LSI and Cicero.
In the ChromVAR method, the input data of the method are a matrix of cell-based peak readings and sequence information of each peak. Thereby constructing a preference score matrix of cell transcription factors, and using the matrix to perform information reduction.
In the LSI method, the input data is a matrix of cell peak readings, and the method complicates the matrix by TF-IDF algorithm (Term Frequency, IDF means inverse text Frequency index) and then performs information reduction by a new matrix.
In the Cicero method, the input data is a matrix of cell-based peak readings, and peak position information on the chromosome. Downstream information reduction is then performed using this matrix.
Disclosure of Invention
In view of the above, the present invention provides a complete, easy-to-use method and system for analyzing scATAC-seq data of biological samples with high-efficiency cell heterogeneity information reduction capability.
In order to achieve the above object, in one aspect, the present invention provides a method for analyzing single-cell chromatin accessibility sequencing data based on peak clustering, comprising:
comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating readings in each peak to obtain a reading matrix of cell-based peaks;
calculating mathematical distances between peaks in a cell-peak reading matrix, clustering the peaks, and merging the cell-peak reading matrix into a cell-peak reading matrix, wherein the cell is the clustered peaks.
In some embodiments, the method further comprises reducing the read matrix of the cell access to a two-bit visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.
In some embodiments, the method further comprises clustering cells according to the matrix of readings of the cells, preferably the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.
In some embodiments, the method further comprises constructing a cell development path pseudotime instance using the matrix of readings of cell x-accesson, preferably the algorithm used in constructing the cell development path pseudotime instance comprises SPRING or monocle.
On the other hand, the invention provides a single-cell chromatin accessibility sequencing data analysis system based on peak clustering, which comprises a preprocessing module and an accesson construction module;
the pretreatment module comprises a) a comparison unit, a comparison unit and a control unit, wherein the comparison unit is used for comparing single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain comparison results; b) The peak searching unit is used for combining the comparison results of all single cells and then searching peaks; c) A reading calculation unit for calculating the reading in each peak to obtain a reading matrix of the cell;
the accesson construction module comprises a) a peak distance calculation unit, which is used for calculating mathematical distance between peaks in a cell-peak reading matrix; b) A peak clustering unit for clustering peaks according to mathematical distances between peaks; c) And the matrix conversion unit is used for combining the reading matrix of the cell-based peaks into the reading matrix of the cell-based peaks, wherein the peaks are clustered.
In some embodiments, the system further comprises a visualization module for reducing the reading matrix of the cell access to a two-bit visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.
In some embodiments, the system further comprises a cell clustering module for clustering cells according to the matrix of readings of cell-x-accesson, preferably the clustering algorithm comprises KNN clustering, kernel clustering or louvain clustering.
In some embodiments, the system further comprises a cell development path remodeling module for constructing a cell development path pseudotime instance using the matrix of cell x-accesson readings, preferably the algorithm used in constructing the cell development path pseudotime instance comprises SPRING or monocle.
In some embodiments, the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance.
In some embodiments, the method of peak clustering comprises KNN, DBSAN, or K-Mean.
In some embodiments, the method of combining the matrix of cell-peak readings into the matrix of cell-peak readings comprises taking the sum of the peak readings in the accesson, the average of the peak readings, the median of the peak readings, or the variance of the peak readings.
In yet another aspect, the present invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, including:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method.
In yet another aspect, the present invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.
Compared with the prior art, the invention has the following beneficial effects:
the present invention provides a first scATAC-seq data analysis method and system from fastq to clustering, visualization and developmental path remodeling;
the invention provides an accesson construction method based on peak clustering, which is used as a key module for scattac-seq data analysis. The transformed cell-access reading matrix was used for subsequent clustering, visualization and cell development path remodeling. The grouping effect was statistically significantly higher than the existing method (ARI) on the gold-labeled dataset test.
Drawings
FIG. 1 is a schematic diagram of an accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention;
FIG. 2 is a graph showing the relationship between the number of accessons and the clustering effect ARI (gold mark test dataset 1);
FIG. 3 shows the scATAC-seq data for human leukemia cells and related lineage cells according to an embodiment of the present invention: A. data clustering (hierarchical clustering) and b. visualization effect (tSNE);
FIG. 4 is the scATAC-seq data relating to the developmental differentiation lineage of human hematopoietic stem cells according to an embodiment of the present invention: data development path remodeling (monocle);
FIG. 5 shows mouse forebrain nerve cell scaTAC-seq data in the examples of the present invention: data clustering (KNN) and visualization (tSNE);
FIGS. 6A-6D are mouse thymus T cell scaTAC-seq data in examples of the present invention: data clustering (Louvain, hierarchical clustering), visualization (tSNE), and developmental path remodeling (monocle);
FIG. 7 is a graph showing the clustering effect and time-consuming comparison with the prior art method (gold mark test dataset 1) in accordance with the present invention;
FIG. 8 is a graph showing the clustering effect and time-consuming comparison with the prior art method (gold test dataset 2) in accordance with the present invention.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
For ease of understanding, the domain names referred to herein are collectively explained herein and are not described in detail.
And (3) cells: the fundamental components of the life activities of mammals (e.g., humans, mice) are often the pathogenesis of various diseases, such as nerve cells, epithelial cells, and tumor cells.
Cell heterogeneity: biological tissue samples (e.g., tumor tissue, brain tissue) are composed of a large number of cells, the physiological functions of which are different. The common cellular heterogeneity is represented by two types: 1) Constitutive cells consist of a variety of well-defined cell populations (discrete). 2) The constituent cells are in a continuous cell differentiation pathway (continuous).
Genome: namely the whole DNA sequence of the organism, which consists of four bases of ATCG in ordered arrangement. The genome of a major mammal such as human, mouse, etc. has been completely sequenced.
Gene: genes (genetic factors) are all DNA sequences required to produce one polypeptide chain or functional RNA. A gene is typically one or more stretches of DNA on the genome.
Transcription factor: a protein bound to DNA initiates or regulates gene expression. Binding to DNA is often accomplished by recognizing a specific pattern of DNA sequences (Motif).
Chromatin: linear composite structures consisting of DNA, histones, nonhistones and small amounts of RNA in the nucleus. The basic element is nucleosome formed by DNA winding on histone.
Chromatin accessibility: i.e. to evaluate whether a piece of DNA is entangled to histones. In general, chromatin accessibility is in two cases: 1) DNA is tightly entangled around nucleosomes, called closed DNA; 2) DNA is DNA which is wound around the nucleosome and is exposed, and is called open DNA.
Chromatin accessibility sequencing (ATAC-seq): a sequencing technology developed by university of stamford 2012 for detecting chromatin accessibility of biological samples (> 500 cells).
TCGA: i.e. cancer and tumor genetic map planning (Cancer Genome Atlas, TCGA). Different sets of sequencing data comprising cancer tissue and normal tissue from 33 different cancers and 11,000 patients.
Single cell chromatin accessibility sequencing (scattac-seq): several sequencing methods exist for detecting chromatin accessibility of individual cells. Including single-core chromatin accessibility sequencing (snATAC-seq), single-cell combinatorial index chromatin accessibility sequencing (sciATAC-seq), flow-based single-cell chromatin accessibility sequencing (FACS scaatac-seq).
Short sequences (sequences reads): i.e.the DNA fragments obtained in biology.
Alignment (Mapping): the short sequences are compared to known genomic information to find the position of each short sequence on the genome.
Peak rolling (Peak rolling): and searching the open position of the DNA through the result of data analysis and comparison, wherein the position information is called peak and is given with a number.
Reading: i.e. the number of short sequences per sample, per peak.
Access on: the peak clustering result provided by the invention is called an Access, namely the clustering condition of the peaks. E.g. Accesson 1 = peak 2, peak 3, peak 5; accesson 2 = peak 1, peak 4.
ARI (Adjusted Rand index) is an evaluation index commonly used in clustering algorithms for evaluating the consistency of the algorithm clustering results with the actual clustering results.
One embodiment of the present invention proposes a single cell chromatin accessibility (scattac-seq) sequencing data analysis system (hereinafter abbreviated as APEC) based on peak clustering: the device comprises the following modules:
1) And a pretreatment module: comprises a) an alignment unit for aligning fastq files (i.e. single cell chromatin accessibility sequencing data) to genomic sequences to form bam files; b) The peak searching unit is used for merging the bam files of all single cell comparison results into a merge_bam file and searching peaks on the basis; c) And a reading calculation unit for calculating the count of reads in each peak and finally outputting a reading matrix of the cell.
2) an accesson construction module: comprises a) a peak distance calculation unit for calculating mathematical distances (including but not limited to Euclidean distance, pearson correlation coefficient, cityblock distance) between peaks through a reading matrix of cell-by-peak; b) And a peak clustering unit for clustering the peaks by mathematical distance between the peaks, wherein the clustered peaks are called accesson, and the clustering method comprises, but is not limited to, (KNN, DBSAN). c) And the matrix conversion unit is used for merging the reading matrix of the cell peak into the cell peak according to the information of the cell peak, and the merging method comprises, but is not limited to, taking the sum, the average value, the median, the variance and the like of peak readings in the cell peak.
3) And a visualization module: the cell-by-cell reading matrix is reduced in dimension to a two-bit visualization matrix using dimension reduction visualization methods including but not limited to PCA, T-SNE, UMAP.
4) Cell clustering module: cells are clustered using an accesson reading matrix, and clustering algorithms include, but are not limited to, KNN clustering, kernel clustering, louvain clustering.
5) Cell development pathway remodeling module: using the matrix of cell x-accesson readings, a pseudo-time condition of the cell development pathway was constructed using algorithms including, but not limited to SPRING, monocle.
The following is a description of the use of APECs in 4 different gold-labeled test data sets in an embodiment according to the invention, illustrating the versatility of APECs in the analysis of different biological sample scattac-seq data sets, the data sets comprising: 1) Human leukemia cells and related lineage cell scATAC-seq data; 2) Human hematopoietic stem cell developmental differentiation lineage related scATAC-seq data; 3) Mouse forebrain nerve cell scattac-seq data; 4) Mouse thymus T cell scATAC-seq data.
The analysis flow using the peak cluster based scataac-seq analysis system (APEC) of the present invention comprises the following steps:
1) Data input:
the input data is fastq file, and its format can be: a) A single fastq file per cell; b) A fastq file mixed together, but each cell can be split into each cell data by a splitting rule given by the data provider. Such as index sequences (using different splits of 5-10 bases before fastq)
2) Data preprocessing:
input data can be compared to different biological sample genomes by the comparison unit, such as data sets 1 and 2 to human genome and data sets 3 and 4 to mouse genome. Or a biological sample genome specified by a data provider. The alignment results produced a Bam file that indicated the location of the read alignment on the genome in each fastq. The processing of the bam file with the peak-finding unit can define chromatin opening sites in the biological sample, and the reading matrix (mxn) of each peak (n) of each cell (m) can be obtained in combination with the reading calculation unit.
3) accesson construction:
FIG. 1 is a schematic diagram of an accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention. In the accesson construction, an mxn matrix of readings is first passed into an accesson construction module.
In the peak distance calculation unit, the relative distance between peaks ( data sets 1,2,3, 4) may be calculated using the euclidean distance, and other commonly used vector distance calculation methods, such as pearson correlation coefficient, cityblock distance, and the like, may be used.
In the peak clustering unit, peaks can be clustered into a specified number of accessons ( data sets 1,2,3, 4) using KNN algorithm. The clustering algorithm may be a common vector clustering algorithm, such as DBSCAN, K-Mean, etc. Where the specified number of accessions does not affect the result over a wide distance (fig. 2), and is therefore defaulted to 2000, which is adjustable according to the specific data.
In the matrix conversion unit, firstly, certain screening is carried out on the accesson according to the basic property of the accesson, for example, the accesson with the number of the contained peaks smaller than a specified value is removed, or the accesson with the internal coefficient of the foundation smaller than the specified value is removed. Then, according to the accesson information, the matrix of readings of the cell-peak is combined into the matrix of the cell-peak by taking the sum of the peak readings in the accesson ( data sets 1,2,3, 4). Other simple vector property calculation methods, such as average value of readings, median of readings, variance of readings, etc., can also be utilized.
4) Data clustering and visualization
In this step, the cell-x-accesson reading matrix may be reduced in dimension to a two-position visualization matrix using the visualization module, and/or the cells may be clustered using the cell clustering module, and/or the cell development path pseudo-time condition may be constructed using the cell development path remodeling module.
FIG. 3 shows human leukemia cells and related lineage cell scattac-seq data: A. data clustering (hierarchical clustering) and b. visualization effect (tSNE);
FIG. 4 shows human hematopoietic stem cell developmental differentiation lineage related scaTAC-seq data: data development path remodeling (monocle);
fig. 5 is mouse forebrain neural cell scattac-seq data: data clustering (KNN) and visualization (tSNE);
FIGS. 6A-6D are mouse thymus T cell scATAC-seq data: wherein, fig. 6A is a Louvain cluster, fig. 6B is a hierarchical cluster, fig. 6C is a visualization (tSNE), and fig. 6D is a developmental path remodeling (monocle).
It can be seen that the present invention can achieve from fastq to clustering, visualization and developmental pathway remodeling. And the grouping effect (ARI) was statistically significantly higher than the existing methods on the gold labeled dataset test, as shown in fig. 7 and 8. The reason that the cell heterogeneity information can be efficiently restored is that the method for constructing the accesson is a filtering process for reducing noise and amplifying signals, and the details are as follows: 1) Compared with LSI and ChromVAR, the invention can convert the originally sparse cell peak matrix into a denser cell peak matrix, thereby reducing noise signals in subsequent analysis; 2) Compared with the Cicero method for peak merging based on chromatin position, the method provided by the invention clusters peaks through mathematical distance and clustering algorithm and merges the peaks. Peaks clustered together in this way have similar expression patterns, and therefore construction of an accesson is more biologically significant, e.g., peaks within an accesson may be regulated by the same transcription factor or more closely related in the chromatin three-dimensional structure. Thus transformed cells access matrix further amplifies the cell heterogeneity.
The invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, which comprises:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method.
The invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.
It should be noted that each functional module/unit in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and so on. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The data processing module may be any suitable hardware processor such as CPU, GPU, FPGA, DSP and ASIC, etc. The storage unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.
Claims (18)
1. A method for analyzing single cell chromatin accessibility sequencing data based on peak clustering, comprising:
comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating readings in each peak to obtain a reading matrix of cell-based peaks;
calculating mathematical distances between peaks in a reading matrix of cell-peak, clustering the peaks, and merging the reading matrix of cell-peak into a reading matrix of cell-peak, wherein the cell is the clustered peak;
the method further comprises constructing a cell development pathway pseudotime profile using the matrix of cell x-accesson readings.
2. The method of analysis of claim 1, wherein the method further comprises dimensionality reducing the matrix of readings of the cell x-accesson to a two-bit visualization matrix.
3. The method of analysis of claim 2, wherein the method of dimension reduction comprises PCA, T-SNE or UMAP.
4. The assay of claim 1, wherein the method further comprises clustering cells according to the matrix of cell-x-access readings.
5. The analysis method of claim 1, wherein the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.
6. The assay of claim 1, wherein the algorithm used in constructing the pseudo-time condition of the cell development pathway comprises SPRING or monocle.
7. The analysis method of any one of claims 1-6, wherein the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance; the peak clustering method comprises KNN, DBSAN or K-Mean.
8. The method of analysis of claim 1, wherein combining the matrix of cell peak readings into the matrix of cell peak readings comprises taking the sum of peak readings in the accesson, the average of peak readings, the median of peak readings, or the variance of peak readings.
9. A single-cell chromatin accessibility sequencing data analysis system based on peak clustering comprises a preprocessing module and an accesson construction module;
the pretreatment module comprises a) a comparison unit, a comparison unit and a control unit, wherein the comparison unit is used for comparing single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain comparison results; b) The peak searching unit is used for combining the comparison results of all single cells and then searching peaks; c) A reading calculation unit for calculating the reading in each peak to obtain a reading matrix of the cell;
the accesson construction module comprises i) a peak distance calculation unit, which is used for calculating mathematical distance between peaks in a cell-peak reading matrix; ii) a peak clustering unit for clustering peaks according to mathematical distance between peaks; iii) The matrix conversion unit is used for combining the reading matrixes of the cell-based peaks into the reading matrixes of the cell-based peaks, wherein the peaks are clustered;
the system further includes a cell development path remodeling module for constructing a cell development path pseudo-time condition using the matrix of cell-x-accesson readings.
10. The analysis system of claim 9, wherein the system further comprises a visualization module for dimensionality reduction of the cell x-access reading matrix to a two-bit visualization matrix.
11. The analysis system of claim 10, wherein the dimension reduction method comprises PCA, T-SNE or UMAP.
12. The analysis system of claim 9, wherein the system further comprises a cell clustering module for clustering cells according to the matrix of readings of the cell.
13. The analysis system of claim 12, wherein the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.
14. The analysis system of claim 9, wherein the algorithm used in constructing the pseudo-time condition of the cell development pathway comprises SPRING or monocle.
15. The analysis system of claim 9 or 10, wherein the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance; the peak clustering method comprises KNN, DBSAN or K-Mean.
16. The analysis system of claim 9, wherein the means for combining the matrix of cell peak readings into the matrix of cell peak readings comprises taking the sum of peak readings in the accesson, the average of peak readings, the median of peak readings, or the variance of peak readings.
17. A single cell chromatin accessibility sequencing data analysis device based on peak clustering, comprising:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method of any of claims 1-8.
18. A computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method of any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910256667.0A CN111755071B (en) | 2019-03-29 | 2019-03-29 | Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910256667.0A CN111755071B (en) | 2019-03-29 | 2019-03-29 | Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111755071A CN111755071A (en) | 2020-10-09 |
CN111755071B true CN111755071B (en) | 2023-04-21 |
Family
ID=72672727
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910256667.0A Active CN111755071B (en) | 2019-03-29 | 2019-03-29 | Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111755071B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112270953A (en) * | 2020-10-29 | 2021-01-26 | 哈尔滨因极科技有限公司 | Analysis method, device and equipment based on BD single cell transcriptome sequencing data |
CN115050416A (en) * | 2021-03-08 | 2022-09-13 | 中国科学院上海营养与健康研究所 | Single cell transcriptome calculation analysis method and system fused with deep learning model |
CN112992267B (en) * | 2021-04-13 | 2024-02-09 | 中国人民解放军军事科学院军事医学研究院 | Single-cell transcription factor regulation network prediction method and device |
CN113178233B (en) * | 2021-04-27 | 2023-04-28 | 西安电子科技大学 | Large-scale single-cell transcriptome data efficient clustering method |
US20240185946A1 (en) * | 2022-02-08 | 2024-06-06 | Chromatintech Beijing Co, Ltd | Method for identifying a chromatin structural characteristic from a hi-c matrix, non-transitory computer readable medium storing a program for identifying a chromatin structural characteristic from a hi-c matrix, and methods for diagnosing and treating a medical condition or disease |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030162219A1 (en) * | 2000-12-29 | 2003-08-28 | Sem Daniel S. | Methods for predicting functional and structural properties of polypeptides using sequence models |
WO2014152091A2 (en) * | 2013-03-15 | 2014-09-25 | Carnegie Institution Of Washington | Methods of genome sequencing and epigenetic analysis |
SG11201508985VA (en) * | 2013-05-23 | 2015-12-30 | Univ Leland Stanford Junior | Transposition into native chromatin for personal epigenomics |
CN103955629A (en) * | 2014-02-18 | 2014-07-30 | 吉林大学 | Micro genome segment clustering method based on fuzzy k-mean |
CN105930862A (en) * | 2016-04-13 | 2016-09-07 | 江南大学 | Density peak clustering algorithm based on density adaptive distance |
CN107368701A (en) * | 2017-07-31 | 2017-11-21 | 浙江绍兴千寻生物科技有限公司 | In high volume unicellular ATAC seq data quality controls and analysis method |
-
2019
- 2019-03-29 CN CN201910256667.0A patent/CN111755071B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111755071A (en) | 2020-10-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111755071B (en) | Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering | |
Hu et al. | A review on longitudinal data analysis with random forest | |
Withnell et al. | XOmiVAE: an interpretable deep learning model for cancer classification using high-dimensional omics data | |
US11954614B2 (en) | Systems and methods for visualizing a pattern in a dataset | |
Peng et al. | Combining gene ontology with deep neural networks to enhance the clustering of single cell RNA-Seq data | |
Marczyk et al. | Adaptive filtering of microarray gene expression data based on Gaussian mixture decomposition | |
Greenberg | DNA microarray gene expression analysis technology and its application to neurological disorders | |
CA3204451A1 (en) | Systems and methods for joint low-coverage whole genome sequencing and whole exome sequencing inference of copy number variation for clinical diagnostics | |
He et al. | Microarrays—the 21st century divining rod? | |
Chen et al. | Integration of spatial and single-cell data across modalities with weakly linked features | |
Liang et al. | SSRE: cell type detection based on sparse subspace representation and similarity enhancement | |
Jiang et al. | Dimensionality reduction and visualization of single-cell RNA-seq data with an improved deep variational autoencoder | |
Pham et al. | Analysis of microarray gene expression data | |
Matos et al. | Research techniques made simple: mass cytometry analysis tools for decrypting the complexity of biological systems | |
Choi et al. | Sparsely correlated hidden Markov models with application to genome-wide location studies | |
Manatakis et al. | An information-theoretic approach for measuring the distance of organ tissue samples using their transcriptomic signatures | |
Chen et al. | Bubble: a fast single-cell RNA-seq imputation using an autoencoder constrained by bulk RNA-seq data | |
Wang et al. | Benchmarking automated cell type annotation tools for single-cell ATAC-seq data | |
CN117275579A (en) | Method for eRNA identification, regulation target prediction and functional annotation based on high-throughput transcriptome sequencing data | |
Krasnitz et al. | Target inference from collections of genomic intervals | |
Xu et al. | Structure-preserving visualization for single-cell RNA-Seq profiles using deep manifold transformation with batch-correction | |
WO2020198942A1 (en) | Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering | |
Wytock et al. | Distinguishing cell phenotype using cell epigenotype | |
González Calabozo et al. | Gene Expression Array Exploration Using-Formal Concept Analysis | |
Deng et al. | Molecular Heterogeneity in Large-Scale Biological Data: Techniques and Applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240130 Address after: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96 Patentee after: University of Science and Technology of China Country or region after: China Patentee after: Qu Kun Address before: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96 Patentee before: University of Science and Technology of China Country or region before: China |