CN111755071A - Single cell chromatin accessibility sequencing data analysis method and system based on peak clustering - Google Patents
Single cell chromatin accessibility sequencing data analysis method and system based on peak clustering Download PDFInfo
- Publication number
- CN111755071A CN111755071A CN201910256667.0A CN201910256667A CN111755071A CN 111755071 A CN111755071 A CN 111755071A CN 201910256667 A CN201910256667 A CN 201910256667A CN 111755071 A CN111755071 A CN 111755071A
- Authority
- CN
- China
- Prior art keywords
- clustering
- cell
- peaks
- peak
- accesson
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 210000003483 chromatin Anatomy 0.000 title claims abstract description 31
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 30
- 238000007405 data analysis Methods 0.000 title claims abstract description 19
- 210000004027 cell Anatomy 0.000 claims abstract description 93
- 239000011159 matrix material Substances 0.000 claims abstract description 63
- 238000012800 visualization Methods 0.000 claims abstract description 20
- 108010077544 Chromatin Proteins 0.000 claims abstract description 18
- 239000012472 biological sample Substances 0.000 claims abstract description 12
- 238000007634 remodeling Methods 0.000 claims abstract description 12
- 230000037361 pathway Effects 0.000 claims description 22
- 238000004458 analytical method Methods 0.000 claims description 19
- 230000011712 cell development Effects 0.000 claims description 14
- 238000010276 construction Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000009467 reduction Effects 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000003556 assay Methods 0.000 claims 2
- 238000011161 development Methods 0.000 abstract description 7
- 230000000694 effects Effects 0.000 abstract description 7
- 108020004414 DNA Proteins 0.000 description 12
- 238000012360 testing method Methods 0.000 description 8
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 7
- 206010028980 Neoplasm Diseases 0.000 description 6
- 230000018109 developmental process Effects 0.000 description 6
- 230000004069 differentiation Effects 0.000 description 5
- 108090000623 proteins and genes Proteins 0.000 description 5
- 108010033040 Histones Proteins 0.000 description 4
- 201000011510 cancer Diseases 0.000 description 4
- 210000001519 tissue Anatomy 0.000 description 4
- 108091028043 Nucleic acid sequence Proteins 0.000 description 3
- 108010047956 Nucleosomes Proteins 0.000 description 3
- 210000001744 T-lymphocyte Anatomy 0.000 description 3
- 108091023040 Transcription factor Proteins 0.000 description 3
- 102000040945 Transcription factor Human genes 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 239000000470 constituent Substances 0.000 description 3
- 239000010931 gold Substances 0.000 description 3
- 229910052737 gold Inorganic materials 0.000 description 3
- 210000003958 hematopoietic stem cell Anatomy 0.000 description 3
- 208000032839 leukemia Diseases 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 210000001623 nucleosome Anatomy 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 210000004129 prosencephalon Anatomy 0.000 description 3
- 230000002992 thymic effect Effects 0.000 description 3
- 241000124008 Mammalia Species 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000024245 cell differentiation Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000008506 pathogenesis Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 206010042971 T-cell lymphoma Diseases 0.000 description 1
- 208000027585 T-cell non-Hodgkin lymphoma Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000005013 brain tissue Anatomy 0.000 description 1
- 210000003855 cell nucleus Anatomy 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013016 damping Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000013020 embryo development Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 210000002919 epithelial cell Anatomy 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000001943 fluorescence-activated cell sorting Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000003061 neural cell Anatomy 0.000 description 1
- 230000035790 physiological processes and functions Effects 0.000 description 1
- 229920001184 polypeptide Polymers 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 239000000523 sample Substances 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 210000000130 stem cell Anatomy 0.000 description 1
- 210000004881 tumor cell Anatomy 0.000 description 1
- 238000007794 visualization technique Methods 0.000 description 1
- 238000004804 winding Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
Abstract
A method and system for peak clustering based single cell chromatin accessibility sequencing data analysis, the method comprising: comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating the reading in each peak to obtain a reading matrix of cells and peaks; calculating the mathematical distance between peaks in the reading matrix of the cell peaks, clustering the peaks, and combining the reading matrix of the cell peaks into the reading matrix of the cell accesson, wherein the accesson is the clustered peaks. The invention provides a first method and a first system for analyzing data from fastq to clustering, visualization and development path remodeling and obviously improves the clustering effect.
Description
Technical Field
The invention belongs to the technical field of biological sequencing data analysis, and particularly relates to a single cell chromatin accessibility sequencing data analysis method and system based on peak clustering.
Background
ATAC-seq is widely popularized in the research of the biological field due to the advantages of simplicity, low price and less required cells since the invention of 2012, and contributes to breakthrough progress in the research of embryonic development, stem cell differentiation, cancer mechanism, typing and the like. ATAC-seq can be used for explaining the pathogenesis and precise dose type of T Cell lymphoma as found in a CANCER Cell (IF ═ 24) in 2017, and ATAC-seq data are entered into a TCGA database in 2018. Thus, to further investigate cellular heterogeneity, the scATAC-seq sequencing technology was proposed in 2015 and implemented a number of different technical solutions in several years of development, with the consequent analytical interpretation of data of the scATAC-seq sequencing results.
The primary purpose of the scATAC-seq data analysis is to restore the major cell population or developmental differentiation pathways in the mixed biological sample by sequencing results. However, the current scATAC-seq technique is relatively low in signal-to-noise ratio of data compared to the leading edge. Therefore, the scATAC-seq data analysis requires a set of easy-to-use analysis methods and minimizes the reduction of cell heterogeneity information. On one hand, the existing scATAC-seq data analysis method has no perfect and easy-to-use analysis process from fastq initiation to clustering, visualization and development path reconstruction. On the other hand, by using gold standard test datasets, i.e., some test datasets that are known to each cell to belong to a subpopulation or location in the developmental differentiation pathway. The existing method still has poor effect on information reduction, and needs to be improved (by utilizing ARI evaluation). As such, there is currently no uniform analytical approach in the industry for the scaTAC-seq analysis.
The prior art has the following three analysis methods: ChromVAR, LSI and Cicero.
In the ChromVAR method, the input data is a reading matrix of cell peaks, and sequence information for each peak. Thus, a cell transcription factor preference fraction matrix is constructed, and information is restored by using the matrix.
In the LSI method, the input data is a reading matrix of cell peaks, which complicates the matrix by TF-IDF algorithm (Term Frequency, IDF means inverse text Frequency index), and then performs information restoration by a new matrix.
In the Cicero method, the input data is a reading matrix of the cell-peak and information about the position of the peak on the chromosome. And then using the matrix to perform downstream information restoration.
Disclosure of Invention
In view of the above, the present invention provides a complete, easy-to-use, and efficient method and system for analyzing scATAC-seq data of biological samples with cell heterogeneity information reduction capability.
In order to achieve the above object, in one aspect, the present invention provides a single-cell chromatin accessibility sequencing data analysis method based on peak clustering, comprising:
comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating the reading in each peak to obtain a reading matrix of cells and peaks;
calculating the mathematical distance between peaks in the reading matrix of the cell peaks, clustering the peaks, and combining the reading matrix of the cell peaks into the reading matrix of the cell accesson, wherein the accesson is the clustered peaks.
In some embodiments, the method further comprises reducing the reading matrix of the cells accesson to a two-dimensional visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.
In some embodiments, the method further comprises clustering cells according to the reading matrix of the cells accesson, preferably the clustering algorithm comprises KNN clustering, kernel clustering or lovain clustering.
In some embodiments, the method further comprises constructing a cell development pathway pseudo-temporal profile using the read matrix of cells accesson, preferably the algorithm used in constructing the cell development pathway pseudo-temporal profile comprises SPRING or monocle.
On the other hand, the invention provides a single cell chromatin accessibility sequencing data analysis system based on peak clustering, which comprises a pretreatment module and an accesson construction module;
the pretreatment module comprises a) a comparison unit, a comparison unit and a comparison unit, wherein the comparison unit is used for comparing the single-cell chromatin accessibility sequencing data with the corresponding biological sample genome data to obtain a comparison result; b) the peak searching unit is used for merging the comparison results of all the single cells and then searching peaks; c) a reading calculation unit for calculating the reading in each peak to obtain a reading matrix of cells and peaks;
the accesson construction module comprises a) a peak distance calculation unit for calculating the mathematical distance between peaks in a reading matrix of cells by peaks; b) the peak clustering unit is used for clustering peaks according to the mathematical distance between the peaks; c) and the matrix conversion unit is used for merging the reading matrixes of the cell peaks into a reading matrix of the cell accesson, wherein the accesson is the clustered peaks.
In some embodiments, the system further comprises a visualization module for reducing the reading matrix of the cells accesson to a two-dimensional visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.
In some embodiments, the system further comprises a cell clustering module for clustering cells according to the reading matrix of cells accesson, preferably the clustering algorithm comprises KNN clustering, kernel clustering or lovain clustering.
In some embodiments, the system further comprises a cell development pathway remodeling module for constructing a cell development pathway pseudo-temporal scenario using the read matrix of the cells, preferably, the algorithm used in constructing the cell development pathway pseudo-temporal scenario comprises SPRING or monocle.
In some embodiments, the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance.
In some embodiments, the method of peak clustering includes KNN, DBSAN, or K-Mean.
In some embodiments, the method of consolidating the reading matrices of cell peaks into a reading matrix of cell peaks comprises taking the sum of the peak readings in accesson, the mean of the peak readings, the median of the peak readings, or the variance of the peak readings.
In another aspect, the present invention further provides a single-cell chromatin accessibility sequencing data analysis apparatus based on peak clustering, including:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method.
In yet another aspect, the present invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a first method and a first system for analyzing scATAC-seq data from fastq to clustering, visualization and development path remodeling;
the invention provides an accesson construction method based on peak clustering, which is used as a key module for data analysis of scaTAC-seq. Transformed cells are used for subsequent clustering, visualization and cell development pathway remodeling. The clustering effect was statistically significantly higher on the gold labeled dataset test than the existing Approach (ARI).
Drawings
FIG. 1 is a schematic diagram of the construction and downstream analysis of accesson based on peak clustering in the embodiment of the present invention;
FIG. 2 shows the relationship between the number of accessons and the ARI (gold mark test data set 1) according to the embodiment of the present invention;
FIG. 3 is data of human leukemia cells and related lineage cells, scATAC-seq, according to the present invention: A. data clustering (hierarchical clustering) and b.
FIG. 4 is data of human hematopoietic stem cell developmental differentiation lineage associated scATAC-seq in accordance with an embodiment of the present invention: data development pathway remodeling (monocle);
FIG. 5 shows data of mouse forebrain neural cells scATAC-seq in examples of the present invention: data clustering (KNN) and visualization (tSNE);
FIGS. 6A-6D are data of mouse thymic T cell scatAC-seq in accordance with the present invention: data clustering (Louvain, hierarchical clustering), visualization (tSNE), and developmental pathway remodeling (monocle);
FIG. 7 shows a comparison between the clustering effect and the time of use in the embodiment of the present invention and the existing method (gold mark test data set 1);
FIG. 8 shows a comparison between the clustering effect and the time of use in the embodiment of the present invention and the existing method (gold test data set 2).
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
For the sake of understanding, the term of art will be used herein to explain the same and to avoid redundancy.
Cell: the basic components of the life activities of mammals (such as human and mouse) are also the pathogenesis of various diseases, such as nerve cells, epithelial cells and tumor cells.
Cellular heterogeneity: biological tissue samples (e.g., tumor tissue, brain tissue) are composed of a large number of cells, whose constituent cells have different physiological functions. The following two manifestations of common cellular heterogeneity exist: 1) the constituent cells are composed of a variety of well-defined cell populations (discrete). 2) The constituent cells are in a continuous cellular differentiation pathway (continuum).
Genome: i.e. the organism full DNA sequence, is composed of the ATCG four basic groups ordered arrangement. The genome of major mammals such as human and mouse has been completely sequenced.
Gene: a gene (genetic element) is the entire DNA sequence required to produce a polypeptide chain or functional RNA. A gene is typically one or more DNA segments of a genome.
Transcription factors: a protein that binds to DNA, initiates or regulates expression of a gene. Its binding to DNA is often through the recognition of specific DNA sequence patterns (Motif).
Chromatin: the cell nucleus is a linear composite structure consisting of DNA, histone, non-histone and a small amount of RNA. The basic element is a nucleosome formed by winding DNA on histone.
Chromatin accessibility: i.e., to assess whether a stretch of DNA is wound around a histone. In general, chromatin accessibility is in two contexts: 1) DNA tightly wound around nucleosomes, termed closed DNA; 2) DNA is wound around nucleosomes in a naked state and is called open DNA.
Chromatin accessibility sequencing (ATAC-seq): a sequencing technique developed at stanford university 2012 for detecting chromatin accessibility in biological samples (> 500 cells).
TCGA: namely The Cancer Genome Atlas (TCGA) project. Contains distinct omics sequencing data for cancer tissues and normal tissues for 33 different cancers and 11,000 patients.
Single cell chromatin accessibility sequencing (scATAC-seq): there are several sequencing methods used to detect the accessibility of chromatin to individual cells. Including single cell chromatin accessibility sequencing (snATAC-seq), single cell combinatorial index chromatin accessibility sequencing (sciATAC-seq), flow-based single cell chromatin accessibility sequencing (FACS scATAC-seq).
Short sequences (Sequence reads): i.e., in the biomemics, the DNA fragments obtained.
Alignment (Mapping): the short sequences are compared to known genomic information to find the location of each short sequence on the genome.
Peak finding (Peak Calling): and searching the position of DNA opening through the result of data analysis and comparison, wherein the position information is called as a peak and is assigned with a number.
Reading: i.e., the number of short sequences per sample, per peak.
And (2) Access: the peak clustering result provided by the invention is referred to as an Access, namely the clustering condition of the peak. Such as access 1 ═ peak 2, peak 3, peak 5; accesson 2 ═ peak 1, peak 4.
Ari (adjusted rank index) is a commonly used evaluation index for clustering algorithm, and is used for evaluating the consistency of algorithm clustering results and actual clustering results.
One embodiment of the present invention proposes a single-cell chromatin accessibility (scATAC-seq) sequencing data analysis system (hereinafter abbreviated as APEC) based on peak clustering: the system comprises the following modules:
1) a preprocessing module: comprises a) an alignment unit for aligning a fastq file (namely single cell chromatin accessibility sequencing data) to a genome sequence to form a bam file; b) the peak searching unit is used for merging the bam files of all the single cell comparison results into a merge _ bam file and searching peaks on the basis; c) and a reading calculation unit for finally outputting a reading matrix of the cell-peak by calculating the counts of reads in each peak.
2) The accesson building module: comprising a) a peak distance calculation unit, calculating the mathematical distance (including but not limited to Euclidean distance, Pearson correlation coefficient, cityblock distance) between the peaks through the reading matrix of cell peaks; b) and a peak clustering unit for clustering peaks through mathematical distances between the peaks, wherein the clustered peaks are called accesson, and the clustering method includes but is not limited to (KNN, DBSAN). c) And the matrix conversion unit is used for combining the reading matrixes of the cell peaks into the cell peaks matrix according to the accesson information, wherein the combination method comprises but is not limited to taking the sum, the average value, the median, the variance and the like of the reading matrixes of the peaks in the accesson.
3) A visualization module: and reducing the dimension of the cell accesson reading matrix into a two-dimensional visualization matrix, wherein the used dimension reduction visualization method comprises but is not limited to PCA, T-SNE and UMAP.
4) A cell clustering module: and clustering the cells by using the accesson reading matrix, wherein the clustering algorithm comprises but is not limited to KNN clustering, kernel clustering and louvain clustering.
5) Cell developmental pathway remodeling module: cell development pathway pseudo-time profiles are constructed using cell accesson readout matrices using algorithms including, but not limited to, SPRING, monocle.
The following is a use case of the APEC in 4 different gold-labeled test data sets for illustrating the universality of the APEC in data analysis of different biological samples scATAC-seq in the embodiment according to the present invention, wherein the data sets include: 1) human leukemia cell and related lineage cell scATAC-seq data; 2) human hematopoietic stem cell developmental differentiation lineage associated scATAC-seq data; 3) mouse forebrain nerve cell scATAC-seq data; 4) mouse thymic T cell scATAC-seq data.
The analysis process of the scATAC-seq analysis system (APEC) based on peak clustering according to the present invention comprises the following steps:
1) data input:
the input data is a fastq file, and the format of the fastq file can be as follows: a) one fastq file per cell; b) a fastq file mixed together, but each cell can be split into data of each cell by a splitting rule given by a data provider. Such as index sequences (different splits with 5-10 bases before fastq)
2) Data preprocessing:
the input data can be aligned to different biological sample genomes through the alignment unit, such as the data sets 1 and 2 are aligned to human genomes, and the data sets 3 and 4 are aligned to mouse genomes. Or a biological sample genome specified by a data provider. The alignment results produced a Bam file that indicates where the reads in each fastq align to the genome. The bam file is processed by a peak finding unit to define chromatin opening sites in the biological sample, and a matrix of reads (m × n) per peak (n) per cell (m) is obtained in conjunction with a read calculation unit.
3) The accesson constructs:
fig. 1 is a schematic diagram of the construction and downstream analysis of accesson based on peak clustering in the embodiment of the present invention. In the accesson construction, an m × n matrix of readings is first passed into the accesson construction module.
In the peak distance calculation unit, the relative distance between peaks ( data sets 1, 2, 3, 4) may be calculated using the euclidean distance, or other commonly used vector distance calculation methods may be used, such as pearson correlation coefficient, cityblock distance, etc.
In the peak clustering unit, peaks can be clustered into a specified number of accessons ( data sets 1, 2, 3, 4) using the KNN algorithm. The clustering algorithm may be a common vector clustering algorithm, such as DBSCAN, K-Mean, and the like. Where the specified number of accessons does not affect the result over a wide distance (fig. 2), the default is 2000, which can be adjusted according to the specific data.
In the matrix conversion unit, firstly, certain screening is carried out on the accessons according to the basic properties of the accessons, for example, the accessons with the number of peaks smaller than a specified value are removed, or the accessons with the internal damping coefficient smaller than the specified value are removed. The cell peak readings are then combined into a cell peak matrix according to the accesson information by summing the peak readings in the accesson ( data sets 1, 2, 3, 4). Other simple vector property calculation methods such as mean of readings, median of readings, variance of readings, etc. may also be utilized.
4) Data clustering and visualization
In this step, the visualization module may be used to reduce the cells-accesson reading matrix into a two-dimensional visualization matrix, and/or the cell clustering module may be used to cluster the cells, and/or the cell development pathway remodeling module may be used to construct the cell development pathway pseudo-time.
FIG. 3 is data for human leukemia cells and related lineage cells, scATAC-seq: A. data clustering (hierarchical clustering) and b.
FIG. 4 is data of human hematopoietic stem cell developmental differentiation lineage associated scATAC-seq: data development pathway remodeling (monocle);
FIG. 5 shows data of mouse forebrain nerve cells scATAC-seq: data clustering (KNN) and visualization (tSNE);
FIGS. 6A-6D are mouse thymic T cell scataC-seq data: in which fig. 6A is luvain clustering, fig. 6B is hierarchical clustering, fig. 6C is visualization (tSNE), and fig. 6D is developmental pathway remodeling (monocle).
Therefore, the method can realize the remodeling of the clustering, visualization and development paths from fastq. And the clustering effect (ARI) was statistically significantly higher on the gold labeled dataset test than the existing methods, as shown in fig. 7 and 8. The reason that the cell heterogeneity information can be efficiently reduced is that the accesson construction method provided by the method is a filtering process for reducing noise and amplifying signals, and the details are as follows: 1) compared with LSI and ChromVAR, the method can convert originally sparse cell peak matrix into more compact cell accesson matrix, and reduces noise signals in subsequent analysis; 2) compared with the Cicero method for peak combination based on chromatin position, the invention combines the clustered peaks through mathematical distance and clustering algorithm. Peaks clustered together in the method have similar expression patterns, so that the construction of the accesson is more biological, for example, peaks inside the accesson are possibly regulated and controlled by the same transcription factor or are closer in the three-dimensional structure of chromatin. Thus, the transformed cells are in the accesson matrix, further amplifying the cell heterogeneity.
The invention also provides a single cell chromatin accessibility sequencing data analysis device based on peak clustering, which comprises:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method.
The invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.
It should be noted that each functional module/unit in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and the like. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The data processing module may be any suitable hardware processor such as a CPU, GPU, FPGA, DSP, ASIC, etc. The memory unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A single cell chromatin accessibility sequencing data analysis method based on peak clustering, comprising:
comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating the reading in each peak to obtain a reading matrix of cells and peaks;
calculating the mathematical distance between peaks in the reading matrix of the cell peaks, clustering the peaks, and combining the reading matrix of the cell peaks into the reading matrix of the cell accesson, wherein the accesson is the clustered peaks.
2. An assay method according to claim 1, wherein the method further comprises reducing the reading matrix of the cells accesson to a two-dimensional visualisation matrix, preferably the dimension reduction method comprises PCA, T-SNE or UMAP.
3. An analysis method according to claim 1, wherein the method further comprises clustering cells according to the cell accesson's reading matrix, preferably the clustering algorithm comprises KNN clustering, kernel clustering or lovain clustering.
4. An assay method according to claim 1, wherein the method further comprises using the reading matrix for cells accesson to construct cell development pathway pseudo-temporal profiles, preferably wherein the algorithm used to construct the cell development pathway pseudo-temporal profiles comprises SPRING or monocle.
5. The analysis method of any of claims 1-4, wherein the mathematical distance comprises a Euclidean distance, a Pearson correlation coefficient, or a cityblock distance;
preferably, the method of peak clustering comprises KNN, DBSAN or K-Mean.
Preferably, the method of combining the reading matrices of cell peaks into a reading matrix of cell accesson comprises taking the sum of the peak readings in accesson, the mean of the peak readings, the median of the peak readings or the variance of the peak readings.
6. A single cell chromatin accessibility sequencing data analysis system based on peak clustering comprises a preprocessing module and an accesson construction module;
the pretreatment module comprises a) a comparison unit, a comparison unit and a comparison unit, wherein the comparison unit is used for comparing the single-cell chromatin accessibility sequencing data with the corresponding biological sample genome data to obtain a comparison result; b) the peak searching unit is used for merging the comparison results of all the single cells and then searching peaks; c) a reading calculation unit for calculating the reading in each peak to obtain a reading matrix of cells and peaks;
the accesson construction module comprises a) a peak distance calculation unit for calculating the mathematical distance between peaks in a reading matrix of cells by peaks; b) the peak clustering unit is used for clustering peaks according to the mathematical distance between the peaks; c) and the matrix conversion unit is used for merging the reading matrixes of the cell peaks into a reading matrix of the cell accesson, wherein the accesson is the clustered peaks.
7. The analysis system according to claim 6, wherein the system further comprises a visualization module for reducing the reading matrix of cells accesson to a two-dimensional visualization matrix, preferably the method of reducing the dimensions comprises PCA, T-SNE or UMAP;
preferably, the system further comprises a cell clustering module for clustering cells according to the reading matrix of the cells, preferably, the clustering algorithm comprises KNN clustering, kernel clustering or louvain clustering;
preferably, the system further comprises a cell development pathway remodeling module for constructing a cell development pathway pseudo-temporal condition using the reading matrix of the cells accesson, and preferably, the algorithm used in constructing the cell development pathway pseudo-temporal condition comprises SPRING or monocle.
8. The analysis system of claim 6 or 7, wherein the mathematical distance comprises a Euclidean distance, a Pearson correlation coefficient, or a cityblock distance;
preferably, the method of peak clustering comprises KNN, DBSAN or K-Mean;
preferably, the method of combining the reading matrices of cell peaks into a reading matrix of cell accesson comprises taking the sum of the peak readings in accesson, the mean of the peak readings, the median of the peak readings or the variance of the peak readings.
9. A single cell chromatin accessibility sequencing data analysis apparatus based on peak clustering, comprising:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method of any of claims 1-5.
10. A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910256667.0A CN111755071B (en) | 2019-03-29 | 2019-03-29 | Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910256667.0A CN111755071B (en) | 2019-03-29 | 2019-03-29 | Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111755071A true CN111755071A (en) | 2020-10-09 |
CN111755071B CN111755071B (en) | 2023-04-21 |
Family
ID=72672727
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910256667.0A Active CN111755071B (en) | 2019-03-29 | 2019-03-29 | Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111755071B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112270953A (en) * | 2020-10-29 | 2021-01-26 | 哈尔滨因极科技有限公司 | Analysis method, device and equipment based on BD single cell transcriptome sequencing data |
CN112992267A (en) * | 2021-04-13 | 2021-06-18 | 中国人民解放军军事科学院军事医学研究院 | Single-cell transcription factor regulation network prediction method and device |
CN113178233A (en) * | 2021-04-27 | 2021-07-27 | 西安电子科技大学 | Efficient clustering method for large-scale single-cell transcriptome data |
WO2022188785A1 (en) * | 2021-03-08 | 2022-09-15 | 中国科学院上海营养与健康研究所 | Single cell transcriptome computation and analysis method and system incorporating deep learning model |
CN116981779A (en) * | 2022-02-08 | 2023-10-31 | 染色质(北京)科技有限公司 | Method for identifying chromatin structural features from a Hi-C matrix, non-transitory computer readable medium storing a program for identifying chromatin structural features from a Hi-C matrix |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030162219A1 (en) * | 2000-12-29 | 2003-08-28 | Sem Daniel S. | Methods for predicting functional and structural properties of polypeptides using sequence models |
CN103955629A (en) * | 2014-02-18 | 2014-07-30 | 吉林大学 | Micro genome segment clustering method based on fuzzy k-mean |
CN105339503A (en) * | 2013-05-23 | 2016-02-17 | 斯坦福大学托管董事会 | Transposition into native chromatin for personal epigenomics |
US20160097088A1 (en) * | 2013-03-15 | 2016-04-07 | Carnegie Institution Of Washington | Methods of Genome Sequencing and Epigenetic Analysis |
CN105930862A (en) * | 2016-04-13 | 2016-09-07 | 江南大学 | Density peak clustering algorithm based on density adaptive distance |
CN107368701A (en) * | 2017-07-31 | 2017-11-21 | 浙江绍兴千寻生物科技有限公司 | In high volume unicellular ATAC seq data quality controls and analysis method |
-
2019
- 2019-03-29 CN CN201910256667.0A patent/CN111755071B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030162219A1 (en) * | 2000-12-29 | 2003-08-28 | Sem Daniel S. | Methods for predicting functional and structural properties of polypeptides using sequence models |
US20160097088A1 (en) * | 2013-03-15 | 2016-04-07 | Carnegie Institution Of Washington | Methods of Genome Sequencing and Epigenetic Analysis |
CN105339503A (en) * | 2013-05-23 | 2016-02-17 | 斯坦福大学托管董事会 | Transposition into native chromatin for personal epigenomics |
CN103955629A (en) * | 2014-02-18 | 2014-07-30 | 吉林大学 | Micro genome segment clustering method based on fuzzy k-mean |
CN105930862A (en) * | 2016-04-13 | 2016-09-07 | 江南大学 | Density peak clustering algorithm based on density adaptive distance |
CN107368701A (en) * | 2017-07-31 | 2017-11-21 | 浙江绍兴千寻生物科技有限公司 | In high volume unicellular ATAC seq data quality controls and analysis method |
Non-Patent Citations (2)
Title |
---|
ZHANG TAO ET AL.: ""Identification, classification and phylogenetic analysis of SET domain gene in barley"", 《2010 4TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICAL ENGINEERING》 * |
高胜寒 等: ""复杂基因组测序技术研究进展"", 《遗传》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112270953A (en) * | 2020-10-29 | 2021-01-26 | 哈尔滨因极科技有限公司 | Analysis method, device and equipment based on BD single cell transcriptome sequencing data |
WO2022188785A1 (en) * | 2021-03-08 | 2022-09-15 | 中国科学院上海营养与健康研究所 | Single cell transcriptome computation and analysis method and system incorporating deep learning model |
CN112992267A (en) * | 2021-04-13 | 2021-06-18 | 中国人民解放军军事科学院军事医学研究院 | Single-cell transcription factor regulation network prediction method and device |
CN112992267B (en) * | 2021-04-13 | 2024-02-09 | 中国人民解放军军事科学院军事医学研究院 | Single-cell transcription factor regulation network prediction method and device |
CN113178233A (en) * | 2021-04-27 | 2021-07-27 | 西安电子科技大学 | Efficient clustering method for large-scale single-cell transcriptome data |
CN113178233B (en) * | 2021-04-27 | 2023-04-28 | 西安电子科技大学 | Large-scale single-cell transcriptome data efficient clustering method |
CN116981779A (en) * | 2022-02-08 | 2023-10-31 | 染色质(北京)科技有限公司 | Method for identifying chromatin structural features from a Hi-C matrix, non-transitory computer readable medium storing a program for identifying chromatin structural features from a Hi-C matrix |
Also Published As
Publication number | Publication date |
---|---|
CN111755071B (en) | 2023-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ashhurst et al. | Integration, exploration, and analysis of high‐dimensional single‐cell cytometry data using Spectre | |
Stuart et al. | Single-cell chromatin state analysis with Signac | |
CN111755071A (en) | Single cell chromatin accessibility sequencing data analysis method and system based on peak clustering | |
Forslund et al. | Predicting protein function from domain content | |
Peng et al. | Combining gene ontology with deep neural networks to enhance the clustering of single cell RNA-Seq data | |
Ji et al. | An integrated software system for analyzing ChIP-chip and ChIP-seq data | |
Rizk et al. | GASSST: global alignment short sequence search tool | |
Cannistraci et al. | Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding | |
Fraser et al. | Evolutionary rate in the protein interaction network | |
US11954614B2 (en) | Systems and methods for visualizing a pattern in a dataset | |
Persad et al. | SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data | |
Zaki et al. | Protein-protein interaction based on pairwise similarity | |
Postic et al. | An ambiguity principle for assigning protein structural domains | |
Ding et al. | Biological process activity transformation of single cell gene expression for cross-species alignment | |
Singh et al. | Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities | |
Autio et al. | Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations | |
Schmidt et al. | Integrative analysis of epigenetics data identifies gene-specific regulatory elements | |
Persad et al. | SEACells: Inference of transcriptional and epigenomic cellular states from single-cell genomics data | |
Wu et al. | StackTADB: a stacking-based ensemble learning model for predicting the boundaries of topologically associating domains (TADs) accurately in fruit flies | |
Chen et al. | Integration of spatial and single-cell data across modalities with weakly linked features | |
Jiang et al. | Dimensionality reduction and visualization of single-cell RNA-seq data with an improved deep variational autoencoder | |
Yang et al. | DeepCCI: a deep learning framework for identifying cell–cell interactions from single-cell RNA sequencing data | |
Turenne et al. | Finding biomarkers in non-model species: literature mining of transcription factors involved in bovine embryo development | |
Liu et al. | Are dropout imputation methods for scRNA-seq effective for scATAC-seq data? | |
Becker et al. | Large-scale correlation network construction for unraveling the coordination of complex biological systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240130 Address after: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96 Patentee after: University of Science and Technology of China Country or region after: China Patentee after: Qu Kun Address before: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96 Patentee before: University of Science and Technology of China Country or region before: China |