CN111755071B - Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering - Google Patents

Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering Download PDF

Info

Publication number
CN111755071B
CN111755071B CN201910256667.0A CN201910256667A CN111755071B CN 111755071 B CN111755071 B CN 111755071B CN 201910256667 A CN201910256667 A CN 201910256667A CN 111755071 B CN111755071 B CN 111755071B
Authority
CN
China
Prior art keywords
cell
peak
clustering
matrix
readings
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910256667.0A
Other languages
Chinese (zh)
Other versions
CN111755071A (en
Inventor
瞿昆
方靖文
黎斌
李杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qu Kun
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201910256667.0A priority Critical patent/CN111755071B/en
Publication of CN111755071A publication Critical patent/CN111755071A/en
Application granted granted Critical
Publication of CN111755071B publication Critical patent/CN111755071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method and system for single cell chromatin accessibility sequencing data analysis based on peak clustering, the method comprising: comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating readings in each peak to obtain a reading matrix of cell-based peaks; calculating mathematical distances between peaks in a cell-peak reading matrix, clustering the peaks, and merging the cell-peak reading matrix into a cell-peak reading matrix, wherein the cell is the clustered peaks. The invention provides a method and a system for analyzing scATAC-seq data from fastq to clustering, visualization and development path remodeling, and the grouping effect is remarkably improved.

Description

Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering
Technical Field
The invention belongs to the technical field of biological sequencing data analysis, and particularly relates to a single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering.
Background
ATAC-seq has been widely used in research in the field of biology since 2012, and has contributed to breakthrough progress in research on embryonic development, stem cell differentiation, cancer mechanism, typing, and the like, due to advantages of simplicity, low cost, and few required cells. As one cancel Cell in 2017 (if=24) found that the pathogenesis and precise drug typing of T Cell lymphoma could be explained by ATAC-seq, data from ATAC-seq in 2018 entered the TCGA database. Thus, to further investigate cellular heterogeneity, scATAC-seq sequencing technology was proposed by people in 2015 and developed over several years to implement a number of different protocols, with the consequent analytical interpretation of the scATAC-seq sequencing result data.
The primary purpose of scATAC-seq data analysis is to reduce the primary cell population or developmental differentiation pathway in a mixed biological sample by sequencing results. However, current scattac-seq techniques compare the leading edge and the signal to noise ratio of the data is low. Therefore, scATAC-seq data analysis requires a set of easy-to-use analysis methods and maximally restores cellular heterogeneity information. On the one hand, the currently disclosed scattac-seq data analysis method does not have a perfect and easy-to-use analysis flow from fastq start to clustering, visualization and development path reconstruction. On the other hand, the evaluation was performed by using a gold standard test dataset, i.e., some test datasets in which the location in the subpopulation or developmental differentiation pathway to which each cell belongs was known. The existing methods still have poor information recovery, and improvements (using ARI assessment) are needed. As such, scattac-seq analysis is currently not an industry-uniform method of analysis.
In the prior art, the following three analysis methods exist: chromVAR, LSI and Cicero.
In the ChromVAR method, the input data of the method are a matrix of cell-based peak readings and sequence information of each peak. Thereby constructing a preference score matrix of cell transcription factors, and using the matrix to perform information reduction.
In the LSI method, the input data is a matrix of cell peak readings, and the method complicates the matrix by TF-IDF algorithm (Term Frequency, IDF means inverse text Frequency index) and then performs information reduction by a new matrix.
In the Cicero method, the input data is a matrix of cell-based peak readings, and peak position information on the chromosome. Downstream information reduction is then performed using this matrix.
Disclosure of Invention
In view of the above, the present invention provides a complete, easy-to-use method and system for analyzing scATAC-seq data of biological samples with high-efficiency cell heterogeneity information reduction capability.
In order to achieve the above object, in one aspect, the present invention provides a method for analyzing single-cell chromatin accessibility sequencing data based on peak clustering, comprising:
comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating readings in each peak to obtain a reading matrix of cell-based peaks;
calculating mathematical distances between peaks in a cell-peak reading matrix, clustering the peaks, and merging the cell-peak reading matrix into a cell-peak reading matrix, wherein the cell is the clustered peaks.
In some embodiments, the method further comprises reducing the read matrix of the cell access to a two-bit visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.
In some embodiments, the method further comprises clustering cells according to the matrix of readings of the cells, preferably the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.
In some embodiments, the method further comprises constructing a cell development path pseudotime instance using the matrix of readings of cell x-accesson, preferably the algorithm used in constructing the cell development path pseudotime instance comprises SPRING or monocle.
On the other hand, the invention provides a single-cell chromatin accessibility sequencing data analysis system based on peak clustering, which comprises a preprocessing module and an accesson construction module;
the pretreatment module comprises a) a comparison unit, a comparison unit and a control unit, wherein the comparison unit is used for comparing single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain comparison results; b) The peak searching unit is used for combining the comparison results of all single cells and then searching peaks; c) A reading calculation unit for calculating the reading in each peak to obtain a reading matrix of the cell;
the accesson construction module comprises a) a peak distance calculation unit, which is used for calculating mathematical distance between peaks in a cell-peak reading matrix; b) A peak clustering unit for clustering peaks according to mathematical distances between peaks; c) And the matrix conversion unit is used for combining the reading matrix of the cell-based peaks into the reading matrix of the cell-based peaks, wherein the peaks are clustered.
In some embodiments, the system further comprises a visualization module for reducing the reading matrix of the cell access to a two-bit visualization matrix, preferably the method of reducing the dimension comprises PCA, T-SNE or UMAP.
In some embodiments, the system further comprises a cell clustering module for clustering cells according to the matrix of readings of cell-x-accesson, preferably the clustering algorithm comprises KNN clustering, kernel clustering or louvain clustering.
In some embodiments, the system further comprises a cell development path remodeling module for constructing a cell development path pseudotime instance using the matrix of cell x-accesson readings, preferably the algorithm used in constructing the cell development path pseudotime instance comprises SPRING or monocle.
In some embodiments, the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance.
In some embodiments, the method of peak clustering comprises KNN, DBSAN, or K-Mean.
In some embodiments, the method of combining the matrix of cell-peak readings into the matrix of cell-peak readings comprises taking the sum of the peak readings in the accesson, the average of the peak readings, the median of the peak readings, or the variance of the peak readings.
In yet another aspect, the present invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, including:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method.
In yet another aspect, the present invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.
Compared with the prior art, the invention has the following beneficial effects:
the present invention provides a first scATAC-seq data analysis method and system from fastq to clustering, visualization and developmental path remodeling;
the invention provides an accesson construction method based on peak clustering, which is used as a key module for scattac-seq data analysis. The transformed cell-access reading matrix was used for subsequent clustering, visualization and cell development path remodeling. The grouping effect was statistically significantly higher than the existing method (ARI) on the gold-labeled dataset test.
Drawings
FIG. 1 is a schematic diagram of an accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention;
FIG. 2 is a graph showing the relationship between the number of accessons and the clustering effect ARI (gold mark test dataset 1);
FIG. 3 shows the scATAC-seq data for human leukemia cells and related lineage cells according to an embodiment of the present invention: A. data clustering (hierarchical clustering) and b. visualization effect (tSNE);
FIG. 4 is the scATAC-seq data relating to the developmental differentiation lineage of human hematopoietic stem cells according to an embodiment of the present invention: data development path remodeling (monocle);
FIG. 5 shows mouse forebrain nerve cell scaTAC-seq data in the examples of the present invention: data clustering (KNN) and visualization (tSNE);
FIGS. 6A-6D are mouse thymus T cell scaTAC-seq data in examples of the present invention: data clustering (Louvain, hierarchical clustering), visualization (tSNE), and developmental path remodeling (monocle);
FIG. 7 is a graph showing the clustering effect and time-consuming comparison with the prior art method (gold mark test dataset 1) in accordance with the present invention;
FIG. 8 is a graph showing the clustering effect and time-consuming comparison with the prior art method (gold test dataset 2) in accordance with the present invention.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
For ease of understanding, the domain names referred to herein are collectively explained herein and are not described in detail.
And (3) cells: the fundamental components of the life activities of mammals (e.g., humans, mice) are often the pathogenesis of various diseases, such as nerve cells, epithelial cells, and tumor cells.
Cell heterogeneity: biological tissue samples (e.g., tumor tissue, brain tissue) are composed of a large number of cells, the physiological functions of which are different. The common cellular heterogeneity is represented by two types: 1) Constitutive cells consist of a variety of well-defined cell populations (discrete). 2) The constituent cells are in a continuous cell differentiation pathway (continuous).
Genome: namely the whole DNA sequence of the organism, which consists of four bases of ATCG in ordered arrangement. The genome of a major mammal such as human, mouse, etc. has been completely sequenced.
Gene: genes (genetic factors) are all DNA sequences required to produce one polypeptide chain or functional RNA. A gene is typically one or more stretches of DNA on the genome.
Transcription factor: a protein bound to DNA initiates or regulates gene expression. Binding to DNA is often accomplished by recognizing a specific pattern of DNA sequences (Motif).
Chromatin: linear composite structures consisting of DNA, histones, nonhistones and small amounts of RNA in the nucleus. The basic element is nucleosome formed by DNA winding on histone.
Chromatin accessibility: i.e. to evaluate whether a piece of DNA is entangled to histones. In general, chromatin accessibility is in two cases: 1) DNA is tightly entangled around nucleosomes, called closed DNA; 2) DNA is DNA which is wound around the nucleosome and is exposed, and is called open DNA.
Chromatin accessibility sequencing (ATAC-seq): a sequencing technology developed by university of stamford 2012 for detecting chromatin accessibility of biological samples (> 500 cells).
TCGA: i.e. cancer and tumor genetic map planning (Cancer Genome Atlas, TCGA). Different sets of sequencing data comprising cancer tissue and normal tissue from 33 different cancers and 11,000 patients.
Single cell chromatin accessibility sequencing (scattac-seq): several sequencing methods exist for detecting chromatin accessibility of individual cells. Including single-core chromatin accessibility sequencing (snATAC-seq), single-cell combinatorial index chromatin accessibility sequencing (sciATAC-seq), flow-based single-cell chromatin accessibility sequencing (FACS scaatac-seq).
Short sequences (sequences reads): i.e.the DNA fragments obtained in biology.
Alignment (Mapping): the short sequences are compared to known genomic information to find the position of each short sequence on the genome.
Peak rolling (Peak rolling): and searching the open position of the DNA through the result of data analysis and comparison, wherein the position information is called peak and is given with a number.
Reading: i.e. the number of short sequences per sample, per peak.
Access on: the peak clustering result provided by the invention is called an Access, namely the clustering condition of the peaks. E.g. Accesson 1 = peak 2, peak 3, peak 5; accesson 2 = peak 1, peak 4.
ARI (Adjusted Rand index) is an evaluation index commonly used in clustering algorithms for evaluating the consistency of the algorithm clustering results with the actual clustering results.
One embodiment of the present invention proposes a single cell chromatin accessibility (scattac-seq) sequencing data analysis system (hereinafter abbreviated as APEC) based on peak clustering: the device comprises the following modules:
1) And a pretreatment module: comprises a) an alignment unit for aligning fastq files (i.e. single cell chromatin accessibility sequencing data) to genomic sequences to form bam files; b) The peak searching unit is used for merging the bam files of all single cell comparison results into a merge_bam file and searching peaks on the basis; c) And a reading calculation unit for calculating the count of reads in each peak and finally outputting a reading matrix of the cell.
2) an accesson construction module: comprises a) a peak distance calculation unit for calculating mathematical distances (including but not limited to Euclidean distance, pearson correlation coefficient, cityblock distance) between peaks through a reading matrix of cell-by-peak; b) And a peak clustering unit for clustering the peaks by mathematical distance between the peaks, wherein the clustered peaks are called accesson, and the clustering method comprises, but is not limited to, (KNN, DBSAN). c) And the matrix conversion unit is used for merging the reading matrix of the cell peak into the cell peak according to the information of the cell peak, and the merging method comprises, but is not limited to, taking the sum, the average value, the median, the variance and the like of peak readings in the cell peak.
3) And a visualization module: the cell-by-cell reading matrix is reduced in dimension to a two-bit visualization matrix using dimension reduction visualization methods including but not limited to PCA, T-SNE, UMAP.
4) Cell clustering module: cells are clustered using an accesson reading matrix, and clustering algorithms include, but are not limited to, KNN clustering, kernel clustering, louvain clustering.
5) Cell development pathway remodeling module: using the matrix of cell x-accesson readings, a pseudo-time condition of the cell development pathway was constructed using algorithms including, but not limited to SPRING, monocle.
The following is a description of the use of APECs in 4 different gold-labeled test data sets in an embodiment according to the invention, illustrating the versatility of APECs in the analysis of different biological sample scattac-seq data sets, the data sets comprising: 1) Human leukemia cells and related lineage cell scATAC-seq data; 2) Human hematopoietic stem cell developmental differentiation lineage related scATAC-seq data; 3) Mouse forebrain nerve cell scattac-seq data; 4) Mouse thymus T cell scATAC-seq data.
The analysis flow using the peak cluster based scataac-seq analysis system (APEC) of the present invention comprises the following steps:
1) Data input:
the input data is fastq file, and its format can be: a) A single fastq file per cell; b) A fastq file mixed together, but each cell can be split into each cell data by a splitting rule given by the data provider. Such as index sequences (using different splits of 5-10 bases before fastq)
2) Data preprocessing:
input data can be compared to different biological sample genomes by the comparison unit, such as data sets 1 and 2 to human genome and data sets 3 and 4 to mouse genome. Or a biological sample genome specified by a data provider. The alignment results produced a Bam file that indicated the location of the read alignment on the genome in each fastq. The processing of the bam file with the peak-finding unit can define chromatin opening sites in the biological sample, and the reading matrix (mxn) of each peak (n) of each cell (m) can be obtained in combination with the reading calculation unit.
3) accesson construction:
FIG. 1 is a schematic diagram of an accesson construction and downstream analysis based on peak clustering in an embodiment of the present invention. In the accesson construction, an mxn matrix of readings is first passed into an accesson construction module.
In the peak distance calculation unit, the relative distance between peaks ( data sets 1,2,3, 4) may be calculated using the euclidean distance, and other commonly used vector distance calculation methods, such as pearson correlation coefficient, cityblock distance, and the like, may be used.
In the peak clustering unit, peaks can be clustered into a specified number of accessons ( data sets 1,2,3, 4) using KNN algorithm. The clustering algorithm may be a common vector clustering algorithm, such as DBSCAN, K-Mean, etc. Where the specified number of accessions does not affect the result over a wide distance (fig. 2), and is therefore defaulted to 2000, which is adjustable according to the specific data.
In the matrix conversion unit, firstly, certain screening is carried out on the accesson according to the basic property of the accesson, for example, the accesson with the number of the contained peaks smaller than a specified value is removed, or the accesson with the internal coefficient of the foundation smaller than the specified value is removed. Then, according to the accesson information, the matrix of readings of the cell-peak is combined into the matrix of the cell-peak by taking the sum of the peak readings in the accesson ( data sets 1,2,3, 4). Other simple vector property calculation methods, such as average value of readings, median of readings, variance of readings, etc., can also be utilized.
4) Data clustering and visualization
In this step, the cell-x-accesson reading matrix may be reduced in dimension to a two-position visualization matrix using the visualization module, and/or the cells may be clustered using the cell clustering module, and/or the cell development path pseudo-time condition may be constructed using the cell development path remodeling module.
FIG. 3 shows human leukemia cells and related lineage cell scattac-seq data: A. data clustering (hierarchical clustering) and b. visualization effect (tSNE);
FIG. 4 shows human hematopoietic stem cell developmental differentiation lineage related scaTAC-seq data: data development path remodeling (monocle);
fig. 5 is mouse forebrain neural cell scattac-seq data: data clustering (KNN) and visualization (tSNE);
FIGS. 6A-6D are mouse thymus T cell scATAC-seq data: wherein, fig. 6A is a Louvain cluster, fig. 6B is a hierarchical cluster, fig. 6C is a visualization (tSNE), and fig. 6D is a developmental path remodeling (monocle).
It can be seen that the present invention can achieve from fastq to clustering, visualization and developmental pathway remodeling. And the grouping effect (ARI) was statistically significantly higher than the existing methods on the gold labeled dataset test, as shown in fig. 7 and 8. The reason that the cell heterogeneity information can be efficiently restored is that the method for constructing the accesson is a filtering process for reducing noise and amplifying signals, and the details are as follows: 1) Compared with LSI and ChromVAR, the invention can convert the originally sparse cell peak matrix into a denser cell peak matrix, thereby reducing noise signals in subsequent analysis; 2) Compared with the Cicero method for peak merging based on chromatin position, the method provided by the invention clusters peaks through mathematical distance and clustering algorithm and merges the peaks. Peaks clustered together in this way have similar expression patterns, and therefore construction of an accesson is more biologically significant, e.g., peaks within an accesson may be regulated by the same transcription factor or more closely related in the chromatin three-dimensional structure. Thus transformed cells access matrix further amplifies the cell heterogeneity.
The invention also provides a single-cell chromatin accessibility sequencing data analysis device based on peak clustering, which comprises:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method.
The invention also proposes a computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method.
It should be noted that each functional module/unit in the present invention may be hardware, for example, the hardware may be a circuit, including a digital circuit, an analog circuit, and so on. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The data processing module may be any suitable hardware processor such as CPU, GPU, FPGA, DSP and ASIC, etc. The storage unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.

Claims (18)

1. A method for analyzing single cell chromatin accessibility sequencing data based on peak clustering, comprising:
comparing the single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain a comparison result, searching peaks on the basis of the comparison result, and calculating readings in each peak to obtain a reading matrix of cell-based peaks;
calculating mathematical distances between peaks in a reading matrix of cell-peak, clustering the peaks, and merging the reading matrix of cell-peak into a reading matrix of cell-peak, wherein the cell is the clustered peak;
the method further comprises constructing a cell development pathway pseudotime profile using the matrix of cell x-accesson readings.
2. The method of analysis of claim 1, wherein the method further comprises dimensionality reducing the matrix of readings of the cell x-accesson to a two-bit visualization matrix.
3. The method of analysis of claim 2, wherein the method of dimension reduction comprises PCA, T-SNE or UMAP.
4. The assay of claim 1, wherein the method further comprises clustering cells according to the matrix of cell-x-access readings.
5. The analysis method of claim 1, wherein the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.
6. The assay of claim 1, wherein the algorithm used in constructing the pseudo-time condition of the cell development pathway comprises SPRING or monocle.
7. The analysis method of any one of claims 1-6, wherein the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance; the peak clustering method comprises KNN, DBSAN or K-Mean.
8. The method of analysis of claim 1, wherein combining the matrix of cell peak readings into the matrix of cell peak readings comprises taking the sum of peak readings in the accesson, the average of peak readings, the median of peak readings, or the variance of peak readings.
9. A single-cell chromatin accessibility sequencing data analysis system based on peak clustering comprises a preprocessing module and an accesson construction module;
the pretreatment module comprises a) a comparison unit, a comparison unit and a control unit, wherein the comparison unit is used for comparing single-cell chromatin accessibility sequencing data with corresponding biological sample genome data to obtain comparison results; b) The peak searching unit is used for combining the comparison results of all single cells and then searching peaks; c) A reading calculation unit for calculating the reading in each peak to obtain a reading matrix of the cell;
the accesson construction module comprises i) a peak distance calculation unit, which is used for calculating mathematical distance between peaks in a cell-peak reading matrix; ii) a peak clustering unit for clustering peaks according to mathematical distance between peaks; iii) The matrix conversion unit is used for combining the reading matrixes of the cell-based peaks into the reading matrixes of the cell-based peaks, wherein the peaks are clustered;
the system further includes a cell development path remodeling module for constructing a cell development path pseudo-time condition using the matrix of cell-x-accesson readings.
10. The analysis system of claim 9, wherein the system further comprises a visualization module for dimensionality reduction of the cell x-access reading matrix to a two-bit visualization matrix.
11. The analysis system of claim 10, wherein the dimension reduction method comprises PCA, T-SNE or UMAP.
12. The analysis system of claim 9, wherein the system further comprises a cell clustering module for clustering cells according to the matrix of readings of the cell.
13. The analysis system of claim 12, wherein the clustering algorithm comprises KNN clustering, kernel clustering, or louvain clustering.
14. The analysis system of claim 9, wherein the algorithm used in constructing the pseudo-time condition of the cell development pathway comprises SPRING or monocle.
15. The analysis system of claim 9 or 10, wherein the mathematical distance comprises a euclidean distance, a pearson correlation coefficient, or a cityblock distance; the peak clustering method comprises KNN, DBSAN or K-Mean.
16. The analysis system of claim 9, wherein the means for combining the matrix of cell peak readings into the matrix of cell peak readings comprises taking the sum of peak readings in the accesson, the average of peak readings, the median of peak readings, or the variance of peak readings.
17. A single cell chromatin accessibility sequencing data analysis device based on peak clustering, comprising:
a processor;
a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the analysis method of any of claims 1-8.
18. A computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the analysis method of any one of claims 1-8.
CN201910256667.0A 2019-03-29 2019-03-29 Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering Active CN111755071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910256667.0A CN111755071B (en) 2019-03-29 2019-03-29 Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910256667.0A CN111755071B (en) 2019-03-29 2019-03-29 Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Publications (2)

Publication Number Publication Date
CN111755071A CN111755071A (en) 2020-10-09
CN111755071B true CN111755071B (en) 2023-04-21

Family

ID=72672727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910256667.0A Active CN111755071B (en) 2019-03-29 2019-03-29 Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering

Country Status (1)

Country Link
CN (1) CN111755071B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270953A (en) * 2020-10-29 2021-01-26 哈尔滨因极科技有限公司 Analysis method, device and equipment based on BD single cell transcriptome sequencing data
CN115050416A (en) * 2021-03-08 2022-09-13 中国科学院上海营养与健康研究所 Single cell transcriptome calculation analysis method and system fused with deep learning model
CN112992267B (en) * 2021-04-13 2024-02-09 中国人民解放军军事科学院军事医学研究院 Single-cell transcription factor regulation network prediction method and device
CN113178233B (en) * 2021-04-27 2023-04-28 西安电子科技大学 Large-scale single-cell transcriptome data efficient clustering method
US20240185946A1 (en) * 2022-02-08 2024-06-06 Chromatintech Beijing Co, Ltd Method for identifying a chromatin structural characteristic from a hi-c matrix, non-transitory computer readable medium storing a program for identifying a chromatin structural characteristic from a hi-c matrix, and methods for diagnosing and treating a medical condition or disease

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030162219A1 (en) * 2000-12-29 2003-08-28 Sem Daniel S. Methods for predicting functional and structural properties of polypeptides using sequence models
WO2014152091A2 (en) * 2013-03-15 2014-09-25 Carnegie Institution Of Washington Methods of genome sequencing and epigenetic analysis
SG11201508985VA (en) * 2013-05-23 2015-12-30 Univ Leland Stanford Junior Transposition into native chromatin for personal epigenomics
CN103955629A (en) * 2014-02-18 2014-07-30 吉林大学 Micro genome segment clustering method based on fuzzy k-mean
CN105930862A (en) * 2016-04-13 2016-09-07 江南大学 Density peak clustering algorithm based on density adaptive distance
CN107368701A (en) * 2017-07-31 2017-11-21 浙江绍兴千寻生物科技有限公司 In high volume unicellular ATAC seq data quality controls and analysis method

Also Published As

Publication number Publication date
CN111755071A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
CN111755071B (en) Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering
Hu et al. A review on longitudinal data analysis with random forest
Withnell et al. XOmiVAE: an interpretable deep learning model for cancer classification using high-dimensional omics data
US11954614B2 (en) Systems and methods for visualizing a pattern in a dataset
Peng et al. Combining gene ontology with deep neural networks to enhance the clustering of single cell RNA-Seq data
Marczyk et al. Adaptive filtering of microarray gene expression data based on Gaussian mixture decomposition
Greenberg DNA microarray gene expression analysis technology and its application to neurological disorders
CA3204451A1 (en) Systems and methods for joint low-coverage whole genome sequencing and whole exome sequencing inference of copy number variation for clinical diagnostics
He et al. Microarrays—the 21st century divining rod?
Chen et al. Integration of spatial and single-cell data across modalities with weakly linked features
Liang et al. SSRE: cell type detection based on sparse subspace representation and similarity enhancement
Jiang et al. Dimensionality reduction and visualization of single-cell RNA-seq data with an improved deep variational autoencoder
Pham et al. Analysis of microarray gene expression data
Matos et al. Research techniques made simple: mass cytometry analysis tools for decrypting the complexity of biological systems
Choi et al. Sparsely correlated hidden Markov models with application to genome-wide location studies
Manatakis et al. An information-theoretic approach for measuring the distance of organ tissue samples using their transcriptomic signatures
Chen et al. Bubble: a fast single-cell RNA-seq imputation using an autoencoder constrained by bulk RNA-seq data
Wang et al. Benchmarking automated cell type annotation tools for single-cell ATAC-seq data
CN117275579A (en) Method for eRNA identification, regulation target prediction and functional annotation based on high-throughput transcriptome sequencing data
Krasnitz et al. Target inference from collections of genomic intervals
Xu et al. Structure-preserving visualization for single-cell RNA-Seq profiles using deep manifold transformation with batch-correction
WO2020198942A1 (en) Single-cell chromatin accessibility sequencing data analysis method and system based on peak clustering
Wytock et al. Distinguishing cell phenotype using cell epigenotype
González Calabozo et al. Gene Expression Array Exploration Using-Formal Concept Analysis
Deng et al. Molecular Heterogeneity in Large-Scale Biological Data: Techniques and Applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240130

Address after: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96

Patentee after: University of Science and Technology of China

Country or region after: China

Patentee after: Qu Kun

Address before: 230026 Jinzhai Road, Baohe District, Hefei, Anhui Province, No. 96

Patentee before: University of Science and Technology of China

Country or region before: China