CN111951896A - Chromatin accessibility data analysis method based on clinical samples - Google Patents

Chromatin accessibility data analysis method based on clinical samples Download PDF

Info

Publication number
CN111951896A
CN111951896A CN202010843055.4A CN202010843055A CN111951896A CN 111951896 A CN111951896 A CN 111951896A CN 202010843055 A CN202010843055 A CN 202010843055A CN 111951896 A CN111951896 A CN 111951896A
Authority
CN
China
Prior art keywords
analysis
chromatin
sample
clinical
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010843055.4A
Other languages
Chinese (zh)
Other versions
CN111951896B (en
Inventor
方靖文
瞿昆
李杨
朱连邦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hanyin Life Technology Co Ltd
Original Assignee
Hangzhou Hanyin Life Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hanyin Life Technology Co Ltd filed Critical Hangzhou Hanyin Life Technology Co Ltd
Priority to CN202010843055.4A priority Critical patent/CN111951896B/en
Publication of CN111951896A publication Critical patent/CN111951896A/en
Application granted granted Critical
Publication of CN111951896B publication Critical patent/CN111951896B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application relates to a chromatin accessibility data analysis method based on a clinical sample, which takes the research of ATAC-seq sequencing data as a starting point and the application of the clinical sample as a starting point to construct an ATAC-seq sequencing data analysis process integrating multiple functions. The process not only comprises the preliminary analysis of ATAC-seq sequencing data of a single clinical sample, but also can carry out the comprehensive analysis of ATAC-seq sequencing data of a plurality of groups of clinical samples, and further comprises the transcription factor analysis and copy number variation analysis and the like required by clinical research. The invention is the analysis of chromatin accessibility data among clinical multiple groups, and has positive effects on searching corresponding clinical biomarkers and aiming at clinical disease medication.

Description

Chromatin accessibility data analysis method based on clinical samples
Technical Field
The application relates to the technical field of ATAC-seq, in particular to a chromatin accessibility data analysis method based on clinical samples.
Background
Chromatin accessibility directly provides information on the binding of RNA polymerase, transcription factors to DNA, enhancers and a variety of other information that regulate the progress of gene transcription. In recent years, ATAC-seq is increasingly used in biological or clinical research, such as research in the fields of dynamic regulation of chromatin opening state, cell differentiation and development, etc., which are personalized for normal people and patients. The ATAC-seq method can achieve higher adherence sensitivity with less cell amount, and thus becomes the most widely used test method for chromatin accessibility
However, with the popularization of ATAC-seq technology, the data output after sequencing by ATAC-seq is lack of a data analysis method. The existing traditional analysis methods of ATAC-seq sequencing data are only used for simply preprocessing and integrating data output after ATAC-seq sequencing, and lack subsequent analysis methods of the data, in particular lack analysis methods of ATAC-seq sequencing data among multiple sample groups which can be used for comprehensively analyzing clinical samples.
Disclosure of Invention
Based on this, there is a need to provide a chromatin accessibility data analysis method based on clinical samples, in response to the problem of lack of analysis methods available for comprehensive analysis of ATAC-seq sequencing data among multiple sample sets of clinical samples.
The present application provides a method for chromatin accessibility data analysis based on clinical samples, comprising:
setting a plurality of sample groups, wherein each sample group comprises a plurality of clinical samples;
obtaining an original sequencing file output after each clinical sample is subjected to ATAC-seq sequencing;
performing data processing on the original sequencing file, performing quality control analysis on the original sequencing file subjected to data processing to generate a sequencing quality control analysis result, and visualizing the sequencing quality control analysis result;
acquiring an open area of each sample group, and performing information annotation on the chromatin open area of each sample group;
performing difference analysis, cluster analysis and similarity analysis among a plurality of sample groups according to the open area of each sample group;
according to the analysis result of the previous step, carrying out enrichment analysis on the transcription factors, and searching for the enriched transcription factors;
selecting a transcription factor related to a preset research direction, and carrying out combined imprinting analysis on the transcription factor;
deconvoluting the original sequencing file of each clinical sample to obtain the percentage of the cell number of different types of cells in each clinical sample in the total number of cells;
and performing CNV analysis on the original sequencing file of each clinical sample to obtain DNA fragment difference information among different clinical samples, and visualizing the DNA fragment difference information among different clinical samples.
The application relates to a chromatin accessibility data analysis method based on a clinical sample, which takes the research of ATAC-seq sequencing data as a starting point and the application of the clinical sample as a starting point to construct an ATAC-seq sequencing data analysis process integrating multiple functions. The process not only comprises the preliminary analysis of ATAC-seq sequencing data of a single clinical sample, but also can carry out the comprehensive analysis of ATAC-seq sequencing data of a plurality of groups of clinical samples, and further comprises the transcription factor analysis and copy number variation analysis and the like required by clinical research. The invention is the analysis of chromatin accessibility data among clinical multiple groups, and has positive effects on searching corresponding clinical biomarkers and aiming at clinical disease medication.
Drawings
FIG. 1 is a flow chart of a method for clinical specimen-based analysis of chromatin accessibility data provided in an embodiment of the present application;
FIG. 2 is a graph of a transcriptional start site enrichment analysis in a clinical sample-based chromatin accessibility data analysis method provided in an embodiment of the present application;
FIG. 3 is a graph of sequencing fragment distribution analysis in a method of chromatin accessibility data analysis based on a clinical sample according to an embodiment of the present application;
FIG. 4 is a graph illustrating chromatin alignment in a clinical sample-based chromatin accessibility data analysis method provided in an embodiment of the present application;
FIG. 5 is a chromatin opening region annotation diagram in a clinical sample-based chromatin accessibility data analysis method provided in an embodiment of the present application;
FIG. 6 is a graph showing regions of intergroup variability in a method of chromatin accessibility data analysis based on a clinical sample according to an embodiment of the present application;
FIG. 7 is a graph of cluster analysis of intergroup differential open area in a method of chromatin accessibility based on clinical samples according to an embodiment of the present application;
FIG. 8 is a graph of inter-sample cluster analysis in a method of chromatin accessibility data analysis based on clinical samples according to an embodiment of the present application;
FIG. 9 is a graph of a test for similarity between groups in a method of chromatin accessibility based on a clinical sample according to an embodiment of the present application;
FIG. 10 is a graph of a transcription factor enrichment analysis in a clinical sample-based chromatin accessibility data analysis method provided by an embodiment of the present application;
FIG. 11 is a graph of transcription factor enrichment score cluster analysis in a clinical sample-based chromatin accessibility data analysis method provided by an embodiment of the present application;
FIG. 12 is a heat map of sequencing fragments of a selected region of a transcription factor motif in a method of chromatin accessibility data analysis based on a clinical sample according to an embodiment of the present application;
FIG. 13 is a graph of a cell type ratio deconvolution analysis in a method of chromatin accessibility based on a clinical sample according to an embodiment of the present application;
FIG. 14 is a graph of copy number variation analysis in a method of clinical specimen-based chromatin accessibility data analysis provided by an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The present application provides a method for chromatin accessibility data analysis based on clinical samples. The chromatin accessibility data analysis method based on the clinical sample is applied to analysis of data output after ATAC-seq sequencing.
In addition, the methods for analyzing chromatin accessibility based on clinical samples provided herein are not limited in their implementation. Alternatively, the subject matter of the clinical sample-based chromatin accessibility data analysis method provided herein may be a chromatin accessibility data analysis terminal.
As shown in fig. 1, in an embodiment of the present application, the method for analyzing chromatin accessibility based on clinical samples includes the following steps S100 to S900:
s100, a plurality of sample groups are set. Each sample group includes a plurality of clinical samples.
S200, obtaining an original sequencing file output after each clinical sample is subjected to ATAC-seq sequencing.
And S300, carrying out data processing on the original sequencing file. Further, performing quality control analysis on the original sequencing file subjected to data processing to generate a sequencing quality control analysis result, and visualizing the sequencing quality control analysis result. See fig. 4 for visualization results.
And S400, acquiring the open area of each sample group, and performing information annotation on the chromatin open area of each sample group.
And S500, performing difference analysis, cluster analysis and similarity analysis among the plurality of sample groups according to the open area of each sample group.
S600, carrying out enrichment analysis on the transcription factors according to the analysis result of the step S500, and searching for the enriched transcription factors.
S700, selecting the transcription factors related to the preset research direction, and carrying out combined imprinting analysis on the transcription factors.
S800, deconvoluting the original sequencing file of each clinical sample to obtain the percentage of the cell number of different types of cells in each clinical sample in the total number of cells.
S900, performing CNV analysis on the original sequencing file of each clinical sample, acquiring DNA fragment difference information among different clinical samples, and visualizing the DNA fragment difference information among different clinical samples.
Specifically, the number of the sample groups is not limited. The number of clinical samples included in each sample group is not limited. Alternatively, one sample group may be set as a control group and the other sample group as an experimental group.
In step S900, the CNV analysis principle is to use the background noise in the original sequencing file to define the average coverage rate at different DNA fragment positions in chromatin, and evaluate the copy number variation of different chromatin and the copy number variation of different DNA fragments in different chromatin according to the average coverage rate. Through the CNV analysis in step S900, the DNA fragment difference between different samples can be explored, reference is provided for the diagnosis of some clinical symptoms, more clinical-related information is indirectly mined from the original sequencing file, and other additional clinical tests are avoided. The visualized result in step S900 is a copy number variation analysis chart.
In the embodiment, an ATAC-seq sequencing data analysis flow integrating multiple functions is constructed by taking the research of ATAC-seq sequencing data as a starting point and taking the application of clinical samples as a starting point. The process not only comprises the preliminary analysis of ATAC-seq sequencing data of a single clinical sample, but also can carry out the comprehensive analysis of ATAC-seq sequencing data of a plurality of groups of clinical samples, and further comprises the transcription factor analysis and copy number variation analysis and the like required by clinical research. The invention is the analysis of chromatin accessibility data among clinical multiple groups, and has positive effects on searching corresponding clinical biomarkers and aiming at clinical disease medication.
In an embodiment of the present application, the step S300 includes the following steps S310 to S360:
s310, selecting an original sequencing file of a clinical sample, and removing an adapter sequence in the original sequencing file. Further, performing chromatin comparison and format conversion on the original sequencing file with the adapter sequence removed to generate a sequencing information file.
S320, carrying out enrichment analysis on the transcription initiation sites of the sequencing information file to generate an enrichment analysis diagram of the transcription initiation sites of the clinical sample. FIG. 2 shows an analysis of enrichment of transcription initiation sites. And performing sequencing fragment distribution analysis on the sequencing information file to generate a sequencing fragment distribution analysis chart of the clinical sample. The distribution analysis of the sequenced fragments is shown in FIG. 3.
And S330, generating a chromatin alignment result visualization graph based on the chromatin alignment result.
And S340, repeatedly executing the steps S310 to S330, and generating a sequencing information file, a transcription initiation site enrichment analysis map, a sequencing fragment distribution analysis map and a chromatin comparison result visualization map of each clinical sample.
S350, merging the sequencing information files of a plurality of clinical samples in a sample group into a group sequencing information file. Further, performing enrichment analysis of the transcription start sites and distribution analysis of the sequencing fragments on the group of sequencing information files to generate an enrichment analysis chart of the transcription start sites and a distribution analysis chart of the sequencing fragments of the sample group.
And S360, repeatedly executing the step S350, and generating a group sequencing information file, a transcription start site enrichment analysis map and a sequencing fragment distribution analysis map of each sample group.
Specifically, the raw sequencing file is a FASTQ formatted file comprising a plurality of sequencing fragments. Optionally, in step S310, format converting the FASTQ format raw sequencing file by one or more of Bowtie2, Picard, BedTools, AWK and SamTools, and generating a BAM format sequencing information file after format conversion. Optionally, during the format conversion process, the position of the starting point of the sequencing read length in the original sequencing file is shifted by ± 4/5, and the shifted starting point is taken as the midpoint, and the sequencing read length is extended to 50bp to eliminate the ATAC bias.
FIG. 2 shows an analysis of enrichment of transcription initiation sites. As shown in FIG. 2, the central position of the abscissa axis is the enrichment center of the transcription initiation site, and the corresponding ordinate value is larger, which proves that the enrichment analysis of the transcription initiation site is successful.
The distribution analysis of the sequenced fragments is shown in FIG. 3. As shown in FIG. 3, the density distribution of the sequenced fragments for each clinical sample can be observed according to FIG. 3.
In the process of transcription initiation site enrichment analysis and sequencing fragment distribution analysis, the following steps are required to be obtained or calculated: in each clinical sample, the average of the number of sequencing reads at each base site on 4000bp near the gene start site, the length distribution of all detected sequencing fragments, the number of reads detected in total, the number and proportion of reads that can eventually be attached back to DNA, the number of repeat reads, the number of low quality reads, the proportion of reads falling on mitochondrial DNA, the proportion of reads falling on the sequencing black list, and the proportion of repeat reads.
The visualization chart of the chromatin alignment result is fig. 4, and fig. 4 can show the result of the chromatin alignment.
In step S340, after generating the sequencing information file, the transcription start site enrichment analysis map, and the sequencing fragment distribution analysis map of the clinical samples of each clinical sample, the sequencing information files of a plurality of clinical samples in a sample group are also merged into one group sequencing information file. Further, a transcription initiation site enrichment analysis map and a sequencing fragment distribution analysis map were generated for each sample set. This is because the individual clinical samples in the group are not identical, and the average condition of a plurality of clinical samples in one sample group can be known by combining the sequencing information files and then performing the enrichment analysis of the transcription start sites and the distribution analysis of the sequencing fragments.
Alternatively, the generated sequencing information file in BAM format may be visually displayed by IGV software or UCSCbrowser software, as shown in fig. 4.
In this embodiment, the quality control analysis of each sample and the quality control analysis of each sample group are realized by data processing of the original sequencing file, enrichment analysis of the transcription start site, and distribution analysis of the sequencing fragments.
In an embodiment of the present application, the step S400 includes the following steps S410 to S450:
s410, obtaining a plurality of potential chromatin open regions for each sample group using the MACS2 algorithm based on the group sequencing information file for each sample group.
S420, screening for a chromatin open area from the plurality of potential chromatin open areas based on one or more of the fold difference parameter, the chromatin open area P _ value, and the FDR.
S430, combining the chromatin open areas of all the sample groups to generate an open area list.
S440, calculating the number of sequencing reads for each clinical sample on each chromatin open region, and generating a first read number matrix.
Figure BDA0002642124920000081
Wherein H is the first read length number matrix. M is the number of open areas of chromatin. And N is the serial number of the sample. XMNThe length data was read for the nth sample on the mth open area of chromatin.
And S450, performing position annotation and genome function annotation on the chromatin open regions of each sample group to generate a chromatin open region annotation map. A chromatin opening area annotation map is shown in figure 5.
The location annotation comprises one or more of a promoter annotation, an enhancer annotation, and a heterochromatin region annotation. The genome function annotation annotates gene function by the GREAT algorithm.
Specifically, in step S410, the method of probing chromatin opening regions for each sample set can use MACS2 software to search potential chromatin opening regions for each sample set. The search method can be specifically that peaks are detected for each sample group by MACS2 software. Based on the peak quality in the standard results output by MACS2 software, high quality peaks were selected, each high quality peak being an open area of chromatin.
In step S420, the multiple difference parameter is also referred to as the Fold Change parameter. The value of the Fold Change parameter may be set to 1.5. The value of the chromatin opening region P-value may be set to 0.05. The value of FDR may be set to 0.01.
In step S450, position annotation and genome function annotation can be performed on the chromatin open region according to known annotation information, such as an annotation file on the epigmic Roadmap website. As shown in fig. 5, the location annotations include one or more of promoter annotations, enhancer annotations, and heterochromatin region annotations. The genome function annotation annotates gene function by the GREAT algorithm.
In step S450, if the annotation information is unknown, the near end (within 1 kb) of the gene region and the far end (outside 1 kb) of the gene region can be annotated according to the gene region.
The embodiment can realize effective detection of the chromatin open areas of each sample group, and can realize quantification of the openness degree of each clinical sample on all open areas through genome function annotation.
In an embodiment of the present application, the step S500 includes the following steps S511 to S513:
and S511, carrying out normalization analysis on the first read-length number matrix.
S512, performing difference analysis on every two of the plurality of sample groups according to the first read length number matrix subjected to normalization analysis to obtain an inter-group difference open region, and generating an inter-group difference region display diagram. The differential regions between groups are shown in FIG. 6. The differential open areas between groups are areas of chromatin opening that are significantly different between the two sample groups.
And S513, calculating the number of sequencing reads of each clinical sample on each interclass differential open region, and generating a second read number matrix.
Figure BDA0002642124920000101
Wherein M is a second read length number matrix. K is the number of the differentially open regions between groups. N is the serial number of the clinical specimen. Y isKNFor the Nth sample at KSequencing read length data on differentially open regions between groups.
Specifically, this example describes the analysis of the variability between groups. In step S511, the first read-length number matrix may be subjected to normalization analysis by using a standard method of the existing software package DESeq.
In step S512, the diversity analysis between the plurality of sample groups is performed by pairwise comparison, and the inter-group diversity open regions obtained after any one comparison are retained. P-value and FDR need to be calculated in the process of differential analysis. When the P-value is less than 0.05, the obtained interclass differential open area is credible, otherwise, the interclass differential open area is not credible.
The differential regions between groups are shown in FIG. 6. In FIG. 6, there are two dotted lines, and all solid points in the area above the dotted line are regions of inter-group variability. All solid points in the region below the lower dashed line are also regions of inter-group variability.
In this embodiment, through normalization analysis and difference analysis, it is possible to find open regions in which significant differences directly exist in sample groups.
In an embodiment of the present application, the step S500 further includes the following steps:
and S521, performing cluster analysis on the second read length number matrix by adopting a Euclidean clustering or hierarchical clustering method to obtain cluster information of the interclass difference open regions, and generating an interclass difference open region cluster analysis diagram. A graph of cluster analysis of the differential open areas between groups is shown in FIG. 7.
Specifically, the present embodiment introduces a process of performing cluster analysis on the second read length number matrix generated by the differentiation analysis, and further performing cluster analysis. A graph of cluster analysis of the differential open areas between groups is shown in FIG. 7. In FIG. 7, each horizontal line represents an intergroup differential open area, and each vertical line represents a clinical specimen. Specific clustering algorithms include, but are not limited to, hierarchical clustering, Euclidean clustering, K-mean clustering, KNN clustering, and the like.
In this embodiment, by performing cluster analysis on the second read length number matrix, the behavior of the inter-group difference open regions in different clinical samples can be shown.
In an embodiment of the present application, the step S500 further includes the following steps:
and S531, calculating similarity coefficients among different clinical samples, and performing cluster analysis on the calculation results to generate an inter-sample cluster analysis graph. The inter-sample cluster analysis graph is shown in fig. 8.
Specifically, this embodiment introduces inter-group similarity analysis. This example is applicable to situations where it is uncertain whether a single control group and multiple experimental groups are present. The similarity coefficient includes, but is not limited to, pearson correlation coefficient, cosine correlation coefficient, Spearman correlation coefficient, etc.
In this embodiment, by calculating the similarity coefficient between different clinical samples, it can be determined which clinical samples have differences.
In an embodiment of the present application, when the sample group includes a control group and a plurality of experimental groups, the step S500 further includes the steps of:
s541, selecting an experimental group, and acquiring a difference open area between the experimental group and the control group as a first difference open area between the groups.
And S542, selecting another experimental group, and acquiring the interclass differential open areas of the experimental group and the control group as second interclass differential open areas.
S543, acquiring a superposed open area of the first inter-group difference open area and the second inter-group difference open area, performing Fisher accurate detection on the superposed open area, and calculating a detection value.
And S543, judging whether the check value is less than 0.05.
S544, if the test value is less than 0.05, a difference between the two experimental groups is indicated.
S545, if the test value is greater than or equal to 0.05, it indicates no correlation between the two experimental groups.
And S546, repeatedly executing the steps S541 to S545 to obtain difference relations among the plurality of experimental groups, and generating a similarity test chart among the groups. Figure 9 shows the inter-group similarity test.
In particular, the present embodiment is applicable to the case where it is determined that there is a single control group and a plurality of experimental groups. As shown in fig. 9, the example of fig. 9 has 1 control group and 2 experimental groups, the left circle covering a range 4657 representing the difference between the first experimental group and the control group, the right circle covering a range 192 representing the difference between the second experimental group and the control group, and the overlapping area 1989 of the two circles representing the coincident open areas, i.e., common differences.
In this embodiment, obtaining the difference relationship between the plurality of experimental groups is achieved by performing fisher's exact inspection on the overlapped open regions of the difference open regions between the first group and the second group.
In an embodiment of the present application, the step S600 includes the following steps S610 to S650:
s610, according to a preset research direction, screening out specific inter-group difference open areas from the inter-group difference open areas, and generating a specific inter-group difference open area list.
S620, running HOMER software to search transcription factors in a clinical sample according to the specific intergroup difference open region list, and calculating the enrichment fraction of each transcription factor in the clinical sample. When the enrichment score is less than 0.05, it indicates that differentially open regions between specific groups are regulated by the transcription factor in the clinical sample. When the enrichment score is greater than or equal to 0.05, it indicates that the differential open regions between specific groups are not regulated by the transcription factor in the clinical sample.
And S630, repeatedly executing the step S620, calculating enrichment scores of different transcription factors in different clinical samples, and generating an enrichment score matrix and a transcription factor enrichment analysis chart. The transcription factor enrichment analysis is shown in FIG. 10.
Figure BDA0002642124920000131
Wherein W is an enrichment fraction matrix. L is the number of the transcription factor. N is the serial number of the clinical specimen. ZLNIs the enrichment fraction of the L-th transcription factor in the N-th clinical sample.
And S640, performing cluster analysis on the enrichment fraction matrix to generate a transcription factor enrichment fraction cluster analysis diagram. A transcription factor-enriched score cluster analysis chart is shown in FIG. 11.
S650, screening the enriched fraction matrix for the enriched transcription factors by using a t test.
Specifically, the preset study direction in step S610 is related to the study direction of the researcher, i.e., related to the disease condition that the researcher wants to study. FIG. 10 expresses different enrichment fractions for different transcription factors. FIG. 11 shows the regulatory status of different transcription factors in different clinical samples.
In this embodiment, the control conditions of different transcription factors in different clinical samples are obtained by searching for enriched transcription factors.
In an embodiment of the present application, the step S700 includes the following steps S710 to S720:
s710, selecting the transcription factors related to the preset research direction from the transcription factor library.
S720, searching a plurality of binding sites of the transcription factors related to the preset research direction on the preset DNA region, and calculating the length of the sequencing fragment around each binding site. The distance of the sequencing fragments around the binding site to the binding site is calculated. Generating a sequencing fragment heat map of the selected transcription factor motif region according to the calculation result. A heat map of sequenced fragments of the selected region of the transcription factor motif is shown in FIG. 12.
Specifically, the transcription factor associated with the preset study direction may be CTCF. CTCF (CCCTC-binding factor) is a versatile transcription factor that is widely present in eukaryotes. The M segment of the zinc finger structures with 11 zinc finger structures can form different combinations, so that the CTCF can regulate the expression of various target genes. Thus, CTCF was chosen as the hallmark, enriched transcription factor for the analysis process in this example.
The predetermined DNA region is designated by the researcher, which is related to the direction of research of the researcher. FIG. 12 is a graph showing the signal distribution of ATAC sequencing fragments at the binding sites of the transcription factors in the predetermined DNA region, produced by each transcription factor.
In this embodiment, by analyzing the binding imprinting of the enriched transcription factor and the DNA region, the signal distribution of the transcription factor at the ATAC sequencing fragment at the binding position of the predetermined DNA region can be obtained.
In an embodiment of the present application, the step S800 includes the following steps:
s810, reducing the cell type proportion distribution condition of each clinical sample by a deconvolution mode by adopting a CIBERSORT algorithm to generate a cell type proportion deconvolution analysis chart. FIG. 13 shows a diagram of a cell type ratio deconvolution analysis.
Specifically, one can select a mixture of Th1, Th2, Th17, Treg,
Figure BDA0002642124920000141
five common T cell sub-cell types are reduced, although not limited to these five cell types. As shown in FIG. 13, the cell type distribution in chromatin of each clinical specimen was known.
The original sequencing file output after ATAC-seq sequencing can be introduced into the CIBERSORT software, and then the cell characteristic vectors of the five cell types are input into the interface of the CIBERSORT software, so that the CIBERSORT software can automatically output the proportion distribution condition of each cell type.
In an embodiment of the present application, by analyzing the ratio of cell types in a clinical sample, the analysis result of the embodiment can be referred to during clinical medication, so that a deconvolution analysis chart is realized by referring to the ratio of cell types in the clinical sample during medication, and the coupling of cell information and gene information is realized.
In an embodiment of the present application, the step S900 includes the following steps S910 to S920:
at S910, chromatin from each clinical specimen is fragmented into multiple distinct chromatin regions in fixed length. And calculating the average number of sequencing read lengths of the non-open positions of each clinical sample on different chromatin regions according to the chromatin open regions of each clinical sample and the sequencing information file of the clinical sample, recording the average number as the chromatin copy number fraction, and generating a copy number fraction matrix.
Figure BDA0002642124920000151
Wherein P is a copy number fraction matrix. i is the number of chromatin regions formed by dividing chromatin into fixed lengths. N is the serial number of the clinical specimen. Tau isiNIs the copy number fraction of the ith chromatin region in the nth clinical sample.
S920, performing copy number variation analysis on the copy number fraction matrix. Further, a copy number variation analysis chart is generated according to the result of the copy number variation analysis. Chromatin regions with copy number scores significantly greater than those of other chromatin regions were screened.
Specifically, in step S920, the copy number variation analysis chart is shown in fig. 14. The purpose of step S920 is to screen out chromatin regions with significantly larger copy number scores. The screening method is divided into two steps.
The first step, selecting a chromatin region, performing single sample t test on all other chromatin regions in the chromatin region, calculating a P _ Value, and entering the screening step of the second step if the P _ Value is less than 0.05. If the P Value is greater than or equal to 0.05, it indicates that the copy number score for the chromatin is not significant.
And secondly, calculating the average value of all the copy number scores, judging whether the copy number score of the chromatin region is larger than the average value, and if the copy number score of the chromatin region is larger than the average value, indicating that the copy number score of the chromatin is obviously larger than the copy number scores of other chromatin regions. If the copy number score of the chromatin region is less than or equal to the mean value, it indicates that the copy number score of the chromatin is also not significant.
And performing the first and second screening processes on all chromatin regions to obtain a final screening result.
In this embodiment, through CNV analysis, differences in DNA fragments between different samples can be explored, which provides reference for clinical diagnosis of some diseases, and indirectly excavates more information related to clinical from the original sequencing file, thereby avoiding additional clinical trials.
The technical features of the embodiments described above may be arbitrarily combined, the order of execution of the method steps is not limited, and for simplicity of description, all possible combinations of the technical features in the embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the combinations of the technical features should be considered as the scope of the present description.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (11)

1. A method for analyzing chromatin accessibility data based on a clinical specimen, comprising:
s100, setting a plurality of sample groups, wherein each sample group comprises a plurality of clinical samples;
s200, obtaining an original sequencing file output after each clinical sample is subjected to ATAC-seq sequencing;
s300, performing data processing on the original sequencing file, performing quality control analysis on the original sequencing file after the data processing to generate a sequencing quality control analysis result, and visualizing the sequencing quality control analysis result;
s400, acquiring the open area of each sample group, and performing information annotation on the chromatin open area of each sample group;
s500, performing difference analysis, cluster analysis and multi-group similarity analysis among a plurality of sample groups according to the open area of each sample group;
s600, carrying out enrichment analysis on the transcription factors according to the analysis result of the step S500, and searching for the enriched transcription factors;
s700, selecting a transcription factor related to a preset research direction, and carrying out combined imprinting analysis on the transcription factor;
s800, deconvoluting the original sequencing file of each clinical sample to obtain the percentage of the cell number of different types of cells in each clinical sample in the total number of cells;
s900, performing CNV analysis on the original sequencing file of each clinical sample, acquiring DNA fragment difference information among different clinical samples, and visualizing the DNA fragment difference information among different clinical samples.
2. The method for clinical specimen-based chromatin accessibility data analysis of claim 1, wherein said step S300 comprises:
s310, selecting an original sequencing file of a clinical sample, removing an adapter sequence in the original sequencing file, and performing chromatin comparison and format conversion on the original sequencing file from which the adapter sequence is removed to generate a sequencing information file;
s320, carrying out enrichment analysis on the transcription initiation sites and distribution analysis on sequencing fragments on the sequencing information file to generate an enrichment analysis chart and a distribution analysis chart of the sequencing fragments of the transcription initiation sites of the clinical sample;
s330, generating a chromatin comparison result visualization graph based on the chromatin comparison result;
s340, repeatedly executing the steps S310 to S330, and generating a sequencing information file, a transcription initiation site enrichment analysis chart, a sequencing fragment distribution analysis chart and a chromatin comparison result visualization chart of each clinical sample;
s350, merging the sequencing information files of a plurality of clinical samples in a sample group into a group sequencing information file, carrying out transcription initiation site enrichment analysis and sequencing fragment distribution analysis on the group sequencing information file, and generating a transcription initiation site enrichment analysis diagram and a sequencing fragment distribution analysis diagram of the sample group;
and S360, repeatedly executing the step S350, and generating a group sequencing information file, a transcription start site enrichment analysis map and a sequencing fragment distribution analysis map of each sample group.
3. The method for clinical specimen-based chromatin accessibility data analysis of claim 2, wherein said step S400 comprises:
s410, acquiring a plurality of potential chromatin open areas of each sample group by using a MACS2 algorithm based on the group sequencing information file of each sample group;
s420, screening a chromatin open area from the plurality of potential chromatin open areas based on the fold difference parameter, one or more of the chromatin open areas P _ value and FDR;
s430, combining the chromatin open areas of all the sample groups to generate an open area list;
s440, calculating the number of sequencing reads of each clinical sample on each chromatin open area, and generating a first read number matrix;
Figure FDA0002642124910000031
wherein H is the first read number matrix, M is the number of open areas of chromatin, N is the number of samples, XMNReading the length data for the sequencing of the nth sample on the mth chromatin open region;
s450, performing position annotation and genome function annotation on the chromatin open regions of each sample group to generate a chromatin open region annotation map; the location annotation comprises one or more of a priming enhancer annotation, a heterochromatin region annotation, and a functional annotation; the genome function annotation annotates gene function by the GREAT algorithm.
4. The method for clinical specimen-based chromatin accessibility data analysis of claim 3, wherein said step S500 comprises:
s511, carrying out normalized analysis on the first read length number matrix;
s512, performing difference analysis on every two sample groups according to the first read length number matrix subjected to normalization analysis to obtain an inter-group difference open region, and generating an inter-group difference region display diagram; the intergroup differential open areas are chromatin open areas with significant differences between the two sample groups;
s513, calculating the number of sequencing reads of each clinical sample in each inter-group difference open region, and generating a second read number matrix;
Figure FDA0002642124910000032
wherein M is a second read number matrix, K is the serial number of the difference open area between groups, N is the serial number of the clinical sample, YKNRead length data for sequencing of nth sample on K-th inter-group differential open area.
5. The method for clinical specimen-based chromatin accessibility data analysis of claim 4, wherein said step S500 further comprises:
and S521, performing cluster analysis on the second read length number matrix by adopting a Euclidean clustering or hierarchical clustering method to obtain cluster information of the interclass difference open regions, and generating an interclass difference open region cluster analysis diagram.
6. The method for clinical specimen-based chromatin accessibility data analysis of claim 5, wherein said step S500 further comprises:
and S531, calculating similarity coefficients among different clinical samples, and performing cluster analysis on the calculation results to generate an inter-sample cluster analysis graph.
7. The method for analyzing chromatin accessibility data based on clinical samples of claim 6, wherein when the sample group comprises a control group and a plurality of experimental groups, the step S500 further comprises:
s541, selecting an experimental group, and acquiring a difference open area between the experimental group and the control group as a first difference open area between the groups;
s542, selecting another experimental group, and acquiring intergroup difference open areas of the experimental group and the control group as second intergroup difference open areas;
s543, acquiring a superposed open area of the first inter-group difference open area and the second inter-group difference open area, performing Fisher accurate inspection on the superposed open area, and calculating an inspection value;
s543, judging whether the check value is less than 0.05;
s544, if the check value is less than 0.05, the difference between the two experimental groups is indicated;
s545, if the test value is greater than or equal to 0.05, indicating that there is no correlation between the two experimental groups;
and S546, repeatedly executing the steps S541 to S545 to obtain difference relations among the plurality of experimental groups, and generating a similarity test chart among the groups.
8. The method for clinical specimen-based chromatin accessibility data analysis of claim 7, wherein said step S600 comprises:
s610, screening specific inter-group difference open areas from the inter-group difference open areas according to a preset research direction, and generating a specific inter-group difference open area list;
s620, running HOMER software to search transcription factors in a clinical sample according to the specific intergroup difference open region list, and calculating the enrichment fraction of each transcription factor in the clinical sample; when the enrichment fraction is less than 0.05, indicating that differential open regions between specific groups are regulated by the transcription factor in the clinical sample; when the enrichment fraction is greater than or equal to 0.05, it indicates that the differential open regions between specific groups are not regulated by the transcription factor in the clinical sample;
s630, repeatedly executing the step S620, calculating enrichment scores of different transcription factors in different clinical samples, and generating an enrichment score matrix and a transcription factor enrichment analysis chart;
Figure FDA0002642124910000051
wherein W is an enrichment fraction matrix, L is the sequence number of a transcription factor, N is the sequence number of a clinical sample, ZLN(ii) is the enrichment fraction of the L transcription factor in the N clinical sample;
s640, performing cluster analysis on the enrichment fraction matrix to generate a transcription factor enrichment fraction cluster analysis diagram;
s650, screening the enriched fraction matrix for the enriched transcription factors by using a t test.
9. The method for analyzing chromatin accessibility data based on a clinical specimen according to claim 8, wherein said step S700 comprises:
s710, selecting transcription factors related to a preset research direction from a transcription factor library;
s720, searching a plurality of binding sites of the transcription factors related to the preset research direction on a preset DNA region, calculating the length of a sequencing fragment around each binding site, calculating the distance between the sequencing fragment around each binding site and the binding site, and generating a sequencing fragment heat map of the selected transcription factor motif region according to the calculation result.
10. The method for analyzing chromatin accessibility data based on a clinical specimen of claim 9, wherein the step S800 comprises:
s810, reducing the cell type proportion distribution situation in each clinical sample by a deconvolution mode by adopting a CIBERSORT algorithm to generate a cell type proportion deconvolution analysis chart.
11. The method for clinical specimen-based chromatin accessibility data analysis of claim 10, wherein said step S900 comprises:
s910, dividing the chromatin of each clinical sample into a plurality of different chromatin areas according to a fixed length, calculating a sequencing read length average of each clinical sample at a non-open position on each different chromatin area according to the chromatin open area of each clinical sample and a sequencing information file of the clinical sample, recording the sequencing read length average as the chromatin copy number score, and generating a copy number score matrix;
Figure FDA0002642124910000061
wherein P is a copy number fraction matrix, i is the number of chromatin regions formed by dividing chromatin into fixed lengths, N is the number of clinical samples, τiN(ii) is the copy number score of the ith chromatin region in the nth clinical sample;
s920, performing copy number variation analysis on the copy number fraction matrix, generating a copy number variation analysis graph according to the result of the copy number variation analysis, and screening chromatin regions with copy number fractions significantly larger than those of other chromatin regions.
CN202010843055.4A 2020-08-20 2020-08-20 Chromatin accessibility data analysis method based on clinical samples Active CN111951896B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010843055.4A CN111951896B (en) 2020-08-20 2020-08-20 Chromatin accessibility data analysis method based on clinical samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010843055.4A CN111951896B (en) 2020-08-20 2020-08-20 Chromatin accessibility data analysis method based on clinical samples

Publications (2)

Publication Number Publication Date
CN111951896A true CN111951896A (en) 2020-11-17
CN111951896B CN111951896B (en) 2023-10-20

Family

ID=73358505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010843055.4A Active CN111951896B (en) 2020-08-20 2020-08-20 Chromatin accessibility data analysis method based on clinical samples

Country Status (1)

Country Link
CN (1) CN111951896B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113774136A (en) * 2021-09-17 2021-12-10 杭州瀚因生命科技有限公司 Quantitative detection method for chromatin patency in specific region of genome
CN115083517A (en) * 2022-07-07 2022-09-20 南华大学附属第一医院 Data processing method and system for identifying enhancer and super enhancer

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090125248A1 (en) * 2007-11-09 2009-05-14 Soheil Shams System, Method and computer program product for integrated analysis and visualization of genomic data
CN107832585A (en) * 2017-11-23 2018-03-23 南宁科城汇信息科技有限公司 A kind of RNAseq data analysing methods
CN109666719A (en) * 2018-12-09 2019-04-23 华中农业大学 A method of improving cellular resolution chromatin accessibility
CN110544509A (en) * 2019-08-20 2019-12-06 广州基迪奥生物科技有限公司 single-cell ATAC-seq data analysis method
CN111095422A (en) * 2017-06-19 2020-05-01 琼格拉有限责任公司 Interpretation of Gene and genomic variants by comprehensive computational and Experimental deep mutation learning frameworks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090125248A1 (en) * 2007-11-09 2009-05-14 Soheil Shams System, Method and computer program product for integrated analysis and visualization of genomic data
CN111095422A (en) * 2017-06-19 2020-05-01 琼格拉有限责任公司 Interpretation of Gene and genomic variants by comprehensive computational and Experimental deep mutation learning frameworks
CN107832585A (en) * 2017-11-23 2018-03-23 南宁科城汇信息科技有限公司 A kind of RNAseq data analysing methods
CN109666719A (en) * 2018-12-09 2019-04-23 华中农业大学 A method of improving cellular resolution chromatin accessibility
CN110544509A (en) * 2019-08-20 2019-12-06 广州基迪奥生物科技有限公司 single-cell ATAC-seq data analysis method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴杰;全建平;叶勇;吴珍芳;杨杰;杨明;郑恩琴;: "染色质转座酶可及性测序研究进展", 遗传, no. 04 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113774136A (en) * 2021-09-17 2021-12-10 杭州瀚因生命科技有限公司 Quantitative detection method for chromatin patency in specific region of genome
CN115083517A (en) * 2022-07-07 2022-09-20 南华大学附属第一医院 Data processing method and system for identifying enhancer and super enhancer

Also Published As

Publication number Publication date
CN111951896B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
US10347365B2 (en) Systems and methods for visualizing a pattern in a dataset
US11898206B2 (en) Systems and methods for clonotype screening
US11954614B2 (en) Systems and methods for visualizing a pattern in a dataset
US20130169645A1 (en) Computer-aided visualization of expression comparison
US20210381056A1 (en) Systems and methods for joint interactive visualization of gene expression and dna chromatin accessibility
Yao et al. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers
Larsson et al. Comparative microarray analysis
CN111951896B (en) Chromatin accessibility data analysis method based on clinical samples
CN111584006B (en) Circular RNA identification method based on machine learning strategy
Schilder et al. echolocatoR: an automated end-to-end statistical and functional genomic fine-mapping pipeline
JP2003500663A (en) Methods for normalization of experimental data
CN108710784A (en) A kind of genetic transcription variation probability and the algorithm in the direction that makes a variation
CN116189763A (en) Single sample copy number variation detection method based on second generation sequencing
Chen et al. Integration of spatial and single-cell data across modalities with weakly linked features
Rabenius et al. Quantifying RNA synthesis at rate-limiting steps of transcription using nascent RNA-sequencing data
US6994965B2 (en) Method for displaying results of hybridization experiment
JP4557609B2 (en) How to display splice variant sequence mapping
CN113257354B (en) Method for mining key RNA function based on high-throughput experimental data mining
CN115083517B (en) Data processing method and system for identifying enhancer and super enhancer
CN115881218B (en) Gene automatic selection method for whole genome association analysis
WO2024010081A1 (en) High-precision diagnostic system using multi-item simultaneous measurement data, high-precision diagnostic method, and program
Nash et al. Pipeline for Integrated Microarray Expression Normalization Tool Kit (PIMENTo) for Tumor Microarray Profiling Experiments
Roy et al. The Role of Transcriptional Profiling in Neurobiology and Treatment of Major Depression: A Translational Perspective on RNA Sequencing Platform
CN117265092A (en) Marker combination for predicting severe acute pancreatitis and application thereof
Liao et al. Deep Learning Enhanced Tandem Repeat Variation Identification via Multi-Modal Conversion of Nanopore Reads Alignment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant