CN114446384A - Prediction method and prediction system of chromosome topological association domains - Google Patents
Prediction method and prediction system of chromosome topological association domains Download PDFInfo
- Publication number
- CN114446384A CN114446384A CN202210245600.9A CN202210245600A CN114446384A CN 114446384 A CN114446384 A CN 114446384A CN 202210245600 A CN202210245600 A CN 202210245600A CN 114446384 A CN114446384 A CN 114446384A
- Authority
- CN
- China
- Prior art keywords
- quasi
- interaction
- block
- chromosome
- genomic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 210000000349 chromosome Anatomy 0.000 title claims abstract description 102
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000003993 interaction Effects 0.000 claims abstract description 239
- 239000011159 matrix material Substances 0.000 claims abstract description 50
- 210000004940 nucleus Anatomy 0.000 claims description 49
- 238000005516 engineering process Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 14
- 230000002759 chromosomal effect Effects 0.000 claims description 11
- 238000012163 sequencing technique Methods 0.000 claims description 11
- 238000003064 k means clustering Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 7
- 238000010899 nucleation Methods 0.000 claims description 7
- 230000000717 retained effect Effects 0.000 claims description 5
- 230000001174 ascending effect Effects 0.000 claims description 4
- 238000011144 upstream manufacturing Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 7
- 239000013598 vector Substances 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 210000004962 mammalian cell Anatomy 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000000386 microscopy Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/20—Protein or domain folding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本发明属于计算机技术领域,具体涉及一种染色体拓扑关联结构域的预测方法及预测系统。The invention belongs to the field of computer technology, and in particular relates to a prediction method and prediction system of a chromosome topological association domain.
背景技术Background technique
近年来,全基因组范围内的染色体构象捕获技术(High-throughput chromosomeconfiguration capture technology,Hi-C)的出现,推动了人们对染色体空间结构层次的认识。相关研究人员将哺乳动物细胞的Hi-C测序数据转化为Hi-C 互作矩阵并进行可视化,从而发现了分辨率低于100kb时的高度自我互作区域,这样的区域就是拓扑关联结构域(Topologically Associationg Domain,TAD)。其中,Hi-C互作矩阵的构建步骤具体为:将一条染色体划分为等长的N个片段,并构建成一个N*N的矩阵M,用于表征一条染色体上两两片段间的互作信号,其中等长的单位长度片段称为一个基因组区块,基因组区块的大小与Hi-C互作矩阵的分辨率有关。通过统计高通量染色体构象捕获技术所产生的测序片段读数在基因组区块对之间的比对情况和N个基因组区块之间的互作频数,研究人员构建出了Hi-C 互作矩阵。例如,每有一个测序片段读数可以分割比对到基因组区块i与基因组区块j,则在矩阵元素M i,j 、M j,i 上累计加1。In recent years, the emergence of genome-wide chromosome conformation capture technology (High-throughput chromosome configuration capture technology, Hi-C) has promoted people's understanding of the spatial structure of chromosomes. Related researchers converted the Hi-C sequencing data of mammalian cells into a Hi-C interaction matrix and visualized, and found highly self-interacting regions with a resolution below 100kb, such regions are topological association domains ( Topologically Associationg Domain, TAD). Among them, the steps of constructing the Hi-C interaction matrix are as follows: dividing a chromosome into N segments of equal length, and constructing an N * N matrix M , which is used to characterize the interaction between two segments on a chromosome The signal, in which a unit-length segment of equal length is called a genomic block, the size of the genomic block is related to the resolution of the Hi-C interaction matrix. The Hi-C interaction matrix was constructed by counting the alignment of the sequencing fragment reads generated by the high-throughput chromosome conformation capture technology between pairs of genomic blocks and the interaction frequency between N genomic blocks. . For example, each time there is a sequencing fragment read that can be divided and aligned to the genome block i and the genome block j , then the matrix elements M i,j , M j,i are cumulatively incremented by 1.
目前,受显微技术和生物技术的限制,研究人员仍然无法直接完整的观察到TAD,且TAD的形成机制仍处于模糊概念。所以,要想得到TAD的信息,则必须借助于一些间接方法来实现,比如利用Hi-C 测序数据捕获的染色体片段间的互作信息构建Hi-C 互作矩阵,进而通过相关的算法来实现对TAD的预测。最近几年,研究人员提出了基于机器学习算法预测TAD的方法;但在不同细胞系上应用这些方法却受到很大限制,因为不同的细胞系往往需要大量对应且特有的相关信息去提取特征训练模型,这为研究人员增加了额外的负担。At present, due to the limitations of microscopy and biotechnology, researchers are still unable to directly and completely observe TAD, and the formation mechanism of TAD is still in a vague concept. Therefore, in order to obtain the information of TAD, some indirect methods must be used to achieve it, such as using the interaction information between chromosome fragments captured by Hi-C sequencing data to construct a Hi-C interaction matrix, and then use related algorithms to achieve TAD's forecast. In recent years, researchers have proposed methods for predicting TAD based on machine learning algorithms; however, the application of these methods on different cell lines is very limited, because different cell lines often require a large amount of corresponding and unique relevant information to extract features for training model, which places an additional burden on researchers.
现有的TAD预测算法,主要从边界处互作偏好性、TAD内部的相似性、TAD与非TAD的差异性、TAD内接触频数密度变化等角度去预测TAD。这些方法要么仅仅聚焦于边界的寻找,漏掉了TAD内部的信息;要么需要使用自定义的参数去控制TAD的尺寸大小、聚类终止阈值、局部最值等;这就使得识别TAD问题存在很大的波动性和主观性;而且,TAD作为一种未被精确定义的结构,不应该通过限制其自身的属性去进行预测。The existing TAD prediction algorithms mainly predict TAD from the perspectives of the interaction preference at the boundary, the similarity within the TAD, the difference between the TAD and the non-TAD, and the change of the contact frequency density within the TAD. These methods either only focus on the search for the boundary and miss the information inside the TAD; or they need to use custom parameters to control the size of the TAD, cluster termination threshold, local maxima, etc. This makes the problem of identifying TAD very difficult. large volatility and subjectivity; moreover, TAD, as a structure that is not precisely defined, should not be predicted by limiting its own properties.
发明内容SUMMARY OF THE INVENTION
本发明的目的之一在于提供一种可靠性高、准确性好且效果较好的染色体拓扑关联结构域的预测方法。One of the objectives of the present invention is to provide a method for predicting chromosomal topological association domains with high reliability, good accuracy and good effect.
本发明的目的之二在于提供一种实现所述染色体拓扑关联结构域的预测方法的预测系统。Another object of the present invention is to provide a prediction system for implementing the method for predicting the chromosome topologically related domains.
本发明提供的这种染色体拓扑关联结构域的预测方法,包括如下步骤:The method for predicting this chromosome topological association domain provided by the present invention comprises the following steps:
S1. 获取基因组区块之间的互作矩阵中每个基因组区块,并采用聚类算法识别得到对应的高频互作区;S1. Obtain each genomic block in the interaction matrix between the genomic blocks, and use a clustering algorithm to identify the corresponding high-frequency interaction area;
S2. 针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核:S2. For each genomic block, judge and identify whether there is a quasi-check centered on the genomic block from the corresponding high-frequency interaction area:
若高频互作区存在以该基因组区块为中心的准核,则继续进行后续步骤;If there is a quasi-core centered on the genomic block in the high-frequency interaction region, proceed to the next steps;
若高频互作区不存在以该基因组区块为中心的准核,则对该高频互作区进行拆分后再重新判断和识别准核,直至拆分后的区域不包含基因组区块;If there is no quasi-nucleus centered on the genomic block in the high-frequency interaction region, the high-frequency interaction region is split and then the quasi-nucleus is re-judged and identified until the split region does not contain the genomic block ;
S3. 对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核;S3. The quasi-nuclei identified on each chromosome are processed according to the relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;
S4. 根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,并将合并后的核作为要预测的染色体拓扑关联结构域的核;S4. Merge the non-overlapping quasi-nuclei on a chromosome according to the correlation between the quasi-nuclei, and use the merged nucleus as the nucleus of the chromosome topological association domain to be predicted;
S5. 确定附件候选区中每个基因组区块的从属关系,结合步骤S4得到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域。S5. Determine the affiliation of each genomic block in the attachment candidate region, and combine the nucleus of the chromosome topological association domain obtained in step S4 to obtain the final predicted chromosome topological association domain.
所述的步骤S1,具体为采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵中每个基因组区块,并采用k=2的K均值聚类算法进行聚类,从而识别得到对应的高频互作区。Described step S1, specifically adopts whole genome conformation capture technology and sequencing technology, obtains each genome block in the interaction matrix between genome blocks, and uses k =2 K -means clustering algorithm for clustering, Thereby, the corresponding high-frequency interaction region can be identified.
所述的步骤S1,具体包括如下步骤:The described step S1 specifically includes the following steps:
S1.1. 采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵;S1.1. Use whole-genome conformation capture technology and sequencing technology to obtain the interaction matrix between genomic blocks;
S1.2. 对步骤S1.1得到的基因组区块之间的互作矩阵的对角线上每个基因组区块与自身的互作值进行赋0处理;S1.2. The interaction value between each genome block and itself on the diagonal of the interaction matrix between the genome blocks obtained in step S1.1 is assigned 0;
S1.3. 对任意基因组区块i,采用k=2的K均值聚类算法对该基因组区块i与其互作值不为0的其他基因组区块进行聚类;S1.3. For any genomic block i , use the K -means clustering algorithm of k =2 to cluster the genomic block i and other genomic blocks whose interaction value is not 0;
S1.4. 为每一个基因组区块i定义对应的高频互作区 ;其中,l i 对应于基因组区块i高互作类中基因组区块的最小区块号,r i 对应于基因组区块i高互作类中基因组区块的最大区块号。S1.4. Define the corresponding high-frequency interaction region for each genomic block i wherein, li corresponds to the minimum block number of the genome block in the high interaction class of the genome block i , and ri corresponds to the largest block number of the genome block in the high interaction class of the genome block i .
采用如下函数作为步骤S1.3中的其他基因组区块的分类函数:The following function is used as the classification function of other genomic blocks in step S1.3 :
式中为基因组区块i与基因组区块j的互作值;为第k个中心的平均值;为取与距离最近的中心所对应的类别号操作的函数;为2-范数;两个类的初始中心值和的设置为非零互作值升序排序后和位置对应的互作值,且对应低频互作类的中心,对应高频互作类的中心;in the formula is the interaction value of genome block i and genome block j ; is the average of the kth center; for taking and The function of the category number operation corresponding to the nearest center; is the 2-norm; the initial center value of the two classes and is set to non-zero interaction value after ascending sorting and the interaction value corresponding to the position, and corresponds to the center of the low-frequency interaction class, corresponds to the center of the high-frequency interaction class;
通过求解分类函数,将与中心值最小的距离对应的类赋给基因组区块j。The class corresponding to the distance with the smallest central value is assigned to the genomic block j by solving the classification function.
所述的步骤S2,具体包括如下步骤:The described step S2 specifically includes the following steps:
S2.1. 计算基因组区块i所在的高频互作区在基因组区块之间的互作矩阵中组成的子矩阵的平均互作值;S2.1. Calculate the high-frequency interaction region where the genomic block i is located Submatrices formed in the interaction matrix between genomic blocks The average interaction value of ;
S2.2. 对步骤S2.1得到的平均互作值与邻近5个相同窗口大小的子矩阵的平均互作值进行比较:S2.2. Compare the average interaction value obtained in step S2.1 with the average interaction value of five adjacent sub-matrices with the same window size:
若步骤S2.1得到的平均互作值大于邻近5个相同窗口大小的子矩阵的平均互作值,则判定高频互作区为算基因组区块i的准核;If the average interaction value obtained in step S2.1 is greater than the average interaction value of five adjacent sub-matrices with the same window size, the high-frequency interaction area is determined. is the quasi-check for calculating genomic block i ;
若步骤S2.1得到的平均互作值不大于邻近5个相同窗口大小的子矩阵的平均互作值,则对高频互作区进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止;If the average interaction value obtained in step S2.1 is not greater than the average interaction value of five adjacent sub-matrices with the same window size, then the high-frequency interaction area is Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i ;
所述的邻近5个相同窗口大小的子矩阵,具体为上方3个子矩阵、和,右侧的1个子矩阵,以及下方的一个子矩阵。The adjacent 5 sub-matrices with the same window size, specifically the upper 3 sub-matrices , and , 1 submatrix on the right , and a submatrix below .
所述的对高频互作区进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止,具体包括如下步骤:the high frequency interaction region Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i , which specifically includes the following steps:
首先,以高频互作区中与高频互作区内其他基因组区块互作总和最小的基因组区块m i 为分割点,将高频互作区分为高频互作区和高频互作区;First, in the high-frequency interaction area Middle and high frequency interaction area The genomic block mi with the smallest sum of interactions among other genomic blocks is the dividing point, and the high-frequency interaction area is divided into high frequency interaction zone and high frequency interaction area ;
然后,进行判断:Then, make a judgment:
若i = m i ,则判定不存在以基因组区块i为中心的准核;If i = m i , it is determined that there is no quasi-nucleus centered on the genomic block i ;
若i < m i ,则以高频互作区作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断;If i < m i , then the high-frequency interaction region As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval;
若i > m i ,则以高频互作区作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断。If i > m i , then the high-frequency interaction region As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval.
所述的步骤S3,具体包括如下步骤:The described step S3 specifically includes the following steps:
S3.1. 对每条染色体上识别的准核,判定两个相邻准核之间的关系:S3.1. For the quasi-nuclei identified on each chromosome, determine the relationship between two adjacent quasi-nuclei:
若两个相邻准核之间为包含关系,则保留被包含的准核,并过滤包含的准核;If there is a containment relationship between two adjacent licenses, the contained licenses are retained and the contained licenses are filtered;
若两个相邻准核之间为交叠关系,则再次进行判断:若该两个准核合并后依然满足准核的定义,则将该两个准核合并为一个准核;否则,保留该两个准核中平均互作值较大的准核,并过滤剩余的准核;If there is an overlapping relationship between two adjacent quasi-nuclears, the judgment is made again: if the two quasi-nuclei still meet the definition of quasi-nuclear after merging, then the two quasi-nuclears are merged into one quasi-nuclear; otherwise, keep the The quasi-nucleus with the larger average interaction value among the two quasi-nuclei, and filtering the remaining quasi-nuclei;
S3.2. 重复步骤S3.1直至整条染色体上所有的准核均进行完判定和处理,最终得到互不重叠的准核。S3.2. Repeat step S3.1 until all quasi-nuclei on the entire chromosome have been judged and processed, and finally non-overlapping quasi-nuclei are obtained.
所述的步骤S4,具体为计算所有相邻的准核之间的余弦相似性,并将余弦相似性高于设定阈值且相邻准核间平均互作值大于整条染色体上非零互作值的均值的连续若干个相邻的准核合并为一个新的区域,并将该区域作为要预测的染色体拓扑关联结构域的核-附件结构模型中的核。The step S4 is to calculate the cosine similarity between all adjacent quasi-nuclei, and set the cosine similarity higher than the set threshold and the average interaction value between adjacent quasi-nuclei is greater than the non-zero interaction value on the entire chromosome. Several consecutive adjacent quasi-nuclei taking the mean value of the values are merged into a new region, and this region is used as the nucleus in the nucleus-attachment structure model of the chromosome topological association domain to be predicted.
所述的计算所有相邻的准核之间的余弦相似性,具体为采用如下算式计算相邻的准核pc i 和pc j 的余弦相似性:The calculation of the cosine similarity between all adjacent quasi-kernels is specifically calculated by using the following formula to calculate the cosine similarity of adjacent quasi-kernels pc i and pc j :
式中为pc i 与其他所有准核的平均互作值组成的特征向量,且,,为准核pc k 和pc i 之间的平均互作值;为pc j 与其他所有准核的平均互作值组成的特征向量,且,,为准核pc k 和pc j 之间的平均互作值;为向量的内积;为向量的取模。in the formula is the eigenvector composed of the average interaction value of pc i and all other quasi-kernels, and , , is the average interaction value between quasi-kernel pc k and pc i ; is the eigenvector composed of the average interaction value of pc j and all other quasi-kernels, and , , is the average interaction value between quasi-kernel pc k and pc j ; is the inner product of vectors; is the modulo of the vector.
所述的步骤S5,具体为定义核与核之间的区域为附件区,确定每一个附件区中每个基因组区块所从属的邻近的染色体拓扑关联结构域的核,从而得到最终预测的染色体拓扑关联结构域;每一个染色体拓扑关联结构域均包括一个核以及该核两边的附件区。Described step S5, specifically defines the area between nucleus and nucleus as appendix area, determines the nucleus of adjacent chromosome topological association structure domain to which each genome block in each appendix area belongs, thereby obtains the final predicted chromosome. Topological association domains; each chromosomal topological association domain includes a nucleus and appendage regions on either side of the nucleus.
所述的步骤S5,具体包括如下步骤:The step S5 specifically includes the following steps:
S5.1. 对相邻两核和中间的基因组区块,过滤高频互作区的平均互作值小于整条染色体上非零互作值的均值的基因组区块;S5.1. For two adjacent cores and middle genomic block , to filter genomic blocks whose average interaction value in the high-frequency interaction region is less than the average value of non-zero interaction values on the entire chromosome;
S5.2. 在步骤S5.1的基础上,对相邻两核和及该两核之间的基因组区块构成的子矩阵,去除背景信号;背景信号定义为相邻两核之间的基因组区块构成的子矩阵中非零互作值的均值;S5.2. On the basis of step S5.1, for the adjacent two cores and and the genomic block between the two cores The formed sub-matrix removes the background signal; the background signal is defined as the mean value of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;
S5.3. 在步骤S5.2的基础上,对相邻两核和中间的基因组区块,过滤不存在与基因组区域内任何基因组区块有非零互作值的基因组区块;S5.3. On the basis of step S5.2, for the adjacent two cores and middle genomic block , filtering does not exist with genomic regions Any genomic block within a genomic block with a non-zero interaction value;
S5.4. 在步骤S5.3的基础上,计算相邻两核和之间剩余的每一个基因组区块所在子矩阵的平均互作值,并将子矩阵平均互作值最小所对应的基因组区块作为分割点,分割点上游的基因组区块认定为核的附件,分割点下游的基因组区块认定为核的附件;从而得到最终预测的染色体拓扑关联结构域。S5.4. On the basis of step S5.3, calculate the adjacent two cores and The submatrix where each remaining genomic block is located between The average interaction value of the sub-matrix, and the genome block corresponding to the minimum average interaction value of the sub-matrix is used as the split point, and the genome block upstream of the split point is identified as the core attachments, genomic blocks downstream of the split point are identified as nuclear attachment; resulting in the final predicted chromosomal topological association domain.
本发明还提供了一种实现所述染色体拓扑关联结构域的预测方法的预测系统,包括依次串接的高频互作区识别模块、准核识别模块、准核处理模块、染色体拓扑关联结构域核识别模块和染色体拓扑关联结构域识别模块;高频互作区识别模块用于获取基因组区块之间的互作矩阵中每个基因组区块,采用聚类算法识别得到对应的高频互作区,并将得到的高频互作区上传准核识别模块;准核识别模块用于针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核,并将得到的准核上传准核处理模块;准核处理模块用于对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核,并将得到的互不重叠的准核上传染色体拓扑关联结构域核识别模块;染色体拓扑关联结构域核识别模块用于根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,将合并后的核作为要预测的染色体拓扑关联结构域的核,并将得到的核上传染色体拓扑关联结构域识别模块;染色体拓扑关联结构域识别模块用于确定附件候选区中每个基因组区块的从属关系,并结合接收到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域,并进行输出。The present invention also provides a prediction system for realizing the method for predicting the chromosome topological association domain, including a high-frequency interaction region identification module, a quasi-nucleus identification module, a quasi-nucleus processing module, and a chromosome topological association domain that are serially connected in series The nuclear identification module and the chromosome topological association domain identification module; the high-frequency interaction region identification module is used to obtain each genome block in the interaction matrix between the genome blocks, and use the clustering algorithm to identify the corresponding high-frequency interaction The obtained high-frequency interaction area is uploaded to the quasi-nuclear identification module; the quasi-nuclear identification module is used for each genomic block to judge and identify whether there is a genomic block from the corresponding high-frequency interaction area. The quasi-nuclei of the center, and upload the obtained quasi-nuclei to the quasi-nuclei processing module; the quasi-nucleation processing module is used to process the quasi-nuclei identified on each chromosome according to the relationship between the adjacent quasi-nuclei, and obtain mutually different quasi-nuclei. Overlapping quasi-nuclei, and upload the obtained non-overlapping quasi-nuclei to the chromosome topological association domain nuclear identification module; The non-overlapping quasi-nuclei are merged, the merged nucleus is used as the nucleus of the chromosome topological association domain to be predicted, and the obtained nucleus is uploaded to the chromosome topological association domain identification module; the chromosome topological association domain identification module is used to determine attachments The affiliation of each genomic block in the candidate region is combined with the received nuclei of the chromosome topological association domain to obtain the final predicted chromosome topological association domain and output.
本发明提供的这种染色体拓扑关联结构域的预测方法及预测系统,充分利用了Hi-C数据的全局信息,缩减候选边界定位的范围,从而可减少假阳性结果的出现;同时本发明也无需用户给出预定义的参数,因此本发明能够准确的预测拓扑关联结构域,而且可靠性高、准确性好且效果较好。The prediction method and prediction system of the chromosome topological association domain provided by the present invention make full use of the global information of Hi-C data and reduce the range of candidate boundary positioning, thereby reducing the occurrence of false positive results; at the same time, the present invention does not require The user gives predefined parameters, so the present invention can accurately predict the topological correlation structure domain, and has high reliability, good accuracy and good effect.
附图说明Description of drawings
图1为本发明方法的方法流程示意图。FIG. 1 is a schematic flow chart of the method of the present invention.
图2为本发明方法的实施例的流程示意图。FIG. 2 is a schematic flowchart of an embodiment of the method of the present invention.
图3为本发明系统的结构示意图。FIG. 3 is a schematic structural diagram of the system of the present invention.
具体实施方式Detailed ways
如图1所示为本发明方法的方法流程示意图:本发明提供的这种染色体拓扑关联结构域的预测方法,包括如下步骤:As shown in Figure 1 is a schematic flow chart of the method of the method of the present invention: the prediction method of this chromosome topological association domain provided by the present invention comprises the following steps:
S1. 获取基因组区块之间的互作矩阵中每个基因组区块,并采用聚类算法识别得到对应的高频互作区;具体为采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵(简称Hi-C互作矩阵)中每个基因组区块,并采用k=2的K均值聚类算法进行聚类,从而识别得到对应的高频互作区;S1. Obtain each genome block in the interaction matrix between the genome blocks, and use the clustering algorithm to identify the corresponding high-frequency interaction area; specifically, use the whole genome conformation capture technology and sequencing technology to obtain the genome block Each genomic block in the interaction matrix (referred to as Hi-C interaction matrix) is clustered by K -means clustering algorithm with k = 2, so as to identify the corresponding high-frequency interaction area;
具体实施时,包括如下步骤:The specific implementation includes the following steps:
S1.1. 采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵;S1.1. Use whole-genome conformation capture technology and sequencing technology to obtain the interaction matrix between genomic blocks;
S1.2. 对步骤S1.1得到的基因组区块之间的互作矩阵的对角线上每个基因组区块与自身的互作值进行赋0处理;S1.2. The interaction value between each genome block and itself on the diagonal of the interaction matrix between the genome blocks obtained in step S1.1 is assigned 0;
S1.3. 对任意基因组区块i,采用k=2的K均值聚类算法对该基因组区块i与其互作值不为0的其他基因组区块进行聚类;采用如下函数作为其他基因组区块的分类函数:S1.3. For any genomic block i , use the K -means clustering algorithm with k = 2 to cluster the genomic block i and other genomic blocks whose interaction value is not 0; use the following functions as other genomic regions Classification function for blocks :
式中为基因组区块i与基因组区块j的互作值;为第k个中心的平均值;为取与距离最近的中心所对应的类别号操作的函数;为2-范数;两个类的初始中心值和的设置为非零互作值升序排序后和位置对应的互作值,且对应低频互作类的中心,对应高频互作类的中心;in the formula is the interaction value of genome block i and genome block j ; is the average of the kth center; for taking and The function of the category number operation corresponding to the nearest center; is the 2-norm; the initial center value of the two classes and is set to non-zero interaction value after ascending sorting and the interaction value corresponding to the position, and corresponds to the center of the low-frequency interaction class, corresponds to the center of the high-frequency interaction class;
通过求解分类函数,将与中心值最小的距离对应的类赋给基因组区块j;By solving the classification function, the class corresponding to the distance with the smallest central value is assigned to the genome block j ;
S1.4. 为每一个基因组区块i定义对应的高频互作区;其中,l i 对应于基因组区块i高互作类中基因组区块的最小区块号,r i 对应于基因组区块i高互作类中基因组区块的最大区块号;S1.4. Define the corresponding high-frequency interaction region for each genomic block i Wherein , li corresponds to the minimum block number of the genome block in the genome block i high interaction class , and ri corresponds to the maximum block number of the genome block in the genome block i high interaction class;
S2. 针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核:S2. For each genomic block, judge and identify whether there is a quasi-check centered on the genomic block from the corresponding high-frequency interaction area:
若高频互作区存在以该基因组区块为中心的准核,则继续进行后续步骤;If there is a quasi-core centered on the genomic block in the high-frequency interaction region, proceed to the next steps;
若高频互作区不存在以该基因组区块为中心的准核,则对该高频互作区进行拆分后再重新判断和识别准核,直至拆分后的区域不包含基因组区块;If there is no quasi-nucleus centered on the genomic block in the high-frequency interaction region, the high-frequency interaction region is split and then the quasi-nucleus is re-judged and identified until the split region does not contain the genomic block ;
具体实施时,包括如下步骤:The specific implementation includes the following steps:
S2.1. 计算基因组区块i所在的高频互作区在基因组区块之间的互作矩阵中组成的子矩阵的平均互作值;S2.1. Calculate the high-frequency interaction region where the genomic block i is located Submatrices formed in the interaction matrix between genomic blocks The average interaction value of ;
S2.2. 对步骤S2.1得到的平均互作值与邻近5个相同窗口大小的子矩阵的平均互作值进行比较:S2.2. Compare the average interaction value obtained in step S2.1 with the average interaction value of five adjacent sub-matrices with the same window size:
若步骤S2.1得到的平均互作值大于邻近5个相同窗口大小的子矩阵的平均互作值,则判定高频互作区为算基因组区块i的准核;If the average interaction value obtained in step S2.1 is greater than the average interaction value of five adjacent sub-matrices with the same window size, the high-frequency interaction area is determined. is the quasi-check for calculating genomic block i ;
若步骤S2.1得到的平均互作值不大于邻近5个相同窗口大小的子矩阵的平均互作值,则对高频互作区进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止;If the average interaction value obtained in step S2.1 is not greater than the average interaction value of five adjacent sub-matrices with the same window size, then the high-frequency interaction area is Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i ;
所述的邻近5个相同窗口大小的子矩阵,具体为上方3个子矩阵、和,右侧的1个子矩阵,以及下方的一个子矩阵;The adjacent 5 sub-matrices with the same window size, specifically the upper 3 sub-matrices , and , 1 submatrix on the right , and a submatrix below ;
所述的对高频互作区进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止,具体包括如下步骤:the high frequency interaction region Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i , which specifically includes the following steps:
首先,以高频互作区中与高频互作区内其他基因组区块互作总和最小的基因组区块m i 为分割点,将高频互作区分为高频互作区和高频互作区;First, in the high-frequency interaction area Middle and high frequency interaction area The genomic block mi with the smallest sum of interactions among other genomic blocks is the dividing point, and the high-frequency interaction area is divided into high frequency interaction zone and high frequency interaction area ;
然后,进行判断:Then, make a judgment:
若i = m i ,则判定不存在以基因组区块i为中心的准核;If i = m i , it is determined that there is no quasi-nucleus centered on the genomic block i ;
若i < m i ,则以高频互作区作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断;If i < m i , then the high-frequency interaction region As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval;
若i > m i ,则以高频互作区作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断;If i > m i , then the high-frequency interaction region As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval;
S3. 对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核;具体包括如下步骤:S3. The quasi-nuclei identified on each chromosome are processed according to the relationship between adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei; the specific steps include the following:
S3.1. 对每条染色体上识别的准核,判定两个相邻准核之间的关系:S3.1. For the quasi-nuclei identified on each chromosome, determine the relationship between two adjacent quasi-nuclei:
若两个相邻准核之间为包含关系,则保留被包含的准核,并过滤包含的准核;If there is a containment relationship between two adjacent licenses, the contained licenses are retained and the contained licenses are filtered;
若两个相邻准核之间为交叠关系,则再次进行判断:若该两个准核合并后依然满足准核的定义,则将该两个准核合并为一个准核;否则,保留该两个准核中平均互作值较大的准核,并过滤剩余的准核;If there is an overlapping relationship between two adjacent quasi-nuclears, the judgment is made again: if the two quasi-nuclei still meet the definition of quasi-nuclear after merging, then the two quasi-nuclears are merged into one quasi-nuclear; otherwise, keep the The quasi-nucleus with the larger average interaction value among the two quasi-nuclei, and filtering the remaining quasi-nuclei;
S3.2. 重复步骤S3.1直至整条染色体上所有的准核均进行完判定和处理,最终得到互不重叠的准核;S3.2. Repeat step S3.1 until all quasi-nuclei on the entire chromosome have been judged and processed, and finally non-overlapping quasi-nuclei are obtained;
S4. 根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,并将合并后的核作为要预测的染色体拓扑关联结构域(TAD)的核;具体为计算所有相邻的准核之间的余弦相似性,并将余弦相似性高于设定阈值且相邻准核间平均互作值大于整条染色体上非零互作值的均值的连续若干个相邻的准核合并为一个新的区域,并将该区域作为要预测的染色体拓扑关联结构域的核-附件结构模型中的核;S4. Merge the non-overlapping quasi-nuclei on a chromosome according to the correlation between the quasi-nuclei, and use the merged nucleus as the nucleus of the chromosome topological association domain (TAD) to be predicted; The cosine similarity between adjacent quasi-nuclei, and the cosine similarity is higher than the set threshold and the average interaction value between adjacent quasi-nuclei is greater than the average value of the non-zero interaction value on the entire chromosome. The quasi-nuclei were merged into a new region and used as the nucleus in the nucleus-attachment structure model of the chromosome topological association domain to be predicted;
具体实施时,采用如下算式计算相邻的准核pc i 和pc j 的余弦相似性:In specific implementation, the following formula is used to calculate the cosine similarity of adjacent quasi-kernels pc i and pc j :
式中为pc i 与其他所有准核的平均互作值组成的特征向量,且,,为准核pc k 和pc i 之间的平均互作值;为pc j 与其他所有准核的平均互作值组成的特征向量,且,,为准核pc k 和pc j 之间的平均互作值;为向量的内积;为向量的取模;in the formula is the eigenvector composed of the average interaction value of pc i and all other quasi-kernels, and , , is the average interaction value between quasi-kernel pc k and pc i ; is the eigenvector composed of the average interaction value of pc j and all other quasi-kernels, and , , is the average interaction value between quasi-kernel pc k and pc j ; is the inner product of vectors; is the modulo of the vector;
S5. 确定附件候选区中每个基因组区块的从属关系,结合步骤S4得到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域;具体为定义核与核之间的区域为附件区,确定每一个附件区中每个基因组区块所从属的邻近的染色体拓扑关联结构域的核,从而得到最终预测的染色体拓扑关联结构域;每一个染色体拓扑关联结构域均包括一个核以及该核两边的附件区;S5. Determine the affiliation of each genome block in the attachment candidate region, and combine the nuclei of the chromosome topological association domain obtained in step S4 to obtain the final predicted chromosome topological association domain; specifically, the area between the nucleus and the nucleus is defined as Attachment region, determine the nuclei of adjacent chromosome topological association domains to which each genome block in each attachment region belongs, so as to obtain the final predicted chromosome topological association domain; each chromosome topological association domain includes a nucleus and annex areas on either side of the nucleus;
具体实施时,包括如下步骤:The specific implementation includes the following steps:
S5.1. 对相邻两核和中间的基因组区块,过滤高频互作区的平均互作值小于整条染色体上非零互作值的均值的基因组区块;S5.1. For two adjacent cores and middle genomic block , to filter genomic blocks whose average interaction value in the high-frequency interaction region is less than the average value of non-zero interaction values on the entire chromosome;
S5.2. 在步骤S5.1的基础上,对相邻两核和及该两核之间的基因组区块构成的子矩阵,去除背景信号;背景信号定义为相邻两核之间的基因组区块构成的子矩阵中非零互作值的均值;S5.2. On the basis of step S5.1, for the adjacent two cores and and the genomic block between the two cores The formed sub-matrix removes the background signal; the background signal is defined as the mean value of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;
S5.3. 在步骤S5.2的基础上,对相邻两核和中间的基因组区块,过滤不存在与基因组区域内任何基因组区块有非零互作值的基因组区块;S5.3. On the basis of step S5.2, for the adjacent two cores and middle genomic block , filtering does not exist with genomic regions Any genomic block within a genomic block with a non-zero interaction value;
S5.4. 在步骤S5.3的基础上,计算相邻两核和之间剩余的每一个基因组区块所在子矩阵的平均互作值,并将子矩阵平均互作值最小所对应的基因组区块作为分割点,分割点上游的基因组区块认定为核的附件,分割点下游的基因组区块认定为核的附件;从而得到最终预测的染色体拓扑关联结构域。S5.4. On the basis of step S5.3, calculate the adjacent two cores and The submatrix where each remaining genomic block is located between The average interaction value of the sub-matrix, and the genome block corresponding to the minimum average interaction value of the sub-matrix is used as the split point, and the genome block upstream of the split point is identified as the core attachments, genomic blocks downstream of the split point are identified as nuclear attachment; resulting in the final predicted chromosomal topological association domain.
以下结合一个实施例,对本发明方法进行进一步说明:Below in conjunction with an embodiment, the inventive method is further described:
如图2所示为实施例提供的基于核-附件结构模型的染色体拓扑关联结构域预测方法含有以下步骤;图中Hi-C 图谱的展示为GSE63525数据集中包含的50kb分辨率下KR标准化后的GM12878_combined的Hi-C 互作矩阵,具体区段为一号染色体的第120-200个基因组区块;As shown in FIG. 2 , the method for predicting chromosome topological association domains based on the nuclear-appendix structure model provided by the embodiment includes the following steps; the Hi-C map in the figure is displayed as KR normalization at 50kb resolution included in the GSE63525 dataset. Hi-C interaction matrix of GM12878_combined, the specific segment is the 120th-200th genomic block of
步骤S1、对全基因组构象捕获技术与测序技术所得到的基因组区块之间的互作矩阵(简称Hi-C互作矩阵)中每个基因组区块,采用K均值聚类方法识别出其高频互作区;Step S1, for each genome block in the interaction matrix (referred to as Hi-C interaction matrix) between the genome blocks obtained by the whole-genome conformation capture technology and the sequencing technology, K-means clustering method is used to identify its high frequency interaction area;
如图2-①所示(图2-①为Hi-C 互作矩阵的预处理过程),对50kb分辨率下KR标准化后的GM12878_combined的Hi-C 互作矩阵对角线上每个基因组区块与自身的互作值进行赋0处理;As shown in Figure 2-1 (Figure 2-1 is the preprocessing process of the Hi-C interaction matrix), for each genomic region on the diagonal of the Hi-C interaction matrix of GM12878_combined after KR normalization at 50kb resolution The interaction value between the block and itself is assigned 0;
如图2-②所示(图2-②为高频互作区的识别过程),对每一个基因组区块i,用k=2的K均值聚类算法对与其互作值不为0的其他基因组区块进行k=2的聚类,其他基因组区块的分类函数为:As shown in Figure 2-2 (Figure 2-2 is the identification process of the high-frequency interaction area), for each genomic block i , the K-means clustering algorithm with k = 2 is used to identify those whose interaction value is not 0. Other genome blocks are clustered with k = 2, and the classification function of other genome blocks is:
其中,为基因组区块i与j的互作值,是第k个中心的平均值。两个类的初始中心值和设置为非零互作值升序排序后和位置对应的互作值,对应低频互作类的中心,对应高频互作类的中心;通过求解分类函数,将与中心值最小的距离对应的类赋予基因组区块j;in, is the interaction value between genomic blocks i and j , is the mean of the kth center. The initial center value of the two classes and Set to non-zero interaction value after ascending sorting and The interaction value corresponding to the position, corresponds to the center of the low-frequency interaction class, The center of the corresponding high-frequency interaction class; by solving the classification function, the class corresponding to the distance with the smallest center value is given to the genome block j ;
为每一个基因组区块i定义其高频互作区(l i ,r i ),l i 对应基因组区块i高互作类中基因组区块的最小区块号,r i 对应基因组区块i高互作类中基因组区块的最大区块号;高频互作区的示意图如图2-②b所示;Define its high-frequency interaction region ( li , ri ) for each genome block i , li corresponds to the minimum block number of the genome block in the high interaction class of genome block i, and ri corresponds to genome block i The largest block number of the genome block in the high interaction class; the schematic diagram of the high frequency interaction area is shown in Figure 2-2b;
步骤S2、如图2-③a所示(图2-③为TADs准核的构建过程),对每个基因组区块,从其高频互作区中判断并识别是否存在以该基因组区块为中心的准核;Step S2, as shown in Figure 2-③a (Figure 2-③ is the construction process of TADs quasi-nucleation), for each genomic block, judge and identify whether there is a genomic block based on its high-frequency interaction area. approval by the Centre;
准核的定义为,若基因组区块i所在的高频互作区在Hi-C互作矩阵中组成的子矩阵的平均互作值大于邻近5个相同窗口大小的子矩阵,其中包含上方3个子矩阵、和,右边的一个子矩阵,以及下边的一个子矩阵,则该高频互作区是基因组区块i的准核;Quasi-nucleation is defined as if the high-frequency interaction region where genomic block i is located Submatrix formed in Hi-C interaction matrix The average interaction value of is greater than the adjacent 5 sub-matrices of the same window size, including the upper 3 sub-matrices , and , a submatrix on the right , and a submatrix below , then the high-frequency interaction region is the quasi-validation of genomic block i ;
若基因组区块i的高频互作区在Hi-C互作矩阵中组成的子矩阵的平均互作值不大于其他5个邻近相同窗口大小的子矩阵,则对该高频互作区进行拆分后再重新判断和识别准核,直至拆分后的区域不包含基因组区块i才停止;If the high frequency interaction region of genomic block i Submatrix formed in Hi-C interaction matrix The average interaction value of is not greater than the other 5 adjacent sub-matrices with the same window size, then the high-frequency interaction area is After splitting, re-judgment and identify the quasi-check, and stop until the split region does not contain genomic block i ;
拆分时:对基因组区块i的高频互作区进行拆分,首先以高频互作区中与高频互作区内其他基因组区块互作总和最小的基因组区块m i 为分割点,将高频互作区分为两个高频互作区和;When splitting: high-frequency interaction regions for genomic block i Split, first use the high-frequency interaction area The genome block mi with the smallest sum of interactions between other genomic blocks in the middle and high frequency interaction regions is the dividing point, and the high frequency interaction region is divided into Divided into two high-frequency interaction regions and ;
进一步地,当i= m i ,则判断不存在以基因组区块i为中心的准核;当i< m i ,则继续对高频互作区进行重新判断和识别准核;当i> m i ,则继续对高频互作区进行重新判断和识别准核;判断和识别的过程如上所示;Further, when i = m i , it is judged that there is no quasi-nucleus centered on the genomic block i ; when i < m i , continue to analyze the high-frequency interaction regions. Re-judgment and identification quasi-check; when i > m i , continue to analyze the high-frequency interaction area Carry out re-judgment and identification approval; the process of judgment and identification is as shown above;
步骤S3、如图2-③b,c所示,对每条染色体上识别的准核,根据两两相邻准核间的包含或交叠的关系,进行过滤或合并处理,得到互不重叠的准核;Step S3, as shown in Figure 2-③b, c, filter or merge the quasi-nuclei identified on each chromosome according to the inclusion or overlapping relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei. approved;
当两个相邻准核之间为包含关系时,则保留被包含的准核,过滤包含的准核;When there is an inclusion relationship between two adjacent quasi-nuclears, the included quasi-nuclei are retained, and the included quasi-nuclei are filtered;
当两个相邻准核之间为交叠关系时,若两者合并依然满足准核的定义,则将它们合并为一个准核;否则,仅保留两者中平均互作频数更大的准核;When there is an overlapping relationship between two adjacent quasi-kernels, if the combination of the two still satisfies the definition of quasi-kernels, they are merged into one quasi-kernel; otherwise, only the quasi-kernel with a larger average interaction frequency among the two is retained. nuclear;
处理完一组两两相邻准核后,从下一个准核开始寻找两两相邻的、包含或交叠的准核并进行相同处理,直到对整条染色体上没有互相重叠的准核出现;After processing a set of pairwise adjacent quasi-nuclei, start from the next quasi-nucleus to search for pairwise adjacent, containing or overlapping quasi-nuclei and perform the same processing until no overlapping quasi-nuclei appear on the entire chromosome ;
步骤S4、如图2-④所示(图2-④为TADs的核-附件结构模型中核的构建过程),根据准核之间的相关性,对于一条染色体上互不重叠的准核进行合并,把合并后的核视为要预测的染色体拓扑关联结构域(TADs)的核;Step S4, as shown in Figure 2-④ (Figure 2-④ is the construction process of nuclei in the nuclear-attachment structure model of TADs), according to the correlation between quasi-nuclei, merge non-overlapping quasi-nuclei on a chromosome , treat the merged nucleus as the nucleus of the chromosome topological association domains (TADs) to be predicted;
用余弦相似性对所有相邻的两两准核pc i 和pc j 进行相关性计算,计算公式如下所示:;Use cosine similarity to calculate the correlation between all adjacent pairwise quasi-kernels pc i and pc j . The calculation formula is as follows: ;
设定相关性阈值,将相似度高于阈值的两个或连续多个相邻的准核且相邻准核间平均互作值大于整条染色体上非零互作值的均值,合并成一个新的区域,作为一个TAD的核-附件结构模型中的核Set the correlation threshold, and combine two or more consecutive adjacent quasi-nuclei whose similarity is higher than the threshold and the average interaction value between adjacent quasi-nuclei is greater than the average value of non-zero interaction values on the entire chromosome, and merge them into one New region as a nucleus in the core-appendix structure model of a TAD
步骤S5、如图2-⑤所示(图2-⑤为TADs的完整核-附件结构模型的建立过程),核与核之间的区域定义为附件候选区,确定附件候选区中的每个基因组区块从属于邻近的哪个核,最终预测的每个TAD由一个核与其两边的附件组成;具体实施时,包括如下步骤:Step S5, as shown in Fig. 2-⑤ (Fig. 2-⑤ is the establishment process of the complete core-accessory structure model of TADs), the area between the core and the core is defined as the accessory candidate area, and each of the accessory candidate areas is determined. The genome block belongs to which adjacent nucleus, and each TAD finally predicted consists of a nucleus and its annexes on both sides; the specific implementation includes the following steps:
S5.1. 对相邻两核和中间的基因组区块,过滤高频互作区的平均互作值小于整条染色体上非零互作值的均值的基因组区块;S5.1. For two adjacent cores and middle genomic block , to filter genomic blocks whose average interaction value in high-frequency interaction regions is less than the average value of non-zero interaction values on the entire chromosome;
S5.2. 在步骤S5.1的基础上,对相邻两核和及该两核之间的基因组区块构成的子矩阵,去除背景信号;背景信号定义为相邻两核之间的基因组区块构成的子矩阵中非零互作值的均值;S5.2. On the basis of step S5.1, for the adjacent two cores and and the genomic block between the two cores The formed sub-matrix removes the background signal; the background signal is defined as the mean of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;
S5.3. 在步骤S5.2的基础上,对相邻两核和中间的基因组区块,过滤不存在与基因组区域内任何基因组区块有非零互作值的基因组区块;S5.3. On the basis of step S5.2, for the adjacent two cores and middle genomic block , filtering does not exist with genomic regions Any genomic block within a genomic block with a non-zero interaction value;
S5.4. 在步骤S5.3的基础上,计算相邻两核和之间剩余的每一个基因组区块所在子矩阵的平均互作值,并将子矩阵平均互作值最小所对应的基因组区块作为分割点,分割点上游的基因组区块认定为核的附件,分割点下游的基因组区块认定为核的附件;从而得到最终预测的染色体拓扑关联结构域。S5.4. On the basis of step S5.3, calculate the adjacent two cores and The submatrix where each remaining genomic block is located between The average interaction value of the sub-matrix, and the genome block corresponding to the minimum average interaction value of the sub-matrix is used as the split point, and the genome block upstream of the split point is identified as the core attachments, genomic blocks downstream of the split point are identified as nuclear attachment; resulting in the final predicted chromosomal topological association domain.
如图3所示为本发明系统的结构示意图:本发明还提供了一种实现所述染色体拓扑关联结构域的预测方法的预测系统,包括依次串接的高频互作区识别模块、准核识别模块、准核处理模块、染色体拓扑关联结构域核识别模块和染色体拓扑关联结构域识别模块;高频互作区识别模块用于获取基因组区块之间的互作矩阵中每个基因组区块,采用聚类算法识别得到对应的高频互作区,并将得到的高频互作区上传准核识别模块;准核识别模块用于针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核,并将得到的准核上传准核处理模块;准核处理模块用于对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核,并将得到的互不重叠的准核上传染色体拓扑关联结构域核识别模块;染色体拓扑关联结构域核识别模块用于根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,将合并后的核作为要预测的染色体拓扑关联结构域的核,并将得到的核上传染色体拓扑关联结构域识别模块;染色体拓扑关联结构域识别模块用于确定附件候选区中每个基因组区块的从属关系,并结合接收到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域,并进行输出。Figure 3 is a schematic diagram of the structure of the system of the present invention: the present invention also provides a prediction system for realizing the prediction method of the chromosome topological association domain, including a high-frequency interaction region identification module connected in series, a quasi-nucleus Identification module, quasi-nucleation processing module, chromosome topological association domain nuclear identification module and chromosome topological association domain identification module; the high-frequency interaction region identification module is used to obtain each genome block in the interaction matrix between the genome blocks , using the clustering algorithm to identify the corresponding high-frequency interaction area, and upload the obtained high-frequency interaction area to the quasi-nuclear identification module; Judging and identifying whether there is a quasi-nuclei centered on the genomic block in the region, and uploading the obtained quasi-nuclei to the quasi-nucleation processing module; the quasi-nucleation processing module is used to identify the quasi-nuclei on each chromosome, The relationship between adjacent quasi-nuclei is processed to obtain non-overlapping quasi-nuclei, and the obtained non-overlapping quasi-nuclei are uploaded to the chromosome topological association domain core identification module; Correlation between quasi-nuclei, merge non-overlapping quasi-nuclei on a chromosome, use the merged nuclei as the nucleus of the chromosome topological association domain to be predicted, and upload the obtained nuclei to the chromosome topological association domain identification Module; the chromosome topological association domain identification module is used to determine the affiliation of each genomic block in the attachment candidate region, and combine the received nuclei of the chromosome topological association domain to obtain the final predicted chromosome topological association domain. output.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210245600.9A CN114446384B (en) | 2022-03-14 | 2022-03-14 | Prediction method and prediction system for chromosome topological association domains |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210245600.9A CN114446384B (en) | 2022-03-14 | 2022-03-14 | Prediction method and prediction system for chromosome topological association domains |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114446384A true CN114446384A (en) | 2022-05-06 |
CN114446384B CN114446384B (en) | 2024-11-05 |
Family
ID=81358910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210245600.9A Active CN114446384B (en) | 2022-03-14 | 2022-03-14 | Prediction method and prediction system for chromosome topological association domains |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114446384B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114944190A (en) * | 2022-05-12 | 2022-08-26 | 南开大学 | TAD (TAD-based data analysis) identification method and system based on Hi-C sequencing data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190005191A1 (en) * | 2015-07-14 | 2019-01-03 | Whitehead Institute For Biomedical Research | Chromosome neighborhood structures and methods relating thereto |
US20190295684A1 (en) * | 2018-03-22 | 2019-09-26 | The Regents Of The University Of Michigan | Method and apparatus for analysis of chromatin interaction data |
-
2022
- 2022-03-14 CN CN202210245600.9A patent/CN114446384B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190005191A1 (en) * | 2015-07-14 | 2019-01-03 | Whitehead Institute For Biomedical Research | Chromosome neighborhood structures and methods relating thereto |
US20190295684A1 (en) * | 2018-03-22 | 2019-09-26 | The Regents Of The University Of Michigan | Method and apparatus for analysis of chromatin interaction data |
Non-Patent Citations (1)
Title |
---|
许希伦;: "染色体相互作用密度与拓扑域相关分析", 电脑知识与技术, no. 03, 25 January 2020 (2020-01-25) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114944190A (en) * | 2022-05-12 | 2022-08-26 | 南开大学 | TAD (TAD-based data analysis) identification method and system based on Hi-C sequencing data |
CN114944190B (en) * | 2022-05-12 | 2024-04-19 | 南开大学 | TAD identification method and system based on Hi-C sequencing data |
Also Published As
Publication number | Publication date |
---|---|
CN114446384B (en) | 2024-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114332568B (en) | Training method, system, device and storage medium for domain-adapted image classification network | |
WO2023217290A1 (en) | Genophenotypic prediction based on graph neural network | |
WO2017173929A1 (en) | Unsupervised feature selection method and device | |
CN108805002A (en) | Monitor video accident detection method based on deep learning and dynamic clustering | |
CN110689091A (en) | Weakly supervised fine-grained object classification method | |
CN102184216A (en) | Automatic clustering method based on data field grid division | |
CN104102706A (en) | Hierarchical clustering-based suspicious taxpayer detection method | |
CN110493221A (en) | A kind of network anomaly detection method based on the profile that clusters | |
CN101923604A (en) | Weighted KNN Tumor Gene Expression Profile Classification Method Based on Neighborhood Rough Sets | |
CN104572985A (en) | Industrial data sample screening method based on complex network community discovery | |
CN111710364A (en) | A kind of acquisition method, device, terminal and storage medium of flora marker | |
CN114446384A (en) | Prediction method and prediction system of chromosome topological association domains | |
CN116206327A (en) | Image classification method based on online knowledge distillation | |
CN114842507A (en) | Reinforced pedestrian attribute identification method based on group optimization reward | |
CN111461440A (en) | Link prediction method, system and terminal equipment | |
CN115691661A (en) | Gene coding breeding prediction method and device based on graph clustering | |
CN116861226A (en) | Data processing method and related device | |
CN112418522B (en) | Industrial heating furnace steel temperature prediction method based on three-branch integrated prediction model | |
WO2022011855A1 (en) | False positive structural variation filtering method, storage medium, and computing device | |
CN111192638B (en) | High-dimensional low-sample gene data screening and protein network analysis method and system | |
CN110097922B (en) | A method for differential analysis of hierarchical TADs in Hi-C contact matrix based on online machine learning | |
CN116129999A (en) | Method, device, equipment and storage medium for constructing tumor virtual three-dimensional transcriptome | |
CN116403713A (en) | Method for predicting autism spectrum barrier risk genes based on multiclass unsupervised feature extraction method | |
CN112735532B (en) | Metabolite identification system based on molecular fingerprint prediction and its application method | |
CN115828120A (en) | Self-adaptive recognition method, system and computer equipment for ship traffic behavior pattern |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |