CN114446384A

CN114446384A - Prediction method and prediction system of chromosome topological association domains

Info

Publication number: CN114446384A
Application number: CN202210245600.9A
Authority: CN
Inventors: 彭小清; 李一鸣; 孔祥艳; 盛羽; 段桂华
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-05-06
Anticipated expiration: 2042-03-14
Also published as: CN114446384B

Abstract

The invention discloses a prediction method of a chromosome topological correlation domain, which comprises the steps of obtaining each genome block in an interaction matrix among the genome blocks and identifying to obtain a high-frequency interaction region; identifying a quasi-nucleus from the high frequency interaction region for each genome block: processing the quasi-nuclei identified on each chromosome to obtain non-overlapping quasi-nuclei; merging non-overlapping quasi cores on a chromosome to obtain a core of a topological association structure domain of the chromosome to be predicted; and determining the subordination relation of each genome block in the accessory candidate region and combining the kernels of the chromosome topological correlation domains to obtain a final predicted chromosome topological correlation domain. The invention also discloses a prediction system for realizing the prediction method of the chromosome topology association domain. The invention fully utilizes the global information of Hi-C data, reduces the range of candidate boundary positioning, does not need a user to give predefined parameters, can accurately predict the topological associated domain, and has high reliability, good accuracy and better effect.

Description

Prediction method and prediction system of chromosome topological association domains

技术领域technical field

本发明属于计算机技术领域，具体涉及一种染色体拓扑关联结构域的预测方法及预测系统。The invention belongs to the field of computer technology, and in particular relates to a prediction method and prediction system of a chromosome topological association domain.

背景技术Background technique

近年来，全基因组范围内的染色体构象捕获技术（High-throughput chromosomeconfiguration capture technology，Hi-C）的出现，推动了人们对染色体空间结构层次的认识。相关研究人员将哺乳动物细胞的Hi-C测序数据转化为Hi-C 互作矩阵并进行可视化，从而发现了分辨率低于100kb时的高度自我互作区域，这样的区域就是拓扑关联结构域（Topologically Associationg Domain，TAD）。其中，Hi-C互作矩阵的构建步骤具体为：将一条染色体划分为等长的N个片段，并构建成一个N*N的矩阵M，用于表征一条染色体上两两片段间的互作信号，其中等长的单位长度片段称为一个基因组区块，基因组区块的大小与Hi-C互作矩阵的分辨率有关。通过统计高通量染色体构象捕获技术所产生的测序片段读数在基因组区块对之间的比对情况和N个基因组区块之间的互作频数，研究人员构建出了Hi-C 互作矩阵。例如，每有一个测序片段读数可以分割比对到基因组区块i与基因组区块j，则在矩阵元素M _i,j、M _j,i上累计加1。In recent years, the emergence of genome-wide chromosome conformation capture technology (High-throughput chromosome configuration capture technology, Hi-C) has promoted people's understanding of the spatial structure of chromosomes. Related researchers converted the Hi-C sequencing data of mammalian cells into a Hi-C interaction matrix and visualized, and found highly self-interacting regions with a resolution below 100kb, such regions are topological association domains ( Topologically Associationg Domain, TAD). Among them, the steps of constructing the Hi-C interaction matrix are as follows: dividing a chromosome into N segments of equal length, and constructing an N * N matrix M , which is used to characterize the interaction between two segments on a chromosome The signal, in which a unit-length segment of equal length is called a genomic block, the size of the genomic block is related to the resolution of the Hi-C interaction matrix. The Hi-C interaction matrix was constructed by counting the alignment of the sequencing fragment reads generated by the high-throughput chromosome conformation capture technology between pairs of genomic blocks and the interaction frequency between N genomic blocks. . For example, each time there is a sequencing fragment read that can be divided and aligned to the genome block i and the genome block j , then the matrix elements M _i,j , M _j,i are cumulatively incremented by 1.

目前，受显微技术和生物技术的限制，研究人员仍然无法直接完整的观察到TAD，且TAD的形成机制仍处于模糊概念。所以，要想得到TAD的信息，则必须借助于一些间接方法来实现，比如利用Hi-C 测序数据捕获的染色体片段间的互作信息构建Hi-C 互作矩阵，进而通过相关的算法来实现对TAD的预测。最近几年，研究人员提出了基于机器学习算法预测TAD的方法；但在不同细胞系上应用这些方法却受到很大限制，因为不同的细胞系往往需要大量对应且特有的相关信息去提取特征训练模型，这为研究人员增加了额外的负担。At present, due to the limitations of microscopy and biotechnology, researchers are still unable to directly and completely observe TAD, and the formation mechanism of TAD is still in a vague concept. Therefore, in order to obtain the information of TAD, some indirect methods must be used to achieve it, such as using the interaction information between chromosome fragments captured by Hi-C sequencing data to construct a Hi-C interaction matrix, and then use related algorithms to achieve TAD's forecast. In recent years, researchers have proposed methods for predicting TAD based on machine learning algorithms; however, the application of these methods on different cell lines is very limited, because different cell lines often require a large amount of corresponding and unique relevant information to extract features for training model, which places an additional burden on researchers.

现有的TAD预测算法，主要从边界处互作偏好性、TAD内部的相似性、TAD与非TAD的差异性、TAD内接触频数密度变化等角度去预测TAD。这些方法要么仅仅聚焦于边界的寻找，漏掉了TAD内部的信息；要么需要使用自定义的参数去控制TAD的尺寸大小、聚类终止阈值、局部最值等；这就使得识别TAD问题存在很大的波动性和主观性；而且，TAD作为一种未被精确定义的结构，不应该通过限制其自身的属性去进行预测。The existing TAD prediction algorithms mainly predict TAD from the perspectives of the interaction preference at the boundary, the similarity within the TAD, the difference between the TAD and the non-TAD, and the change of the contact frequency density within the TAD. These methods either only focus on the search for the boundary and miss the information inside the TAD; or they need to use custom parameters to control the size of the TAD, cluster termination threshold, local maxima, etc. This makes the problem of identifying TAD very difficult. large volatility and subjectivity; moreover, TAD, as a structure that is not precisely defined, should not be predicted by limiting its own properties.

发明内容SUMMARY OF THE INVENTION

本发明的目的之一在于提供一种可靠性高、准确性好且效果较好的染色体拓扑关联结构域的预测方法。One of the objectives of the present invention is to provide a method for predicting chromosomal topological association domains with high reliability, good accuracy and good effect.

本发明的目的之二在于提供一种实现所述染色体拓扑关联结构域的预测方法的预测系统。Another object of the present invention is to provide a prediction system for implementing the method for predicting the chromosome topologically related domains.

本发明提供的这种染色体拓扑关联结构域的预测方法，包括如下步骤：The method for predicting this chromosome topological association domain provided by the present invention comprises the following steps:

S1. 获取基因组区块之间的互作矩阵中每个基因组区块，并采用聚类算法识别得到对应的高频互作区；S1. Obtain each genomic block in the interaction matrix between the genomic blocks, and use a clustering algorithm to identify the corresponding high-frequency interaction area;

S2. 针对每个基因组区块，从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核：S2. For each genomic block, judge and identify whether there is a quasi-check centered on the genomic block from the corresponding high-frequency interaction area:

若高频互作区存在以该基因组区块为中心的准核，则继续进行后续步骤；If there is a quasi-core centered on the genomic block in the high-frequency interaction region, proceed to the next steps;

若高频互作区不存在以该基因组区块为中心的准核，则对该高频互作区进行拆分后再重新判断和识别准核，直至拆分后的区域不包含基因组区块；If there is no quasi-nucleus centered on the genomic block in the high-frequency interaction region, the high-frequency interaction region is split and then the quasi-nucleus is re-judged and identified until the split region does not contain the genomic block ;

S3. 对每条染色体上识别的准核，根据两两相邻准核之间的关系进行处理，得到互不重叠的准核；S3. The quasi-nuclei identified on each chromosome are processed according to the relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;

S4. 根据各个准核之间的相关性，对一条染色体上互不重叠的准核进行合并，并将合并后的核作为要预测的染色体拓扑关联结构域的核；S4. Merge the non-overlapping quasi-nuclei on a chromosome according to the correlation between the quasi-nuclei, and use the merged nucleus as the nucleus of the chromosome topological association domain to be predicted;

S5. 确定附件候选区中每个基因组区块的从属关系，结合步骤S4得到的染色体拓扑关联结构域的核，得到最终预测的染色体拓扑关联结构域。S5. Determine the affiliation of each genomic block in the attachment candidate region, and combine the nucleus of the chromosome topological association domain obtained in step S4 to obtain the final predicted chromosome topological association domain.

所述的步骤S1，具体为采用全基因组构象捕获技术与测序技术，获取基因组区块之间的互作矩阵中每个基因组区块，并采用k=2的K均值聚类算法进行聚类，从而识别得到对应的高频互作区。Described step S1, specifically adopts whole genome conformation capture technology and sequencing technology, obtains each genome block in the interaction matrix between genome blocks, and uses k =2 K -means clustering algorithm for clustering, Thereby, the corresponding high-frequency interaction region can be identified.

所述的步骤S1，具体包括如下步骤：The described step S1 specifically includes the following steps:

S1.1. 采用全基因组构象捕获技术与测序技术，获取基因组区块之间的互作矩阵；S1.1. Use whole-genome conformation capture technology and sequencing technology to obtain the interaction matrix between genomic blocks;

S1.2. 对步骤S1.1得到的基因组区块之间的互作矩阵的对角线上每个基因组区块与自身的互作值进行赋0处理；S1.2. The interaction value between each genome block and itself on the diagonal of the interaction matrix between the genome blocks obtained in step S1.1 is assigned 0;

S1.3. 对任意基因组区块i，采用k=2的K均值聚类算法对该基因组区块i与其互作值不为0的其他基因组区块进行聚类；S1.3. For any genomic block i , use the K -means clustering algorithm of k =2 to cluster the genomic block i and other genomic blocks whose interaction value is not 0;

S1.4. 为每一个基因组区块i定义对应的高频互作区

；其中，l _i对应于基因组区块i高互作类中基因组区块的最小区块号，r _i对应于基因组区块i高互作类中基因组区块的最大区块号。S1.4. Define the corresponding high-frequency interaction region for each genomic block i

wherein, _li corresponds to the minimum block number of the genome block in the high interaction class of the genome block i _, and ri corresponds to the largest block number of the genome block in the high interaction class of the genome block i .

采用如下函数作为步骤S1.3中的其他基因组区块的分类函数

：The following function is used as the classification function of other genomic blocks in step S1.3

:

式中

为基因组区块i与基因组区块j的互作值；

为第k个中心的平均值；

为取与

距离最近的中心所对应的类别号操作的函数；

为2-范数；两个类的初始中心值

和

的设置为非零互作值升序排序后

和

位置对应的互作值，且

对应低频互作类的中心，

对应高频互作类的中心；in the formula

is the interaction value of genome block i and genome block j ;

is the average of the kth center;

for taking and

The function of the category number operation corresponding to the nearest center;

is the 2-norm; the initial center value of the two classes

and

is set to non-zero interaction value after ascending sorting

and

the interaction value corresponding to the position, and

corresponds to the center of the low-frequency interaction class,

corresponds to the center of the high-frequency interaction class;

通过求解分类函数，将与中心值最小的距离对应的类赋给基因组区块j。The class corresponding to the distance with the smallest central value is assigned to the genomic block j by solving the classification function.

所述的步骤S2，具体包括如下步骤：The described step S2 specifically includes the following steps:

S2.1. 计算基因组区块i所在的高频互作区

在基因组区块之间的互作矩阵中组成的子矩阵

的平均互作值；S2.1. Calculate the high-frequency interaction region where the genomic block i is located

Submatrices formed in the interaction matrix between genomic blocks

The average interaction value of ;

S2.2. 对步骤S2.1得到的平均互作值与邻近5个相同窗口大小的子矩阵的平均互作值进行比较：S2.2. Compare the average interaction value obtained in step S2.1 with the average interaction value of five adjacent sub-matrices with the same window size:

若步骤S2.1得到的平均互作值大于邻近5个相同窗口大小的子矩阵的平均互作值，则判定高频互作区

为算基因组区块i的准核；If the average interaction value obtained in step S2.1 is greater than the average interaction value of five adjacent sub-matrices with the same window size, the high-frequency interaction area is determined.

is the quasi-check for calculating genomic block i ;

若步骤S2.1得到的平均互作值不大于邻近5个相同窗口大小的子矩阵的平均互作值，则对高频互作区

进行拆分；拆分后再重新进行判断和识别，直至拆分后的区域不包含基因组区块i时停止；If the average interaction value obtained in step S2.1 is not greater than the average interaction value of five adjacent sub-matrices with the same window size, then the high-frequency interaction area is

Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i ;

所述的邻近5个相同窗口大小的子矩阵，具体为上方3个子矩阵

、

和

，右侧的1个子矩阵

，以及下方的一个子矩阵

。The adjacent 5 sub-matrices with the same window size, specifically the upper 3 sub-matrices

,

and

, 1 submatrix on the right

, and a submatrix below

.

所述的对高频互作区

进行拆分；拆分后再重新进行判断和识别，直至拆分后的区域不包含基因组区块i时停止，具体包括如下步骤：the high frequency interaction region

Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i , which specifically includes the following steps:

首先，以高频互作区

中与高频互作区

内其他基因组区块互作总和最小的基因组区块m _i为分割点，将高频互作区

分为高频互作区

和高频互作区

；First, in the high-frequency interaction area

Middle and high frequency interaction area

The genomic block mi with the smallest sum of interactions among other _genomic blocks is the dividing point, and the high-frequency interaction area is divided into

high frequency interaction zone

and high frequency interaction area

;

然后，进行判断：Then, make a judgment:

若i = m _i，则判定不存在以基因组区块i为中心的准核；If i = m _i , it is determined that there is no quasi-nucleus centered on the genomic block i ;

若i < m _i，则以高频互作区

作为基因组区块所在的高频互作区，重复步骤S2.1~S2.2进行准核的判断；If i < m _i , then the high-frequency interaction region

As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval;

若i > m _i，则以高频互作区

作为基因组区块所在的高频互作区，重复步骤S2.1~S2.2进行准核的判断。If i > m _i , then the high-frequency interaction region

As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval.

所述的步骤S3，具体包括如下步骤：The described step S3 specifically includes the following steps:

S3.1. 对每条染色体上识别的准核，判定两个相邻准核之间的关系：S3.1. For the quasi-nuclei identified on each chromosome, determine the relationship between two adjacent quasi-nuclei:

若两个相邻准核之间为包含关系，则保留被包含的准核，并过滤包含的准核；If there is a containment relationship between two adjacent licenses, the contained licenses are retained and the contained licenses are filtered;

若两个相邻准核之间为交叠关系，则再次进行判断：若该两个准核合并后依然满足准核的定义，则将该两个准核合并为一个准核；否则，保留该两个准核中平均互作值较大的准核，并过滤剩余的准核；If there is an overlapping relationship between two adjacent quasi-nuclears, the judgment is made again: if the two quasi-nuclei still meet the definition of quasi-nuclear after merging, then the two quasi-nuclears are merged into one quasi-nuclear; otherwise, keep the The quasi-nucleus with the larger average interaction value among the two quasi-nuclei, and filtering the remaining quasi-nuclei;

S3.2. 重复步骤S3.1直至整条染色体上所有的准核均进行完判定和处理，最终得到互不重叠的准核。S3.2. Repeat step S3.1 until all quasi-nuclei on the entire chromosome have been judged and processed, and finally non-overlapping quasi-nuclei are obtained.

所述的步骤S4，具体为计算所有相邻的准核之间的余弦相似性，并将余弦相似性高于设定阈值且相邻准核间平均互作值大于整条染色体上非零互作值的均值的连续若干个相邻的准核合并为一个新的区域，并将该区域作为要预测的染色体拓扑关联结构域的核-附件结构模型中的核。The step S4 is to calculate the cosine similarity between all adjacent quasi-nuclei, and set the cosine similarity higher than the set threshold and the average interaction value between adjacent quasi-nuclei is greater than the non-zero interaction value on the entire chromosome. Several consecutive adjacent quasi-nuclei taking the mean value of the values are merged into a new region, and this region is used as the nucleus in the nucleus-attachment structure model of the chromosome topological association domain to be predicted.

所述的计算所有相邻的准核之间的余弦相似性，具体为采用如下算式计算相邻的准核pc _i和pc _j的余弦相似性

：The calculation of the cosine similarity between all adjacent quasi-kernels is specifically calculated by using the following formula to calculate the cosine similarity of adjacent quasi-kernels pc _i and pc _j

:

式中

为pc _i与其他所有准核的平均互作值组成的特征向量，且

，

，

为准核pc _k和pc _i之间的平均互作值；

为pc _j与其他所有准核的平均互作值组成的特征向量，且

，

，

为准核pc _k和pc _j之间的平均互作值；

为向量的内积；

为向量的取模。in the formula

is the eigenvector composed of the average interaction value of pc _i and all other quasi-kernels, and

,

is the average interaction value between quasi-kernel pc _k and pc _i ;

is the eigenvector composed of the average interaction value of pc _j and all other quasi-kernels, and

,

is the average interaction value between quasi-kernel pc _k and pc _j ;

is the inner product of vectors;

is the modulo of the vector.

所述的步骤S5，具体为定义核与核之间的区域为附件区，确定每一个附件区中每个基因组区块所从属的邻近的染色体拓扑关联结构域的核，从而得到最终预测的染色体拓扑关联结构域；每一个染色体拓扑关联结构域均包括一个核以及该核两边的附件区。Described step S5, specifically defines the area between nucleus and nucleus as appendix area, determines the nucleus of adjacent chromosome topological association structure domain to which each genome block in each appendix area belongs, thereby obtains the final predicted chromosome. Topological association domains; each chromosomal topological association domain includes a nucleus and appendage regions on either side of the nucleus.

所述的步骤S5，具体包括如下步骤：The step S5 specifically includes the following steps:

S5.1. 对相邻两核

和

中间的基因组区块

，过滤高频互作区的平均互作值小于整条染色体上非零互作值的均值的基因组区块；S5.1. For two adjacent cores

and

middle genomic block

, to filter genomic blocks whose average interaction value in the high-frequency interaction region is less than the average value of non-zero interaction values on the entire chromosome;

S5.2. 在步骤S5.1的基础上，对相邻两核

和

及该两核之间的基因组区块

构成的子矩阵，去除背景信号；背景信号定义为相邻两核之间的基因组区块构成的子矩阵中非零互作值的均值；S5.2. On the basis of step S5.1, for the adjacent two cores

and

and the genomic block between the two cores

The formed sub-matrix removes the background signal; the background signal is defined as the mean value of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;

S5.3. 在步骤S5.2的基础上，对相邻两核

和

中间的基因组区块

，过滤不存在与基因组区域

内任何基因组区块有非零互作值的基因组区块；S5.3. On the basis of step S5.2, for the adjacent two cores

and

middle genomic block

, filtering does not exist with genomic regions

Any genomic block within a genomic block with a non-zero interaction value;

S5.4. 在步骤S5.3的基础上，计算相邻两核

和

之间剩余的每一个基因组区块所在子矩阵

的平均互作值，并将子矩阵平均互作值最小所对应的基因组区块作为分割点，分割点上游的基因组区块认定为核

的附件，分割点下游的基因组区块认定为核

的附件；从而得到最终预测的染色体拓扑关联结构域。S5.4. On the basis of step S5.3, calculate the adjacent two cores

and

The submatrix where each remaining genomic block is located between

The average interaction value of the sub-matrix, and the genome block corresponding to the minimum average interaction value of the sub-matrix is used as the split point, and the genome block upstream of the split point is identified as the core

attachments, genomic blocks downstream of the split point are identified as nuclear

attachment; resulting in the final predicted chromosomal topological association domain.

本发明还提供了一种实现所述染色体拓扑关联结构域的预测方法的预测系统，包括依次串接的高频互作区识别模块、准核识别模块、准核处理模块、染色体拓扑关联结构域核识别模块和染色体拓扑关联结构域识别模块；高频互作区识别模块用于获取基因组区块之间的互作矩阵中每个基因组区块，采用聚类算法识别得到对应的高频互作区，并将得到的高频互作区上传准核识别模块；准核识别模块用于针对每个基因组区块，从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核，并将得到的准核上传准核处理模块；准核处理模块用于对每条染色体上识别的准核，根据两两相邻准核之间的关系进行处理，得到互不重叠的准核，并将得到的互不重叠的准核上传染色体拓扑关联结构域核识别模块；染色体拓扑关联结构域核识别模块用于根据各个准核之间的相关性，对一条染色体上互不重叠的准核进行合并，将合并后的核作为要预测的染色体拓扑关联结构域的核，并将得到的核上传染色体拓扑关联结构域识别模块；染色体拓扑关联结构域识别模块用于确定附件候选区中每个基因组区块的从属关系，并结合接收到的染色体拓扑关联结构域的核，得到最终预测的染色体拓扑关联结构域，并进行输出。The present invention also provides a prediction system for realizing the method for predicting the chromosome topological association domain, including a high-frequency interaction region identification module, a quasi-nucleus identification module, a quasi-nucleus processing module, and a chromosome topological association domain that are serially connected in series The nuclear identification module and the chromosome topological association domain identification module; the high-frequency interaction region identification module is used to obtain each genome block in the interaction matrix between the genome blocks, and use the clustering algorithm to identify the corresponding high-frequency interaction The obtained high-frequency interaction area is uploaded to the quasi-nuclear identification module; the quasi-nuclear identification module is used for each genomic block to judge and identify whether there is a genomic block from the corresponding high-frequency interaction area. The quasi-nuclei of the center, and upload the obtained quasi-nuclei to the quasi-nuclei processing module; the quasi-nucleation processing module is used to process the quasi-nuclei identified on each chromosome according to the relationship between the adjacent quasi-nuclei, and obtain mutually different quasi-nuclei. Overlapping quasi-nuclei, and upload the obtained non-overlapping quasi-nuclei to the chromosome topological association domain nuclear identification module; The non-overlapping quasi-nuclei are merged, the merged nucleus is used as the nucleus of the chromosome topological association domain to be predicted, and the obtained nucleus is uploaded to the chromosome topological association domain identification module; the chromosome topological association domain identification module is used to determine attachments The affiliation of each genomic block in the candidate region is combined with the received nuclei of the chromosome topological association domain to obtain the final predicted chromosome topological association domain and output.

本发明提供的这种染色体拓扑关联结构域的预测方法及预测系统，充分利用了Hi-C数据的全局信息，缩减候选边界定位的范围，从而可减少假阳性结果的出现；同时本发明也无需用户给出预定义的参数，因此本发明能够准确的预测拓扑关联结构域，而且可靠性高、准确性好且效果较好。The prediction method and prediction system of the chromosome topological association domain provided by the present invention make full use of the global information of Hi-C data and reduce the range of candidate boundary positioning, thereby reducing the occurrence of false positive results; at the same time, the present invention does not require The user gives predefined parameters, so the present invention can accurately predict the topological correlation structure domain, and has high reliability, good accuracy and good effect.

附图说明Description of drawings

图1为本发明方法的方法流程示意图。FIG. 1 is a schematic flow chart of the method of the present invention.

图2为本发明方法的实施例的流程示意图。FIG. 2 is a schematic flowchart of an embodiment of the method of the present invention.

图3为本发明系统的结构示意图。FIG. 3 is a schematic structural diagram of the system of the present invention.

具体实施方式Detailed ways

如图1所示为本发明方法的方法流程示意图：本发明提供的这种染色体拓扑关联结构域的预测方法，包括如下步骤：As shown in Figure 1 is a schematic flow chart of the method of the method of the present invention: the prediction method of this chromosome topological association domain provided by the present invention comprises the following steps:

S1. 获取基因组区块之间的互作矩阵中每个基因组区块，并采用聚类算法识别得到对应的高频互作区；具体为采用全基因组构象捕获技术与测序技术，获取基因组区块之间的互作矩阵（简称Hi-C互作矩阵）中每个基因组区块，并采用k=2的K均值聚类算法进行聚类，从而识别得到对应的高频互作区；S1. Obtain each genome block in the interaction matrix between the genome blocks, and use the clustering algorithm to identify the corresponding high-frequency interaction area; specifically, use the whole genome conformation capture technology and sequencing technology to obtain the genome block Each genomic block in the interaction matrix (referred to as Hi-C interaction matrix) is clustered by K -means clustering algorithm with k = 2, so as to identify the corresponding high-frequency interaction area;

具体实施时，包括如下步骤：The specific implementation includes the following steps:

S1.3. 对任意基因组区块i，采用k=2的K均值聚类算法对该基因组区块i与其互作值不为0的其他基因组区块进行聚类；采用如下函数作为其他基因组区块的分类函数

：S1.3. For any genomic block i , use the K -means clustering algorithm with k = 2 to cluster the genomic block i and other genomic blocks whose interaction value is not 0; use the following functions as other genomic regions Classification function for blocks

:

式中

为基因组区块i与基因组区块j的互作值；

为第k个中心的平均值；

为取与

距离最近的中心所对应的类别号操作的函数；

为2-范数；两个类的初始中心值

和

的设置为非零互作值升序排序后

和

位置对应的互作值，且

对应低频互作类的中心，

对应高频互作类的中心；in the formula

is the interaction value of genome block i and genome block j ;

is the average of the kth center;

for taking and

is the 2-norm; the initial center value of the two classes

and

is set to non-zero interaction value after ascending sorting

and

the interaction value corresponding to the position, and

corresponds to the center of the low-frequency interaction class,

corresponds to the center of the high-frequency interaction class;

通过求解分类函数，将与中心值最小的距离对应的类赋给基因组区块j；By solving the classification function, the class corresponding to the distance with the smallest central value is assigned to the genome block j ;

S1.4. 为每一个基因组区块i定义对应的高频互作区

；其中，l _i对应于基因组区块i高互作类中基因组区块的最小区块号，r _i对应于基因组区块i高互作类中基因组区块的最大区块号；S1.4. Define the corresponding high-frequency interaction region for each genomic block i

Wherein , _li corresponds to the minimum block number of the genome block in the genome block i high interaction class _, and ri corresponds to the maximum block number of the genome block in the genome block i high interaction class;

S2.1. 计算基因组区块i所在的高频互作区

在基因组区块之间的互作矩阵中组成的子矩阵

Submatrices formed in the interaction matrix between genomic blocks

The average interaction value of ;

is the quasi-check for calculating genomic block i ;

、

和

，右侧的1个子矩阵

，以及下方的一个子矩阵

；The adjacent 5 sub-matrices with the same window size, specifically the upper 3 sub-matrices

,

and

, 1 submatrix on the right

, and a submatrix below

;

所述的对高频互作区

首先，以高频互作区

中与高频互作区

分为高频互作区

和高频互作区

；First, in the high-frequency interaction area

Middle and high frequency interaction area

high frequency interaction zone

and high frequency interaction area

;

然后，进行判断：Then, make a judgment:

若i < m _i，则以高频互作区

若i > m _i，则以高频互作区

作为基因组区块所在的高频互作区，重复步骤S2.1~S2.2进行准核的判断；If i > m _i , then the high-frequency interaction region

S3. 对每条染色体上识别的准核，根据两两相邻准核之间的关系进行处理，得到互不重叠的准核；具体包括如下步骤：S3. The quasi-nuclei identified on each chromosome are processed according to the relationship between adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei; the specific steps include the following:

S3.2. 重复步骤S3.1直至整条染色体上所有的准核均进行完判定和处理，最终得到互不重叠的准核；S3.2. Repeat step S3.1 until all quasi-nuclei on the entire chromosome have been judged and processed, and finally non-overlapping quasi-nuclei are obtained;

S4. 根据各个准核之间的相关性，对一条染色体上互不重叠的准核进行合并，并将合并后的核作为要预测的染色体拓扑关联结构域（TAD）的核；具体为计算所有相邻的准核之间的余弦相似性，并将余弦相似性高于设定阈值且相邻准核间平均互作值大于整条染色体上非零互作值的均值的连续若干个相邻的准核合并为一个新的区域，并将该区域作为要预测的染色体拓扑关联结构域的核-附件结构模型中的核；S4. Merge the non-overlapping quasi-nuclei on a chromosome according to the correlation between the quasi-nuclei, and use the merged nucleus as the nucleus of the chromosome topological association domain (TAD) to be predicted; The cosine similarity between adjacent quasi-nuclei, and the cosine similarity is higher than the set threshold and the average interaction value between adjacent quasi-nuclei is greater than the average value of the non-zero interaction value on the entire chromosome. The quasi-nuclei were merged into a new region and used as the nucleus in the nucleus-attachment structure model of the chromosome topological association domain to be predicted;

具体实施时，采用如下算式计算相邻的准核pc _i和pc _j的余弦相似性

：In specific implementation, the following formula is used to calculate the cosine similarity of adjacent quasi-kernels pc _i and pc _j

:

式中

为pc _i与其他所有准核的平均互作值组成的特征向量，且

，

，

为准核pc _k和pc _i之间的平均互作值；

为pc _j与其他所有准核的平均互作值组成的特征向量，且

，

，

为准核pc _k和pc _j之间的平均互作值；

为向量的内积；

为向量的取模；in the formula

,

is the average interaction value between quasi-kernel pc _k and pc _i ;

,

is the average interaction value between quasi-kernel pc _k and pc _j ;

is the inner product of vectors;

is the modulo of the vector;

S5. 确定附件候选区中每个基因组区块的从属关系，结合步骤S4得到的染色体拓扑关联结构域的核，得到最终预测的染色体拓扑关联结构域；具体为定义核与核之间的区域为附件区，确定每一个附件区中每个基因组区块所从属的邻近的染色体拓扑关联结构域的核，从而得到最终预测的染色体拓扑关联结构域；每一个染色体拓扑关联结构域均包括一个核以及该核两边的附件区；S5. Determine the affiliation of each genome block in the attachment candidate region, and combine the nuclei of the chromosome topological association domain obtained in step S4 to obtain the final predicted chromosome topological association domain; specifically, the area between the nucleus and the nucleus is defined as Attachment region, determine the nuclei of adjacent chromosome topological association domains to which each genome block in each attachment region belongs, so as to obtain the final predicted chromosome topological association domain; each chromosome topological association domain includes a nucleus and annex areas on either side of the nucleus;

S5.1. 对相邻两核

和

中间的基因组区块

and

middle genomic block

S5.2. 在步骤S5.1的基础上，对相邻两核

和

及该两核之间的基因组区块

and

and the genomic block between the two cores

S5.3. 在步骤S5.2的基础上，对相邻两核

和

中间的基因组区块

，过滤不存在与基因组区域

and

middle genomic block

, filtering does not exist with genomic regions

Any genomic block within a genomic block with a non-zero interaction value;

S5.4. 在步骤S5.3的基础上，计算相邻两核

和

之间剩余的每一个基因组区块所在子矩阵

的附件，分割点下游的基因组区块认定为核

and

The submatrix where each remaining genomic block is located between

以下结合一个实施例，对本发明方法进行进一步说明：Below in conjunction with an embodiment, the inventive method is further described:

如图2所示为实施例提供的基于核-附件结构模型的染色体拓扑关联结构域预测方法含有以下步骤；图中Hi-C 图谱的展示为GSE63525数据集中包含的50kb分辨率下KR标准化后的GM12878_combined的Hi-C 互作矩阵，具体区段为一号染色体的第120-200个基因组区块；As shown in FIG. 2 , the method for predicting chromosome topological association domains based on the nuclear-appendix structure model provided by the embodiment includes the following steps; the Hi-C map in the figure is displayed as KR normalization at 50kb resolution included in the GSE63525 dataset. Hi-C interaction matrix of GM12878_combined, the specific segment is the 120th-200th genomic block of chromosome 1;

步骤S1、对全基因组构象捕获技术与测序技术所得到的基因组区块之间的互作矩阵（简称Hi-C互作矩阵）中每个基因组区块，采用K均值聚类方法识别出其高频互作区；Step S1, for each genome block in the interaction matrix (referred to as Hi-C interaction matrix) between the genome blocks obtained by the whole-genome conformation capture technology and the sequencing technology, K-means clustering method is used to identify its high frequency interaction area;

如图2-①所示（图2-①为Hi-C 互作矩阵的预处理过程），对50kb分辨率下KR标准化后的GM12878_combined的Hi-C 互作矩阵对角线上每个基因组区块与自身的互作值进行赋0处理；As shown in Figure 2-1 (Figure 2-1 is the preprocessing process of the Hi-C interaction matrix), for each genomic region on the diagonal of the Hi-C interaction matrix of GM12878_combined after KR normalization at 50kb resolution The interaction value between the block and itself is assigned 0;

如图2-②所示（图2-②为高频互作区的识别过程），对每一个基因组区块i，用k=2的K均值聚类算法对与其互作值不为0的其他基因组区块进行k=2的聚类，其他基因组区块的分类函数为：As shown in Figure 2-2 (Figure 2-2 is the identification process of the high-frequency interaction area), for each genomic block i , the K-means clustering algorithm with k = 2 is used to identify those whose interaction value is not 0. Other genome blocks are clustered with k = 2, and the classification function of other genome blocks is:

其中，

为基因组区块i与j的互作值，

是第k个中心的平均值。两个类的初始中心值

和

设置为非零互作值升序排序后

和

位置对应的互作值，

对应低频互作类的中心，

对应高频互作类的中心；通过求解分类函数，将与中心值最小的距离对应的类赋予基因组区块j；in,

is the interaction value between genomic blocks i and j ,

is the mean of the kth center. The initial center value of the two classes

and

Set to non-zero interaction value after ascending sorting

and

The interaction value corresponding to the position,

corresponds to the center of the low-frequency interaction class,

The center of the corresponding high-frequency interaction class; by solving the classification function, the class corresponding to the distance with the smallest center value is given to the genome block j ;

为每一个基因组区块i定义其高频互作区（l _i，r _i），l _i对应基因组区块i高互作类中基因组区块的最小区块号，r _i对应基因组区块i高互作类中基因组区块的最大区块号；高频互作区的示意图如图2-②b所示；Define its high-frequency interaction region ( li , ri ) for each genome block i , li corresponds to the minimum block number of the genome block in the high interaction class of genome block i, and ri _corresponds _to _genome block _i The largest block number of the genome block in the high interaction class; the schematic diagram of the high frequency interaction area is shown in Figure 2-2b;

步骤S2、如图2-③a所示（图2-③为TADs准核的构建过程），对每个基因组区块，从其高频互作区中判断并识别是否存在以该基因组区块为中心的准核；Step S2, as shown in Figure 2-③a (Figure 2-③ is the construction process of TADs quasi-nucleation), for each genomic block, judge and identify whether there is a genomic block based on its high-frequency interaction area. approval by the Centre;

准核的定义为，若基因组区块i所在的高频互作区

在Hi-C互作矩阵中组成的子矩阵

的平均互作值大于邻近5个相同窗口大小的子矩阵，其中包含上方3个子矩阵

、

和

，右边的一个子矩阵

，以及下边的一个子矩阵

，则该高频互作区

是基因组区块i的准核；Quasi-nucleation is defined as if the high-frequency interaction region where genomic block i is located

Submatrix formed in Hi-C interaction matrix

The average interaction value of is greater than the adjacent 5 sub-matrices of the same window size, including the upper 3 sub-matrices

,

and

, a submatrix on the right

, and a submatrix below

, then the high-frequency interaction region

is the quasi-validation of genomic block i ;

若基因组区块i的高频互作区

在Hi-C互作矩阵中组成的子矩阵

的平均互作值不大于其他5个邻近相同窗口大小的子矩阵，则对该高频互作区

进行拆分后再重新判断和识别准核，直至拆分后的区域不包含基因组区块i才停止；If the high frequency interaction region of genomic block i

Submatrix formed in Hi-C interaction matrix

The average interaction value of is not greater than the other 5 adjacent sub-matrices with the same window size, then the high-frequency interaction area is

After splitting, re-judgment and identify the quasi-check, and stop until the split region does not contain genomic block i ;

拆分时：对基因组区块i的高频互作区

进行拆分，首先以高频互作区

中与高频互作区内其他基因组区块互作总和最小的基因组区块m _i为分割点，将高频互作区

分为两个高频互作区

和

；When splitting: high-frequency interaction regions for genomic block i

Split, first use the high-frequency interaction area

The genome block mi with the _smallest sum of interactions between other genomic blocks in the middle and high frequency interaction regions is the dividing point, and the high frequency interaction region is divided into

Divided into two high-frequency interaction regions

and

;

进一步地，当i= m _i，则判断不存在以基因组区块i为中心的准核；当i< m _i，则继续对高频互作区

进行重新判断和识别准核；当i> m _i，则继续对高频互作区

进行重新判断和识别准核；判断和识别的过程如上所示；Further, when i = m _i , it is judged that there is no quasi-nucleus centered on the genomic block i ; when i < m _i , continue to analyze the high-frequency interaction regions.

Re-judgment and identification quasi-check; when i > m _i , continue to analyze the high-frequency interaction area

Carry out re-judgment and identification approval; the process of judgment and identification is as shown above;

步骤S3、如图2-③b,c所示，对每条染色体上识别的准核，根据两两相邻准核间的包含或交叠的关系，进行过滤或合并处理，得到互不重叠的准核；Step S3, as shown in Figure 2-③b, c, filter or merge the quasi-nuclei identified on each chromosome according to the inclusion or overlapping relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei. approved;

当两个相邻准核之间为包含关系时，则保留被包含的准核，过滤包含的准核；When there is an inclusion relationship between two adjacent quasi-nuclears, the included quasi-nuclei are retained, and the included quasi-nuclei are filtered;

当两个相邻准核之间为交叠关系时，若两者合并依然满足准核的定义，则将它们合并为一个准核；否则，仅保留两者中平均互作频数更大的准核；When there is an overlapping relationship between two adjacent quasi-kernels, if the combination of the two still satisfies the definition of quasi-kernels, they are merged into one quasi-kernel; otherwise, only the quasi-kernel with a larger average interaction frequency among the two is retained. nuclear;

处理完一组两两相邻准核后，从下一个准核开始寻找两两相邻的、包含或交叠的准核并进行相同处理，直到对整条染色体上没有互相重叠的准核出现；After processing a set of pairwise adjacent quasi-nuclei, start from the next quasi-nucleus to search for pairwise adjacent, containing or overlapping quasi-nuclei and perform the same processing until no overlapping quasi-nuclei appear on the entire chromosome ;

步骤S4、如图2-④所示（图2-④为TADs的核-附件结构模型中核的构建过程），根据准核之间的相关性，对于一条染色体上互不重叠的准核进行合并，把合并后的核视为要预测的染色体拓扑关联结构域（TADs）的核；Step S4, as shown in Figure 2-④ (Figure 2-④ is the construction process of nuclei in the nuclear-attachment structure model of TADs), according to the correlation between quasi-nuclei, merge non-overlapping quasi-nuclei on a chromosome , treat the merged nucleus as the nucleus of the chromosome topological association domains (TADs) to be predicted;

用余弦相似性对所有相邻的两两准核pc _i和pc _j进行相关性计算，计算公式如下所示：

；Use cosine similarity to calculate the correlation between all adjacent pairwise quasi-kernels pc _i and pc _j . The calculation formula is as follows:

;

设定相关性阈值，将相似度高于阈值的两个或连续多个相邻的准核且相邻准核间平均互作值大于整条染色体上非零互作值的均值，合并成一个新的区域，作为一个TAD的核-附件结构模型中的核Set the correlation threshold, and combine two or more consecutive adjacent quasi-nuclei whose similarity is higher than the threshold and the average interaction value between adjacent quasi-nuclei is greater than the average value of non-zero interaction values on the entire chromosome, and merge them into one New region as a nucleus in the core-appendix structure model of a TAD

步骤S5、如图2-⑤所示（图2-⑤为TADs的完整核-附件结构模型的建立过程），核与核之间的区域定义为附件候选区，确定附件候选区中的每个基因组区块从属于邻近的哪个核，最终预测的每个TAD由一个核与其两边的附件组成；具体实施时，包括如下步骤：Step S5, as shown in Fig. 2-⑤ (Fig. 2-⑤ is the establishment process of the complete core-accessory structure model of TADs), the area between the core and the core is defined as the accessory candidate area, and each of the accessory candidate areas is determined. The genome block belongs to which adjacent nucleus, and each TAD finally predicted consists of a nucleus and its annexes on both sides; the specific implementation includes the following steps:

S5.1. 对相邻两核

和

中间的基因组区块

and

middle genomic block

, to filter genomic blocks whose average interaction value in high-frequency interaction regions is less than the average value of non-zero interaction values on the entire chromosome;

S5.2. 在步骤S5.1的基础上，对相邻两核

和

及该两核之间的基因组区块

and

and the genomic block between the two cores

The formed sub-matrix removes the background signal; the background signal is defined as the mean of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;

S5.3. 在步骤S5.2的基础上，对相邻两核

和

中间的基因组区块

，过滤不存在与基因组区域

and

middle genomic block

, filtering does not exist with genomic regions

Any genomic block within a genomic block with a non-zero interaction value;

S5.4. 在步骤S5.3的基础上，计算相邻两核

和

之间剩余的每一个基因组区块所在子矩阵

的附件，分割点下游的基因组区块认定为核

and

The submatrix where each remaining genomic block is located between

如图3所示为本发明系统的结构示意图：本发明还提供了一种实现所述染色体拓扑关联结构域的预测方法的预测系统，包括依次串接的高频互作区识别模块、准核识别模块、准核处理模块、染色体拓扑关联结构域核识别模块和染色体拓扑关联结构域识别模块；高频互作区识别模块用于获取基因组区块之间的互作矩阵中每个基因组区块，采用聚类算法识别得到对应的高频互作区，并将得到的高频互作区上传准核识别模块；准核识别模块用于针对每个基因组区块，从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核，并将得到的准核上传准核处理模块；准核处理模块用于对每条染色体上识别的准核，根据两两相邻准核之间的关系进行处理，得到互不重叠的准核，并将得到的互不重叠的准核上传染色体拓扑关联结构域核识别模块；染色体拓扑关联结构域核识别模块用于根据各个准核之间的相关性，对一条染色体上互不重叠的准核进行合并，将合并后的核作为要预测的染色体拓扑关联结构域的核，并将得到的核上传染色体拓扑关联结构域识别模块；染色体拓扑关联结构域识别模块用于确定附件候选区中每个基因组区块的从属关系，并结合接收到的染色体拓扑关联结构域的核，得到最终预测的染色体拓扑关联结构域，并进行输出。Figure 3 is a schematic diagram of the structure of the system of the present invention: the present invention also provides a prediction system for realizing the prediction method of the chromosome topological association domain, including a high-frequency interaction region identification module connected in series, a quasi-nucleus Identification module, quasi-nucleation processing module, chromosome topological association domain nuclear identification module and chromosome topological association domain identification module; the high-frequency interaction region identification module is used to obtain each genome block in the interaction matrix between the genome blocks , using the clustering algorithm to identify the corresponding high-frequency interaction area, and upload the obtained high-frequency interaction area to the quasi-nuclear identification module; Judging and identifying whether there is a quasi-nuclei centered on the genomic block in the region, and uploading the obtained quasi-nuclei to the quasi-nucleation processing module; the quasi-nucleation processing module is used to identify the quasi-nuclei on each chromosome, The relationship between adjacent quasi-nuclei is processed to obtain non-overlapping quasi-nuclei, and the obtained non-overlapping quasi-nuclei are uploaded to the chromosome topological association domain core identification module; Correlation between quasi-nuclei, merge non-overlapping quasi-nuclei on a chromosome, use the merged nuclei as the nucleus of the chromosome topological association domain to be predicted, and upload the obtained nuclei to the chromosome topological association domain identification Module; the chromosome topological association domain identification module is used to determine the affiliation of each genomic block in the attachment candidate region, and combine the received nuclei of the chromosome topological association domain to obtain the final predicted chromosome topological association domain. output.

Claims

1. a prediction method of chromosome topological association domain is characterized in that comprising the steps:

S1. Obtain each genomic block in the interaction matrix between the genomic blocks, and use a clustering algorithm to identify the corresponding high-frequency interaction area;

S2. For each genomic block, judge and identify whether there is a quasi-check centered on the genomic block from the corresponding high-frequency interaction area:

If there is a quasi-core centered on the genomic block in the high-frequency interaction region, proceed to the next steps;

If there is no quasi-nucleus centered on the genomic block in the high-frequency interaction region, the high-frequency interaction region is split and then the quasi-nucleus is re-judged and identified until the split region does not contain the genomic block ;

S3. The quasi-nuclei identified on each chromosome are processed according to the relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;

S4. Merge the non-overlapping quasi-nuclei on a chromosome according to the correlation between the quasi-nuclei, and use the merged nucleus as the nucleus of the chromosome topological association domain to be predicted;

S5. Determine the affiliation of each genomic block in the attachment candidate region, and combine the nucleus of the chromosome topological association domain obtained in step S4 to obtain the final predicted chromosome topological association domain.

2. The method for predicting chromosomal topological association domains according to claim 1, wherein the step S1 is to obtain the interaction matrix between the genome blocks by adopting the whole genome conformation capture technology and the sequencing technology. Each genomic block is clustered using the K -means clustering algorithm with k = 2, so as to identify the corresponding high-frequency interaction regions.

3. the prediction method of chromosome topological association domain according to claim 2, is characterized in that described step S1, specifically comprises the steps:

S1.1. Use whole-genome conformation capture technology and sequencing technology to obtain the interaction matrix between genomic blocks;

S1.2. The interaction value between each genome block and itself on the diagonal of the interaction matrix between the genome blocks obtained in step S1.1 is assigned 0;

S1.3. For any genomic block i , use the K -means clustering algorithm of k =2 to cluster the genomic block i and other genomic blocks whose interaction value is not 0; adopt the following function as step S1. Classification functions for other genomic blocks in 3

:

in the formula

is the interaction value of genome block i and genome block j ;

is the average of the kth center;

for taking and

is the 2-norm; the initial center value of the two classes

and

is set to non-zero interaction value after ascending sorting

and

the interaction value corresponding to the position, and

corresponds to the center of the low-frequency interaction class,

corresponds to the center of the high-frequency interaction class;

By solving the classification function, the class corresponding to the distance with the smallest central value is assigned to the genome block j ;

S1.4. Define the corresponding high-frequency interaction region for each genomic block i

4. the prediction method of chromosome topological association domain according to claim 3, is characterized in that described step S2, specifically comprises the steps:

S2.1. Calculate the high-frequency interaction region where the genomic block i is located

Submatrices formed in the interaction matrix between genomic blocks

The average interaction value of ;

S2.2. Compare the average interaction value obtained in step S2.1 with the average interaction value of five adjacent sub-matrices with the same window size:

If the average interaction value obtained in step S2.1 is greater than the average interaction value of five adjacent sub-matrices with the same window size, the high-frequency interaction area is determined.

is the quasi-check for calculating genomic block i ;

If the average interaction value obtained in step S2.1 is not greater than the average interaction value of five adjacent sub-matrices with the same window size, then the high-frequency interaction area is

The adjacent 5 sub-matrices with the same window size, specifically the upper 3 sub-matrices

,

and

, 1 submatrix on the right

, and a submatrix below

.

5. The method for predicting chromosomal topological association domains according to claim 4, characterized in that said pair of high-frequency interaction regions

First, in the high-frequency interaction area

Middle and high frequency interaction area

high frequency interaction zone

and high frequency interaction area

;

Then, make a judgment:

If i = m _i , it is determined that there is no quasi-nucleus centered on the genomic block i ;

If i < m _i , then the high-frequency interaction region

If i > m _i , then the high-frequency interaction region

6. The prediction method of chromosome topological association domain according to claim 5, is characterized in that described step S3, specifically comprises the steps:

S3.1. For the quasi-nuclei identified on each chromosome, determine the relationship between two adjacent quasi-nuclei:

If there is a containment relationship between two adjacent licenses, the contained licenses are retained and the contained licenses are filtered;

If there is an overlapping relationship between two adjacent quasi-nuclears, the judgment is made again: if the two quasi-nuclei still meet the definition of quasi-nuclear after merging, then the two quasi-nuclears are merged into one quasi-nuclear; otherwise, keep the The quasi-nucleus with the larger average interaction value among the two quasi-nuclei, and filtering the remaining quasi-nuclei;

S3.2. Repeat step S3.1 until all quasi-nuclei on the entire chromosome have been judged and processed, and finally non-overlapping quasi-nuclei are obtained.

7. The method for predicting chromosome topological association domains according to claim 6, wherein the step S4 is to calculate the cosine similarity between all adjacent quasi-nuclei, and compare the cosine similarity to higher than Set a threshold and the average interaction value between adjacent quasi-nuclei is greater than the average value of the non-zero interaction value on the entire chromosome. Several consecutive adjacent quasi-nuclei are merged into a new region, and this region is used as the chromosome to be predicted Nuclei in the nuclear-attachment structural model of topologically associated domains.

8. The method for predicting a chromosome topological association domain according to claim 7, wherein the step S5 is to define the area between the nucleus and the nucleus as an accessory area, and determine each genome in each accessory area. The nuclei of the adjacent chromosomal topological association domains to which the block belongs, thereby obtaining the final predicted chromosomal topological association domains; each chromosome topological association domain includes a nucleus and attachment regions on both sides of the nucleus.

9. The method for predicting chromosomal topological association domains according to claim 8, wherein the step S5 specifically comprises the following steps:

S5.1. For two adjacent cores

and

middle genomic block

S5.2. On the basis of step S5.1, for the adjacent two cores

and

and the genomic block between the two cores

S5.3. On the basis of step S5.2, for the adjacent two cores

and

middle genomic block

, filtering does not exist with genomic regions

Any genomic block within a genomic block with a non-zero interaction value;

S5.4. On the basis of step S5.3, calculate the adjacent two cores

and

The submatrix where each remaining genomic block is located between

10. A prediction system for realizing the prediction method of the chromosome topological association domain described in one of claims 1 to 9, characterized in that it comprises a high-frequency interaction region identification module, a quasi-nucleus identification module, a quasi-nucleus connected in series in sequence The processing module, the chromosome topological association domain core identification module and the chromosome topological association domain identification module; the high-frequency interaction region identification module is used to obtain each genome block in the interaction matrix between the genome blocks, and a clustering algorithm is used. Identify the corresponding high-frequency interaction area, and upload the obtained high-frequency interaction area to the quasi-nuclear identification module; the quasi-nuclear identification module is used to judge and identify each genome block from the corresponding high-frequency interaction area Whether there is a quasi-nuclei centered on the genomic block, and upload the obtained quasi-nuclei to the quasi-nuclei processing module; the quasi-nucleation processing module is used to identify the quasi-nuclei on each chromosome, The obtained non-overlapping quasi-nuclei are processed to obtain non-overlapping quasi-nuclei, and the obtained non-overlapping quasi-nuclei are uploaded to the chromosome topological association domain core identification module; Correlation, merge the non-overlapping quasi-nuclei on a chromosome, take the merged nucleus as the nucleus of the chromosome topological association domain to be predicted, and upload the obtained nucleus to the chromosome topological association domain identification module; chromosome topological association The domain identification module is used to determine the affiliation of each genomic block in the candidate attachment region, and combine the received nuclei of the chromosome topological association domain to obtain the final predicted chromosome topological association domain, and output.