CN114446384A - Prediction method and prediction system of chromosome topological association domains - Google Patents

Prediction method and prediction system of chromosome topological association domains Download PDF

Info

Publication number
CN114446384A
CN114446384A CN202210245600.9A CN202210245600A CN114446384A CN 114446384 A CN114446384 A CN 114446384A CN 202210245600 A CN202210245600 A CN 202210245600A CN 114446384 A CN114446384 A CN 114446384A
Authority
CN
China
Prior art keywords
quasi
interaction
block
chromosome
genomic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210245600.9A
Other languages
Chinese (zh)
Other versions
CN114446384B (en
Inventor
彭小清
李一鸣
孔祥艳
盛羽
段桂华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210245600.9A priority Critical patent/CN114446384B/en
Publication of CN114446384A publication Critical patent/CN114446384A/en
Application granted granted Critical
Publication of CN114446384B publication Critical patent/CN114446384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a prediction method of a chromosome topological correlation domain, which comprises the steps of obtaining each genome block in an interaction matrix among the genome blocks and identifying to obtain a high-frequency interaction region; identifying a quasi-nucleus from the high frequency interaction region for each genome block: processing the quasi-nuclei identified on each chromosome to obtain non-overlapping quasi-nuclei; merging non-overlapping quasi cores on a chromosome to obtain a core of a topological association structure domain of the chromosome to be predicted; and determining the subordination relation of each genome block in the accessory candidate region and combining the kernels of the chromosome topological correlation domains to obtain a final predicted chromosome topological correlation domain. The invention also discloses a prediction system for realizing the prediction method of the chromosome topology association domain. The invention fully utilizes the global information of Hi-C data, reduces the range of candidate boundary positioning, does not need a user to give predefined parameters, can accurately predict the topological associated domain, and has high reliability, good accuracy and better effect.

Description

染色体拓扑关联结构域的预测方法及预测系统Prediction method and prediction system of chromosome topological association domains

技术领域technical field

本发明属于计算机技术领域,具体涉及一种染色体拓扑关联结构域的预测方法及预测系统。The invention belongs to the field of computer technology, and in particular relates to a prediction method and prediction system of a chromosome topological association domain.

背景技术Background technique

近年来,全基因组范围内的染色体构象捕获技术(High-throughput chromosomeconfiguration capture technology,Hi-C)的出现,推动了人们对染色体空间结构层次的认识。相关研究人员将哺乳动物细胞的Hi-C测序数据转化为Hi-C 互作矩阵并进行可视化,从而发现了分辨率低于100kb时的高度自我互作区域,这样的区域就是拓扑关联结构域(Topologically Associationg Domain,TAD)。其中,Hi-C互作矩阵的构建步骤具体为:将一条染色体划分为等长的N个片段,并构建成一个N*N的矩阵M,用于表征一条染色体上两两片段间的互作信号,其中等长的单位长度片段称为一个基因组区块,基因组区块的大小与Hi-C互作矩阵的分辨率有关。通过统计高通量染色体构象捕获技术所产生的测序片段读数在基因组区块对之间的比对情况和N个基因组区块之间的互作频数,研究人员构建出了Hi-C 互作矩阵。例如,每有一个测序片段读数可以分割比对到基因组区块i与基因组区块j,则在矩阵元素M i,j M j,i 上累计加1。In recent years, the emergence of genome-wide chromosome conformation capture technology (High-throughput chromosome configuration capture technology, Hi-C) has promoted people's understanding of the spatial structure of chromosomes. Related researchers converted the Hi-C sequencing data of mammalian cells into a Hi-C interaction matrix and visualized, and found highly self-interacting regions with a resolution below 100kb, such regions are topological association domains ( Topologically Associationg Domain, TAD). Among them, the steps of constructing the Hi-C interaction matrix are as follows: dividing a chromosome into N segments of equal length, and constructing an N * N matrix M , which is used to characterize the interaction between two segments on a chromosome The signal, in which a unit-length segment of equal length is called a genomic block, the size of the genomic block is related to the resolution of the Hi-C interaction matrix. The Hi-C interaction matrix was constructed by counting the alignment of the sequencing fragment reads generated by the high-throughput chromosome conformation capture technology between pairs of genomic blocks and the interaction frequency between N genomic blocks. . For example, each time there is a sequencing fragment read that can be divided and aligned to the genome block i and the genome block j , then the matrix elements M i,j , M j,i are cumulatively incremented by 1.

目前,受显微技术和生物技术的限制,研究人员仍然无法直接完整的观察到TAD,且TAD的形成机制仍处于模糊概念。所以,要想得到TAD的信息,则必须借助于一些间接方法来实现,比如利用Hi-C 测序数据捕获的染色体片段间的互作信息构建Hi-C 互作矩阵,进而通过相关的算法来实现对TAD的预测。最近几年,研究人员提出了基于机器学习算法预测TAD的方法;但在不同细胞系上应用这些方法却受到很大限制,因为不同的细胞系往往需要大量对应且特有的相关信息去提取特征训练模型,这为研究人员增加了额外的负担。At present, due to the limitations of microscopy and biotechnology, researchers are still unable to directly and completely observe TAD, and the formation mechanism of TAD is still in a vague concept. Therefore, in order to obtain the information of TAD, some indirect methods must be used to achieve it, such as using the interaction information between chromosome fragments captured by Hi-C sequencing data to construct a Hi-C interaction matrix, and then use related algorithms to achieve TAD's forecast. In recent years, researchers have proposed methods for predicting TAD based on machine learning algorithms; however, the application of these methods on different cell lines is very limited, because different cell lines often require a large amount of corresponding and unique relevant information to extract features for training model, which places an additional burden on researchers.

现有的TAD预测算法,主要从边界处互作偏好性、TAD内部的相似性、TAD与非TAD的差异性、TAD内接触频数密度变化等角度去预测TAD。这些方法要么仅仅聚焦于边界的寻找,漏掉了TAD内部的信息;要么需要使用自定义的参数去控制TAD的尺寸大小、聚类终止阈值、局部最值等;这就使得识别TAD问题存在很大的波动性和主观性;而且,TAD作为一种未被精确定义的结构,不应该通过限制其自身的属性去进行预测。The existing TAD prediction algorithms mainly predict TAD from the perspectives of the interaction preference at the boundary, the similarity within the TAD, the difference between the TAD and the non-TAD, and the change of the contact frequency density within the TAD. These methods either only focus on the search for the boundary and miss the information inside the TAD; or they need to use custom parameters to control the size of the TAD, cluster termination threshold, local maxima, etc. This makes the problem of identifying TAD very difficult. large volatility and subjectivity; moreover, TAD, as a structure that is not precisely defined, should not be predicted by limiting its own properties.

发明内容SUMMARY OF THE INVENTION

本发明的目的之一在于提供一种可靠性高、准确性好且效果较好的染色体拓扑关联结构域的预测方法。One of the objectives of the present invention is to provide a method for predicting chromosomal topological association domains with high reliability, good accuracy and good effect.

本发明的目的之二在于提供一种实现所述染色体拓扑关联结构域的预测方法的预测系统。Another object of the present invention is to provide a prediction system for implementing the method for predicting the chromosome topologically related domains.

本发明提供的这种染色体拓扑关联结构域的预测方法,包括如下步骤:The method for predicting this chromosome topological association domain provided by the present invention comprises the following steps:

S1. 获取基因组区块之间的互作矩阵中每个基因组区块,并采用聚类算法识别得到对应的高频互作区;S1. Obtain each genomic block in the interaction matrix between the genomic blocks, and use a clustering algorithm to identify the corresponding high-frequency interaction area;

S2. 针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核:S2. For each genomic block, judge and identify whether there is a quasi-check centered on the genomic block from the corresponding high-frequency interaction area:

若高频互作区存在以该基因组区块为中心的准核,则继续进行后续步骤;If there is a quasi-core centered on the genomic block in the high-frequency interaction region, proceed to the next steps;

若高频互作区不存在以该基因组区块为中心的准核,则对该高频互作区进行拆分后再重新判断和识别准核,直至拆分后的区域不包含基因组区块;If there is no quasi-nucleus centered on the genomic block in the high-frequency interaction region, the high-frequency interaction region is split and then the quasi-nucleus is re-judged and identified until the split region does not contain the genomic block ;

S3. 对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核;S3. The quasi-nuclei identified on each chromosome are processed according to the relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;

S4. 根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,并将合并后的核作为要预测的染色体拓扑关联结构域的核;S4. Merge the non-overlapping quasi-nuclei on a chromosome according to the correlation between the quasi-nuclei, and use the merged nucleus as the nucleus of the chromosome topological association domain to be predicted;

S5. 确定附件候选区中每个基因组区块的从属关系,结合步骤S4得到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域。S5. Determine the affiliation of each genomic block in the attachment candidate region, and combine the nucleus of the chromosome topological association domain obtained in step S4 to obtain the final predicted chromosome topological association domain.

所述的步骤S1,具体为采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵中每个基因组区块,并采用k=2的K均值聚类算法进行聚类,从而识别得到对应的高频互作区。Described step S1, specifically adopts whole genome conformation capture technology and sequencing technology, obtains each genome block in the interaction matrix between genome blocks, and uses k =2 K -means clustering algorithm for clustering, Thereby, the corresponding high-frequency interaction region can be identified.

所述的步骤S1,具体包括如下步骤:The described step S1 specifically includes the following steps:

S1.1. 采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵;S1.1. Use whole-genome conformation capture technology and sequencing technology to obtain the interaction matrix between genomic blocks;

S1.2. 对步骤S1.1得到的基因组区块之间的互作矩阵的对角线上每个基因组区块与自身的互作值进行赋0处理;S1.2. The interaction value between each genome block and itself on the diagonal of the interaction matrix between the genome blocks obtained in step S1.1 is assigned 0;

S1.3. 对任意基因组区块i,采用k=2的K均值聚类算法对该基因组区块i与其互作值不为0的其他基因组区块进行聚类;S1.3. For any genomic block i , use the K -means clustering algorithm of k =2 to cluster the genomic block i and other genomic blocks whose interaction value is not 0;

S1.4. 为每一个基因组区块i定义对应的高频互作区

Figure 100002_DEST_PATH_IMAGE002
;其中,l i 对应于基因组区块i高互作类中基因组区块的最小区块号,r i 对应于基因组区块i高互作类中基因组区块的最大区块号。S1.4. Define the corresponding high-frequency interaction region for each genomic block i
Figure 100002_DEST_PATH_IMAGE002
wherein, li corresponds to the minimum block number of the genome block in the high interaction class of the genome block i , and ri corresponds to the largest block number of the genome block in the high interaction class of the genome block i .

采用如下函数作为步骤S1.3中的其他基因组区块的分类函数

Figure 100002_DEST_PATH_IMAGE004
:The following function is used as the classification function of other genomic blocks in step S1.3
Figure 100002_DEST_PATH_IMAGE004
:

Figure 100002_DEST_PATH_IMAGE006
Figure 100002_DEST_PATH_IMAGE006

式中

Figure 100002_DEST_PATH_IMAGE008
为基因组区块i与基因组区块j的互作值;
Figure 100002_DEST_PATH_IMAGE010
为第k个中心的平均值;
Figure 100002_DEST_PATH_IMAGE012
为取与
Figure 100002_DEST_PATH_IMAGE014
距离最近的中心所对应的类别号操作的函数;
Figure 100002_DEST_PATH_IMAGE016
为2-范数;两个类的初始中心值
Figure 100002_DEST_PATH_IMAGE018
Figure 100002_DEST_PATH_IMAGE020
的设置为非零互作值升序排序后
Figure 100002_DEST_PATH_IMAGE022
Figure 100002_DEST_PATH_IMAGE024
位置对应的互作值,且
Figure 725207DEST_PATH_IMAGE018
对应低频互作类的中心,
Figure 6016DEST_PATH_IMAGE020
对应高频互作类的中心;in the formula
Figure 100002_DEST_PATH_IMAGE008
is the interaction value of genome block i and genome block j ;
Figure 100002_DEST_PATH_IMAGE010
is the average of the kth center;
Figure 100002_DEST_PATH_IMAGE012
for taking and
Figure 100002_DEST_PATH_IMAGE014
The function of the category number operation corresponding to the nearest center;
Figure 100002_DEST_PATH_IMAGE016
is the 2-norm; the initial center value of the two classes
Figure 100002_DEST_PATH_IMAGE018
and
Figure 100002_DEST_PATH_IMAGE020
is set to non-zero interaction value after ascending sorting
Figure 100002_DEST_PATH_IMAGE022
and
Figure 100002_DEST_PATH_IMAGE024
the interaction value corresponding to the position, and
Figure 725207DEST_PATH_IMAGE018
corresponds to the center of the low-frequency interaction class,
Figure 6016DEST_PATH_IMAGE020
corresponds to the center of the high-frequency interaction class;

通过求解分类函数,将与中心值最小的距离对应的类赋给基因组区块jThe class corresponding to the distance with the smallest central value is assigned to the genomic block j by solving the classification function.

所述的步骤S2,具体包括如下步骤:The described step S2 specifically includes the following steps:

S2.1. 计算基因组区块i所在的高频互作区

Figure DEST_PATH_IMAGE025
在基因组区块之间的互作矩阵中组成的子矩阵
Figure 100002_DEST_PATH_IMAGE027
的平均互作值;S2.1. Calculate the high-frequency interaction region where the genomic block i is located
Figure DEST_PATH_IMAGE025
Submatrices formed in the interaction matrix between genomic blocks
Figure 100002_DEST_PATH_IMAGE027
The average interaction value of ;

S2.2. 对步骤S2.1得到的平均互作值与邻近5个相同窗口大小的子矩阵的平均互作值进行比较:S2.2. Compare the average interaction value obtained in step S2.1 with the average interaction value of five adjacent sub-matrices with the same window size:

若步骤S2.1得到的平均互作值大于邻近5个相同窗口大小的子矩阵的平均互作值,则判定高频互作区

Figure 740754DEST_PATH_IMAGE025
为算基因组区块i的准核;If the average interaction value obtained in step S2.1 is greater than the average interaction value of five adjacent sub-matrices with the same window size, the high-frequency interaction area is determined.
Figure 740754DEST_PATH_IMAGE025
is the quasi-check for calculating genomic block i ;

若步骤S2.1得到的平均互作值不大于邻近5个相同窗口大小的子矩阵的平均互作值,则对高频互作区

Figure 880748DEST_PATH_IMAGE025
进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止;If the average interaction value obtained in step S2.1 is not greater than the average interaction value of five adjacent sub-matrices with the same window size, then the high-frequency interaction area is
Figure 880748DEST_PATH_IMAGE025
Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i ;

所述的邻近5个相同窗口大小的子矩阵,具体为上方3个子矩阵

Figure 100002_DEST_PATH_IMAGE029
Figure 100002_DEST_PATH_IMAGE031
Figure 100002_DEST_PATH_IMAGE033
,右侧的1个子矩阵
Figure 100002_DEST_PATH_IMAGE035
,以及下方的一个子矩阵
Figure 100002_DEST_PATH_IMAGE037
。The adjacent 5 sub-matrices with the same window size, specifically the upper 3 sub-matrices
Figure 100002_DEST_PATH_IMAGE029
,
Figure 100002_DEST_PATH_IMAGE031
and
Figure 100002_DEST_PATH_IMAGE033
, 1 submatrix on the right
Figure 100002_DEST_PATH_IMAGE035
, and a submatrix below
Figure 100002_DEST_PATH_IMAGE037
.

所述的对高频互作区

Figure 948848DEST_PATH_IMAGE025
进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止,具体包括如下步骤:the high frequency interaction region
Figure 948848DEST_PATH_IMAGE025
Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i , which specifically includes the following steps:

首先,以高频互作区

Figure 349742DEST_PATH_IMAGE025
中与高频互作区
Figure 583277DEST_PATH_IMAGE025
内其他基因组区块互作总和最小的基因组区块m i 为分割点,将高频互作区
Figure 210568DEST_PATH_IMAGE025
分为高频互作区
Figure DEST_PATH_IMAGE039
和高频互作区
Figure DEST_PATH_IMAGE041
;First, in the high-frequency interaction area
Figure 349742DEST_PATH_IMAGE025
Middle and high frequency interaction area
Figure 583277DEST_PATH_IMAGE025
The genomic block mi with the smallest sum of interactions among other genomic blocks is the dividing point, and the high-frequency interaction area is divided into
Figure 210568DEST_PATH_IMAGE025
high frequency interaction zone
Figure DEST_PATH_IMAGE039
and high frequency interaction area
Figure DEST_PATH_IMAGE041
;

然后,进行判断:Then, make a judgment:

i = m i ,则判定不存在以基因组区块i为中心的准核;If i = m i , it is determined that there is no quasi-nucleus centered on the genomic block i ;

i < m i ,则以高频互作区

Figure 100002_DEST_PATH_IMAGE042
作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断;If i < m i , then the high-frequency interaction region
Figure 100002_DEST_PATH_IMAGE042
As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval;

i > m i ,则以高频互作区

Figure 100002_DEST_PATH_IMAGE043
作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断。If i > m i , then the high-frequency interaction region
Figure 100002_DEST_PATH_IMAGE043
As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval.

所述的步骤S3,具体包括如下步骤:The described step S3 specifically includes the following steps:

S3.1. 对每条染色体上识别的准核,判定两个相邻准核之间的关系:S3.1. For the quasi-nuclei identified on each chromosome, determine the relationship between two adjacent quasi-nuclei:

若两个相邻准核之间为包含关系,则保留被包含的准核,并过滤包含的准核;If there is a containment relationship between two adjacent licenses, the contained licenses are retained and the contained licenses are filtered;

若两个相邻准核之间为交叠关系,则再次进行判断:若该两个准核合并后依然满足准核的定义,则将该两个准核合并为一个准核;否则,保留该两个准核中平均互作值较大的准核,并过滤剩余的准核;If there is an overlapping relationship between two adjacent quasi-nuclears, the judgment is made again: if the two quasi-nuclei still meet the definition of quasi-nuclear after merging, then the two quasi-nuclears are merged into one quasi-nuclear; otherwise, keep the The quasi-nucleus with the larger average interaction value among the two quasi-nuclei, and filtering the remaining quasi-nuclei;

S3.2. 重复步骤S3.1直至整条染色体上所有的准核均进行完判定和处理,最终得到互不重叠的准核。S3.2. Repeat step S3.1 until all quasi-nuclei on the entire chromosome have been judged and processed, and finally non-overlapping quasi-nuclei are obtained.

所述的步骤S4,具体为计算所有相邻的准核之间的余弦相似性,并将余弦相似性高于设定阈值且相邻准核间平均互作值大于整条染色体上非零互作值的均值的连续若干个相邻的准核合并为一个新的区域,并将该区域作为要预测的染色体拓扑关联结构域的核-附件结构模型中的核。The step S4 is to calculate the cosine similarity between all adjacent quasi-nuclei, and set the cosine similarity higher than the set threshold and the average interaction value between adjacent quasi-nuclei is greater than the non-zero interaction value on the entire chromosome. Several consecutive adjacent quasi-nuclei taking the mean value of the values are merged into a new region, and this region is used as the nucleus in the nucleus-attachment structure model of the chromosome topological association domain to be predicted.

所述的计算所有相邻的准核之间的余弦相似性,具体为采用如下算式计算相邻的准核pc i pc j 的余弦相似性

Figure DEST_PATH_IMAGE045
:The calculation of the cosine similarity between all adjacent quasi-kernels is specifically calculated by using the following formula to calculate the cosine similarity of adjacent quasi-kernels pc i and pc j
Figure DEST_PATH_IMAGE045
:

Figure DEST_PATH_IMAGE047
Figure DEST_PATH_IMAGE047

式中

Figure DEST_PATH_IMAGE049
pc i 与其他所有准核的平均互作值组成的特征向量,且
Figure 100002_DEST_PATH_IMAGE051
Figure 100002_DEST_PATH_IMAGE053
Figure DEST_PATH_IMAGE055
为准核pc k pc i 之间的平均互作值;
Figure DEST_PATH_IMAGE057
pc j 与其他所有准核的平均互作值组成的特征向量,且
Figure 100002_DEST_PATH_IMAGE059
Figure 79429DEST_PATH_IMAGE053
Figure DEST_PATH_IMAGE061
为准核pc k pc j 之间的平均互作值;
Figure DEST_PATH_IMAGE063
为向量的内积;
Figure DEST_PATH_IMAGE065
为向量的取模。in the formula
Figure DEST_PATH_IMAGE049
is the eigenvector composed of the average interaction value of pc i and all other quasi-kernels, and
Figure 100002_DEST_PATH_IMAGE051
,
Figure 100002_DEST_PATH_IMAGE053
,
Figure DEST_PATH_IMAGE055
is the average interaction value between quasi-kernel pc k and pc i ;
Figure DEST_PATH_IMAGE057
is the eigenvector composed of the average interaction value of pc j and all other quasi-kernels, and
Figure 100002_DEST_PATH_IMAGE059
,
Figure 79429DEST_PATH_IMAGE053
,
Figure DEST_PATH_IMAGE061
is the average interaction value between quasi-kernel pc k and pc j ;
Figure DEST_PATH_IMAGE063
is the inner product of vectors;
Figure DEST_PATH_IMAGE065
is the modulo of the vector.

所述的步骤S5,具体为定义核与核之间的区域为附件区,确定每一个附件区中每个基因组区块所从属的邻近的染色体拓扑关联结构域的核,从而得到最终预测的染色体拓扑关联结构域;每一个染色体拓扑关联结构域均包括一个核以及该核两边的附件区。Described step S5, specifically defines the area between nucleus and nucleus as appendix area, determines the nucleus of adjacent chromosome topological association structure domain to which each genome block in each appendix area belongs, thereby obtains the final predicted chromosome. Topological association domains; each chromosomal topological association domain includes a nucleus and appendage regions on either side of the nucleus.

所述的步骤S5,具体包括如下步骤:The step S5 specifically includes the following steps:

S5.1. 对相邻两核

Figure DEST_PATH_IMAGE067
Figure DEST_PATH_IMAGE069
中间的基因组区块
Figure DEST_PATH_IMAGE071
,过滤高频互作区的平均互作值小于整条染色体上非零互作值的均值的基因组区块;S5.1. For two adjacent cores
Figure DEST_PATH_IMAGE067
and
Figure DEST_PATH_IMAGE069
middle genomic block
Figure DEST_PATH_IMAGE071
, to filter genomic blocks whose average interaction value in the high-frequency interaction region is less than the average value of non-zero interaction values on the entire chromosome;

S5.2. 在步骤S5.1的基础上,对相邻两核

Figure 833365DEST_PATH_IMAGE067
Figure DEST_PATH_IMAGE072
及该两核之间的基因组区块
Figure 424752DEST_PATH_IMAGE071
构成的子矩阵,去除背景信号;背景信号定义为相邻两核之间的基因组区块构成的子矩阵中非零互作值的均值;S5.2. On the basis of step S5.1, for the adjacent two cores
Figure 833365DEST_PATH_IMAGE067
and
Figure DEST_PATH_IMAGE072
and the genomic block between the two cores
Figure 424752DEST_PATH_IMAGE071
The formed sub-matrix removes the background signal; the background signal is defined as the mean value of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;

S5.3. 在步骤S5.2的基础上,对相邻两核

Figure 149126DEST_PATH_IMAGE067
Figure 930000DEST_PATH_IMAGE072
中间的基因组区块
Figure 180853DEST_PATH_IMAGE071
,过滤不存在与基因组区域
Figure DEST_PATH_IMAGE074
内任何基因组区块有非零互作值的基因组区块;S5.3. On the basis of step S5.2, for the adjacent two cores
Figure 149126DEST_PATH_IMAGE067
and
Figure 930000DEST_PATH_IMAGE072
middle genomic block
Figure 180853DEST_PATH_IMAGE071
, filtering does not exist with genomic regions
Figure DEST_PATH_IMAGE074
Any genomic block within a genomic block with a non-zero interaction value;

S5.4. 在步骤S5.3的基础上,计算相邻两核

Figure 647868DEST_PATH_IMAGE067
Figure 984172DEST_PATH_IMAGE072
之间剩余的每一个基因组区块所在子矩阵
Figure DEST_PATH_IMAGE076
的平均互作值,并将子矩阵平均互作值最小所对应的基因组区块作为分割点,分割点上游的基因组区块认定为核
Figure 709682DEST_PATH_IMAGE067
的附件,分割点下游的基因组区块认定为核
Figure 815041DEST_PATH_IMAGE072
的附件;从而得到最终预测的染色体拓扑关联结构域。S5.4. On the basis of step S5.3, calculate the adjacent two cores
Figure 647868DEST_PATH_IMAGE067
and
Figure 984172DEST_PATH_IMAGE072
The submatrix where each remaining genomic block is located between
Figure DEST_PATH_IMAGE076
The average interaction value of the sub-matrix, and the genome block corresponding to the minimum average interaction value of the sub-matrix is used as the split point, and the genome block upstream of the split point is identified as the core
Figure 709682DEST_PATH_IMAGE067
attachments, genomic blocks downstream of the split point are identified as nuclear
Figure 815041DEST_PATH_IMAGE072
attachment; resulting in the final predicted chromosomal topological association domain.

本发明还提供了一种实现所述染色体拓扑关联结构域的预测方法的预测系统,包括依次串接的高频互作区识别模块、准核识别模块、准核处理模块、染色体拓扑关联结构域核识别模块和染色体拓扑关联结构域识别模块;高频互作区识别模块用于获取基因组区块之间的互作矩阵中每个基因组区块,采用聚类算法识别得到对应的高频互作区,并将得到的高频互作区上传准核识别模块;准核识别模块用于针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核,并将得到的准核上传准核处理模块;准核处理模块用于对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核,并将得到的互不重叠的准核上传染色体拓扑关联结构域核识别模块;染色体拓扑关联结构域核识别模块用于根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,将合并后的核作为要预测的染色体拓扑关联结构域的核,并将得到的核上传染色体拓扑关联结构域识别模块;染色体拓扑关联结构域识别模块用于确定附件候选区中每个基因组区块的从属关系,并结合接收到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域,并进行输出。The present invention also provides a prediction system for realizing the method for predicting the chromosome topological association domain, including a high-frequency interaction region identification module, a quasi-nucleus identification module, a quasi-nucleus processing module, and a chromosome topological association domain that are serially connected in series The nuclear identification module and the chromosome topological association domain identification module; the high-frequency interaction region identification module is used to obtain each genome block in the interaction matrix between the genome blocks, and use the clustering algorithm to identify the corresponding high-frequency interaction The obtained high-frequency interaction area is uploaded to the quasi-nuclear identification module; the quasi-nuclear identification module is used for each genomic block to judge and identify whether there is a genomic block from the corresponding high-frequency interaction area. The quasi-nuclei of the center, and upload the obtained quasi-nuclei to the quasi-nuclei processing module; the quasi-nucleation processing module is used to process the quasi-nuclei identified on each chromosome according to the relationship between the adjacent quasi-nuclei, and obtain mutually different quasi-nuclei. Overlapping quasi-nuclei, and upload the obtained non-overlapping quasi-nuclei to the chromosome topological association domain nuclear identification module; The non-overlapping quasi-nuclei are merged, the merged nucleus is used as the nucleus of the chromosome topological association domain to be predicted, and the obtained nucleus is uploaded to the chromosome topological association domain identification module; the chromosome topological association domain identification module is used to determine attachments The affiliation of each genomic block in the candidate region is combined with the received nuclei of the chromosome topological association domain to obtain the final predicted chromosome topological association domain and output.

本发明提供的这种染色体拓扑关联结构域的预测方法及预测系统,充分利用了Hi-C数据的全局信息,缩减候选边界定位的范围,从而可减少假阳性结果的出现;同时本发明也无需用户给出预定义的参数,因此本发明能够准确的预测拓扑关联结构域,而且可靠性高、准确性好且效果较好。The prediction method and prediction system of the chromosome topological association domain provided by the present invention make full use of the global information of Hi-C data and reduce the range of candidate boundary positioning, thereby reducing the occurrence of false positive results; at the same time, the present invention does not require The user gives predefined parameters, so the present invention can accurately predict the topological correlation structure domain, and has high reliability, good accuracy and good effect.

附图说明Description of drawings

图1为本发明方法的方法流程示意图。FIG. 1 is a schematic flow chart of the method of the present invention.

图2为本发明方法的实施例的流程示意图。FIG. 2 is a schematic flowchart of an embodiment of the method of the present invention.

图3为本发明系统的结构示意图。FIG. 3 is a schematic structural diagram of the system of the present invention.

具体实施方式Detailed ways

如图1所示为本发明方法的方法流程示意图:本发明提供的这种染色体拓扑关联结构域的预测方法,包括如下步骤:As shown in Figure 1 is a schematic flow chart of the method of the method of the present invention: the prediction method of this chromosome topological association domain provided by the present invention comprises the following steps:

S1. 获取基因组区块之间的互作矩阵中每个基因组区块,并采用聚类算法识别得到对应的高频互作区;具体为采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵(简称Hi-C互作矩阵)中每个基因组区块,并采用k=2的K均值聚类算法进行聚类,从而识别得到对应的高频互作区;S1. Obtain each genome block in the interaction matrix between the genome blocks, and use the clustering algorithm to identify the corresponding high-frequency interaction area; specifically, use the whole genome conformation capture technology and sequencing technology to obtain the genome block Each genomic block in the interaction matrix (referred to as Hi-C interaction matrix) is clustered by K -means clustering algorithm with k = 2, so as to identify the corresponding high-frequency interaction area;

具体实施时,包括如下步骤:The specific implementation includes the following steps:

S1.1. 采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵;S1.1. Use whole-genome conformation capture technology and sequencing technology to obtain the interaction matrix between genomic blocks;

S1.2. 对步骤S1.1得到的基因组区块之间的互作矩阵的对角线上每个基因组区块与自身的互作值进行赋0处理;S1.2. The interaction value between each genome block and itself on the diagonal of the interaction matrix between the genome blocks obtained in step S1.1 is assigned 0;

S1.3. 对任意基因组区块i,采用k=2的K均值聚类算法对该基因组区块i与其互作值不为0的其他基因组区块进行聚类;采用如下函数作为其他基因组区块的分类函数

Figure DEST_PATH_IMAGE077
:S1.3. For any genomic block i , use the K -means clustering algorithm with k = 2 to cluster the genomic block i and other genomic blocks whose interaction value is not 0; use the following functions as other genomic regions Classification function for blocks
Figure DEST_PATH_IMAGE077
:

Figure 951494DEST_PATH_IMAGE006
Figure 951494DEST_PATH_IMAGE006

式中

Figure DEST_PATH_IMAGE078
为基因组区块i与基因组区块j的互作值;
Figure 40672DEST_PATH_IMAGE010
为第k个中心的平均值;
Figure 304295DEST_PATH_IMAGE012
为取与
Figure DEST_PATH_IMAGE079
距离最近的中心所对应的类别号操作的函数;
Figure 701645DEST_PATH_IMAGE016
为2-范数;两个类的初始中心值
Figure 618786DEST_PATH_IMAGE018
Figure 70627DEST_PATH_IMAGE020
的设置为非零互作值升序排序后
Figure DEST_PATH_IMAGE080
Figure DEST_PATH_IMAGE081
位置对应的互作值,且
Figure 262573DEST_PATH_IMAGE018
对应低频互作类的中心,
Figure 201580DEST_PATH_IMAGE020
对应高频互作类的中心;in the formula
Figure DEST_PATH_IMAGE078
is the interaction value of genome block i and genome block j ;
Figure 40672DEST_PATH_IMAGE010
is the average of the kth center;
Figure 304295DEST_PATH_IMAGE012
for taking and
Figure DEST_PATH_IMAGE079
The function of the category number operation corresponding to the nearest center;
Figure 701645DEST_PATH_IMAGE016
is the 2-norm; the initial center value of the two classes
Figure 618786DEST_PATH_IMAGE018
and
Figure 70627DEST_PATH_IMAGE020
is set to non-zero interaction value after ascending sorting
Figure DEST_PATH_IMAGE080
and
Figure DEST_PATH_IMAGE081
the interaction value corresponding to the position, and
Figure 262573DEST_PATH_IMAGE018
corresponds to the center of the low-frequency interaction class,
Figure 201580DEST_PATH_IMAGE020
corresponds to the center of the high-frequency interaction class;

通过求解分类函数,将与中心值最小的距离对应的类赋给基因组区块jBy solving the classification function, the class corresponding to the distance with the smallest central value is assigned to the genome block j ;

S1.4. 为每一个基因组区块i定义对应的高频互作区

Figure 820780DEST_PATH_IMAGE025
;其中,l i 对应于基因组区块i高互作类中基因组区块的最小区块号,r i 对应于基因组区块i高互作类中基因组区块的最大区块号;S1.4. Define the corresponding high-frequency interaction region for each genomic block i
Figure 820780DEST_PATH_IMAGE025
Wherein , li corresponds to the minimum block number of the genome block in the genome block i high interaction class , and ri corresponds to the maximum block number of the genome block in the genome block i high interaction class;

S2. 针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核:S2. For each genomic block, judge and identify whether there is a quasi-check centered on the genomic block from the corresponding high-frequency interaction area:

若高频互作区存在以该基因组区块为中心的准核,则继续进行后续步骤;If there is a quasi-core centered on the genomic block in the high-frequency interaction region, proceed to the next steps;

若高频互作区不存在以该基因组区块为中心的准核,则对该高频互作区进行拆分后再重新判断和识别准核,直至拆分后的区域不包含基因组区块;If there is no quasi-nucleus centered on the genomic block in the high-frequency interaction region, the high-frequency interaction region is split and then the quasi-nucleus is re-judged and identified until the split region does not contain the genomic block ;

具体实施时,包括如下步骤:The specific implementation includes the following steps:

S2.1. 计算基因组区块i所在的高频互作区

Figure 353392DEST_PATH_IMAGE025
在基因组区块之间的互作矩阵中组成的子矩阵
Figure 958817DEST_PATH_IMAGE027
的平均互作值;S2.1. Calculate the high-frequency interaction region where the genomic block i is located
Figure 353392DEST_PATH_IMAGE025
Submatrices formed in the interaction matrix between genomic blocks
Figure 958817DEST_PATH_IMAGE027
The average interaction value of ;

S2.2. 对步骤S2.1得到的平均互作值与邻近5个相同窗口大小的子矩阵的平均互作值进行比较:S2.2. Compare the average interaction value obtained in step S2.1 with the average interaction value of five adjacent sub-matrices with the same window size:

若步骤S2.1得到的平均互作值大于邻近5个相同窗口大小的子矩阵的平均互作值,则判定高频互作区

Figure 627696DEST_PATH_IMAGE025
为算基因组区块i的准核;If the average interaction value obtained in step S2.1 is greater than the average interaction value of five adjacent sub-matrices with the same window size, the high-frequency interaction area is determined.
Figure 627696DEST_PATH_IMAGE025
is the quasi-check for calculating genomic block i ;

若步骤S2.1得到的平均互作值不大于邻近5个相同窗口大小的子矩阵的平均互作值,则对高频互作区

Figure 43896DEST_PATH_IMAGE025
进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止;If the average interaction value obtained in step S2.1 is not greater than the average interaction value of five adjacent sub-matrices with the same window size, then the high-frequency interaction area is
Figure 43896DEST_PATH_IMAGE025
Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i ;

所述的邻近5个相同窗口大小的子矩阵,具体为上方3个子矩阵

Figure 329384DEST_PATH_IMAGE029
Figure 472920DEST_PATH_IMAGE031
Figure 261885DEST_PATH_IMAGE033
,右侧的1个子矩阵
Figure 957308DEST_PATH_IMAGE035
,以及下方的一个子矩阵
Figure 589147DEST_PATH_IMAGE037
;The adjacent 5 sub-matrices with the same window size, specifically the upper 3 sub-matrices
Figure 329384DEST_PATH_IMAGE029
,
Figure 472920DEST_PATH_IMAGE031
and
Figure 261885DEST_PATH_IMAGE033
, 1 submatrix on the right
Figure 957308DEST_PATH_IMAGE035
, and a submatrix below
Figure 589147DEST_PATH_IMAGE037
;

所述的对高频互作区

Figure 395429DEST_PATH_IMAGE025
进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止,具体包括如下步骤:the high frequency interaction region
Figure 395429DEST_PATH_IMAGE025
Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i , which specifically includes the following steps:

首先,以高频互作区

Figure 304479DEST_PATH_IMAGE025
中与高频互作区
Figure 46170DEST_PATH_IMAGE025
内其他基因组区块互作总和最小的基因组区块m i 为分割点,将高频互作区
Figure 40671DEST_PATH_IMAGE025
分为高频互作区
Figure 385064DEST_PATH_IMAGE039
和高频互作区
Figure 148621DEST_PATH_IMAGE041
;First, in the high-frequency interaction area
Figure 304479DEST_PATH_IMAGE025
Middle and high frequency interaction area
Figure 46170DEST_PATH_IMAGE025
The genomic block mi with the smallest sum of interactions among other genomic blocks is the dividing point, and the high-frequency interaction area is divided into
Figure 40671DEST_PATH_IMAGE025
high frequency interaction zone
Figure 385064DEST_PATH_IMAGE039
and high frequency interaction area
Figure 148621DEST_PATH_IMAGE041
;

然后,进行判断:Then, make a judgment:

i = m i ,则判定不存在以基因组区块i为中心的准核;If i = m i , it is determined that there is no quasi-nucleus centered on the genomic block i ;

i < m i ,则以高频互作区

Figure 809016DEST_PATH_IMAGE042
作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断;If i < m i , then the high-frequency interaction region
Figure 809016DEST_PATH_IMAGE042
As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval;

i > m i ,则以高频互作区

Figure 290813DEST_PATH_IMAGE043
作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断;If i > m i , then the high-frequency interaction region
Figure 290813DEST_PATH_IMAGE043
As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval;

S3. 对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核;具体包括如下步骤:S3. The quasi-nuclei identified on each chromosome are processed according to the relationship between adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei; the specific steps include the following:

S3.1. 对每条染色体上识别的准核,判定两个相邻准核之间的关系:S3.1. For the quasi-nuclei identified on each chromosome, determine the relationship between two adjacent quasi-nuclei:

若两个相邻准核之间为包含关系,则保留被包含的准核,并过滤包含的准核;If there is a containment relationship between two adjacent licenses, the contained licenses are retained and the contained licenses are filtered;

若两个相邻准核之间为交叠关系,则再次进行判断:若该两个准核合并后依然满足准核的定义,则将该两个准核合并为一个准核;否则,保留该两个准核中平均互作值较大的准核,并过滤剩余的准核;If there is an overlapping relationship between two adjacent quasi-nuclears, the judgment is made again: if the two quasi-nuclei still meet the definition of quasi-nuclear after merging, then the two quasi-nuclears are merged into one quasi-nuclear; otherwise, keep the The quasi-nucleus with the larger average interaction value among the two quasi-nuclei, and filtering the remaining quasi-nuclei;

S3.2. 重复步骤S3.1直至整条染色体上所有的准核均进行完判定和处理,最终得到互不重叠的准核;S3.2. Repeat step S3.1 until all quasi-nuclei on the entire chromosome have been judged and processed, and finally non-overlapping quasi-nuclei are obtained;

S4. 根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,并将合并后的核作为要预测的染色体拓扑关联结构域(TAD)的核;具体为计算所有相邻的准核之间的余弦相似性,并将余弦相似性高于设定阈值且相邻准核间平均互作值大于整条染色体上非零互作值的均值的连续若干个相邻的准核合并为一个新的区域,并将该区域作为要预测的染色体拓扑关联结构域的核-附件结构模型中的核;S4. Merge the non-overlapping quasi-nuclei on a chromosome according to the correlation between the quasi-nuclei, and use the merged nucleus as the nucleus of the chromosome topological association domain (TAD) to be predicted; The cosine similarity between adjacent quasi-nuclei, and the cosine similarity is higher than the set threshold and the average interaction value between adjacent quasi-nuclei is greater than the average value of the non-zero interaction value on the entire chromosome. The quasi-nuclei were merged into a new region and used as the nucleus in the nucleus-attachment structure model of the chromosome topological association domain to be predicted;

具体实施时,采用如下算式计算相邻的准核pc i pc j 的余弦相似性

Figure DEST_PATH_IMAGE082
:In specific implementation, the following formula is used to calculate the cosine similarity of adjacent quasi-kernels pc i and pc j
Figure DEST_PATH_IMAGE082
:

Figure 845422DEST_PATH_IMAGE047
Figure 845422DEST_PATH_IMAGE047

式中

Figure 729065DEST_PATH_IMAGE049
pc i 与其他所有准核的平均互作值组成的特征向量,且
Figure DEST_PATH_IMAGE083
Figure 592984DEST_PATH_IMAGE053
Figure DEST_PATH_IMAGE084
为准核pc k pc i 之间的平均互作值;
Figure 703023DEST_PATH_IMAGE057
pc j 与其他所有准核的平均互作值组成的特征向量,且
Figure 389219DEST_PATH_IMAGE059
Figure 284625DEST_PATH_IMAGE053
Figure DEST_PATH_IMAGE085
为准核pc k pc j 之间的平均互作值;
Figure DEST_PATH_IMAGE086
为向量的内积;
Figure 804599DEST_PATH_IMAGE065
为向量的取模;in the formula
Figure 729065DEST_PATH_IMAGE049
is the eigenvector composed of the average interaction value of pc i and all other quasi-kernels, and
Figure DEST_PATH_IMAGE083
,
Figure 592984DEST_PATH_IMAGE053
,
Figure DEST_PATH_IMAGE084
is the average interaction value between quasi-kernel pc k and pc i ;
Figure 703023DEST_PATH_IMAGE057
is the eigenvector composed of the average interaction value of pc j and all other quasi-kernels, and
Figure 389219DEST_PATH_IMAGE059
,
Figure 284625DEST_PATH_IMAGE053
,
Figure DEST_PATH_IMAGE085
is the average interaction value between quasi-kernel pc k and pc j ;
Figure DEST_PATH_IMAGE086
is the inner product of vectors;
Figure 804599DEST_PATH_IMAGE065
is the modulo of the vector;

S5. 确定附件候选区中每个基因组区块的从属关系,结合步骤S4得到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域;具体为定义核与核之间的区域为附件区,确定每一个附件区中每个基因组区块所从属的邻近的染色体拓扑关联结构域的核,从而得到最终预测的染色体拓扑关联结构域;每一个染色体拓扑关联结构域均包括一个核以及该核两边的附件区;S5. Determine the affiliation of each genome block in the attachment candidate region, and combine the nuclei of the chromosome topological association domain obtained in step S4 to obtain the final predicted chromosome topological association domain; specifically, the area between the nucleus and the nucleus is defined as Attachment region, determine the nuclei of adjacent chromosome topological association domains to which each genome block in each attachment region belongs, so as to obtain the final predicted chromosome topological association domain; each chromosome topological association domain includes a nucleus and annex areas on either side of the nucleus;

具体实施时,包括如下步骤:The specific implementation includes the following steps:

S5.1. 对相邻两核

Figure DEST_PATH_IMAGE087
Figure 526568DEST_PATH_IMAGE069
中间的基因组区块
Figure 141088DEST_PATH_IMAGE071
,过滤高频互作区的平均互作值小于整条染色体上非零互作值的均值的基因组区块;S5.1. For two adjacent cores
Figure DEST_PATH_IMAGE087
and
Figure 526568DEST_PATH_IMAGE069
middle genomic block
Figure 141088DEST_PATH_IMAGE071
, to filter genomic blocks whose average interaction value in the high-frequency interaction region is less than the average value of non-zero interaction values on the entire chromosome;

S5.2. 在步骤S5.1的基础上,对相邻两核

Figure 999323DEST_PATH_IMAGE087
Figure 549253DEST_PATH_IMAGE072
及该两核之间的基因组区块
Figure 102725DEST_PATH_IMAGE071
构成的子矩阵,去除背景信号;背景信号定义为相邻两核之间的基因组区块构成的子矩阵中非零互作值的均值;S5.2. On the basis of step S5.1, for the adjacent two cores
Figure 999323DEST_PATH_IMAGE087
and
Figure 549253DEST_PATH_IMAGE072
and the genomic block between the two cores
Figure 102725DEST_PATH_IMAGE071
The formed sub-matrix removes the background signal; the background signal is defined as the mean value of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;

S5.3. 在步骤S5.2的基础上,对相邻两核

Figure 396303DEST_PATH_IMAGE087
Figure 843465DEST_PATH_IMAGE072
中间的基因组区块
Figure 193325DEST_PATH_IMAGE071
,过滤不存在与基因组区域
Figure DEST_PATH_IMAGE088
内任何基因组区块有非零互作值的基因组区块;S5.3. On the basis of step S5.2, for the adjacent two cores
Figure 396303DEST_PATH_IMAGE087
and
Figure 843465DEST_PATH_IMAGE072
middle genomic block
Figure 193325DEST_PATH_IMAGE071
, filtering does not exist with genomic regions
Figure DEST_PATH_IMAGE088
Any genomic block within a genomic block with a non-zero interaction value;

S5.4. 在步骤S5.3的基础上,计算相邻两核

Figure 624306DEST_PATH_IMAGE067
Figure 455996DEST_PATH_IMAGE072
之间剩余的每一个基因组区块所在子矩阵
Figure 633031DEST_PATH_IMAGE076
的平均互作值,并将子矩阵平均互作值最小所对应的基因组区块作为分割点,分割点上游的基因组区块认定为核
Figure 524763DEST_PATH_IMAGE067
的附件,分割点下游的基因组区块认定为核
Figure 177461DEST_PATH_IMAGE072
的附件;从而得到最终预测的染色体拓扑关联结构域。S5.4. On the basis of step S5.3, calculate the adjacent two cores
Figure 624306DEST_PATH_IMAGE067
and
Figure 455996DEST_PATH_IMAGE072
The submatrix where each remaining genomic block is located between
Figure 633031DEST_PATH_IMAGE076
The average interaction value of the sub-matrix, and the genome block corresponding to the minimum average interaction value of the sub-matrix is used as the split point, and the genome block upstream of the split point is identified as the core
Figure 524763DEST_PATH_IMAGE067
attachments, genomic blocks downstream of the split point are identified as nuclear
Figure 177461DEST_PATH_IMAGE072
attachment; resulting in the final predicted chromosomal topological association domain.

以下结合一个实施例,对本发明方法进行进一步说明:Below in conjunction with an embodiment, the inventive method is further described:

如图2所示为实施例提供的基于核-附件结构模型的染色体拓扑关联结构域预测方法含有以下步骤;图中Hi-C 图谱的展示为GSE63525数据集中包含的50kb分辨率下KR标准化后的GM12878_combined的Hi-C 互作矩阵,具体区段为一号染色体的第120-200个基因组区块;As shown in FIG. 2 , the method for predicting chromosome topological association domains based on the nuclear-appendix structure model provided by the embodiment includes the following steps; the Hi-C map in the figure is displayed as KR normalization at 50kb resolution included in the GSE63525 dataset. Hi-C interaction matrix of GM12878_combined, the specific segment is the 120th-200th genomic block of chromosome 1;

步骤S1、对全基因组构象捕获技术与测序技术所得到的基因组区块之间的互作矩阵(简称Hi-C互作矩阵)中每个基因组区块,采用K均值聚类方法识别出其高频互作区;Step S1, for each genome block in the interaction matrix (referred to as Hi-C interaction matrix) between the genome blocks obtained by the whole-genome conformation capture technology and the sequencing technology, K-means clustering method is used to identify its high frequency interaction area;

如图2-①所示(图2-①为Hi-C 互作矩阵的预处理过程),对50kb分辨率下KR标准化后的GM12878_combined的Hi-C 互作矩阵对角线上每个基因组区块与自身的互作值进行赋0处理;As shown in Figure 2-1 (Figure 2-1 is the preprocessing process of the Hi-C interaction matrix), for each genomic region on the diagonal of the Hi-C interaction matrix of GM12878_combined after KR normalization at 50kb resolution The interaction value between the block and itself is assigned 0;

如图2-②所示(图2-②为高频互作区的识别过程),对每一个基因组区块i,用k=2的K均值聚类算法对与其互作值不为0的其他基因组区块进行k=2的聚类,其他基因组区块的分类函数为:As shown in Figure 2-2 (Figure 2-2 is the identification process of the high-frequency interaction area), for each genomic block i , the K-means clustering algorithm with k = 2 is used to identify those whose interaction value is not 0. Other genome blocks are clustered with k = 2, and the classification function of other genome blocks is:

Figure RE-330017DEST_PATH_IMAGE003
Figure RE-330017DEST_PATH_IMAGE003

其中,

Figure RE-919262DEST_PATH_IMAGE042
为基因组区块ij的互作值,
Figure RE-964578DEST_PATH_IMAGE005
是第k个中心的平均值。两个类的初始中心值
Figure RE-359787DEST_PATH_IMAGE009
Figure RE-347728DEST_PATH_IMAGE010
设置为非零互作值升序排序后
Figure RE-435770DEST_PATH_IMAGE044
Figure RE-702803DEST_PATH_IMAGE045
位置对应的互作值,
Figure RE-901703DEST_PATH_IMAGE009
对应低频互作类的中心,
Figure RE-242686DEST_PATH_IMAGE010
对应高频互作类的中心;通过求解分类函数,将与中心值最小的距离对应的类赋予基因组区块j;in,
Figure RE-919262DEST_PATH_IMAGE042
is the interaction value between genomic blocks i and j ,
Figure RE-964578DEST_PATH_IMAGE005
is the mean of the kth center. The initial center value of the two classes
Figure RE-359787DEST_PATH_IMAGE009
and
Figure RE-347728DEST_PATH_IMAGE010
Set to non-zero interaction value after ascending sorting
Figure RE-435770DEST_PATH_IMAGE044
and
Figure RE-702803DEST_PATH_IMAGE045
The interaction value corresponding to the position,
Figure RE-901703DEST_PATH_IMAGE009
corresponds to the center of the low-frequency interaction class,
Figure RE-242686DEST_PATH_IMAGE010
The center of the corresponding high-frequency interaction class; by solving the classification function, the class corresponding to the distance with the smallest center value is given to the genome block j ;

为每一个基因组区块i定义其高频互作区(l i r i ),l i 对应基因组区块i高互作类中基因组区块的最小区块号,r i 对应基因组区块i高互作类中基因组区块的最大区块号;高频互作区的示意图如图2-②b所示;Define its high-frequency interaction region ( li , ri ) for each genome block i , li corresponds to the minimum block number of the genome block in the high interaction class of genome block i, and ri corresponds to genome block i The largest block number of the genome block in the high interaction class; the schematic diagram of the high frequency interaction area is shown in Figure 2-2b;

步骤S2、如图2-③a所示(图2-③为TADs准核的构建过程),对每个基因组区块,从其高频互作区中判断并识别是否存在以该基因组区块为中心的准核;Step S2, as shown in Figure 2-③a (Figure 2-③ is the construction process of TADs quasi-nucleation), for each genomic block, judge and identify whether there is a genomic block based on its high-frequency interaction area. approval by the Centre;

准核的定义为,若基因组区块i所在的高频互作区

Figure RE-501629DEST_PATH_IMAGE013
在Hi-C互作矩阵中组成的子矩阵
Figure RE-255958DEST_PATH_IMAGE014
的平均互作值大于邻近5个相同窗口大小的子矩阵,其中包含上方3个子矩阵
Figure RE-992970DEST_PATH_IMAGE015
Figure RE-454039DEST_PATH_IMAGE016
Figure RE-821566DEST_PATH_IMAGE017
,右边的一个子矩阵
Figure RE-797612DEST_PATH_IMAGE018
,以及下边的一个子矩阵
Figure RE-338315DEST_PATH_IMAGE019
,则该高频互作区
Figure RE-981786DEST_PATH_IMAGE013
是基因组区块i的准核;Quasi-nucleation is defined as if the high-frequency interaction region where genomic block i is located
Figure RE-501629DEST_PATH_IMAGE013
Submatrix formed in Hi-C interaction matrix
Figure RE-255958DEST_PATH_IMAGE014
The average interaction value of is greater than the adjacent 5 sub-matrices of the same window size, including the upper 3 sub-matrices
Figure RE-992970DEST_PATH_IMAGE015
,
Figure RE-454039DEST_PATH_IMAGE016
and
Figure RE-821566DEST_PATH_IMAGE017
, a submatrix on the right
Figure RE-797612DEST_PATH_IMAGE018
, and a submatrix below
Figure RE-338315DEST_PATH_IMAGE019
, then the high-frequency interaction region
Figure RE-981786DEST_PATH_IMAGE013
is the quasi-validation of genomic block i ;

若基因组区块i的高频互作区

Figure RE-753171DEST_PATH_IMAGE013
在Hi-C互作矩阵中组成的子矩阵
Figure RE-RE-DEST_PATH_IMAGE053
的平均互作值不大于其他5个邻近相同窗口大小的子矩阵,则对该高频互作区
Figure RE-482092DEST_PATH_IMAGE013
进行拆分后再重新判断和识别准核,直至拆分后的区域不包含基因组区块i才停止;If the high frequency interaction region of genomic block i
Figure RE-753171DEST_PATH_IMAGE013
Submatrix formed in Hi-C interaction matrix
Figure RE-RE-DEST_PATH_IMAGE053
The average interaction value of is not greater than the other 5 adjacent sub-matrices with the same window size, then the high-frequency interaction area is
Figure RE-482092DEST_PATH_IMAGE013
After splitting, re-judgment and identify the quasi-check, and stop until the split region does not contain genomic block i ;

拆分时:对基因组区块i的高频互作区

Figure RE-560907DEST_PATH_IMAGE013
进行拆分,首先以高频互作区
Figure RE-996567DEST_PATH_IMAGE013
中与高频互作区内其他基因组区块互作总和最小的基因组区块m i 为分割点,将高频互作区
Figure RE-502635DEST_PATH_IMAGE013
分为两个高频互作区
Figure RE-DEST_PATH_IMAGE054
Figure RE-656536DEST_PATH_IMAGE023
;When splitting: high-frequency interaction regions for genomic block i
Figure RE-560907DEST_PATH_IMAGE013
Split, first use the high-frequency interaction area
Figure RE-996567DEST_PATH_IMAGE013
The genome block mi with the smallest sum of interactions between other genomic blocks in the middle and high frequency interaction regions is the dividing point, and the high frequency interaction region is divided into
Figure RE-502635DEST_PATH_IMAGE013
Divided into two high-frequency interaction regions
Figure RE-DEST_PATH_IMAGE054
and
Figure RE-656536DEST_PATH_IMAGE023
;

进一步地,当i= m i ,则判断不存在以基因组区块i为中心的准核;当i< m i ,则继续对高频互作区

Figure RE-539041DEST_PATH_IMAGE054
进行重新判断和识别准核;当i> m i ,则继续对高频互作区
Figure RE-891525DEST_PATH_IMAGE023
进行重新判断和识别准核;判断和识别的过程如上所示;Further, when i = m i , it is judged that there is no quasi-nucleus centered on the genomic block i ; when i < m i , continue to analyze the high-frequency interaction regions.
Figure RE-539041DEST_PATH_IMAGE054
Re-judgment and identification quasi-check; when i > m i , continue to analyze the high-frequency interaction area
Figure RE-891525DEST_PATH_IMAGE023
Carry out re-judgment and identification approval; the process of judgment and identification is as shown above;

步骤S3、如图2-③b,c所示,对每条染色体上识别的准核,根据两两相邻准核间的包含或交叠的关系,进行过滤或合并处理,得到互不重叠的准核;Step S3, as shown in Figure 2-③b, c, filter or merge the quasi-nuclei identified on each chromosome according to the inclusion or overlapping relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei. approved;

当两个相邻准核之间为包含关系时,则保留被包含的准核,过滤包含的准核;When there is an inclusion relationship between two adjacent quasi-nuclears, the included quasi-nuclei are retained, and the included quasi-nuclei are filtered;

当两个相邻准核之间为交叠关系时,若两者合并依然满足准核的定义,则将它们合并为一个准核;否则,仅保留两者中平均互作频数更大的准核;When there is an overlapping relationship between two adjacent quasi-kernels, if the combination of the two still satisfies the definition of quasi-kernels, they are merged into one quasi-kernel; otherwise, only the quasi-kernel with a larger average interaction frequency among the two is retained. nuclear;

处理完一组两两相邻准核后,从下一个准核开始寻找两两相邻的、包含或交叠的准核并进行相同处理,直到对整条染色体上没有互相重叠的准核出现;After processing a set of pairwise adjacent quasi-nuclei, start from the next quasi-nucleus to search for pairwise adjacent, containing or overlapping quasi-nuclei and perform the same processing until no overlapping quasi-nuclei appear on the entire chromosome ;

步骤S4、如图2-④所示(图2-④为TADs的核-附件结构模型中核的构建过程),根据准核之间的相关性,对于一条染色体上互不重叠的准核进行合并,把合并后的核视为要预测的染色体拓扑关联结构域(TADs)的核;Step S4, as shown in Figure 2-④ (Figure 2-④ is the construction process of nuclei in the nuclear-attachment structure model of TADs), according to the correlation between quasi-nuclei, merge non-overlapping quasi-nuclei on a chromosome , treat the merged nucleus as the nucleus of the chromosome topological association domains (TADs) to be predicted;

用余弦相似性对所有相邻的两两准核pc i pc j 进行相关性计算,计算公式如下所示:

Figure RE-RE-DEST_PATH_IMAGE055
;Use cosine similarity to calculate the correlation between all adjacent pairwise quasi-kernels pc i and pc j . The calculation formula is as follows:
Figure RE-RE-DEST_PATH_IMAGE055
;

设定相关性阈值,将相似度高于阈值的两个或连续多个相邻的准核且相邻准核间平均互作值大于整条染色体上非零互作值的均值,合并成一个新的区域,作为一个TAD的核-附件结构模型中的核Set the correlation threshold, and combine two or more consecutive adjacent quasi-nuclei whose similarity is higher than the threshold and the average interaction value between adjacent quasi-nuclei is greater than the average value of non-zero interaction values on the entire chromosome, and merge them into one New region as a nucleus in the core-appendix structure model of a TAD

步骤S5、如图2-⑤所示(图2-⑤为TADs的完整核-附件结构模型的建立过程),核与核之间的区域定义为附件候选区,确定附件候选区中的每个基因组区块从属于邻近的哪个核,最终预测的每个TAD由一个核与其两边的附件组成;具体实施时,包括如下步骤:Step S5, as shown in Fig. 2-⑤ (Fig. 2-⑤ is the establishment process of the complete core-accessory structure model of TADs), the area between the core and the core is defined as the accessory candidate area, and each of the accessory candidate areas is determined. The genome block belongs to which adjacent nucleus, and each TAD finally predicted consists of a nucleus and its annexes on both sides; the specific implementation includes the following steps:

S5.1. 对相邻两核

Figure RE-834073DEST_PATH_IMAGE035
Figure RE-209691DEST_PATH_IMAGE036
中间的基因组区块
Figure RE-630308DEST_PATH_IMAGE037
,过滤高频互作区的平均互作值小于整条染色体上非零互作值的均值的基因组区块;S5.1. For two adjacent cores
Figure RE-834073DEST_PATH_IMAGE035
and
Figure RE-209691DEST_PATH_IMAGE036
middle genomic block
Figure RE-630308DEST_PATH_IMAGE037
, to filter genomic blocks whose average interaction value in high-frequency interaction regions is less than the average value of non-zero interaction values on the entire chromosome;

S5.2. 在步骤S5.1的基础上,对相邻两核

Figure RE-102878DEST_PATH_IMAGE035
Figure RE-950748DEST_PATH_IMAGE038
及该两核之间的基因组区块
Figure RE-326845DEST_PATH_IMAGE037
构成的子矩阵,去除背景信号;背景信号定义为相邻两核之间的基因组区块构成的子矩阵中非零互作值的均值;S5.2. On the basis of step S5.1, for the adjacent two cores
Figure RE-102878DEST_PATH_IMAGE035
and
Figure RE-950748DEST_PATH_IMAGE038
and the genomic block between the two cores
Figure RE-326845DEST_PATH_IMAGE037
The formed sub-matrix removes the background signal; the background signal is defined as the mean of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;

S5.3. 在步骤S5.2的基础上,对相邻两核

Figure RE-551153DEST_PATH_IMAGE035
Figure RE-878229DEST_PATH_IMAGE038
中间的基因组区块
Figure RE-162580DEST_PATH_IMAGE037
,过滤不存在与基因组区域
Figure RE-512790DEST_PATH_IMAGE039
内任何基因组区块有非零互作值的基因组区块;S5.3. On the basis of step S5.2, for the adjacent two cores
Figure RE-551153DEST_PATH_IMAGE035
and
Figure RE-878229DEST_PATH_IMAGE038
middle genomic block
Figure RE-162580DEST_PATH_IMAGE037
, filtering does not exist with genomic regions
Figure RE-512790DEST_PATH_IMAGE039
Any genomic block within a genomic block with a non-zero interaction value;

S5.4. 在步骤S5.3的基础上,计算相邻两核

Figure RE-275210DEST_PATH_IMAGE035
Figure RE-456792DEST_PATH_IMAGE038
之间剩余的每一个基因组区块所在子矩阵
Figure RE-912044DEST_PATH_IMAGE040
的平均互作值,并将子矩阵平均互作值最小所对应的基因组区块作为分割点,分割点上游的基因组区块认定为核
Figure RE-749550DEST_PATH_IMAGE035
的附件,分割点下游的基因组区块认定为核
Figure RE-315661DEST_PATH_IMAGE038
的附件;从而得到最终预测的染色体拓扑关联结构域。S5.4. On the basis of step S5.3, calculate the adjacent two cores
Figure RE-275210DEST_PATH_IMAGE035
and
Figure RE-456792DEST_PATH_IMAGE038
The submatrix where each remaining genomic block is located between
Figure RE-912044DEST_PATH_IMAGE040
The average interaction value of the sub-matrix, and the genome block corresponding to the minimum average interaction value of the sub-matrix is used as the split point, and the genome block upstream of the split point is identified as the core
Figure RE-749550DEST_PATH_IMAGE035
attachments, genomic blocks downstream of the split point are identified as nuclear
Figure RE-315661DEST_PATH_IMAGE038
attachment; resulting in the final predicted chromosomal topological association domain.

如图3所示为本发明系统的结构示意图:本发明还提供了一种实现所述染色体拓扑关联结构域的预测方法的预测系统,包括依次串接的高频互作区识别模块、准核识别模块、准核处理模块、染色体拓扑关联结构域核识别模块和染色体拓扑关联结构域识别模块;高频互作区识别模块用于获取基因组区块之间的互作矩阵中每个基因组区块,采用聚类算法识别得到对应的高频互作区,并将得到的高频互作区上传准核识别模块;准核识别模块用于针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核,并将得到的准核上传准核处理模块;准核处理模块用于对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核,并将得到的互不重叠的准核上传染色体拓扑关联结构域核识别模块;染色体拓扑关联结构域核识别模块用于根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,将合并后的核作为要预测的染色体拓扑关联结构域的核,并将得到的核上传染色体拓扑关联结构域识别模块;染色体拓扑关联结构域识别模块用于确定附件候选区中每个基因组区块的从属关系,并结合接收到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域,并进行输出。Figure 3 is a schematic diagram of the structure of the system of the present invention: the present invention also provides a prediction system for realizing the prediction method of the chromosome topological association domain, including a high-frequency interaction region identification module connected in series, a quasi-nucleus Identification module, quasi-nucleation processing module, chromosome topological association domain nuclear identification module and chromosome topological association domain identification module; the high-frequency interaction region identification module is used to obtain each genome block in the interaction matrix between the genome blocks , using the clustering algorithm to identify the corresponding high-frequency interaction area, and upload the obtained high-frequency interaction area to the quasi-nuclear identification module; Judging and identifying whether there is a quasi-nuclei centered on the genomic block in the region, and uploading the obtained quasi-nuclei to the quasi-nucleation processing module; the quasi-nucleation processing module is used to identify the quasi-nuclei on each chromosome, The relationship between adjacent quasi-nuclei is processed to obtain non-overlapping quasi-nuclei, and the obtained non-overlapping quasi-nuclei are uploaded to the chromosome topological association domain core identification module; Correlation between quasi-nuclei, merge non-overlapping quasi-nuclei on a chromosome, use the merged nuclei as the nucleus of the chromosome topological association domain to be predicted, and upload the obtained nuclei to the chromosome topological association domain identification Module; the chromosome topological association domain identification module is used to determine the affiliation of each genomic block in the attachment candidate region, and combine the received nuclei of the chromosome topological association domain to obtain the final predicted chromosome topological association domain. output.

Claims (10)

1.一种染色体拓扑关联结构域的预测方法,其特征在于包括如下步骤:1. a prediction method of chromosome topological association domain is characterized in that comprising the steps: S1. 获取基因组区块之间的互作矩阵中每个基因组区块,并采用聚类算法识别得到对应的高频互作区;S1. Obtain each genomic block in the interaction matrix between the genomic blocks, and use a clustering algorithm to identify the corresponding high-frequency interaction area; S2. 针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核:S2. For each genomic block, judge and identify whether there is a quasi-check centered on the genomic block from the corresponding high-frequency interaction area: 若高频互作区存在以该基因组区块为中心的准核,则继续进行后续步骤;If there is a quasi-core centered on the genomic block in the high-frequency interaction region, proceed to the next steps; 若高频互作区不存在以该基因组区块为中心的准核,则对该高频互作区进行拆分后再重新判断和识别准核,直至拆分后的区域不包含基因组区块;If there is no quasi-nucleus centered on the genomic block in the high-frequency interaction region, the high-frequency interaction region is split and then the quasi-nucleus is re-judged and identified until the split region does not contain the genomic block ; S3. 对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核;S3. The quasi-nuclei identified on each chromosome are processed according to the relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei; S4. 根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,并将合并后的核作为要预测的染色体拓扑关联结构域的核;S4. Merge the non-overlapping quasi-nuclei on a chromosome according to the correlation between the quasi-nuclei, and use the merged nucleus as the nucleus of the chromosome topological association domain to be predicted; S5. 确定附件候选区中每个基因组区块的从属关系,结合步骤S4得到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域。S5. Determine the affiliation of each genomic block in the attachment candidate region, and combine the nucleus of the chromosome topological association domain obtained in step S4 to obtain the final predicted chromosome topological association domain. 2.根据权利要求1所述的染色体拓扑关联结构域的预测方法,其特征在于所述的步骤S1,具体为采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵中每个基因组区块,并采用k=2的K均值聚类算法进行聚类,从而识别得到对应的高频互作区。2. The method for predicting chromosomal topological association domains according to claim 1, wherein the step S1 is to obtain the interaction matrix between the genome blocks by adopting the whole genome conformation capture technology and the sequencing technology. Each genomic block is clustered using the K -means clustering algorithm with k = 2, so as to identify the corresponding high-frequency interaction regions. 3.根据权利要求2所述的染色体拓扑关联结构域的预测方法,其特征在于所述的步骤S1,具体包括如下步骤:3. the prediction method of chromosome topological association domain according to claim 2, is characterized in that described step S1, specifically comprises the steps: S1.1. 采用全基因组构象捕获技术与测序技术,获取基因组区块之间的互作矩阵;S1.1. Use whole-genome conformation capture technology and sequencing technology to obtain the interaction matrix between genomic blocks; S1.2. 对步骤S1.1得到的基因组区块之间的互作矩阵的对角线上每个基因组区块与自身的互作值进行赋0处理;S1.2. The interaction value between each genome block and itself on the diagonal of the interaction matrix between the genome blocks obtained in step S1.1 is assigned 0; S1.3. 对任意基因组区块i,采用k=2的K均值聚类算法对该基因组区块i与其互作值不为0的其他基因组区块进行聚类;采用如下函数作为步骤S1.3中的其他基因组区块的分类函数
Figure DEST_PATH_IMAGE002
S1.3. For any genomic block i , use the K -means clustering algorithm of k =2 to cluster the genomic block i and other genomic blocks whose interaction value is not 0; adopt the following function as step S1. Classification functions for other genomic blocks in 3
Figure DEST_PATH_IMAGE002
:
Figure DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE004
式中
Figure DEST_PATH_IMAGE006
为基因组区块i与基因组区块j的互作值;
Figure DEST_PATH_IMAGE008
为第k个中心的平均值;
Figure DEST_PATH_IMAGE010
为取与
Figure DEST_PATH_IMAGE012
距离最近的中心所对应的类别号操作的函数;
Figure DEST_PATH_IMAGE014
为2-范数;两个类的初始中心值
Figure DEST_PATH_IMAGE016
Figure DEST_PATH_IMAGE018
的设置为非零互作值升序排序后
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE022
位置对应的互作值,且
Figure 955213DEST_PATH_IMAGE016
对应低频互作类的中心,
Figure 718769DEST_PATH_IMAGE018
对应高频互作类的中心;
in the formula
Figure DEST_PATH_IMAGE006
is the interaction value of genome block i and genome block j ;
Figure DEST_PATH_IMAGE008
is the average of the kth center;
Figure DEST_PATH_IMAGE010
for taking and
Figure DEST_PATH_IMAGE012
The function of the category number operation corresponding to the nearest center;
Figure DEST_PATH_IMAGE014
is the 2-norm; the initial center value of the two classes
Figure DEST_PATH_IMAGE016
and
Figure DEST_PATH_IMAGE018
is set to non-zero interaction value after ascending sorting
Figure DEST_PATH_IMAGE020
and
Figure DEST_PATH_IMAGE022
the interaction value corresponding to the position, and
Figure 955213DEST_PATH_IMAGE016
corresponds to the center of the low-frequency interaction class,
Figure 718769DEST_PATH_IMAGE018
corresponds to the center of the high-frequency interaction class;
通过求解分类函数,将与中心值最小的距离对应的类赋给基因组区块jBy solving the classification function, the class corresponding to the distance with the smallest central value is assigned to the genome block j ; S1.4. 为每一个基因组区块i定义对应的高频互作区
Figure DEST_PATH_IMAGE024
;其中,l i 对应于基因组区块i高互作类中基因组区块的最小区块号,r i 对应于基因组区块i高互作类中基因组区块的最大区块号。
S1.4. Define the corresponding high-frequency interaction region for each genomic block i
Figure DEST_PATH_IMAGE024
wherein, li corresponds to the minimum block number of the genome block in the high interaction class of the genome block i , and ri corresponds to the largest block number of the genome block in the high interaction class of the genome block i .
4.根据权利要求3所述的染色体拓扑关联结构域的预测方法,其特征在于所述的步骤S2,具体包括如下步骤:4. the prediction method of chromosome topological association domain according to claim 3, is characterized in that described step S2, specifically comprises the steps: S2.1. 计算基因组区块i所在的高频互作区
Figure 647673DEST_PATH_IMAGE024
在基因组区块之间的互作矩阵中组成的子矩阵
Figure DEST_PATH_IMAGE026
的平均互作值;
S2.1. Calculate the high-frequency interaction region where the genomic block i is located
Figure 647673DEST_PATH_IMAGE024
Submatrices formed in the interaction matrix between genomic blocks
Figure DEST_PATH_IMAGE026
The average interaction value of ;
S2.2. 对步骤S2.1得到的平均互作值与邻近5个相同窗口大小的子矩阵的平均互作值进行比较:S2.2. Compare the average interaction value obtained in step S2.1 with the average interaction value of five adjacent sub-matrices with the same window size: 若步骤S2.1得到的平均互作值大于邻近5个相同窗口大小的子矩阵的平均互作值,则判定高频互作区
Figure DEST_PATH_IMAGE027
为算基因组区块i的准核;
If the average interaction value obtained in step S2.1 is greater than the average interaction value of five adjacent sub-matrices with the same window size, the high-frequency interaction area is determined.
Figure DEST_PATH_IMAGE027
is the quasi-check for calculating genomic block i ;
若步骤S2.1得到的平均互作值不大于邻近5个相同窗口大小的子矩阵的平均互作值,则对高频互作区
Figure 801574DEST_PATH_IMAGE027
进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止;
If the average interaction value obtained in step S2.1 is not greater than the average interaction value of five adjacent sub-matrices with the same window size, then the high-frequency interaction area is
Figure 801574DEST_PATH_IMAGE027
Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i ;
所述的邻近5个相同窗口大小的子矩阵,具体为上方3个子矩阵
Figure DEST_PATH_IMAGE029
Figure DEST_PATH_IMAGE031
Figure DEST_PATH_IMAGE033
,右侧的1个子矩阵
Figure DEST_PATH_IMAGE035
,以及下方的一个子矩阵
Figure DEST_PATH_IMAGE037
The adjacent 5 sub-matrices with the same window size, specifically the upper 3 sub-matrices
Figure DEST_PATH_IMAGE029
,
Figure DEST_PATH_IMAGE031
and
Figure DEST_PATH_IMAGE033
, 1 submatrix on the right
Figure DEST_PATH_IMAGE035
, and a submatrix below
Figure DEST_PATH_IMAGE037
.
5.根据权利要求4所述的染色体拓扑关联结构域的预测方法,其特征在于所述的对高频互作区
Figure DEST_PATH_IMAGE038
进行拆分;拆分后再重新进行判断和识别,直至拆分后的区域不包含基因组区块i时停止,具体包括如下步骤:
5. The method for predicting chromosomal topological association domains according to claim 4, characterized in that said pair of high-frequency interaction regions
Figure DEST_PATH_IMAGE038
Perform splitting; re-judge and identify after splitting, and stop when the split region does not contain genomic block i , which specifically includes the following steps:
首先,以高频互作区
Figure 74293DEST_PATH_IMAGE024
中与高频互作区
Figure 692356DEST_PATH_IMAGE024
内其他基因组区块互作总和最小的基因组区块m i 为分割点,将高频互作区
Figure 634904DEST_PATH_IMAGE024
分为高频互作区
Figure DEST_PATH_IMAGE040
和高频互作区
Figure DEST_PATH_IMAGE042
First, in the high-frequency interaction area
Figure 74293DEST_PATH_IMAGE024
Middle and high frequency interaction area
Figure 692356DEST_PATH_IMAGE024
The genomic block mi with the smallest sum of interactions among other genomic blocks is the dividing point, and the high-frequency interaction area is divided into
Figure 634904DEST_PATH_IMAGE024
high frequency interaction zone
Figure DEST_PATH_IMAGE040
and high frequency interaction area
Figure DEST_PATH_IMAGE042
;
然后,进行判断:Then, make a judgment: i = m i ,则判定不存在以基因组区块i为中心的准核;If i = m i , it is determined that there is no quasi-nucleus centered on the genomic block i ; i < m i ,则以高频互作区
Figure DEST_PATH_IMAGE043
作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断;
If i < m i , then the high-frequency interaction region
Figure DEST_PATH_IMAGE043
As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval;
i > m i ,则以高频互作区
Figure DEST_PATH_IMAGE044
作为基因组区块所在的高频互作区,重复步骤S2.1~S2.2进行准核的判断。
If i > m i , then the high-frequency interaction region
Figure DEST_PATH_IMAGE044
As the high-frequency interaction area where the genome block is located, repeat steps S2.1~S2.2 to judge the approval.
6.根据权利要求5所述的染色体拓扑关联结构域的预测方法,其特征在于所述的步骤S3,具体包括如下步骤:6. The prediction method of chromosome topological association domain according to claim 5, is characterized in that described step S3, specifically comprises the steps: S3.1. 对每条染色体上识别的准核,判定两个相邻准核之间的关系:S3.1. For the quasi-nuclei identified on each chromosome, determine the relationship between two adjacent quasi-nuclei: 若两个相邻准核之间为包含关系,则保留被包含的准核,并过滤包含的准核;If there is a containment relationship between two adjacent licenses, the contained licenses are retained and the contained licenses are filtered; 若两个相邻准核之间为交叠关系,则再次进行判断:若该两个准核合并后依然满足准核的定义,则将该两个准核合并为一个准核;否则,保留该两个准核中平均互作值较大的准核,并过滤剩余的准核;If there is an overlapping relationship between two adjacent quasi-nuclears, the judgment is made again: if the two quasi-nuclei still meet the definition of quasi-nuclear after merging, then the two quasi-nuclears are merged into one quasi-nuclear; otherwise, keep the The quasi-nucleus with the larger average interaction value among the two quasi-nuclei, and filtering the remaining quasi-nuclei; S3.2. 重复步骤S3.1直至整条染色体上所有的准核均进行完判定和处理,最终得到互不重叠的准核。S3.2. Repeat step S3.1 until all quasi-nuclei on the entire chromosome have been judged and processed, and finally non-overlapping quasi-nuclei are obtained. 7.根据权利要求6所述的染色体拓扑关联结构域的预测方法,其特征在于所述的步骤S4,具体为计算所有相邻的准核之间的余弦相似性,并将余弦相似性高于设定阈值且相邻准核间平均互作值大于整条染色体上非零互作值的均值的连续若干个相邻的准核合并为一个新的区域,并将该区域作为要预测的染色体拓扑关联结构域的核-附件结构模型中的核。7. The method for predicting chromosome topological association domains according to claim 6, wherein the step S4 is to calculate the cosine similarity between all adjacent quasi-nuclei, and compare the cosine similarity to higher than Set a threshold and the average interaction value between adjacent quasi-nuclei is greater than the average value of the non-zero interaction value on the entire chromosome. Several consecutive adjacent quasi-nuclei are merged into a new region, and this region is used as the chromosome to be predicted Nuclei in the nuclear-attachment structural model of topologically associated domains. 8.根据权利要求7所述的染色体拓扑关联结构域的预测方法,其特征在于所述的步骤S5,具体为定义核与核之间的区域为附件区,确定每一个附件区中每个基因组区块所从属的邻近的染色体拓扑关联结构域的核,从而得到最终预测的染色体拓扑关联结构域;每一个染色体拓扑关联结构域均包括一个核以及该核两边的附件区。8. The method for predicting a chromosome topological association domain according to claim 7, wherein the step S5 is to define the area between the nucleus and the nucleus as an accessory area, and determine each genome in each accessory area. The nuclei of the adjacent chromosomal topological association domains to which the block belongs, thereby obtaining the final predicted chromosomal topological association domains; each chromosome topological association domain includes a nucleus and attachment regions on both sides of the nucleus. 9.根据权利要求8所述的染色体拓扑关联结构域的预测方法,其特征在于所述的步骤S5,具体包括如下步骤:9. The method for predicting chromosomal topological association domains according to claim 8, wherein the step S5 specifically comprises the following steps: S5.1. 对相邻两核
Figure DEST_PATH_IMAGE046
Figure DEST_PATH_IMAGE048
中间的基因组区块
Figure DEST_PATH_IMAGE050
,过滤高频互作区的平均互作值小于整条染色体上非零互作值的均值的基因组区块;
S5.1. For two adjacent cores
Figure DEST_PATH_IMAGE046
and
Figure DEST_PATH_IMAGE048
middle genomic block
Figure DEST_PATH_IMAGE050
, to filter genomic blocks whose average interaction value in the high-frequency interaction region is less than the average value of non-zero interaction values on the entire chromosome;
S5.2. 在步骤S5.1的基础上,对相邻两核
Figure 623238DEST_PATH_IMAGE046
Figure DEST_PATH_IMAGE051
及该两核之间的基因组区块
Figure 981539DEST_PATH_IMAGE050
构成的子矩阵,去除背景信号;背景信号定义为相邻两核之间的基因组区块构成的子矩阵中非零互作值的均值;
S5.2. On the basis of step S5.1, for the adjacent two cores
Figure 623238DEST_PATH_IMAGE046
and
Figure DEST_PATH_IMAGE051
and the genomic block between the two cores
Figure 981539DEST_PATH_IMAGE050
The formed sub-matrix removes the background signal; the background signal is defined as the mean value of the non-zero interaction values in the sub-matrix formed by the genomic blocks between adjacent two nuclei;
S5.3. 在步骤S5.2的基础上,对相邻两核
Figure DEST_PATH_IMAGE052
Figure DEST_PATH_IMAGE053
中间的基因组区块
Figure DEST_PATH_IMAGE054
,过滤不存在与基因组区域
Figure DEST_PATH_IMAGE056
内任何基因组区块有非零互作值的基因组区块;
S5.3. On the basis of step S5.2, for the adjacent two cores
Figure DEST_PATH_IMAGE052
and
Figure DEST_PATH_IMAGE053
middle genomic block
Figure DEST_PATH_IMAGE054
, filtering does not exist with genomic regions
Figure DEST_PATH_IMAGE056
Any genomic block within a genomic block with a non-zero interaction value;
S5.4. 在步骤S5.3的基础上,计算相邻两核
Figure 142524DEST_PATH_IMAGE052
Figure 662498DEST_PATH_IMAGE053
之间剩余的每一个基因组区块所在子矩阵
Figure DEST_PATH_IMAGE058
的平均互作值,并将子矩阵平均互作值最小所对应的基因组区块作为分割点,分割点上游的基因组区块认定为核
Figure 384466DEST_PATH_IMAGE052
的附件,分割点下游的基因组区块认定为核
Figure DEST_PATH_IMAGE059
的附件;从而得到最终预测的染色体拓扑关联结构域。
S5.4. On the basis of step S5.3, calculate the adjacent two cores
Figure 142524DEST_PATH_IMAGE052
and
Figure 662498DEST_PATH_IMAGE053
The submatrix where each remaining genomic block is located between
Figure DEST_PATH_IMAGE058
The average interaction value of the sub-matrix, and the genome block corresponding to the minimum average interaction value of the sub-matrix is used as the split point, and the genome block upstream of the split point is identified as the core
Figure 384466DEST_PATH_IMAGE052
attachments, genomic blocks downstream of the split point are identified as nuclear
Figure DEST_PATH_IMAGE059
attachment; resulting in the final predicted chromosomal topological association domain.
10.一种实现权利要求1~9之一所述的染色体拓扑关联结构域的预测方法的预测系统,其特征在于包括依次串接的高频互作区识别模块、准核识别模块、准核处理模块、染色体拓扑关联结构域核识别模块和染色体拓扑关联结构域识别模块;高频互作区识别模块用于获取基因组区块之间的互作矩阵中每个基因组区块,采用聚类算法识别得到对应的高频互作区,并将得到的高频互作区上传准核识别模块;准核识别模块用于针对每个基因组区块,从对应的高频互作区中判断并识别是否存在以该基因组区块为中心的准核,并将得到的准核上传准核处理模块;准核处理模块用于对每条染色体上识别的准核,根据两两相邻准核之间的关系进行处理,得到互不重叠的准核,并将得到的互不重叠的准核上传染色体拓扑关联结构域核识别模块;染色体拓扑关联结构域核识别模块用于根据各个准核之间的相关性,对一条染色体上互不重叠的准核进行合并,将合并后的核作为要预测的染色体拓扑关联结构域的核,并将得到的核上传染色体拓扑关联结构域识别模块;染色体拓扑关联结构域识别模块用于确定附件候选区中每个基因组区块的从属关系,并结合接收到的染色体拓扑关联结构域的核,得到最终预测的染色体拓扑关联结构域,并进行输出。10. A prediction system for realizing the prediction method of the chromosome topological association domain described in one of claims 1 to 9, characterized in that it comprises a high-frequency interaction region identification module, a quasi-nucleus identification module, a quasi-nucleus connected in series in sequence The processing module, the chromosome topological association domain core identification module and the chromosome topological association domain identification module; the high-frequency interaction region identification module is used to obtain each genome block in the interaction matrix between the genome blocks, and a clustering algorithm is used. Identify the corresponding high-frequency interaction area, and upload the obtained high-frequency interaction area to the quasi-nuclear identification module; the quasi-nuclear identification module is used to judge and identify each genome block from the corresponding high-frequency interaction area Whether there is a quasi-nuclei centered on the genomic block, and upload the obtained quasi-nuclei to the quasi-nuclei processing module; the quasi-nucleation processing module is used to identify the quasi-nuclei on each chromosome, The obtained non-overlapping quasi-nuclei are processed to obtain non-overlapping quasi-nuclei, and the obtained non-overlapping quasi-nuclei are uploaded to the chromosome topological association domain core identification module; Correlation, merge the non-overlapping quasi-nuclei on a chromosome, take the merged nucleus as the nucleus of the chromosome topological association domain to be predicted, and upload the obtained nucleus to the chromosome topological association domain identification module; chromosome topological association The domain identification module is used to determine the affiliation of each genomic block in the candidate attachment region, and combine the received nuclei of the chromosome topological association domain to obtain the final predicted chromosome topological association domain, and output.
CN202210245600.9A 2022-03-14 2022-03-14 Prediction method and prediction system for chromosome topological association domains Active CN114446384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210245600.9A CN114446384B (en) 2022-03-14 2022-03-14 Prediction method and prediction system for chromosome topological association domains

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210245600.9A CN114446384B (en) 2022-03-14 2022-03-14 Prediction method and prediction system for chromosome topological association domains

Publications (2)

Publication Number Publication Date
CN114446384A true CN114446384A (en) 2022-05-06
CN114446384B CN114446384B (en) 2024-11-05

Family

ID=81358910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210245600.9A Active CN114446384B (en) 2022-03-14 2022-03-14 Prediction method and prediction system for chromosome topological association domains

Country Status (1)

Country Link
CN (1) CN114446384B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114944190A (en) * 2022-05-12 2022-08-26 南开大学 TAD (TAD-based data analysis) identification method and system based on Hi-C sequencing data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190005191A1 (en) * 2015-07-14 2019-01-03 Whitehead Institute For Biomedical Research Chromosome neighborhood structures and methods relating thereto
US20190295684A1 (en) * 2018-03-22 2019-09-26 The Regents Of The University Of Michigan Method and apparatus for analysis of chromatin interaction data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190005191A1 (en) * 2015-07-14 2019-01-03 Whitehead Institute For Biomedical Research Chromosome neighborhood structures and methods relating thereto
US20190295684A1 (en) * 2018-03-22 2019-09-26 The Regents Of The University Of Michigan Method and apparatus for analysis of chromatin interaction data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
许希伦;: "染色体相互作用密度与拓扑域相关分析", 电脑知识与技术, no. 03, 25 January 2020 (2020-01-25) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114944190A (en) * 2022-05-12 2022-08-26 南开大学 TAD (TAD-based data analysis) identification method and system based on Hi-C sequencing data
CN114944190B (en) * 2022-05-12 2024-04-19 南开大学 TAD identification method and system based on Hi-C sequencing data

Also Published As

Publication number Publication date
CN114446384B (en) 2024-11-05

Similar Documents

Publication Publication Date Title
CN114332568B (en) Training method, system, device and storage medium for domain-adapted image classification network
WO2023217290A1 (en) Genophenotypic prediction based on graph neural network
WO2017173929A1 (en) Unsupervised feature selection method and device
CN108805002A (en) Monitor video accident detection method based on deep learning and dynamic clustering
CN110689091A (en) Weakly supervised fine-grained object classification method
CN102184216A (en) Automatic clustering method based on data field grid division
CN104102706A (en) Hierarchical clustering-based suspicious taxpayer detection method
CN110493221A (en) A kind of network anomaly detection method based on the profile that clusters
CN101923604A (en) Weighted KNN Tumor Gene Expression Profile Classification Method Based on Neighborhood Rough Sets
CN104572985A (en) Industrial data sample screening method based on complex network community discovery
CN111710364A (en) A kind of acquisition method, device, terminal and storage medium of flora marker
CN114446384A (en) Prediction method and prediction system of chromosome topological association domains
CN116206327A (en) Image classification method based on online knowledge distillation
CN114842507A (en) Reinforced pedestrian attribute identification method based on group optimization reward
CN111461440A (en) Link prediction method, system and terminal equipment
CN115691661A (en) Gene coding breeding prediction method and device based on graph clustering
CN116861226A (en) Data processing method and related device
CN112418522B (en) Industrial heating furnace steel temperature prediction method based on three-branch integrated prediction model
WO2022011855A1 (en) False positive structural variation filtering method, storage medium, and computing device
CN111192638B (en) High-dimensional low-sample gene data screening and protein network analysis method and system
CN110097922B (en) A method for differential analysis of hierarchical TADs in Hi-C contact matrix based on online machine learning
CN116129999A (en) Method, device, equipment and storage medium for constructing tumor virtual three-dimensional transcriptome
CN116403713A (en) Method for predicting autism spectrum barrier risk genes based on multiclass unsupervised feature extraction method
CN112735532B (en) Metabolite identification system based on molecular fingerprint prediction and its application method
CN115828120A (en) Self-adaptive recognition method, system and computer equipment for ship traffic behavior pattern

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant