CN114446384A - Prediction method and prediction system of chromosome topological correlation structure domain - Google Patents
Prediction method and prediction system of chromosome topological correlation structure domain Download PDFInfo
- Publication number
- CN114446384A CN114446384A CN202210245600.9A CN202210245600A CN114446384A CN 114446384 A CN114446384 A CN 114446384A CN 202210245600 A CN202210245600 A CN 202210245600A CN 114446384 A CN114446384 A CN 114446384A
- Authority
- CN
- China
- Prior art keywords
- genome
- quasi
- chromosome
- interaction
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 210000000349 chromosome Anatomy 0.000 title claims abstract description 118
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000003993 interaction Effects 0.000 claims abstract description 229
- 239000011159 matrix material Substances 0.000 claims abstract description 42
- 210000004940 nucleus Anatomy 0.000 claims description 39
- 238000005516 engineering process Methods 0.000 claims description 14
- 238000001914 filtration Methods 0.000 claims description 12
- 238000012163 sequencing technique Methods 0.000 claims description 10
- 238000005192 partition Methods 0.000 claims description 8
- 230000001174 ascending effect Effects 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000011144 upstream manufacturing Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 239000013598 vector Substances 0.000 description 8
- 238000010586 diagram Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002759 chromosomal effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 210000004962 mammalian cell Anatomy 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000000386 microscopy Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- QERYCTSHXKAMIS-UHFFFAOYSA-M thiophene-2-carboxylate Chemical compound [O-]C(=O)C1=CC=CS1 QERYCTSHXKAMIS-UHFFFAOYSA-M 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/20—Protein or domain folding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a prediction method of a chromosome topological correlation domain, which comprises the steps of obtaining each genome block in an interaction matrix among the genome blocks and identifying to obtain a high-frequency interaction region; identifying a quasi-nucleus from the high frequency interaction region for each genome block: processing the quasi-nuclei identified on each chromosome to obtain non-overlapping quasi-nuclei; merging non-overlapping quasi cores on a chromosome to obtain a core of a topological association structure domain of the chromosome to be predicted; and determining the subordination relation of each genome block in the accessory candidate region and combining the kernels of the chromosome topological correlation domains to obtain a final predicted chromosome topological correlation domain. The invention also discloses a prediction system for realizing the prediction method of the chromosome topology association domain. The invention fully utilizes the global information of Hi-C data, reduces the range of candidate boundary positioning, does not need a user to give predefined parameters, can accurately predict the topological associated domain, and has high reliability, good accuracy and better effect.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a prediction method and a prediction system of a chromosome topology association domain.
Background
In recent years, the emergence of High-throughput chromosome conformation capture technology (High-C) in the genome-wide range has promoted the recognition of chromosome spatial structure hierarchy. Relevant researchers have transformed and visualized Hi-C sequencing data from mammalian cells into Hi-C interaction matrices to find highly self-interacting regions with resolutions below 100kb, such regions being Topologically Associating Domains (TAD). The construction method of the Hi-C interaction matrix comprises the following steps: dividing a chromosome into equal lengthNAre divided into segments and constructed into oneN*NOf (2) matrixMThe method is used for characterizing interaction signals between two segments on a chromosome, wherein the equal-length segment in unit length is called a genome block, and the size of the genome block is related to the resolution of a Hi-C interaction matrix. Researchers have constructed Hi-C interaction matrices by counting the frequency of interactions between pairs of genomic blocks and between N genomic blocks of sequenced fragment reads generated by high throughput chromosome conformation capture technology. For example, each read of a sequencing fragment can be aligned separately to a genomic blockiAnd genome blockjThen at the matrix elementM i,j 、M j,i Add 1 to the running total.
Currently, due to the limitations of microscopy and biotechnology, researchers still cannot directly and completely observe TAD, and the mechanism of TAD formation is still in a vague sense. Therefore, to obtain information about TAD, it is necessary to use some indirect method, such as constructing a Hi-C interaction matrix using the interaction information between chromosome segments captured by Hi-C sequencing data, and then using a correlation algorithm to predict TAD. In recent years, researchers have proposed methods for predicting TAD based on machine learning algorithms; however, the application of these methods to different cell lines is very limited, because different cell lines often require a large amount of corresponding and specific related information to extract the feature training model, which adds extra burden to researchers.
The existing TAD prediction algorithm mainly predicts the TAD from the aspects of interaction preference at a boundary, similarity inside the TAD, difference between the TAD and non-TAD, contact frequency density change inside the TAD and the like. These methods either focus only on the finding of the boundary, missing information inside the TAD; or user-defined parameters are needed to control the size, clustering termination threshold, local maximum and the like of the TAD; this makes identifying the TAD problem highly fluctuating and subjective; furthermore, TAD, a structure that is not precisely defined, should not be predicted by limiting its own properties.
Disclosure of Invention
The invention aims to provide a method for predicting a chromosome topological correlation domain, which has high reliability, good accuracy and good effect.
The other object of the present invention is to provide a prediction system for implementing the method for predicting the chromosome topology association domain.
The method for predicting the topological associated domain of the chromosome, provided by the invention, comprises the following steps:
s1, acquiring each genome block in an interaction matrix among the genome blocks, and identifying by adopting a clustering algorithm to obtain a corresponding high-frequency interaction region;
s2, for each genome block, determining and identifying from the corresponding high frequency interaction region whether a quasi-nucleus centered on the genome block exists:
if the high-frequency interaction region has a quasi-nucleus taking the genome block as the center, continuing to perform the subsequent steps;
if the high-frequency interaction region does not have a quasi-nucleus taking the genome block as the center, the high-frequency interaction region is split and then the quasi-nucleus is judged and identified again until the split region does not contain the genome block;
s3, processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;
s4, merging the non-overlapping quasi-nuclei on a chromosome according to the correlation among the quasi-nuclei, and using the merged nuclei as the nuclei of the topological associated domain of the chromosome to be predicted;
s5, determining the subordination relation of each genome block in the attachment candidate region, and combining the nucleus of the chromosome topological correlation domain obtained in the step S4 to obtain a final predicted chromosome topological correlation domain.
The step S1 is to adopt a whole genome conformation capture technology and a sequencing technology to obtain each genome block in an interaction matrix between the genome blocks, and adoptk=2KAnd clustering by using a mean clustering algorithm so as to identify and obtain a corresponding high-frequency interaction region.
The step S1 specifically includes the following steps:
s1.1, acquiring an interaction matrix between genome blocks by adopting a whole genome conformation capture technology and a sequencing technology;
s1.2, carrying out 0 assigning processing on the interaction value of each genome block and the genome block on the diagonal line of the interaction matrix between the genome blocks obtained in the step S1.1;
s1.3 for any genome blockiBy usingk=2KMean clustering algorithm for the genome blockiClustering with other genome blocks whose interaction value is not 0;
s1.4. for each genome blockiDefining corresponding high frequency interaction regions(ii) a Wherein,l i corresponding to the genome blockiThe minimum block number of a genome block in a high interaction class,r i corresponding to the genome blockiMaximum block number of genomic blocks in high interaction class.
The following function is used as the classification function for the other genomic blocks in step S1.3:
In the formulaIs a genomic blockiAnd genome blockjThe interaction value of (a);is a firstkAverage of the centers;to get anda function of class number operation corresponding to the nearest center;is a 2-norm; initial center values of two classesAndis set to be after the ascending order of the non-zero interaction valuesAndinteraction values corresponding to positions, andcorresponding to the center of the low frequency interaction class,center corresponding to high frequency interaction class;
assigning a class corresponding to the distance having the smallest center value to the genome block by solving the classification functionj。
The step S2 specifically includes the following steps:
s2.1. calculating genome blocksiIn the high-frequency interaction regionSubmatrices composed in an interaction matrix between genomic blocksAn average interaction value of;
s2.2, comparing the average interaction value obtained in the step S2.1 with the average interaction values of the submatrices adjacent to 5 same window sizes:
if the average interaction value obtained in step S2.1 is larger than the average interaction value of the submatrixes adjacent to 5 same window sizes, the high-frequency interaction area is judgedTo calculate genome blocksiThe quasi nucleus of (1);
if the average interaction value obtained in step S2.1 is not larger than the average interaction value of the sub-matrixes adjacent to 5 same window sizes, the high-frequency interaction area is subjected toSplitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiStopping the operation;
the submatrix is adjacent to 5 submatrices with the same window size, specifically the upper 3 submatrixes、Andright 1 submatrixAnd a sub-matrix below。
The pair of high-frequency interaction regionsSplitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiThe method specifically comprises the following steps:
first, the high frequency interaction regionMedium and high frequency interaction regionGenome block with minimal interaction sum of other genome blocks thereinm i For dividing points, dividing the high frequency interaction regionDivided into high frequency interaction regionsAnd high frequency interaction region;
Then, a judgment is made:
if it isi = m i Then, it is determined that there is no genome blockiA centered corelet;
if it isi < m i Then in the high frequency interaction regionRepeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located;
if it isi > m i Then, thenAt a high frequency of the interaction regionAnd (3) repeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located.
The step S3 specifically includes the following steps:
s3.1, judging the relation between two adjacent quasicuclears for the quasicuclear identified on each chromosome:
if the two adjacent quasi cores are in an inclusion relationship, the included quasi cores are reserved, and the included quasi cores are filtered;
if the two adjacent quasi cores are in an overlapping relation, judging again: if the two quasi cores still meet the definition of the quasi cores after being merged, merging the two quasi cores into one quasi core; otherwise, reserving the quasi-nucleus with larger average interaction value in the two quasi-nuclei, and filtering the rest quasi-nuclei;
and S3.2, repeating the step S3.1 until all the quasicuclears on the whole chromosome are judged and processed, and finally obtaining the non-overlapping quasicuclears.
The step S4 is specifically to calculate cosine similarities between all the adjacent quasicucleates, merge several consecutive adjacent quasicucleates of which the cosine similarities are higher than a set threshold and the average interaction value between the adjacent quasicucleates is greater than the mean value of the non-zero interaction values on the whole chromosome into a new region, and use the new region as a core in the core-attachment structure model of the chromosome topology association structure domain to be predicted.
The calculating of the cosine similarity between all the adjacent quasi-kernels specifically includes calculating the adjacent quasi-kernels by using the following formulapc i Andpc j cosine similarity of:
In the formulaIs composed ofpc i A feature vector consisting of average interaction values with all other coregists, and,,is a quasi-nucleuspc k Andpc i average interaction value between;is composed ofpc j A feature vector consisting of average interaction values with all other coregists, and,,is a quasi-nucleuspc k Andpc j average interaction value between;is the inner product of the vectors;is the vector modulo.
Step S5, specifically, defining a region between the nucleus and the nucleus as an attachment region, and determining a nucleus of an adjacent chromosome topology association domain to which each genome block belongs in each attachment region, thereby obtaining a final predicted chromosome topology association domain; each of the chromosomal topological domains includes a nucleus and attachment regions flanking the nucleus.
The step S5 specifically includes the following steps:
s5.1, for two adjacent nucleusesAndmiddle genome blockFiltering genomic blocks having an average interaction value of the high-frequency interaction region that is less than the mean of non-zero interaction values across the entire chromosome;
s5.2, on the basis of the step S5.1, two adjacent cores are processedAndand the genome block between the two coresForming a sub-matrix, and removing background signals; defining a background signal as an average value of non-zero interaction values in a submatrix formed by genome blocks between two adjacent kernels;
s5.3, on the basis of the step S5.2, two adjacent cores are processedAndmiddle genome blockFiltering the absence and genomic regionAny genome block within has a non-zero crossA genomic block of values;
s5.4, calculating two adjacent cores on the basis of the step S5.3Andthe submatrix where each genome block is locatedAnd the genome block corresponding to the smallest average interaction value of the submatrices is taken as a partition point, and the genome block at the upstream of the partition point is taken as a kernelThe genome block downstream of the segmentation point is taken as a nucleusThe accessory of (1); thereby obtaining the final predicted topological associated domain of the chromosome.
The invention also provides a prediction system for realizing the prediction method of the chromosome topological correlation domain, which comprises a high-frequency interaction region identification module, a quasi-nuclear processing module, a chromosome topological correlation domain nuclear identification module and a chromosome topological correlation domain identification module which are sequentially connected in series; the high-frequency interaction region identification module is used for acquiring each genome block in an interaction matrix among the genome blocks, identifying and obtaining a corresponding high-frequency interaction region by adopting a clustering algorithm, and uploading the obtained high-frequency interaction region to the quasi-nuclear identification module; the quasi-nuclear identification module is used for judging and identifying whether quasi-nuclear with the genome block as the center exists in the corresponding high-frequency interaction region aiming at each genome block, and uploading the obtained quasi-nuclear to the quasi-nuclear processing module; the quasi-nucleus processing module is used for processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei, and uploading the non-overlapping quasi-nuclei to the chromosome topological association domain nucleus identification module; the chromosome topological correlation structure domain core identification module is used for merging non-overlapping quasi-cores on a chromosome according to the correlation among the quasi-cores, taking the merged core as the core of the chromosome topological correlation structure domain to be predicted, and uploading the obtained core to the chromosome topological correlation structure domain identification module; and the chromosome topological correlation structure domain identification module is used for determining the subordination relation of each genome block in the accessory candidate region, and combining the received nucleus of the chromosome topological correlation structure domain to obtain a finally predicted chromosome topological correlation structure domain and outputting the finally predicted chromosome topological correlation structure domain.
The prediction method and the prediction system of the chromosome topological correlation structure domain fully utilize the global information of Hi-C data, reduce the range of candidate boundary positioning, and further reduce the occurrence of false positive results; meanwhile, the invention does not need the user to give predefined parameters, so the invention can accurately predict the topological correlation structural domain and has high reliability, good accuracy and better effect.
Drawings
FIG. 1 is a schematic process flow diagram of the process of the present invention.
FIG. 2 is a schematic flow chart of an embodiment of the method of the present invention.
FIG. 3 is a schematic diagram of the system of the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of the method of the present invention: the method for predicting the topological associated domain of the chromosome, provided by the invention, comprises the following steps:
s1, acquiring each genome block in an interaction matrix among the genome blocks, and identifying by adopting a clustering algorithm to obtain a corresponding high-frequency interaction region; specifically, a whole genome conformation capture technology and a sequencing technology are adopted to obtain each genome block in an interaction matrix (Hi-C interaction matrix for short) among the genome blocks, andk=2KClustering by using a mean clustering algorithm so as to identify and obtain a corresponding high-frequency interaction region;
when the method is implemented, the method comprises the following steps:
s1.1, acquiring an interaction matrix between genome blocks by adopting a whole genome conformation capture technology and a sequencing technology;
s1.2, carrying out 0 assigning processing on the interaction value of each genome block and the genome block on the diagonal line of the interaction matrix between the genome blocks obtained in the step S1.1;
s1.3 for any genome blockiBy usingk=2KMean clustering algorithm for the genome blockiClustering other genome blocks with interaction values different from 0; the following function is adopted as the classification function of other genome blocks:
In the formulaIs a genomic blockiAnd genome blockjThe interaction value of (a);is a firstkAverage of the centers;to get anda function of class number operation corresponding to the nearest center;is a 2-norm; initial center values of two classesAndis provided withSet as non-zero interaction value after ascending order and sortingAndinteraction values corresponding to positions, andcorresponding to the center of the low frequency interaction class,center corresponding to high frequency interaction class;
assigning a class corresponding to the distance having the smallest center value to the genome block by solving the classification functionj;
S1.4. for each genome BlockiDefining corresponding high frequency interaction regions(ii) a Wherein,l i corresponding to the genome blockiThe minimum block number of a genome block in a high interaction class,r i corresponding to the genome blockiMaximum block number of genomic blocks in high interaction class;
s2, for each genome block, determining and identifying from the corresponding high frequency interaction region whether a quasi-nucleus centered on the genome block exists:
if the high-frequency interaction region has a quasi-nucleus taking the genome block as the center, continuing to perform the subsequent steps;
if the high-frequency interaction region does not have a quasi-nucleus taking the genome block as the center, the high-frequency interaction region is split and then the quasi-nucleus is judged and identified again until the split region does not contain the genome block;
when the method is implemented, the method comprises the following steps:
s2.1. calculating genome blocksiIn the high-frequency interaction regionSubmatrices composed in an interaction matrix between genomic blocksAn average interaction value of;
s2.2, comparing the average interaction value obtained in the step S2.1 with the average interaction values of the submatrices adjacent to 5 same window sizes:
if the average interaction value obtained in step S2.1 is larger than the average interaction value of the submatrixes adjacent to 5 same window sizes, the high-frequency interaction area is judgedTo calculate genome blocksiThe quasi nucleus of (1);
if the average interaction value obtained in step S2.1 is not larger than the average interaction value of the sub-matrixes adjacent to 5 same window sizes, the high-frequency interaction area is subjected toSplitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiStopping the operation;
the submatrix is adjacent to 5 submatrices with the same window size, specifically the upper 3 submatrixes、Andright 1 submatrixAnd a sub-matrix below;
Said interaction with high frequencyZone(s)Splitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiThe method specifically comprises the following steps:
first, the high frequency interaction regionMedium and high frequency interaction regionGenome block with minimal interaction sum of other genome blocks thereinm i For dividing points, dividing the high-frequency interaction regionDivided into high frequency interaction regionsAnd high frequency interaction region;
Then, a judgment is made:
if it isi = m i Then, it is determined that there is no genome blockiA centered corelet;
if it isi < m i Then in the high frequency interaction regionRepeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located;
if it isi > m i Then in the high frequency interaction regionRepeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located;
s3, processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei; the method specifically comprises the following steps:
s3.1, judging the relation between two adjacent quasicuclears for the quasicuclear identified on each chromosome:
if the two adjacent quasi-kernels are in an inclusion relationship, the included quasi-kernels are reserved, and the included quasi-kernels are filtered;
if the two adjacent quasi cores are in an overlapping relation, judging again: if the two quasi cores still meet the definition of the quasi cores after being merged, merging the two quasi cores into one quasi core; otherwise, reserving the quasi-nucleus with larger average interaction value in the two quasi-nuclei, and filtering the rest quasi-nuclei;
s3.2, repeating the step S3.1 until all the quasicuclears on the whole chromosome are judged and processed, and finally obtaining the non-overlapping quasicuclears;
s4, merging the non-overlapping quasi-nuclei on a chromosome according to the correlation among the quasi-nuclei, and using the merged nuclei as the nuclei of a Topological Associated Domain (TAD) of the chromosome to be predicted; calculating cosine similarity between all adjacent quasi-kernels, combining a plurality of continuous adjacent quasi-kernels of which the cosine similarity is higher than a set threshold and the average interaction value between the adjacent quasi-kernels is larger than the mean value of non-zero interaction values on the whole chromosome into a new region, and taking the region as a kernel in a kernel-attachment structure model of a chromosome topology association structure domain to be predicted;
in specific implementation, the following formula is adopted to calculate the adjacent quasi-kernelspc i Andpc j cosine similarity of:
In the formulaIs composed ofpc i A feature vector consisting of average interaction values with all other coregists, and,,is a quasi-nucleuspc k Andpc i average interaction value between;is composed ofpc j A feature vector composed of average interaction values with all other quasicles, an,,Is a quasi-nucleuspc k Andpc j an average interaction value therebetween;is the inner product of the vectors;taking a modulus of the vector;
s5, determining the subordination relation of each genome block in the accessory candidate region, and combining the nucleus of the chromosome topological correlation domain obtained in the step S4 to obtain a final predicted chromosome topological correlation domain; specifically, a region between a nucleus and a nucleus is defined as an attachment region, and the nucleus of an adjacent chromosome topological correlation domain to which each genome block belongs in each attachment region is determined, so that a final predicted chromosome topological correlation domain is obtained; each chromosome topological correlation structural domain comprises a core and accessory regions at two sides of the core;
when the method is implemented, the method comprises the following steps:
s5.1, for two adjacent nucleusesAndmiddle genome blockFiltering genomic blocks having an average interaction value of the high-frequency interaction regions that is less than the mean of non-zero interaction values across the entire chromosome;
s5.2, on the basis of the step S5.1, two adjacent cores are processedAndand the genome block between the two coresForming a sub-matrix, and removing background signals; defining background signal as the average value of non-zero interaction values in a submatrix formed by genome blocks between two adjacent kernels;
s5.3, on the basis of the step S5.2, two adjacent cores are processedAndmiddle genome blockFiltering the absence and genomic regionA genomic block within which any genomic block has a non-zero interaction value;
s5.4, calculating two adjacent cores on the basis of the step S5.3Andthe submatrix where each genome block is locatedAnd the genome block corresponding to the smallest average interaction value of the submatrices is taken as a partition point, and the genome block at the upstream of the partition point is taken as a kernelThe genome block downstream of the segmentation point is taken as a nucleusThe attachment of (a); thereby obtaining the final predicted topological associated domain of the chromosome.
The process of the invention is further illustrated below with reference to one example:
the chromosome topology association domain prediction method based on the nuclear-attachment structure model provided by the embodiment as shown in FIG. 2 comprises the following steps; the Hi-C map is shown as the Hi-C interaction matrix of the KR-normalized GM12878_ combined at a resolution of 50kb contained in the GSE63525 dataset, and the specific segment is the 120 th and 200 th genome blocks of chromosome I;
step S1, identifying a high-frequency interaction region of each genome block in an interaction matrix (Hi-C interaction matrix for short) between the genome blocks obtained by a whole genome conformation capture technology and a sequencing technology by adopting a K-means clustering method;
as shown in fig. 2-r (fig. 2-r is a preprocessing process of the Hi-C interaction matrix), 0-assigning is performed to the interaction value of each genome block and itself on the diagonal of the KR-normalized Hi-C interaction matrix of GM12878_ combined at a resolution of 50 kb;
as shown in FIG. 2-2 (FIG. 2-2 is the process of identifying the high frequency interaction region), for each genome blockiBy usingkK-means clustering algorithm of =2 on other genome blocks with interaction value different from 0kCluster of =2, classification function of other genome blocks is:
wherein,is a genomic blockiAndjthe value of (2) is determined,is the firstkMean of the centers. Initial center values of two classesAndafter setting to non-zero interaction value and sorting in ascending orderAndthe position of the corresponding interaction value is determined,corresponding to the center of the low frequency interaction class,center corresponding to high frequency interaction class; by calculatingDe-classifying function, assigning the class corresponding to the distance with the minimum central value to the genome blockj;
For each genome blockiDefining its high frequency interaction region (l i ,r i ),l i Corresponding genome blockiThe minimum block number of a genome block in a high interaction class,r i corresponding genome blockiMaximum block number of genomic blocks in high interaction class; the schematic diagram of the high frequency interaction region is shown in FIG. 2-b;
step S2, as shown in FIG. 2-a (FIG. 2-c is the construction process of TADs quasi-nucleus), for each genome block, judging and identifying whether there is a quasi-nucleus taking the genome block as the center from the high-frequency interaction region;
quasi-nuclear is defined as if the genome block is presentiIn the high-frequency interaction regionSubmatrices formed in a Hi-C interaction matrixIs larger than the sub-matrix of the adjacent 5 same window sizes, including the upper 3 sub-matrices、Andright one sub-matrixAnd a lower sub-matrixThen the high frequency interaction regionIs a genome blockiThe quasi nucleus of (1);
if the genome blockiHigh frequency interaction region ofSub-matrices composed in a Hi-C interaction matrixIs not greater than the other 5 sub-matrices adjacent to the same window size, then for the high frequency interaction regionJudging and identifying the quasi-nucleus again after splitting until the split region does not contain the genome blockiStopping the operation;
when splitting: to genome blockiHigh frequency interaction region ofSplitting is carried out, firstly with a high-frequency interaction regionGenome block with minimal interaction sum with other genome blocks in middle and high frequency interaction regionm i For dividing points, dividing the high frequency interaction regionDivided into two high-frequency interaction regionsAnd;
further, wheni= m i If no genome block exists, then the determination is madeiA centered corelet; when in usei< m i Then continue to the high frequency interaction regionCarrying out re-judgment and identification check; when in usei> m i Then continue to the high frequency interaction regionCarrying out re-judgment and identification check; the process of judgment and identification is as described above;
step S3, as shown in FIG. 2-c, b, c, filtering or merging the quasi-nuclei identified on each chromosome according to the inclusion or overlapping relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;
when the two adjacent quasi cores are in an inclusion relationship, the included quasi cores are reserved, and the included quasi cores are filtered;
when two adjacent quasi cores are in an overlapping relation, if the combination of the two quasi cores still meets the definition of the quasi cores, the two quasi cores are combined into one quasi core; otherwise, only the quasi-nucleus with higher average interaction frequency in the two kernels is reserved;
after processing a group of two adjacent quasicucleates, searching two adjacent, contained or overlapped quasicucleates from the next quasicucleate and carrying out the same processing until no mutually overlapped quasicucleates appear on the whole chromosome;
step S4, as shown in fig. 2- ((r) of fig. 2-is a process for constructing nuclei in a nuclear-attached structural model of TADs), merging the non-overlapping quasi-nuclei on one chromosome according to the correlation between the quasi-nuclei, and regarding the merged nuclei as nuclei of chromosome topological correlation domains (TADs) to be predicted;
using cosine similarity to normalize all adjacent pairspc i Andpc j and performing correlation calculation, wherein the calculation formula is as follows:;
setting a correlation threshold value, combining two or a plurality of continuous adjacent quasi-nuclei with similarity higher than the threshold value and average interaction value between the adjacent quasi-nuclei larger than the average value of non-zero interaction values on the whole chromosome into a new region, and using the new region as a nucleus in a TAD (TAD) nucleus-attachment structure model
Step S5, as shown in fig. 2-fifthly (fig. 2-fifthly is the process of building the complete nucleus-attachment structure model of TADs), the region between the nucleus and the nucleus is defined as the attachment candidate region, it is determined to which nucleus each genome block in the attachment candidate region belongs to, and each finally predicted TAD is composed of one nucleus and attachments on both sides of it; when the method is implemented, the method comprises the following steps:
s5.1, for two adjacent nucleusesAndmiddle genome blockFiltering genomic blocks having an average interaction value of the high-frequency interaction region that is less than the mean of non-zero interaction values across the entire chromosome;
s5.2, on the basis of the step S5.1, two adjacent cores are processedAndand the genomic block between the two nucleiForming a sub-matrix, and removing background signals; defining a background signal as an average value of non-zero interaction values in a submatrix formed by genome blocks between two adjacent kernels;
s5.3, on the basis of the step S5.2, two adjacent cores are processedAndmiddle genome blockFiltering the absence and genomic regionA genomic block within which any genomic block has a non-zero interaction value;
s5.4, calculating two adjacent cores on the basis of the step S5.3Andthe submatrix where each genome block is locatedAnd taking the genome block corresponding to the minimum submatrix average interaction value as a partition point, and taking the genome block at the upstream of the partition point as a kernelThe genome block downstream of the segmentation point is taken as a nucleusThe accessory of (1); thereby obtaining the final predicted chromosome topology association structural domain.
FIG. 3 is a schematic structural diagram of the system of the present invention: the invention also provides a prediction system for realizing the prediction method of the chromosome topological correlation structure domain, which comprises a high-frequency interaction region identification module, a quasi-nuclear processing module, a chromosome topological correlation structure domain nuclear identification module and a chromosome topological correlation structure domain identification module which are sequentially connected in series; the high-frequency interaction region identification module is used for acquiring each genome block in an interaction matrix among the genome blocks, identifying and obtaining a corresponding high-frequency interaction region by adopting a clustering algorithm, and uploading the obtained high-frequency interaction region to the quasi-nuclear identification module; the quasi-nucleus identification module is used for judging and identifying whether a quasi-nucleus taking the genome block as the center exists in the corresponding high-frequency interaction region aiming at each genome block, and uploading the obtained quasi-nucleus to the quasi-nucleus processing module; the quasi-nucleus processing module is used for processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei, and uploading the non-overlapping quasi-nuclei to the chromosome topological association domain nucleus identification module; the chromosome topological correlation structure domain core identification module is used for merging non-overlapping quasi-cores on a chromosome according to the correlation among the quasi-cores, taking the merged core as the core of the chromosome topological correlation structure domain to be predicted, and uploading the obtained core to the chromosome topological correlation structure domain identification module; and the chromosome topological correlation structure domain identification module is used for determining the subordination relation of each genome block in the accessory candidate region, and combining the received nucleus of the chromosome topological correlation structure domain to obtain a finally predicted chromosome topological correlation structure domain and outputting the finally predicted chromosome topological correlation structure domain.
Claims (10)
1. A method for predicting a topological relational domain of a chromosome, comprising the steps of:
s1, acquiring each genome block in an interaction matrix among the genome blocks, and identifying by adopting a clustering algorithm to obtain a corresponding high-frequency interaction region;
s2, for each genome block, determining and identifying from the corresponding high frequency interaction region whether a quasi-nucleus centered on the genome block exists:
if the high-frequency interaction region has a quasi-nucleus taking the genome block as the center, continuing to perform the subsequent steps;
if the high-frequency interaction region does not have a quasi-nucleus taking the genome block as the center, the high-frequency interaction region is split and then the quasi-nucleus is judged and identified again until the split region does not contain the genome block;
s3, processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;
s4, merging the non-overlapping quasi-nuclei on a chromosome according to the correlation among the quasi-nuclei, and using the merged nuclei as the nuclei of the topological associated domain of the chromosome to be predicted;
s5, determining the subordination relation of each genome block in the attachment candidate region, and combining the nucleus of the chromosome topological correlation domain obtained in the step S4 to obtain a final predicted chromosome topological correlation domain.
2. The method for predicting the topological correlation domain of the chromosome according to claim 1, wherein the step S1 comprises obtaining each genome block in the interaction matrix between the genome blocks by using a whole genome conformation capture technique and a sequencing technique, and usingk=2KAnd clustering by using a mean clustering algorithm so as to identify and obtain a corresponding high-frequency interaction region.
3. The method for predicting the topological correlation domain of the chromosome according to claim 2, wherein the step S1 specifically comprises the following steps:
s1.1, acquiring an interaction matrix between genome blocks by adopting a whole genome conformation capture technology and a sequencing technology;
s1.2, carrying out 0 assigning processing on the interaction value of each genome block and the genome block on the diagonal line of the interaction matrix between the genome blocks obtained in the step S1.1;
s1.3 for any genome blockiBy usingk=2KMean clustering algorithm for the genome blockiClustering with other genome blocks whose interaction value is not 0; the following function is used as the classification function for the other genomic blocks in step S1.3:
In the formulaIs a genomic blockiAnd genome blockjThe interaction value of (a);is as followskAverage of the centers;to get anda function of class number operation corresponding to the nearest center;is a 2-norm; initial center values of two classesAndis set to be after the ascending order of the non-zero interaction valuesAndinteraction values corresponding to positions, andcorresponding to the center of the low frequency interaction class,center corresponding to high frequency interaction class;
assigning a class corresponding to a distance having the smallest center value to the genome block by solving the classification functionj;
S1.4. for each genome BlockiDefining corresponding high frequency interaction regions(ii) a Wherein,l i corresponding to the genome blockiThe minimum block number of a genome block in a high interaction class,r i corresponding to the genome blockiMaximum block number of genomic blocks in high interaction class.
4. The method for predicting the topological correlation domain of the chromosome according to claim 3, wherein the step S2 specifically comprises the following steps:
s2.1. calculating genome blocksiIn the high-frequency interaction regionSubmatrices composed in an interaction matrix between genomic blocksAn average interaction value of;
s2.2, comparing the average interaction value obtained in the step S2.1 with the average interaction values of the submatrices adjacent to 5 same window sizes:
if the average interaction value obtained in step S2.1 is larger than the average interaction value of the submatrixes adjacent to 5 same window sizes, the high-frequency interaction area is judgedTo calculate genome blocksiQuasi-nuclear of (2);
if the average interaction value obtained in step S2.1 is not larger than the average interaction value of the sub-matrixes adjacent to 5 same window sizes, the high-frequency interaction area is subjected toSplitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiStopping the operation;
5. The method for predicting the topological relational domain of chromosome according to claim 4, wherein said high frequency interaction region is selected from the group consisting ofSplitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiThe method specifically comprises the following steps:
first, the high frequency interaction regionMedium and high frequency interaction regionGenome block with minimal interaction sum of other genome blocks thereinm i For dividing points, dividing the high-frequency interaction regionDivided into high frequency interaction regionsAnd high frequency interaction region;
Then, a judgment is made:
if it isi = m i Then, it is determined that there is no genome blockiA centered corelet;
if it isi < m i Then in the high frequency interaction regionRepeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located;
6. The method for predicting the topological correlation domain of the chromosome according to claim 5, wherein the step S3 specifically comprises the following steps:
s3.1, judging the relation between two adjacent quasicuclears for the quasicuclear identified on each chromosome:
if the two adjacent quasi-kernels are in an inclusion relationship, the included quasi-kernels are reserved, and the included quasi-kernels are filtered;
if the two adjacent quasi cores are in an overlapping relation, judging again: if the two quasi cores still meet the definition of the quasi cores after being merged, merging the two quasi cores into one quasi core; otherwise, reserving the quasi-nucleus with larger average interaction value in the two quasi-nuclei, and filtering the rest quasi-nuclei;
and S3.2, repeating the step S3.1 until all the quasicuclears on the whole chromosome are judged and processed, and finally obtaining the non-overlapping quasicuclears.
7. The method according to claim 6, wherein the step S4 is to calculate cosine similarity between all the adjacent quasiscores, merge several consecutive adjacent quasiscores with cosine similarity higher than a predetermined threshold and average interaction value between the adjacent quasiscores greater than the mean value of non-zero interaction values on the whole chromosome into a new region, and use the new region as a core in the core-attachment structure model of the chromosome topology association domain to be predicted.
8. The method for predicting the topological correlation domain of the chromosome according to claim 7, wherein the step S5 is to define a region between the nucleus and the nucleus as an attachment region, and determine the nucleus of the adjacent topological correlation domain of the chromosome to which each genome block belongs in each attachment region, so as to obtain the final predicted topological correlation domain of the chromosome; each chromosome topology association domain includes a nucleus and attachment regions on both sides of the nucleus.
9. The method for predicting the topological correlation domain of the chromosome according to claim 8, wherein the step S5 specifically comprises the following steps:
s5.1, for two adjacent nucleusesAndmiddle genome blockFiltering genomic blocks having an average interaction value of the high-frequency interaction regions that is less than the mean of non-zero interaction values across the entire chromosome;
s5.2, on the basis of the step S5.1, two adjacent cores are processedAndand the genome block between the two coresForming a sub-matrix, and removing background signals; defining a background signal as an average value of non-zero interaction values in a submatrix formed by genome blocks between two adjacent kernels;
s5.3, on the basis of the step S5.2, two adjacent cores are processedAndmiddle genome blockFiltering the absence and genomic regionA genomic block within which any genomic block has a non-zero interaction value;
s5.4, calculating two adjacent cores on the basis of the step S5.3Andthe submatrix where each genome block is locatedAnd taking the genome block corresponding to the minimum submatrix average interaction value as a partition point, and taking the genome block at the upstream of the partition point as a kernelThe genome block downstream of the segmentation point is taken as a nucleusThe attachment of (a); thereby obtaining the final predicted chromosome topology association structural domain.
10. A prediction system for realizing the prediction method of the chromosome topological correlation structure domain according to any one of claims 1 to 9, which is characterized by comprising a high-frequency interaction region identification module, a quasi-nuclear processing module, a chromosome topological correlation structure domain nuclear identification module and a chromosome topological correlation structure domain identification module which are connected in series in sequence; the high-frequency interaction region identification module is used for acquiring each genome block in an interaction matrix among the genome blocks, identifying and obtaining a corresponding high-frequency interaction region by adopting a clustering algorithm, and uploading the obtained high-frequency interaction region to the quasi-nuclear identification module; the quasi-nuclear identification module is used for judging and identifying whether quasi-nuclear with the genome block as the center exists in the corresponding high-frequency interaction region aiming at each genome block, and uploading the obtained quasi-nuclear to the quasi-nuclear processing module; the quasi-nucleus processing module is used for processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei, and uploading the non-overlapping quasi-nuclei to the chromosome topological association domain nucleus identification module; the chromosome topological correlation structure domain core identification module is used for merging non-overlapping quasi-cores on a chromosome according to the correlation among the quasi-cores, taking the merged core as the core of the chromosome topological correlation structure domain to be predicted, and uploading the obtained core to the chromosome topological correlation structure domain identification module; and the chromosome topological correlation structure domain identification module is used for determining the subordination relation of each genome block in the accessory candidate region, and combining the received nucleus of the chromosome topological correlation structure domain to obtain a finally predicted chromosome topological correlation structure domain and outputting the finally predicted chromosome topological correlation structure domain.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210245600.9A CN114446384A (en) | 2022-03-14 | 2022-03-14 | Prediction method and prediction system of chromosome topological correlation structure domain |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210245600.9A CN114446384A (en) | 2022-03-14 | 2022-03-14 | Prediction method and prediction system of chromosome topological correlation structure domain |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114446384A true CN114446384A (en) | 2022-05-06 |
Family
ID=81358910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210245600.9A Pending CN114446384A (en) | 2022-03-14 | 2022-03-14 | Prediction method and prediction system of chromosome topological correlation structure domain |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114446384A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114944190A (en) * | 2022-05-12 | 2022-08-26 | 南开大学 | TAD (TAD-based data analysis) identification method and system based on Hi-C sequencing data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190005191A1 (en) * | 2015-07-14 | 2019-01-03 | Whitehead Institute For Biomedical Research | Chromosome neighborhood structures and methods relating thereto |
US20190295684A1 (en) * | 2018-03-22 | 2019-09-26 | The Regents Of The University Of Michigan | Method and apparatus for analysis of chromatin interaction data |
-
2022
- 2022-03-14 CN CN202210245600.9A patent/CN114446384A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190005191A1 (en) * | 2015-07-14 | 2019-01-03 | Whitehead Institute For Biomedical Research | Chromosome neighborhood structures and methods relating thereto |
US20190295684A1 (en) * | 2018-03-22 | 2019-09-26 | The Regents Of The University Of Michigan | Method and apparatus for analysis of chromatin interaction data |
Non-Patent Citations (1)
Title |
---|
许希伦;: "染色体相互作用密度与拓扑域相关分析", 电脑知识与技术, no. 03, 25 January 2020 (2020-01-25) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114944190A (en) * | 2022-05-12 | 2022-08-26 | 南开大学 | TAD (TAD-based data analysis) identification method and system based on Hi-C sequencing data |
CN114944190B (en) * | 2022-05-12 | 2024-04-19 | 南开大学 | TAD (transcription activator) identification method and system based on Hi-C sequencing data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108470354B (en) | Video target tracking method and device and implementation device | |
CN110991311B (en) | Target detection method based on dense connection deep network | |
CN111754472A (en) | Pulmonary nodule detection method and system | |
CN110188763B (en) | Image significance detection method based on improved graph model | |
CN111612039A (en) | Abnormal user identification method and device, storage medium and electronic equipment | |
WO2017173929A1 (en) | Unsupervised feature selection method and device | |
CN113764034B (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
Cui et al. | Learning global pairwise interactions with Bayesian neural networks | |
CN114446384A (en) | Prediction method and prediction system of chromosome topological correlation structure domain | |
CN111860823A (en) | Neural network training method, neural network training device, neural network image processing method, neural network image processing device, neural network image processing equipment and storage medium | |
CN115114484A (en) | Abnormal event detection method and device, computer equipment and storage medium | |
CN117992765A (en) | Off-label learning method, device, equipment and medium based on dynamic emerging marks | |
CN107832732B (en) | Lane line detection method based on treble traversal | |
CN113539479A (en) | Similarity constraint-based miRNA-disease association prediction method and system | |
WO2022011855A1 (en) | False positive structural variation filtering method, storage medium, and computing device | |
CN110837853A (en) | Rapid classification model construction method | |
Wu et al. | Mixed Pattern Matching‐Based Traffic Abnormal Behavior Recognition | |
CN111488903A (en) | Decision tree feature selection method based on feature weight | |
CN116403713A (en) | Method for predicting autism spectrum barrier risk genes based on multiclass unsupervised feature extraction method | |
CN110674860A (en) | Feature selection method based on neighborhood search strategy, storage medium and terminal | |
CN115497563A (en) | Cancer driver gene identification method, system, storage medium and equipment | |
CN111863124B (en) | Copy number variation detection method, system, storage medium and computer equipment | |
CN114647679A (en) | Hydrological time series motif mining method based on numerical characteristic clustering | |
CN116861226A (en) | Data processing method and related device | |
CN114091559A (en) | Data filling method and device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |