CN114446384A - Prediction method and prediction system of chromosome topological correlation structure domain - Google Patents

Prediction method and prediction system of chromosome topological correlation structure domain Download PDF

Info

Publication number
CN114446384A
CN114446384A CN202210245600.9A CN202210245600A CN114446384A CN 114446384 A CN114446384 A CN 114446384A CN 202210245600 A CN202210245600 A CN 202210245600A CN 114446384 A CN114446384 A CN 114446384A
Authority
CN
China
Prior art keywords
genome
quasi
chromosome
interaction
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210245600.9A
Other languages
Chinese (zh)
Inventor
彭小清
李一鸣
孔祥艳
盛羽
段桂华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210245600.9A priority Critical patent/CN114446384A/en
Publication of CN114446384A publication Critical patent/CN114446384A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a prediction method of a chromosome topological correlation domain, which comprises the steps of obtaining each genome block in an interaction matrix among the genome blocks and identifying to obtain a high-frequency interaction region; identifying a quasi-nucleus from the high frequency interaction region for each genome block: processing the quasi-nuclei identified on each chromosome to obtain non-overlapping quasi-nuclei; merging non-overlapping quasi cores on a chromosome to obtain a core of a topological association structure domain of the chromosome to be predicted; and determining the subordination relation of each genome block in the accessory candidate region and combining the kernels of the chromosome topological correlation domains to obtain a final predicted chromosome topological correlation domain. The invention also discloses a prediction system for realizing the prediction method of the chromosome topology association domain. The invention fully utilizes the global information of Hi-C data, reduces the range of candidate boundary positioning, does not need a user to give predefined parameters, can accurately predict the topological associated domain, and has high reliability, good accuracy and better effect.

Description

Prediction method and prediction system of chromosome topological correlation structure domain
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a prediction method and a prediction system of a chromosome topology association domain.
Background
In recent years, the emergence of High-throughput chromosome conformation capture technology (High-C) in the genome-wide range has promoted the recognition of chromosome spatial structure hierarchy. Relevant researchers have transformed and visualized Hi-C sequencing data from mammalian cells into Hi-C interaction matrices to find highly self-interacting regions with resolutions below 100kb, such regions being Topologically Associating Domains (TAD). The construction method of the Hi-C interaction matrix comprises the following steps: dividing a chromosome into equal lengthNAre divided into segments and constructed into oneN*NOf (2) matrixMThe method is used for characterizing interaction signals between two segments on a chromosome, wherein the equal-length segment in unit length is called a genome block, and the size of the genome block is related to the resolution of a Hi-C interaction matrix. Researchers have constructed Hi-C interaction matrices by counting the frequency of interactions between pairs of genomic blocks and between N genomic blocks of sequenced fragment reads generated by high throughput chromosome conformation capture technology. For example, each read of a sequencing fragment can be aligned separately to a genomic blockiAnd genome blockjThen at the matrix elementM i,j M j,i Add 1 to the running total.
Currently, due to the limitations of microscopy and biotechnology, researchers still cannot directly and completely observe TAD, and the mechanism of TAD formation is still in a vague sense. Therefore, to obtain information about TAD, it is necessary to use some indirect method, such as constructing a Hi-C interaction matrix using the interaction information between chromosome segments captured by Hi-C sequencing data, and then using a correlation algorithm to predict TAD. In recent years, researchers have proposed methods for predicting TAD based on machine learning algorithms; however, the application of these methods to different cell lines is very limited, because different cell lines often require a large amount of corresponding and specific related information to extract the feature training model, which adds extra burden to researchers.
The existing TAD prediction algorithm mainly predicts the TAD from the aspects of interaction preference at a boundary, similarity inside the TAD, difference between the TAD and non-TAD, contact frequency density change inside the TAD and the like. These methods either focus only on the finding of the boundary, missing information inside the TAD; or user-defined parameters are needed to control the size, clustering termination threshold, local maximum and the like of the TAD; this makes identifying the TAD problem highly fluctuating and subjective; furthermore, TAD, a structure that is not precisely defined, should not be predicted by limiting its own properties.
Disclosure of Invention
The invention aims to provide a method for predicting a chromosome topological correlation domain, which has high reliability, good accuracy and good effect.
The other object of the present invention is to provide a prediction system for implementing the method for predicting the chromosome topology association domain.
The method for predicting the topological associated domain of the chromosome, provided by the invention, comprises the following steps:
s1, acquiring each genome block in an interaction matrix among the genome blocks, and identifying by adopting a clustering algorithm to obtain a corresponding high-frequency interaction region;
s2, for each genome block, determining and identifying from the corresponding high frequency interaction region whether a quasi-nucleus centered on the genome block exists:
if the high-frequency interaction region has a quasi-nucleus taking the genome block as the center, continuing to perform the subsequent steps;
if the high-frequency interaction region does not have a quasi-nucleus taking the genome block as the center, the high-frequency interaction region is split and then the quasi-nucleus is judged and identified again until the split region does not contain the genome block;
s3, processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;
s4, merging the non-overlapping quasi-nuclei on a chromosome according to the correlation among the quasi-nuclei, and using the merged nuclei as the nuclei of the topological associated domain of the chromosome to be predicted;
s5, determining the subordination relation of each genome block in the attachment candidate region, and combining the nucleus of the chromosome topological correlation domain obtained in the step S4 to obtain a final predicted chromosome topological correlation domain.
The step S1 is to adopt a whole genome conformation capture technology and a sequencing technology to obtain each genome block in an interaction matrix between the genome blocks, and adoptk=2KAnd clustering by using a mean clustering algorithm so as to identify and obtain a corresponding high-frequency interaction region.
The step S1 specifically includes the following steps:
s1.1, acquiring an interaction matrix between genome blocks by adopting a whole genome conformation capture technology and a sequencing technology;
s1.2, carrying out 0 assigning processing on the interaction value of each genome block and the genome block on the diagonal line of the interaction matrix between the genome blocks obtained in the step S1.1;
s1.3 for any genome blockiBy usingk=2KMean clustering algorithm for the genome blockiClustering with other genome blocks whose interaction value is not 0;
s1.4. for each genome blockiDefining corresponding high frequency interaction regions
Figure 100002_DEST_PATH_IMAGE002
(ii) a Wherein,l i corresponding to the genome blockiThe minimum block number of a genome block in a high interaction class,r i corresponding to the genome blockiMaximum block number of genomic blocks in high interaction class.
The following function is used as the classification function for the other genomic blocks in step S1.3
Figure 100002_DEST_PATH_IMAGE004
Figure 100002_DEST_PATH_IMAGE006
In the formula
Figure 100002_DEST_PATH_IMAGE008
Is a genomic blockiAnd genome blockjThe interaction value of (a);
Figure 100002_DEST_PATH_IMAGE010
is a firstkAverage of the centers;
Figure 100002_DEST_PATH_IMAGE012
to get and
Figure 100002_DEST_PATH_IMAGE014
a function of class number operation corresponding to the nearest center;
Figure 100002_DEST_PATH_IMAGE016
is a 2-norm; initial center values of two classes
Figure 100002_DEST_PATH_IMAGE018
And
Figure 100002_DEST_PATH_IMAGE020
is set to be after the ascending order of the non-zero interaction values
Figure 100002_DEST_PATH_IMAGE022
And
Figure 100002_DEST_PATH_IMAGE024
interaction values corresponding to positions, and
Figure 725207DEST_PATH_IMAGE018
corresponding to the center of the low frequency interaction class,
Figure 6016DEST_PATH_IMAGE020
center corresponding to high frequency interaction class;
assigning a class corresponding to the distance having the smallest center value to the genome block by solving the classification functionj
The step S2 specifically includes the following steps:
s2.1. calculating genome blocksiIn the high-frequency interaction region
Figure DEST_PATH_IMAGE025
Submatrices composed in an interaction matrix between genomic blocks
Figure 100002_DEST_PATH_IMAGE027
An average interaction value of;
s2.2, comparing the average interaction value obtained in the step S2.1 with the average interaction values of the submatrices adjacent to 5 same window sizes:
if the average interaction value obtained in step S2.1 is larger than the average interaction value of the submatrixes adjacent to 5 same window sizes, the high-frequency interaction area is judged
Figure 740754DEST_PATH_IMAGE025
To calculate genome blocksiThe quasi nucleus of (1);
if the average interaction value obtained in step S2.1 is not larger than the average interaction value of the sub-matrixes adjacent to 5 same window sizes, the high-frequency interaction area is subjected to
Figure 880748DEST_PATH_IMAGE025
Splitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiStopping the operation;
the submatrix is adjacent to 5 submatrices with the same window size, specifically the upper 3 submatrixes
Figure 100002_DEST_PATH_IMAGE029
Figure 100002_DEST_PATH_IMAGE031
And
Figure 100002_DEST_PATH_IMAGE033
right 1 submatrix
Figure 100002_DEST_PATH_IMAGE035
And a sub-matrix below
Figure 100002_DEST_PATH_IMAGE037
The pair of high-frequency interaction regions
Figure 948848DEST_PATH_IMAGE025
Splitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiThe method specifically comprises the following steps:
first, the high frequency interaction region
Figure 349742DEST_PATH_IMAGE025
Medium and high frequency interaction region
Figure 583277DEST_PATH_IMAGE025
Genome block with minimal interaction sum of other genome blocks thereinm i For dividing points, dividing the high frequency interaction region
Figure 210568DEST_PATH_IMAGE025
Divided into high frequency interaction regions
Figure DEST_PATH_IMAGE039
And high frequency interaction region
Figure DEST_PATH_IMAGE041
Then, a judgment is made:
if it isi = m i Then, it is determined that there is no genome blockiA centered corelet;
if it isi < m i Then in the high frequency interaction region
Figure 100002_DEST_PATH_IMAGE042
Repeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located;
if it isi > m i Then, thenAt a high frequency of the interaction region
Figure 100002_DEST_PATH_IMAGE043
And (3) repeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located.
The step S3 specifically includes the following steps:
s3.1, judging the relation between two adjacent quasicuclears for the quasicuclear identified on each chromosome:
if the two adjacent quasi cores are in an inclusion relationship, the included quasi cores are reserved, and the included quasi cores are filtered;
if the two adjacent quasi cores are in an overlapping relation, judging again: if the two quasi cores still meet the definition of the quasi cores after being merged, merging the two quasi cores into one quasi core; otherwise, reserving the quasi-nucleus with larger average interaction value in the two quasi-nuclei, and filtering the rest quasi-nuclei;
and S3.2, repeating the step S3.1 until all the quasicuclears on the whole chromosome are judged and processed, and finally obtaining the non-overlapping quasicuclears.
The step S4 is specifically to calculate cosine similarities between all the adjacent quasicucleates, merge several consecutive adjacent quasicucleates of which the cosine similarities are higher than a set threshold and the average interaction value between the adjacent quasicucleates is greater than the mean value of the non-zero interaction values on the whole chromosome into a new region, and use the new region as a core in the core-attachment structure model of the chromosome topology association structure domain to be predicted.
The calculating of the cosine similarity between all the adjacent quasi-kernels specifically includes calculating the adjacent quasi-kernels by using the following formulapc i Andpc j cosine similarity of
Figure DEST_PATH_IMAGE045
Figure DEST_PATH_IMAGE047
In the formula
Figure DEST_PATH_IMAGE049
Is composed ofpc i A feature vector consisting of average interaction values with all other coregists, and
Figure 100002_DEST_PATH_IMAGE051
Figure 100002_DEST_PATH_IMAGE053
Figure DEST_PATH_IMAGE055
is a quasi-nucleuspc k Andpc i average interaction value between;
Figure DEST_PATH_IMAGE057
is composed ofpc j A feature vector consisting of average interaction values with all other coregists, and
Figure 100002_DEST_PATH_IMAGE059
Figure 79429DEST_PATH_IMAGE053
Figure DEST_PATH_IMAGE061
is a quasi-nucleuspc k Andpc j average interaction value between;
Figure DEST_PATH_IMAGE063
is the inner product of the vectors;
Figure DEST_PATH_IMAGE065
is the vector modulo.
Step S5, specifically, defining a region between the nucleus and the nucleus as an attachment region, and determining a nucleus of an adjacent chromosome topology association domain to which each genome block belongs in each attachment region, thereby obtaining a final predicted chromosome topology association domain; each of the chromosomal topological domains includes a nucleus and attachment regions flanking the nucleus.
The step S5 specifically includes the following steps:
s5.1, for two adjacent nucleuses
Figure DEST_PATH_IMAGE067
And
Figure DEST_PATH_IMAGE069
middle genome block
Figure DEST_PATH_IMAGE071
Filtering genomic blocks having an average interaction value of the high-frequency interaction region that is less than the mean of non-zero interaction values across the entire chromosome;
s5.2, on the basis of the step S5.1, two adjacent cores are processed
Figure 833365DEST_PATH_IMAGE067
And
Figure DEST_PATH_IMAGE072
and the genome block between the two cores
Figure 424752DEST_PATH_IMAGE071
Forming a sub-matrix, and removing background signals; defining a background signal as an average value of non-zero interaction values in a submatrix formed by genome blocks between two adjacent kernels;
s5.3, on the basis of the step S5.2, two adjacent cores are processed
Figure 149126DEST_PATH_IMAGE067
And
Figure 930000DEST_PATH_IMAGE072
middle genome block
Figure 180853DEST_PATH_IMAGE071
Filtering the absence and genomic region
Figure DEST_PATH_IMAGE074
Any genome block within has a non-zero crossA genomic block of values;
s5.4, calculating two adjacent cores on the basis of the step S5.3
Figure 647868DEST_PATH_IMAGE067
And
Figure 984172DEST_PATH_IMAGE072
the submatrix where each genome block is located
Figure DEST_PATH_IMAGE076
And the genome block corresponding to the smallest average interaction value of the submatrices is taken as a partition point, and the genome block at the upstream of the partition point is taken as a kernel
Figure 709682DEST_PATH_IMAGE067
The genome block downstream of the segmentation point is taken as a nucleus
Figure 815041DEST_PATH_IMAGE072
The accessory of (1); thereby obtaining the final predicted topological associated domain of the chromosome.
The invention also provides a prediction system for realizing the prediction method of the chromosome topological correlation domain, which comprises a high-frequency interaction region identification module, a quasi-nuclear processing module, a chromosome topological correlation domain nuclear identification module and a chromosome topological correlation domain identification module which are sequentially connected in series; the high-frequency interaction region identification module is used for acquiring each genome block in an interaction matrix among the genome blocks, identifying and obtaining a corresponding high-frequency interaction region by adopting a clustering algorithm, and uploading the obtained high-frequency interaction region to the quasi-nuclear identification module; the quasi-nuclear identification module is used for judging and identifying whether quasi-nuclear with the genome block as the center exists in the corresponding high-frequency interaction region aiming at each genome block, and uploading the obtained quasi-nuclear to the quasi-nuclear processing module; the quasi-nucleus processing module is used for processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei, and uploading the non-overlapping quasi-nuclei to the chromosome topological association domain nucleus identification module; the chromosome topological correlation structure domain core identification module is used for merging non-overlapping quasi-cores on a chromosome according to the correlation among the quasi-cores, taking the merged core as the core of the chromosome topological correlation structure domain to be predicted, and uploading the obtained core to the chromosome topological correlation structure domain identification module; and the chromosome topological correlation structure domain identification module is used for determining the subordination relation of each genome block in the accessory candidate region, and combining the received nucleus of the chromosome topological correlation structure domain to obtain a finally predicted chromosome topological correlation structure domain and outputting the finally predicted chromosome topological correlation structure domain.
The prediction method and the prediction system of the chromosome topological correlation structure domain fully utilize the global information of Hi-C data, reduce the range of candidate boundary positioning, and further reduce the occurrence of false positive results; meanwhile, the invention does not need the user to give predefined parameters, so the invention can accurately predict the topological correlation structural domain and has high reliability, good accuracy and better effect.
Drawings
FIG. 1 is a schematic process flow diagram of the process of the present invention.
FIG. 2 is a schematic flow chart of an embodiment of the method of the present invention.
FIG. 3 is a schematic diagram of the system of the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of the method of the present invention: the method for predicting the topological associated domain of the chromosome, provided by the invention, comprises the following steps:
s1, acquiring each genome block in an interaction matrix among the genome blocks, and identifying by adopting a clustering algorithm to obtain a corresponding high-frequency interaction region; specifically, a whole genome conformation capture technology and a sequencing technology are adopted to obtain each genome block in an interaction matrix (Hi-C interaction matrix for short) among the genome blocks, andk=2KClustering by using a mean clustering algorithm so as to identify and obtain a corresponding high-frequency interaction region;
when the method is implemented, the method comprises the following steps:
s1.1, acquiring an interaction matrix between genome blocks by adopting a whole genome conformation capture technology and a sequencing technology;
s1.2, carrying out 0 assigning processing on the interaction value of each genome block and the genome block on the diagonal line of the interaction matrix between the genome blocks obtained in the step S1.1;
s1.3 for any genome blockiBy usingk=2KMean clustering algorithm for the genome blockiClustering other genome blocks with interaction values different from 0; the following function is adopted as the classification function of other genome blocks
Figure DEST_PATH_IMAGE077
Figure 951494DEST_PATH_IMAGE006
In the formula
Figure DEST_PATH_IMAGE078
Is a genomic blockiAnd genome blockjThe interaction value of (a);
Figure 40672DEST_PATH_IMAGE010
is a firstkAverage of the centers;
Figure 304295DEST_PATH_IMAGE012
to get and
Figure DEST_PATH_IMAGE079
a function of class number operation corresponding to the nearest center;
Figure 701645DEST_PATH_IMAGE016
is a 2-norm; initial center values of two classes
Figure 618786DEST_PATH_IMAGE018
And
Figure 70627DEST_PATH_IMAGE020
is provided withSet as non-zero interaction value after ascending order and sorting
Figure DEST_PATH_IMAGE080
And
Figure DEST_PATH_IMAGE081
interaction values corresponding to positions, and
Figure 262573DEST_PATH_IMAGE018
corresponding to the center of the low frequency interaction class,
Figure 201580DEST_PATH_IMAGE020
center corresponding to high frequency interaction class;
assigning a class corresponding to the distance having the smallest center value to the genome block by solving the classification functionj
S1.4. for each genome BlockiDefining corresponding high frequency interaction regions
Figure 820780DEST_PATH_IMAGE025
(ii) a Wherein,l i corresponding to the genome blockiThe minimum block number of a genome block in a high interaction class,r i corresponding to the genome blockiMaximum block number of genomic blocks in high interaction class;
s2, for each genome block, determining and identifying from the corresponding high frequency interaction region whether a quasi-nucleus centered on the genome block exists:
if the high-frequency interaction region has a quasi-nucleus taking the genome block as the center, continuing to perform the subsequent steps;
if the high-frequency interaction region does not have a quasi-nucleus taking the genome block as the center, the high-frequency interaction region is split and then the quasi-nucleus is judged and identified again until the split region does not contain the genome block;
when the method is implemented, the method comprises the following steps:
s2.1. calculating genome blocksiIn the high-frequency interaction region
Figure 353392DEST_PATH_IMAGE025
Submatrices composed in an interaction matrix between genomic blocks
Figure 958817DEST_PATH_IMAGE027
An average interaction value of;
s2.2, comparing the average interaction value obtained in the step S2.1 with the average interaction values of the submatrices adjacent to 5 same window sizes:
if the average interaction value obtained in step S2.1 is larger than the average interaction value of the submatrixes adjacent to 5 same window sizes, the high-frequency interaction area is judged
Figure 627696DEST_PATH_IMAGE025
To calculate genome blocksiThe quasi nucleus of (1);
if the average interaction value obtained in step S2.1 is not larger than the average interaction value of the sub-matrixes adjacent to 5 same window sizes, the high-frequency interaction area is subjected to
Figure 43896DEST_PATH_IMAGE025
Splitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiStopping the operation;
the submatrix is adjacent to 5 submatrices with the same window size, specifically the upper 3 submatrixes
Figure 329384DEST_PATH_IMAGE029
Figure 472920DEST_PATH_IMAGE031
And
Figure 261885DEST_PATH_IMAGE033
right 1 submatrix
Figure 957308DEST_PATH_IMAGE035
And a sub-matrix below
Figure 589147DEST_PATH_IMAGE037
Said interaction with high frequencyZone(s)
Figure 395429DEST_PATH_IMAGE025
Splitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiThe method specifically comprises the following steps:
first, the high frequency interaction region
Figure 304479DEST_PATH_IMAGE025
Medium and high frequency interaction region
Figure 46170DEST_PATH_IMAGE025
Genome block with minimal interaction sum of other genome blocks thereinm i For dividing points, dividing the high-frequency interaction region
Figure 40671DEST_PATH_IMAGE025
Divided into high frequency interaction regions
Figure 385064DEST_PATH_IMAGE039
And high frequency interaction region
Figure 148621DEST_PATH_IMAGE041
Then, a judgment is made:
if it isi = m i Then, it is determined that there is no genome blockiA centered corelet;
if it isi < m i Then in the high frequency interaction region
Figure 809016DEST_PATH_IMAGE042
Repeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located;
if it isi > m i Then in the high frequency interaction region
Figure 290813DEST_PATH_IMAGE043
Repeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located;
s3, processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei; the method specifically comprises the following steps:
s3.1, judging the relation between two adjacent quasicuclears for the quasicuclear identified on each chromosome:
if the two adjacent quasi-kernels are in an inclusion relationship, the included quasi-kernels are reserved, and the included quasi-kernels are filtered;
if the two adjacent quasi cores are in an overlapping relation, judging again: if the two quasi cores still meet the definition of the quasi cores after being merged, merging the two quasi cores into one quasi core; otherwise, reserving the quasi-nucleus with larger average interaction value in the two quasi-nuclei, and filtering the rest quasi-nuclei;
s3.2, repeating the step S3.1 until all the quasicuclears on the whole chromosome are judged and processed, and finally obtaining the non-overlapping quasicuclears;
s4, merging the non-overlapping quasi-nuclei on a chromosome according to the correlation among the quasi-nuclei, and using the merged nuclei as the nuclei of a Topological Associated Domain (TAD) of the chromosome to be predicted; calculating cosine similarity between all adjacent quasi-kernels, combining a plurality of continuous adjacent quasi-kernels of which the cosine similarity is higher than a set threshold and the average interaction value between the adjacent quasi-kernels is larger than the mean value of non-zero interaction values on the whole chromosome into a new region, and taking the region as a kernel in a kernel-attachment structure model of a chromosome topology association structure domain to be predicted;
in specific implementation, the following formula is adopted to calculate the adjacent quasi-kernelspc i Andpc j cosine similarity of
Figure DEST_PATH_IMAGE082
Figure 845422DEST_PATH_IMAGE047
In the formula
Figure 729065DEST_PATH_IMAGE049
Is composed ofpc i A feature vector consisting of average interaction values with all other coregists, and
Figure DEST_PATH_IMAGE083
Figure 592984DEST_PATH_IMAGE053
Figure DEST_PATH_IMAGE084
is a quasi-nucleuspc k Andpc i average interaction value between;
Figure 703023DEST_PATH_IMAGE057
is composed ofpc j A feature vector composed of average interaction values with all other quasicles, an
Figure 389219DEST_PATH_IMAGE059
Figure 284625DEST_PATH_IMAGE053
Figure DEST_PATH_IMAGE085
Is a quasi-nucleuspc k Andpc j an average interaction value therebetween;
Figure DEST_PATH_IMAGE086
is the inner product of the vectors;
Figure 804599DEST_PATH_IMAGE065
taking a modulus of the vector;
s5, determining the subordination relation of each genome block in the accessory candidate region, and combining the nucleus of the chromosome topological correlation domain obtained in the step S4 to obtain a final predicted chromosome topological correlation domain; specifically, a region between a nucleus and a nucleus is defined as an attachment region, and the nucleus of an adjacent chromosome topological correlation domain to which each genome block belongs in each attachment region is determined, so that a final predicted chromosome topological correlation domain is obtained; each chromosome topological correlation structural domain comprises a core and accessory regions at two sides of the core;
when the method is implemented, the method comprises the following steps:
s5.1, for two adjacent nucleuses
Figure DEST_PATH_IMAGE087
And
Figure 526568DEST_PATH_IMAGE069
middle genome block
Figure 141088DEST_PATH_IMAGE071
Filtering genomic blocks having an average interaction value of the high-frequency interaction regions that is less than the mean of non-zero interaction values across the entire chromosome;
s5.2, on the basis of the step S5.1, two adjacent cores are processed
Figure 999323DEST_PATH_IMAGE087
And
Figure 549253DEST_PATH_IMAGE072
and the genome block between the two cores
Figure 102725DEST_PATH_IMAGE071
Forming a sub-matrix, and removing background signals; defining background signal as the average value of non-zero interaction values in a submatrix formed by genome blocks between two adjacent kernels;
s5.3, on the basis of the step S5.2, two adjacent cores are processed
Figure 396303DEST_PATH_IMAGE087
And
Figure 843465DEST_PATH_IMAGE072
middle genome block
Figure 193325DEST_PATH_IMAGE071
Filtering the absence and genomic region
Figure DEST_PATH_IMAGE088
A genomic block within which any genomic block has a non-zero interaction value;
s5.4, calculating two adjacent cores on the basis of the step S5.3
Figure 624306DEST_PATH_IMAGE067
And
Figure 455996DEST_PATH_IMAGE072
the submatrix where each genome block is located
Figure 633031DEST_PATH_IMAGE076
And the genome block corresponding to the smallest average interaction value of the submatrices is taken as a partition point, and the genome block at the upstream of the partition point is taken as a kernel
Figure 524763DEST_PATH_IMAGE067
The genome block downstream of the segmentation point is taken as a nucleus
Figure 177461DEST_PATH_IMAGE072
The attachment of (a); thereby obtaining the final predicted topological associated domain of the chromosome.
The process of the invention is further illustrated below with reference to one example:
the chromosome topology association domain prediction method based on the nuclear-attachment structure model provided by the embodiment as shown in FIG. 2 comprises the following steps; the Hi-C map is shown as the Hi-C interaction matrix of the KR-normalized GM12878_ combined at a resolution of 50kb contained in the GSE63525 dataset, and the specific segment is the 120 th and 200 th genome blocks of chromosome I;
step S1, identifying a high-frequency interaction region of each genome block in an interaction matrix (Hi-C interaction matrix for short) between the genome blocks obtained by a whole genome conformation capture technology and a sequencing technology by adopting a K-means clustering method;
as shown in fig. 2-r (fig. 2-r is a preprocessing process of the Hi-C interaction matrix), 0-assigning is performed to the interaction value of each genome block and itself on the diagonal of the KR-normalized Hi-C interaction matrix of GM12878_ combined at a resolution of 50 kb;
as shown in FIG. 2-2 (FIG. 2-2 is the process of identifying the high frequency interaction region), for each genome blockiBy usingkK-means clustering algorithm of =2 on other genome blocks with interaction value different from 0kCluster of =2, classification function of other genome blocks is:
Figure RE-330017DEST_PATH_IMAGE003
wherein,
Figure RE-919262DEST_PATH_IMAGE042
is a genomic blockiAndjthe value of (2) is determined,
Figure RE-964578DEST_PATH_IMAGE005
is the firstkMean of the centers. Initial center values of two classes
Figure RE-359787DEST_PATH_IMAGE009
And
Figure RE-347728DEST_PATH_IMAGE010
after setting to non-zero interaction value and sorting in ascending order
Figure RE-435770DEST_PATH_IMAGE044
And
Figure RE-702803DEST_PATH_IMAGE045
the position of the corresponding interaction value is determined,
Figure RE-901703DEST_PATH_IMAGE009
corresponding to the center of the low frequency interaction class,
Figure RE-242686DEST_PATH_IMAGE010
center corresponding to high frequency interaction class; by calculatingDe-classifying function, assigning the class corresponding to the distance with the minimum central value to the genome blockj
For each genome blockiDefining its high frequency interaction region (l i r i ),l i Corresponding genome blockiThe minimum block number of a genome block in a high interaction class,r i corresponding genome blockiMaximum block number of genomic blocks in high interaction class; the schematic diagram of the high frequency interaction region is shown in FIG. 2-b;
step S2, as shown in FIG. 2-a (FIG. 2-c is the construction process of TADs quasi-nucleus), for each genome block, judging and identifying whether there is a quasi-nucleus taking the genome block as the center from the high-frequency interaction region;
quasi-nuclear is defined as if the genome block is presentiIn the high-frequency interaction region
Figure RE-501629DEST_PATH_IMAGE013
Submatrices formed in a Hi-C interaction matrix
Figure RE-255958DEST_PATH_IMAGE014
Is larger than the sub-matrix of the adjacent 5 same window sizes, including the upper 3 sub-matrices
Figure RE-992970DEST_PATH_IMAGE015
Figure RE-454039DEST_PATH_IMAGE016
And
Figure RE-821566DEST_PATH_IMAGE017
right one sub-matrix
Figure RE-797612DEST_PATH_IMAGE018
And a lower sub-matrix
Figure RE-338315DEST_PATH_IMAGE019
Then the high frequency interaction region
Figure RE-981786DEST_PATH_IMAGE013
Is a genome blockiThe quasi nucleus of (1);
if the genome blockiHigh frequency interaction region of
Figure RE-753171DEST_PATH_IMAGE013
Sub-matrices composed in a Hi-C interaction matrix
Figure RE-RE-DEST_PATH_IMAGE053
Is not greater than the other 5 sub-matrices adjacent to the same window size, then for the high frequency interaction region
Figure RE-482092DEST_PATH_IMAGE013
Judging and identifying the quasi-nucleus again after splitting until the split region does not contain the genome blockiStopping the operation;
when splitting: to genome blockiHigh frequency interaction region of
Figure RE-560907DEST_PATH_IMAGE013
Splitting is carried out, firstly with a high-frequency interaction region
Figure RE-996567DEST_PATH_IMAGE013
Genome block with minimal interaction sum with other genome blocks in middle and high frequency interaction regionm i For dividing points, dividing the high frequency interaction region
Figure RE-502635DEST_PATH_IMAGE013
Divided into two high-frequency interaction regions
Figure RE-DEST_PATH_IMAGE054
And
Figure RE-656536DEST_PATH_IMAGE023
further, wheni= m i If no genome block exists, then the determination is madeiA centered corelet; when in usei< m i Then continue to the high frequency interaction region
Figure RE-539041DEST_PATH_IMAGE054
Carrying out re-judgment and identification check; when in usei> m i Then continue to the high frequency interaction region
Figure RE-891525DEST_PATH_IMAGE023
Carrying out re-judgment and identification check; the process of judgment and identification is as described above;
step S3, as shown in FIG. 2-c, b, c, filtering or merging the quasi-nuclei identified on each chromosome according to the inclusion or overlapping relationship between two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;
when the two adjacent quasi cores are in an inclusion relationship, the included quasi cores are reserved, and the included quasi cores are filtered;
when two adjacent quasi cores are in an overlapping relation, if the combination of the two quasi cores still meets the definition of the quasi cores, the two quasi cores are combined into one quasi core; otherwise, only the quasi-nucleus with higher average interaction frequency in the two kernels is reserved;
after processing a group of two adjacent quasicucleates, searching two adjacent, contained or overlapped quasicucleates from the next quasicucleate and carrying out the same processing until no mutually overlapped quasicucleates appear on the whole chromosome;
step S4, as shown in fig. 2- ((r) of fig. 2-is a process for constructing nuclei in a nuclear-attached structural model of TADs), merging the non-overlapping quasi-nuclei on one chromosome according to the correlation between the quasi-nuclei, and regarding the merged nuclei as nuclei of chromosome topological correlation domains (TADs) to be predicted;
using cosine similarity to normalize all adjacent pairspc i Andpc j and performing correlation calculation, wherein the calculation formula is as follows:
Figure RE-RE-DEST_PATH_IMAGE055
setting a correlation threshold value, combining two or a plurality of continuous adjacent quasi-nuclei with similarity higher than the threshold value and average interaction value between the adjacent quasi-nuclei larger than the average value of non-zero interaction values on the whole chromosome into a new region, and using the new region as a nucleus in a TAD (TAD) nucleus-attachment structure model
Step S5, as shown in fig. 2-fifthly (fig. 2-fifthly is the process of building the complete nucleus-attachment structure model of TADs), the region between the nucleus and the nucleus is defined as the attachment candidate region, it is determined to which nucleus each genome block in the attachment candidate region belongs to, and each finally predicted TAD is composed of one nucleus and attachments on both sides of it; when the method is implemented, the method comprises the following steps:
s5.1, for two adjacent nucleuses
Figure RE-834073DEST_PATH_IMAGE035
And
Figure RE-209691DEST_PATH_IMAGE036
middle genome block
Figure RE-630308DEST_PATH_IMAGE037
Filtering genomic blocks having an average interaction value of the high-frequency interaction region that is less than the mean of non-zero interaction values across the entire chromosome;
s5.2, on the basis of the step S5.1, two adjacent cores are processed
Figure RE-102878DEST_PATH_IMAGE035
And
Figure RE-950748DEST_PATH_IMAGE038
and the genomic block between the two nuclei
Figure RE-326845DEST_PATH_IMAGE037
Forming a sub-matrix, and removing background signals; defining a background signal as an average value of non-zero interaction values in a submatrix formed by genome blocks between two adjacent kernels;
s5.3, on the basis of the step S5.2, two adjacent cores are processed
Figure RE-551153DEST_PATH_IMAGE035
And
Figure RE-878229DEST_PATH_IMAGE038
middle genome block
Figure RE-162580DEST_PATH_IMAGE037
Filtering the absence and genomic region
Figure RE-512790DEST_PATH_IMAGE039
A genomic block within which any genomic block has a non-zero interaction value;
s5.4, calculating two adjacent cores on the basis of the step S5.3
Figure RE-275210DEST_PATH_IMAGE035
And
Figure RE-456792DEST_PATH_IMAGE038
the submatrix where each genome block is located
Figure RE-912044DEST_PATH_IMAGE040
And taking the genome block corresponding to the minimum submatrix average interaction value as a partition point, and taking the genome block at the upstream of the partition point as a kernel
Figure RE-749550DEST_PATH_IMAGE035
The genome block downstream of the segmentation point is taken as a nucleus
Figure RE-315661DEST_PATH_IMAGE038
The accessory of (1); thereby obtaining the final predicted chromosome topology association structural domain.
FIG. 3 is a schematic structural diagram of the system of the present invention: the invention also provides a prediction system for realizing the prediction method of the chromosome topological correlation structure domain, which comprises a high-frequency interaction region identification module, a quasi-nuclear processing module, a chromosome topological correlation structure domain nuclear identification module and a chromosome topological correlation structure domain identification module which are sequentially connected in series; the high-frequency interaction region identification module is used for acquiring each genome block in an interaction matrix among the genome blocks, identifying and obtaining a corresponding high-frequency interaction region by adopting a clustering algorithm, and uploading the obtained high-frequency interaction region to the quasi-nuclear identification module; the quasi-nucleus identification module is used for judging and identifying whether a quasi-nucleus taking the genome block as the center exists in the corresponding high-frequency interaction region aiming at each genome block, and uploading the obtained quasi-nucleus to the quasi-nucleus processing module; the quasi-nucleus processing module is used for processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei, and uploading the non-overlapping quasi-nuclei to the chromosome topological association domain nucleus identification module; the chromosome topological correlation structure domain core identification module is used for merging non-overlapping quasi-cores on a chromosome according to the correlation among the quasi-cores, taking the merged core as the core of the chromosome topological correlation structure domain to be predicted, and uploading the obtained core to the chromosome topological correlation structure domain identification module; and the chromosome topological correlation structure domain identification module is used for determining the subordination relation of each genome block in the accessory candidate region, and combining the received nucleus of the chromosome topological correlation structure domain to obtain a finally predicted chromosome topological correlation structure domain and outputting the finally predicted chromosome topological correlation structure domain.

Claims (10)

1. A method for predicting a topological relational domain of a chromosome, comprising the steps of:
s1, acquiring each genome block in an interaction matrix among the genome blocks, and identifying by adopting a clustering algorithm to obtain a corresponding high-frequency interaction region;
s2, for each genome block, determining and identifying from the corresponding high frequency interaction region whether a quasi-nucleus centered on the genome block exists:
if the high-frequency interaction region has a quasi-nucleus taking the genome block as the center, continuing to perform the subsequent steps;
if the high-frequency interaction region does not have a quasi-nucleus taking the genome block as the center, the high-frequency interaction region is split and then the quasi-nucleus is judged and identified again until the split region does not contain the genome block;
s3, processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei;
s4, merging the non-overlapping quasi-nuclei on a chromosome according to the correlation among the quasi-nuclei, and using the merged nuclei as the nuclei of the topological associated domain of the chromosome to be predicted;
s5, determining the subordination relation of each genome block in the attachment candidate region, and combining the nucleus of the chromosome topological correlation domain obtained in the step S4 to obtain a final predicted chromosome topological correlation domain.
2. The method for predicting the topological correlation domain of the chromosome according to claim 1, wherein the step S1 comprises obtaining each genome block in the interaction matrix between the genome blocks by using a whole genome conformation capture technique and a sequencing technique, and usingk=2KAnd clustering by using a mean clustering algorithm so as to identify and obtain a corresponding high-frequency interaction region.
3. The method for predicting the topological correlation domain of the chromosome according to claim 2, wherein the step S1 specifically comprises the following steps:
s1.1, acquiring an interaction matrix between genome blocks by adopting a whole genome conformation capture technology and a sequencing technology;
s1.2, carrying out 0 assigning processing on the interaction value of each genome block and the genome block on the diagonal line of the interaction matrix between the genome blocks obtained in the step S1.1;
s1.3 for any genome blockiBy usingk=2KMean clustering algorithm for the genome blockiClustering with other genome blocks whose interaction value is not 0; the following function is used as the classification function for the other genomic blocks in step S1.3
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE004
In the formula
Figure DEST_PATH_IMAGE006
Is a genomic blockiAnd genome blockjThe interaction value of (a);
Figure DEST_PATH_IMAGE008
is as followskAverage of the centers;
Figure DEST_PATH_IMAGE010
to get and
Figure DEST_PATH_IMAGE012
a function of class number operation corresponding to the nearest center;
Figure DEST_PATH_IMAGE014
is a 2-norm; initial center values of two classes
Figure DEST_PATH_IMAGE016
And
Figure DEST_PATH_IMAGE018
is set to be after the ascending order of the non-zero interaction values
Figure DEST_PATH_IMAGE020
And
Figure DEST_PATH_IMAGE022
interaction values corresponding to positions, and
Figure 955213DEST_PATH_IMAGE016
corresponding to the center of the low frequency interaction class,
Figure 718769DEST_PATH_IMAGE018
center corresponding to high frequency interaction class;
assigning a class corresponding to a distance having the smallest center value to the genome block by solving the classification functionj
S1.4. for each genome BlockiDefining corresponding high frequency interaction regions
Figure DEST_PATH_IMAGE024
(ii) a Wherein,l i corresponding to the genome blockiThe minimum block number of a genome block in a high interaction class,r i corresponding to the genome blockiMaximum block number of genomic blocks in high interaction class.
4. The method for predicting the topological correlation domain of the chromosome according to claim 3, wherein the step S2 specifically comprises the following steps:
s2.1. calculating genome blocksiIn the high-frequency interaction region
Figure 647673DEST_PATH_IMAGE024
Submatrices composed in an interaction matrix between genomic blocks
Figure DEST_PATH_IMAGE026
An average interaction value of;
s2.2, comparing the average interaction value obtained in the step S2.1 with the average interaction values of the submatrices adjacent to 5 same window sizes:
if the average interaction value obtained in step S2.1 is larger than the average interaction value of the submatrixes adjacent to 5 same window sizes, the high-frequency interaction area is judged
Figure DEST_PATH_IMAGE027
To calculate genome blocksiQuasi-nuclear of (2);
if the average interaction value obtained in step S2.1 is not larger than the average interaction value of the sub-matrixes adjacent to 5 same window sizes, the high-frequency interaction area is subjected to
Figure 801574DEST_PATH_IMAGE027
Splitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiStopping the operation;
the submatrix is adjacent to 5 submatrices with the same window size, specifically the upper 3 submatrixes
Figure DEST_PATH_IMAGE029
Figure DEST_PATH_IMAGE031
And
Figure DEST_PATH_IMAGE033
right 1 submatrix
Figure DEST_PATH_IMAGE035
And a sub-matrix below
Figure DEST_PATH_IMAGE037
5. The method for predicting the topological relational domain of chromosome according to claim 4, wherein said high frequency interaction region is selected from the group consisting of
Figure DEST_PATH_IMAGE038
Splitting is carried out; judging and identifying again after splitting until the split region does not contain the genome blockiThe method specifically comprises the following steps:
first, the high frequency interaction region
Figure 74293DEST_PATH_IMAGE024
Medium and high frequency interaction region
Figure 692356DEST_PATH_IMAGE024
Genome block with minimal interaction sum of other genome blocks thereinm i For dividing points, dividing the high-frequency interaction region
Figure 634904DEST_PATH_IMAGE024
Divided into high frequency interaction regions
Figure DEST_PATH_IMAGE040
And high frequency interaction region
Figure DEST_PATH_IMAGE042
Then, a judgment is made:
if it isi = m i Then, it is determined that there is no genome blockiA centered corelet;
if it isi < m i Then in the high frequency interaction region
Figure DEST_PATH_IMAGE043
Repeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located;
if it isi > m i Then in the high frequency interaction region
Figure DEST_PATH_IMAGE044
And (3) repeating the steps S2.1-S2.2 to judge the quasi-nuclear region as the high-frequency interaction region where the genome block is located.
6. The method for predicting the topological correlation domain of the chromosome according to claim 5, wherein the step S3 specifically comprises the following steps:
s3.1, judging the relation between two adjacent quasicuclears for the quasicuclear identified on each chromosome:
if the two adjacent quasi-kernels are in an inclusion relationship, the included quasi-kernels are reserved, and the included quasi-kernels are filtered;
if the two adjacent quasi cores are in an overlapping relation, judging again: if the two quasi cores still meet the definition of the quasi cores after being merged, merging the two quasi cores into one quasi core; otherwise, reserving the quasi-nucleus with larger average interaction value in the two quasi-nuclei, and filtering the rest quasi-nuclei;
and S3.2, repeating the step S3.1 until all the quasicuclears on the whole chromosome are judged and processed, and finally obtaining the non-overlapping quasicuclears.
7. The method according to claim 6, wherein the step S4 is to calculate cosine similarity between all the adjacent quasiscores, merge several consecutive adjacent quasiscores with cosine similarity higher than a predetermined threshold and average interaction value between the adjacent quasiscores greater than the mean value of non-zero interaction values on the whole chromosome into a new region, and use the new region as a core in the core-attachment structure model of the chromosome topology association domain to be predicted.
8. The method for predicting the topological correlation domain of the chromosome according to claim 7, wherein the step S5 is to define a region between the nucleus and the nucleus as an attachment region, and determine the nucleus of the adjacent topological correlation domain of the chromosome to which each genome block belongs in each attachment region, so as to obtain the final predicted topological correlation domain of the chromosome; each chromosome topology association domain includes a nucleus and attachment regions on both sides of the nucleus.
9. The method for predicting the topological correlation domain of the chromosome according to claim 8, wherein the step S5 specifically comprises the following steps:
s5.1, for two adjacent nucleuses
Figure DEST_PATH_IMAGE046
And
Figure DEST_PATH_IMAGE048
middle genome block
Figure DEST_PATH_IMAGE050
Filtering genomic blocks having an average interaction value of the high-frequency interaction regions that is less than the mean of non-zero interaction values across the entire chromosome;
s5.2, on the basis of the step S5.1, two adjacent cores are processed
Figure 623238DEST_PATH_IMAGE046
And
Figure DEST_PATH_IMAGE051
and the genome block between the two cores
Figure 981539DEST_PATH_IMAGE050
Forming a sub-matrix, and removing background signals; defining a background signal as an average value of non-zero interaction values in a submatrix formed by genome blocks between two adjacent kernels;
s5.3, on the basis of the step S5.2, two adjacent cores are processed
Figure DEST_PATH_IMAGE052
And
Figure DEST_PATH_IMAGE053
middle genome block
Figure DEST_PATH_IMAGE054
Filtering the absence and genomic region
Figure DEST_PATH_IMAGE056
A genomic block within which any genomic block has a non-zero interaction value;
s5.4, calculating two adjacent cores on the basis of the step S5.3
Figure 142524DEST_PATH_IMAGE052
And
Figure 662498DEST_PATH_IMAGE053
the submatrix where each genome block is located
Figure DEST_PATH_IMAGE058
And taking the genome block corresponding to the minimum submatrix average interaction value as a partition point, and taking the genome block at the upstream of the partition point as a kernel
Figure 384466DEST_PATH_IMAGE052
The genome block downstream of the segmentation point is taken as a nucleus
Figure DEST_PATH_IMAGE059
The attachment of (a); thereby obtaining the final predicted chromosome topology association structural domain.
10. A prediction system for realizing the prediction method of the chromosome topological correlation structure domain according to any one of claims 1 to 9, which is characterized by comprising a high-frequency interaction region identification module, a quasi-nuclear processing module, a chromosome topological correlation structure domain nuclear identification module and a chromosome topological correlation structure domain identification module which are connected in series in sequence; the high-frequency interaction region identification module is used for acquiring each genome block in an interaction matrix among the genome blocks, identifying and obtaining a corresponding high-frequency interaction region by adopting a clustering algorithm, and uploading the obtained high-frequency interaction region to the quasi-nuclear identification module; the quasi-nuclear identification module is used for judging and identifying whether quasi-nuclear with the genome block as the center exists in the corresponding high-frequency interaction region aiming at each genome block, and uploading the obtained quasi-nuclear to the quasi-nuclear processing module; the quasi-nucleus processing module is used for processing the quasi-nuclei identified on each chromosome according to the relationship between every two adjacent quasi-nuclei to obtain non-overlapping quasi-nuclei, and uploading the non-overlapping quasi-nuclei to the chromosome topological association domain nucleus identification module; the chromosome topological correlation structure domain core identification module is used for merging non-overlapping quasi-cores on a chromosome according to the correlation among the quasi-cores, taking the merged core as the core of the chromosome topological correlation structure domain to be predicted, and uploading the obtained core to the chromosome topological correlation structure domain identification module; and the chromosome topological correlation structure domain identification module is used for determining the subordination relation of each genome block in the accessory candidate region, and combining the received nucleus of the chromosome topological correlation structure domain to obtain a finally predicted chromosome topological correlation structure domain and outputting the finally predicted chromosome topological correlation structure domain.
CN202210245600.9A 2022-03-14 2022-03-14 Prediction method and prediction system of chromosome topological correlation structure domain Pending CN114446384A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210245600.9A CN114446384A (en) 2022-03-14 2022-03-14 Prediction method and prediction system of chromosome topological correlation structure domain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210245600.9A CN114446384A (en) 2022-03-14 2022-03-14 Prediction method and prediction system of chromosome topological correlation structure domain

Publications (1)

Publication Number Publication Date
CN114446384A true CN114446384A (en) 2022-05-06

Family

ID=81358910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210245600.9A Pending CN114446384A (en) 2022-03-14 2022-03-14 Prediction method and prediction system of chromosome topological correlation structure domain

Country Status (1)

Country Link
CN (1) CN114446384A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114944190A (en) * 2022-05-12 2022-08-26 南开大学 TAD (TAD-based data analysis) identification method and system based on Hi-C sequencing data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190005191A1 (en) * 2015-07-14 2019-01-03 Whitehead Institute For Biomedical Research Chromosome neighborhood structures and methods relating thereto
US20190295684A1 (en) * 2018-03-22 2019-09-26 The Regents Of The University Of Michigan Method and apparatus for analysis of chromatin interaction data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190005191A1 (en) * 2015-07-14 2019-01-03 Whitehead Institute For Biomedical Research Chromosome neighborhood structures and methods relating thereto
US20190295684A1 (en) * 2018-03-22 2019-09-26 The Regents Of The University Of Michigan Method and apparatus for analysis of chromatin interaction data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
许希伦;: "染色体相互作用密度与拓扑域相关分析", 电脑知识与技术, no. 03, 25 January 2020 (2020-01-25) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114944190A (en) * 2022-05-12 2022-08-26 南开大学 TAD (TAD-based data analysis) identification method and system based on Hi-C sequencing data
CN114944190B (en) * 2022-05-12 2024-04-19 南开大学 TAD (transcription activator) identification method and system based on Hi-C sequencing data

Similar Documents

Publication Publication Date Title
CN108470354B (en) Video target tracking method and device and implementation device
CN110991311B (en) Target detection method based on dense connection deep network
CN111754472A (en) Pulmonary nodule detection method and system
CN110188763B (en) Image significance detection method based on improved graph model
CN111612039A (en) Abnormal user identification method and device, storage medium and electronic equipment
WO2017173929A1 (en) Unsupervised feature selection method and device
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
Cui et al. Learning global pairwise interactions with Bayesian neural networks
CN114446384A (en) Prediction method and prediction system of chromosome topological correlation structure domain
CN111860823A (en) Neural network training method, neural network training device, neural network image processing method, neural network image processing device, neural network image processing equipment and storage medium
CN115114484A (en) Abnormal event detection method and device, computer equipment and storage medium
CN117992765A (en) Off-label learning method, device, equipment and medium based on dynamic emerging marks
CN107832732B (en) Lane line detection method based on treble traversal
CN113539479A (en) Similarity constraint-based miRNA-disease association prediction method and system
WO2022011855A1 (en) False positive structural variation filtering method, storage medium, and computing device
CN110837853A (en) Rapid classification model construction method
Wu et al. Mixed Pattern Matching‐Based Traffic Abnormal Behavior Recognition
CN111488903A (en) Decision tree feature selection method based on feature weight
CN116403713A (en) Method for predicting autism spectrum barrier risk genes based on multiclass unsupervised feature extraction method
CN110674860A (en) Feature selection method based on neighborhood search strategy, storage medium and terminal
CN115497563A (en) Cancer driver gene identification method, system, storage medium and equipment
CN111863124B (en) Copy number variation detection method, system, storage medium and computer equipment
CN114647679A (en) Hydrological time series motif mining method based on numerical characteristic clustering
CN116861226A (en) Data processing method and related device
CN114091559A (en) Data filling method and device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination