CN118038990A

CN118038990A - Multi-level chromatin topological structure domain identification method and system based on community discovery

Info

Publication number: CN118038990A
Application number: CN202410430488.5A
Authority: CN
Inventors: 柳军涛; 刘洋洋
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2024-04-11
Filing date: 2024-04-11
Publication date: 2024-05-14

Abstract

The disclosure provides a method and a system for identifying a multilevel chromatin topological structure domain based on community discovery, which relate to the technical field of biological gene identification and comprise the following steps: acquiring an original Hi-C contact matrix of chromosome sequencing, and preprocessing the Hi-C contact matrix to obtain an undirected weighted subnetwork; dividing the undirected weighted subnetwork into mutually independent communities by using a community discovery algorithm, searching candidate double boundaries in all communities and forming a set; extracting boundary local feature representations of candidate double boundaries, and inputting the boundary local feature representations into an MLP model to obtain a prediction score of each candidate double boundary; screening out candidate double boundaries to be reserved according to the prediction scores, wherein the reserved candidate double boundaries form a reliable boundary set; searching the candidate double boundaries in the reliable boundary set, and identifying the chromosome topological structure domain TAD forming multiple layers.

Description

Multi-level chromatin topological structure domain identification method and system based on community discovery

Technical Field

The disclosure relates to the technical field of biological gene identification, in particular to a method and a system for identifying a multilevel chromatin topological structure domain based on community discovery.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The genome in the eukaryotic cell nucleus is organized in a three-dimensional, highly folded manner. Chromatin conformation technology (Hi-C) is widely used to study spatial organization of the genome within the nucleus due to its ability to analyze chromatin interactions throughout the genome. It produces up to billions of pairs of end reads (reads) that can be categorized into a contact matrix whose elements reflect the frequency of interactions between corresponding pairs of genomic loci, embodied as a two-dimensional symmetric matrix. The application of Hi-C technology has led researchers to find that the human, mouse and drosophila genomes are linearly divided into regions of the scale of millions of bases with strong internal connectivity but limited interactions with other domains, known as topological domains (TADs). Studies have shown that TADs are widely conserved across species and are closely related to histone modification, remote gene regulation and epigenetic, while misfolding of a particular TAD can lead to varying degrees of genetic disease. Thus, studying TAD helps to better understand the relationship between three-dimensional chromatin organization and epigenetic inheritance.

Currently, there are many calculation methods designed to identify TAD structures in Hi-C matrices. Depending on the type of TAD they identify, these algorithms can be divided into two classes: the first class is algorithms that can identify non-nested TADs, i.e., the TADs they identify do not have an intersection between two; the second class is algorithms that can identify nested TADs, i.e., the TADs they identify can have intersections and containment relationships.

However, the inventors have found that existing TAD prediction algorithms also suffer from the following drawbacks:

1. limitations of using a single algorithm: these algorithms are currently based on only one statistical or combinatorial optimization algorithm. However, certain systematic errors must exist in the use of a single algorithm, thereby reducing the accuracy of TAD prediction;

2. The grasp of TAD boundary features is insufficient: the prediction of the TAD boundary is closely related to the prediction of the TAD, but the current TAD algorithm does not pay attention to or pays attention to the TAD boundary characteristics, and influences the accuracy of the TAD prediction;

3. Limitations on fine TAD structure prediction: current research indicates that TADs still contain TAD structures within, which are typically referred to as nested TADs or child TADs, but most of the current TAD prediction algorithms only predict non-nested TAD structures, i.e., TADs are distributed linearly and non-intersecting in the genome, which algorithms cannot predict such finer child TADs;

4. Robustness to data of different sequencing depths is poor: the Hi-C technology has a certain systematic error, and different sequencing depths can influence the quality of Hi-C data to a great extent, so that the accuracy of a TAD prediction algorithm is influenced, however, the robustness of the existing algorithm to the Hi-C sequencing depth is generally poor.

Disclosure of Invention

In order to solve the above problems, the disclosure provides a multi-level chromatin topological domain identification method and system based on community discovery, and provides BINDER (bound-anchored Infomap and Neural network-based TAD IDENTIFIER) method to convert the TAD identification problem into the community identification problem in the network, so that the traditional community discovery algorithm Infomap and the neural network method are effectively and reasonably integrated, the advantages are complementary, and the TAD can be predicted more accurately.

According to some embodiments, the present disclosure employs the following technical solutions:

a multi-level chromatin topological structure domain identification method based on community discovery comprises the following steps:

Acquiring an original Hi-C contact matrix of chromosome sequencing, and preprocessing the Hi-C contact matrix to obtain an undirected weighted subnetwork;

Dividing the undirected weighted subnetwork into mutually independent communities by using a community discovery algorithm, searching candidate double boundaries in all communities and forming a set;

Extracting boundary local feature representations of candidate double boundaries, and inputting the boundary local feature representations into an MLP model to obtain a prediction score of each candidate double boundary; screening out candidate double boundaries to be reserved according to the prediction scores, wherein the reserved candidate double boundaries form a reliable boundary set;

Searching the candidate double boundaries in the reliable boundary set, and identifying the chromosome topological structure domain TAD forming multiple layers.

a multi-level chromatin topology domain identification system based on community discovery, comprising:

the data acquisition module is used for acquiring an original Hi-C contact matrix of chromosome sequencing, and acquiring an undirected weighting sub-network after preprocessing the Hi-C contact matrix;

the boundary searching module is used for dividing the undirected weighted sub-network into mutually independent communities by using a community discovery algorithm, searching candidate double boundaries in all communities and forming a set;

The boundary feature extraction and prediction module is used for extracting boundary local feature representations of the candidate double boundaries and inputting the boundary local feature representations into the MLP model to obtain a prediction score of each candidate double boundary; screening out candidate double boundaries to be reserved according to the prediction scores, wherein the reserved candidate double boundaries form a reliable boundary set;

and the multi-level TAD generation module is used for searching the candidate double boundaries in the reliable boundary set and identifying and forming a multi-level chromosome topological structure domain TAD.

Compared with the prior art, the beneficial effects of the present disclosure are:

the multi-level chromatin topological structure domain identification method based on community discovery disclosed by the invention comprises the steps of 1) converting a TAD identification problem into a community identification problem in a network, effectively and reasonably integrating a traditional community discovery algorithm Infomap and a neural network method, complementing advantages, and being capable of more accurately predicting the TAD;

2) The present disclosure assumes that the anchoring of the TAD boundary plays a decisive role in the formation of the TAD, and therefore, by grasping the TAD boundary features, three boundary features describing the TAD boundary in the Hi-C matrix are defined, so that the algorithm more accurately recognizes the TAD boundary;

3) The present disclosure can predict finer multi-level TAD structures, including nested TADs (sub TADs) and partially overlapping TADs, and define corresponding levels of TADs, facilitating subsequent research;

4) The model for predicting the TAD designed by the disclosure can keep stable and accurate output under Hi-C data with different sequencing depths, and has excellent robustness.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

Fig. 1 is a schematic diagram of CEC of a CTCF-based assessment method of an embodiment of the present disclosure;

fig. 2 is a flow chart of a BINDER method of an embodiment of the present disclosure.

Detailed Description

The disclosure is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Term interpretation:

Hi-C: chromatin conformation techniques;

And (3) TAD: a topological domain;

Infomap algorithm: the community discovery algorithm is a community discovery method based on information theory, infomap utilizes probability streams randomly walking on a network to simulate information streams in a system, and decomposes the network into communities by compressing descriptions of the probability streams.

BINDER method, one proposed by the present disclosure, bound-anchored Infomap and Neural network-based TAD IDENTIFIER, BINDER is a multi-level chromatin topology domain (multi-LEVEL TADS) recognition algorithm based on a community discovery algorithm Infomap and a neural network.

Example 1

An embodiment of the present disclosure provides a method for identifying a multi-level chromatin topological domain based on community discovery, which is BINDER (bound-anchored Infomap and Neural network-based TAD IDENTIFIER), a method for identifying a multi-level chromatin topological domain based on a community discovery algorithm Infomap and a neural network, and includes:

Step one: acquiring an original Hi-C contact matrix of chromosome sequencing, and preprocessing the Hi-C contact matrix to obtain an undirected weighted subnetwork;

step two: dividing the undirected weighted subnetwork into mutually independent communities by using a community discovery algorithm, searching candidate double boundaries in all communities and forming a set;

Step three: extracting boundary local feature representations of candidate double boundaries, and inputting the boundary local feature representations into an MLP model to obtain a prediction score of each candidate double boundary; screening out candidate double boundaries to be reserved according to the prediction scores, wherein the reserved candidate double boundaries form a reliable boundary set;

Step four: searching the candidate double boundaries in the reliable boundary set, and identifying the chromosome topological structure domain TAD forming multiple layers.

As one example, the multi-layered chromatin topology domain identification method based on community discovery of the present disclosure, based on a chromatin ring extrusion model, assumes that the anchoring of the boundary of a TAD determines the structure of the TAD and that the TAD boundary is composed of two left and right bins (bins represent the minimum unit length of a split genome, i.e., the resolution of the Hi-C matrix), and based on this assumption, develops a TAD identification method that takes searching, filtering and determining the TAD boundary as a central task, and converts the search of the TAD boundary into the search of the "community" boundary in the network in consideration of the characteristics that the genome interaction frequency inside the TAD structure is high and the interaction frequency with the external genome is low, and is implemented by the Infomap method. Specific embodiments of the disclosed method include:

1) Obtaining an original Hi-C contact matrix of chromosome sequencing and preprocessing;

The method comprises the steps of obtaining an original Hi-C contact matrix of chromosome sequencing, and in order to lighten the deviation of original data and improve the accuracy of subsequent prediction, firstly performing global normalization on the original Hi-C matrix by using a Sequence Component Normalization (SCN) method, wherein the normalization method is applicable to any genome contact map and is irrelevant to protocols.

The Hi-C contact matrix is conceptualized as a weighted undirected network. Since TADs exhibit frequent internal interactions and infrequent interactions with external areas, the identification problem of TADs is equated to community discovery problems in the network. To solve this problem, infomap algorithm is adopted in BINDER of the present disclosure, which is a community discovery method based on information theory. Infomap model the information flow within the system with probability flows that walk randomly across the network and break the network down into communities by compressing a description of these probability flows. Running Infomap on the entire Hi-C matrix can take a significant amount of time due to computational limitations. Thus, the strategy employed by the present disclosure is to capture a fixed-size sub-matrix through a sliding window along the diagonal of the Hi-C matrix. Then Infomap is applied to the sub-networks to which these sub-matrices correspond.

Specifically, the size of the TAD is generally lower than 2Mb (2,000,000 base pairs), so the sliding window size of BINDER is set to 2Mb/r, where r is the resolution of the Hi-C matrix, for example, the window size of the Hi-C matrix with a resolution of 10Kb is 200 bin length. The step size of the sliding window is set to a length of 10 bins. Then, the submatrices on the diagonal captured by the sliding window are treated as an undirected weighted network (without considering the elements on the diagonal, i.e. without considering the self-loops of the nodes), and the Infomap algorithm is applied thereto to detect the community structure of the network.

2) Searching candidate TAD double boundaries

Candidate double boundaries are extracted from mutually independent communities, which for any sub-network Infomap will be divided into mutually independent communities C ₁, C₂, …, C_m, each community being represented as a monotonically increasing bin sequence C _i=(B₁, …, B_n). Judging that the point in each bin sequence C _i=(B₁, …, B_n) is all balance points or part of balance points, if the points in each bin sequence C _i=(B₁, …, B_n) are all balance points, the bin sequence C _i=(B₁, …, B_n) is a balance sequence, otherwise, the points are part of balance sequences, if the bin sequence C _i=(B₁, …, B_n) is a balance sequence, two double-side boundaries are obtained in the bin sequence C _i=(B₁, …, B_n).

When the bin sequence is a partial balanced sequence, the method comprises a right balanced bin set and a left balanced bin set, then a double boundary set is extracted from the partial balanced sequence, and finally the double boundary set searched out from all submatrices forms a candidate double boundary set.

Specifically, for any j=2..n-1, if B _j= B_j-1+1= B_j+1 -1, we call B _j in C _i balanced; if B _j= B_j+1 +1 and B _j>B_j-1 -1, we call B _j in C _i right balanced; if B _j= B_j-1 +1 and B _j<B_j+1 -1 we call B _j in C _i left balanced. C _i is considered balanced if all B _j (j=2,..n-1) in C _i are balanced, otherwise it is partially balanced. If C _i is balanced, two double bounds (B ₁-1,B₁) and (B _n,B_n +1), respectively represented by two bins, can be derived from C _i. If C _n is partially balanced, it contains the right balanced bin set: And a left balanced bin set: /(I) . Then, a double boundary set is extracted from C _n.

Therefore, the set of candidate double boundaries searched out in all the submatrices is referred to as S _dual. Meanwhile, infomap, which records all double boundaries b _i = (i, i+1), supports counting, i.e. the total number of occurrences of b _i in all subnetworks, denoted info (b _i).

3) Boundary local feature extraction for candidate double boundary

With three boundary local features derived entirely from the Hi-C contact matrix, the three boundary local features include a local interaction density, a directivity index (Directionality Index, DI), which is a measure of the local interaction frequency of the double boundary in the Hi-C contact matrix, and a Wilcoxon rank sum test, the directivity index being a measure of the degree of deviation upstream or downstream of a given partition, and the p-value of the Wilcoxon rank sum test (Wilcoxon rank sum test) being a measure of the difference between capturing interactions around a given bin and internal interactions.

(A) Local interaction density features

The difference in internal and external interaction frequencies is a significant feature of the TAD structure in the Hi-C data, whereby a "local interaction density" feature is defined for each double boundary to measure its local interaction frequency in the Hi-C contact matrix.

Given a double boundary b _i = (i, i+1), consider a total of 11 double boundaries within a range of 5 bin lengths of its radius. Then, given a window size w ₁, for any b _n (n=i-5..i+5), the local interaction density is defined as:

Where M (a, b) represents the values of row a and column b of the SCN normalized Hi-C matrix.

Thus, the local interaction density of the double boundary b _i with window size w ₁ may be represented as an 11-dimensional vector D (b _i|w₁)=[d(b_i-5), …,d(b_i+5) ]. Finally, the local interaction density feature vector for double boundary b _i is represented as the union of D (b _i|w₁), i.e., 110 dimensions, under all windows W ₁ = [5, 10, 20, 30, 50, 60, 70, 80, 90, 100], denoted D (b _i).

(B) Directivity index characteristics

Since the interaction frequency deviation of the topological outer peripheral region is large, a directivity index (Directionality Index, DI) was developed to quantify the degree of deviation upstream or downstream of a given partition. The definition of DI is as follows:

Wherein, Is the sum of the interaction frequencies from a given bini to an upstream w ₂ bin length; Is the sum of the interaction frequencies of w ₂ bin lengths from a given bini to downstream,/> Indicating the expected interaction frequency.

Now given a double boundary b _i = (i, i+1) and a step w ₂, consider that 10 bins with a radius of 5 bins length are set around b _i: . Then, the DI vector of b _i at step w ₂ is defined as DI (b _i|w₂) = [DI(i-4|w₂), …,DI(i+5|w₂) ]. Finally, the DI eigenvector of double boundary b _i is represented as the union of DI (b _i|w₂) at all steps W ₂ = [5, 10, 20, 30, 50, 60, 70, 80, 90, 100], i.e. 100 dimensions, denoted DI (b _i).

(C) P-value characterization for Wilcoxon rank sum test

The p-value of the Wilcoxon rank sum test (typically used to detect whether 2 data sets are from the same distribution population) is used to capture the differences between interactions around a given bin and internal interactions; the smaller the p value, the more likely the bin is a TAD boundary.

Now, a double boundary b _i = (i, i+1) and a window size w ₃ are given. At the window size, two sets between and within the region of b _i are defined as follows:

Thus, at window size W ₃, the P-value of the Wilcoxon rank sum test between inter (b _i|w₃) and intra (b _i|w₃) of double boundary b _i is denoted as P (b _i|w₂), then the 10-dimensional P-value characteristic P (b _i) of b _i is defined as the union of P (b _i|w₂) under all window sizes W ₃ = [5, 10, 20, 30, 50, 60, 70, 80, 90, 100 ].

Finally, the three boundary features are combined to give a feature representation of double boundary b _i as a 220-dimensional vector [ D (b _i),DI(b_i),P(b_i) ].

4) MLP neural network model-based prediction in BINDER methods

In order to quantify the reliability of the detected candidate double boundary, an MLP neural network model is designed, and features of the TAD double boundary are learned according to the extracted features of the 220-dimensional double boundary b _i.

Specifically, the MLP neural network model consists of 6 layers, where the first layer consists of 220 neurons for accepting 220-dimensional feature vectors, the 4 hidden layers contain 512, 128, 32 and 4 neurons, respectively, and the last layer consists of one neuron, representing a double-boundary reliability score. Each linear layer is activated by a ReLU function.

Inputting the characteristic representation of the double boundary b _i into the MLP neural network model to obtain the predictive score value of each candidate double boundary。/>Is a predicted value of b _i in P _k for the trained MLP model.

The present disclosure uses four resolution (100 Kb/50Kb/25Kb/10 Kb) Hi-C maps to train neural network models, which span the entire genome of the IMR90 cell line (22 autosomes and one X sex chromosome). To balance the positive and negative samples in the training set, a portion of the more negative samples are randomly deleted to ensure that the number of positive and negative samples is equal. After equilibration, the dataset contained 147,982 samples in total, and then was randomly split into a training set containing 88,790 samples and a test set containing 29,596 samples. The validation set is randomly drawn from the training set and contains 29,596 samples. Finally, a trained neural network is used to predict the reliability score for each double boundary.

5) Five-part boundary filtering method for screening candidate double boundary which should be reserved

The result of Infomap community discovery algorithm is effectively combined with the prediction result of the pre-trained MLP model through a five-part boundary filtering module (five-part boundary filter) so as to filter candidate TAD boundaries for the second time and output a final TAD boundary set; the method for screening the reserved candidate double boundary comprises the following steps: dividing all candidate double boundaries in the set into five-part regions, setting corresponding reservation threshold values for the five-part regions, comparing the prediction scores with the reservation threshold values, and when the prediction scores are larger than the reservation threshold values, reserving the candidate double boundaries corresponding to the prediction scores, wherein the reserved candidate double boundaries form a reliable boundary set.

The present disclosure first divides all double bounds in candidate set S _dual into five parts according to the count supported by Infomap, and Infomap supports count for all double bounds b _i = (i, i+1), i.e., the total number of occurrences of b _i in all subnetworks, denoted info (b _i). I.e., P₁(info(b_i)=0),P₂(info(b_i)=1),P₃(info(b_i)=2),P₄(2<info(b_i)=1≤8) and P ₅(info(b_i) > 8), where b _i belongs to S _dual. Then, 5 thresholds are set for the 5 parts, respectively, and a double boundary S _dual satisfying the following conditions in P _k (k=1, 2,3, 4, 5) will be retained:

Wherein, Is the predicted value of b _i in P _k by the MLP neural network model,/>Is a threshold value (k=1, 2, 3, 4, 5) set for P _k.

Here the number of the elements is the number,Default to 0.77. /(I)、/>、/>Set as 85, 80 and 30 percentiles, respectively, of the MLP model for all dual boundary predictors for the corresponding part. /(I)Default to 0. Then, all the double boundary sets retained by the above filtering strategy are denoted as S _reliable.

6) Identification of chromosome topology domain TADs forming multiple hierarchy levels

The present disclosure proposes a strategy for generating TADs and corresponding hierarchies from Sreliable, outputting a final multi-level TAD result by a TAD generation strategy based on TAD boundary anchoring and a TAD hierarchy generation strategy based on containment relationships, comprising:

given a double boundary b _i = (i, i+1), i and i+1 are defined as the left and right cells of b _i, respectively. Now, for the boundary b _i = (i, i+1), its left unit i searches to the left on the genome scale of the same chromosome and tries to form a candidate TAD with the right unit j+1 of b _j = (j, j+1), where j+1<i. Subsequently, [ j+1, i ] can form a TAD depending on whether the sub-network represented by the sub-matrix of the diagonal of the Hi-C matrix in which the candidate TAD is located is only detected by the Infomap algorithm.

The definition of the TAD hierarchy is introduced. Given two TADs T ₁= [i₁,j₁ and T ₂= [i₂,j₂, respectively, if i ₂≥i₁ and j ₂≤j₁, we say T ₁ includes T ₂ and is expressed as. If two TADs meet/>Two nodes n _i and n _j are connected by a directed edge n _i→n_j. According to graph G, the hierarchy for any TADT _i is defined as:

Where LP (n _i,n_j) is the length of the longest directional path in G from n _i to n _j (if reachable), LP (n _i,n_j) =0 (if unreachable).

Evaluation method

(1) Evaluation criterion CEC based on CTCF

In mammals, TAD boundaries are often enriched by the chromatin structural proteins CCCTC binding factor (CTCF) and bondin, which are thought to be a mechanism by which they co-promote "loop extrusion" to construct TADs. Furthermore, a recent model suggests that TAD formation in the genome is dependent on mandatory alternation of CTCF site clusters. It is therefore considered reasonable to use CTCF-based information to evaluate the quality of TAD boundaries in the absence of quantitative evaluation methods.

Here, the present disclosure proposes a CTCF-based assessment standard (CEC) method for assessing TAD boundary prediction quality based on CTCF markers. As shown in fig. 1, we will center on x, half resolution length, assuming boundary x is at the juncture of two bins b _i and b _i+1 Neighborhood U (x,/>)) A docking area defined as boundary x. If CTCF binding domain binds U (x,/>)) If there is an intersection, boundary x is referred to as a CTCF matching boundary, which is defined as a TAD boundary. Thus, the "exact" TAD boundaries in the whole genome are correspondingly defined. Given a set of chromosomes S _c={c₁, c₂, …, c_n and TAD predictions S _tb at S _c given by the TAD prediction algorithm, the accuracy, recall and F1 score are defined as follows:

Wherein, The union of the substantially true boundaries representing chromosome c ₁, c₂, …, c_n.

(2) Weighted similarity

Given two sets of TADsAnd/>. First, any two TADs of M and NAnd/>Similarity score/>The definition is as follows:

Wherein, Representation/>And/>The number of bins intersected by each other,/>And/>Respectively express/>And/>Is a bin number of the bin.

Then the first time period of the first time period,The similarity score on N is defined as:

Thus, the weighted similarity M with respect to N is defined as follows:

Due to Not equal to/>Thus M and N are weighted with each other like/>The definition is as follows:

Method validity verification

In this study BINDER was compared to the eight most advanced TAD prediction methods, including TopDom, MSTD, SBTD, spectralTAD, onTAD, insulationScore (IS), deDoc (didoc (M) and didoc (E)) and CATAD, and the CEC evaluation criteria defined above were used to quantitatively evaluate the quality of TAD boundaries identified by the different TAD algorithms. Then, the performance of BINDER and other eight algorithms on three cell lines GM12878, K562, NHEK and HMEC, recall and three indicators F1-score were compared on the corresponding Hi-C datasets with resolutions of 50kb, 25kb and 10kb, as shown in table 1; the results of the comparison show that BINDER performed best in all three metrics and resolutions in all comparison methods. Notably, at a resolution of 10kb, BINDER improves on average 17.05% accuracy, 13.07% recall, and an F1 score of 35.26% over the second best TAD algorithm in four cell lines. Higher accuracy reflects BINDER predicted TAD boundaries anchored by more CTCF, while higher recall indicates BINDER is able to predict more CTCF matching boundaries. This evaluation shows BINDER's advantages in identifying TADs.

Table 1 method predictive dominance assessment

(2) BINDER shows very strong robustness on downsampled Hi-C data

Because Hi-C techniques are increasingly widely used, often requiring deeper sequencing depths, TAD prediction algorithms must exhibit robustness and reliability when using sparse Hi-C data. To test the robustness of the TAD prediction algorithm when using lower sampled Hi-C data, 50%, 20%, 10% and 1% downsampled respectively were performed on the GM12878 cell line chromosome 17 10kb resolution Hi-C matrix, and 9 TAD prediction methods including BINDER were run on these downsampled Hi-C data. The robustness and stability of all TAD prediction algorithms is evaluated by the variation of the predicted TAD boundaries and the number of TADs at different downsampling rates, the difference between the set of TADs predicted on the original Hi-C data and the TAD results predicted at different downsampling rates, and the F1 score of the TAD predicted on the downsampled Hi-C data set. Ten downsamples were randomly performed for each sample rate and the average of the corresponding results for each TAD prediction algorithm was calculated.

Test results show that BINDER is superior to all other methods in stability and robustness. As shown in table 2, the number of TADs and TAD boundaries predicted by each method remain substantially the same as the downsampling rate increases. To evaluate the similarity between the predicted TAD results on the original Hi-C dataset and the various downsampled Hi-C datasets, we calculated the weighted similarity for each TAD prediction algorithm using the evaluation method described above. The results show that BINDER shows strong TAD prediction stability, and can generate reliable TAD even under the condition of extremely low sequencing depth; in addition, their robustness in predicting TAD was also evaluated by calculating F1 scores at different downsampling rates for different methods, and by evaluation BINDER exhibited stronger prediction stability, F1 scores remained at 0.199 at 1%, while other methods were all below 0.04. Taken together, these results clearly demonstrate that the present disclosure BINDER has strong robustness and reliability in predicting TAD at different descending depths.

Table 2 test results

Example 2

In one embodiment of the present disclosure, a multi-level chromatin topology domain identification system based on community discovery is provided, comprising:

The boundary feature extraction and prediction module is used for extracting boundary local feature representations of the candidate double boundaries and inputting the boundary local feature representations into the MLP model to obtain a prediction score of each candidate double boundary; and screening out the candidate double boundaries to be reserved according to the prediction scores, wherein the reserved candidate double boundaries form a reliable boundary set.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims

1. The multi-level chromatin topological structure domain identification method based on community discovery is characterized by comprising the following steps of:

2. The community discovery-based multi-level chromatin topology domain identification method of claim 1, wherein the preprocessing is: global normalization is performed on the Hi-C contact matrix, the Hi-C contact matrix is conceptualized as a weighted undirected network, and a sub-matrix of a fixed size is captured along a sliding window of the diagonal of the Hi-C contact matrix, and the sub-matrix is considered as an undirected weighted sub-network.

3. The method for identifying multi-level chromatin topology domains based on community discovery of claim 1, wherein the probability flows of random walk on the network are utilized to simulate the information flow in the system, and the weighted undirected network is decomposed into communities by compressing the descriptions of these probability flows, which are divided into communities independent of each other by the community discovery algorithm for any sub-network.

4. The method for identifying multi-level chromatin topology domains based on community discovery of claim 1, wherein extracting candidate double boundaries from mutually independent communities comprises: and judging that the points in each bin sequence are all balance points or part of balance points if the points in each bin sequence are all balance points, the bin sequence is a balance sequence, otherwise, the points are part of balance sequences, and if the points in each bin sequence are balance sequences, two double-side boundaries are obtained in the bin sequence.

5. The method for identifying a multi-level chromatin topology domain based on community finding according to claim 4, wherein when the bin sequence is a partial balanced sequence, the method comprises the steps of right balanced bin set and left balanced bin set, extracting double boundary sets from the partial balanced sequence, and forming candidate double boundary sets from the double boundaries searched out from all the submatrices.

6. The community discovery-based multi-level chromatin topology domain identification method of claim 1, wherein boundary local features of the candidate double boundary comprise: three boundary local features, including local interaction density, directivity index and Wilcoxon rank sum test p-value, derived entirely from the Hi-C contact matrix are utilized, the local interaction density is a measure of the local interaction frequency of the double boundary in the Hi-C contact matrix, and the directivity index is a measure of the degree of deviation upstream or downstream of a given partition.

7. The method for identifying multi-level chromatin topological structure domain based on community finding according to claim 1, wherein the boundary local feature of the candidate double boundary is finally represented as a multi-dimensional feature vector, an MLP neural network model is designed to learn the feature of the TAD double boundary, the MLP neural network model is composed of 6 layers, wherein a first layer is composed of a plurality of neurons for receiving the multi-dimensional feature vector, 4 hidden layers respectively comprise 512, 128, 32 and 4 neurons, and the last layer is composed of one neuron and represents the reliability score of the double boundary.

8. The method for identifying multi-level chromatin topology domains based on community discovery of claim 1, wherein the method for screening the preserved candidate double boundaries comprises: dividing all candidate double boundaries in the set into five-part regions, setting corresponding reservation threshold values for the five-part regions, comparing the prediction scores with the reservation threshold values, and when the prediction scores are larger than the reservation threshold values, reserving the candidate double boundaries corresponding to the prediction scores, wherein the reserved candidate double boundaries form a reliable boundary set.

9. The community discovery-based multi-level chromatin topology domain identification method of claim 8, wherein all candidate double boundaries in the set are divided into five partial areas according to counts supported by a community discovery algorithm, and then 5 retention thresholds are set for the five partial areas, respectively, and candidate double boundaries meeting a set condition are to be retained.

10. A community discovery-based multi-level chromatin topology domain identification system, comprising: