CN110097922B

CN110097922B - Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning

Info

Publication number: CN110097922B
Application number: CN201910315741.1A
Authority: CN
Inventors: 吕红强; 刘聪毅; 韩九强
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2020-12-08
Anticipated expiration: 2039-04-19
Also published as: CN110097922A

Abstract

A hierarchical TADs difference analysis method in Hi-C contact matrix based on-line machine learning is used for carrying out standardization processing on Hi-C data so as to eliminate experimental system deviation and enhance comparability between data; calculating the average value of the interaction frequency between the upstream and downstream regions of each bin according to the normalized data, and recording the average value as the bin signal (i); fitting the sequence binSignal and carrying out rank sum inspection to obtain a boundary region point of the TADS; obtaining all possible hierarchical TADs according to the boundary region points, and providing a mathematical model between the interaction frequency in the Hi-C contact matrix and all possible hierarchical TADs; and (3) establishing an objective function of the model, solving the hierarchical TADs difference analysis model by adopting an online machine learning algorithm FTRL for the first time, and identifying the hierarchical TADs with the difference of different cell lines. The invention provides a mathematical model between interaction frequency and hierarchical TADs in a Hi-C contact matrix, and adopts an online machine learning algorithm FTRL to obtain the weight coefficients of all hierarchical TADs, so as to identify the differential TADs among different cell lines.

Description

Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning

Technical Field

The invention belongs to the technical field of biology, relates to hierarchical TADs (TADs) difference analysis under different cell lines, and particularly relates to a hierarchical TADs difference analysis method in a Hi-C contact matrix based on online machine learning.

Background

The Hi-C technology is a high-throughput chromatin conformation capture technology, and interaction information between any sites of the whole genome can be obtained through a Hi-C experiment. The Hi-C data is data obtained by a Hi-C experiment, and the Hi-C data is generally in the form of a matrix, which is called a touch matrix, the touch matrix is a symmetric square matrix, and each element in the touch matrix is called interaction frequency. With the development of the Hi-C technology, scientists found that each chromosome can be roughly divided into two compartments (a/B components) with active and passive chromosome states when studying Hi-C data, wherein the chromosome state of the a compartment is active, and the chromosome state of the B compartment is passive, and based on the discovery of the two kinds of compartments, at higher resolution, scientists also found that a genomic region with higher interaction strength exists in the compartments, which is called as Topologically Associated Domains (TADs), and the interaction strength of sites in the topologically associated domains is far higher than that of the outside.

Numerous biological experiments have shown that TADs are the essential element for regulating the transcriptional expression of genes, and that TADs restrict the regulation of enhancers and promoters during gene regulation, and in addition, disruption of the TADs boundaries may also lead to some diseases, such as cancer. Studies have shown that most TADs have a hierarchical structure, with only a very small number of TADs being independent. By carrying out differential analysis on hierarchical TADs, the mechanism that gene expression influences cell differentiation can be deeply understood.

The prior art method for analyzing the difference of TADs is deficient in consideration of the hierarchical structure of the TADs, and the identification rate of the difference TADs is influenced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a hierarchical TADs difference analysis method in a Hi-C contact matrix based on online machine learning, which aims at performing hierarchical TADs difference analysis on Hi-C data under different cell lines.

In order to achieve the purpose, the invention adopts the technical scheme that:

a hierarchical TADs difference analysis method in a Hi-C contact matrix based on online machine learning is characterized by comprising the following steps:

step 1, carrying out standardization processing on Hi-C data to eliminate system deviation of Hi-C experiments and enhance comparability between data;

step 2, calculating the average value of the interaction frequency between the upstream and downstream regions of each bin for the Hi-C data after standardization, and recording the average value as the BiSignal (i);

step 3, fitting the sequence binSignal by using a curve fitting algorithm, and preliminarily regarding a local minimum point of a fitting curve as a boundary region point of TADs;

step 4, filtering false positive TADS boundary region points by a rank sum check method to obtain final TADS boundary region points;

step 5, obtaining all possible hierarchical TADs according to the boundary region points of the TADs, and proposing the interaction frequency M in the Hi-C contact matrix M_ijAnd between all possible hierarchical TADs;

step 6, establishing an objective function of the model, and solving the objective function by using an online machine learning algorithm FTRL;

and 7, identifying the hierarchical TADs with different cell lineages according to the solved result.

The specific method for standardizing the Hi-C data in the step 1 comprises the following steps:

firstly, adopting a Hi-C data standardization method of a cross-cell line to carry out primary processing on Hi-C data under different cell lines so as to eliminate system deviation between the different cell lines as much as possible; the primarily processed Hi-C data was then processed using the data normalization method CPM (counts per million) to further enhance the comparability of Hi-C data between different cell lines.

The specific method for calculating the binsignal (i) in the step 2 comprises the following steps:

for a bin with index i, the average interaction frequency between the upstream and downstream regions is denoted as binsignal (i), the window size of the upstream and downstream regions is w, and the binsignal (i) is calculated according to the following formula:

wherein U is_i(l) Denotes a certain bin, D, in the upstream region of the bin with index i_i(m) denotes a certain bin in the downstream region of the bin with index i, cont.freq denotes the interaction frequency between the upstream and downstream region bins, l denotes the index of the upstream region window element, and m denotes the index of the downstream region window element.

The specific method for filtering the false positive TADs boundary region points by using rank sum test in step 4 is as follows:

counting the interaction frequency S between the upstream and the downstream within a certain range of the TADs boundary region to be checked₁Frequency of interactions S with upstream or downstream₂Checking S by means of rank sum check₁And S₂If there is significant difference, the probability of the rank sum test is less than 0.05, and if there is significant difference, the TADs boundary region is considered as false positive and needs to be filtered out, otherwise the TADs boundary region is considered as true positive.

In step 5, the contact matrix M is a symmetrical square matrix, each element in the contact matrix is called interaction frequency number, and the interaction frequency number M_ijAnd the mathematical models for all possible hierarchical TADs are:

Y＝XB+N(0,σ²) Wherein B is not less than 0

Wherein Y represents a matrix formed by interaction frequency numbers of all Hi-C contact matrixes, each column element in Y is a vector formed by interaction frequency numbers of a triangular area (containing diagonal lines) under one contact matrix, X represents the position relation between the interaction frequency numbers and all possible hierarchical TADs, B represents the weight coefficient of all possible hierarchical TADs, and N (0, sigma)²) Represents the noise generated by a normal distribution, and σ represents the standard deviation of the normal distribution. The mathematical model can be described specifically as:

wherein m isRepresenting the dimension of the contact matrix, n representing the total number of all possible hierarchical TADs, c₁Represents a first cell line, c₂Represents a second cell line, K₁Indicates the total number of Hi-C data samples, K, in the first cell line type₂Indicates the total number of Hi-C data samples in the second cell line type,

represents the K-th cell line of the first type₁The frequency of interaction in the mth row and mth column of the Hi-C contact matrix,

represents the K-th cell line of the second type₂Frequency of interaction in m row and m column of Hi-C contact matrix for x_i,zIf it corresponds to y_i,jInternal to the z-th hierarchical TADs, then it is set to 1, otherwise it is set to 0, in the model

Represents the weighting coefficient of the nth hierarchical TADs under the first cell line,

represents the weighting coefficient of the nth hierarchical TADs under the second cell line.

The specific method for solving the objective function in step 6 is as follows:

first, an objective function is established:

wherein B is not less than 0

Wherein K₁Indicates the total number of Hi-C data samples, K, in the first cell line type₂Indicates the total number of Hi-C data samples in the second cell line, j indicates the index of the sample, Y_1jIndicates the lower triangular region of the jth Hi-C contact matrix under the first cell lineVector formed by frequency of all interactions in (diagonal included)_2jRepresents the vector formed by all interaction frequencies in the triangular region (containing diagonal) under the jth Hi-C contact matrix under the second cell line type B_:,1Weight vector representing hierarchical TADs of the first cell line type, B_:,2A weight vector representing the hierarchical TADs of the second cell line, X representing the interaction frequency and the positional relationship of all possible hierarchical TADs, and B is represented by a vector B_:,1And B_:,2And (5) splicing to form a matrix. The first item of the target optimization function is a common mean square error, and the aim is to ensure that the overall difference between the interaction frequency obtained based on model calculation and the real interaction frequency is as small as possible; the second term is₁A regularization term to increase the sparsity of the solution; the third term is₂A regularization term, which aims to make the solution smoother and prevent the occurrence of overfitting; the fourth term is the core term of the loss function, the purpose being to optimize the vector B_:1Sum vector B_:2The overall degree of difference between them is as small as possible.

The model is solved by using an online machine learning algorithm FTRL, and the advantages are that: saving computer memory, improving solving efficiency and obtaining sparse solution.

The specific method for identifying the differential hierarchical TADs based on the solving result in the step 7 is as follows:

the solution result is two vectors B_:1And B_:2In which B is_i1Representative cell line c₁Weight coefficient of lower ith hierarchy TADs, B_i2Representative cell line c₂Weight coefficient, abs (B) of the lower ith hierarchical TADs_i1-B_i2) Representative cell line c₁And cell line c₂The degree of difference between them in the ith hierarchical TADs, abs (B)_i1-B_i2) The larger the value of (a) represents the cell line c₁And cell line c₂The greater the variability in the ith tier TADs.

Compared with the prior art, the invention has the advantages that: based on a large number of biological experiment results, a mathematical model between interaction frequency of a Hi-C contact matrix and hierarchical TADs is provided, an objective function is established, weight coefficients of all hierarchical TADs are obtained by adopting an online machine learning algorithm FTRL, and differential TADs among different cell lines are identified.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a heat map of Hi-C data at 500Kb resolution under cell lines HUVEC and IMR 90.

Fig. 3 is the results of actual hierarchical differential TADs and analytical fitting.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

As shown in FIG. 1, the differential analysis of hierarchical TADs in two cell lines based on the identification of TADs boundary region of the present invention comprises the following steps:

step 1, carrying out standardization processing on Hi-C data to eliminate system deviation of Hi-C experiments and enhance comparability between data, wherein the specific method comprises the following steps:

Step 2, calculating the average value of the interaction frequency between the upstream area and the downstream area of each bin for the initialized Hi-C data, and recording the average value as bin Signal (i), wherein the specific method is as follows:

for a bin indexed by subscript i, the average interaction frequency between the upstream and downstream regions is denoted as binsignal (i), and the window size of the upstream and downstream regions is w. binsignal (i) is calculated according to the following equation:

And 3, fitting the sequence binSignal by using a curve fitting algorithm, and preliminarily regarding the local minimum point of the fitting curve as a boundary region point of the TADs.

And 4, filtering false positive TADs boundary region points through a rank sum test method to obtain final TADs boundary region points, wherein the method comprises the following specific steps:

counting the interaction frequency S between the upstream and the downstream within a certain range of the TADs boundary region to be checked₁Frequency of interactions S with upstream or downstream₂Checking S by means of rank sum check₁And S₂Whether there is significant difference between them, wherein significant difference means that the probability of the rank sum test is less than 0.05. If there is significant difference, the TADs boundary region is considered to be false positive and needs to be filtered out, otherwise the TADs boundary region is considered to be true positive.

Step 5, obtaining all possible hierarchical TADs according to the boundary region points of the TADs, and establishing the interaction frequency M in the Hi-C data contact matrix M_ijAnd all possible hierarchical TADs, the formula:

Y＝XB+N(0,σ²) Wherein B is not less than 0

where m denotes the dimension of the contact matrix, n denotes the total number of all possible hierarchical TADs, c₁Represents a first cell line, c₂Represents a second cell line, K₁Indicates the total number of Hi-C data samples, K, in the first cell line type₂Indicates the total number of Hi-C data samples in the second cell line type,

represents the K-th cell line of the second type₂And the frequency of interaction of the mth row and the mth column in the Hi-C contact matrix. For x_i,zIf it corresponds to y_i,jInternal to the z-th hierarchical TADs, then it is set to 1, otherwise it is set to 0, in the model

Step 6, establishing an objective function of the model, and solving the objective optimization function by using an online machine learning algorithm FTRL, wherein the specific method comprises the following steps:

first, an objective function is established:

wherein B is not less than 0

Wherein K₁Indicates the total number of Hi-C data samples, K, in the first cell line type₂Indicates the total number of Hi-C data samples in the second cell line type, j indicatesIndex of samples, Y_1jRepresents the vector formed by all interaction frequencies in the triangular region (containing diagonal) under the jth Hi-C contact matrix under the first cell line type, Y_2jRepresents the vector formed by all interaction frequencies in the triangular region (containing diagonal) under the jth Hi-C contact matrix under the second cell line type B_:,1Weight vector representing hierarchical TADs of the first cell line type, B_:,2A weight vector representing the hierarchical TADs of the second cell line, X representing the interaction frequency and the positional relationship of all possible hierarchical TADs, and B is represented by a vector B_:,1And B_:,2And (5) splicing to form a matrix. The first item of the target optimization function is a common mean square error, and the aim is to ensure that the overall difference between the interaction frequency obtained based on model calculation and the real interaction frequency is as small as possible; the second term is₁A regularization term to increase the sparsity of the solution; the third term is₂A regularization term, which aims to make the solution smoother and prevent the occurrence of overfitting; the fourth term is the core term of the objective function, the purpose is to optimize the vector B_:1Sum vector B_:2The overall degree of difference between them is as small as possible.

And 7, identifying the hierarchical TADs with different cell lineages based on the solved result, wherein the specific method comprises the following steps:

the solution result is two vectors B_:1And B_:2In which B is_i1Representative cell line c₁Weight coefficient of lower ith hierarchy TADs, B_i2Representative cell line c₂Weight coefficient, abs (B) of the lower ith hierarchical TADs_i1-B_i2) Represents the cell line c₁And cell line c₂The degree of difference between them in the ith hierarchical TADs, abs (B)_i1-B_i2) The larger the value of (a) represents the cell line c₁And cell line c₂The greater the variability in the ith tier TADs.

The procedure of the present invention is described by way of example to identify the hierarchical TADs with differences between the HUVEC and IMR90 cell lines.

(1) Data are for cell lines HUVEC and IMR90 with Hi-C data at 500Kb resolution, two data for each cell line. Heat maps of data as shown in fig. 2, the two heat maps on the left are data for HUVEC and the two heat maps on the right are data for IMR 90.

(2) The data were normalized, specifically including the cross-cell line Hi-C data normalization method MultiHiCcomp and the data normalization method CPM.

(3) Identifying TADS boundary region points of the normalized data specifically comprises the following steps: calculate the value of binsignal (i) for each bin, curve fit to get the initial local minimum point, rank sum test filter false positive TADs boundary region point.

(4) And obtaining all possible hierarchical TADs according to the TADs boundary region points, and solving the weight coefficient of each possible hierarchical TADs.

(5) And obtaining a final analysis result according to the obtained weight coefficient, and visualizing the result, wherein the visualized result is shown in fig. 3, the left graph is the actual hierarchical difference TADs, and the right graph is the result obtained by analyzing and fitting, so that the result obtained by analyzing and fitting is basically close to the actual hierarchical difference TADs.

Claims

1. A hierarchical TADs difference analysis method in a Hi-C contact matrix based on online machine learning is characterized by comprising the following steps:

step 3, fitting the sequence binSignal (i) by using a curve fitting algorithm, and preliminarily regarding a local minimum point of a fitting curve as a boundary region point of TADs;

step 5, obtaining all possible hierarchical TADs according to the boundary region points of the TADs, and establishing interaction frequency M in the Hi-C contact matrix M_ijAnd between all possible hierarchical TADs;

2. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 1, wherein the specific method for normalizing the Hi-C data in step 1 is as follows:

3. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 1, wherein the specific method for calculating binsignal (i) in step 2 is:

wherein U is_i(l) Denotes a certain bin, D, in the upstream region of the bin with index i_i(m) denotes a certain region in the downstream of bin having index i of subscriptBin, cont.freq denotes the interaction frequency between upstream and downstream region bins, l denotes the index of the upstream region window element, and m denotes the index of the downstream region window element.

4. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 1, wherein the specific method for filtering false positive TADs boundary region points in step 4 by using rank sum test is:

counting the interaction frequency S between the upstream and the downstream within a certain range of the TADs boundary region to be checked₁Frequency of interactions S with upstream or downstream₂Checking S by means of rank sum check₁And S₂If there is significant difference, the TADs boundary region is considered as false positive and needs to be filtered out, otherwise the TADs boundary region is considered as true positive, and the significant difference means that the probability of the rank sum test is less than 0.05.

5. The hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning of claim 1, wherein the contact matrix M in step 5 is a symmetric square matrix, each element in the contact matrix is called interaction frequency, and the interaction frequency M is the interaction frequency_ijAnd the mathematical models for all possible hierarchical TADs are:

Y＝XB+N(0,σ²) Wherein B is not less than 0

Wherein Y represents a matrix formed by interaction frequency numbers of all Hi-C contact matrixes, each column element in Y is a vector formed by interaction frequency numbers of a lower triangular region of the contact matrix including a diagonal line, X represents the position relation between the interaction frequency numbers and all possible hierarchical TADs, B represents the weight coefficient of all possible hierarchical TADs, and N (0, sigma)²) Represents the noise generated by a normal distribution, and σ represents the standard deviation of the normal distribution.

6. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 5, wherein the mathematical model is specifically described as:

7. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 5 or 6, wherein the specific method for solving the objective function in step 6 is:

first, an objective function is established:

wherein B is not less than 0

Wherein K₁Indicates the total number of Hi-C data samples, K, in the first cell line type₂Indicates the total number of Hi-C data samples in the second cell line, j indicates the index of the sample, Y_1jRepresenting the vector formed by all interaction frequencies in the lower triangular region containing diagonal lines of the jth Hi-C contact matrix under the first cell line type, Y_2jRepresenting the vectors formed by all interaction frequencies in the lower triangular region containing diagonal lines of the jth Hi-C contact matrix under the second cell line type, B_:,1Weight vector representing hierarchical TADs of the first cell line type, B_:,2A weight vector representing the hierarchical TADs of the second cell line, X representing the interaction frequency and the positional relationship of all possible hierarchical TADs, and B is represented by a vector B_:,1And B_:,2A matrix formed by splicing; the first item of the target optimization function is a common mean square error, and the aim is to ensure that the overall difference between the interaction frequency obtained based on model calculation and the real interaction frequency is as small as possible; the second term is₁A regularization term to increase the sparsity of the solution; the third term is₂A regularization term, which aims to make the solution smoother and prevent the occurrence of overfitting; the fourth term is the core term of the loss function, the purpose being to optimize the vector B_:1Sum vector B_:2The overall degree of difference between them is as small as possible.

8. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 7, wherein the specific method for identifying the hierarchical TADs based on the solution result in step 7 is: