CN110097922A

CN110097922A - Hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning

Info

Publication number: CN110097922A
Application number: CN201910315741.1A
Authority: CN
Inventors: 吕红强; 刘聪毅; 韩九强
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2019-08-06
Anticipated expiration: 2039-04-19
Also published as: CN110097922B

Abstract

Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, comparativity being standardized between Hi-C data to eliminate experimental system deviation and enhance data；Interaction frequency average value between each bin upstream and downstream region is calculated to the data after standardization, is denoted as binSignal (i)；Rank sum test is fitted and carried out to sequence binSignal, obtains the borderline region point of TADs；All possible hierarchical TADs are obtained according to borderline region point, propose the mathematical model in Hi-C contact matrix between interaction frequency and all possible hierarchical TADs；The objective function of establishing model, and the solution of hierarchical TADs variance analysis model is carried out using online machine learning algorithm FTRL for the first time, identify that different cell lines have the hierarchical TADs of otherness.The invention proposes the mathematical models in Hi-C contact matrix between interaction frequency and hierarchical TADs, and the weight coefficient of all hierarchical TADs is acquired using online machine learning algorithm FTRL, identify TADs variant between different cell lines.

Description

Hierarchical TADs variance analysis in Hi-C contact matrix based on online machine learning Method

Technical field

The invention belongs to field of biotechnology, it is related to hierarchical TADs difference analysis under different cell lines, in particular to Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning.

Background technique

Hi-C technology is a kind of high-throughput chromatin conformation capture technique, can obtain full-length genome by Hi-C experiment and appoint Interaction information between meaning site.Hi-C data are the data obtained by Hi-C experiment, and Hi-C data general type is One matrix, the matrix are referred to as contact matrix, and contact matrix is a symmetrical square matrix, and each element is claimed in contact matrix For interaction frequency.With the development of Hi-C technology, scientist has found that each chromosome substantially can be with when to Hi-C data research It is divided into active and passive two compartments (A/B compartment) of chromatin state, wherein A compartment chromatin state is active, B Compartment chromatin state is passive, and based on the discovery of two categories compartment, under higher resolution ratio, scientist has found in compartment again There are the higher genome areas of interaction strength, are referred to as topological correlation structural domain (topological Associated domains, TADs), the intensity of site interaction is significantly larger than external mutual in topological correlation structural domain Action intensity.

A large amount of Bioexperiment show that TADs is the basic role element of controlling gene transcriptional expression, in gene regulation process In, TADs constrains the regulating and controlling effect of enhancer and promoter, and in addition to this, the destruction on the boundary TADs may also will lead to The generation of disease, such as cancer.Studies have shown that most of TADs has hierarchical structure, the TADs of only only a few is only Vertical.By carrying out difference analysis to hierarchical TADs, the mechanism of Gene Expression cell differentiation can be understood in depth.

The method of prior art analysis TADs otherness is lacking in terms of the hierarchical structure for considering TADs, will affect The discrimination of otherness TADs.

Summary of the invention

In order to overcome the disadvantages of the above prior art, the purpose of the present invention is to provide a kind of based on online machine learning Hierarchical TADs difference analysis method in Hi-C contact matrix carries out hierarchical for the Hi-C data under different cell lines TADs variance analysis.

To achieve the goals above, the technical solution adopted by the present invention is that:

Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, feature exist In including the following steps:

Step 1, Hi-C data are standardized with eliminate Hi-C experiment system deviation and enhance data it Between comparativity；

Step 2 calculates the interaction between each bin upstream and downstream region frequently to the Hi-C data after standardization Several average value, is denoted as binSignal (i)；

Step 3 is fitted sequence binSignal using curve fitting algorithm, by the local minimum points of matched curve Tentatively it is considered as the borderline region point of TADs；

Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final The borderline region point of TADs；

Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, proposes Hi-C contact matrix M Middle interactive frequency M_ijMathematical model between all possible hierarchical TADs；

The objective function of step 6, establishing model solves objective function using online machine learning algorithm FTRL；

Step 7, based on solution as a result, identify different cell lines have otherness hierarchical TADs.

Wherein, Hi-C data are standardized described in step 1 method particularly includes:

First using the Hi-C data normalization method multiHiCcompare of across cell system under different cell lines Hi-C data carry out preliminary treatment, to eliminate the system deviation between different cell lines as much as possible；Then data standard is utilized Change method CPM (Counts per million) handles the Hi-C data Jing Guo preliminary treatment, to further enhance not With the comparativity of Hi-C data between cell line.

Calculating binSignal (i) described in step 2 method particularly includes:

The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon, The window size in upstream and downstream region is w, and binSignal (i) is calculated according to the following formula:

Wherein U_i(l) some bin, D in the upstream region for the bin that subscript index is i are indicated_i(m) indicate that subscript index is Some bin in the downstream area of the bin of i, cont.freq indicate the interaction frequency between the bin of upstream and downstream region, in l expression The index of regional window element is swum, m indicates the index of downstream area window elements.

False positive T ADs borderline region point is filtered out using rank sum test described in step 4 method particularly includes:

By counting the interaction frequency S in the TADs borderline region a certain range for needing to examine between upstream and downstream₁The upstream and Between or downstream between interaction frequency S₂, S is examined using the method for rank sum test₁And S₂Between whether have apparent difference Property, wherein apparent otherness refers to the probability of rank sum test less than 0.05, if there is apparent otherness, it is considered as this TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region is true positives.

Contact matrix M described in step 5 is a symmetrical square matrix, and each element is known as interaction frequency in contact matrix, Interaction frequency M_ijWith the mathematical model of all possible hierarchical TADs are as follows:

Y=XB+N (0, σ²), B >=0 subject to

Wherein, Y indicates the matrix of the interaction frequency composition of all Hi-C contact matrixes, and each column element connects in Y for one The vector that (containing diagonal line), interaction frequency was constituted of Delta Region under matrix is touched, X indicates interaction frequency and all possible hierarchicals The positional relationship of TADs, B indicate the weight coefficient of all possible hierarchical TADs, N (0, σ²) indicate to be generated by normal distribution Noise, σ indicate the standard deviation of normal distribution.Then the mathematical model can specifically describe are as follows:

Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c₁Indicate that the first kind is thin Born of the same parents system, c₂Indicate the second class cell line, K₁Indicate Hi-C data sample sum, K under first kind cell line₂Indicate the second class cell It is lower Hi-C data sample sum,Represent K under first kind cell line₁M row m is arranged in a Hi-C contact matrix Interaction frequency,Represent K under the second class cell line₂The interaction frequency that m row m is arranged in a Hi-C contact matrix, for x_{I, z}If its corresponding y_{I, j}Positioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelIndicate the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate lower n-th layer of the second class cell line The weight coefficient of grade formula TADs.

Objective function is solved described in step 6 method particularly includes:

Objective function is established first:

Wherein K₁Indicate Hi-C data sample sum, K under first kind cell line₂Indicate Hi-C data under the second class cell line Total sample number, j indicate the index of sample, Y_1jIndicate under the first kind cell line Delta Region under j-th of Hi-C contact matrix (containing pair Linea angulata) in the vector that constitutes of all interactive frequencies, Y_2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix The vector that all interactive frequencies are constituted in (containing diagonal line), B_{:, 1}The weight vectors of first kind cell line hierarchical TADs are represented, B_{:, 2}The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs Relationship is set, B is by vector B_{:, 1}And B_{:, 2}The matrix being spliced.Objective optimization function first item is common mean square error, mesh Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible；Section 2 is l₁Regularization term, it is therefore an objective to increase the sparsity of solution；Section 3 is l₂Regularization term, it is therefore an objective to so that solution is more smooth, prevent The generation of fitting；Section 4 is the kernel item of loss function, it is therefore an objective to by optimizing so that vector B_{: 1}With vector B_{: 2}Between it is total Body difference degree is small as far as possible.

Model is solved using online machine learning algorithm FTRL, advantage is: saving calculator memory, improves and ask Solution efficiency can obtain sparse solution.

Based on solving result recognition differential opposite sex hierarchical TADs's described in step 7 method particularly includes:

Solving result is two vector B_{: 1}And B_{: 2}, wherein B_i1Represent cell line c₁The weight system of lower i-th of hierarchical TADs Number, B_i2Represent cell line c₂The weight coefficient of lower i-th of hierarchical TADs, abs (B_i1-B_i2) represent cell line c₁And cell line c₂Between difference degree on i-th of hierarchical TADs, abs (B_i1-B_i2) numerical value more big just represent cell line c₁And cell It is c₂Otherness on i-th of hierarchical TADs is bigger.

Compared with the prior art, the advantages of the present invention are as follows: a large amount of bioassay results are based on, a kind of Hi-C is proposed and connects The mathematical model between the interaction frequency and hierarchical TADs of matrix is touched, establishes objective function, and use online machine learning Algorithm FTRL acquires the weight coefficient of all hierarchical TADs, identifies TADs variant between different cell lines.

Detailed description of the invention

Fig. 1 is flow chart of the present invention.

Fig. 2 is the Hi-C data thermal map that resolution ratio is 500Kb under cell line HUVEC and IMR90.

Fig. 3 is actual hierarchical otherness TADs and the result that analysis fitting obtains.

Specific embodiment

The embodiment that the present invention will be described in detail with reference to the accompanying drawings and examples.

As shown in Figure 1, the present invention is based on the othernesses point of hierarchical TADs under two cell lines of TADs borderline region identification Analysis the following steps are included:

Step 1, Hi-C data are standardized with eliminate Hi-C experiment system deviation and enhance data it Between comparativity, the specific method is as follows:

Step 2 calculates the interaction between each bin upstream and downstream region frequently to the Hi-C data after initialization Several average value, is denoted as binSignal (i), and the specific method is as follows:

The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon, The window size in upstream and downstream region is w.BinSignal (i) is calculated according to the following formula:

Step 3 is fitted sequence binSignal using curve fitting algorithm, by the local minimum points of matched curve Tentatively it is considered as the borderline region point of TADs.

Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final The borderline region point of TADs, the specific method is as follows:

By counting the interaction frequency S in the TADs borderline region a certain range for needing to examine between upstream and downstream₁The upstream and Between or downstream between interaction frequency S₂, S is examined using the method for rank sum test₁And S₂Between whether have apparent difference Property, wherein apparent otherness refers to the probability of rank sum test less than 0.05.If there is apparent otherness, it is considered as this TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region is true positives.

Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, establishes Hi-C data exposure square Interaction frequency M in battle array M_ijMathematical model between all possible hierarchical TADs, formula are as follows:

Y=XB+N (0, σ²), B >=0 subject to

Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c₁Indicate that the first kind is thin Born of the same parents system, c₂Indicate the second class cell line, K₁Indicate Hi-C data sample sum, K under first kind cell line₂Indicate the second class cell It is lower Hi-C data sample sum,Represent K under first kind cell line₁M row m is arranged in a Hi-C contact matrix Interaction frequency,Represent K under the second class cell line₂The interaction frequency that m row m is arranged in a Hi-C contact matrix.For x_{I, z}If its corresponding y_{I, j}Positioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelIndicate the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate n-th of level under the second class cell line The weight coefficient of formula TADs.

The objective function of step 6, establishing model seeks objective optimization function using online machine learning algorithm FTRL Solution, the specific method is as follows:

Objective function is established first:

Wherein K₁Indicate Hi-C data sample sum, K under first kind cell line₂Indicate Hi-C data under the second class cell line Total sample number, j indicate the index of sample, Y_1jIndicate under the first kind cell line Delta Region under j-th of Hi-C contact matrix (containing pair Linea angulata) in the vector that constitutes of all interactive frequencies, Y_2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix The vector that all interactive frequencies are constituted in (containing diagonal line), B_{:, 1}The weight vectors of first kind cell line hierarchical TADs are represented, B_{:, 2}The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs Relationship is set, B is by vector B_{:, 1}And B_{:, 2}The matrix being spliced.Objective optimization function first item is common mean square error, mesh Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible；Section 2 is l₁Regularization term, it is therefore an objective to increase the sparsity of solution；Section 3 is l₂Regularization term, it is therefore an objective to so that solution is more smooth, prevent The generation of fitting；Section 4 is the kernel item of objective function, it is therefore an objective to by optimizing so that vector B_{: 1}With vector B_{: 2}Between it is total Body difference degree is small as far as possible.

Step 7, based on solution as a result, identify different cell lines have otherness hierarchical TADs, specific method It is as follows:

Solving result is two vector B_{: 1}And B_{: 2}, wherein B_i1Represent cell line c₁The weight system of lower i-th of hierarchical TADs Number, B_i2Represent cell line c₂The weight coefficient of lower i-th of hierarchical TADs, abs (B_i1-B_i2) just represent cell line c₁And cell It is c₂Between difference degree on i-th of hierarchical TADs, abs (B_i1-B_i2) numerical value more big just represent cell line c₁With it is thin Born of the same parents system c₂Otherness on i-th of hierarchical TADs is bigger.

For identifying the hierarchical TADs between cell line HUVEC and IMR90 with otherness, to process of the invention It is introduced.

(1) data are the Hi-C data that resolution ratio is 500Kb under cell line HUVEC and IMR90, and each cell line respectively prepares Two data.For the thermal map of data as shown in Fig. 2, two, left side thermal map is the data of HUVEC, two, the right thermal map is IMR90's Data.

(2) data are standardized, specifically include the Hi-C data normalization method of across cell system MultiHiCcompare and data standardized method CPM.

(3) identification that the data after standardization are carried out with TADs borderline region point, specifically includes: calculating each bin's Value, the curve matching of binSignal (i) obtains preliminary local minizing point, rank sum test filters out the frontier district TADs of false positive Domain point.

(4) all possible hierarchical TADs is obtained according to TADs borderline region point, to each possible hierarchical TADs Weight coefficient solved.

(5) according to the weight coefficient that finds out, obtain final analysis as a result, and by result visualization, visualization result is such as Shown in Fig. 3, wherein left figure is actual hierarchical otherness TADs, and right figure is the result that analysis fitting obtains, it can be seen that point The result that analysis fitting obtains is essentially close to actual hierarchical otherness TADs.

Claims

1. hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, which is characterized in that Include the following steps:

Step 1 is standardized Hi-C data to eliminate the system deviation of Hi-C experiment and enhance between data Comparativity；

Step 2 calculates the interaction frequency between each bin upstream and downstream region to the Hi-C data after standardization Average value is denoted as binSignal (i)；

Step 3 is fitted sequence binSignal using curve fitting algorithm, and the local minimum points of matched curve are preliminary It is considered as the borderline region point of TADs；

Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final TADs Borderline region point；

Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, establishes in Hi-C contact matrix M and hands over Mutual frequency M_ijMathematical model between all possible hierarchical TADs；

2. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1 Method, which is characterized in that Hi-C data are standardized described in step 1 method particularly includes:

First using the Hi-C data normalization method multiHiCcompare of across cell system to the Hi-C number under different cell lines According to preliminary treatment is carried out, to eliminate the system deviation between different cell lines as much as possible；Then data normalization method is utilized CPM (Counts per million) handles the Hi-C data Jing Guo preliminary treatment, to further enhance different cells The comparativity of Hi-C data between system.

3. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1 Method, which is characterized in that calculating binSignal (i) described in step 2 method particularly includes:

The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon, up and down The window size for swimming region is w, and binSignal (i) is calculated according to the following formula:

Wherein U_i(l) some bin, D in the upstream region for the bin that subscript index is i are indicated_i(m) indicate that subscript index is i's Some bin in the downstream area of bin, cont.freq indicate the interaction frequency between the bin of upstream and downstream region, and l indicates upstream The index of regional window element, m indicate the index of downstream area window elements.

4. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1 Method, which is characterized in that filter out false positive T ADs borderline region point using rank sum test described in step 4 method particularly includes:

By count need the interaction frequency S in TADs borderline region a certain range for examining between upstream and downstream and between upstream or Interaction frequency S between person downstream₂, S is examined using the method for rank sum test₁And S₂Between whether have apparent otherness, if There is apparent otherness, being considered as the TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region It is true positives.

5. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 4 Method, which is characterized in that the probability for having apparent otherness to refer to rank sum test is less than 0.05.

6. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1 Method, which is characterized in that contact matrix M described in step 5 is a symmetrical square matrix, and each element is known as interacting in contact matrix Frequency, interaction frequency M_ijWith the mathematical model of all possible hierarchical TADs are as follows:

Y=XB+N (0, σ²), B >=0 subject to

Wherein, Y indicates the matrix of the interaction frequency composition of all Hi-C contact matrixes, and each column element is a contact square in Y The Delta Region vector that (containing diagonal line), interaction frequency was constituted under gust, the interactive frequency of X expression and all possible hierarchical TADs' Positional relationship, B indicate the weight coefficient of all possible hierarchical TADs, N (0, σ²) indicate the noise generated by normal distribution, σ Indicate the standard deviation of normal distribution.

7. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 6 Method, which is characterized in that the mathematical model specifically describes are as follows:

Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c₁Indicate first kind cell line, c₂Indicate the second class cell line, K₁Indicate Hi-C data sample sum, K under first kind cell line₂It indicates under the second class cell line Hi-C data sample sum,Represent K under first kind cell line₁The interaction that m row m is arranged in a Hi-C contact matrix Frequency,Represent K under the second class cell line₂The interaction frequency that m row m is arranged in a Hi-C contact matrix, for x_{I, z}, If its corresponding y_{I, j}Positioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelTable Show the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate n-th of hierarchical under the second class cell line The weight coefficient of TADs.

8. hierarchical TADs variance analysis in the Hi-C contact matrix described according to claim 6 or 7 based on online machine learning Method, which is characterized in that objective function is solved described in step 6 method particularly includes:

Objective function is established first:

Wherein K₁Indicate Hi-C data sample sum, K under first kind cell line₂Indicate Hi-C data sample under the second class cell line Sum, j indicate the index of sample, Y_1jIndicate that Delta Region (contains diagonal under j-th of Hi-C contact matrix under first kind cell line Line) in the vector that constitutes of all interactive frequencies, Y_2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix The vector that all interactive frequencies are constituted in (containing diagonal line), B_{:, 1}The weight vectors of first kind cell line hierarchical TADs are represented, B_{:, 2}The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs Relationship is set, B is by vector B_{:, 1}And B_{:, 2}The matrix being spliced.Objective optimization function first item is common mean square error, mesh Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible；Section 2 is l₁Regularization term, it is therefore an objective to increase the sparsity of solution；Section 3 is l₂Regularization term, it is therefore an objective to so that solution is more smooth, prevent The generation of fitting；Section 4 is the kernel item of loss function, it is therefore an objective to by optimizing so that vector B_{: 1}With vector B_{: 2}Between it is total Body difference degree is small as far as possible.

9. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 8 Method, which is characterized in that based on solving result recognition differential opposite sex hierarchical TADs's described in step 7 method particularly includes:

Solving result is two vector B_{: 1}And B_{: 2}, wherein B_i1Represent cell line c₁The weight coefficient of lower i-th of hierarchical TADs, B_i2Represent cell line c₂The weight coefficient of lower i-th of hierarchical TADs, abs (B_i1-B_i2) represent cell line c₁With cell line c₂ Between difference degree on i-th of hierarchical TADs, abs (B_i1-B_i2) numerical value more big just represent cell line c₁And cell line c₂Otherness on i-th of hierarchical TADs is bigger.