CN110097922A - Hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning - Google Patents
Hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning Download PDFInfo
- Publication number
- CN110097922A CN110097922A CN201910315741.1A CN201910315741A CN110097922A CN 110097922 A CN110097922 A CN 110097922A CN 201910315741 A CN201910315741 A CN 201910315741A CN 110097922 A CN110097922 A CN 110097922A
- Authority
- CN
- China
- Prior art keywords
- tads
- hierarchical
- cell line
- indicate
- contact matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000011159 matrix material Substances 0.000 title claims abstract description 55
- 238000004458 analytical method Methods 0.000 title claims abstract description 25
- 238000010801 machine learning Methods 0.000 title claims abstract description 23
- 230000003993 interaction Effects 0.000 claims abstract description 47
- 230000006870 function Effects 0.000 claims abstract description 19
- 238000011144 upstream manufacturing Methods 0.000 claims abstract description 19
- 238000012360 testing method Methods 0.000 claims abstract description 13
- 238000013178 mathematical model Methods 0.000 claims abstract description 11
- 238000000034 method Methods 0.000 claims description 36
- 239000013598 vector Substances 0.000 claims description 27
- 230000002452 interceptive effect Effects 0.000 claims description 11
- 238000002474 experimental method Methods 0.000 claims description 5
- 230000014509 gene expression Effects 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 4
- 230000009182 swimming Effects 0.000 claims 1
- 210000004027 cell Anatomy 0.000 description 56
- 108010077544 Chromatin Proteins 0.000 description 4
- 210000003483 chromatin Anatomy 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000001276 controlling effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000004166 bioassay Methods 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 230000024245 cell differentiation Effects 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003623 enhancer Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Mathematical Optimization (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Mathematical Analysis (AREA)
- Biotechnology (AREA)
- Pure & Applied Mathematics (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Genetics & Genomics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Algebra (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Complex Calculations (AREA)
Abstract
Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, comparativity being standardized between Hi-C data to eliminate experimental system deviation and enhance data;Interaction frequency average value between each bin upstream and downstream region is calculated to the data after standardization, is denoted as binSignal (i);Rank sum test is fitted and carried out to sequence binSignal, obtains the borderline region point of TADs;All possible hierarchical TADs are obtained according to borderline region point, propose the mathematical model in Hi-C contact matrix between interaction frequency and all possible hierarchical TADs;The objective function of establishing model, and the solution of hierarchical TADs variance analysis model is carried out using online machine learning algorithm FTRL for the first time, identify that different cell lines have the hierarchical TADs of otherness.The invention proposes the mathematical models in Hi-C contact matrix between interaction frequency and hierarchical TADs, and the weight coefficient of all hierarchical TADs is acquired using online machine learning algorithm FTRL, identify TADs variant between different cell lines.
Description
Technical field
The invention belongs to field of biotechnology, it is related to hierarchical TADs difference analysis under different cell lines, in particular to
Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning.
Background technique
Hi-C technology is a kind of high-throughput chromatin conformation capture technique, can obtain full-length genome by Hi-C experiment and appoint
Interaction information between meaning site.Hi-C data are the data obtained by Hi-C experiment, and Hi-C data general type is
One matrix, the matrix are referred to as contact matrix, and contact matrix is a symmetrical square matrix, and each element is claimed in contact matrix
For interaction frequency.With the development of Hi-C technology, scientist has found that each chromosome substantially can be with when to Hi-C data research
It is divided into active and passive two compartments (A/B compartment) of chromatin state, wherein A compartment chromatin state is active, B
Compartment chromatin state is passive, and based on the discovery of two categories compartment, under higher resolution ratio, scientist has found in compartment again
There are the higher genome areas of interaction strength, are referred to as topological correlation structural domain (topological
Associated domains, TADs), the intensity of site interaction is significantly larger than external mutual in topological correlation structural domain
Action intensity.
A large amount of Bioexperiment show that TADs is the basic role element of controlling gene transcriptional expression, in gene regulation process
In, TADs constrains the regulating and controlling effect of enhancer and promoter, and in addition to this, the destruction on the boundary TADs may also will lead to
The generation of disease, such as cancer.Studies have shown that most of TADs has hierarchical structure, the TADs of only only a few is only
Vertical.By carrying out difference analysis to hierarchical TADs, the mechanism of Gene Expression cell differentiation can be understood in depth.
The method of prior art analysis TADs otherness is lacking in terms of the hierarchical structure for considering TADs, will affect
The discrimination of otherness TADs.
Summary of the invention
In order to overcome the disadvantages of the above prior art, the purpose of the present invention is to provide a kind of based on online machine learning
Hierarchical TADs difference analysis method in Hi-C contact matrix carries out hierarchical for the Hi-C data under different cell lines
TADs variance analysis.
To achieve the goals above, the technical solution adopted by the present invention is that:
Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, feature exist
In including the following steps:
Step 1, Hi-C data are standardized with eliminate Hi-C experiment system deviation and enhance data it
Between comparativity;
Step 2 calculates the interaction between each bin upstream and downstream region frequently to the Hi-C data after standardization
Several average value, is denoted as binSignal (i);
Step 3 is fitted sequence binSignal using curve fitting algorithm, by the local minimum points of matched curve
Tentatively it is considered as the borderline region point of TADs;
Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final
The borderline region point of TADs;
Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, proposes Hi-C contact matrix M
Middle interactive frequency MijMathematical model between all possible hierarchical TADs;
The objective function of step 6, establishing model solves objective function using online machine learning algorithm FTRL;
Step 7, based on solution as a result, identify different cell lines have otherness hierarchical TADs.
Wherein, Hi-C data are standardized described in step 1 method particularly includes:
First using the Hi-C data normalization method multiHiCcompare of across cell system under different cell lines
Hi-C data carry out preliminary treatment, to eliminate the system deviation between different cell lines as much as possible;Then data standard is utilized
Change method CPM (Counts per million) handles the Hi-C data Jing Guo preliminary treatment, to further enhance not
With the comparativity of Hi-C data between cell line.
Calculating binSignal (i) described in step 2 method particularly includes:
The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon,
The window size in upstream and downstream region is w, and binSignal (i) is calculated according to the following formula:
Wherein Ui(l) some bin, D in the upstream region for the bin that subscript index is i are indicatedi(m) indicate that subscript index is
Some bin in the downstream area of the bin of i, cont.freq indicate the interaction frequency between the bin of upstream and downstream region, in l expression
The index of regional window element is swum, m indicates the index of downstream area window elements.
False positive T ADs borderline region point is filtered out using rank sum test described in step 4 method particularly includes:
By counting the interaction frequency S in the TADs borderline region a certain range for needing to examine between upstream and downstream1The upstream and
Between or downstream between interaction frequency S2, S is examined using the method for rank sum test1And S2Between whether have apparent difference
Property, wherein apparent otherness refers to the probability of rank sum test less than 0.05, if there is apparent otherness, it is considered as this
TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region is true positives.
Contact matrix M described in step 5 is a symmetrical square matrix, and each element is known as interaction frequency in contact matrix,
Interaction frequency MijWith the mathematical model of all possible hierarchical TADs are as follows:
Y=XB+N (0, σ2), B >=0 subject to
Wherein, Y indicates the matrix of the interaction frequency composition of all Hi-C contact matrixes, and each column element connects in Y for one
The vector that (containing diagonal line), interaction frequency was constituted of Delta Region under matrix is touched, X indicates interaction frequency and all possible hierarchicals
The positional relationship of TADs, B indicate the weight coefficient of all possible hierarchical TADs, N (0, σ2) indicate to be generated by normal distribution
Noise, σ indicate the standard deviation of normal distribution.Then the mathematical model can specifically describe are as follows:
Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c1Indicate that the first kind is thin
Born of the same parents system, c2Indicate the second class cell line, K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate the second class cell
It is lower Hi-C data sample sum,Represent K under first kind cell line1M row m is arranged in a Hi-C contact matrix
Interaction frequency,Represent K under the second class cell line2The interaction frequency that m row m is arranged in a Hi-C contact matrix, for
xI, zIf its corresponding yI, jPositioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelIndicate the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate lower n-th layer of the second class cell line
The weight coefficient of grade formula TADs.
Objective function is solved described in step 6 method particularly includes:
Objective function is established first:
Wherein K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate Hi-C data under the second class cell line
Total sample number, j indicate the index of sample, Y1jIndicate under the first kind cell line Delta Region under j-th of Hi-C contact matrix (containing pair
Linea angulata) in the vector that constitutes of all interactive frequencies, Y2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix
The vector that all interactive frequencies are constituted in (containing diagonal line), B:, 1The weight vectors of first kind cell line hierarchical TADs are represented,
B:, 2The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs
Relationship is set, B is by vector B:, 1And B:, 2The matrix being spliced.Objective optimization function first item is common mean square error, mesh
Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible;Section 2 is
l1Regularization term, it is therefore an objective to increase the sparsity of solution;Section 3 is l2Regularization term, it is therefore an objective to so that solution is more smooth, prevent
The generation of fitting;Section 4 is the kernel item of loss function, it is therefore an objective to by optimizing so that vector B: 1With vector B: 2Between it is total
Body difference degree is small as far as possible.
Model is solved using online machine learning algorithm FTRL, advantage is: saving calculator memory, improves and ask
Solution efficiency can obtain sparse solution.
Based on solving result recognition differential opposite sex hierarchical TADs's described in step 7 method particularly includes:
Solving result is two vector B: 1And B: 2, wherein Bi1Represent cell line c1The weight system of lower i-th of hierarchical TADs
Number, Bi2Represent cell line c2The weight coefficient of lower i-th of hierarchical TADs, abs (Bi1-Bi2) represent cell line c1And cell line
c2Between difference degree on i-th of hierarchical TADs, abs (Bi1-Bi2) numerical value more big just represent cell line c1And cell
It is c2Otherness on i-th of hierarchical TADs is bigger.
Compared with the prior art, the advantages of the present invention are as follows: a large amount of bioassay results are based on, a kind of Hi-C is proposed and connects
The mathematical model between the interaction frequency and hierarchical TADs of matrix is touched, establishes objective function, and use online machine learning
Algorithm FTRL acquires the weight coefficient of all hierarchical TADs, identifies TADs variant between different cell lines.
Detailed description of the invention
Fig. 1 is flow chart of the present invention.
Fig. 2 is the Hi-C data thermal map that resolution ratio is 500Kb under cell line HUVEC and IMR90.
Fig. 3 is actual hierarchical otherness TADs and the result that analysis fitting obtains.
Specific embodiment
The embodiment that the present invention will be described in detail with reference to the accompanying drawings and examples.
As shown in Figure 1, the present invention is based on the othernesses point of hierarchical TADs under two cell lines of TADs borderline region identification
Analysis the following steps are included:
Step 1, Hi-C data are standardized with eliminate Hi-C experiment system deviation and enhance data it
Between comparativity, the specific method is as follows:
First using the Hi-C data normalization method multiHiCcompare of across cell system under different cell lines
Hi-C data carry out preliminary treatment, to eliminate the system deviation between different cell lines as much as possible;Then data standard is utilized
Change method CPM (Counts per million) handles the Hi-C data Jing Guo preliminary treatment, to further enhance not
With the comparativity of Hi-C data between cell line.
Step 2 calculates the interaction between each bin upstream and downstream region frequently to the Hi-C data after initialization
Several average value, is denoted as binSignal (i), and the specific method is as follows:
The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon,
The window size in upstream and downstream region is w.BinSignal (i) is calculated according to the following formula:
Wherein Ui(l) some bin, D in the upstream region for the bin that subscript index is i are indicatedi(m) indicate that subscript index is
Some bin in the downstream area of the bin of i, cont.freq indicate the interaction frequency between the bin of upstream and downstream region, in l expression
The index of regional window element is swum, m indicates the index of downstream area window elements.
Step 3 is fitted sequence binSignal using curve fitting algorithm, by the local minimum points of matched curve
Tentatively it is considered as the borderline region point of TADs.
Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final
The borderline region point of TADs, the specific method is as follows:
By counting the interaction frequency S in the TADs borderline region a certain range for needing to examine between upstream and downstream1The upstream and
Between or downstream between interaction frequency S2, S is examined using the method for rank sum test1And S2Between whether have apparent difference
Property, wherein apparent otherness refers to the probability of rank sum test less than 0.05.If there is apparent otherness, it is considered as this
TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region is true positives.
Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, establishes Hi-C data exposure square
Interaction frequency M in battle array MijMathematical model between all possible hierarchical TADs, formula are as follows:
Y=XB+N (0, σ2), B >=0 subject to
Wherein, Y indicates the matrix of the interaction frequency composition of all Hi-C contact matrixes, and each column element connects in Y for one
The vector that (containing diagonal line), interaction frequency was constituted of Delta Region under matrix is touched, X indicates interaction frequency and all possible hierarchicals
The positional relationship of TADs, B indicate the weight coefficient of all possible hierarchical TADs, N (0, σ2) indicate to be generated by normal distribution
Noise, σ indicate the standard deviation of normal distribution.Then the mathematical model can specifically describe are as follows:
Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c1Indicate that the first kind is thin
Born of the same parents system, c2Indicate the second class cell line, K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate the second class cell
It is lower Hi-C data sample sum,Represent K under first kind cell line1M row m is arranged in a Hi-C contact matrix
Interaction frequency,Represent K under the second class cell line2The interaction frequency that m row m is arranged in a Hi-C contact matrix.For
xI, zIf its corresponding yI, jPositioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelIndicate the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate n-th of level under the second class cell line
The weight coefficient of formula TADs.
The objective function of step 6, establishing model seeks objective optimization function using online machine learning algorithm FTRL
Solution, the specific method is as follows:
Objective function is established first:
Wherein K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate Hi-C data under the second class cell line
Total sample number, j indicate the index of sample, Y1jIndicate under the first kind cell line Delta Region under j-th of Hi-C contact matrix (containing pair
Linea angulata) in the vector that constitutes of all interactive frequencies, Y2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix
The vector that all interactive frequencies are constituted in (containing diagonal line), B:, 1The weight vectors of first kind cell line hierarchical TADs are represented,
B:, 2The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs
Relationship is set, B is by vector B:, 1And B:, 2The matrix being spliced.Objective optimization function first item is common mean square error, mesh
Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible;Section 2 is
l1Regularization term, it is therefore an objective to increase the sparsity of solution;Section 3 is l2Regularization term, it is therefore an objective to so that solution is more smooth, prevent
The generation of fitting;Section 4 is the kernel item of objective function, it is therefore an objective to by optimizing so that vector B: 1With vector B: 2Between it is total
Body difference degree is small as far as possible.
Model is solved using online machine learning algorithm FTRL, advantage is: saving calculator memory, improves and ask
Solution efficiency can obtain sparse solution.
Step 7, based on solution as a result, identify different cell lines have otherness hierarchical TADs, specific method
It is as follows:
Solving result is two vector B: 1And B: 2, wherein Bi1Represent cell line c1The weight system of lower i-th of hierarchical TADs
Number, Bi2Represent cell line c2The weight coefficient of lower i-th of hierarchical TADs, abs (Bi1-Bi2) just represent cell line c1And cell
It is c2Between difference degree on i-th of hierarchical TADs, abs (Bi1-Bi2) numerical value more big just represent cell line c1With it is thin
Born of the same parents system c2Otherness on i-th of hierarchical TADs is bigger.
For identifying the hierarchical TADs between cell line HUVEC and IMR90 with otherness, to process of the invention
It is introduced.
(1) data are the Hi-C data that resolution ratio is 500Kb under cell line HUVEC and IMR90, and each cell line respectively prepares
Two data.For the thermal map of data as shown in Fig. 2, two, left side thermal map is the data of HUVEC, two, the right thermal map is IMR90's
Data.
(2) data are standardized, specifically include the Hi-C data normalization method of across cell system
MultiHiCcompare and data standardized method CPM.
(3) identification that the data after standardization are carried out with TADs borderline region point, specifically includes: calculating each bin's
Value, the curve matching of binSignal (i) obtains preliminary local minizing point, rank sum test filters out the frontier district TADs of false positive
Domain point.
(4) all possible hierarchical TADs is obtained according to TADs borderline region point, to each possible hierarchical TADs
Weight coefficient solved.
(5) according to the weight coefficient that finds out, obtain final analysis as a result, and by result visualization, visualization result is such as
Shown in Fig. 3, wherein left figure is actual hierarchical otherness TADs, and right figure is the result that analysis fitting obtains, it can be seen that point
The result that analysis fitting obtains is essentially close to actual hierarchical otherness TADs.
Claims (9)
1. hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, which is characterized in that
Include the following steps:
Step 1 is standardized Hi-C data to eliminate the system deviation of Hi-C experiment and enhance between data
Comparativity;
Step 2 calculates the interaction frequency between each bin upstream and downstream region to the Hi-C data after standardization
Average value is denoted as binSignal (i);
Step 3 is fitted sequence binSignal using curve fitting algorithm, and the local minimum points of matched curve are preliminary
It is considered as the borderline region point of TADs;
Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final TADs
Borderline region point;
Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, establishes in Hi-C contact matrix M and hands over
Mutual frequency MijMathematical model between all possible hierarchical TADs;
The objective function of step 6, establishing model solves objective function using online machine learning algorithm FTRL;
Step 7, based on solution as a result, identify different cell lines have otherness hierarchical TADs.
2. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1
Method, which is characterized in that Hi-C data are standardized described in step 1 method particularly includes:
First using the Hi-C data normalization method multiHiCcompare of across cell system to the Hi-C number under different cell lines
According to preliminary treatment is carried out, to eliminate the system deviation between different cell lines as much as possible;Then data normalization method is utilized
CPM (Counts per million) handles the Hi-C data Jing Guo preliminary treatment, to further enhance different cells
The comparativity of Hi-C data between system.
3. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1
Method, which is characterized in that calculating binSignal (i) described in step 2 method particularly includes:
The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon, up and down
The window size for swimming region is w, and binSignal (i) is calculated according to the following formula:
Wherein Ui(l) some bin, D in the upstream region for the bin that subscript index is i are indicatedi(m) indicate that subscript index is i's
Some bin in the downstream area of bin, cont.freq indicate the interaction frequency between the bin of upstream and downstream region, and l indicates upstream
The index of regional window element, m indicate the index of downstream area window elements.
4. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1
Method, which is characterized in that filter out false positive T ADs borderline region point using rank sum test described in step 4 method particularly includes:
By count need the interaction frequency S in TADs borderline region a certain range for examining between upstream and downstream and between upstream or
Interaction frequency S between person downstream2, S is examined using the method for rank sum test1And S2Between whether have apparent otherness, if
There is apparent otherness, being considered as the TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region
It is true positives.
5. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 4
Method, which is characterized in that the probability for having apparent otherness to refer to rank sum test is less than 0.05.
6. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1
Method, which is characterized in that contact matrix M described in step 5 is a symmetrical square matrix, and each element is known as interacting in contact matrix
Frequency, interaction frequency MijWith the mathematical model of all possible hierarchical TADs are as follows:
Y=XB+N (0, σ2), B >=0 subject to
Wherein, Y indicates the matrix of the interaction frequency composition of all Hi-C contact matrixes, and each column element is a contact square in Y
The Delta Region vector that (containing diagonal line), interaction frequency was constituted under gust, the interactive frequency of X expression and all possible hierarchical TADs'
Positional relationship, B indicate the weight coefficient of all possible hierarchical TADs, N (0, σ2) indicate the noise generated by normal distribution, σ
Indicate the standard deviation of normal distribution.
7. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 6
Method, which is characterized in that the mathematical model specifically describes are as follows:
Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c1Indicate first kind cell line,
c2Indicate the second class cell line, K1Indicate Hi-C data sample sum, K under first kind cell line2It indicates under the second class cell line
Hi-C data sample sum,Represent K under first kind cell line1The interaction that m row m is arranged in a Hi-C contact matrix
Frequency,Represent K under the second class cell line2The interaction frequency that m row m is arranged in a Hi-C contact matrix, for xI, z,
If its corresponding yI, jPositioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelTable
Show the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate n-th of hierarchical under the second class cell line
The weight coefficient of TADs.
8. hierarchical TADs variance analysis in the Hi-C contact matrix described according to claim 6 or 7 based on online machine learning
Method, which is characterized in that objective function is solved described in step 6 method particularly includes:
Objective function is established first:
Wherein K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate Hi-C data sample under the second class cell line
Sum, j indicate the index of sample, Y1jIndicate that Delta Region (contains diagonal under j-th of Hi-C contact matrix under first kind cell line
Line) in the vector that constitutes of all interactive frequencies, Y2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix
The vector that all interactive frequencies are constituted in (containing diagonal line), B:, 1The weight vectors of first kind cell line hierarchical TADs are represented,
B:, 2The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs
Relationship is set, B is by vector B:, 1And B:, 2The matrix being spliced.Objective optimization function first item is common mean square error, mesh
Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible;Section 2 is
l1Regularization term, it is therefore an objective to increase the sparsity of solution;Section 3 is l2Regularization term, it is therefore an objective to so that solution is more smooth, prevent
The generation of fitting;Section 4 is the kernel item of loss function, it is therefore an objective to by optimizing so that vector B: 1With vector B: 2Between it is total
Body difference degree is small as far as possible.
9. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 8
Method, which is characterized in that based on solving result recognition differential opposite sex hierarchical TADs's described in step 7 method particularly includes:
Solving result is two vector B: 1And B: 2, wherein Bi1Represent cell line c1The weight coefficient of lower i-th of hierarchical TADs,
Bi2Represent cell line c2The weight coefficient of lower i-th of hierarchical TADs, abs (Bi1-Bi2) represent cell line c1With cell line c2
Between difference degree on i-th of hierarchical TADs, abs (Bi1-Bi2) numerical value more big just represent cell line c1And cell line
c2Otherness on i-th of hierarchical TADs is bigger.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910315741.1A CN110097922B (en) | 2019-04-19 | 2019-04-19 | Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910315741.1A CN110097922B (en) | 2019-04-19 | 2019-04-19 | Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110097922A true CN110097922A (en) | 2019-08-06 |
CN110097922B CN110097922B (en) | 2020-12-08 |
Family
ID=67445233
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910315741.1A Expired - Fee Related CN110097922B (en) | 2019-04-19 | 2019-04-19 | Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110097922B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023092303A1 (en) * | 2021-11-23 | 2023-06-01 | Chromatintech Beijing Co, Ltd | Method for generating an enhanced hi-c matrix, non-transitory computer readable medium storing a program for generating an enhanced hi-c matrix, method for identifying a structural chromatin aberration in an enhanced hi-c matrix |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017011710A2 (en) * | 2015-07-14 | 2017-01-19 | Whitehead Institute For Biomedical Research | Chromosome neighborhood structures and methods relating thereto |
CN108647492A (en) * | 2018-05-02 | 2018-10-12 | 中国人民解放军军事科学院军事医学研究院 | A kind of characterizing method and device of chromatin topology relevant domain |
CN109448783A (en) * | 2018-08-07 | 2019-03-08 | 清华大学 | Method for analyzing chromatin topological structure domain boundary |
WO2019060683A2 (en) * | 2017-09-21 | 2019-03-28 | The Penn State Research Foundation | Systems, methods, and processor-readable media for detecting disease causal variants |
CN109637579A (en) * | 2018-12-18 | 2019-04-16 | 长沙学院 | A kind of key protein matter recognition methods based on tensor random walk |
-
2019
- 2019-04-19 CN CN201910315741.1A patent/CN110097922B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017011710A2 (en) * | 2015-07-14 | 2017-01-19 | Whitehead Institute For Biomedical Research | Chromosome neighborhood structures and methods relating thereto |
WO2019060683A2 (en) * | 2017-09-21 | 2019-03-28 | The Penn State Research Foundation | Systems, methods, and processor-readable media for detecting disease causal variants |
CN108647492A (en) * | 2018-05-02 | 2018-10-12 | 中国人民解放军军事科学院军事医学研究院 | A kind of characterizing method and device of chromatin topology relevant domain |
CN109448783A (en) * | 2018-08-07 | 2019-03-08 | 清华大学 | Method for analyzing chromatin topological structure domain boundary |
CN109637579A (en) * | 2018-12-18 | 2019-04-16 | 长沙学院 | A kind of key protein matter recognition methods based on tensor random walk |
Non-Patent Citations (3)
Title |
---|
MARIE ZUFFEREY ET AL: "Comparison of computational methods for the identification of topologically associating domains", 《GENOME BIOLOGY》 * |
王小滔: "染色质拓扑相关结构域的结构及功能研究", 《中国博士学位论文全文数据库基础科学辑》 * |
韩九强 等: "基于生物信息学的HERV研究现状与发展趋势", 《生物信息学》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023092303A1 (en) * | 2021-11-23 | 2023-06-01 | Chromatintech Beijing Co, Ltd | Method for generating an enhanced hi-c matrix, non-transitory computer readable medium storing a program for generating an enhanced hi-c matrix, method for identifying a structural chromatin aberration in an enhanced hi-c matrix |
Also Published As
Publication number | Publication date |
---|---|
CN110097922B (en) | 2020-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109300111B (en) | Chromosome recognition method based on deep learning | |
Zou et al. | Breast cancer histopathological image classification using attention high‐order deep network | |
Xie et al. | Deep learning based analysis of histopathological images of breast cancer | |
Garcı́a et al. | Self-organizing map and clustering for wastewater treatment monitoring | |
Kuismin et al. | Estimation of covariance and precision matrix, network structure, and a view toward systems biology | |
CN106991430A (en) | A kind of cluster number based on point of proximity method automatically determines Spectral Clustering | |
Sun et al. | A new multiscale decomposition ensemble approach for forecasting exchange rates | |
Wang et al. | Human peripheral blood leukocyte classification method based on convolutional neural network and data augmentation | |
CN110751038A (en) | PDF table structure identification method based on graph attention machine mechanism | |
Liu et al. | A three-domain fuzzy support vector regression for image denoising and experimental studies | |
CN110009097A (en) | The image classification method of capsule residual error neural network, capsule residual error neural network | |
CN103020711A (en) | Classifier training method and classifier training system | |
CN110111113A (en) | A kind of detection method and device of exception transaction node | |
CN107609588A (en) | A kind of disturbances in patients with Parkinson disease UPDRS score Forecasting Methodologies based on voice signal | |
Liu et al. | Automatic classification of chinese herbal based on deep learning method | |
Menaka et al. | Chromenet: A CNN architecture with comparison of optimizers for classification of human chromosome images | |
Li et al. | Identifying cell types from single-cell data based on similarities and dissimilarities between cells | |
Wang et al. | Feature selection with multi-class logistic regression | |
Wang et al. | Extended ResNet and label feature vector based chromosome classification | |
Gangurde et al. | [Retracted] Developing an Efficient Cancer Detection and Prediction Tool Using Convolution Neural Network Integrated with Neural Pattern Recognition | |
CN114898167A (en) | Multi-view subspace clustering method and system based on inter-view difference detection | |
Xiao et al. | DEEPACC: automate chromosome classification based on metaphase images using deep learning framework fused with priori knowledge | |
CN110097922A (en) | Hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning | |
CN106960218B (en) | Breast cancer data computer classification system based on cost-sensitive learning Bayes | |
Lee et al. | Learning non-homogenous textures and the unlearning problem with application to drusen detection in retinal images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20201208 |
|
CF01 | Termination of patent right due to non-payment of annual fee |