CN110097922A - Hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning - Google Patents

Hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning Download PDF

Info

Publication number
CN110097922A
CN110097922A CN201910315741.1A CN201910315741A CN110097922A CN 110097922 A CN110097922 A CN 110097922A CN 201910315741 A CN201910315741 A CN 201910315741A CN 110097922 A CN110097922 A CN 110097922A
Authority
CN
China
Prior art keywords
tads
hierarchical
cell line
indicate
contact matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910315741.1A
Other languages
Chinese (zh)
Other versions
CN110097922B (en
Inventor
吕红强
刘聪毅
韩九强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201910315741.1A priority Critical patent/CN110097922B/en
Publication of CN110097922A publication Critical patent/CN110097922A/en
Application granted granted Critical
Publication of CN110097922B publication Critical patent/CN110097922B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Optimization (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Analysis (AREA)
  • Biotechnology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Complex Calculations (AREA)

Abstract

Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, comparativity being standardized between Hi-C data to eliminate experimental system deviation and enhance data;Interaction frequency average value between each bin upstream and downstream region is calculated to the data after standardization, is denoted as binSignal (i);Rank sum test is fitted and carried out to sequence binSignal, obtains the borderline region point of TADs;All possible hierarchical TADs are obtained according to borderline region point, propose the mathematical model in Hi-C contact matrix between interaction frequency and all possible hierarchical TADs;The objective function of establishing model, and the solution of hierarchical TADs variance analysis model is carried out using online machine learning algorithm FTRL for the first time, identify that different cell lines have the hierarchical TADs of otherness.The invention proposes the mathematical models in Hi-C contact matrix between interaction frequency and hierarchical TADs, and the weight coefficient of all hierarchical TADs is acquired using online machine learning algorithm FTRL, identify TADs variant between different cell lines.

Description

Hierarchical TADs variance analysis in Hi-C contact matrix based on online machine learning Method
Technical field
The invention belongs to field of biotechnology, it is related to hierarchical TADs difference analysis under different cell lines, in particular to Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning.
Background technique
Hi-C technology is a kind of high-throughput chromatin conformation capture technique, can obtain full-length genome by Hi-C experiment and appoint Interaction information between meaning site.Hi-C data are the data obtained by Hi-C experiment, and Hi-C data general type is One matrix, the matrix are referred to as contact matrix, and contact matrix is a symmetrical square matrix, and each element is claimed in contact matrix For interaction frequency.With the development of Hi-C technology, scientist has found that each chromosome substantially can be with when to Hi-C data research It is divided into active and passive two compartments (A/B compartment) of chromatin state, wherein A compartment chromatin state is active, B Compartment chromatin state is passive, and based on the discovery of two categories compartment, under higher resolution ratio, scientist has found in compartment again There are the higher genome areas of interaction strength, are referred to as topological correlation structural domain (topological Associated domains, TADs), the intensity of site interaction is significantly larger than external mutual in topological correlation structural domain Action intensity.
A large amount of Bioexperiment show that TADs is the basic role element of controlling gene transcriptional expression, in gene regulation process In, TADs constrains the regulating and controlling effect of enhancer and promoter, and in addition to this, the destruction on the boundary TADs may also will lead to The generation of disease, such as cancer.Studies have shown that most of TADs has hierarchical structure, the TADs of only only a few is only Vertical.By carrying out difference analysis to hierarchical TADs, the mechanism of Gene Expression cell differentiation can be understood in depth.
The method of prior art analysis TADs otherness is lacking in terms of the hierarchical structure for considering TADs, will affect The discrimination of otherness TADs.
Summary of the invention
In order to overcome the disadvantages of the above prior art, the purpose of the present invention is to provide a kind of based on online machine learning Hierarchical TADs difference analysis method in Hi-C contact matrix carries out hierarchical for the Hi-C data under different cell lines TADs variance analysis.
To achieve the goals above, the technical solution adopted by the present invention is that:
Hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, feature exist In including the following steps:
Step 1, Hi-C data are standardized with eliminate Hi-C experiment system deviation and enhance data it Between comparativity;
Step 2 calculates the interaction between each bin upstream and downstream region frequently to the Hi-C data after standardization Several average value, is denoted as binSignal (i);
Step 3 is fitted sequence binSignal using curve fitting algorithm, by the local minimum points of matched curve Tentatively it is considered as the borderline region point of TADs;
Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final The borderline region point of TADs;
Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, proposes Hi-C contact matrix M Middle interactive frequency MijMathematical model between all possible hierarchical TADs;
The objective function of step 6, establishing model solves objective function using online machine learning algorithm FTRL;
Step 7, based on solution as a result, identify different cell lines have otherness hierarchical TADs.
Wherein, Hi-C data are standardized described in step 1 method particularly includes:
First using the Hi-C data normalization method multiHiCcompare of across cell system under different cell lines Hi-C data carry out preliminary treatment, to eliminate the system deviation between different cell lines as much as possible;Then data standard is utilized Change method CPM (Counts per million) handles the Hi-C data Jing Guo preliminary treatment, to further enhance not With the comparativity of Hi-C data between cell line.
Calculating binSignal (i) described in step 2 method particularly includes:
The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon, The window size in upstream and downstream region is w, and binSignal (i) is calculated according to the following formula:
Wherein Ui(l) some bin, D in the upstream region for the bin that subscript index is i are indicatedi(m) indicate that subscript index is Some bin in the downstream area of the bin of i, cont.freq indicate the interaction frequency between the bin of upstream and downstream region, in l expression The index of regional window element is swum, m indicates the index of downstream area window elements.
False positive T ADs borderline region point is filtered out using rank sum test described in step 4 method particularly includes:
By counting the interaction frequency S in the TADs borderline region a certain range for needing to examine between upstream and downstream1The upstream and Between or downstream between interaction frequency S2, S is examined using the method for rank sum test1And S2Between whether have apparent difference Property, wherein apparent otherness refers to the probability of rank sum test less than 0.05, if there is apparent otherness, it is considered as this TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region is true positives.
Contact matrix M described in step 5 is a symmetrical square matrix, and each element is known as interaction frequency in contact matrix, Interaction frequency MijWith the mathematical model of all possible hierarchical TADs are as follows:
Y=XB+N (0, σ2), B >=0 subject to
Wherein, Y indicates the matrix of the interaction frequency composition of all Hi-C contact matrixes, and each column element connects in Y for one The vector that (containing diagonal line), interaction frequency was constituted of Delta Region under matrix is touched, X indicates interaction frequency and all possible hierarchicals The positional relationship of TADs, B indicate the weight coefficient of all possible hierarchical TADs, N (0, σ2) indicate to be generated by normal distribution Noise, σ indicate the standard deviation of normal distribution.Then the mathematical model can specifically describe are as follows:
Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c1Indicate that the first kind is thin Born of the same parents system, c2Indicate the second class cell line, K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate the second class cell It is lower Hi-C data sample sum,Represent K under first kind cell line1M row m is arranged in a Hi-C contact matrix Interaction frequency,Represent K under the second class cell line2The interaction frequency that m row m is arranged in a Hi-C contact matrix, for xI, zIf its corresponding yI, jPositioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelIndicate the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate lower n-th layer of the second class cell line The weight coefficient of grade formula TADs.
Objective function is solved described in step 6 method particularly includes:
Objective function is established first:
Wherein K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate Hi-C data under the second class cell line Total sample number, j indicate the index of sample, Y1jIndicate under the first kind cell line Delta Region under j-th of Hi-C contact matrix (containing pair Linea angulata) in the vector that constitutes of all interactive frequencies, Y2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix The vector that all interactive frequencies are constituted in (containing diagonal line), B:, 1The weight vectors of first kind cell line hierarchical TADs are represented, B:, 2The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs Relationship is set, B is by vector B:, 1And B:, 2The matrix being spliced.Objective optimization function first item is common mean square error, mesh Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible;Section 2 is l1Regularization term, it is therefore an objective to increase the sparsity of solution;Section 3 is l2Regularization term, it is therefore an objective to so that solution is more smooth, prevent The generation of fitting;Section 4 is the kernel item of loss function, it is therefore an objective to by optimizing so that vector B: 1With vector B: 2Between it is total Body difference degree is small as far as possible.
Model is solved using online machine learning algorithm FTRL, advantage is: saving calculator memory, improves and ask Solution efficiency can obtain sparse solution.
Based on solving result recognition differential opposite sex hierarchical TADs's described in step 7 method particularly includes:
Solving result is two vector B: 1And B: 2, wherein Bi1Represent cell line c1The weight system of lower i-th of hierarchical TADs Number, Bi2Represent cell line c2The weight coefficient of lower i-th of hierarchical TADs, abs (Bi1-Bi2) represent cell line c1And cell line c2Between difference degree on i-th of hierarchical TADs, abs (Bi1-Bi2) numerical value more big just represent cell line c1And cell It is c2Otherness on i-th of hierarchical TADs is bigger.
Compared with the prior art, the advantages of the present invention are as follows: a large amount of bioassay results are based on, a kind of Hi-C is proposed and connects The mathematical model between the interaction frequency and hierarchical TADs of matrix is touched, establishes objective function, and use online machine learning Algorithm FTRL acquires the weight coefficient of all hierarchical TADs, identifies TADs variant between different cell lines.
Detailed description of the invention
Fig. 1 is flow chart of the present invention.
Fig. 2 is the Hi-C data thermal map that resolution ratio is 500Kb under cell line HUVEC and IMR90.
Fig. 3 is actual hierarchical otherness TADs and the result that analysis fitting obtains.
Specific embodiment
The embodiment that the present invention will be described in detail with reference to the accompanying drawings and examples.
As shown in Figure 1, the present invention is based on the othernesses point of hierarchical TADs under two cell lines of TADs borderline region identification Analysis the following steps are included:
Step 1, Hi-C data are standardized with eliminate Hi-C experiment system deviation and enhance data it Between comparativity, the specific method is as follows:
First using the Hi-C data normalization method multiHiCcompare of across cell system under different cell lines Hi-C data carry out preliminary treatment, to eliminate the system deviation between different cell lines as much as possible;Then data standard is utilized Change method CPM (Counts per million) handles the Hi-C data Jing Guo preliminary treatment, to further enhance not With the comparativity of Hi-C data between cell line.
Step 2 calculates the interaction between each bin upstream and downstream region frequently to the Hi-C data after initialization Several average value, is denoted as binSignal (i), and the specific method is as follows:
The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon, The window size in upstream and downstream region is w.BinSignal (i) is calculated according to the following formula:
Wherein Ui(l) some bin, D in the upstream region for the bin that subscript index is i are indicatedi(m) indicate that subscript index is Some bin in the downstream area of the bin of i, cont.freq indicate the interaction frequency between the bin of upstream and downstream region, in l expression The index of regional window element is swum, m indicates the index of downstream area window elements.
Step 3 is fitted sequence binSignal using curve fitting algorithm, by the local minimum points of matched curve Tentatively it is considered as the borderline region point of TADs.
Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final The borderline region point of TADs, the specific method is as follows:
By counting the interaction frequency S in the TADs borderline region a certain range for needing to examine between upstream and downstream1The upstream and Between or downstream between interaction frequency S2, S is examined using the method for rank sum test1And S2Between whether have apparent difference Property, wherein apparent otherness refers to the probability of rank sum test less than 0.05.If there is apparent otherness, it is considered as this TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region is true positives.
Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, establishes Hi-C data exposure square Interaction frequency M in battle array MijMathematical model between all possible hierarchical TADs, formula are as follows:
Y=XB+N (0, σ2), B >=0 subject to
Wherein, Y indicates the matrix of the interaction frequency composition of all Hi-C contact matrixes, and each column element connects in Y for one The vector that (containing diagonal line), interaction frequency was constituted of Delta Region under matrix is touched, X indicates interaction frequency and all possible hierarchicals The positional relationship of TADs, B indicate the weight coefficient of all possible hierarchical TADs, N (0, σ2) indicate to be generated by normal distribution Noise, σ indicate the standard deviation of normal distribution.Then the mathematical model can specifically describe are as follows:
Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c1Indicate that the first kind is thin Born of the same parents system, c2Indicate the second class cell line, K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate the second class cell It is lower Hi-C data sample sum,Represent K under first kind cell line1M row m is arranged in a Hi-C contact matrix Interaction frequency,Represent K under the second class cell line2The interaction frequency that m row m is arranged in a Hi-C contact matrix.For xI, zIf its corresponding yI, jPositioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelIndicate the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate n-th of level under the second class cell line The weight coefficient of formula TADs.
The objective function of step 6, establishing model seeks objective optimization function using online machine learning algorithm FTRL Solution, the specific method is as follows:
Objective function is established first:
Wherein K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate Hi-C data under the second class cell line Total sample number, j indicate the index of sample, Y1jIndicate under the first kind cell line Delta Region under j-th of Hi-C contact matrix (containing pair Linea angulata) in the vector that constitutes of all interactive frequencies, Y2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix The vector that all interactive frequencies are constituted in (containing diagonal line), B:, 1The weight vectors of first kind cell line hierarchical TADs are represented, B:, 2The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs Relationship is set, B is by vector B:, 1And B:, 2The matrix being spliced.Objective optimization function first item is common mean square error, mesh Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible;Section 2 is l1Regularization term, it is therefore an objective to increase the sparsity of solution;Section 3 is l2Regularization term, it is therefore an objective to so that solution is more smooth, prevent The generation of fitting;Section 4 is the kernel item of objective function, it is therefore an objective to by optimizing so that vector B: 1With vector B: 2Between it is total Body difference degree is small as far as possible.
Model is solved using online machine learning algorithm FTRL, advantage is: saving calculator memory, improves and ask Solution efficiency can obtain sparse solution.
Step 7, based on solution as a result, identify different cell lines have otherness hierarchical TADs, specific method It is as follows:
Solving result is two vector B: 1And B: 2, wherein Bi1Represent cell line c1The weight system of lower i-th of hierarchical TADs Number, Bi2Represent cell line c2The weight coefficient of lower i-th of hierarchical TADs, abs (Bi1-Bi2) just represent cell line c1And cell It is c2Between difference degree on i-th of hierarchical TADs, abs (Bi1-Bi2) numerical value more big just represent cell line c1With it is thin Born of the same parents system c2Otherness on i-th of hierarchical TADs is bigger.
For identifying the hierarchical TADs between cell line HUVEC and IMR90 with otherness, to process of the invention It is introduced.
(1) data are the Hi-C data that resolution ratio is 500Kb under cell line HUVEC and IMR90, and each cell line respectively prepares Two data.For the thermal map of data as shown in Fig. 2, two, left side thermal map is the data of HUVEC, two, the right thermal map is IMR90's Data.
(2) data are standardized, specifically include the Hi-C data normalization method of across cell system MultiHiCcompare and data standardized method CPM.
(3) identification that the data after standardization are carried out with TADs borderline region point, specifically includes: calculating each bin's Value, the curve matching of binSignal (i) obtains preliminary local minizing point, rank sum test filters out the frontier district TADs of false positive Domain point.
(4) all possible hierarchical TADs is obtained according to TADs borderline region point, to each possible hierarchical TADs Weight coefficient solved.
(5) according to the weight coefficient that finds out, obtain final analysis as a result, and by result visualization, visualization result is such as Shown in Fig. 3, wherein left figure is actual hierarchical otherness TADs, and right figure is the result that analysis fitting obtains, it can be seen that point The result that analysis fitting obtains is essentially close to actual hierarchical otherness TADs.

Claims (9)

1. hierarchical TADs difference analysis method in a kind of Hi-C contact matrix based on online machine learning, which is characterized in that Include the following steps:
Step 1 is standardized Hi-C data to eliminate the system deviation of Hi-C experiment and enhance between data Comparativity;
Step 2 calculates the interaction frequency between each bin upstream and downstream region to the Hi-C data after standardization Average value is denoted as binSignal (i);
Step 3 is fitted sequence binSignal using curve fitting algorithm, and the local minimum points of matched curve are preliminary It is considered as the borderline region point of TADs;
Step 4, the method by rank sum test filter out the TADs borderline region point of false positive, obtain final TADs Borderline region point;
Step 5 obtains all possible hierarchical TADs according to the borderline region point of TADs, establishes in Hi-C contact matrix M and hands over Mutual frequency MijMathematical model between all possible hierarchical TADs;
The objective function of step 6, establishing model solves objective function using online machine learning algorithm FTRL;
Step 7, based on solution as a result, identify different cell lines have otherness hierarchical TADs.
2. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1 Method, which is characterized in that Hi-C data are standardized described in step 1 method particularly includes:
First using the Hi-C data normalization method multiHiCcompare of across cell system to the Hi-C number under different cell lines According to preliminary treatment is carried out, to eliminate the system deviation between different cell lines as much as possible;Then data normalization method is utilized CPM (Counts per million) handles the Hi-C data Jing Guo preliminary treatment, to further enhance different cells The comparativity of Hi-C data between system.
3. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1 Method, which is characterized in that calculating binSignal (i) described in step 2 method particularly includes:
The bin for being i for subscript index, the average interactive frequency between downstream area is denoted as binSignal (i) thereon, up and down The window size for swimming region is w, and binSignal (i) is calculated according to the following formula:
Wherein Ui(l) some bin, D in the upstream region for the bin that subscript index is i are indicatedi(m) indicate that subscript index is i's Some bin in the downstream area of bin, cont.freq indicate the interaction frequency between the bin of upstream and downstream region, and l indicates upstream The index of regional window element, m indicate the index of downstream area window elements.
4. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1 Method, which is characterized in that filter out false positive T ADs borderline region point using rank sum test described in step 4 method particularly includes:
By count need the interaction frequency S in TADs borderline region a certain range for examining between upstream and downstream and between upstream or Interaction frequency S between person downstream2, S is examined using the method for rank sum test1And S2Between whether have apparent otherness, if There is apparent otherness, being considered as the TADs borderline region is false positive, needs to be filtered out, otherwise it is assumed that the TADs borderline region It is true positives.
5. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 4 Method, which is characterized in that the probability for having apparent otherness to refer to rank sum test is less than 0.05.
6. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 1 Method, which is characterized in that contact matrix M described in step 5 is a symmetrical square matrix, and each element is known as interacting in contact matrix Frequency, interaction frequency MijWith the mathematical model of all possible hierarchical TADs are as follows:
Y=XB+N (0, σ2), B >=0 subject to
Wherein, Y indicates the matrix of the interaction frequency composition of all Hi-C contact matrixes, and each column element is a contact square in Y The Delta Region vector that (containing diagonal line), interaction frequency was constituted under gust, the interactive frequency of X expression and all possible hierarchical TADs' Positional relationship, B indicate the weight coefficient of all possible hierarchical TADs, N (0, σ2) indicate the noise generated by normal distribution, σ Indicate the standard deviation of normal distribution.
7. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 6 Method, which is characterized in that the mathematical model specifically describes are as follows:
Wherein m indicates that the dimension of contact matrix, n indicate the sum of all possible hierarchical TADs, c1Indicate first kind cell line, c2Indicate the second class cell line, K1Indicate Hi-C data sample sum, K under first kind cell line2It indicates under the second class cell line Hi-C data sample sum,Represent K under first kind cell line1The interaction that m row m is arranged in a Hi-C contact matrix Frequency,Represent K under the second class cell line2The interaction frequency that m row m is arranged in a Hi-C contact matrix, for xI, z, If its corresponding yI, jPositioned at the inside of z-th of hierarchical TADs, then be set to 1, it is otherwise set to 0, in modelTable Show the weight coefficient of n-th of hierarchical TADs under first kind cell line,Indicate n-th of hierarchical under the second class cell line The weight coefficient of TADs.
8. hierarchical TADs variance analysis in the Hi-C contact matrix described according to claim 6 or 7 based on online machine learning Method, which is characterized in that objective function is solved described in step 6 method particularly includes:
Objective function is established first:
Wherein K1Indicate Hi-C data sample sum, K under first kind cell line2Indicate Hi-C data sample under the second class cell line Sum, j indicate the index of sample, Y1jIndicate that Delta Region (contains diagonal under j-th of Hi-C contact matrix under first kind cell line Line) in the vector that constitutes of all interactive frequencies, Y2jIndicate under the second class cell line Delta Region under j-th of Hi-C contact matrix The vector that all interactive frequencies are constituted in (containing diagonal line), B:, 1The weight vectors of first kind cell line hierarchical TADs are represented, B:, 2The weight vectors of the second class cell line hierarchical TADs are represented, X indicates the position of interaction frequency and all possible hierarchical TADs Relationship is set, B is by vector B:, 1And B:, 2The matrix being spliced.Objective optimization function first item is common mean square error, mesh Be so that the overall gap between the interaction frequency and true interaction frequency being calculated based on model is small as far as possible;Section 2 is l1Regularization term, it is therefore an objective to increase the sparsity of solution;Section 3 is l2Regularization term, it is therefore an objective to so that solution is more smooth, prevent The generation of fitting;Section 4 is the kernel item of loss function, it is therefore an objective to by optimizing so that vector B: 1With vector B: 2Between it is total Body difference degree is small as far as possible.
9. the variance analysis side hierarchical TADs in the Hi-C contact matrix based on online machine learning according to claim 8 Method, which is characterized in that based on solving result recognition differential opposite sex hierarchical TADs's described in step 7 method particularly includes:
Solving result is two vector B: 1And B: 2, wherein Bi1Represent cell line c1The weight coefficient of lower i-th of hierarchical TADs, Bi2Represent cell line c2The weight coefficient of lower i-th of hierarchical TADs, abs (Bi1-Bi2) represent cell line c1With cell line c2 Between difference degree on i-th of hierarchical TADs, abs (Bi1-Bi2) numerical value more big just represent cell line c1And cell line c2Otherness on i-th of hierarchical TADs is bigger.
CN201910315741.1A 2019-04-19 2019-04-19 Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning Expired - Fee Related CN110097922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910315741.1A CN110097922B (en) 2019-04-19 2019-04-19 Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910315741.1A CN110097922B (en) 2019-04-19 2019-04-19 Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning

Publications (2)

Publication Number Publication Date
CN110097922A true CN110097922A (en) 2019-08-06
CN110097922B CN110097922B (en) 2020-12-08

Family

ID=67445233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910315741.1A Expired - Fee Related CN110097922B (en) 2019-04-19 2019-04-19 Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning

Country Status (1)

Country Link
CN (1) CN110097922B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023092303A1 (en) * 2021-11-23 2023-06-01 Chromatintech Beijing Co, Ltd Method for generating an enhanced hi-c matrix, non-transitory computer readable medium storing a program for generating an enhanced hi-c matrix, method for identifying a structural chromatin aberration in an enhanced hi-c matrix

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017011710A2 (en) * 2015-07-14 2017-01-19 Whitehead Institute For Biomedical Research Chromosome neighborhood structures and methods relating thereto
CN108647492A (en) * 2018-05-02 2018-10-12 中国人民解放军军事科学院军事医学研究院 A kind of characterizing method and device of chromatin topology relevant domain
CN109448783A (en) * 2018-08-07 2019-03-08 清华大学 Method for analyzing chromatin topological structure domain boundary
WO2019060683A2 (en) * 2017-09-21 2019-03-28 The Penn State Research Foundation Systems, methods, and processor-readable media for detecting disease causal variants
CN109637579A (en) * 2018-12-18 2019-04-16 长沙学院 A kind of key protein matter recognition methods based on tensor random walk

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017011710A2 (en) * 2015-07-14 2017-01-19 Whitehead Institute For Biomedical Research Chromosome neighborhood structures and methods relating thereto
WO2019060683A2 (en) * 2017-09-21 2019-03-28 The Penn State Research Foundation Systems, methods, and processor-readable media for detecting disease causal variants
CN108647492A (en) * 2018-05-02 2018-10-12 中国人民解放军军事科学院军事医学研究院 A kind of characterizing method and device of chromatin topology relevant domain
CN109448783A (en) * 2018-08-07 2019-03-08 清华大学 Method for analyzing chromatin topological structure domain boundary
CN109637579A (en) * 2018-12-18 2019-04-16 长沙学院 A kind of key protein matter recognition methods based on tensor random walk

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MARIE ZUFFEREY ET AL: "Comparison of computational methods for the identification of topologically associating domains", 《GENOME BIOLOGY》 *
王小滔: "染色质拓扑相关结构域的结构及功能研究", 《中国博士学位论文全文数据库基础科学辑》 *
韩九强 等: "基于生物信息学的HERV研究现状与发展趋势", 《生物信息学》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023092303A1 (en) * 2021-11-23 2023-06-01 Chromatintech Beijing Co, Ltd Method for generating an enhanced hi-c matrix, non-transitory computer readable medium storing a program for generating an enhanced hi-c matrix, method for identifying a structural chromatin aberration in an enhanced hi-c matrix

Also Published As

Publication number Publication date
CN110097922B (en) 2020-12-08

Similar Documents

Publication Publication Date Title
CN109300111B (en) Chromosome recognition method based on deep learning
Zou et al. Breast cancer histopathological image classification using attention high‐order deep network
Xie et al. Deep learning based analysis of histopathological images of breast cancer
Garcı́a et al. Self-organizing map and clustering for wastewater treatment monitoring
Kuismin et al. Estimation of covariance and precision matrix, network structure, and a view toward systems biology
CN106991430A (en) A kind of cluster number based on point of proximity method automatically determines Spectral Clustering
Sun et al. A new multiscale decomposition ensemble approach for forecasting exchange rates
Wang et al. Human peripheral blood leukocyte classification method based on convolutional neural network and data augmentation
CN110751038A (en) PDF table structure identification method based on graph attention machine mechanism
Liu et al. A three-domain fuzzy support vector regression for image denoising and experimental studies
CN110009097A (en) The image classification method of capsule residual error neural network, capsule residual error neural network
CN103020711A (en) Classifier training method and classifier training system
CN110111113A (en) A kind of detection method and device of exception transaction node
CN107609588A (en) A kind of disturbances in patients with Parkinson disease UPDRS score Forecasting Methodologies based on voice signal
Liu et al. Automatic classification of chinese herbal based on deep learning method
Menaka et al. Chromenet: A CNN architecture with comparison of optimizers for classification of human chromosome images
Li et al. Identifying cell types from single-cell data based on similarities and dissimilarities between cells
Wang et al. Feature selection with multi-class logistic regression
Wang et al. Extended ResNet and label feature vector based chromosome classification
Gangurde et al. [Retracted] Developing an Efficient Cancer Detection and Prediction Tool Using Convolution Neural Network Integrated with Neural Pattern Recognition
CN114898167A (en) Multi-view subspace clustering method and system based on inter-view difference detection
Xiao et al. DEEPACC: automate chromosome classification based on metaphase images using deep learning framework fused with priori knowledge
CN110097922A (en) Hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning
CN106960218B (en) Breast cancer data computer classification system based on cost-sensitive learning Bayes
Lee et al. Learning non-homogenous textures and the unlearning problem with application to drusen detection in retinal images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201208

CF01 Termination of patent right due to non-payment of annual fee