CN110097922B - Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning - Google Patents

Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning Download PDF

Info

Publication number
CN110097922B
CN110097922B CN201910315741.1A CN201910315741A CN110097922B CN 110097922 B CN110097922 B CN 110097922B CN 201910315741 A CN201910315741 A CN 201910315741A CN 110097922 B CN110097922 B CN 110097922B
Authority
CN
China
Prior art keywords
tads
hierarchical
cell line
contact matrix
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910315741.1A
Other languages
Chinese (zh)
Other versions
CN110097922A (en
Inventor
吕红强
刘聪毅
韩九强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201910315741.1A priority Critical patent/CN110097922B/en
Publication of CN110097922A publication Critical patent/CN110097922A/en
Application granted granted Critical
Publication of CN110097922B publication Critical patent/CN110097922B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Biophysics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Genetics & Genomics (AREA)
  • Algebra (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

A hierarchical TADs difference analysis method in Hi-C contact matrix based on-line machine learning is used for carrying out standardization processing on Hi-C data so as to eliminate experimental system deviation and enhance comparability between data; calculating the average value of the interaction frequency between the upstream and downstream regions of each bin according to the normalized data, and recording the average value as the bin signal (i); fitting the sequence binSignal and carrying out rank sum inspection to obtain a boundary region point of the TADS; obtaining all possible hierarchical TADs according to the boundary region points, and providing a mathematical model between the interaction frequency in the Hi-C contact matrix and all possible hierarchical TADs; and (3) establishing an objective function of the model, solving the hierarchical TADs difference analysis model by adopting an online machine learning algorithm FTRL for the first time, and identifying the hierarchical TADs with the difference of different cell lines. The invention provides a mathematical model between interaction frequency and hierarchical TADs in a Hi-C contact matrix, and adopts an online machine learning algorithm FTRL to obtain the weight coefficients of all hierarchical TADs, so as to identify the differential TADs among different cell lines.

Description

Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning
Technical Field
The invention belongs to the technical field of biology, relates to hierarchical TADs (TADs) difference analysis under different cell lines, and particularly relates to a hierarchical TADs difference analysis method in a Hi-C contact matrix based on online machine learning.
Background
The Hi-C technology is a high-throughput chromatin conformation capture technology, and interaction information between any sites of the whole genome can be obtained through a Hi-C experiment. The Hi-C data is data obtained by a Hi-C experiment, and the Hi-C data is generally in the form of a matrix, which is called a touch matrix, the touch matrix is a symmetric square matrix, and each element in the touch matrix is called interaction frequency. With the development of the Hi-C technology, scientists found that each chromosome can be roughly divided into two compartments (a/B components) with active and passive chromosome states when studying Hi-C data, wherein the chromosome state of the a compartment is active, and the chromosome state of the B compartment is passive, and based on the discovery of the two kinds of compartments, at higher resolution, scientists also found that a genomic region with higher interaction strength exists in the compartments, which is called as Topologically Associated Domains (TADs), and the interaction strength of sites in the topologically associated domains is far higher than that of the outside.
Numerous biological experiments have shown that TADs are the essential element for regulating the transcriptional expression of genes, and that TADs restrict the regulation of enhancers and promoters during gene regulation, and in addition, disruption of the TADs boundaries may also lead to some diseases, such as cancer. Studies have shown that most TADs have a hierarchical structure, with only a very small number of TADs being independent. By carrying out differential analysis on hierarchical TADs, the mechanism that gene expression influences cell differentiation can be deeply understood.
The prior art method for analyzing the difference of TADs is deficient in consideration of the hierarchical structure of the TADs, and the identification rate of the difference TADs is influenced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a hierarchical TADs difference analysis method in a Hi-C contact matrix based on online machine learning, which aims at performing hierarchical TADs difference analysis on Hi-C data under different cell lines.
In order to achieve the purpose, the invention adopts the technical scheme that:
a hierarchical TADs difference analysis method in a Hi-C contact matrix based on online machine learning is characterized by comprising the following steps:
step 1, carrying out standardization processing on Hi-C data to eliminate system deviation of Hi-C experiments and enhance comparability between data;
step 2, calculating the average value of the interaction frequency between the upstream and downstream regions of each bin for the Hi-C data after standardization, and recording the average value as the BiSignal (i);
step 3, fitting the sequence binSignal by using a curve fitting algorithm, and preliminarily regarding a local minimum point of a fitting curve as a boundary region point of TADs;
step 4, filtering false positive TADS boundary region points by a rank sum check method to obtain final TADS boundary region points;
step 5, obtaining all possible hierarchical TADs according to the boundary region points of the TADs, and proposing the interaction frequency M in the Hi-C contact matrix MijAnd between all possible hierarchical TADs;
step 6, establishing an objective function of the model, and solving the objective function by using an online machine learning algorithm FTRL;
and 7, identifying the hierarchical TADs with different cell lineages according to the solved result.
The specific method for standardizing the Hi-C data in the step 1 comprises the following steps:
firstly, adopting a Hi-C data standardization method of a cross-cell line to carry out primary processing on Hi-C data under different cell lines so as to eliminate system deviation between the different cell lines as much as possible; the primarily processed Hi-C data was then processed using the data normalization method CPM (counts per million) to further enhance the comparability of Hi-C data between different cell lines.
The specific method for calculating the binsignal (i) in the step 2 comprises the following steps:
for a bin with index i, the average interaction frequency between the upstream and downstream regions is denoted as binsignal (i), the window size of the upstream and downstream regions is w, and the binsignal (i) is calculated according to the following formula:
Figure GDA0002693272490000031
wherein U isi(l) Denotes a certain bin, D, in the upstream region of the bin with index ii(m) denotes a certain bin in the downstream region of the bin with index i, cont.freq denotes the interaction frequency between the upstream and downstream region bins, l denotes the index of the upstream region window element, and m denotes the index of the downstream region window element.
The specific method for filtering the false positive TADs boundary region points by using rank sum test in step 4 is as follows:
counting the interaction frequency S between the upstream and the downstream within a certain range of the TADs boundary region to be checked1Frequency of interactions S with upstream or downstream2Checking S by means of rank sum check1And S2If there is significant difference, the probability of the rank sum test is less than 0.05, and if there is significant difference, the TADs boundary region is considered as false positive and needs to be filtered out, otherwise the TADs boundary region is considered as true positive.
In step 5, the contact matrix M is a symmetrical square matrix, each element in the contact matrix is called interaction frequency number, and the interaction frequency number MijAnd the mathematical models for all possible hierarchical TADs are:
Y=XB+N(0,σ2) Wherein B is not less than 0
Wherein Y represents a matrix formed by interaction frequency numbers of all Hi-C contact matrixes, each column element in Y is a vector formed by interaction frequency numbers of a triangular area (containing diagonal lines) under one contact matrix, X represents the position relation between the interaction frequency numbers and all possible hierarchical TADs, B represents the weight coefficient of all possible hierarchical TADs, and N (0, sigma)2) Represents the noise generated by a normal distribution, and σ represents the standard deviation of the normal distribution. The mathematical model can be described specifically as:
Figure GDA0002693272490000041
wherein m isRepresenting the dimension of the contact matrix, n representing the total number of all possible hierarchical TADs, c1Represents a first cell line, c2Represents a second cell line, K1Indicates the total number of Hi-C data samples, K, in the first cell line type2Indicates the total number of Hi-C data samples in the second cell line type,
Figure GDA0002693272490000042
represents the K-th cell line of the first type1The frequency of interaction in the mth row and mth column of the Hi-C contact matrix,
Figure GDA0002693272490000043
represents the K-th cell line of the second type2Frequency of interaction in m row and m column of Hi-C contact matrix for xi,zIf it corresponds to yi,jInternal to the z-th hierarchical TADs, then it is set to 1, otherwise it is set to 0, in the model
Figure GDA0002693272490000044
Represents the weighting coefficient of the nth hierarchical TADs under the first cell line,
Figure GDA0002693272490000045
represents the weighting coefficient of the nth hierarchical TADs under the second cell line.
The specific method for solving the objective function in step 6 is as follows:
first, an objective function is established:
Figure GDA0002693272490000046
Figure GDA0002693272490000047
wherein B is not less than 0
Wherein K1Indicates the total number of Hi-C data samples, K, in the first cell line type2Indicates the total number of Hi-C data samples in the second cell line, j indicates the index of the sample, Y1jIndicates the lower triangular region of the jth Hi-C contact matrix under the first cell lineVector formed by frequency of all interactions in (diagonal included)2jRepresents the vector formed by all interaction frequencies in the triangular region (containing diagonal) under the jth Hi-C contact matrix under the second cell line type B:,1Weight vector representing hierarchical TADs of the first cell line type, B:,2A weight vector representing the hierarchical TADs of the second cell line, X representing the interaction frequency and the positional relationship of all possible hierarchical TADs, and B is represented by a vector B:,1And B:,2And (5) splicing to form a matrix. The first item of the target optimization function is a common mean square error, and the aim is to ensure that the overall difference between the interaction frequency obtained based on model calculation and the real interaction frequency is as small as possible; the second term is1A regularization term to increase the sparsity of the solution; the third term is2A regularization term, which aims to make the solution smoother and prevent the occurrence of overfitting; the fourth term is the core term of the loss function, the purpose being to optimize the vector B:1Sum vector B:2The overall degree of difference between them is as small as possible.
The model is solved by using an online machine learning algorithm FTRL, and the advantages are that: saving computer memory, improving solving efficiency and obtaining sparse solution.
The specific method for identifying the differential hierarchical TADs based on the solving result in the step 7 is as follows:
the solution result is two vectors B:1And B:2In which B isi1Representative cell line c1Weight coefficient of lower ith hierarchy TADs, Bi2Representative cell line c2Weight coefficient, abs (B) of the lower ith hierarchical TADsi1-Bi2) Representative cell line c1And cell line c2The degree of difference between them in the ith hierarchical TADs, abs (B)i1-Bi2) The larger the value of (a) represents the cell line c1And cell line c2The greater the variability in the ith tier TADs.
Compared with the prior art, the invention has the advantages that: based on a large number of biological experiment results, a mathematical model between interaction frequency of a Hi-C contact matrix and hierarchical TADs is provided, an objective function is established, weight coefficients of all hierarchical TADs are obtained by adopting an online machine learning algorithm FTRL, and differential TADs among different cell lines are identified.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a heat map of Hi-C data at 500Kb resolution under cell lines HUVEC and IMR 90.
Fig. 3 is the results of actual hierarchical differential TADs and analytical fitting.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
As shown in FIG. 1, the differential analysis of hierarchical TADs in two cell lines based on the identification of TADs boundary region of the present invention comprises the following steps:
step 1, carrying out standardization processing on Hi-C data to eliminate system deviation of Hi-C experiments and enhance comparability between data, wherein the specific method comprises the following steps:
firstly, adopting a Hi-C data standardization method of a cross-cell line to carry out primary processing on Hi-C data under different cell lines so as to eliminate system deviation between the different cell lines as much as possible; the primarily processed Hi-C data was then processed using the data normalization method CPM (counts per million) to further enhance the comparability of Hi-C data between different cell lines.
Step 2, calculating the average value of the interaction frequency between the upstream area and the downstream area of each bin for the initialized Hi-C data, and recording the average value as bin Signal (i), wherein the specific method is as follows:
for a bin indexed by subscript i, the average interaction frequency between the upstream and downstream regions is denoted as binsignal (i), and the window size of the upstream and downstream regions is w. binsignal (i) is calculated according to the following equation:
Figure GDA0002693272490000061
wherein U isi(l) Denotes a certain bin, D, in the upstream region of the bin with index ii(m) denotes a certain bin in the downstream region of the bin with index i, cont.freq denotes the interaction frequency between the upstream and downstream region bins, l denotes the index of the upstream region window element, and m denotes the index of the downstream region window element.
And 3, fitting the sequence binSignal by using a curve fitting algorithm, and preliminarily regarding the local minimum point of the fitting curve as a boundary region point of the TADs.
And 4, filtering false positive TADs boundary region points through a rank sum test method to obtain final TADs boundary region points, wherein the method comprises the following specific steps:
counting the interaction frequency S between the upstream and the downstream within a certain range of the TADs boundary region to be checked1Frequency of interactions S with upstream or downstream2Checking S by means of rank sum check1And S2Whether there is significant difference between them, wherein significant difference means that the probability of the rank sum test is less than 0.05. If there is significant difference, the TADs boundary region is considered to be false positive and needs to be filtered out, otherwise the TADs boundary region is considered to be true positive.
Step 5, obtaining all possible hierarchical TADs according to the boundary region points of the TADs, and establishing the interaction frequency M in the Hi-C data contact matrix MijAnd all possible hierarchical TADs, the formula:
Y=XB+N(0,σ2) Wherein B is not less than 0
Wherein Y represents a matrix formed by interaction frequency numbers of all Hi-C contact matrixes, each column element in Y is a vector formed by interaction frequency numbers of a triangular area (containing diagonal lines) under one contact matrix, X represents the position relation between the interaction frequency numbers and all possible hierarchical TADs, B represents the weight coefficient of all possible hierarchical TADs, and N (0, sigma)2) Represents the noise generated by a normal distribution, and σ represents the standard deviation of the normal distribution. The mathematical model can be described specifically as:
Figure GDA0002693272490000071
where m denotes the dimension of the contact matrix, n denotes the total number of all possible hierarchical TADs, c1Represents a first cell line, c2Represents a second cell line, K1Indicates the total number of Hi-C data samples, K, in the first cell line type2Indicates the total number of Hi-C data samples in the second cell line type,
Figure GDA0002693272490000072
represents the K-th cell line of the first type1The frequency of interaction in the mth row and mth column of the Hi-C contact matrix,
Figure GDA0002693272490000081
represents the K-th cell line of the second type2And the frequency of interaction of the mth row and the mth column in the Hi-C contact matrix. For xi,zIf it corresponds to yi,jInternal to the z-th hierarchical TADs, then it is set to 1, otherwise it is set to 0, in the model
Figure GDA0002693272490000082
Represents the weighting coefficient of the nth hierarchical TADs under the first cell line,
Figure GDA0002693272490000083
represents the weighting coefficient of the nth hierarchical TADs under the second cell line.
Step 6, establishing an objective function of the model, and solving the objective optimization function by using an online machine learning algorithm FTRL, wherein the specific method comprises the following steps:
first, an objective function is established:
Figure GDA0002693272490000084
Figure GDA0002693272490000085
wherein B is not less than 0
Wherein K1Indicates the total number of Hi-C data samples, K, in the first cell line type2Indicates the total number of Hi-C data samples in the second cell line type, j indicatesIndex of samples, Y1jRepresents the vector formed by all interaction frequencies in the triangular region (containing diagonal) under the jth Hi-C contact matrix under the first cell line type, Y2jRepresents the vector formed by all interaction frequencies in the triangular region (containing diagonal) under the jth Hi-C contact matrix under the second cell line type B:,1Weight vector representing hierarchical TADs of the first cell line type, B:,2A weight vector representing the hierarchical TADs of the second cell line, X representing the interaction frequency and the positional relationship of all possible hierarchical TADs, and B is represented by a vector B:,1And B:,2And (5) splicing to form a matrix. The first item of the target optimization function is a common mean square error, and the aim is to ensure that the overall difference between the interaction frequency obtained based on model calculation and the real interaction frequency is as small as possible; the second term is1A regularization term to increase the sparsity of the solution; the third term is2A regularization term, which aims to make the solution smoother and prevent the occurrence of overfitting; the fourth term is the core term of the objective function, the purpose is to optimize the vector B:1Sum vector B:2The overall degree of difference between them is as small as possible.
The model is solved by using an online machine learning algorithm FTRL, and the advantages are that: saving computer memory, improving solving efficiency and obtaining sparse solution.
And 7, identifying the hierarchical TADs with different cell lineages based on the solved result, wherein the specific method comprises the following steps:
the solution result is two vectors B:1And B:2In which B isi1Representative cell line c1Weight coefficient of lower ith hierarchy TADs, Bi2Representative cell line c2Weight coefficient, abs (B) of the lower ith hierarchical TADsi1-Bi2) Represents the cell line c1And cell line c2The degree of difference between them in the ith hierarchical TADs, abs (B)i1-Bi2) The larger the value of (a) represents the cell line c1And cell line c2The greater the variability in the ith tier TADs.
The procedure of the present invention is described by way of example to identify the hierarchical TADs with differences between the HUVEC and IMR90 cell lines.
(1) Data are for cell lines HUVEC and IMR90 with Hi-C data at 500Kb resolution, two data for each cell line. Heat maps of data as shown in fig. 2, the two heat maps on the left are data for HUVEC and the two heat maps on the right are data for IMR 90.
(2) The data were normalized, specifically including the cross-cell line Hi-C data normalization method MultiHiCcomp and the data normalization method CPM.
(3) Identifying TADS boundary region points of the normalized data specifically comprises the following steps: calculate the value of binsignal (i) for each bin, curve fit to get the initial local minimum point, rank sum test filter false positive TADs boundary region point.
(4) And obtaining all possible hierarchical TADs according to the TADs boundary region points, and solving the weight coefficient of each possible hierarchical TADs.
(5) And obtaining a final analysis result according to the obtained weight coefficient, and visualizing the result, wherein the visualized result is shown in fig. 3, the left graph is the actual hierarchical difference TADs, and the right graph is the result obtained by analyzing and fitting, so that the result obtained by analyzing and fitting is basically close to the actual hierarchical difference TADs.

Claims (8)

1. A hierarchical TADs difference analysis method in a Hi-C contact matrix based on online machine learning is characterized by comprising the following steps:
step 1, carrying out standardization processing on Hi-C data to eliminate system deviation of Hi-C experiments and enhance comparability between data;
step 2, calculating the average value of the interaction frequency between the upstream and downstream regions of each bin for the Hi-C data after standardization, and recording the average value as the BiSignal (i);
step 3, fitting the sequence binSignal (i) by using a curve fitting algorithm, and preliminarily regarding a local minimum point of a fitting curve as a boundary region point of TADs;
step 4, filtering false positive TADS boundary region points by a rank sum check method to obtain final TADS boundary region points;
step 5, obtaining all possible hierarchical TADs according to the boundary region points of the TADs, and establishing interaction frequency M in the Hi-C contact matrix MijAnd between all possible hierarchical TADs;
step 6, establishing an objective function of the model, and solving the objective function by using an online machine learning algorithm FTRL;
and 7, identifying the hierarchical TADs with different cell lineages according to the solved result.
2. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 1, wherein the specific method for normalizing the Hi-C data in step 1 is as follows:
firstly, adopting a Hi-C data standardization method of a cross-cell line to carry out primary processing on Hi-C data under different cell lines so as to eliminate system deviation between the different cell lines as much as possible; the primarily processed Hi-C data was then processed using the data normalization method CPM (counts per million) to further enhance the comparability of Hi-C data between different cell lines.
3. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 1, wherein the specific method for calculating binsignal (i) in step 2 is:
for a bin with index i, the average interaction frequency between the upstream and downstream regions is denoted as binsignal (i), the window size of the upstream and downstream regions is w, and the binsignal (i) is calculated according to the following formula:
Figure FDA0002709995040000021
wherein U isi(l) Denotes a certain bin, D, in the upstream region of the bin with index ii(m) denotes a certain region in the downstream of bin having index i of subscriptBin, cont.freq denotes the interaction frequency between upstream and downstream region bins, l denotes the index of the upstream region window element, and m denotes the index of the downstream region window element.
4. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 1, wherein the specific method for filtering false positive TADs boundary region points in step 4 by using rank sum test is:
counting the interaction frequency S between the upstream and the downstream within a certain range of the TADs boundary region to be checked1Frequency of interactions S with upstream or downstream2Checking S by means of rank sum check1And S2If there is significant difference, the TADs boundary region is considered as false positive and needs to be filtered out, otherwise the TADs boundary region is considered as true positive, and the significant difference means that the probability of the rank sum test is less than 0.05.
5. The hierarchical TADs difference analysis method in Hi-C contact matrix based on online machine learning of claim 1, wherein the contact matrix M in step 5 is a symmetric square matrix, each element in the contact matrix is called interaction frequency, and the interaction frequency M is the interaction frequencyijAnd the mathematical models for all possible hierarchical TADs are:
Y=XB+N(0,σ2) Wherein B is not less than 0
Wherein Y represents a matrix formed by interaction frequency numbers of all Hi-C contact matrixes, each column element in Y is a vector formed by interaction frequency numbers of a lower triangular region of the contact matrix including a diagonal line, X represents the position relation between the interaction frequency numbers and all possible hierarchical TADs, B represents the weight coefficient of all possible hierarchical TADs, and N (0, sigma)2) Represents the noise generated by a normal distribution, and σ represents the standard deviation of the normal distribution.
6. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 5, wherein the mathematical model is specifically described as:
Figure FDA0002709995040000031
where m denotes the dimension of the contact matrix, n denotes the total number of all possible hierarchical TADs, c1Represents a first cell line, c2Represents a second cell line, K1Indicates the total number of Hi-C data samples, K, in the first cell line type2Indicates the total number of Hi-C data samples in the second cell line type,
Figure FDA0002709995040000032
represents the K-th cell line of the first type1The frequency of interaction in the mth row and mth column of the Hi-C contact matrix,
Figure FDA0002709995040000033
represents the K-th cell line of the second type2Frequency of interaction in m row and m column of Hi-C contact matrix for xi,zIf it corresponds to yi,jInternal to the z-th hierarchical TADs, then it is set to 1, otherwise it is set to 0, in the model
Figure FDA0002709995040000034
Represents the weighting coefficient of the nth hierarchical TADs under the first cell line,
Figure FDA0002709995040000035
represents the weighting coefficient of the nth hierarchical TADs under the second cell line.
7. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 5 or 6, wherein the specific method for solving the objective function in step 6 is:
first, an objective function is established:
Figure FDA0002709995040000041
wherein B is not less than 0
Wherein K1Indicates the total number of Hi-C data samples, K, in the first cell line type2Indicates the total number of Hi-C data samples in the second cell line, j indicates the index of the sample, Y1jRepresenting the vector formed by all interaction frequencies in the lower triangular region containing diagonal lines of the jth Hi-C contact matrix under the first cell line type, Y2jRepresenting the vectors formed by all interaction frequencies in the lower triangular region containing diagonal lines of the jth Hi-C contact matrix under the second cell line type, B:,1Weight vector representing hierarchical TADs of the first cell line type, B:,2A weight vector representing the hierarchical TADs of the second cell line, X representing the interaction frequency and the positional relationship of all possible hierarchical TADs, and B is represented by a vector B:,1And B:,2A matrix formed by splicing; the first item of the target optimization function is a common mean square error, and the aim is to ensure that the overall difference between the interaction frequency obtained based on model calculation and the real interaction frequency is as small as possible; the second term is1A regularization term to increase the sparsity of the solution; the third term is2A regularization term, which aims to make the solution smoother and prevent the occurrence of overfitting; the fourth term is the core term of the loss function, the purpose being to optimize the vector B:1Sum vector B:2The overall degree of difference between them is as small as possible.
8. The hierarchical TADs difference analysis method in the Hi-C contact matrix based on online machine learning according to claim 7, wherein the specific method for identifying the hierarchical TADs based on the solution result in step 7 is:
the solution result is two vectors B:1And B:2In which B isi1Representative cell line c1Weight coefficient of lower ith hierarchy TADs, Bi2Representative cell line c2Weight coefficient, abs (B) of the lower ith hierarchical TADsi1-Bi2) Representative cell line c1And cell line c2The degree of difference between them in the ith hierarchical TADs, abs (B)i1-Bi2) The larger the value of (a) represents the cell line c1And cell line c2The greater the variability in the ith tier TADs.
CN201910315741.1A 2019-04-19 2019-04-19 Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning Expired - Fee Related CN110097922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910315741.1A CN110097922B (en) 2019-04-19 2019-04-19 Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910315741.1A CN110097922B (en) 2019-04-19 2019-04-19 Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning

Publications (2)

Publication Number Publication Date
CN110097922A CN110097922A (en) 2019-08-06
CN110097922B true CN110097922B (en) 2020-12-08

Family

ID=67445233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910315741.1A Expired - Fee Related CN110097922B (en) 2019-04-19 2019-04-19 Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning

Country Status (1)

Country Link
CN (1) CN110097922B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023092303A1 (en) * 2021-11-23 2023-06-01 Chromatintech Beijing Co, Ltd Method for generating an enhanced hi-c matrix, non-transitory computer readable medium storing a program for generating an enhanced hi-c matrix, method for identifying a structural chromatin aberration in an enhanced hi-c matrix

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017011710A2 (en) * 2015-07-14 2017-01-19 Whitehead Institute For Biomedical Research Chromosome neighborhood structures and methods relating thereto
CN108647492A (en) * 2018-05-02 2018-10-12 中国人民解放军军事科学院军事医学研究院 A kind of characterizing method and device of chromatin topology relevant domain
CN109448783A (en) * 2018-08-07 2019-03-08 清华大学 Method for analyzing chromatin topological structure domain boundary
WO2019060683A2 (en) * 2017-09-21 2019-03-28 The Penn State Research Foundation Systems, methods, and processor-readable media for detecting disease causal variants

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109637579B (en) * 2018-12-18 2022-04-15 长沙学院 Tensor random walk-based key protein identification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017011710A2 (en) * 2015-07-14 2017-01-19 Whitehead Institute For Biomedical Research Chromosome neighborhood structures and methods relating thereto
WO2019060683A2 (en) * 2017-09-21 2019-03-28 The Penn State Research Foundation Systems, methods, and processor-readable media for detecting disease causal variants
CN108647492A (en) * 2018-05-02 2018-10-12 中国人民解放军军事科学院军事医学研究院 A kind of characterizing method and device of chromatin topology relevant domain
CN109448783A (en) * 2018-08-07 2019-03-08 清华大学 Method for analyzing chromatin topological structure domain boundary

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Comparison of computational methods for the identification of topologically associating domains;Marie Zufferey et al;《Genome Biology》;20181210;1-18 *
基于生物信息学的HERV研究现状与发展趋势;韩九强 等;《生物信息学》;20140630;117-122 *
染色质拓扑相关结构域的结构及功能研究;王小滔;《中国博士学位论文全文数据库基础科学辑》;20190115;A006-4 *

Also Published As

Publication number Publication date
CN110097922A (en) 2019-08-06

Similar Documents

Publication Publication Date Title
CN109300111B (en) Chromosome recognition method based on deep learning
Tang et al. A pruning neural network model in credit classification analysis
Kuismin et al. Estimation of covariance and precision matrix, network structure, and a view toward systems biology
CN110738247B (en) Fine-grained image classification method based on selective sparse sampling
BinTayyash et al. Non-parametric modelling of temporal and spatial counts data from RNA-seq experiments
CN114155397B (en) Small sample image classification method and system
CN112633337A (en) Unbalanced data processing method based on clustering and boundary points
CN114091603A (en) Spatial transcriptome cell clustering and analyzing method
CN113838532A (en) Multi-granularity breast cancer gene classification method based on dual self-adaptive neighborhood radius
CN110097922B (en) Hierarchical TADs (TADs-related analysis) difference analysis method in Hi-C contact matrix based on online machine learning
CN115112372A (en) Bearing fault diagnosis method and device, electronic equipment and storage medium
AU2019480813A1 (en) Methods for determining chromosome aneuploidy and constructing classification model, and device
CN107526946A (en) Merge the gene expression data cancer classification method of self study and low-rank representation
CN111863135B (en) False positive structure variation filtering method, storage medium and computing device
CN115881232A (en) ScRNA-seq cell type annotation method based on graph neural network and feature fusion
Vishwakarma et al. An automated robust algorithm for clustering multivariate data
CN111428510B (en) Public praise-based P2P platform risk analysis method
CN116150687A (en) Fluid pipeline leakage identification method based on multi-classification G-WLSTSVM model
Himani et al. A comparative study on machine learning based prediction of citations of articles
CN115083511A (en) Peripheral gene regulation and control feature extraction method based on graph representation learning and attention
Fan et al. Assisted graphical model for gene expression data analysis
CN112906751A (en) Method for identifying abnormal value through unsupervised learning
Yu et al. CNLLRR: a novel low-rank representation method for single-cell RNA-seq data analysis
Liu et al. Assessing agreement of clustering methods with gene expression microarray data
Hamedi et al. A comparative study on measurement of lane-changing trajectory similarities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201208

CF01 Termination of patent right due to non-payment of annual fee