CN114944190B

CN114944190B - TAD (transcription activator) identification method and system based on Hi-C sequencing data

Info

Publication number: CN114944190B
Application number: CN202210512716.4A
Authority: CN
Inventors: 刘健; 李平静; 陈娇
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2024-04-19
Anticipated expiration: 2042-05-12
Also published as: CN114944190A

Abstract

The invention discloses a TAD identification method and a system based on Hi-C sequencing data; wherein the method comprises the following steps: obtaining Hi-C sequencing data of a single chromosome; segmenting Hi-C sequencing data of a single chromosome to generate a plurality of chromosome fragments; performing TAD structure identification on each chromosome segment; based on the identified TAD structure, false positive results are identified. The whole chromosome Hi-C sequencing data is fully utilized, and the precision is improved; and meanwhile, a random restarting wandering algorithm and punishment operation are introduced, and the influence caused by genetic variation is effectively limited through punishment coefficients.

Description

TAD (transcription activator) identification method and system based on Hi-C sequencing data

Technical Field

The invention relates to the technical field of gene sequencing, in particular to a TAD identification method and system based on Hi-C sequencing data.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

The study of the three-dimensional structure of chromosomes in space has achieved a certain result. Chromatin conformation capture (3C) is a technique developed by biologists to study one-to-one site interactions of chromosomal fragments, on the basis of which one-to-many site (4C), many-to-many site (5C) and full-to-full site techniques are developed, respectively. Wherein the whole pair of whole is called High-throughput chromosome conformation capture (High-throughputchromosomeconformationcapture), i.e., hi-C sequencing technology. Scientists have successively discovered spatial structures formed by cell chromosomes using Hi-C technology, such as topologically related regions (topologically associating domains, TADs), A/B compartment (A/B com-partment) chromatin loops (loops), and the like.

Topologically related regions (topologically associating domains, TADs), which are fragments that are highly folded within a region of the chromosome to form interactions, are of great relevance to genetics, development, disease and evolution. Therefore, the identification of TAD structures is required for a wide range of applications such as studying chromosome space conformation and function.

The inventor finds that the existing TAD recognition algorithm in Hi-C sequencing data generally needs a certain input parameter, and cannot meet the convenience of biological researchers. And their calculation results tend to be sensitive to the input parameters, with tiny parameters leading to disparate results.

Although the detection method and the detection system of the TAD nested structure in the Hi-C data of the China patent CN 113178230A-three-dimensional genome realize the detection of the TAD nested structure, the patent utilizes a deep learning mode to enhance the original Hi-C data and overcomes a large amount of resources required for acquiring high-precision data, but the patent introduces analog data for a TAD identification method and does not propose a TAD identification method based on the original data.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a TAD identification method and a TAD identification system based on Hi-C sequencing data; sensitivity to input parameters is reduced for researchers and clinicians without specialized bioinformatics knowledge. Meanwhile, the chromosome whole Hi-C data can be utilized to obtain a more accurate identification result.

In a first aspect, the invention provides a TAD identification method based on Hi-C sequencing data;

a TAD identification method based on Hi-C sequencing data, comprising:

obtaining Hi-C sequencing data of a single chromosome; segmenting Hi-C sequencing data of a single chromosome to generate a plurality of chromosome fragments;

Performing TAD structure identification on each chromosome segment;

Based on the identified TAD structure, false positive results are identified.

In a second aspect, the invention provides a TAD identification system based on Hi-C sequencing data;

A TAD identification system based on Hi-C sequencing data, comprising:

An acquisition module configured to: obtaining Hi-C sequencing data of a single chromosome; segmenting Hi-C sequencing data of a single chromosome to generate a plurality of chromosome fragments;

A TAD structure identification module configured to: performing TAD structure identification on each chromosome segment;

a false positive identification module configured to: based on the identified TAD structure, false positive results are identified.

In a third aspect, the present invention also provides an electronic device, including:

a memory for non-transitory storage of computer readable instructions; and

A processor for executing the computer-readable instructions,

Wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect described above.

In a fourth aspect, the invention also provides a storage medium storing non-transitory computer readable instructions, wherein the instructions of the method of the first aspect are executed when the non-transitory computer readable instructions are executed by a computer.

In a fifth aspect, the invention also provides a computer program product comprising a computer program for implementing the method of the first aspect described above when run on one or more processors.

Compared with the prior art, the invention has the beneficial effects that:

The scheme of the disclosure compares the data of TAD on chromosome Hi-C with the correlation diagram of the community, and provides a new idea for subsequent research; the whole chromosome Hi-C sequencing data is fully utilized, and the precision is improved; and meanwhile, a random restarting wandering algorithm and punishment operation are introduced, and the influence caused by genetic variation is effectively limited through punishment coefficients. In addition, unlike other disclosed algorithms, the present disclosure greatly reduces the sensitivity of the results to input parameters. Introducing a brand new identification TAD identification thought; reducing the dependence of the result of TAD identification on the parameter; the utilization rate of the chromosome Hi-C sequencing data is improved; the accuracy of the recognition result is improved. For researchers and clinicians without specialized bioinformatics knowledge, the protocol reduces the number of parameter choices and numerical difficulties, and can provide accurate analysis results.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a flow chart of a method according to a first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

All data acquisition in the embodiment is legal application of the data on the basis of meeting laws and regulations and agreements of users.

Example 1

The present example provides TAD identification methods based on Hi-C sequencing data;

As shown in fig. 1, the TAD identification method based on Hi-C sequencing data includes:

S101: obtaining Hi-C sequencing data of a single chromosome; segmenting Hi-C sequencing data of a single chromosome to generate a plurality of chromosome fragments;

s102: performing TAD structure identification on each chromosome segment;

s103: based on the identified TAD structure, false positive results are identified.

False positives are understood to mean incorrect TAD regions calculated due to experimental or computational errors.

Further, the S101 acquires Hi-C sequencing data of a single chromosome; segmenting Hi-C sequencing data of a single chromosome to generate a plurality of chromosome fragments; the method specifically comprises the following steps:

obtaining Hi-C sequencing data of a single chromosome; wherein Hi-C sequencing data of a single chromosome is in a matrix structure;

calculating the local contact frequency of each segment bin in a single chromosome;

after each segment bin calculates a local contact frequency value, screening out the segment bin with the minimum value point;

starting from the bin where the minimum value point is located, calculating the maximum boundary with strictly monotonically rising local contact frequency to the left and the right respectively; the left-right boundary difference of each minimum value is referred to as the maximum rising distance;

Sorting the maximum ascending distance according to the order from big to small, and taking a plurality of values which are sorted to the front as TAD boundaries;

and dividing the whole chromosome according to the TAD boundary to obtain a plurality of chromosome fragments.

Further, the local contact frequency of each segment bin (length is called resolution, marked bin in Hi-C matrix) in the chromosome is calculated; the method comprises the following steps:

Where w is the resolution of the user input divided by 2MB, cont.freq is the frequency of contact between two bins and its value is the value of the matrix formed by Hi-C sequencing data. U, D refer to the upstream (up) and downstream (down) regions, respectively. The local contact frequency value local density describes the sum of the contacts of a bin with its upstream and downstream distances w, the TAD center has a maximum and the TAD boundary has a minimum.

According to the formula (1), each bin calculates a local contact frequency value, and the bin with the minimum value point is screened out. The minimum value is defined mathematically as a value smaller than both the left and right sides (neighborhood).

Starting from the bin where the minimum value point is located, calculating the maximum boundary with strictly monotonically rising local contact frequency to the left and the right respectively; the left-right boundary difference of each minimum value is referred to as the maximum rising distance.

The maximum ascending distance is ordered from the big to the small, a plurality of values which are ordered at the front are taken,

(This value may be entered by the user, and there is also a preset value of 45%), the bins corresponding to these first several values taken are determined as TAD boundaries (boundaries).

And dividing the whole chromosome according to the boundary to obtain a plurality of chromosome fragments.

It should be noted that the TAD structure exists inside a single chromosome, and the TAD structure does not exist between chromosomes, and under this condition, the input of the algorithm should be a single chromosome Hi-C matrix.

Further, the step S102: performing TAD structure identification on each chromosome segment; the method specifically comprises the following steps:

For each chromosome segment, adopting a random restart migration algorithm (RWR, random WALK WITH RESTART) to acquire the similarity between all segment bins in the current chromosome segment;

dividing the similarity between the two-segment bins and the distance between the two-segment bins (punishment operation) to obtain punishment results;

and taking the division result as input data of a tag propagation algorithm, and performing a tag propagation process on the input data by adopting the tag propagation algorithm (Label Propagation), wherein the output content of the tag propagation process is a community structure, and the community structure corresponds to the TAD structure in meaning.

The community structure is defined by a label propagation algorithm as a region, and the internal correlation of the region is higher than the correlation of the region and other regions.

It should be understood that the random restart walk algorithm can fully utilize global data to quickly find the association degree between every two bins.

It should be understood that the division operation (penalty operation) is performed on the similarity between the two bins and the distance between the two bins, and this step is to prevent the fast propagation of super nodes (super nodes) during the subsequent tag propagation, and reduce the errors caused by factors such as chromosome variation. The accuracy is improved for subsequent unsupervised learning, and meanwhile, the influence caused by factors such as chromosome copy number variation (Copy number variations, CNV), chromosome translocation (transfer) and the like is eliminated; and (3) running an unsupervised learning algorithm to complete the community discovery process, wherein the algorithm used by the method is a label propagation algorithm.

Further, S103: identifying a false positive result according to the identified TAD structure; the method specifically comprises the following steps:

according to biological conclusions, the standard range of TAD structures is between 180Kb to 2 Mb.

The user input parameter is a Hi-C matrix resolution (resolution) value, and two end point values (180 Kb, 2 Mb) of the standard range are divided by the resolution value to obtain a bin number range contained in a topology association area under the Hi-C matrix resolution;

According to the number range of the bin contained in the topology association area, false positive results which are not in the range in the identified TAD structure are filtered out.

Further, the method further comprises: the false positives are further filtered according to the frequency of contact difference (the difference of average interaction frequency between intra-domainand the corresponding inter-domain,DIFF) and the pearson correlation coefficient (Pearson correlation coefficient, PCC) between the interior of the quality index topology correlation region and the adjacent topology correlation region.

The pearson correlation coefficient is a statistically significant indicator for measuring the correlation within a data, and is equally applicable to describing correlations within topologically related regions. It should be clear that the correlation inside the TAD is extremely high, so that results with pearson correlation coefficients below 0.6 will be considered as false positive results.

The frequency of contact Difference (DIFF) between the inside of the topology associated region and the adjacent topology associated region is a quality evaluation index in TAD identification studies. The index calculates the sum of the contacts of all the bins within one TAD, and calculates the sum of the contacts between the bins respectively falling within the adjacent TAD.

It should be clear that the frequency of contact between the bins within a TAD is extremely high, and the frequency of contact between the bins of different TADs is extremely low. DIFF calculates the frequency difference of contact between the inside of a topologically related area and the adjacent topologically related area, which index is below 20, which will be regarded as a false positive result.

In addition, the DIFF and PCC values are often used as TAD evaluation indexes in related researches, the index calculation process is included, meanwhile, the index calculation process is also used for filtering false positive results, and finally the residual communities are used as TAD recognition results.

The scheme of the disclosure provides a user-friendly recognition algorithm, simplifies the analysis difficulty of bioinformatics for researchers and clinicians without professional bioinformatics knowledge, and provides more accurate calculation results.

Example two

The present embodiment provides a TAD identification system based on Hi-C sequencing data;

A TAD identification system based on Hi-C sequencing data, comprising:

It should be noted that the acquiring module, the TAD structure identifying module, and the false positive identifying module correspond to steps S101 to S103 in the first embodiment, and the modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of embodiment one.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A TAD identification method based on Hi-C sequencing data, comprising:

Performing TAD structure identification on each chromosome segment;

identifying a false positive result according to the identified TAD structure;

The Hi-C sequencing data of a single chromosome are obtained; segmenting Hi-C sequencing data of a single chromosome to generate a plurality of chromosome fragments; the method specifically comprises the following steps:

obtaining Hi-C sequencing data of a single chromosome; wherein Hi-C sequencing data of a single chromosome is of a rectangular structure; calculating the local contact frequency of each segment bin in a single chromosome; after each segment bin calculates a local contact frequency value, screening out the segment bin with the minimum value point; starting from the bin where the minimum value point is located, calculating the maximum boundary with strictly monotonically rising local contact frequency to the left and the right respectively; the left-right boundary difference of each minimum value is referred to as the maximum rising distance; sorting the maximum ascending distance according to the order from big to small, and taking a plurality of values which are sorted to the front as TAD boundaries; dividing the whole chromosome according to the TAD boundary to obtain a plurality of chromosome fragments;

calculating the local contact frequency of each segment bin in a single chromosome; the method comprises the following steps:

Where w is the resolution input by the user divided by 2MB, cont.freq is the frequency of contact between two bins, the value of which is the value of the matrix formed by Hi-C sequencing data; u, D refer to upstream up and downstream down regions, respectively; the local contact frequency value local density describes the sum of the contacts of a bin with its upstream and downstream distances w, the TAD center has a maximum value, and the TAD boundary has a minimum value;

Performing TAD structure identification on each chromosome segment; the method specifically comprises the following steps:

for each chromosome segment, obtaining the similarity between every two segments of the bin in the current chromosome segment by adopting a random restarting walk algorithm; dividing the similarity between the two fragment bins and the distance between the two fragment bins to obtain a punishment result; taking the division result as input data of a tag propagation algorithm, and performing a tag propagation process on the input data by adopting the tag propagation algorithm, wherein the output content of the tag propagation process is a community structure, and the community structure corresponds to the TAD structure in meaning;

The community structure is defined as a region with internal correlation higher than the correlation of the region with other regions by a label propagation algorithm;

identifying a false positive result according to the identified TAD structure; the method specifically comprises the following steps:

according to biological conclusion, the standard range of TAD structures is between 180Kb and 2 Mb;

The user input parameter is a Hi-C matrix resolution value, and two end point values of the standard range are divided by a resolution value to obtain a bin number range contained in a topological association area under the Hi-C matrix resolution; filtering false positive results which are not in the range in the identified TAD structure according to the number range of the bin contained in the topology association area; further filtering false positive according to the contact frequency difference between the inside of the quality index topological correlation area and the adjacent topological correlation area and the pearson correlation coefficient; and finally, taking the residual community as a TAD recognition result.

2. TAD identification system based on Hi-C sequencing data, based on a TAD identification method based on Hi-C sequencing data according to claim 1, characterized by comprising:

3. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

A processor for executing the computer-readable instructions,

Wherein the computer readable instructions, when executed by the processor, perform the method of claim 1.

4. A storage medium storing computer readable instructions non-transitory, wherein the instructions of the method of claim 1 are performed when the non-transitory computer readable instructions are executed by a computer.