CN108897986B

CN108897986B - Genome sequence splicing method based on protein information

Info

Publication number: CN108897986B
Application number: CN201810530874.6A
Authority: CN
Inventors: 王建新; 尚娟; 李洪东
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2018-05-29
Filing date: 2018-05-29
Publication date: 2020-11-27
Anticipated expiration: 2038-05-29
Also published as: CN108897986A

Abstract

The invention discloses a genome sequence splicing method based on protein information, which comprises the following steps: acquiring comparison information between a DNA sequence to be spliced and a protein sequence; determining the adjacent relation between corresponding DNA sequences on each protein sequence; constructing connecting edges between adjacent DNA sequences and obtaining the support information of the corresponding connecting edges of each DNA sequence on each protein sequence; sequentially carrying out denoising treatment on the support information of the connecting edge of each DNA sequence; sequentially denoising the front node and the rear node of each DNA sequence based on a weight scoring function; calculating the connection distance of all DNA sequence connection edges with the support information; and sequentially connecting the front node and the rear node of each DNA sequence in series based on the connection intervals of the connection edges of all the DNA sequences to obtain a genome sequence splicing path. The sensitivity and the accuracy of the genome sequence splicing result are improved by the method.

Description

Genome sequence splicing method based on protein information

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a genome sequence splicing method based on protein information.

Background

New sequencing technologies that have evolved at low cost have greatly changed the pattern of whole genome sequencing, enabling scientists to launch numerous genomic projects to decode the genomes of previously unsequenced organisms. Sequencing technology can complete deep sequencing on most species, including mammals, in a short period of days. However, DNA sequencing techniques do not directly produce complete sequences at the chromosomal level, but rather they generate a large number of reads, sampling from different parts of the genome continuous bases ranging in length from tens to thousands. Long sequences of genomic splices are made up of millions or billions of short DNA sequence sequencing reads generated by sequencing techniques.

Most species lack a reference genome and must have the sequence spliced de novo to the reads before the genome can be analyzed. Unfortunately, genome splicing remains a very difficult problem. Genome splicing software combines reads into longer sequences called protein sequences (contigs) based on their overlap, and constructs long genome splicing sequences (scaffolds) by determining the orientation, order, and distance between contigs. However, current sequencing technologies face a number of challenges that prevent splicing tools from reconstructing a complete chromosome, including read errors and large numbers of repeated regions in the genome. The limitations of genome sequencing technologies and the inherent complexity of genomes have led to the perfection of none of the current splicing algorithms.

The structure of the reducing gene is one of the important targets for the splicing of genome sequences. In some low-quality splices, even high-quality splices, certain gene regions remain incomplete. In order to obtain a new sequence with a more complete gene structure, it is effective to use a protein sequence to assist scaffold.

The genes are DNA fragments with genetic effect (the genes of the virus can be RNA) and are arranged linearly on the chromosome. The expression of a gene is achieved by synthesizing a protein from DNA, the base arrangement of the DNA sequence determines the base arrangement of the mRNA sequence, which in turn determines the amino acid arrangement of the protein, which ultimately determines the primary structure and functional specificity of the protein, resulting in the expression of different genetic traits in an organism. Therefore, there is a correspondence between genes, DNAs, and proteins. A gene may contain multiple exons, the exons of the gene may be on two different contigs, and contigs may be ligated based on the split-gene information to construct longer scfolds, see figure 1.

The invention of protein sequence determination technology is prior to DNA sequence sequencing technology, and many protein sequence databases contain sufficient protein resource information, which lays a foundation for protein sequence guide sequence splicing. Uniprot (http:// www.uniprot.org /) is a central resource library of protein sequences and annotation catalogues with comprehensive functions, and has the most abundant information and the most extensive resources. Uniprot integrates three databases of Swiss-Prot, TrEMBL and PIR-PSD, wherein protein data in the Swiss-Prot database is subjected to manual annotation analysis, while the TrEMBL database is automatically predicted and is not manually annotated, so that the Swiss-Prot protein is more reliable than the TrEMBL protein.

For genome function studies and evolutionary analysis, the identification of gene structures is a major goal of genome sequencing projects. Although double-ended short reads or long reads can increase the value of gene sequence N50, it is still difficult to perfect all gene structures. Therefore, it is necessary to develop a new genome sequence splicing method (scaffold method) to restore a gene region. Several methods of scaffold using protein information as a guide to increase genome continuity have been developed, such as ESPRIT, SWiPS, and PEP _ scaffold.

The accuracy of the tool ESPRIT depends on the predicted gene set, and the extremely time-consuming nature of gene prediction limits the practical application of ESPRIT. Similarly, SWiPS requires many alignments of protein sequences and genome sequences to improve the matching between sequences, and also requires a lot of time. The tool PEP _ scaffold will directly delete the protein sequence that matches the too large coverage, compromising the sensitivity of the scaffold result to some extent. Therefore, the existing scaffold method cannot meet the requirements of sensitivity and accuracy.

Disclosure of Invention

The invention aims to provide a genome sequence splicing method based on protein information, which improves the sensitivity and accuracy of a scaffold result.

The invention provides a genome sequence splicing method based on protein information, which comprises the following steps:

s1: acquiring comparison information between a DNA sequence to be spliced and a protein sequence;

the method comprises the following steps of obtaining a piece of comparison information when one DNA sequence is matched with one protein sequence, wherein the protein sequence corresponding to each piece of obtained comparison information at least matches two DNA sequences, and the comparison information at least comprises the following steps: matching value, initial position coordinates and end position coordinates of a matching region on the protein sequence, and comparison direction;

the protein sequence and the DNA sequence belong to the same species or homologous species;

s2: acquiring the protein sequences with the comparison information, and respectively determining the adjacent relation between the corresponding DNA sequences on each protein sequence according to the initial position and the end position of the matching region on the protein sequence in each comparison information of each protein sequence;

s3: constructing a connecting edge between adjacent DNA sequences based on the adjacent relation between the DNA sequences, and acquiring the support information of the corresponding connecting edge of each DNA sequence on each protein sequence;

when a DNA sequence connection edge is matched with N protein sequences, N pieces of supporting information correspondingly exist on the DNA sequence connection edge, and N is a positive integer;

the support information is used for representing the connection type of two DNA sequences in the connection side of the corresponding DNA sequences on one protein sequence, and the connection type comprises L¹L²A connection type and a gap connection type;

s4: denoising the support information of the connecting edge of each DNA sequence in sequence based on the comparison information;

wherein, only one piece of support information exists on the connecting edge of each DNA sequence after denoising treatment;

s5: sequentially denoising the front node and the rear node of each DNA sequence based on a weight scoring function;

the number of the front nodes and the rear nodes of each DNA sequence after denoising does not exceed 1, the weight fractions of the front nodes and the rear nodes are respectively greater than the weight fractions of the rest front nodes and the rest rear nodes of the corresponding DNA sequences, the support information of the corresponding DNA sequence connecting edges is reserved according to the front nodes and the rear nodes of each DNA sequence after denoising, and the rest support information is deleted;

wherein, the DNA sequence at the front end in the connection side of one DNA sequence is a front node, and the DNA sequence at the rear end is a rear node;

s6: calculating the connection intervals corresponding to the connection edges of all DNA sequences with the support information;

s7: obtaining a genome sequence splicing path based on the connection intervals of all DNA sequence connection edges in S6 and the serial connection of the front node and the rear node of each DNA sequence;

wherein, the distance between the adjacent DNA sequences corresponds to the connection distance corresponding to the connection edge of the DNA sequences.

L¹L²The support information of the connection type does not directly include the gap value, and the information of the gap connection type directly includes the gap value. The gap value is the distance between two contigs. The connection distance corresponding to the connecting side of the DNA sequence in S6 is also the distance between two DNA sequence nodes representing the connecting side of the DNA sequence. Each DNA sequence after the step of S5 was divided into three categories: (1) without a previous node, this DNA sequence node is called the start node; (2) there is no back node, and this DNA sequence node is called the end node; (3) the DNA sequence nodes having both the front node junction and the back node junction are called intermediate nodes. And when the nodes are connected in series, starting from each node in the starting node set, sequentially connecting the nodes in series according to the connecting edges among the DNA sequence nodes until the nodes are ended, and generating a scaffold path.

Further preferably, the process of denoising the front and rear nodes of a DNA sequence in step S5 is as follows:

sequentially calculating the weight fraction of each node in all front nodes and all rear nodes of a DNA sequence to be denoised, and respectively judging whether the weight fraction of a unique node exists in all the front nodes and all the rear nodes is the maximum value;

if the information does not exist, deleting the DNA sequence to be denoised and the corresponding support information of all front nodes or all corresponding back nodes; if the DNA sequence exists, respectively selecting the node with the highest weight score from all front nodes and all rear nodes as the front node and the rear node of the DNA sequence to be denoised; finally, according to the selected front and rear nodes of each DNA sequence, retaining the support information of the corresponding DNA sequence connection edge, and deleting the rest support information;

wherein, the calculation formula of the weight fraction is as follows:

S＝M+P*w₁+α*w₂+β*w₃)/F²，w₁<w₂<w₃

in the formula, S is the weight fraction of a front node or a rear node of a DNA sequence to be denoised, and M is a matching value corresponding to the front node or the rear node, wherein the front node or the rear node and the DNA sequence to be denoised determine a DNA sequence connecting edge so as to determine a matched protein sequence, and further determine the unique matching value of the front node or the rear node; p is the protein support degree of the DNA protein connecting edge determined by the DNA sequence to be denoised and the front node or the rear node, F is the number of the front nodes of the front node or the rear node, w₁、w₂、w₃Respectively, a first weight coefficient, a second weight coefficient and a third weight coefficient, wherein alpha and beta are respectively support coefficients;

wherein the protein support degree P of the connection side of the DNA sequence is equal to the number of protein sequences of which the connection direction of the connection side of the DNA sequence meets the corresponding uniform connection direction in S4;

the values of the support coefficients alpha and beta are determined according to the support information of the connecting edge of the DNA sequence formed by the DNA sequence to be denoised and the front or rear node;

if the support information of the DNA sequence connection edge formed by the DNA sequence to be denoised and the front or rear node is L¹L²When the connection type is adopted, the support coefficient alpha is 1 or 0, and the support coefficient beta is 0; in the case of the gap junction type, the support coefficient α is 0 and the support coefficient β is 1.

Further preferably, before calculating the weight fraction of each node before and after the DNA sequence to be denoised, the method further comprises the steps of respectively obtaining the support coefficients alpha and beta and the number F of the nodes in the calculation of the weight fraction;

wherein, S51: acquiring a DNA sequence connecting edge formed by a DNA sequence to be processed and each front or rear node, and acquiring corresponding support information of each DNA sequence connecting edge;

s52: obtaining the supporting information L in S51¹L²All DNA sequences of the ligation type are ligated together and the terminal length L of the ligation side of each DNA sequence is calculated¹And sequentially judging the terminal length L of the connecting side of each DNA sequence¹Whether the accuracy is met;

if the precision is met, the support coefficient alpha of the corresponding DNA sequence connection edge is 1, and beta is 0; if not, the support coefficient alpha of the corresponding DNA sequence connection side is 0, and beta is 0;

wherein the length L of the terminal end of the connecting side of the DNA sequence¹Indicates the length of the ends of the front DNA sequence that are not aligned to the corresponding protein sequence;

s53: acquiring all DNA sequence connection edges of which the support information is a gap connection type in S51, wherein the support coefficients alpha of the acquired all DNA sequence connection edges are 0, and beta is 1;

s54: and acquiring the number of front nodes of each front node or each back node of the DNA sequence to be processed.

More preferably, the terminal length L of the connecting side of each DNA sequence is determined in S52¹The procedure of whether the accuracy is satisfied is as follows:

first, the slave support information is L¹L²The terminal length L of the joining side of all DNA sequences of the type of ligation¹In the selection of the terminal length L¹And calculating the length L of the end of the connecting side of each DNA sequence in turn¹A ratio to the maximum value;

then, sequentially judging whether the ratio corresponding to the connecting edge of each DNA sequence is greater than 0.9, if so, meeting the precision; otherwise, the accuracy is not satisfied.

Further preferably, in S4, the process of denoising the support information of the connecting edge of each DNA sequence is as follows:

s41: acquiring the number of protein sequences supporting forward connection and the number of protein sequences supporting reverse connection on the connection side of the same DNA sequence according to the comparison direction in the comparison information;

identifying whether two DNA sequences on the connecting side of the DNA sequences are the same in the comparison direction of the same protein sequence according to the comparison direction in the comparison information, if so, the connecting side of the DNA sequences is in forward connection with the protein sequence; otherwise, the connection is reverse connection;

s42: judging whether the number of the protein sequences supporting forward connection is equal to the number of the protein sequences supporting reverse connection;

if the DNA sequence is equal to the original DNA sequence, deleting all supporting information of the connection side of the DNA sequence;

if the support information is not equal, retaining the support information corresponding to all protein sequences in the direction with the maximum support number in the support information of the DNA sequence connection side, deleting the corresponding residual support information and recording the number of the protein sequences in the direction with the maximum support number;

wherein, the number of the recorded proteins is the protein support degree of the connection side of the DNA sequence;

s43: judging whether the support information of the DNA sequence connection edge exceeds 1;

s44: if the number of the strips does not exceed 1, denoising is finished; if yes, sequentially judging whether the support information of the connection side of the DNA sequence contains the support information of the gap connection type;

if yes, retaining the support information containing the gap connection type and deleting the corresponding residual support information, and then retaining the support information corresponding to the maximum matching degree from the retained support information and deleting the corresponding residual support information;

if not, retaining the support information corresponding to the maximum matching degree from all the support information of the DNA sequence connection edge and deleting the corresponding residual support information;

wherein the matching degree represents the matching degree of the connecting edge of the DNA sequence and the protein sequence, and the matching degree is equal to the sum of the matching values of the two DNA sequences of the connecting edge of the DNA sequence and the corresponding protein sequence.

Further preferably, the determination of the adjacent relationship of the DNA sequences having matching relationship on each protein sequence according to each alignment information on each protein sequence in S2 is performed as follows:

s21: determining the sequence of the DNA sequences with matching relation on each protein sequence according to the initial position and the end position of the matching region on the protein sequence in the alignment information;

s22: identifying whether position overlapping exists in matching blocks corresponding to different DNA sequences on each protein sequence based on the sequence among the DNA sequences, if so, performing noise reduction processing on the alignment information of the DNA sequences with the position overlapping on the protein sequences, and if not, executing S23;

wherein, the matching block is a matching region of the DNA sequence on the protein sequence, and the process of denoising two DNA sequences with overlapped positions on the protein sequence comprises the following steps:

firstly judging whether the size of the overlapping block is larger than 1/2 of the size of the matching block of the front-end DNA sequence, if so, respectively calculating the ratio of the matching value of the two DNA sequences to the size of the corresponding matching block, and deleting the comparison information between the DNA sequence with the small ratio and the protein sequence;

otherwise, judging whether the size of the overlapping block is larger than the size of the matching block of the rear-end DNA sequence or not 1/2, if so, respectively calculating the ratio of the matching value of the two DNA sequences to the size of the corresponding matching block, and deleting the comparison information between the DNA sequence with the small ratio and the protein sequence; if not, go to S23;

the size of the matching block is equal to the difference between the ending position and the starting position of the corresponding DNA sequence in the matching region of the protein sequence, the DNA sequence which is close to the starting end of the protein sequence in the two DNA sequences with overlapped positions is a front end DNA sequence, and the other DNA sequence is a rear end DNA sequence;

s23: based on the sequence between the DNA sequences obtained at S21 and the alignment information obtained at S22, the adjacent relationship between the DNA sequences having the alignment information on each protein sequence is obtained.

It is further preferable that the calculation of the support information corresponding to the connecting side of each DNA sequence on each protein sequence in S3 is performed as follows:

s31: calculation of the first length of the connecting edge of the DNA sequenceAnd judging whether the degree is greater than a first threshold value or not, and if so, generating L¹L²Support information of a connection type;

L_(a,b)＝L₁-L_1.b+L_2.a

in the formula, L_(a,b)Is the first length of the connecting side of the DNA sequence, the first length represents the sum of the end unmatched length of the front DNA sequence and the front unmatched length of the back DNA sequence in the connecting side of the DNA sequence, L₁The length of the leader DNA sequence among the DNA sequences, L_1.bCoordinates of the termination position of the matching region for the front DNA sequence, L_2.aCoordinates of the starting position of the matching region for the rear-end DNA sequence;

s32: if not, calculating the length of an intron on the connection side of the DNA sequence and judging whether the length of the intron is less than 0, if so, generating support information of the gap connection type with the gap value of 0, otherwise, generating the support information of the gap connection type with the gap value equal to the length of the intron;

wherein, the calculation formula of the length of the intron is as follows:

I＝(l_2.a-l_1.b)*3-L_(a,b)

wherein I represents the length of an intron on the connecting side of the DNA sequence, l_2.aFor the starting position coordinates of the matching region on the protein sequence corresponding to the rear DNA sequence,/_1.bThe coordinates of the termination positions of the corresponding matching regions on the protein sequence are for the front DNA sequence.

The value range of the first threshold is 300-600, preferably 500.

Further preferably, the execution process of S1 is as follows:

s11: comparing the DNA sequence to be spliced with the protein sequence one by one to obtain a comparison file;

the comparison file comprises comparison information between all matched DNA sequences and protein sequences, and if one DNA sequence and one protein sequence have n matching positions, n pieces of comparison information are correspondingly generated;

s12: acquiring a protein sequence which only has a matching relation with one DNA sequence, and deleting corresponding comparison information of the acquired protein sequence in the comparison file;

s13: acquiring a protein sequence with more than one matching position with the same DNA sequence, selecting one piece of comparison information from all comparison information corresponding to the acquired protein sequence, keeping the comparison information in a comparison file, and deleting the corresponding residual comparison information;

wherein, the comparison information in the comparison file after the step S13 is the comparison information obtained in the step S1.

Further preferably, in S13, when one piece of alignment information is selected from all pieces of alignment information corresponding to protein sequences and retained, the piece of alignment information with the largest matching value is selected according to the matching values in the alignment information.

Further preferably, the process of calculating the connection distance of the connection edges of all DNA sequences in which the supporting information exists in S6 is as follows:

s61: is obtained as L¹L²Connecting edges of the DNA sequences corresponding to the support information of the connection type, sequentially calculating the first length of each obtained DNA sequence connecting edge, and sequentially judging whether the first length of each obtained DNA sequence connecting edge is smaller than a second threshold value;

wherein the first length represents the sum of the end unmatched length of the front DNA sequence among the DNA sequence connecting edges and the front unmatched length of the rear DNA sequence;

if the length is smaller than the first threshold value, the connection distance corresponding to the connection edge of the DNA sequence is equal to the difference between the second threshold value and the first length of the connection edge of the DNA sequence;

otherwise, the connection distance corresponding to the connection of the DNA sequence is 100;

s62: acquiring DNA sequence connection edges corresponding to the support information of the gap connection type, and determining connection intervals corresponding to the DNA sequence connection edges according to gap values in the support information; wherein, the connection distance corresponding to the connection side of the DNA sequence is equal to the gap value in the corresponding support information.

Advantageous effects

1. According to the invention, the second generation double-ended short reading or the third generation long reading is not needed, the scaffold can be effectively assisted by only the protein sequence, and the gene structure can be more directly recovered by utilizing integrated protein information to perform sequence splicing. According to the comparison information of the guide protein sequence and the sequence to be spliced, multiple comparison and gene prediction are not needed, so that the running time is saved; meanwhile, the invention not only considers the single factor of protein support degree, but also filters conflict branches based on the weight scoring function comprehensively constructed by multiple factors, and selects the optimal front and rear nodes, thereby more effectively constructing the scfolds.

2. The method provided by the invention compares the method with other methods for integrating protein information sequence splicing by simulating double verification of a data set and a real data set, and further verifies that the method for processing the conflict nodes based on the weight scoring function obviously improves the sensitivity, the accuracy and the like of sequence splicing.

Drawings

FIG. 1 is a diagram showing the relationship among DNA sequences, genes and proteins.

FIG. 2 is a flow chart of the present invention.

FIG. 3 is a diagram of overlap match block processing.

Fig. 4 is a view of a scaffold connection.

Fig. 5 is a contigs connection diagram.

FIG. 6 is a schematic diagram of contigs in a sequence file.

Detailed Description

The present invention will be further described with reference to the following examples. In this example, contig described below is represented as a DNA sequence.

As shown in fig. 2, the method for generating a genome splicing sequence scaffold based on protein information provided by the invention comprises the following steps:

step 1: and (4) preprocessing.

Its purpose is S1: and acquiring the comparison information between the DNA sequence to be spliced and the protein sequence. The specific execution process comprises the following steps:

s11: and (3) comparing the DNA sequence to be spliced with the protein sequence one by one to obtain a comparison file (out).

The comparison file comprises comparison information between all matched DNA sequences and protein sequences, and each row in the comparison file represents one piece of comparison information. If a DNA sequence and a protein sequence have n matching positions, n pieces of comparison information are correspondingly generated, wherein each piece of comparison information at least comprises: matching value m (matches), start position coordinates (qStart) and end position coordinates (qEnd) of a matching region on a protein sequence, protein sequence length (qSize), start position coordinates (tStart) and end position coordinates (tEnd) of a matching region on a DNA sequence, alignment direction (strand), DNA sequence length (tSize);

s12: and acquiring a protein sequence which has a matching relation with only one DNA sequence, and deleting the corresponding comparison information of the acquired protein sequence in the comparison file. The objective was to filter out protein sequences that aligned only the last contigs, as this protein sequence was not able to direct scaffold.

S13: acquiring a protein sequence with more than one matching position with the same DNA sequence, selecting one piece of comparison information from all comparison information corresponding to the acquired protein sequence, keeping the comparison information in a comparison file, and deleting the corresponding residual comparison information.

Since one contig can be aligned to multiple positions on the same protein sequence, so that multiple alignment records are generated corresponding to one contig, the alignment information with the maximum matching value of each contig on the protein sequence is extracted, other alignment information is deleted, conflicts are processed, one contig is only aligned to one position of one protein, and simultaneously, one contig is allowed to be aligned to different protein sequences. Therefore, the comparison information in the comparison file after executing the step S13 is the comparison information obtained in the step S1.

In this example protein sequence data was downloaded from a Uniprot central repository.

Step 2: the adjacent relation between the DNA sequences corresponding to each protein sequence is determined.

And combining the initial position qStart of each contig on the protein, the matching value matches of the matching blocks and the final position qEnd of each contig on the protein, arranging the matching blocks of the contigs on each protein, and iteratively processing contigs with overlapped matching, thereby determining the adjacent relation between contigs.

The specific process of iteratively processing contigs with overlapping matches is as follows:

firstly judging whether the size of the overlapping block is larger than 1/2 of the size of the matching block of the front contig, if so, respectively calculating the ratio of the matching value of the two contigs to the size of the corresponding matching block, and deleting the comparison information between the contigs with small ratios and the protein sequence, namely deleting the contigs with poor matching quality on the protein sequence;

otherwise, judging whether the size of the overlapping block is larger than 1/2 of the matching block size of the back-end contig, if so, respectively calculating the ratio of the matching value of the two contigs to the size of the corresponding matching block, and deleting the comparison information between the contigs with small ratios and the protein sequence; if not, the next overlapping matching block is processed iteratively.

After the overlap region is processed, adjacent contigs can be determined based on the arrangement of the respective contigs on each protein.

The size of the matching block is equal to the difference between the ending position qEnd and the starting position qStart of the matching region of the corresponding contig on the protein sequence, the contig close to the starting end of the protein sequence in two contigs with overlapped positions is the front contig, and the other contig is the back contig. As shown in FIG. 3, contigs were aligned in which contig2 and contig3 present an overlap on protein P1, and the alignment information of contig2 and protein was removed by the above-described treatment. Since contigs may be much longer than the protein sequence, the dotted lines in the figure indicate variable lengths.

And step 3: building contigs connections and obtaining support information of each contigs connection.

Where contigs junctions represent the joining edge between two adjacent contigs constructed, i.e. the joining edge of a DNA sequence. Since one contig can match a plurality of protein sequences, contigs links of two adjacent contigs may have a matching relationship with a plurality of protein sequences, and when one contig link has a matching relationship with N protein sequences, the one contig link corresponds to N pieces of support information.

In different protein sequences, for a contig, its pre-and post-nodes may not be unique. As shown in FIG. 4, if protein P1 supports ligation (contig1, contig3), (contig3, contig4), (contig4, contig5), protein P2 supports ligation (contig3, contig6), (contig4, contig7), (contig7, contig8), protein P3 supports ligation (contig1, contig10), (contig8, contig9), a ligation map as shown in FIG. 4 was constructed based on proteins P1, P2, P3. We can know which two contigs are adjacent with protein support from the constructed connection diagram, wherein the contigs are nodes, and the connection relationship between the contigs represents an edge.

The generation process of the support information is as follows:

s31: calculating the first length L of the connecting side of the DNA sequence_(a,b)And judging whether the L is larger than a first threshold value, if so, generating L¹L²Support information for connection type.

L¹L²Support information of connection type: l is¹: l (contig1.b) and L²: l (contig2. a). L is¹Indicates that contig1 is not aligned to the terminal length on the protein, and is represented by the length L of contig1₁(tSize) coordinates L of the start position of the matching area on contig1_1.bThe difference between (tEnd) is given by L²Indicates that contig2 is not aligned to the length of the head on the protein, and the value is equal to the starting position coordinate L of the matching region on contig2_2.a(tStart)，

L_(a,b)＝L¹+L²＝L₁-L_1.b+L_2.a

In the formula, L_(a,b)Is the first length of the connecting side of the DNA sequence, L₁The length of the leader DNA sequence among the DNA sequences, L_1.bCoordinates of the termination position of the matching region for the front DNA sequence, L_2.aCoordinates of the starting position of the matching region for the rear-end DNA sequence;

s32: if not, calculating the length of an intron on the connection side of the DNA sequence, judging whether the length of the intron is less than 0, if so, generating the support information of the gap connection type with the gap value of 0, and otherwise, generating the support information of the gap connection type with the gap value equal to the length of the intron;

where a gap value equal to 0 indicates that two contigs are directly linked with no other base in between. A gap value equal to the intron length I (introLen) ligation type indicates that I bases need to be filled between two ligated contigs;

I＝(l_2.a-l_1.b)*3-L_(a,b)

wherein I represents the length of an intron on the connecting side of the DNA sequence, l_2.aStarting position coordinates (qStart), l for the region of the rear DNA sequence corresponding to the match on the protein sequence_1.bThe coordinates (qEnd) of the termination positions of the corresponding matching regions on the protein sequence are for the front DNA sequence. The value range of the first threshold is 300-600, and 500 is preferred in the embodiment.

And 4, step 4: orientation conflicts were handled according to protein support. Sequentially carrying out denoising treatment on the support information of the connecting edge of each DNA sequence;

the method is mainly executed by two parts, the first part is used for processing the direction conflict of contigs connection, one contigs connection may have the support information of one protein or a plurality of proteins, the connection between two contigs supported by a plurality of proteins in comparison may be forward connection or reverse connection, and the support directions are mutually contradictory, so that further processing is needed. And the other part is that the support information with the optimal matching degree is selected for each contigs connection.

The first part is as follows:

s41: acquiring the number of protein sequences supporting forward connection and the number of protein sequences supporting reverse connection connected by the same contigs according to the comparison direction in the comparison information;

identifying whether two contigs connected with contigs are the same in the comparison direction of the same protein sequence according to the comparison direction in the comparison information, wherein if the two contigs are the same, the contigs connection and the protein sequence are in forward connection; otherwise, it is a reverse connection. As shown in FIG. 5, the two contigs are contig1 and contig2, respectively, and the two matching block sequences on the matching protein P1 are: p1₁,P1₂. As can be seen from the figure, the positive alignment of Contig1 in the (a) figure is P1₁Contig2 positive alignment to P1₂Meaning that the 3 'end of Contig1 is associated with the 5' end of Contig2. (b) Contig1 positive alignment in the figure to P1₁Contig2 reverse alignment to P1₂Meaning that the 3 'end of Contig1 is associated with the 3' end of Contig2. (a) Type means that Contig1 and Contig2 are connected in the same direction, which is forward direction, and type (b) means that Contig1 and Contig2 are not in the same direction, which is reverse direction.

S42: judging whether the number of the protein sequences supporting forward connection is the same as that of the protein sequences supporting reverse connection;

if the information is the same, deleting all the support information of the contigs connection;

if not, retaining the support information corresponding to all the protein sequences in the direction with the maximum number of supports in the support information of the contigs connection, deleting the corresponding residual support information and recording the number of the protein sequences in the direction with the maximum number of supports;

wherein the number of proteins recorded is the protein support to which the contigs are linked.

After the directional conflict is processed, each contigs link may still have a plurality of pieces of information supporting the same directional protein, and at this time, the support information of the contigs link is processed, and only the support information of the protein sequence with the maximum matching value with the contigs is reserved. The specific implementation procedure for each contigs connection is as follows:

s43: judging whether the support information of contigs connection exceeds 1;

s44: if the number of the strips does not exceed 1, denoising is finished; if yes, sequentially judging whether the support information of contigs connection contains the support information of gap connection type;

if yes, retaining the support information containing the gap connection and deleting the corresponding residual support information, and then retaining the support information corresponding to the maximum matching degree from the retained support information and deleting the corresponding residual support information; if not, retaining the support information corresponding to the maximum matching degree from all the support information connected with the contigs and deleting the corresponding residual support information;

wherein the matching degree represents the matching degree of contigs connection and protein sequence, and the matching degree is equal to the sum of the matching values of two contigs connected with contigs and corresponding protein sequence. Namely, the principle of keeping the protein support information with the size of gap directly as the support information as much as possible is carried out.

And 5: and setting a weight scoring function to process branch conflicts. Namely, the front node and the rear node of each contig are subjected to denoising treatment in sequence based on the weight scoring function.

A contig node may have multiple back nodes or multiple front nodes. If such a connection is constructed (contig1, contig2), contig1 is the previous node of contig2 and contig2 is the previous node of contig1. The following node processing is explained as an example in this embodiment:

first, parameters in the weight scoring function are obtained. And secondly, respectively calculating the weight fraction of each back node of each contig, and then respectively selecting the back node with the maximum weight fraction from all the back nodes of each contig as a unique back node.

(1) The process of acquiring parameters is as follows:

s51: and acquiring contigs formed by the contigs to be processed and each rear node, and acquiring corresponding support information of each contigs connection. Only one piece of support information is reserved for each contigs connection after the processing of S4. Wherein, the support information does not directly contain gap length, namely L¹L²The connection type is subjected to the processing of S52 as follows; the following process of S53 is performed for the case where the gap length is directly included, that is, the gap connection type.

S52: obtaining the supporting information L in S51¹L²All contigs of the ligation type are ligated and the end length L of each contigs ligation is calculated¹And sequentially determining the terminal length L of each contigs junction¹Whether the accuracy is met;

if the condition is satisfied,appending the end length L to the support information corresponding to the contigs connection¹The support coefficient α corresponding to a contigs connection is 1 when the support information contains the spequal information (denoted herein as spequal information). Since no gap value is contained, β is 0.

If the information does not satisfy the requirement, the support information corresponding to contigs connections is not added with the spequal information, and if the support information does not contain the spequal information, the support coefficient alpha is 0 on the side corresponding to DNA sequence connections. Since no gap value is contained, β is 0.

S53: and acquiring all contigs connections of which the support information is a gap connection type in the S51, wherein the support coefficients alpha of all contigs connections are 0, and beta is 1. This is based on the inclusion of a gap value, β being 1; when the support information does not contain the spequal information, α is 0.

S54: and acquiring the number of front nodes of each rear node of the DNA sequence to be processed. That is, no matter what kind of support information is, the number of front nodes of each back node needs to be acquired.

(2) Calculating the weight fraction:

sequentially calculating the weight fraction of each node in all the back nodes of the contig to be denoised, and respectively judging whether the weight fraction of the unique node in all the back nodes is the maximum value;

if not, deleting the DNA sequence to be denoised and the support information corresponding to all the rear nodes;

if yes, respectively selecting the node with the highest weight fraction from all the rear nodes as the rear node of the contig to be denoised;

and finally, retaining the support information of the corresponding contigs connection according to the selected back node of each contig, and deleting the corresponding residual support information. It should be understood that, at this time, the back node of each contig is already determined, and thus, the unique back node of each contig may be determined, that is, the back branch collision of each contig is eliminated, and at this time, the support information of the eliminated back branch collision contigs connection is correspondingly deleted.

The weight fraction calculation formula is as follows:

S＝M+P*w₁+α*w₂+β*w₃)/F²，w₁<w₂<w₃

in the formula, S is the weight fraction of the rear node of the DNA sequence to be denoised, M is the matching value corresponding to the rear node, the rear node and the DNA sequence to be denoised determine a DNA sequence connecting edge, further determine the matched protein sequence, and further determine the unique matching value of the rear node. And P is the protein support of the DNA protein connecting edge determined by the DNA sequence to be denoised and the front node or the rear node, wherein the obtaining manner of the protein support of the DNA protein connecting edge has been explained in step S42. F is the number of front nodes of the rear node, w₁、w₂、w₃Respectively, a first weight coefficient, a second weight coefficient, and a third weight coefficient. w is a₁The value range of (1) is 500-1000, the default value is 999, w₂The value range of (1) is 3000-5000, the default value is 5000, w₃The value range of (1) is 8000-10000, and the default value is 10000. Alpha and beta are respectively support coefficients;

a back node contig has multiple different front node contigs connected, similar to the branch conflict after processing, only the branch with the largest weight fraction is reserved, and after the back branch conflict and the front branch conflict are processed until a node only has a front node or a back node, no longer having the connection with mutual conflict.

Step 6: the ligation spacing of the joining edges of all DNA sequences for which supporting information exists is calculated. The specific process is as follows:

s61: is obtained as L¹L²Connecting edges of DNA sequences corresponding to the support information of the connection type, and sequentially calculating the first length L of the connecting edges of each obtained DNA sequence_(a,b)And sequentially judging the first length L of the connecting edge of each obtained DNA sequence_(a,b)Whether or not it is less than a second threshold value M₂；

If the number of the DNA sequence is less than the first threshold value M, the connection distance corresponding to the connection edge of the DNA sequence is equal to the second threshold value M₂First length L of the side to which the DNA sequence is attached_(a,b)The difference between the two; otherwise, the connection distance corresponding to the connection side of the DNA sequence is 100. The invention sets M according to experimental values₂Value ofThe range is 3000-5000, and the preferred value is 3500.

S62: acquiring DNA sequence connection edges corresponding to the support information of the gap connection type, and determining connection intervals corresponding to the DNA sequence connection edges according to gap values in the support information;

wherein, the connection distance corresponding to the connection side of the DNA sequence is equal to the corresponding gap value, and the gap: at 0, the gap value is 0, and the gap: for introLen, the gap value is the introLen value.

And 7: the scaffold path is connected. That is, the scaffold path is obtained based on the connection distance of all the DNA sequence connection edges in S6 and the sequential concatenation of the front and back nodes of each DNA sequence

contig nodes can be divided into three categories: (1) none of the contigs is its predecessor, called the start node; (2) none of the contigs is its back node, which is called an end node; (3) nodes with both front and back node connections are called intermediate nodes. And traversing the scaffold graph from each node in the starting node set and outputting a scaffold path.

And 8: and reading a prestored contigs sequence file according to the output scaffold path set, constructing scaffolds, filling N (N is the number equal to the connection distance) according to the interval information, wherein N represents interval characters, and generating a scaffold result.

For example, one generated scaffold path is: "4-2001:300004-30001:4000014-40001:50000",

as shown in fig. 6, corresponding contigs in the sequence file read the contigs sequence file according to the names of the contigs to find corresponding real sequences, and fill the gaps, so that the generated scaffold result is "TTGAAAAATAAAAACTATTTATGATGCTTTATTGCTTTGAAGNATTGGTATTTTCTCTGGGATA"; when gap equals 0, there is no need to fill 'N', and when gap is greater than 0, a corresponding number of 'N's are filled.

Simulation and verification:

the biologically relevant data set used in the invention.

The simulated dataset was from Drosophila species contigs from PEP _ scaffholder, and the real dataset contained human species contigs from PEP _ scaffholder. Drosophila mimicry contigs were obtained by disrupting the Drosophila reference genome into fragments of the same length of 10 kbp. The guide proteins included human proteins, human homologous rodent proteins and human homologous mammalian proteins, all from PEP _ scaffolders, from the Swiss-Prot database and the TrEMBL database. Staphylococcus aureus authentic data set the scaffolds data obtained from the splicing tool ABySS was from GAGE, and its corresponding guide protein sequence was from the Swiss-Prot database.

One amino acid in the protein sequence corresponds to three nucleic acids of the DNA sequence, and the proportion of the protein sequence to the total sequence of the genome is roughly measured by calculating the ratio of the size of the protein sequence used in the genomic sequences of the species drosophila and human scaffolding to its reference genome. The ratio calculated by Drosophila is about 41.97%, and the ratio calculated by human is only 1.36%.

And (3) verifying the effectiveness of the method for splicing the integrated protein information genome sequence.

The tools using the protein information scaffold were evaluated on the simulated dataset and the real dataset, and the method of the invention is hereinafter described as GCSA, and the tools compared to GCSA are PEP _ scaffold, SWiPS and ESPRIT. To further validate the protein message, super-scfafolds were constructed from the protein sequence using staphylococcus aureus scfafolds obtained from the splicing tool ABySS.

a. Experimental analysis of Drosophila simulation data set.

Four linkages constructed with the tool of the protein information scaffold were first evaluated against the Drosophila species simulation dataset and the results are shown in Table 1. The ESPRIT tool relies on homologous species reference genomes for scaffold based on comparative genomics thought, which is the highest precision, however ESPRIT has the least number of correct junctions, poor sensitivity, and much greater total run time than other tools. The PEP _ scfolder has the highest number of correct connections next to ESPRIT, while the GCSA has the lowest number of correct connections next to PEP _ scfolder and the lowest running time.

TABLE 1 results of ligation evaluation of contigs of different tools on Drosophila simulation dataset

b. Human true data set experimental analysis.

The splicing tools SWiPS and ESPRIT unreasonably consume a large amount of running time when splicing small data set drosophila, have low sensitivity and poor practical effect, so that only the tools GCSA and PEP _ scaffold are compared when splicing human real large data sets. Firstly, the GCSA and the PEP _ scaffold are respectively evaluated by using the proteins of the two databases, and finally, the proteins of the two databases are comprehensively evaluated. The results of Contigs ligation evaluations are shown in tables 2, 3 and 4.

When human proteins were used as the guide proteins, the contigs ligation results are shown in Table 2. The number of human protein Swiss-Prot databases was 20207, the number of TrEMBL databases was 126454, and 146661 protein sequences were found in total. Whether the human proteins in the Swiss-Prot database were used alone, the human proteins in the TrEMBL database were used alone, or the proteins in both databases were used together, the number of correct connections and the accuracy of GCSA were better than PEP _ scaffold.

TABLE 2 results of ligation evaluation of the human proteins GCSA and PEP _ scaffold

When human homologous rodent proteins were used as guide proteins, the contigs ligation results are shown in Table 3. The number of human protein Swiss-Prot databases was 26337, the number of TrEMBL databases was 245600, and a total of 271937 protein sequences were present. Whether the rodent proteins in the Swiss-Prot database were used alone, the rodent proteins in the TrEMBL database were used alone, or the proteins in both databases were used together, the number of correct connections for GCSA was significantly higher than for PEP _ scaffolders, and the accuracy was also better than for PEP _ scaffolders.

TABLE 3 results of evaluation of the ligation of the rodent proteins GCSA and PEP _ scaffold

When human homologous mammalian proteins are used as the guide proteins, the ligation results of contigs are shown in Table 4. The number of human protein Swiss-Prot databases was 19830, the number of TrEMBL databases was 929543, and a total of 949373 protein sequences were present. The number of correctly linked GCSA was greater than that of PEP _ scaffold with little accuracy than that of PEP _ scaffold using mammalian proteins from the Swiss-Prot database alone. When mammalian proteins in the TrEMBL database were used alone or proteins in both databases were used in combination, the correct number of connections and the accuracy of GCSA were better than for PEP _ scaffold. Notably, GCSA using the proteins in the TrEMBL database alone outperformed the results of the comprehensive utilization of the protein sequences.

TABLE 4 results of ligation evaluation of the mammalian proteins GCSA and PEP _ scaffold

In summary, in the human authentic data set, whether the protein sequence is from the Swiss-Prot database or the TrEMBL database, the number of correct junctions of GCSA is better than that of PEP _ scaffolders, and only when the mammalian proteins in the Swiss-Prot database are used alone is the accuracy lower than that of PEP _ scaffolders.

Analysis of the results of the super-scaffolds experiment.

The validity of the protein sequence for splicing the genome was demonstrated by further splicing with protein sequences from staphylococcus aureus scafffolds obtained from the splicing tool ABySS. The results of the Quast4.1 evaluation are shown in Table 5.

The following results show that GCSA can further construct super-scaffolds, the genome component number, NA50 and NGA50 are improved under the condition of slightly increasing relocation, the effectiveness of protein sequences in supporting genome sequence splicing is further proved, and to a certain extent, the protein sequences are used for guiding splicing, so that more gene structures can be recovered. On low coverage genomes, the effect of scafffolding is more pronounced.

TABLE 5 effects of GCSA splicing on ABySS tool scaffolds

In conclusion, the splicing method of the integrated protein information provided by the invention has an important role in the aspects of accuracy, sensitivity and the like of sequence splicing of the integrated protein information by utilizing a weight scoring function integrating multiple factors, can perfect a splicing result by using a protein sequence, and can better reduce a gene structure, particularly for low-coverage splicing.

It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the invention is not to be limited to the examples described herein, but rather to other embodiments that may be devised by those skilled in the art based on the teachings herein, and that various modifications, alterations, and substitutions are possible without departing from the spirit and scope of the present invention.

Claims

1. A genome sequence splicing method based on protein information is characterized in that: the method comprises the following steps:

wherein, only one piece of support information exists on each DNA sequence connection edge after denoising treatment, and the process is as follows:

s41: acquiring the number of protein sequences supporting forward connection and the number of protein sequences supporting reverse connection on the connection side of the same DNA sequence according to the comparison direction in the comparison information; s42: judging whether the number of the protein sequences supporting forward connection is equal to the number of the protein sequences supporting reverse connection; if the DNA sequence is equal to the original DNA sequence, deleting all supporting information of the connection side of the DNA sequence; if the support information is not equal, retaining the support information corresponding to all protein sequences in the direction with the maximum support number in the support information of the DNA sequence connection side, deleting the corresponding residual support information and recording the number of the protein sequences in the direction with the maximum support number; s43: judging whether the support information of the DNA sequence connection edge exceeds 1; s44: if the number of the strips does not exceed 1, denoising is finished; if yes, sequentially judging whether the support information of the connection side of the DNA sequence contains the support information of the gap connection type; if yes, retaining the support information containing the gap connection type and deleting the corresponding residual support information, and then retaining the support information corresponding to the maximum matching degree from the retained support information and deleting the corresponding residual support information; if not, retaining the support information corresponding to the maximum matching degree from all the support information of the DNA sequence connection edge and deleting the corresponding residual support information;

s6: calculating the connection distance of all DNA sequence connection edges with the support information;

2. The method of claim 1, wherein: the process of denoising the front and rear nodes of a DNA sequence in step S5 is as follows:

if the information does not exist, deleting the DNA sequence to be denoised and the corresponding support information of all front nodes or all corresponding back nodes;

if the DNA sequence exists, respectively selecting the node with the highest weight score from all front nodes and all rear nodes as the front node and the rear node of the DNA sequence to be denoised;

finally, according to the selected front and rear nodes of each DNA sequence, retaining the support information of the corresponding DNA sequence connection edge, and deleting the rest support information;

wherein, the calculation formula of the weight fraction is as follows:

S＝M+P*w₁+α*w₂+β*w₃)/F²，w₁<w₂<w₃

wherein S is the weight fraction of the front node or the rear node of the DNA sequence to be denoised, M is the matching value corresponding to the front node or the rear node, P is the protein support degree of the DNA protein connecting edge determined by the DNA sequence to be denoised and the front node or the rear node, F is the number of the front node or the rear node, w is the weight fraction of the front node or the rear node of the DNA sequence to be denoised, W is the weight fraction of the front node or the rear node of the DNA sequence to be denoised, M is the matching value corresponding to₁、w₂、w₃Respectively, a first weight coefficient, a second weight coefficient and a third weight coefficient, wherein alpha and beta are respectively support coefficients;

3. The method of claim 2, wherein: before calculating the weight fraction of each node before and after the DNA sequence to be denoised, respectively obtaining the support coefficients alpha and beta and the node number F in the calculation of the weight fraction;

4. The method of claim 3, wherein: s52 sequencing the terminal length L of the joining side of each DNA sequence¹The procedure of whether the accuracy is satisfied is as follows:

5. The method of claim 1, wherein: in S4, identifying whether two DNA sequences on the connecting side of the DNA sequences are the same in the comparison direction of the same protein sequence according to the comparison direction in the comparison information, if so, the connecting side of the DNA sequences is in forward connection with the protein sequence; otherwise, the connection is reverse connection;

the number of the recorded proteins is the protein support degree of the DNA sequence connection side; the matching degree represents the matching degree of the connecting edge of the DNA sequence and the protein sequence, and the matching degree is equal to the sum of the matching values of the two DNA sequences of the connecting edge of the DNA sequence and the corresponding protein sequence.

6. The method of claim 1, wherein: the implementation process of determining the adjacent relationship of the DNA sequences having matching relationship on each protein sequence according to each alignment information on each protein sequence in S2 is as follows:

7. The method of claim 1, further comprising: the calculation of the support information corresponding to the connecting edge of each DNA sequence on each protein sequence in S3 is performed as follows:

s31: calculating the first length of the DNA sequence connecting edge and judging whether the first length is larger than a first threshold value, if so, generating L¹L²Support information of a connection type;

L_(a,b)＝L₁-L_1.b+L_2.a

wherein, the calculation formula of the length of the intron is as follows:

I＝(l_2.a-l_1.b)*3-L_(a,b)

8. The method of claim 1, wherein: the execution process of S1 is as follows:

9. The method of claim 8, wherein: in S13, when one piece of alignment information is selected from all the alignment information corresponding to the protein sequences and retained, the piece of alignment information with the largest matching value is selected according to the matching value in the alignment information.

10. The method of claim 1, wherein: the process of calculating the connection distance of the connection edges of all DNA sequences with the support information in S6 is as follows:

otherwise, the corresponding connection interval of the DNA sequence connection edge is 100;

wherein, the connection distance corresponding to the connection side of the DNA sequence is equal to the gap value in the corresponding support information.