CN112634989A

CN112634989A - Double-sided genome fragment filling method and device based on fragment contig

Info

Publication number: CN112634989A
Application number: CN202011597411.5A
Authority: CN
Inventors: 柳楠; 李春良; 李胜华; 朱永琦; 李晓峰; 郑晶玲; 尤宝山; 王向辉
Original assignee: Shandong Jianzhu University
Current assignee: Shandong Jianzhu University
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-09

Abstract

The invention discloses a genome fragment filling method and a device based on fragment contig, wherein the method comprises the following steps: calculating to obtain a deletion gene; classifying the maximum deletion gene string; merging the maximum deletion gene strings meeting the conditions; searching a Type-1 string and executing a Type-1 string insertion algorithm; searching a Type-3-II string without slot, and executing a no-slot-Type-3-II string insertion algorithm; searching Type-2 and Type-3 Type strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm. The invention carries out calculation based on the segment contig, has more general form and wider application. The filling method has the advantages of high searching speed and high filling efficiency, and can reduce the time and space complexity of genome fragment filling based on the fragment contig and improve the sensitivity and specificity of filling.

Description

Double-sided genome fragment filling method and device based on fragment contig

Technical Field

The invention relates to a method and a device for filling double-sided genome fragments based on fragment contig, belonging to the technical field of genetic engineering.

Background

With the continuous development of gene sequencing technology, the sequencing scale and speed are greatly improved, and the sequencing cost is effectively reduced, but it is still difficult to obtain a complete genome sequence only by gene sequencing means. Generally, the whole genome sequence is obtained by obtaining a large number of base sequences through a gene sequencer, assembling the short gene fragments into larger gene fragments (fragment contigs) through a computer-related technology, namely, a splicing algorithm, and determining the arrangement sequence of all the fragment contigs in a genome and the spacing distance between each fragment contig, so as to obtain a larger gene structure, namely, a genome frame. In the genomic framework, there are many genes that are deleted. Multiple frames of the same genome are obtained by multiple times of sequencing, genome fragments are filled, and missing genes can be filled into fragment contig intervals of incomplete sequences, so that the integrity of the frames is effectively improved, and the cost of biological sequencing is greatly reduced.

The genome segment filling is to fill the deletion gene into the segment contig interval of the incomplete sequence, so that the integrity and the accuracy of the genome segment are effectively improved, the cost of gene sequencing is reduced, and the method has a certain application value. Double-sided genomic fragment filling based on fragment contigs is a more general form of pre-common sequence-based double-sided genomic fragment filling. The input to the problem has changed from the common genomic fragment sequence to the more commonly used fragment contig sequence in practical applications. The insertion position of the deleted gene is no longer between any two genes but is limited to the segment contig, thus ensuring that the gene structure that has gained significance is not destroyed by the newly inserted gene. Liu finds an upper bound of the number of common neighbors in the optimal solution scheme by analyzing the classification of breakpoints during double-sided filling and the relation between the number of inserted genes and the number of generated common neighbors, designs an approximate algorithm adopting a greedy strategy, and has a performance ratio of 1.5; ma first defines how many adjacencies a fragment fill can increase, and then finds the approximate maximum independent set in either a 5-connected pawless graph or a 7-connected pawless graph, proposing a 1.4-approximation algorithm. However, these two algorithms can only solve the double-sided genome fragment filling based on common sequences, and cannot be completed in polynomial time, and cannot be applied to the double-sided genome fragment filling based on the fragment contig.

Therefore, how to solve the double-sided genome fragment filling based on the fragment contig and can be completed in polynomial time becomes a problem to be solved urgently.

Disclosure of Invention

Aiming at the defects and limitations of the prior art, the invention provides a double-sided genome segment filling method based on segment contig, which can complete the filling of double-sided genome segments based on the segment contig, greatly reduces the time complexity and the space complexity, has faster filling speed and higher accuracy, and ensures that the integrity and the accuracy of genome segments are higher. An apparatus for implementing the method is also provided.

In a first aspect, embodiments of the present invention provide a method for double-sided genome fragment population based on fragment contig, comprising the following steps:

(1) calculating to obtain a deletion gene;

comparing the elements in array A and array B with each other to obtain the deleted gene set X in array A and the deleted gene set Y in array B.

(2) Classifying the maximum deletion gene string;

type of the largest missing gene string in the optimal solution consisting of elements in X and Y: let the string length be n, i.e., consist of n missing genes.

Type-1 Type: constituting n +1 adjacency.

Type-1-I: a string composed of elements of X and Y together.

Type-1-II: a string consisting of only elements in X or only elements in Y.

Type-2 Type: forming n neighbors.

Type-2-I: a string of X and Y elements together.

Type-2-II: a string of X and Y elements together.

Type-2-III: if the original adjacency is destroyed when n adjacencies are formed, there will generally be a pair of such missing strings, otherwise such strings may equivalently change into two strings: the string formed by the original adjacency and the Type-3 string formed by putting the string at the end of the arrangement, and the missing string in this case must be from the same missing gene set.

Type-3 Type: constituting n-1 neighbors.

Type-3-I: a string of X and Y elements together.

Type-3-II: a string consisting of only elements in X or only elements in Y.

(3) Merging the maximum deletion gene strings meeting the conditions;

merging operation: the slots between consecutive largest missing gene strings are removed (the outermost slots remain unchanged).

(4) Searching a Type-1 Type string, and executing a Type-1 string insertion algorithm;

(5) searching a Type-3-II string without slot, and executing a no-slot-Type-3-II string insertion algorithm;

(6) searching Type-2 and Type-3 Type strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm;

(7) all remaining deletion genes were inserted at the end of each alignment.

In a second aspect, embodiments of the present invention provide a double-sided genome fragment filling apparatus based on fragment contigs, comprising:

an input unit: two genome arrangements based on a set of segment contigs;

an initialization unit: calculating to obtain a deletion gene set according to the input sequence;

a classification unit: classifying the maximum deletion gene string;

a merging unit: merging the maximum deletion gene strings meeting the conditions;

type-1 string insertion unit: searching a Type-1 Type string, and executing a Type-1 string insertion algorithm;

No-slot-Type-3-II string insertion unit: searching a Type-3-II string without slot, and executing a no-slot-Type-3-II string insertion algorithm;

type-2&3 string insertion unit: searching Type-2 and Type-3 Type strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm;

remaining deletion gene insertion units: inserting all remaining deletion genes to the end of each permutation;

an output unit: the two genomes after filling are aligned.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:

calculating to obtain a deletion gene;

classifying the maximum deletion gene string;

merging the maximum deletion gene strings meeting the conditions;

searching a Type-1 Type string, and executing a Type-1 string insertion algorithm;

searching a Type-3-II string without slot, and executing a no-slot-Type-3-II string insertion algorithm;

searching Type-2 and Type-3 Type strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm;

inserting all remaining deletion genes to the end of each permutation;

in a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, including:

the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform a method comprising:

calculating to obtain a deletion gene;

classifying the maximum deletion gene string;

merging the maximum deletion gene strings meeting the conditions;

inserting all remaining deletion genes to the end of each permutation;

the input arrangement of the invention is genome arrangement based on the segment contig, genome segment filling is carried out based on the segment contig arrangement, and the segment contig is accurately defined so as to be closer to a real structure. The insertion position of the deleted gene is no longer between any two genes, but is limited to the segment contig, thereby ensuring that the meaningful gene structure is not destroyed by the inserted gene. Compared with the genome segment filling method based on the common sequence, the method has the advantages that the searching speed and the accuracy are obviously improved, and the method can be completed in polynomial time.

Drawings

FIG. 1 is a flow chart of the fragment contig-based genome fragment population method of the present invention

FIG. 2 is a schematic view of a node structure according to the present invention

FIG. 3 is a schematic diagram of the X-Y twist common adjacency relation of the present invention

FIG. 4 is a schematic diagram of the X-X, Y-Y common abutment relationship of the present invention

FIG. 5 is a schematic diagram of the common adjacency relationship between Type-2-II and Type-2-III strings according to the present invention

FIG. 6 is a schematic diagram of the shifted edge in the X-X, Y-Y twisted common adjacency relation of the present invention

FIG. 7 is a schematic view of a supplemental class I edge of the present invention

FIG. 8 is a schematic diagram of a complementary class II edge in a public adjacency for Slot preemption of the present invention

FIG. 9 is a schematic diagram of the construction of the genome fragment filling apparatus based on the fragment contig according to the present invention

FIG. 10 is a block diagram of an electronic device according to the present invention

Detailed Description

First, concepts regarding arrangement, sequence, segment contig, slot, maximum deletion gene string, common gene, adjacent deletion string, contradictory common gene, and the like will be explained.

Arranging: given a gene set Σ, if an element in Σ appears only once in P, P is called a permutation, and the set of elements in permutation P is denoted by c (P).

The sequence is as follows: given a gene set Σ, if an element in Σ appears multiple times in a, then a is called a sequence, and the set of elements in sequence a is represented by c (a).

Adjoining: given two sequences A ═ a₁a₂a₃…a_n，B＝b₁b₂b₃…b_nIf a is_ia_i+1＝b_jb_j+1(or a)_ia_i+1＝b_j+1b_j) Wherein a is_ia_i+1∈P_A，b_j+1b_j∈P_B. Let us call a_ia_i+1And b_jb_j+1Are matched with each other. At P_AAnd P_BOf the most matching pair of (2), the pair a having a matching relationship_ia_i+1Referred to as a common adjacency in A relative to B

Breaking points: pairs a without matching relationship_ja_j+1Referred to as a breakpoint in a relative to B.

Fragment contig (contig): is a string composed of the genes in Σ, and the content of this string is fixed.

slot: the insertion position between two segment contigs is called slot. Wherein, the slot positioned at the left side of the gene is called L-slot, and the slot positioned at the right side of the gene is called R-slot. Slots are generally denoted by.

Maximum deletion gene string: the longest string of consecutive deleted genes (non-cross contigs) is called the largest deleted gene string. Wherein, if the maximum deletion gene string is one contig, it is called independent maximum deletion gene string, and it has two slots; otherwise, it is called a non-independent maximum deletion gene string. The contiguous largest missing gene string is referred to as the contiguous largest missing gene string.

Public genes: genes present in both input permutations.

Contiguous missing string: for a common gene G input to array A, the genes that are not present in B and are contiguous on the left and right sides form a deletion sequence of array B, called the contiguous deletion sequence of G.

Contradictory public genes: a common gene G satisfies that G has at least one adjacent deletion string in each arrangement, G at least comprises one slot in the two arrangements, G at least comprises one slot in the arrangement A, and then all the adjacent deletion strings of G in the arrangement B at least comprise one slot.

The invention is further described with reference to the following figures and detailed description.

The embodiment of the invention, a genome fragment filling method based on fragment contig, as shown in FIG. 1, comprises the following steps:

(1) calculating to obtain a deletion gene;

(2) classifying the maximum deletion gene string;

(3) merging the maximum deletion gene strings meeting the conditions;

(7) all remaining deletion genes were inserted at the end of each alignment.

Preferably, in the above method for filling genome fragments by fragment contig, the calculation in step (1) to obtain the deleted genes is to compare elements in array a and array B with each other to obtain the deleted gene set X in array a and the deleted gene set Y in array B.

Preferably, in the above genome fragment population based on fragment contig filling method, the maximum deletion gene string is classified in step (2), and the specific classification rule is as follows:

Type-1 Type: constituting n +1 adjacency.

Type-1-I: a string composed of elements of X and Y together.

Type-1-II: a string consisting of only elements in X or only elements in Y.

Type-2 Type: forming n neighbors.

Type-2-I: a string of X and Y elements together.

Type-2-II: a string of X and Y elements together.

Type-3 Type: constituting n-1 neighbors.

Type-3-I: a string of X and Y elements together.

Type-3-II: a string consisting of only elements in X or only elements in Y.

Preferably, in the above method for filling genome fragments based on a fragment contig, the merging of the eligible maximum deletion gene strings in step (3) specifically includes the following steps:

merging operation: removing slots between consecutive maximum deletion gene strings (the outermost slots remain unchanged);

eligible merge operations:

when the dependent maximum missing gene string and the independent maximum missing gene string exist in the continuous maximum missing gene string (the dependent maximum missing gene string is necessarily positioned at the leftmost side or the rightmost side of the continuous maximum missing gene string), the dependent maximum missing gene string at the leftmost side or the rightmost side is taken as an initial string, the continuous independent maximum missing gene string and the dependent maximum missing gene string can be merged, and the number and the types (R-slot and L-slot) of the merged maximum missing gene string are consistent with the dependent maximum missing gene string;

and when only the independent maximum missing gene string exists in the continuous maximum missing gene strings, executing merging operation, wherein the merged maximum missing gene string is still an independent maximum missing gene string, and the left side and the right side of the merged maximum missing gene string are still provided with two slots.

Preferably, in the method for filling genome fragments based on a fragment contig, the deleted gene string between two genes after the step (3) is as follows:

let A be a certain continuous maximum miss string S_AIf the left adjacent common gene is l and the right adjacent common gene is r, the deletion string between the two common genes is:

(ii) maximum missing string without slots, e.g.

② there is a maximum missing string of a slot, e.g.

③ there are two independent maximum missing strings of slots, e.g.

Two maximum miss strings separated by a slot, e.g.

Preferably, in the above genome fragment population based on fragment contig filling method, the Type-1 string insertion algorithm is performed by searching for the Type-1 string in step (4). The specific Type-1 string insertion algorithm is as follows:

are arranged in sequenceThe largest consecutive missing string in column A, B is scanned from left to right, and let A be the largest consecutive missing string S_AThe left adjacent common gene is l, and the right adjacent common gene is r.

If l.r or r.l is present in B, then S is not the same_AIn which case the above-mentioned deletion gene string belongs to, S can be substituted_AOr S_AThe reverse order of (1) is inserted into the slots of l.r or r.l and locked (i.e., all slots of l.r or r.l are deleted), when S_AOr S_AThe reverse order of (1) is Type-1-II Type string. Note that if S_AIt is the case that (i) Type-1-II strings can be constructed only in this case. In addition to this, the present invention is,

I. when missing string S_AIs the case 2, set to

If present in B

(or

) Or

(or

) Then A is inserted into S_B(or S)_BIn reverse order) and locked, constitute

(or

) Insertion of S into B_A(or S)_AIn reverse order), locked after insertion, constitute

(or

) (ii) a What is inserted at this time is a Type-1-I string.

When missing string S_ACondition (Y) is

If S is present in B_BBelongs to the group II or III, (one or three or two) and the adjacent common genes are l and r, the insertion operation in the group I can be carried out, and the Type-1-I Type string is formed by locking after insertion.

III when S_AIf yes, the Type-1-I string cannot be formed.

Preferably, in the aforementioned genome fragment population-based genome fragment population filling method, the slot-free Type-3-II string is searched in step (5), and a no-slot-Type-3-II string insertion algorithm is performed. The specific no-slot-Type-3-II string insertion algorithm is as follows:

the largest missing string in the permutation A, B is scanned sequentially from left to right, and a certain largest missing string S in A is set_AThe left adjacent common gene is l, and the right adjacent common gene is r. If S_AThe end of permutation B is inserted and a slot is added at the end. The Type-3-II string searched in the algorithm is called a no-slot-Type-3-II string.

Preferably, in the genome fragment population based on fragment contig filling method, the Type-2 and Type-3 Type strings are searched in step (6), the common adjacency relation related to the contradictory common genes is processed, and the Type-2&3 string insertion algorithm is executed. When the missing string of Type-2 is searched, if the common adjacency relations of each arrangement are respectively established according to the original common adjacency, because the slot exists between the contradictory common gene and the missing string, if the missing gene is inserted into the slot, the original common adjacency basis can be damaged, and the common adjacency relations related to the contradictory genes need to be processed at the same time, so as to avoid the error of the insertion result caused by establishing the contradictory common adjacency relations. Therefore, the common adjacency composed of the contradictory common genes and the maximum missing gene string is first classified:

hypothesis contradictionThe common gene is T, if the largest missing string adjacent to it in the arrangement A is S_AIn the arrangement B, the adjacent missing string is S_B: (drawn, i.e. available slots)

[ common adjacent relation

If A comprises

B contains

Can be formed in the result arrangement

At this time S_BBelongs to the deletion gene set X, S_ABelonging to a deleted gene set Y, the class of twist matching is called X-Y twist common adjacency relation, and the deletion string formed in the result belongs to a Type-2-I Type. Can be formed into

The forms of the insertion result are numerous and, as shown in table 1 in particular, the symmetrical forms of the form exchange in a and B are not listed.

TABLE 1 details of the twist-public adjacency

In addition, there is a special twist common adjacency composed of two adjacent maximum missing strings in the same arrangement, and the maximum missing strings are called Y-Y twist common adjacency in arrangement B and X-X twist common adjacency in arrangement A. If the X-X/Y-Y twist public adjacency relation can form public adjacency with a certain adjacent public gene, the missing string is of Type-2-I, otherwise, the missing string is of Type-3-II.

Slot seizing public adjacency relation

If A comprises

B contains

Either of the two result permutations will only constitute

Common adjacent relation, or only forming

A common adjacent relation forms S_AAnd S_BSlot of T is preempted, then

Or

This is called slot preemption of common adjacencies. There are many forms in which slot preemption common adjacency relationships can be constructed. As shown in Table 2, the symmetric versions of the form exchange in A and B are not listed.

Table 2 may constitute a specific form of slot preemption common adjacency

Note that if the conflicting common genes T in both arrangements have slots on either side of the string where no missing gene is present, for example, the inclusion of A

B contains

Then the result in the result a permutation can be finally constructed

Results in arrangement B

It is a special slot that preempts the common adjacency.

Preferably, in the genome fragment population based on fragment contig filling method, the Type-2 and Type-3 Type strings are searched in step (6), the common adjacency relation related to the contradictory common genes is processed, and the Type-2&3 string insertion algorithm is executed. Because of the existence of contradictory common genes, the common bipartite graph maximum matching in the single-face fragment filling problem has not been able to achieve correct results. The Type-2&3 string insertion algorithm obtains the maximum matching of the public adjacency relation by establishing a weighted maximum matching model of a general graph and using a weighted flower tree algorithm idea, realizes the insertion of missing strings according to the matching and obtains all Type-2 and Type-3 strings in a certain optimal solution. Obtaining the optimal number of adjacencies is achieved by using a weighted flower tree algorithm, and it is determined that: (1) the structure of the node; (2) establishing rules of edges among nodes; (3) the weight of the edge.

(1) The structure of the node is as follows:

in the Type-2&3 string insertion algorithm, the nodes in the graph structure cannot be just the missing gene strings and the common genes carrying slots, because the existence of contradictory common genes results in twisted common adjacencies representing the missing strings and slots preempting the common adjacencies. In the node structure of the deletion string adjacent to the contradictory common genes, it is necessary to specify the relationship with the contradictory common genes because one deletion string may be adjacent to two contradictory common genes and one contradictory common gene may be adjacent to two deletion strings. In order to be able to determine the final insertion of each missing string node, each node structure contains at most one missing string. Thus, the structure of the node contains at most one deleted string and its two adjacent contradictory public genes, in addition to which there are some other node structures, as shown in FIG. 2, where 6 is the deleted gene string, h and F are the public genes, and the respective slots are indicated for both the deleted gene string and the public gene.

(2) Establishing rules of edges between nodes:

in the Type-2&3 string insertion algorithm, the basic rule for edge establishment is to establish an edge between two elements if the elements in two nodes are connected together to form a common adjacency, i.e. to form a common adjacency relation. The weights on the edges are defined according to different types of the common adjacency relation, so as to ensure that the obtained new adjacency number is the maximum. Due to the existence of Type-3-I Type strings, when matching missing strings to form Type-2 Type strings under the same condition, whether missing strings without slots are preferentially selected needs to be considered. This is also the primary reason why the search for Type-2 and Type-3 strings is done together. However, because of the existence of contradictory public genes, the establishment of some special public adjacency relations needs to be preferentially carried out, and the edge weight value formed by the public adjacency relations needs to be set higher. Compared with the Type-3 Type string, the Type-2 Type string has more new common adjacency numbers, so the weight value on the corresponding edge is also reflected.

The edges in a weighted general graph can be divided into five major categories:

common adjacency of X-Y, X-X, Y-Y: the edge between the slots of two missing strings, two of the junction nodes represent the slots ·, which generally represents the X-Y, X-X, Y-Y twist common adjacency. Wherein, the X-Y twist common adjacent relation is shown in figure 3, and the X-X, Y-Y twist common adjacent relation is shown in figure 4.

Type-2-II and Type-2-III strings public adjacency relation: the edges between the missing string and the slot of the common gene, the box in the junction representing the missing string and the common gene represent the · of the slot, which generally represents the Type-2-II, Type-2-III string common adjacency, as shown in FIG. 5.

③ X-X, Y-Y twist the transformation edges of the common adjacency: in order to uniformly use the maximum matching technology of a general diagram, the edge representation form of the X-X, Y-Y twisted common adjacency relation capable of forming a Type-2 string is subjected to special processing, and a transformation edge of the X-X, Y-Y twisted common adjacency relation is added. As shown in FIG. 6, the common adjacency for h56 is represented by the edge between the slot of h and the missing string 6, the common adjacency for 56k is represented by the edge between the slot of k and the missing string 5, the edge between 5 and 6 indicates that 56 can constitute a Type-3-II string, and the edge between the missing string 5(6) and h (k) indicates that 5(6) constitutes a Type-2-II string alone.

Supplementing a class-I side: if an edge, when the insertion slot of the missing string is located in the X-Y twisted public adjacency, the edge will reduce the weight of the edge formed by the twisted public adjacency, mark the edge, and the edge still needs to be supplemented and selected after participating in the maximum matching, because the slot to be inserted after the maximum matching is likely to be released and can be reconstructed into the public adjacency, and a supplement-I type edge is established. Note that the missing string and its adjacent public gene do not contain slots, otherwise, it will form X-Y twist or slots to seize the adjacency; the X-X, Y-Y transformed edges of the common adjacency may also belong to the complementary-class I edges. As shown in FIG. 7, the deletion strings d and nk can be linked to the slot of the common gene h in the X-Y twisted common adjacency, so that the weight of the 6nh twisted common adjacency side is reduced. If the result of the maximum matching of real edges is to form a 6JF common adjacency, then d may form an adjacency with h.

Supplementing a class II edge: the slot is marked to preempt an edge of the common adjacency. Because slot preemption of the public adjacency often takes one of several public adjacency relations which cannot exist at the same time, the efficiency of forming a new adjacency is low, and the result correctness is also influenced by directly putting the slot preemption into a graph to participate in maximum matching. Therefore, the common adjacency relation of the type is only marked during drawing construction, the weights of other related edges are influenced, the maximum matching calculation is not involved, and after the maximum matching is finished, if the common adjacency relation can still form the common adjacency relation, the edge is selected in a supplementing way, and the edge is called as a supplement-II type edge. As shown in fig. 8, a large dashed box indicates the range of slot preemption of common adjacencies, two small dashed boxes indicate nodes in preemption of common adjacencies, and the established edges are indicated by dashed lines. After 6nk forms a Type-2-I string, the slot can still select the edge of L and R which occupy the common adjacency relation, and L forms a Type-2-II string.

(3) Weight of edge

In a weighted general graph, the weight value often represents the value of this edge. The more efficiently the represented common adjacency becomes to form a new adjacency, the higher the weight of the edge. The greater the number of nodes that form a common adjacency, the lower the weight of the possible relevant edges. This is also not difficult to understand, as nodes with only one common adjacency edge more require that the edge match be successful.

In the created weighted general graph, a variable representing the weight of an edge is set to I, K₀、K₁、K₂、K₃、K₄、K₅、K₆、K₇、K₈Satisfy I>0，K₈<K₇<K₆<K₅<K₄<K₃<K₂<K₁<K₀<2I，K₁+K₅>K₃，K₅+K₇>K₁+K₈，2K₈>K₆. Now, the weight is summarized as follows, if the edge satisfies various weight requirements, the weight with a smaller value is the edge weight: for clarity of illustration, each variable may be taken to be a qualified specific value in the figure. The weight of the edge is specifically classified into the following categories:

the weight of the edge is 2I (2): in the X-Y twisted common adjacent relation, if the slot of the common gene has no other connected edge, the missing string has no other slot inserted, and the weight of the edge formed by the twisted common adjacent relation is 2I (2).

② the weight is K₀(1.9): when in the X-Y twist common adjacency relation, the slots of the common genes are only connected with the dotted line edges (namely two types of supplementary edges), and the weight of the edge formed by the twist common adjacency is reduced from 2I (2) to K₀(1.9)。

③ the weight is K₁(1.8): only one side of the transformation edge of the X-X, Y-Y twist public adjacent relation is adjacent to the public gene, or if the slot of the inserted public gene has no edge forming the X-Y twist public adjacent relation and has no edge formed by other deletion strings except the deletion string in the X-X, Y-Y twist public adjacent relation, the weight of the transformation edge is K₁(1.8)。

Weight of K₂(1.78): when the X-Y twist common adjacent relation is established, the slot of the common gene is connected with the common gene except the dotted line sideOther slots can be inserted into other types of edges or missing strings, and the weight of the edge formed by the twisted common adjacency relation is reduced from 2I to K₂(1.78)。

The weight of the edge is K₃(1.77): the edge weight value forming the Type-2-II string is K₃(1.77)。

Weight of the edge is K₄(1.76): the edge weight value forming the Type-2-II string is K₄(1.76)。

The weight of the edge is K₅(1.75)：

i. If the weight is K₃(1.77) edges forming Type-2-II strings, wherein the slot of the missing string or the common gene belongs to the slot to preempt the common adjacency range, and the missing string is a node in the supplement-I Type edge, and the weight is K₃(1.77) reduction to K₅(1.75)；

When the deletion string shares two common genes with two continuous X-X twists or Y-Y twists, the edge weight of the common adjacency relation between the deletion string and the two common genes is K₅(1.75)。

The weight of the edges is K₆(0.19): the edge weight value of the Type-3-II Type string forming the X-X, Y-Y twisted public adjacency relation is K₆(0.19)。

Ninthly weight is K₇(0.18): the edge weight value forming the Type-3-I Type string is K₇(0.18)。

The weight of the R side is K₈(0.1)：

i.X-X, Y-Y transformed side of common adjacent relation, when X-X/Y-Y and common genes at two sides are adjacent, if there is side of X-Y transformed common adjacent relation in slot of inserted common gene or side of other missing string except missing string in X-X, Y-Y transformed common adjacent relation, the weight of the transformed side is K₀(1.9) reduction to K₈(0.1), and the weight of the Type-3-I side formed by the missing string and other missing strings in the converted side is also K₇(0.18) reduction to K₈(0.1). When the weights of the transformation edges of the X-X, Y-Y common adjacency relation can be K₀(1.9) randomly selecting one side weight value to be reduced to K₈(0.1), and the weight of the Type-3-I side formed by the missing string and other missing strings in the converted side is also K₇(0.18) reduction to K₈(0.1)。

When X-X/Y-Y is adjacent to only one common gene, the weight of the transformed side is K1, and the weight of the Type-3-I side formed by the missing string and other missing strings in the transformed side is K₇(0.18) reduction to K₈(0.1)。

Preferably, in the genome fragment population based on fragment contig filling method, the Type-2 and Type-3 Type strings are searched in step (6), the common adjacency relation related to the contradictory common genes is processed, and the Type-2&3 string insertion algorithm is executed. Wherein, the detailed process of the Type-2&3 string insertion algorithm is as follows:

the array A, B is scanned sequentially from left to right,

the search for the contradictory common genes in A, B. By definition, a paradox common gene must be present in a node at A, B at the same time, so that all paradox common genes can be searched most quickly by searching A, B for the array with fewer missing strings.

② construct A, B a weighted general graph of matching relationships:

i. initialization of all types of nodes that establish a common adjacency is added to each permutation. The types of the nodes include nodes containing common genes and deletion strings, nodes containing only deletion strings, and nodes containing only common genes. When one missing gene string is simultaneously positioned in two X-Y twist public adjacent relations or simultaneously positioned in one X-Y twist public adjacent relation and one slot seizes the public adjacent relation, the node consists of the missing string, two public genes and corresponding slots. The nodes forming two sets Π_AAnd pi_B. The input array with the small number of missing gene strings was searched for, and assumed to be a. Scanning the missing string in the array A from left to right, and if the missing string belongs to an X-Y twist common adjacent relation, forming pi by the missing string and one or two adjacent contradictory common genes_AWherein the missing string and the adjacent contradiction common gene located in the arrangement B in the common adjacent relation form pi_BOne node of; two missing strings separated by slot form an X-X twist common adjacent relation, and then the two missing strings form pi_AWherein if there is a slot in permutation B for the missing string contiguous common gene, then the contiguous common gene is located in the corresponding missing string node; if the missing string can be matched with a common gene in arrangement B to form a Type-2-II or Type-2-III string, then the missing string forms Π_AWherein one node in B corresponds to a common gene to form pi_BOne node of; if the missing string belongs to the slot to seize the public adjacency relation, respectively in pi_AAnd pi_BThe term "sequence" is used to indicate a sequence of deletions in the arrays A and B belonging to the common adjacent relationship and a node formed by the contradictory common genes. And B is scanned and arranged from left to right, the same treatment is carried out on the rest missing strings in B, and all nodes related to the contradictory common genes in B are initialized in the treatment of A.

initializing an edge. And establishing edges between the nodes according to the type of the common adjacency relation. If the nodes form an X-Y, X-X, Y-Y twist public adjacency relation or a Type-3-I Type public adjacency relation, connecting the corresponding slots of the two missing gene strings in the nodes by edges; otherwise, one end of the edge is connected with the deletion string, and the other end is connected with the slot of the public gene. Wherein, the slot preempts the edge supplement-I type edge and the supplement-II type edge in the public adjacency relation to carry out special marking.

And iii, adding a corresponding weight for each edge according to the weight type.

Thirdly, the edge set theta of the maximum matching obtained by using the weighted flower tree algorithm in the graph₁And inserting the missing string into a corresponding position, locking, and updating the missing string set and the two arrays. Note that the insertion of the X-X/Y-Y common adjacent relation in the Type-3-II string is performed last, and whether there are released slots on the left and right sides of the adjacent common genes is detected first.

Fourthly, performing weight maximum matching on the supplement-I type edges according to the weight to obtain an edge set theta₂And inserting the missing string into a corresponding position, locking, and updating the missing string set and the two arrays.

Selecting the residual supplement-II type edge to obtain an edge set theta₃And inserting the missing string into a corresponding position, locking, and updating the missing string set and the two arrays.

Sixthly, inserting the residual deletion stringAt the end of the array, it is split by slots to form the final array A^*And B^*。

Fig. 9 is a schematic structural diagram of an assembly apparatus for a genome according to an embodiment of the present invention, and as shown in fig. 9, an embodiment of the present invention provides an assembly apparatus for a genome, including an input unit 1001, an initialization unit 1002, a classification unit 1003, a merging unit 1004, a Type-1 string insertion unit 1005, a no-slot-Type-3-II string insertion unit 1006, a Type-2&3 unit 1007, a residual missing gene insertion unit 1008, and an output unit 1009, where:

the input unit 1001 is configured to obtain an input genome arrangement based on two segment contig sets; the initialization unit 1002 is used for calculating a deletion gene; the classification unit 1003 is used for classifying the maximum missing gene string; a merging unit 1004 for merging the eligible maximum missing gene strings; the Type-1 string insertion unit 1005 is used for searching a Type-1 Type string and executing a Type-1 string insertion algorithm; the no-slot-Type-3-II string insertion unit 1006 is configured to search a slot-free Type-3-II string, and execute a no-slot-Type-3-II string insertion algorithm; the Type-2&3 unit 1007 is used for searching Type-2 and Type-3 strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm; a remaining deleted gene insertion unit 1008 for inserting all remaining deleted genes to the end of each permutation, respectively; the output unit 1009 is used to output the filled genes.

Specifically, the input unit 1001 is configured to obtain two inputted genome arrangements based on the segment contig sets; the initialization unit 1002 is used for calculating a deletion gene; the classification unit 1003 is used for classifying the maximum missing gene string; a merging unit 1004 for merging the eligible maximum missing gene strings; the Type-1 string insertion unit 1005 is used for searching a Type-1 Type string and executing a Type-1 string insertion algorithm; the no-slot-Type-3-II string insertion unit 1006 is configured to search a slot-free Type-3-II string, and execute a no-slot-Type-3-II string insertion algorithm; the Type-2&3 unit 1007 is used for searching Type-2 and Type-3 strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm; a remaining deleted gene insertion unit 1008 for inserting all remaining deleted genes to the end of each permutation, respectively; the output unit 1009 is used to output the filled genes.

The genome segment filling device based on the segment contig provided by the embodiment of the invention can improve the efficiency and accuracy of genome segment filling by initializing, classifying and combining the input gene arrangement, then searching various types of strings and inserting the strings into a proper position.

The genome fragment filling apparatus based on the fragment contig provided by the embodiment of the present invention can be specifically used for executing the processing procedures of the above method embodiments, and the functions thereof are not described herein again, and reference can be made to the detailed description of the above method embodiments.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 10, the electronic device includes: a processor (processor)1101, a memory (memory)1102, and a bus (bus) 1103;

the processor 1101 and the memory 1102 complete communication with each other through a bus 1103;

the processor 1101 is configured to call the program instructions in the memory 1102 to perform the methods provided by the above-mentioned method embodiments, for example, including: calculating to obtain a deletion gene; classifying the maximum deletion gene string; merging the maximum deletion gene strings meeting the conditions; searching a Type-1 string and executing a Type-1 string insertion algorithm; searching a Type-3-II string without slot, and executing a no-slot-Type-3-II string insertion algorithm; searching Type-2 and Type-3 Type strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm; inserting all remaining deletion genes to the end of each permutation; finally outputting the filled gene.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: calculating to obtain a deletion gene; classifying the maximum deletion gene string; merging the maximum deletion gene strings meeting the conditions; searching a Type-1 string and executing a Type-1 string insertion algorithm; searching a Type-3-II string without slot, and executing a no-slot-Type-3-II string insertion algorithm; searching Type-2 and Type-3 Type strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm; inserting all remaining deletion genes to the end of each permutation; finally outputting the filled gene.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: calculating to obtain a deletion gene; classifying the maximum deletion gene string; merging the maximum deletion gene strings meeting the conditions; searching a Type-1 string and executing a Type-1 string insertion algorithm; searching a Type-3-II string without slot, and executing a no-slot-Type-3-II string insertion algorithm; searching Type-2 and Type-3 Type strings, processing the public adjacency relation related to the contradictory public genes, and executing a Type-2&3 string insertion algorithm; inserting all remaining deletion genes to the end of each permutation; finally outputting the filled gene.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM, RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for genome fragment population based on fragment contig, comprising the steps of:

calculating to obtain a deletion gene;

classifying the maximum deletion gene string;

merging the maximum deletion gene strings meeting the conditions;

all remaining deletion genes were inserted at the end of each alignment.

2. The method for fragment contig-based genome fragment population filling according to claim 1, wherein the maximum missing gene string is classified, and the type of the maximum missing gene string composed of the elements in X and Y in the optimal solution is: setting the string length of the string as n, namely, the string consists of n deletion genes; the method specifically comprises the following steps: type-1 Type; Type-1-I; Type-1-II; type-2 Type; Type-2-I Type; Type-2-II; Type-2-III; type-3 Type; Type-3-I Type; Type-3-II.

3. The method for filling genome segments based on segment contigs as claimed in claim 1, wherein the maximum missing gene strings meeting the condition are merged by removing the slots between the consecutive maximum missing gene strings (the outermost slots remain unchanged).

4. The method for filling genome fragments based on fragment contig as claimed in claim 1, wherein said Type-2 and Type-3 strings are searched, the common adjacency relation related to contradictory common genes is processed, and the Type-2&3 string insertion algorithm is executed, because when the missing string search of Type-2 Type is performed, if the adjacent matching of each array is performed according to the original adjacency, because there is a slot between the contradictory common genes and the missing string, if the missing gene is inserted into the slot, the original adjacency basis may be destroyed, and the common adjacency relation related to the contradictory genes needs to be processed at the same time to avoid the error of the insertion result caused by the contradictory matching; therefore, the contradictory common genes are first classified as contiguous matches with the largest missing gene string: (1) twisting the common adjacency relation; (2) slots preempt adjacency.

5. The method for genome fragment filling based on fragment contig as claimed in claim 1, wherein said searching Type-2, Type-3 Type string, processing the public adjacency relation related to the contradicting public genes, executing Type-2&3 string insertion algorithm, because of the existence of the contradicting public genes, the common bipartite graph maximum matching in the single-side fragment filling problem can not obtain the correct result; the Type-2&3 string insertion algorithm obtains the adjacent maximum matching by establishing a weighted maximum matching model of a general graph and using a weighted flower tree algorithm idea, realizes the insertion of missing strings according to the matching and obtains all Type-2 and Type-3 strings in a certain optimal solution; obtaining the optimal number of adjacencies is achieved by using a weighted flower tree algorithm, and it is determined that: (1) the structure of the node; (2) establishing rules of edges among nodes; (3) the weight of the edge.

6. The method for genome fragment population based on fragment contig as claimed in claim 5, wherein the rules for establishing the edges between nodes can be divided into five categories in the weighted general graph:

(1) X-Y, X-X, Y-Y twist common adjacency;

(2) the common adjacency relation of the Type-2-II and Type-2-III strings;

(3) X-X, Y-Y twisted commonly adjoining shifting edges

(4) Complement class-I edges;

(5) complement class II edges.

7. The method of claim 5, wherein the weight value variable is I, K₀、K₁、K₂、K₃、K₄、K₅、K₆、K₇、K₈Satisfy I>0，K₈< K₇< K₆< K₅< K₄< K₃< K₂< K₁< K₀< 2I ，K₁+ K₅>K₃，K₅+ K₇> K₁+ K₈，2 K₈> K₆；

Now, the weight is summarized as follows, if the edge satisfies various weight requirements, the weight with a smaller value is the edge weight: for clarity of illustration, each variable may be represented by a particular value in the figure;

the weight of the edge is specifically classified into the following categories: (1) the edge weight is 2I (2); the weight of the edge is K₀(1.9); the weight of the edge is K₁(1.8); the weight of the edge is K₂(1.78); the weight of the edge is K₃(1.77); the weight of the edge is K₄(1.76); the weight of the edge is K₅(1.75); the weight of the edge is K₆(0.19); the weight of the edge is K₇(0.18); the weight of the edge is K₈（0.1）。

8. A method for genome fragment population based on fragment contig, comprising the steps of: an apparatus for genome fragment population based on fragment contig, comprising:

an input unit: two genome arrangements based on a set of segment contigs;

a classification unit: classifying the maximum deletion gene string;

an output unit: the two genomes after filling are aligned.

9. A method for genome fragment population based on fragment contig, comprising an electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.