CN113889186A - Double-sided genome fragment filling method and device containing repeated genes based on fragment contig - Google Patents

Double-sided genome fragment filling method and device containing repeated genes based on fragment contig Download PDF

Info

Publication number
CN113889186A
CN113889186A CN202111310669.7A CN202111310669A CN113889186A CN 113889186 A CN113889186 A CN 113889186A CN 202111310669 A CN202111310669 A CN 202111310669A CN 113889186 A CN113889186 A CN 113889186A
Authority
CN
China
Prior art keywords
gene
correlation
genes
string
deletion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111310669.7A
Other languages
Chinese (zh)
Inventor
柳楠
李胜华
朱永琦
崔晓宇
李晓峰
任燕
卞忠勇
李洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jianzhu University
Original Assignee
Shandong Jianzhu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jianzhu University filed Critical Shandong Jianzhu University
Priority to CN202111310669.7A priority Critical patent/CN113889186A/en
Publication of CN113889186A publication Critical patent/CN113889186A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a double-sided genome segment filling method and device containing repeated genes based on segment contig. The method mainly comprises the following steps: calculating to obtain a deletion gene set; the largest missing gene string is classified, and gene elements are classified into three types: n-Type-1 string, n-Type-2 string and n-Type-3 string, wherein n is the length of the missing gene string; the relationship between the maximum deletion gene string and the repeat gene is classified, and the relationship between the gene elements and the repeat gene is classified into three types: the method comprises the following steps of (1) no correlation, semi-correlation and correlation, wherein the no correlation means that an insertion string is not adjacent to a repetitive gene and an insertion position is not related to the repetitive gene, the semi-correlation means that the insertion string is not adjacent to the repetitive gene and the insertion position may be related to the repetitive gene, and the presence or absence of substitution related to the insertion position refers to that the insertion is adjacent to the repetitive gene or the insertion position is completely related to the repetitive gene; searching insertion strings of non-correlation and semi-correlation types, and executing a non-correlation and semi-correlation string insertion algorithm; and constructing an auxiliary graph for the related type strings, and inserting by utilizing a backtracking algorithm and a maximum matching algorithm. The filling method has the advantages of high filling speed and high efficiency. The invention carries out genome filling based on the segment contig, can improve the filling accuracy and the completion rate, and has generality and practicability.

Description

Double-sided genome fragment filling method and device containing repeated genes based on fragment contig
Technical Field
The technical scheme of the invention is a method and a device for filling double-sided genome segments containing repetitive genes based on segment contigs, belonging to the technical field of genetic engineering.
Background
The human genome project has been proposed as early as 80 s in the last century, and the research contents are the construction of genetic maps, physical maps, sequence maps and transcriptional maps. In recent years, whole genome sequencing has attracted considerable attention. Although the biological sequencing is developed through the first generation, the second generation and the third generation sequencing technologies, the testing scale and the testing speed are greatly improved, and the sequencing cost is also greatly reduced, the complete genome sequence is still difficult to obtain only through the biological sequencing means. The whole genome sequence is obtained by assembling short gene segments into larger gene segments by a computer-related technology, namely a splicing algorithm. The genome real data is composed of a series of continuous fragment contigs (contigs), and a larger gene structure, namely a genome framework (Scaffold), is obtained by determining the arrangement sequence of all fragment contigs in the genome and the spacing distance between each fragment contig.
Computational genomics is a discipline for analyzing, modeling and calculating genome research data by using computer technology and information technology to obtain biological information. The genome fragment filling problem is an emerging combinatorial optimization problem in computational genomics. The genome segment filling problem is to calculate the difference between filled gene segments after the deletion gene is filled into incomplete gene segments. Wherein, the difference types among fragments are as follows: genome recombination distance, genome sampling distance, breakpoint distance, minimum common string partition distance, maximum common adjacency distance, and the like. Double-sided genomic fragment filling based on fragment contigs is a more general form of pre-common sequence-based double-sided genomic fragment filling. Liu et al designs an approximation algorithm using a greedy strategy by classifying segment break points and classifying missing string types, and the approximation ratio can reach 1.5; ma et al further improved the approximation ratio to 1.4 by constructing a 5-connected fluxless graph and a 7-connected fluxless graph to find the largest independent set. However, these two algorithms can only solve the problem of double-sided genome fragment filling of common sequences, and cannot be applied to double-sided genome fragment filling based on fragment contig. Currently, Li et al propose a double-sided genome segment filling algorithm based on segment contig, although only based on a class of examples, which may also provide a significant reference value for the related field. Since the genes are massive and variable, the algorithm cannot be applied to the problem of filling double-sided genome segments containing repeated genes based on segment contig.
Therefore, how to solve the problem of filling double-sided genome fragments containing repeated genes based on the fragment contig and calculate an approximation algorithm of the problem becomes a hot problem in the technical field at present.
Disclosure of Invention
The invention aims to provide a novel double-sided genome fragment filling algorithm containing repeated genes based on fragment contig aiming at the defects in the prior art. Through a large number of researches and tests, the invention provides a double-sided genome segment filling method based on a backtracking algorithm and a maximum matching algorithm, designs a brand-new approximate algorithm, can obtain a more accurate genome sequence, and simultaneously provides a device for realizing the technology, thereby being beneficial to further research and development in the field of genomics in the future.
Specifically, in the first aspect, the embodiment of the present invention provides a double-sided genome fragment filling method containing repeated genes based on fragment contig, comprising the following steps:
step 1: calculating to obtain a deletion gene set;
comparing the elements in the sequence A and the sequence B with each other to obtain a deleted gene set X in the sequence A and a deleted gene set Y in the sequence B.
Step 2: the largest missing gene string was classified.
Gene elements are classified into three types based on the difference in the number of common adjacencies generated by combinations among the elements in the gene sample sequence:
n-Type-1 Type: consists of n deletion genes, and can form n +1 neighbors after insertion.
n-Type-2 Type: consists of n deletion genes, and n adjacent genes can be formed after insertion.
n-Type-3 Type: consists of n deletion genes, and can form n-1 neighbors after insertion.
And step 3: determining the relationship between the maximum deletion gene string and the repetitive genes;
the relationship between the deleted gene string and the repeated gene is classified into three types based on the position of the deleted gene string in the sequence and the insertable position:
no correlation is found: the maximum deletion cluster is not involved in the repeat gene and the insertion site is not involved in the repeat gene.
Semi-correlation: the maximum deletion cluster is not involved in the repeat gene and the insertion site may be involved in the repeat gene, but there are alternative insertion sites.
And (3) correlation: the maximum deletion cluster is involved with the repeat gene or the insertion site is completely involved with the repeat gene.
And 4, step 4: preferentially inserting Type-1 deletion strings in no correlation and half correlation with repeated genes, and then inserting the rest deletion strings in no correlation and half correlation by constructing a bipartite graph and using a maximum matching method;
type-1 Type deletion strings having no correlation and half correlation relationships can be preferentially inserted based on the relationship of the deletion string to the repeat gene.
After the Type-1 Type missing string is inserted, searching a maximum matching strategy by adopting a bipartite graph, and inserting the remaining missing strings without correlation and semi-correlation relations.
And 5: updating the sequence, searching the deletion gene which has a correlation with the repeated gene, and constructing an auxiliary map;
based on the updated sequence of the insertion, searching missing strings with relevant relations, and constructing an auxiliary graph:
when constructing the helper map, only a single gene is considered, and if the gene has an insertion position, the deleted gene is connected to the insertion position slot by a solid line, and if the gene has no insertion position, the adjacent deleted genes are searched and connected by a broken line.
Step 6: completing the insertion of the deletion genes with the correlation by using a backtracking algorithm and a maximum matching algorithm;
from the insertion position, if the insertion position is connected with only one solid line, the insertion position can be directly determined, and meanwhile, greedy genes connected with the dotted line side are merged and inserted together, if the insertion position is connected with a plurality of solid line sides, the missing gene is traced back, and inheritance with less solid line sides is included in the missing gene.
And 7: inserting all the remaining Type-3 strings while ensuring that existing common adjacencies cannot be destroyed;
if the existing adjacency cannot be destroyed, the n-Type-3 string can be inserted into any slot, and the strings of the Type are selected to be inserted into the slot at the rightmost end of the gene sequence without loss of generality.
In a second aspect, the embodiments of the present invention provide a double-sided genome filling apparatus containing repeated genes based on segment contig, including the following:
an input unit: both fragment contig-based genomic sequences containing repeat genes are incomplete sequences;
an initialization unit: traversing the input sequence to obtain a missing gene set according to the input sequence;
a classification unit: classifying gene elements in the gene sample sequence;
an identification unit: determining the relationship between the maximum deletion gene string and the repetitive genes;
no correlation and half correlation units: inserting the remaining missing strings without correlation and half correlation relations by constructing a bipartite graph by using a maximum matching method;
a correlation unit: searching for a missing gene having a correlation with the repeated gene, constructing an auxiliary graph, and completing the insertion of the missing gene having the correlation by using a backtracking algorithm and a maximum matching algorithm;
remaining deletion gene insertion units: respectively inserting all the residual n-Type-3 strings into the gene sequences, and ensuring that the existing adjacency cannot be damaged at the same time;
an output unit: two genomic sequences obtained after filling.
In a third aspect, an embodiment of the present invention provides a server, including a processor, a memory, and a bus, including:
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions for execution by the processor;
the processor when executing the computer program instructions is capable of performing the method of the first aspect of fragment contig based double-sided genome fragment population containing duplicate genes.
According to the double-sided genome segment filling method and device based on the segment contig, which are provided by the embodiment of the invention, two gene sequences which are both incomplete sequences are obtained, each missing gene in the gene sample is accurately classified according to the adjacency relation formed by each missing gene and the gene sequence, and then the relation between the missing string and the repeated gene is accurately identified, so that the double-sided genome segment filling method based on the segment contig and containing the repeated gene is generated. The invention adopts a method based on the segment contig, the deletion gene can not be inserted randomly, but only can be inserted at the two ends of the segment contig, and needs to consider the more complex insertion relation under the condition of containing the repeated gene, thereby ensuring the integrity of the existing gene structure in the gene sequence and improving the gene filling efficiency and the accuracy.
Drawings
To illustrate the feasibility of the solution, the schematic diagram of the invention used in the technical description is briefly presented below.
FIG. 1 is a flow chart of the present invention based on the filling of double-sided genomic fragments containing repeat genes for fragment contigs;
FIG. 2 is a schematic structural diagram of a double-sided genome fragment filling apparatus containing repetitive genes based on fragment contig according to the second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a server entity provided in the third embodiment of the present invention;
Detailed Description
In the first aspect, relevant concepts of segment contigs, genome segment fills, breakpoints, and common adjacencies are presented to facilitate understanding by one of ordinary skill in the art.
Fragment contig (contig): in high throughput sequencing of genes, a large number of base sequences (reads) are obtained, and then the base sequences are assembled by certain computer techniques to obtain a contig of fragments.
Genome fragment filling: the process of filling missing genes into both ends of a segment contig in a genome frame is called as a genome segment filling algorithm by a design algorithm, wherein the single-sided genome segment filling problem and the double-sided genome segment filling problem are divided according to whether two gene sample sequences are complete or not: if only one gene sample sequence is deleted, the problem of filling single-sided genome fragments is solved, and if both the two gene sample sequences are deleted, the problem of filling double-sided genome fragments is solved.
The sequence is as follows: given a set of symbols sigma and a gene sequence a, we use c (a) to denote the set of all symbols in the gene sequence a, and a gene sample is called a sequence if the symbols in sigma occur multiple times in the gene sequence a.
Arranging: given a set of symbols sigma and a gene sequence a, we use c (a) to denote the set of all symbols in the gene sequence a, and a gene sample is said to be an array if the symbols in sigma appear in the gene sequence a only once.
Common adjacency: let P be P1 a2 … an and B1B2 … bm, where A, B is two gene sequences in ΣA=(a1a2,a2a3,…,an-1an),PB=(b1b2,b2b3,…,bm-1bm) For any block a in Aiai+1And any one of blocks B in Bjbj+1If a isiai+1=bjbj+1(or a)i+1ai=bjbj+1) Then call aiai+1And bjbj+1A common adjacency is formed.
Breaking points: let A, B be two gene sequences in Σ, a ═ a1a2…an,B=b1b2…bmFor any block a in Aiai+1And any one of blocks B in Bjbj+1If a isiai+1≠bjbj+1(or a)i+1ai≠bjbj+1) Then call aiai+1And bjbj+1A relative breakpoint is formed.
Public genes: genes present in both gene sample sequences.
slot: the positions at the two ends of the segment contig are called slots, and are generally represented by, wherein one segment contig CiThe head and tail elements of (A) are respectively alphaiAnd betaiThe open slots at both ends of the gene sample sequence are expressed by [ ∞.. alpha. ]1> and < betan+ ∞ > represents.
No correlation is found: the maximum deletion cluster is not involved in the repeat gene and the insertion site is not involved in the repeat gene.
Semi-correlation: the maximum deletion cluster is not involved in the repeat gene and the insertion site may be involved in the repeat gene, but there are alternative insertion sites.
And (3) correlation: the maximum deletion cluster is involved with the repeat gene or the insertion site is completely involved with the repeat gene.
Maximum deletion gene string: the largest string of consecutive missing genes in the same segment contig.
Second aspect, in order to make the present invention better understood by those skilled in the art, the following embodiments of the present invention will be further described with reference to the accompanying drawings and detailed description. Of course, the embodiments of the invention have been described in detail for purposes of illustration only, and not for purposes of limitation. Based on the embodiments of the present invention, the embodiments obtained by the ordinary workers in the field without making innovations shall all belong to the protection scope of the present invention.
Example one
The method implementation of the present invention will be specifically described below with reference to the accompanying drawings and examples.
As shown in FIG. 1, the embodiment of the present invention mainly includes the following steps:
step 1: calculating to obtain a deletion gene set;
step 2: classifying the maximum deletion gene string;
and step 3: determining the relationship between the maximum deletion gene string and the repetitive genes;
and 4, step 4: preferentially inserting Type-1 deletion strings in no correlation and half correlation with repeated genes, and then inserting the rest deletion strings in no correlation and half correlation by constructing a bipartite graph and using a maximum matching method;
and 5: updating the sequence, searching for a deletion gene having a correlation with the repetitive gene, constructing an auxiliary map:
step 6: completing the insertion of the deletion genes with the correlation by using a backtracking algorithm and a maximum matching algorithm;
and 7: inserting all the remaining Type-3 strings while ensuring that existing common adjacencies cannot be destroyed;
in the above double-sided genome fragment filling method containing repeated genes based on fragment contig, the missing gene set is obtained by the calculation described in step 1, and the specific implementation method is as follows:
sequentially traversing the gene sequence A and the gene sequence B from left to right, calculating c (A) and c (B), and making X (c), (B) -c (A) and Y (c), (A) -c (B) to further obtain a missing gene set X and Y.
In the above double-sided genome fragment filling method containing repeated genes based on fragment contig, the maximum deletion gene string is classified in step 2, and gene elements are classified into the following three types based on the difference in the number of common neighbors generated by the combination of each element in the gene sample sequence:
n-Type-1 Type: consists of n deletion genes, and can form n-1 neighbors after insertion.
n-Type-2 Type: consists of n deletion genes, and n adjacent genes can be formed after insertion.
n-Type-3 Type: consists of n deletion genes, and can form n-1 neighbors after insertion.
In the above double-sided genome fragment filling method containing repeat genes based on fragment contig, the relationship between the maximum deletion gene string and the repeat genes is determined in step 3, and based on the relationship between the deletion string and the repeat genes, the deletion string is divided into the following three types:
no correlation is found: the maximum deletion cluster is not involved in the repeat gene and the insertion site is not involved in the repeat gene.
Semi-correlation: the maximum deletion cluster is not involved in the repeat gene and the insertion site may be involved in the repeat gene, but there are alternative insertion sites.
And (3) correlation: the maximum deletion cluster is involved with the repeat gene or the insertion site is completely involved with the repeat gene.
In the double-sided genome fragment filling method based on fragment contig and containing the repetitive genes, the Type-1 Type deletion string with no correlation and half correlation with the repetitive genes is preferentially inserted in the step 4, and then the remaining deletion strings with no correlation and half correlation are inserted by using a maximum matching method through constructing a bipartite graph, and the specific implementation method is as follows:
and executing a Type-1 insertion algorithm based on the gene types in the non-correlation and semi-correlation relations, preferentially inserting the Type-1 missing genes, and then adopting a bipartite graph to find a maximum matching strategy to insert the remaining genes in the non-correlation and semi-correlation relations.
In the double-sided genome fragment filling method based on the fragment contig and containing the repetitive genes, the missing genes which have the correlation with the repetitive genes are searched in the step 5, and the auxiliary map is constructed, and the specific construction method of the auxiliary map is as follows:
considering only a single gene each time based on the related genes, if an insertion position exists, connecting the deleted gene with the insertion position slot by a solid line, and if no insertion position exists, searching adjacent deleted genes and connecting the deleted genes by a dotted line;
in the auxiliary map, if a gene has an insertion position, a dotted line connection may be created between the adjacent deletion genes.
In the above double-sided genome fragment filling method containing repeated genes based on fragment contig, the insertion of the deleted genes having a correlation relationship is completed by using a backtracking algorithm and a maximum matching algorithm described in step 6, and the specific implementation method is as follows:
in the auxiliary graph obtained as described above, if only one solid line is connected to the insertion position from the insertion position, the insertion position can be directly determined, and if the genes connected by the dotted line side are greedy inserted while being merged, if the insertion position is connected by a plurality of solid line sides, the missing gene is traced back, and inheritance in which the solid line sides included in the missing gene are few is performed.
If both sides of the deleted gene are connected with the dotted line side and the deleted gene itself is connected with the solid line side, only one dotted line side can be selected, if in the auxiliary graph, a string of continuous deleted genes is connected by the dotted line side, and if only two ends of the deleted genes contain the solid line side, the continuous deleted gene strings can be merged, and only one solid line side is taken.
In the double-sided genome filling method based on segment contig and containing repeated genes, all the remaining n-Type-3 are inserted in the step 7, and the specific implementation method is as follows:
after the insertion operation of the related genes is completed, all the remaining deleted genes (Type-3 strings) in the deleted gene set adopt a relatively open insertion principle, namely n-1 new common neighbors are newly generated by ensuring the Type-3 strings with the insertion length of n. On the basis of not destroying the original adjacency, the Type-3 string is inserted into the slot at the leftmost end or the rightmost end of the gene sequence.
Here, in order to better implement the embodiment of the present invention, the insertion of the n-Type-3 string is completed by uniformly inserting the Type-3 string into the rightmost end of the gene sequence.
Thus, the gene sequence A filled in through all the steps was obtained*And sequence B*
Example two
As shown in fig. 2, a schematic structural diagram of a double-sided genome fragment filling apparatus containing a repetitive gene based on a fragment contig according to a second embodiment of the present invention provides a double-sided genome filling apparatus containing a repetitive gene based on a fragment contig, which includes an input unit 1001, an initialization unit 1002, a classification unit 1003, an identification unit 1004, a non-correlation and semi-correlation unit 1005, a correlation unit 1006, a residual missing gene insertion unit 1007, and an output unit 1008.
An input unit 1001 for obtaining two input gene sample sequences; the initialization unit 1002 is configured to traverse the input sequence to obtain a missing gene set; the classification unit 1003 is used for classifying gene elements in the gene sample sequence; the identifying unit 1004 is used for determining the relationship between the maximum deletion gene string and the repetitive genes and dividing the relationship between the deletion string and the repetitive genes into three types; the no correlation and half correlation unit 1005 is used to insert the remaining missing strings without correlation and half correlation using the maximum matching method by constructing a bipartite graph; the correlation unit 1006 is configured to search for a missing gene having a correlation with a repeated gene, construct an auxiliary graph, and complete insertion of the missing gene having a correlation using a backtracking algorithm and a maximum matching algorithm; the residual deletion gene inserting unit 1007 is used for uniformly inserting all residual deletion genes into slots at the rightmost end of a gene sample sequence; the output unit 1008 is used to output the filled genes.
According to the double-sided genome fragment filling device containing the repetitive genes based on the fragment contig, the efficiency and the accuracy of double-sided genome fragment filling containing the repetitive genes based on the fragment contig are improved by sequentially initializing, classifying, identifying, inserting irrelevant and semi-relevant missing gene strings, inserting relevant missing gene strings and inserting Type-3 strings into an input gene sample sequence.
The double-sided genome fragment filling apparatus containing repetitive genes based on the fragment contig provided by the second embodiment of the present invention can be used for executing the processing procedure of the first embodiment of the present invention.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a server entity provided in the third embodiment of the present invention, and as shown in fig. 3, the server includes: a processor (processor)1101, a memory 1102, a bus 1103, and computer programs 1104.
The processor 1101 and the memory 1102 according to the third embodiment of the present invention complete mutual communication through the bus 1103, and the computer program 1104 is stored in the memory 1102 and can be called and executed by the processor 1101;
the Processor 1101 according to the third embodiment of the present invention may be a Central Processing Unit (CPU), which is the most commonly used Processor, or an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (Digital Signal Processor), and the like.
The storage 1102 according to the third embodiment of the present invention may be an internal storage unit of the server, such as a memory or a hard disk of the server, or an external storage device, such as a removable hard disk or a flash memory card.
The computer program 1104 according to the third embodiment of the present invention implements the double-sided genome fragment filling method including repeated genes based on the fragment contig according to the first embodiment of the present invention, which mainly includes: initializing a gene sample sequence to obtain a deletion gene set; classifying gene elements in the gene sample sequence; determining the relationship between the maximum deletion gene string and the repetitive genes; inserting the remaining missing strings without correlation and half correlation relations by constructing a bipartite graph by using a maximum matching method; searching for a missing gene having a correlation with the repeated gene, constructing an auxiliary graph, and completing the insertion of the missing gene having the correlation by using a backtracking algorithm and a maximum matching algorithm; uniformly inserting all the remaining deletion genes into a slot at the rightmost end of a gene sample sequence; and finally, outputting the filled gene sample sequence.
One of ordinary skill in the art will appreciate that: in practical application, the functions can be distributed into different module units according to requirements, that is, each module unit exists independently or a plurality of modules are integrated into one unit, and the integrated unit can be realized in a hardware form or a software functional unit form. The aforementioned computer program may be stored in a computer storage medium such as ROM, RAM, or an optical disk. In addition, the names of the functional modules are only for convenience of distinguishing and are not used for limiting the protection scope of the present application. For the specific execution process of each functional module in the system, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the above three embodiments, the present invention is described in detail with different emphasis points, and if a specific embodiment is not described in detail, reference may be made to the detailed description in other embodiments, and details are not repeated.
The above-described embodiments of servers and the like are only exemplary, and the units described as separate components may be physically separated or not; the unit display means may be a physical unit, may not be a physical unit, may be distributed in the same place, or may be distributed over a plurality of network units. The present embodiment can be implemented by selecting some or all of the modules according to actual requirements. One of ordinary skill in the art will understand and appreciate the embodiments of the present invention for use with any of a variety of different applications without undue experimentation.
Through the above description of the embodiments of the present invention, it is clear for those skilled in the art that the foregoing embodiments can be implemented by software and a necessary general hardware platform, and of course, can also be implemented by hardware only. With this understanding in mind, the above-described embodiments and/or portions thereof that contribute to the prior art may be embodied in the form of a software product, which may be stored on a computer-readable storage medium such as ROM, RAM, magnetic disks, optical disks, etc., and includes program instructions for causing a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or portions thereof.
The above examples are only for describing the invention in detail, and are not intended to limit the invention; it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. The double-sided genome fragment filling method containing repeated genes based on the fragment contig is characterized by mainly comprising the following steps of:
step 1: calculating to obtain a deletion gene set;
step 2: classifying the maximum deletion gene string;
and step 3: determining the relationship between the maximum deletion gene string and the repetitive genes;
and 4, step 4: preferentially inserting Type-1 deletion strings in no correlation and half correlation with repeated genes, and then inserting the rest deletion strings in no correlation and half correlation by constructing a bipartite graph and using a maximum matching method;
and 5: updating the sequence, searching the deletion gene which has a correlation with the repeated gene, and constructing an auxiliary map;
step 6: completing the insertion of the deletion genes with the correlation by using a backtracking algorithm and a maximum matching algorithm;
and 7: all remaining Type-3 strings are inserted while ensuring that existing common adjacencies cannot be broken.
2. The method for filling double-sided genome segments containing repetitive genes based on segment contigs as claimed in claim 1, wherein the calculation in step 1 obtains a deleted gene set, and calculates the deleted gene set X, Y by traversing two genome sequences, wherein segment A and segment B are used as reference.
3. The method for filling double-sided genome segments containing repetitive genes based on segment contig as claimed in claim 1, wherein the maximum deletion string is classified in step 2, and the maximum deletion string type composed of elements in X and Y in the optimal solution is: setting the length of the string as n, namely the string consists of n deletion genes; the method specifically comprises the following steps: an n-Type-1 Type string; an n-Type-2 Type string; an n-Type-3 Type string.
4. The method for filling double-sided genome segments containing repetitive genes based on segment contigs as claimed in claim 1, wherein the relationship between the maximum deletion gene string and the repetitive genes determined in step 3 is divided into three types: no correlation, semi-correlation and correlation; unrelated means that the maximum deletion cluster is not involved in the repeat gene and the insertion site is not involved in the repeat gene; semi-association means that the maximum deletion cluster is not involved in the repeat gene and the insertion site may be involved in the repeat gene, with or without substitution of the insertion site; by related is meant that the largest deletion cluster is involved with the repeat gene or that the insertion site is completely involved with the repeat gene.
5. The method for double-sided genome fragment filling containing repetitive genes based on fragment contig as claimed in claim 1, wherein the preferential insertion of Type-1 deletion string with no correlation and half correlation with repetitive genes in step 4, i.e. insertion of Type-1 deletion string with length n can generate n +1 adjacency; and then adopting a strategy of constructing a bipartite graph to find the maximum matching, inserting the rest missing strings with the uncorrelated and semi-correlated relations into the slot, and locking the slot without allowing other missing strings to be inserted.
6. The method of claim 1, wherein the updating sequence of step 5 is used to search for deleted genes related to duplicate genes, construct an auxiliary map, and only consider a single gene at a time, and if there is an insertion position, the deleted genes are connected to the insertion position slots by solid lines, and if there is no insertion position, the adjacent deleted genes are searched and connected by dotted lines.
7. The double-sided genome fragment filling method based on fragment contig containing repetitive genes as claimed in claim 1, wherein the insertion of the deleted genes having correlation is performed by using a backtracking algorithm and a maximum matching algorithm in step 6, wherein from the insertion position, if the insertion position is connected with only one solid line, the insertion position can be directly determined, and meanwhile, the greedy genes connected with the dotted line edges are merged together for insertion, and if the insertion position is connected with a plurality of solid line edges, the deleted genes are backtracked and inherited by the deleted genes with less solid line edges.
8. The method for filling double-sided genome fragments containing repeated genes based on fragment contig as claimed in claim 1, wherein the insertion of all the remaining deleted genes (Type-3 strings) in step 7, i.e. the deleted genes with length n, can generate n-1 public borders, and can be inserted into any open slot without destroying the existing public borders, and the Type-3 strings are selected to be inserted into the slot at the rightmost end of the gene sequence.
9. The double-sided genome fragment filling device containing repeated genes based on the fragment contig is characterized by comprising the following components in parts by weight:
an input unit: both fragment contig-based genomic sequences containing repeat genes are incomplete sequences;
an initialization unit: traversing the input sequence to obtain a missing gene set according to the input sequence;
a classification unit: classifying gene elements in the gene sample sequence;
an identification unit: determining the relationship between the maximum deletion gene string and the repetitive genes;
no correlation and half correlation units: inserting the remaining missing strings without correlation and half correlation relations by constructing a bipartite graph by using a maximum matching method;
a correlation unit: searching for a missing gene having a correlation with the repeated gene, constructing an auxiliary graph, and completing the insertion of the missing gene having the correlation by using a backtracking algorithm and a maximum matching algorithm;
remaining deletion gene insertion units: respectively inserting all the residual Type-3 strings into the gene sequence, and ensuring that the existing adjacency cannot be damaged at the same time;
an output unit: two genomic sequences obtained after filling.
10. A server comprising a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor;
the processor when executing the computer program instructions is capable of performing the method of any of claims 1 to 8.
CN202111310669.7A 2021-11-05 2021-11-05 Double-sided genome fragment filling method and device containing repeated genes based on fragment contig Pending CN113889186A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111310669.7A CN113889186A (en) 2021-11-05 2021-11-05 Double-sided genome fragment filling method and device containing repeated genes based on fragment contig

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111310669.7A CN113889186A (en) 2021-11-05 2021-11-05 Double-sided genome fragment filling method and device containing repeated genes based on fragment contig

Publications (1)

Publication Number Publication Date
CN113889186A true CN113889186A (en) 2022-01-04

Family

ID=79016814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111310669.7A Pending CN113889186A (en) 2021-11-05 2021-11-05 Double-sided genome fragment filling method and device containing repeated genes based on fragment contig

Country Status (1)

Country Link
CN (1) CN113889186A (en)

Similar Documents

Publication Publication Date Title
Emms et al. STRIDE: species tree root inference from gene duplication events
Li et al. Fast and accurate long-read alignment with Burrows–Wheeler transform
US10204207B2 (en) Systems and methods for transcriptome analysis
CN107798216B (en) Method for comparing high-similarity sequences by adopting divide-and-conquer method
EP2963575B1 (en) Data analysis device and method therefor
Downarowicz et al. Finite-rank Bratteli–Vershik diagrams are expansive
Prezza et al. SNPs detection by eBWT positional clustering
CN103065067B (en) The filter method of sequence fragment and system in short sequence assembling
Schaller et al. Complete characterization of incorrect orthology assignments in best match graphs
Jackson et al. Parallel short sequence assembly of transcriptomes
Zhang et al. Greedy method for inferring tandem duplication history
US20220157401A1 (en) Method and system for mapping read sequences using a pangenome reference
Ribeiro et al. Efficient subgraph frequency estimation with g-tries
CN113889186A (en) Double-sided genome fragment filling method and device containing repeated genes based on fragment contig
CN106844533A (en) A kind of packet method for congregating and device
EP3663890B1 (en) Alignment method, device and system
Huo et al. CS2A: A compressed suffix array-based method for short read alignment
CN112634989A (en) Double-sided genome fragment filling method and device based on fragment contig
CN113257358A (en) Single-sided genome fragment filling method and device based on fragment contig
Wu New methods for inference of local tree topologies with recombinant SNP sequences in populations
CN117690479A (en) Single-sided genome segment filling method and device for limiting insertion position
Lee et al. BulkAligner: A novel sequence alignment algorithm based on graph theory and Trinity
CN113315656B (en) Node importance evaluation method and system based on graph propagation and readable storage medium
Varma et al. Hardware acceleration of de novo genome assembly
Behera Suffix Tree, Minwise Hashing and Streaming Algorithms for Big Data Analysis in Bioinformatics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination