CN112397148B - Sequence comparison method, sequence correction method and device thereof - Google Patents

Sequence comparison method, sequence correction method and device thereof Download PDF

Info

Publication number
CN112397148B
CN112397148B CN201910787734.1A CN201910787734A CN112397148B CN 112397148 B CN112397148 B CN 112397148B CN 201910787734 A CN201910787734 A CN 201910787734A CN 112397148 B CN112397148 B CN 112397148B
Authority
CN
China
Prior art keywords
sequence
comparison
coverage
preset
reference sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910787734.1A
Other languages
Chinese (zh)
Other versions
CN112397148A (en
Inventor
胡江
韩悦
汪德鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Hope Group Biotechnology Co ltd
Original Assignee
Wuhan Hope Group Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Hope Group Biotechnology Co ltd filed Critical Wuhan Hope Group Biotechnology Co ltd
Priority to CN201910787734.1A priority Critical patent/CN112397148B/en
Publication of CN112397148A publication Critical patent/CN112397148A/en
Application granted granted Critical
Publication of CN112397148B publication Critical patent/CN112397148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a sequence comparison method, a sequence correction method and a sequence correction device, and relates to the technical field of biological information. The sequence alignment method mainly comprises interval alignment and base alignment, and is matched with sequence compression and alignment result optimizing storage means, so that efficient and high-accuracy alignment is realized. The sequence correction method is based on comparison results, and the low-quality area is further corrected through positive sequence scoring and reverse sequence backtracking, so that efficient and high-accuracy comparison is realized.

Description

Sequence comparison method, sequence correction method and device thereof
Technical Field
The invention relates to the technical field of biological information, in particular to a sequence comparison method, a sequence correction method and a sequence correction device.
Background
With the rapid development of high-throughput sequencing technology, it has become an important means for researching biological problems such as species evolution and diagnosing disease causes.
The second generation sequencing technology mainly comprising the Illumina sequencing platform is widely applied in the field of genomics research due to the characteristics of high data output, low cost and high base accuracy. Genome assembly refers to reconstructing an intact genome sequence through a sequencing sequence result, and the accuracy and the integrity of the assembly result have important influences on the aspects of subsequent gene annotation analysis, SNP, structural variation detection and the like. Due to the complex conditions of repeated sequences, heterozygosity and the like, the short fragment sequences generated by adopting the second generation sequencing technology are difficult to assemble to obtain complete genome sequences. Taking the human genome as an example, 5% -10% of undefined complex regions still exist in the human reference genome sequence assembled by the second generation sequencing.
The third generation sequencing technology platform, represented by Pacific Biosciences and Oxford Nanopore Technologies, has the remarkable characteristics of long reading length, can generate long fragment sequences larger than 10k, has no preference on sequencing results of areas with higher GC content, and brings great revolution to the field of genome research. The third generation sequencing technology improves the sequence length, and compared with the second generation sequencing technology (error rate is 0.1%), the third generation sequencing technology has the defect that the base sequence error rate is higher (error rate is 15% -20%), and error information in the sequencing sequence has a great influence on subsequent data analysis. For the third generation sequencing data, the strategy of the overlay-Layout-Consensus (OLC) algorithm is generally adopted for genome assembly, but the complexity of assembly and the resource consumption for calculating the interrelationship between sequences are greatly increased due to the defect of high sequence error rate. The existing genome assembly software aiming at the third generation sequencing data adopts a method of firstly correcting the sequencing data and then assembling the genome or adopts a method of firstly assembling the genome and then correcting the assembly result. Both methods involve mutual alignment of sequences and correct for error information in the sequencing data based on the alignment. In the existing comparison method, a large amount of invalid comparison exists, the accuracy of correcting linkage influence is achieved by the invalid comparison, and meanwhile, the comparison result needs to occupy huge storage space, so that the comparison and correction steps occupy the longest time consumption in genome assembly, and the memory requirement is huge. Taking the three-generation sequencing data of assembled mammalian genome as an example, the existing software consumes tens of thousands of CPU cores, and can not handle the assembly of ultra-large genome, thus seriously impeding the wide application of the three-generation sequencing technology. Therefore, development of a comparison and correction method for targeted processing of long-reading long-sequencing data is needed, so that the calculation resource consumption of the existing method is effectively reduced, the operation efficiency is improved, and the potential of the long-reading long-sequencing technology in the field of genomics research is fully unlocked. Such long-read long sequencing includes, but is not limited to, single-molecule real-time Sequencing (SMRT) techniques and nanopore sequencing techniques.
Disclosure of Invention
The invention aims to provide a sequence alignment method, a sequence correction method and a sequence correction device, which can well reduce invalid alignment number and improve the accuracy of sequence correction.
The invention provides a sequence comparison method, which comprises the following steps: s101, acquiring a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length; s102, comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence; s103, calculating the coverage of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage, and obtaining a second reference sequence and a second comparison sequence; s104, calculating a comparison path between the second reference sequence and the second comparison sequence, and screening the second reference sequence and the second comparison sequence based on the editing distance to obtain a third reference sequence and a third comparison sequence.
Further, the method includes converting and storing a sequence based on a binary number, the sequence including one or more of the sequencing sequence, the first sequence, the second sequence, the first reference sequence, the first alignment sequence, the second reference sequence, the second alignment sequence, the third reference sequence, and the third alignment sequence;
Preferably, the step of converting and storing the sequence based on binary numbers includes: dividing the sequence into a plurality of base combinations according to a preset group; wherein the preset grouping includes determining four adjacent bases as a base combination; converting the base of the sequence into binary numbers according to a preset conversion relation between the base and the binary numbers; and carrying out bit filling expansion on the combinations of less than four bases in the sequence by adopting a designated binary number to obtain a binary sequence meeting the preset grouping.
Further, the method includes storage optimization of the comparison results: recording the number of the comparison sequence, the comparison direction, the beginning of the comparison interval of the comparison sequence, the end of the comparison interval of the comparison sequence, the number of the reference sequence, the beginning of the comparison interval of the reference sequence and the end of the comparison interval of the reference sequence, and storing the comparison result; preferably, the comparison sequence number difference value, the reference sequence comparison interval length and the comparison sequence comparison interval length difference value are recorded and the comparison result is stored; wherein the alignment sequence comprises one or more of a first alignment sequence, a second alignment sequence and a third alignment sequence, and the reference sequence comprises one or more of a first reference sequence, a second reference sequence and a third reference sequence.
Further, the method for storing the comparison result comprises the following steps: storing according to 4 bytes; using 7-bit directory values per byte, and filling with specific binary numbers less than 7 bits; the remaining 1 bit identifies whether the comparison is terminated using a particular binary number.
Further, in step S103: the coverage comprises window coverage and overall coverage; the window coverage is obtained by dividing the first reference sequence into a plurality of windows according to a preset first base number, and calculating the coverage of the first comparison sequence on each window; the overall coverage is an average coverage of the first reference sequence; the method for screening the first reference sequence and the first comparison sequence based on the screening comprises the steps of determining the first comparison sequence meeting a preset first coverage condition as the second comparison sequence until the overall coverage of the second comparison sequence on the first reference sequence reaches a preset first coverage; the first coverage condition is that the window coverage is smaller than a preset second coverage and smaller than a first preset multiple of the whole coverage; preferably, the coverage calculation and the screening are performed according to the sequence from long to short of the overlapping length of the first comparison sequence and the first reference sequence.
Further, step S103 further includes filtering the chimeric sequences, that is, eliminating the first reference sequence determined as the chimeric sequence and the alignment information thereof, where the chimeric sequence is a sequence in the first reference sequence that meets a preset second coverage condition; the second coverage condition is that at least one window exists, the window coverage of the window is smaller than a preset third coverage, and the window coverage of a specified number of windows adjacent to the window is larger than a preset fourth coverage; or at least one window exists, and the window coverage of the window is smaller than a preset fifth coverage; wherein the fifth coverage threshold is a second preset multiple of the window coverage average of a specified number of windows adjacent to the current window.
Further, step S103 further includes filtering the inclusion sequence, that is, eliminating the first reference sequence determined as the inclusion sequence and the alignment information thereof, where the inclusion sequence is the first reference sequence satisfying the following conditions: at least one first comparison sequence exists, the number of bases from the start of the comparison interval of the first comparison sequence and the first reference sequence to the start of the first reference sequence, and the number of bases from the end of the comparison interval of the first comparison sequence and the first reference sequence to the end of the first reference sequence are all within a preset second number of bases.
Further, in step S104, the following steps are specifically performed in the alignment interval between the second reference sequence and the second alignment sequence: s141, calculating a comparison path between the second reference sequence and the second comparison sequence by using a greedy algorithm, selecting an optimal comparison path and determining an editing distance of the optimal comparison path; s142, selecting a second comparison sequence which is not more than a preset editing distance condition from the second comparison sequences as a third comparison sequence, wherein a corresponding reference sequence is the third reference sequence; the preset editing distance condition is a third preset multiple of the sum of the lengths of the first reference sequence and the second comparison sequence in the comparison interval; s143, the steps S141 to S142 are circulated until the coverage of the third comparison sequence to the third reference sequence reaches the preset sixth coverage.
Preferably, in step S141, the greedy algorithm calculates a constraint line range between a minimum constraint line and a maximum constraint line when subtracting a specified distance value from the maximum extension distance under the constraint of the edit distance of the comparison path.
Preferably, in step S143, the coverage interval is calculated by determining, according to the optimal alignment path, an alignment interval between the third alignment sequence and the third reference sequence by backtracking, searching a start position where the first preset third base number in the alignment interval is completely matched, and a final position where the last preset third base number is completely matched, and determining the start position and the final position as the start position and the final position of the interval.
The invention provides a sequence correction method, which comprises the following steps: s201, comparing a plurality of sequencing fragments to obtain a reference sequence and a comparison sequence; s202, dividing the reference sequence by taking k-mers as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments; s203, calculating the scores of all bases at each position by taking a k-mer fragment as a unit from the positive sequence; s204, determining the last base as the highest scoring base at the last position from the reverse sequence, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a third last base according to the k-mer fragment corresponding to the second last base; and by analogy, obtaining a k-mer score chain; s205, replacing the reference sequence by using a k-mer score chain to obtain a first correction result.
Preferably, the k value in the sequence correction method is 3.
Further, in step S203, according to the formula score (p, b) =max { score (p-1, b) +count k-mer Calculating the score of all bases at each position; wherein score (p, b) is the fraction of bases at the p position of the reference sequence, b is base A, T, G, C or deleted, count k-mer A number of occurrences of the k-mer score chain in the alignment sequence; c is the coverage of the reference sequence by the alignment sequence at the p position of the reference sequence.
Further, the correction method further includes: s206, using a comparison interval sequence of the comparison sequences covering the low-quality sequences in the comparison result of the step S201 as a seed sequence, and correcting the low-quality sequences by using a directed graph algorithm based on the seed sequence; replacing the low quality sequence with the corrected sequence to obtain a second correction result; the low-quality sequence is a sequence which meets all conditions that the number of continuous bases is not less than a preset first value, the average value of the interval is less than a longest sequence of a preset second value, the length of the interval exceeds a preset third value, and the number of continuous high-quality bases in the interval is not greater than the preset first value.
Preferably, the number of the seed sequences in the directed graph algorithm correction does not exceed a preset fifth value, and preferably the seed sequences included in the alignment sequences with the length order at the middle preset fifth value are included.
Further, in step S206, the specific steps of the directed graph algorithm for correction are as follows: s261, optionally drawing a directed graph by taking one seed sequence as a motif; s262, calculating the base scores in the seed sequences according to the score matrix, and updating the directed graph; s263, performing topological sorting on the directed graph; s264, determining the base of each node as the base with the highest occurrence probability on the node according to the topological sequencing result, and obtaining the corrected sequence.
Further, in step S262, the following steps are performed for all the seed sequences: a. initializing a node score according to the formula score (x) =max { score (x-1) } -GP, wherein score (x) is the node score, x is the node index, GP is a gap penalty; setting node score of the node without any predecessor as 0; b. traversing each base of the seed sequence, updating a scoring matrix according to a formula,
wherein x is a node index, y is a base index in the seed sequence, GP is a gap penalty, and M is a mismatch or matching score; c. selecting a predecessor node with the highest score by taking the highest score node with the degree of 0 as an origin, backtracking, and taking the node index as 0 or the base index as 0 as an end point; d. if the base index of the end point is greater than 0, adding a base positioned before the base index of the end point in the seed sequence into a directed graph; and if the base index of the starting point is smaller than the length of the seed sequence, adding the base positioned behind the base index of the starting point in the seed sequence into a directed graph.
Further, before the directed graph algorithm correction is performed, the first correction result is filtered according to any one of the following conditions, that is, the directed graph algorithm correction is not performed on the first correction result satisfying any one of the following conditions: a. the accumulated length of the low-quality sequence exceeds the first correction result of a preset fourth value; b. the number of the seed sequences is smaller than the first correction result of a preset first value; c. and the comparison sequence with the length exceeding a preset fourth value is subjected to the first correction result with the duty ratio exceeding a fourth preset multiple.
Further, the correction method further includes: s207, correcting again by using the seed sequence as the comparison sequence and the second correction result as the reference sequence by using the methods described in the steps S201, S202, S203 and S204 to obtain a third correction result.
Further, in step S201, the comparison method described in the foregoing steps S101, S102, S103, S104 is used for comparison; wherein the reference sequence comprises one or more of the first reference sequence, the second reference sequence, and the third reference sequence, and the alignment sequence comprises one or more of the first alignment sequence, the second alignment sequence, and the third alignment sequence.
The invention provides a sequence comparison device, which comprises: the sequence acquisition module is used for acquiring a sequencing sequence, screening a first sequence from the sequencing sequence according to a first preset data length, and screening a second sequence from the sequencing sequence according to a second preset length; the first comparison module is used for comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence; the second comparison module is used for calculating the coverage of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage, and obtaining a second reference sequence and a second comparison sequence; and the third comparison module is used for calculating a comparison path between the second reference sequence and the second comparison sequence, screening the second reference sequence and the second comparison sequence based on the editing distance, and obtaining a third reference sequence and a third comparison sequence.
The invention provides a sequence correction device, which comprises: the sequence comparison module is used for comparing a plurality of sequencing fragments to obtain a reference sequence and a comparison sequence; the sequence dividing module is used for dividing the reference sequence and the comparison sequence by taking k-mer as a unit to obtain k-mer fragments; the score calculating module is used for calculating the scores of all bases at each position by taking the k-mer fragments as units from the positive sequence; the base determining module is used for determining the last base as the highest scoring base at the last position from the reverse sequence, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a third last base according to the k-mer fragment corresponding to the second last base; and analogizing to obtain a k-mer score chain; and the correction module is used for replacing the reference sequence by using a k-mer score chain to obtain a first correction result.
The invention provides a sequence comparison method and a device, wherein the method comprises the following steps: s101, acquiring a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length; s102, comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence; s103, calculating the coverage of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage, and obtaining a second reference sequence and a second comparison sequence; s104, calculating a comparison path between the second reference sequence and the second comparison sequence, and screening the second reference sequence and the second comparison sequence based on the edited distance to obtain a third reference sequence and a third comparison sequence. According to the sequence comparison method, the interval comparison is performed firstly, then the base comparison is performed, and the reference sequence and the comparison sequence are screened for multiple times, so that the ineffective comparison quantity can be reduced, the comparison accuracy can be improved, and the data storage and operation space can be reduced. In addition, the comparison results between the sequencing sequences are provided in the mode, and can be combined with various correction methods for correcting based on the comparison results, so that the sequence correction method is not limited.
The invention provides a sequence correction method and a device, wherein the method comprises the following steps: s201, comparing a plurality of sequencing fragments to obtain a reference sequence and a comparison sequence; s202, dividing the reference sequence by taking k-mers as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments; s203, calculating the scores of all bases at each position by taking a k-mer fragment as a unit from the positive sequence; s204, determining the last base as the highest scoring base at the last position from the reverse sequence, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a third last base according to the k-mer fragment corresponding to the second last base; and by analogy, obtaining a k-mer score chain; s205, replacing the reference sequence by using a k-mer score chain to obtain a first correction result. According to the invention, the sequencing fragments are compared in the sequence correction mode, and then the correction result is obtained based on the k-mer fragments, so that the correction result caused by single base comparison is prevented from having preference, the accuracy of sequence correction is improved, and the processing efficiency of sequence correction is also improved. In addition, the method provided by the invention is based on correction of the sequence alignment result, does not limit the alignment method for obtaining the alignment result, and can be combined with various sequence alignment methods.
It can be understood that the above sequence alignment method and the above sequence alignment method provided by the present invention can be combined to improve the accuracy of sequence alignment.
The sequence is marked by the first, second, third, etc., the steps are numbered by the S101, S102, S201, S202, etc., and the preset conditions or preset values are marked by the first, second, third, etc., which are only for the sake of more clear description. Wherein each sequence is described in the form of a set, meaning that each sequence in the set is operated on. For example, in step S102, the first sequence is aligned with the second sequence, and all sequences in the first sequence are aligned with all sequences in the second sequence in sequence. In another example, in step S202, the reference sequences are divided, the comparison sequences are correspondingly divided, that is, k-mer division is performed on each reference sequence, and all comparison sequences having a comparison relationship with the reference sequences are correspondingly divided.
In addition, the sequence comparison method, the sequence correction method, the device and the equipment thereof provided by the invention are suitable for processing long-reading long-sequencing results. The source of the long-reading long-sequencing result is not limited, and the source is not limited to the third generation sequencing technology. The third generation sequencing technology platform is not limited to the sequence, sequence II platform of Pacific Biosciences, minION, promethION, gridION platform of Oxford Nanopore Technologies.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the embodiments of the present invention or the drawings needed in the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a sequence alignment method according to a first embodiment of the present invention;
FIG. 2 is a diagram of a binary format converted sequence according to a second embodiment of the present invention;
FIG. 3 is a schematic diagram of binary form comparison result optimized storage according to a third embodiment of the present invention;
FIG. 4 is a flowchart of a sequence correction method according to a fourth embodiment of the present invention;
FIG. 5 is a diagram of searching for a low quality sequence according to a fourth embodiment of the present invention;
fig. 6 is a block diagram of a sequence alignment apparatus according to a seventh embodiment of the present invention;
fig. 7 is a block diagram of a sequence correction device according to an eighth embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It should be noted that the specific values mentioned in the examples, such as the first base number 64, the first coverage 40, the first preset multiple 1.3, the first value 4, and the gap penalty 2, are only illustrative and should not be construed as limiting.
Currently, the data volume required for ultra-large genome assembly often reaches the Tb level, and software in the prior art (such as Falcon, canu) cannot be used at all; the sequence alignment and sequence correction method and device provided by the embodiment of the invention can reduce the invalid alignment and improve the accuracy of sequence correction by considering the problems of large invalid alignment and low accuracy of the existing sequence correction scheme for large and ultra-large genomes. The techniques may be applied in the context of sequence alignment and/or sequence correction of oversized genomes, which generally refers to genomes of sizes exceeding 30GB.
Example 1
The present embodiment provides a sequence alignment method. Referring to the sequence alignment method flowchart shown in fig. 1, the method comprises the steps of:
s101, acquiring a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length;
S102, comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence;
s103, calculating the coverage of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage, and obtaining a second reference sequence and a second comparison sequence;
s104, calculating a comparison path between the second reference sequence and the second comparison sequence, screening the second reference sequence and the second comparison sequence based on the editing distance, and obtaining a third reference sequence and a third comparison sequence. The third reference sequence and the third alignment sequence are the alignment of the sequencing sequences. The reference sequence and the comparison sequence have comparison relation due to sequence similarity, and the interval with the comparison relation is the comparison interval, namely the comparison result is reflected; wherein the reference sequence comprises one or more of a first reference sequence, a second reference sequence and a third reference sequence, and the alignment sequence comprises one or more of a first alignment sequence, a second alignment sequence and a third alignment sequence.
The sequence alignment method provided by the embodiment of the invention comprises the following steps: acquiring a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length; comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence; calculating the coverage of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage, and obtaining a second reference sequence and a second comparison sequence; and calculating a comparison path between the second reference sequence and the second comparison sequence, and screening the second reference sequence and the second comparison sequence based on the editing distance to obtain a third reference sequence and a third comparison sequence. According to the embodiment, through the sequence comparison method, the reference sequence and the comparison sequence are subjected to two-step comparison and multiple times of screening, invalid comparison numbers can be reduced, comparison accuracy is improved, and data storage and operation space is reduced. In addition, the embodiment provides the comparison result between the sequencing sequences in the above manner, and can be combined with various correction methods for correcting based on the comparison result, and the sequence correction method is not limited.
In step S101 of the present embodiment, the following is referred to for specific implementation. Selecting a sequencing sequence with the data length being greater than a first preset length (such as the longest 40×sequence length) from the sequencing sequence reads as a first sequence, and outputting the first sequence to a first sequence file; and selecting a sequencing sequence with a second preset length (such as higher than 1 k) from the sequencing sequences as a second sequence, and outputting the second sequence to a second sequence file.
In order to facilitate subsequent multitasking parallel operation, the first sequence file and the second sequence file may be cut, and the number of cut portions is not limited in this embodiment.
In step S102 of this embodiment, based on the first sequence and the second sequence, a section comparison method is provided, and all reads in the first sequence and all reads in the second sequence are compared in pairs to obtain a first reference sequence and a first comparison sequence. This step may be implemented using existing alignment software, such as minimap2. In the prior art, only short sequences (for example, less than 1K) are usually filtered, and all the filtered sequences are aligned, that is, the second sequence is aligned with the second sequence, so that the alignment between the non-first sequences in the second sequence is included, and the sequences are shorter (for example, the longest 40×sequence length is between 1K) and are relatively redundant in number. In this embodiment, only the first sequence and the second sequence are compared, so that redundant comparison can be avoided, the comparison efficiency can be improved, and the negative influence of the redundant comparison on the accuracy of subsequent correction can be avoided.
In step S103 of this embodiment, in order to effectively reduce the data set of the sequences for correction, and simultaneously reduce multiple alignment and alignment noise caused by repeated sequences and alignment errors in the genome, the first reference sequence and the first alignment sequence are screened based on coverage, and a second reference sequence and a second alignment sequence with more accurate alignment results are obtained. Meanwhile, after coverage screening, the coverage level of each obtained second reference sequence is relatively consistent, so that subsequent correction is facilitated.
The coverage includes window coverage, overall coverage; the window coverage is that a first reference sequence is divided into a plurality of windows according to a preset first base number (such as 64), and the coverage of the first reference sequence on each window is calculated; the overall coverage is the average coverage of the first reference sequence. The rule for screening the first reference sequence and the first comparison sequence based on the coverage is as follows: determining the first comparison sequence meeting the preset first coverage condition as a second comparison sequence, wherein the overall coverage of the second comparison sequence to the first reference sequence reaches the preset first coverage (such as 40); the first coverage condition is that the window coverage is smaller than a preset second coverage (such as 50) and smaller than a first preset multiple of the whole coverage (such as 1.3). In order to perform coverage calculation and screening of the first reference sequence and the first comparison sequence more efficiently, the sequence operation of the length between the first comparison sequence and the first reference sequence from long to short is performed. Specific implementations may be referred to as follows.
S131: calculating the length of a comparison interval between the first comparison sequence and the first reference sequence, and arranging the first comparison sequence from long to short according to the comparison interval; and for the first comparison sequences with equal comparison interval lengths, sequencing the comparison sequences according to the numbers from large to small.
S132: for the first reference sequence, the first reference sequence is divided into a plurality of windows according to the length of 64 bases, and the window coverage and the whole coverage are initialized to 0.
S133: and for each window of the current first reference sequence, sequentially searching the first comparison sequence according to the S131 arrangement order. If the window coverage of the window is smaller than the preset second coverage 50 and is lower than the first preset multiple 1.3 of the total coverage of the current first reference sequence, the first comparison sequence is considered to be effective, the first comparison sequence is output, and the window coverage of the window corresponding to the first comparison sequence is added with 1; otherwise, the first comparison sequence is redundant and discarded. Until the current overall coverage of the first reference sequence reaches the preset first coverage 40, which means that the first reference sequence has enough first comparison sequences, and all subsequent first comparison sequences are discarded.
Before step S103, the chimeric sequences and the inclusion sequences in the first reference sequence are further filtered, respectively. And filtering to further remove the comparison results of low reliability and redundancy in the first reference sequence and the first comparison sequence.
The chimeric sequence is a sequence which satisfies a preset second coverage condition in the first reference sequence; the second coverage condition is that at least one window exists, the window coverage of the window is smaller than a preset third coverage (such as 3), and the window coverage of the specified number of windows adjacent to the window is larger than a preset fourth coverage (such as 20); or at least one window exists, and the window coverage of the window is smaller than the preset fifth coverage; wherein the fifth coverage threshold is a second preset multiple (e.g., 1/5) of the window coverage average for a specified number of windows adjacent to the current window.
Specific implementations can be referred to as follows: if for any of the first reference sequences the window coverage of a window is less than the third coverage 3 and the window coverage of a nearby adjacent plurality of windows is greater than the fourth coverage 20; or the window coverage of a certain window is smaller than a second preset multiple 1/5 of the average value of the window coverage of a plurality of adjacent windows; the first reference sequence is defined as a chimeric sequence and the first reference sequence and alignment thereof are discarded.
The inclusion sequence is a first reference sequence that satisfies the following condition: at least one first comparison sequence exists, and the number of bases from the beginning of the comparison interval of the first comparison sequence and the first reference sequence to the beginning of the first reference sequence, and the number of bases from the ending of the comparison interval of the first comparison sequence and the first reference sequence to the ending of the first reference sequence are all within a range of a preset second number of bases (such as 300).
Specific implementations can be referred to as follows: if the starting point of the alignment interval is within 300bp from the starting point of the first reference sequence and the end point of the alignment interval is also within 300bp from the end point of the first reference sequence for any one of the first reference sequences, the first reference sequence is defined as a contained sequence, and the first reference sequence and the alignment information thereof are discarded.
In step S104 of this embodiment, more accurate base alignment is further performed in the alignment interval of the second reference sequence and the second alignment sequence based on the interval alignment of step S102 and coverage screening of step S103. Base alignment is performed on the interval alignment results, rather than directly, because more efficient alignment can be performed based on less but more accurate data. Through base comparison, the credible comparison information can be further reserved, the unreliable comparison information can be removed, the redundant comparison information can be removed, the comparison result can be further reduced, and the comparison precision can be improved.
The embodiment provides a specific implementation mode of base alignment between the second reference sequence and the second alignment sequence, and the following steps are executed in the alignment interval of the second reference sequence and the second alignment sequence:
S141, calculating a comparison path between the second reference sequence and the second comparison sequence by using a greedy algorithm, selecting an optimal comparison path and determining an editing distance of the optimal comparison path;
s142, selecting a sequence which is not more than a preset editing distance condition from the second comparison sequences as a third comparison sequence, wherein a corresponding reference sequence is the third reference sequence; the preset editing distance condition is a third preset multiple (e.g. 0.4) of the sum of the lengths of the first reference sequence and the second comparison sequence in the comparison interval;
s143, the steps S141 to S142 are repeated until the coverage of the third reference sequence by the third comparison sequence reaches a preset sixth coverage (e.g. 90).
Further, in order to accelerate the calculation speed of the greedy algorithm in step S141, the range of constraint lines during calculation of the greedy algorithm is limited to be between the minimum constraint line and the maximum constraint line when the specified distance value (e.g. 150) is subtracted from the maximum extension distance under the constraint of the edit distance of the comparison path. The reason for this is that for each edit distance d, all constraint lines k are calculated, while most k lines have no meaning for finding the currently optimal base alignment.
Further, in order to calculate the coverage more accurately, the comparison interval of the calculated coverage is determined again as follows: and according to the optimal comparison path, backtracking to determine a comparison interval of the third comparison sequence and the third reference sequence, searching a starting position completely matched with the first preset third base number and a final ending position completely matched with the last preset third base number (e.g. 8) in the comparison interval, and determining the starting position and the ending position as the starting position and the ending position of the interval. The problem of higher error rate of the boundary (i.e. the end point) between the comparison areas can be effectively solved through the step.
Specific implementations can refer to the following steps (1) to (4):
(1) Using a Diff greedy algorithm, at k e[k min ,k max ]And (3) calculating the editing distance of each path, selecting the optimal comparison path and determining the editing distance of the optimal comparison path. Wherein k is min The constraint line is the k line that is the smallest when the maximum value of the maximum extension distance minus the specified distance value 150 is found under the constraint of the current edit distance d. k (k) max The constraint line is the k line that is found the maximum value of the maximum extension distance minus the specified distance value 150 under the constraint of the current edit distance d. The optimal editing distance is the corresponding editing distance when the number of editing operations is the minimum.
(2) And judging whether the optimal editing distance is larger than a preset editing distance condition. The preset edit distance condition is (l1+l2) ×0.4, where L1 is the data length of the second reference sequence and L2 is the data length of the second alignment sequence.
If yes, the second reference sequence and the second comparison sequence are represented as error comparison, and the second comparison sequence corresponding to the current optimal editing distance is abandoned; if not, the second reference sequence is correctly compared with the second comparison sequence, and the second comparison sequence corresponding to the current optimal editing distance is reserved, namely a third comparison sequence, and the second comparison sequence corresponds to the third reference sequence.
(3) Backtracking to a comparison interval according to the optimal editing distance, searching from front to back in the comparison interval, and resetting a starting point of which the length of the first third reference sequence is completely matched with the length of the third comparison sequence which is the preset third base number 8 as a starting point of the comparison interval; searching from back to front, and setting the final point of the complete matching of the length of the last third reference sequence and the length of the third comparison sequence with the preset third base number 8 as the key point of the comparison interval.
(4) And calculating the coverage of the comparison interval until the coverage of the third comparison sequence to the third reference sequence reaches a preset sixth coverage 90.
In summary, the sequence alignment method provided in this embodiment effectively reduces ineffective alignment and eliminates erroneous alignment through twice alignment and multiple screening, and realizes rapid, accurate and efficient alignment.
Example two
In practical applications, the data size of the sequencing sequences is very large, and particularly for large and ultra-large genomes, a large hard disk storage space is required. The specific sequence is converted into a binary form, so that compression can be effectively performed, and the storage space is saved. Based on this, the sequence alignment method provided in this embodiment further includes: converting and storing the sequence based on the binary number; the sequences include one or more of the sequencing sequences, first sequences, second sequences, first reference sequences, first alignment sequences, second reference sequences, second alignment sequences, third reference sequences, and third alignment sequences of example 1.
The steps include: (1) Dividing the sequence into a plurality of base combinations according to a preset group; wherein the preset grouping strategy comprises determining four adjacent bases as a base combination; (2) Converting the base of the sequence into binary numbers according to a preset conversion relation between the base and the binary numbers; (3) The combination of less than four bases in the sequence is subjected to bit filling expansion by adopting a specified binary number, so that a binary sequence meeting grouping is obtained; (4) The binary sequence is stored in accordance with the divided base combinations.
Specifically, the sequencing sequence contains four forms of adenine (a), cytosine (C), guanine (G) and thymine (T), each of which requires 1 byte (8 bits) to store data. In this embodiment the above bases may first be converted in binary form, such as: a=00, c=01, g=10, t=11, i.e. each base is converted into a binary representation occupying two bits. Then, forming a byte by taking four adjacent bases as a group for storage; therefore, the sequence after binary compression is about one quarter of the original data size, so that the occupied memory space is effectively reduced.
For sequences in which the sequence length is a multiple of four in the sequencing sequence, every four base combinations can be represented as a complete one byte form. For sequences with a sequence length that is not four times that of the sequence, the base combination of the last group cannot occupy eight bits, and the base combination of the last group can be subjected to bit-filling expansion by 00, so that the base combination of the last group can meet the form of eight bits, namely one byte.
For ease of understanding, an example of binary format conversion of a sequencing sequence is given below:
such as the original sequencing sequences:
>6 36535
TTTGCTATTAGT
>252 33213
AAGTATTA AT
the original sequencing sequence is binary-form converted in the above manner, the sequence of the number 636535 is 111111100111001111001011 after conversion, and the sequence of the number 25233213 is 000010110011110000110000 after conversion. In particular, as shown in fig. 2, different memory cells are distinguished in fig. 2 by different background fills.
Meanwhile, data can be compressed and decompressed by adopting bit operation, the bit operation consumes less CPU, and the writing and reading speeds can be improved.
Example III
The comparison result can be stored by recording the characteristics of the comparison section, so as to reduce the storage space as much as possible. The sequence alignment method provided in this embodiment further includes: recording the number of the comparison sequence, the comparison direction, the beginning of the comparison interval of the comparison sequence, the end of the comparison interval of the comparison sequence, the number of the reference sequence, the beginning of the comparison interval of the reference sequence and the end of the comparison interval of the reference sequence, and storing the comparison result; preferably, the comparison sequence number difference value, the reference sequence comparison interval length and the comparison sequence comparison interval length difference value are recorded and the comparison result is stored; the alignment sequence comprises one or more of a first alignment sequence, a second alignment sequence and a third alignment sequence in the first embodiment, and the reference sequence comprises one or more of a first reference sequence, a second reference sequence and a third reference sequence. Furthermore, storing is performed in 4 bytes; and, each byte uses the record value of 7 bits, the use of specific binary numbers of less than 7 bits is filled; the remaining 1 bit identifies whether the comparison is terminated using a particular binary number.
For ease of understanding, reference may be made to the following exemplary description. The comparison result of the first reference sequence and the first comparison sequence may include the following comparison information: a first alignment number (first column), an alignment direction (second column; indicating the direction over alignment, stored in 8 bits of one byte, the first 7 bits being 0, the 8 th bits being stored as 0 or 1, 1 representing the forward direction, 0 representing the reverse direction), a first alignment interval start (third column), a first alignment interval end (fourth column), a first reference sequence number (fifth column), a first reference sequence alignment interval start (sixth column), a first reference sequence alignment interval end (seventh column).
In practical application, the comparison result of the first reference sequence and the first comparison sequence can be further processed: the first column records the difference between the current first comparison sequence number and the last comparison sequence number, if the value is smaller than zero, the second column value is added by two, the fifth column records the difference between the current first reference sequence number and the last first reference sequence number, if the value is smaller than zero, the second column value is added by four, the third column value and the sixth column value are unchanged, the fourth column records the length of the first comparison sequence comparison section, the seventh column records the difference between the first comparison sequence and the length of the first reference sequence comparison section, and if the value is smaller than zero, the second column value is added by eight.
Since the current stored value is stored in 4 bytes (32 bits), the value in the output result is typically very small, requiring only 2 bytes or less, with the higher bits being zero and the invalid bits. Therefore, in order to further reduce the storage space, the following compression processing may be performed on the values of each column in the first comparison sequence: dividing every seven bits as a group, discarding the combination with the value of 0 in the front-end packet, setting the highest position of the reserved packet except the highest position of the last group as 0 and the highest positions of other groups as 1, and filling the byte combination after numerical compression by using 0 in the last group of less than seven bits. The size of the first comparison result after the compression processing in the mode is usually 1/3 of the original size, and the storage space is obviously occupied to be smaller.
For ease of understanding, reference may be made to the following specific examples of columns, values, binary.
The compression example is described by taking two records in the first comparison result as an example.
Comparison result:
the simplified result is:
the binary storage can be referred to in fig. 3, specifically:
the double underlined salient numbers in the binary stored sequence are filling and identification numbers.
Example IV
The present embodiment provides a sequence correction method. Referring to the sequence correction method flowchart shown in fig. 4, the method comprises the steps of:
s201, comparing a plurality of sequencing fragments to obtain a reference sequence and a comparison sequence;
s202, dividing the reference sequence by taking k-mers as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments;
s203, calculating the scores of all bases at each position by taking a k-mer fragment as a unit from the positive sequence;
s204, determining the last base as the highest scoring base at the last position from the reverse sequence, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining the third last base according to the k-mer fragment corresponding to the second last base; and by analogy, obtaining a k-mer score chain;
s205, replacing the reference sequence by using a k-mer score chain to obtain a first correction result.
Wherein, k-mer, monomeric unit (mer) corresponds to nt or bp, and 100mer DNA corresponds to 100nt per strand, then the whole strand is 100bp. Reads of length m in general can be divided into m-k+1 k-mers. The above sequences to be corrected are divided by taking k-mers as 3 and by taking k-mers as 3, for example, to obtain a plurality of 3-mer fragments, such as the first 3-mer at positions 1, 2 and 3, the second 3-mer at positions 2, 3 and 4, and so on.
The sequence correction method provided by the embodiment of the invention comprises the following steps: comparing the sequencing fragments to obtain a reference sequence and a comparison sequence; dividing the reference sequence by taking k-mer as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments; calculating the scores of all bases at each position from the positive sequence in units of k-mer fragments; determining the last base as the highest scoring base at the last position from the reverse order, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a third-to-last base according to the k-mer fragment corresponding to the second-to-last base; and by analogy, obtaining a k-mer score chain; a first calibration result is obtained using a k-mer score chain to replace the reference sequence. According to the embodiment, the sequencing fragments are compared in the sequence correction mode, and then the correction result is obtained based on the k-mer fragment, so that the correction result caused by single base comparison is prevented from being favored, the accuracy of sequence correction is improved, and the processing efficiency of sequence correction is also improved. In addition, the above manner provided in this embodiment is based on correction of the sequence alignment result, and the alignment method for obtaining the alignment result is not limited, and may be combined with various sequence alignment methods.
In the above step S203, the scores of all bases at each position are calculated by the following calculation formula:
score(p,b)=max{score(p-1,b)+count k-mer }-C×0.3
wherein score (p, b) is the score of the base at position p of the reference sequence, b is base A, T, G, C or deletion, count k-mer The number of occurrences of the k-mer fragment in the alignment sequence; c is the coverage of the reference sequence by the alignment sequence at the p position of the reference sequence.
In one specific implementation, the k value may be 3, i.e., the k-mer is a 3-mer.
Further, the situation that the accuracy distribution is uneven in the long-reading long sequencing technology is considered. Taking the three-generation sequencing result as an example, the accuracy of certain intervals is extremely low (< 70%), the reliability of the comparison result of the intervals is poor, and therefore, the accuracy of the correction sequence obtained by correcting based on the comparison result is low. The accuracy can be further improved by finding a low accuracy interval and correcting it again for that interval. Thus, the sequence correction method may further include:
step S206, using the comparison interval sequence of the comparison sequence covering the low-quality sequence in the comparison result of step S201 as a seed sequence, and correcting the low-quality sequence by using a directed graph algorithm based on the seed sequence; replacing the low-quality sequence with the corrected sequence to obtain a second correction result;
The low-quality sequence is a longest sequence with the number of continuous bases not smaller than a preset first value (such as 4), the average value of the interval being smaller than a preset second value (such as 40), the length of the interval exceeding a preset third value (such as 16), and the number of continuous high-quality bases in the interval not larger than the preset first value (such as 4).
In one specific implementation, the base matrix value Q is defined as q=count 3-mer C×100; determining a base with a base matrix value greater than a preset first value of 4 as a low-quality base, and determining a base with a base quality value not greater than the first value of 4 as a high-quality base; the interval average homogeneity value is the average of all base matrix values within the interval.
Looking up the low quality sequence in the first calibration result based on the low quality base and the high quality base by referring to the following steps (a) to (c):
(a) And (3) carrying out low-quality base search on the first correction sequence according to the reverse order, and recording the first low-quality base position lq_s.
(b) And (3) taking the lq_s position as a starting position, continuing searching low-quality bases according to the reverse sequence, recording the number lq_c of the searched continuous low-quality bases, and judging the relation between the lq_c and the first numerical value 4.
If lq_c is not less than the first value 4, continuing to search in the reverse order, and if high quality bases are found, recording the number hq_c of continuous high quality bases. If hq_c is larger than the first value 4, initializing lq_c and hq_c to 0, searching low-quality alkali base again according to the reverse sequence, and repeating the steps a and b. If hq_c is not greater than the first value 4, initializing hq_c to 0, continuously searching according to the reverse order, continuously recording lq_c, and executing the step (c);
If lq_c is smaller than the first value 4, initializing lq_c to 0, searching low-quality bases again according to the reverse order, and repeating the steps a and b.
(c) And calculating a section average homogeneity value, and judging whether the section length lq_l is not smaller than the third value 16 when the section average homogeneity value is the maximum value of the second value 40 or less.
If the base position lq_e is larger than or equal to the third numerical value, the last base position lq_e searched in the reverse sequence is recorded as a termination position, and the sequence between the lq_s position and the lq_e position is a low-quality sequence. And initializing the length lq_l, the number lq_c of low-quality bases and the number hq_c of high-quality bases of the low-quality interval to be 0, searching the low-quality bases again according to the reverse sequence, and repeating the steps a, b and c until the searching of all the reference sequences is completed.
If the length lq_l of the low quality interval, the number lq_c of the low quality bases and the number hq_c of the high quality bases are initialized to be 0, the low quality bases are searched again according to the reverse sequence, and the steps a, b and c are repeated until the searching of all the reference sequences is completed.
For ease of understanding, reference may be made to the following example, for finding low quality sequences for the first correction result, and reference may be made in particular to fig. 5:
wherein the lower case portion is a low quality base, the upper case portion is a high quality base, and the underlined portion is a found low quality interval sequence.
In step S206, the specific steps for correction by the directed graph algorithm are:
s261, optionally drawing a directed graph by taking a seed sequence as a motif.
In this embodiment, in order to reduce the comparison of invalid data, before performing the directed graph algorithm correction, the first correction result may be filtered according to any one of the following conditions, that is, the directed graph algorithm correction is not performed on the first correction result that satisfies any one of the following conditions: a. a first correction result that the accumulated length of the low-quality sequence exceeds a preset fourth value (such as 10000); b. the number of the seed sequences is smaller than a first correction result of a preset first value (such as 4); c. and a first correction result that the ratio of the comparison sequence with the length exceeding a preset fourth value (such as 10000) exceeds a fourth preset multiple (such as 1/3). The filtered data is characterized by low credibility and overlength, so that the bottleneck problem of extremely slow calculation speed caused by overlength sequences in a directed graph algorithm can be solved.
Specifically, each low-quality sequence is traversed, and all comparison sequences covering the low-quality sequence are found out by utilizing the starting point position and the end point position of the low-quality sequence. If the found alignment sequences are greater than 60, stopping searching the alignment sequences of the low-quality interval sequences.
Further, the searched comparison sequences are sequenced from long to short according to the sequence length, and seed sequences are selected. In a preferred embodiment, the number of seed sequences does not exceed a predetermined fifth value (e.g., 30), and preferably the seed sequences included in the aligned sequences are aligned at intermediate predetermined fifth values (e.g., 30). If the number of the searched comparison sequences is less than 30, taking all the comparison sequences as seed sequences S.
Optionally drawing a directed graph G by taking one of the seed sequences S as a motif; the Node of the directed graph is a base in the sequence, the directed Edge of the directed graph is a connecting line between any two continuous bases, and the initial Node of the directed graph is a initial base of the sequence.
S262, calculating the base scores in the seed sequences according to the score matrix, and updating the directed graph, namely adding all the seed sequences into the directed graph.
In a specific implementation, the steps (1) to (4) are performed as follows:
(1) Initializing a node score according to the formula score (x) =max { score (x-1) } -GP, wherein score (x) is the node score, x is the node index, GP is a gap penalty (e.g., 2); the node score of no predecessor node is set to 0.
(2) The present embodiment may traverse nodes in the directed graph, traversing all predecessor nodes of each node based on that node. For any predecessor node, traversing each base in S, and updating the matrix score according to the following calculation formula:
Where x is the node index, y is the base index in the seed sequence, GP is the gap penalty (e.g., 2), M is the mismatch score (e.g., 2) or the match score (e.g., 1).
(3) And selecting the predecessor node with the highest score from the highest scoring node with the degree of 0 as a starting point, backtracking, and taking the node index as 0 or the base index as 0 as an end point.
Specifically, traversing nodes in the directed graph, if the degree of departure of the nodes is 0, defining the nodes as terminal nodes, and recording the score and index of the terminal nodes. After finding all the end points, finding out the end point with the largest score, determining the end point as a backtracking start point, and recording the node index of the start point and the corresponding base index. And (3) circularly tracing back from the starting point, searching for the predecessor node and predecessor base with highest score of any node x until the node index is 0 or the sequence index is 0, determining the node as a tracing back end point, and recording the node index of the end point and the corresponding base index.
(4) If the base index of the end point is greater than 0, adding a base before the base index of the end point in the seed sequence into the directed graph; if the base index of the starting point is smaller than the length of the seed sequence, adding the base following the base index of the starting point in the seed sequence to the directed graph.
The specific operation of adding the directed graph is as follows: if the current base is a match at the current node, skipping the base; if the node is mismatched, adding the current base into a base list contained in the current node; if so, then add to the directed graph with the new node and new edge.
S263, performing topological sorting on the directed graph; topology ordering can also be performed on the updated directed graph, so that any pair of vertices u and v in the updated directed graph, and if the edges (u, v) E (G) are satisfied, u appears before v in the ordered directed graph.
S264, determining the base of each node as the base with the highest occurrence probability on the node according to the topological sequencing result, and obtaining the corrected sequence.
In a specific implementation manner, according to the topological sorting result, the base with the largest occurrence number is selected on each node, so that a corrected low-quality sequence is determined, and the corresponding sequence of the first correction result is replaced, so that a second correction result is obtained.
To further improve accuracy, the sequence correction method may further include:
s207, correcting again by using the seed sequence as a comparison sequence and the second correction result as a reference sequence by using the sequence correction method of the steps S201 to S205, thereby obtaining a third correction result. Specific details are not described again.
In summary, the sequence correction method provided in the above embodiment can realize efficient and high-accuracy correction of the sequence.
Example five
In step S202 of example 4, the higher the k-mer score, the better the specificity of the k-mer fragment and the more accurate the correction. However, the greater the k value, the less the probability of a perfectly uniform k-mer. When the sequencing accuracy is p, the probability of occurrence of a perfectly identical k-mer is p k
As can be seen from the above formulas of score (p, b), the score correlates with the number of exactly identical k-mers, and if the number is too low, the number of sequences counted for correction is too small, which can seriously affect the collation effect. Taking the example of 85% of the average accuracy of the third-generation sequencing at the present stage, the probability of complete coincidence of the 2-mer is 72.25%, the probability of complete coincidence of the 3-mer is 61.41%, and the probability of complete coincidence of the 4-mer is 52.20%. It is apparent that the probability is too low for 4-mers and larger than 4-mers. To further illustrate the technical effect of k-mer partitioning, an algorithm is performed by way of example, as follows.
Table 1 shows exemplary data for processing aligned sequence fragments including Read1 to Read7, with position 1 being the start site and position 5 being the end site.
TABLE 1 data information
Example 1:
The k-mer was set to 3, and correction was performed according to the method of this example, with the score chain determined by backtracking being GGAGT in order from position 1 to 5.
The k-mer is used as 2 for division, correction is performed by referring to the method of the embodiment, and the correction chain determined by backtracking is GGCGT from the position 1 to the position 5.
The correction result for position 3 is different compared to the two correction results. As before, the greater the k value, the better the specificity of the k-mer and the more accurate the correction result. Thus, 3-mers are the best choice.
Example 2:
the alignment times of single bases are simply corrected, the highest alignment times of each position is selected as a corrected sequence, and the statistics of the alignment times of single bases of each position are shown in table 2.
TABLE 2 number of single base alignments
The result of example 2 is GGCGT, which is consistent with the k=2 result in example 1. It can be seen that the number of single base alignment times cannot be accurately corrected by statistics.
In step S204 of example 4, from the reverse order, determining the last base as the highest scoring base at the last position, and determining the penultimate base from the k-mer fragment to which the last base corresponds; determining a third last base according to the k-mer fragment corresponding to the second last base; and so on, obtaining a k-mer score chain.
For a clearer explanation, the calculation is performed by example 3, specifically as follows.
Table 3 shows exemplary data, and the alignment fragments include Read1 to Read5, and position 1 is set as the start position, and position 7 is set as the end position, and correction is performed by the method of example 4. Wherein the k-mer is set to be 3-mer, and the effective coverage of each position is 5.
TABLE 3 data information
Using the method of this example, the k-mer score chain obtained is ACAGACC in order from positions 1 to 7.
If no backtracking is performed, only a single base is considered, and the connection relationship between adjacent bases is not considered, i.e., selection is performed by the highest score. Position 3, highest scoring base is a; position 4, highest scoring base G; position 5, the highest scoring base is T; position 6, the highest scoring base is G; at position 7, the highest scoring base is C. And finally, the positions 3 to 7 of the score chain are AGTGC, which are inconsistent with the result after backtracking, so that the backtracking effect is reflected.
As can be seen from the above examples, when the correction is performed by using the method of embodiment 4, the combination situation of the forward sequence and the current base is considered, that is, the continuous small fragment which is most likely to exist is considered as the correction result, so that the preference influence of only counting the number of single base comparison times on the correction result is effectively avoided.
Example six
The present embodiment provides a sequence correction method, wherein the method described in the first embodiment is used to align a sequencing sequence; based on the third reference sequence, the third alignment sequence, correction of the sequence was performed using the method described in example four. For convenience and brevity of description, the specific working procedures of the sequence comparison method and the sequence correction method described in this embodiment may refer to the corresponding procedures in the first to fifth embodiments of the foregoing method, and will not be described herein again.
The embodiment of the invention also provides an example of an actual application scene, and the example is referred to as follows:
the PacBIO Sequal platform and NanoPore PromethION platform standard procedures are used for sequencing a human sample, all sequencing data of chromosome 20 are taken as test data, and two data disclosed by the chromosome are used as standard data for evaluating the technical effects of the comparison method and the correction method, as shown in Table 4.
TABLE 4 Table 4
/>
The corrected yield refers to the proportion of data with the corrected reading length exceeding 5kb to the original data, the average length of reads and the longest reading length of the invention, canu and Falcon, refer to corrected data, the ratio of reads with the length exceeding 99% and matched with standard data, the ratio of similarity exceeding 97% refers to the ratio of reads with the similarity matched with standard data, and the average similarity of the ratio exceeding 99% and matched with standard data.
34 first sequences were randomly extracted from the chromosome 20 Nanopore sequencing result, and a total of 1588683bp was obtained. At the time of alignment, the methods described in example 1 (optimization methods) were used, respectively, and the method of restricting the constraint line range in step S141 in example 1 (non-optimization method); the method described in example 4 was used for correction. The results showed that the optimized process run time was 38.2s and the non-optimized process run time was 813.6s.
Example seven
Based on the sequence alignment method provided in the foregoing embodiment, the present embodiment provides a sequence alignment device, referring to a structural block diagram of the sequence alignment device shown in fig. 6, the device includes:
the sequence acquisition module 301 is configured to acquire a sequencing sequence, screen a first sequence in the sequencing sequence according to a first preset data length, and screen a second sequence in the sequencing sequence according to a second preset length;
the first comparing module 302 is configured to compare the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence;
a second comparison module 303, configured to calculate coverage of the first comparison sequence to the first reference sequence, screen the first reference sequence and the first comparison sequence based on the coverage, and obtain a second reference sequence and a second comparison sequence;
And a third comparison module 304, configured to calculate a comparison path between the second reference sequence and the second comparison sequence, and screen the second reference sequence and the second comparison sequence based on the editing distance, so as to obtain a third reference sequence and a third comparison sequence.
In one embodiment, the sequence alignment device further includes a sequence storage module, where the sequence storage module is configured to: converting and storing the sequence based on the binary number; the sequences include one or more of a sequencing sequence, a first sequence, a second sequence, a first reference sequence, a first alignment sequence, a second reference sequence, a second alignment sequence, a third reference sequence, and a third alignment sequence; preferably, the step of converting and storing the sequence based on binary numbers comprises: dividing the sequence into a plurality of base combinations according to a preset group; wherein the preset grouping includes determining four adjacent bases as a base combination; converting the base of the sequence into binary numbers according to a preset conversion relation between the base and the binary numbers; the method comprises the steps of carrying out bit filling expansion on combinations of less than four bases in a sequence by adopting a specified binary number to obtain a binary sequence meeting a preset group; the binary sequence is stored in accordance with the divided base combinations.
In one embodiment, the sequence alignment device further includes an alignment storage optimization module, where the alignment storage optimization module is configured to: recording the number of the comparison sequence, the comparison direction, the beginning of the comparison interval of the comparison sequence, the end of the comparison interval of the comparison sequence, the number of the reference sequence, the beginning of the comparison interval of the reference sequence and the end of the comparison interval of the reference sequence, and storing the comparison result; preferably, the comparison sequence number difference value, the reference sequence comparison interval length and the comparison sequence comparison interval length difference value are recorded and the comparison result is stored; wherein the alignment sequence comprises one or more of a first alignment sequence, a second alignment sequence and a third alignment sequence, and the reference sequence comprises one or more of a first reference sequence, a second reference sequence and a third reference sequence.
In one embodiment, the above-mentioned comparison storage optimization module is further configured to: storing according to 4 bytes; and, each byte uses 7 bits to record the numerical value, and uses the specific binary number to fill in less than 7 bits; the remaining 1 bit identifies whether the comparison is terminated using a particular binary number.
In one embodiment, in the second comparison module 303, the coverage includes a window coverage and an overall coverage; the window coverage is obtained by dividing a first reference sequence into a plurality of windows according to a preset first base number, and calculating the coverage of the first comparison sequence on each window; the overall coverage is the average coverage of the first reference sequence; the screening method comprises the steps of determining a first comparison sequence meeting a preset first coverage condition as a second comparison sequence, wherein the overall coverage of the second comparison sequence to a first reference sequence reaches the preset first coverage; the first coverage condition is that the window coverage is smaller than a preset second coverage and smaller than a first preset multiple of the whole coverage; preferably, the coverage calculation and screening are performed in the order of the overlapping length of the first comparison sequence and the first reference sequence from long to short.
In one embodiment, the sequence comparison device further includes a filtering module for a chimeric sequence, where the chimeric sequence is a sequence in the first reference sequence that meets a preset second coverage condition; the second coverage condition is that at least one window exists, the window coverage of the window is smaller than a preset third coverage, and the window coverage of a specified number of windows adjacent to the window is larger than a preset fourth coverage; or at least one window exists, and the window coverage of the window is smaller than the preset fifth coverage; wherein the fifth coverage threshold is a second preset multiple of the window coverage average of a specified number of windows adjacent to the current window.
In one embodiment, the sequence comparison device further includes a filtering module for a sequence, wherein the sequence includes a first reference sequence satisfying the following conditions: at least one first comparison sequence exists, and the number of bases from the start of the comparison interval of the first comparison sequence and the first reference sequence to the start of the first reference sequence, and the number of bases from the end of the comparison interval of the first comparison sequence and the first reference sequence to the end of the first reference sequence are all within a preset second number of bases.
In one embodiment, in the third alignment module 304, the following steps are performed during the alignment interval between the second reference sequence and the second alignment sequence: s141, calculating a comparison path between the second reference sequence and the second comparison sequence by using a greedy algorithm, selecting an optimal comparison path and determining an editing distance of the optimal comparison path; s142, selecting a second comparison sequence which is not more than a preset editing distance condition from the second comparison sequences as a third comparison sequence, wherein a corresponding reference sequence is the third reference sequence; the preset editing distance condition is a third preset multiple of the sum of the lengths of the first reference sequence and the second comparison sequence in the comparison interval; s143, circulating the steps S141 to S142 until the coverage of the third comparison sequence on the third reference sequence reaches a preset sixth coverage;
preferably, in step S141, the constraint line range during the calculation of the greedy algorithm is between the minimum constraint line and the maximum constraint line when the specified distance value is subtracted from the maximum extension distance under the constraint of the edit distance of the comparison path;
preferably, in step S143, the coverage calculation section is that, according to the optimal comparison path, the comparison section of the third comparison sequence and the third reference sequence is determined by backtracking, the starting position and the ending position of the first preset third base number and the last preset third base number in the comparison section are searched for and determined as the starting position and the ending position of the section.
Example eight
Based on the sequence correction method provided in the above embodiment, the present embodiment provides a sequence correction device, referring to the structural block diagram of the sequence correction device shown in fig. 7, the device includes:
a sequence comparison module 401, configured to compare a plurality of sequencing fragments to obtain a reference sequence and a comparison sequence;
the sequence dividing module 402 is configured to divide a reference sequence by using k-mers as a unit, and correspondingly divide the comparison sequence to obtain a plurality of k-mer segments;
a score calculation module 403 for calculating the scores of all bases at each position in units of k-mer fragments from the positive sequence;
a base determination module 404 for determining, from the reverse order, that the last base is the highest scoring base at the last position, and determining the penultimate base from the k-mer fragment to which the last base corresponds; determining the third last base according to the k-mer fragment corresponding to the second last base; and by analogy, obtaining a k-mer score chain;
a correction module 405 for replacing the reference sequence with a k-mer score chain to obtain a first correction result.
In one embodiment, in the score calculation module 403, according to the formula score (p, b) =max { score (p-1, b) +count k-mer Calculating the score of all bases at each position;
wherein score (p, b) is the p position of the reference sequenceScoring the base, b is base A, T, G, C or missing, count k-mer The number of occurrences of the k-mer fragment in the alignment sequence; c is the coverage of the comparison sequence at the p position of the reference sequence to the reference sequence; preferably, k has a value of 3.
In one embodiment, the sequence correction device further includes: the low-quality sequence correction module is used for correcting the low-quality sequence in the first correction result by using a comparison region sequence covering the comparison sequence of the low-quality sequence in the comparison result of the sequence comparison module 401 as a seed sequence and using a directed graph algorithm based on the seed sequence; replacing the low quality sequence with the corrected sequence to obtain a second correction result; the low-quality sequence is a sequence which meets all conditions that the number of continuous high-quality bases in a section is not more than a preset first value, the number of continuous low-quality bases in the section is not more than a preset second value, the length of the section is more than a preset third value, and the number of continuous high-quality bases in the section is not more than the preset first value; preferably, the number of seed sequences in the directed graph algorithm correction does not exceed a preset fifth value, and preferably the seed sequences included in the aligned sequences with the length order positioned at the middle preset fifth value are included.
In one embodiment, the low quality sequence correction module is further configured to: s261, optionally drawing a directed graph by taking a seed sequence as a motif; s262, calculating the base score in the seed sequence according to the score matrix, and updating the directed graph; s263, performing topological sorting on the directed graph; s264, determining the base of each node as the base with the highest occurrence probability on the node according to the topological sequencing result, and obtaining the corrected sequence.
In one embodiment, in the step S262, the following steps are performed for all seed sequences: a. initializing a node score according to the formula score (x) =max { score (x-1) } -GP, wherein score (x) is the node score, x is the node index, GP is the gap penalty; setting node score of the node without any predecessor as 0;
b. traversing each base of the seed sequence, updating the scoring matrix according to a formula,
wherein x is a node index, y is a base index in the seed sequence, GP is a gap penalty, and M is a mismatch or matching score;
c. selecting a predecessor node with the highest score by taking the highest score node with the degree of 0 as an origin, backtracking, and taking the node index as 0 or the base index as 0 as an end point; d. if the base index of the end point is greater than 0, adding a base before the base index of the end point in the seed sequence into the directed graph; if the base index of the starting point is smaller than the length of the seed sequence, adding the base after the base index of the starting point in the seed sequence into the directed graph.
In one embodiment, the first correction result is filtered before the directed graph algorithm correction is performed according to any one of the following conditions: a. a first correction result that the accumulated length of the low-quality sequence exceeds a preset fourth value; b. the number of the seed sequences is smaller than a first correction result of a preset first value; c. and a first correction result that the comparison sequence ratio of which the length exceeds a preset fourth value exceeds a fourth preset multiple.
In one embodiment, the sequence correction device further includes: and the re-correction module is used for carrying out correction again by taking the seed sequence as a comparison sequence and taking the second correction result as a reference sequence by using the correction method of any one of the above steps to obtain a third correction result.
In one embodiment, the sequence alignment module 401 uses any of the above alignment methods for alignment; wherein the reference sequence comprises one or more of a first reference sequence, a second reference sequence, and a third reference sequence, and the alignment sequence comprises one or more of a first alignment sequence, a second alignment sequence, and a third alignment sequence.
The relative steps of the modules and steps set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise. In all examples shown and described for these embodiments, any particular values are to be construed as merely illustrative, and not limitative, and therefore other examples of exemplary embodiments may have different values.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced equivalently; such modifications and substitutions do not depart from the spirit of the invention.

Claims (15)

1. A method of alignment, the method comprising:
s101, acquiring a sequencing sequence, screening a first sequence from the sequencing sequences according to a first preset data length, and selecting the sequencing sequence with the data length larger than the first preset length from the sequencing sequences as the first sequence; screening a second sequence in the sequencing sequence according to a second preset length;
s102, comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence;
s103, calculating the coverage of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage, and obtaining a second reference sequence and a second comparison sequence;
s104, calculating a comparison path between the second reference sequence and the second comparison sequence, and screening the second reference sequence and the second comparison sequence based on the editing distance to obtain a third reference sequence and a third comparison sequence;
in step S103, the coverage includes window coverage and overall coverage; the window coverage is obtained by dividing the first reference sequence into a plurality of windows according to a preset first base number, and calculating the coverage of the first comparison sequence on each window; the overall coverage is an average coverage of the first reference sequence;
The screening method comprises the steps of determining the first comparison sequence meeting a preset first coverage condition as the second comparison sequence, wherein the overall coverage of the second comparison sequence to the first reference sequence reaches a preset first coverage; the first coverage condition is that the window coverage is smaller than a preset second coverage and smaller than a first preset multiple of the whole coverage;
performing the coverage calculation and the screening according to the sequence from long to short of the overlapping length of the first comparison sequence and the first reference sequence;
in step S104, in the alignment interval between the second reference sequence and the second alignment sequence, the following steps are performed:
s141, calculating a comparison path between the second reference sequence and the second comparison sequence by using a greedy algorithm, selecting an optimal comparison path and determining an editing distance of the optimal comparison path; the optimal editing distance of the optimal comparison path is as follows: the corresponding editing distance when the editing operation times are the least;
s142, selecting a second comparison sequence which is not more than a preset editing distance condition from the second comparison sequences as a third comparison sequence, wherein a corresponding reference sequence is the third reference sequence; the preset editing distance condition is a third preset multiple of the sum of the lengths of the first reference sequence and the second comparison sequence in the comparison interval;
S143, circulating the steps S141 to S142 until the coverage of the third comparison sequence on the third reference sequence reaches a preset sixth coverage;
in step S141, the constraint line range during greedy algorithm calculation is between the minimum constraint line and the maximum constraint line when the specified distance value is subtracted from the maximum extension distance under the constraint of the edit distance of the comparison path;
in step S143, the coverage interval is calculated as,
and backtracking to determine a comparison interval of a third comparison sequence and a third reference sequence according to the optimal comparison path, searching a starting position of the first preset third base number in the comparison interval and a final position of the last preset third base number in the comparison interval, and determining the starting position and the final position as the starting position and the final position of the interval.
2. The method according to claim 1, characterized in that the method comprises:
converting and storing the sequence based on the binary number; the sequence comprises one or more of the sequencing sequence, the first sequence, the second sequence, the first reference sequence, the first alignment sequence, the second reference sequence, the second alignment sequence, the third reference sequence, and the third alignment sequence;
The step of converting and storing the sequence based on the binary number comprises the following steps:
dividing the sequence into a plurality of base combinations according to a preset group; wherein the preset grouping includes determining four adjacent bases as a base combination;
converting the base of the sequence into binary numbers according to a preset conversion relation between the base and the binary numbers;
and carrying out bit filling expansion on the combinations of less than four bases in the sequence by adopting a designated binary number to obtain a binary sequence meeting the preset grouping.
3. The method of claim 1, comprising storage optimization of the comparison results:
recording the number of the comparison sequence, the comparison direction, the beginning of the comparison interval of the comparison sequence, the end of the comparison interval of the comparison sequence, the number of the reference sequence, the beginning of the comparison interval of the reference sequence and the end of the comparison interval of the reference sequence, and storing the comparison result;
recording the comparison sequence number difference value, the reference sequence number difference value, and storing the comparison result between the length of the reference sequence comparison interval and the length difference value of the comparison interval of the comparison sequence;
wherein the alignment sequence comprises one or more of a first alignment sequence, a second alignment sequence and a third alignment sequence, and the reference sequence comprises one or more of a first reference sequence, a second reference sequence and a third reference sequence.
4. A method according to claim 3, wherein the method of storing the comparison result comprises:
storing according to 4 bytes;
and, each byte uses 7 bits to record the numerical value, and uses the specific binary number to fill in less than 7 bits; the remaining 1 bit identifies whether the comparison is terminated using a particular binary number.
5. The method of claim 1, wherein prior to step S103, the method further comprises filtering the chimeric sequences, wherein the chimeric sequences are,
sequences meeting a preset second coverage condition in the first reference sequences;
the second coverage condition is that at least one window exists, the window coverage of the window is smaller than a preset third coverage, and the window coverage of a specified number of windows adjacent to the window is larger than a preset fourth coverage; or at least one window exists, and the window coverage of the window is smaller than a preset fifth coverage; wherein the fifth coverage threshold is a second preset multiple of the window coverage average of a specified number of windows adjacent to the current window.
6. The method of claim 1, wherein prior to step S103, the method further comprises filtering a containment sequence, wherein the containment sequence is,
The first reference sequence satisfying the following conditions: at least one first comparison sequence exists, the number of bases from the start of the comparison interval of the first comparison sequence and the first reference sequence to the start of the first reference sequence, and the number of bases from the end of the comparison interval of the first comparison sequence and the first reference sequence to the end of the first reference sequence are all within a preset second number of bases.
7. A method of sequence correction, the method comprising:
s201, comparing a plurality of sequencing fragments to obtain a reference sequence and a comparison sequence;
s202, dividing the reference sequence by taking k-mers as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments;
s203, calculating the scores of all bases at each position by taking a k-mer fragment as a unit from the positive sequence;
s204, determining the last base as the highest scoring base at the last position from the reverse sequence, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a third last base according to the k-mer fragment corresponding to the second last base; and by analogy, obtaining a k-mer score chain;
S205, replacing the reference sequence by using a k-mer score chain to obtain a first correction result;
the k value in the sequence correction method is 3;
in step S201, an alignment is performed using the sequence alignment method according to any one of claims 1 to 6.
8. The method according to claim 7, wherein in step S203,
according to the formulaCalculating the scores of all bases at each position;
wherein ,a score for a base at the p position of the reference sequence, b being the base A, T, G, C or deletion, count k-mer A number of occurrences of the k-mer fragment in the alignment sequence; c is the coverage of the reference sequence by the alignment sequence at the p position of the reference sequence.
9. The method of claim 7, wherein the method further comprises:
s206, using a comparison interval sequence of the comparison sequences covering the low-quality sequences in the comparison result of the step S201 as a seed sequence, and correcting the low-quality sequences by using a directed graph algorithm based on the seed sequence; replacing the low quality sequence with the corrected sequence to obtain a second correction result;
the low-quality sequence is a sequence which meets all conditions that the number of continuous bases is not less than a preset first value, the average value of the interval is less than the longest sequence of the preset second value, the length of the interval exceeds a preset third value, and the number of continuous high-quality bases in the interval is not greater than the preset first value;
And correcting the seed sequences in the directed graph algorithm, wherein the seed sequences are included in the comparison sequences with the length ordering at the middle preset fifth value.
10. The method according to claim 9, wherein in step S206, the specific step of correcting the directed graph algorithm is:
s261, optionally drawing a directed graph by taking one seed sequence as a motif;
s262, calculating the base scores in the seed sequences according to the score matrix, and updating the directed graph;
s263, performing topological sorting on the directed graph;
s264, determining the base of each node as the base with the highest occurrence probability on the node according to the topological sequencing result, and obtaining the corrected sequence.
11. The method according to claim 10, characterized in that in step S262 the following steps are performed for all the seed sequences:
a. initializing a node score according to the formula score (x) =max { score (x-1) } -GP, wherein score (x) is the node score, x is the node index, GP is a gap penalty; setting node score of the node without any predecessor as 0;
b. traversing each base of the seed sequence, updating the following scoring matrix according to the formula:
wherein x is a node index, y is a base index in the seed sequence, GP is a gap penalty, and M is a mismatch or matching score;
c. Selecting a predecessor node with the highest score by taking the highest score node with the degree of 0 as a starting point, backtracking, and taking the node index as 0 or the base index as 0 as an end point;
d. if the base index of the end point is greater than 0, adding a base positioned before the base index of the end point in the seed sequence into a directed graph; and if the base index of the starting point is smaller than the length of the seed sequence, adding the base positioned behind the base index of the starting point in the seed sequence into a directed graph.
12. The method of claim 9, wherein the first correction result is filtered prior to the directed graph algorithm correction according to any of the following conditions:
a. the accumulated length of the low-quality sequence exceeds the first correction result of a preset fourth value;
b. the number of the seed sequences is smaller than the first correction result of a preset first value;
c. and the comparison sequence with the length exceeding a preset fourth value is subjected to the first correction result with the duty ratio exceeding a fourth preset multiple.
13. The method of claim 9, the method further comprising:
s207, using the seed sequence as the comparison sequence, using the second correction result as the reference sequence, and performing correction again by using the correction method according to any one of claims 7 to 8 to obtain a third correction result.
14. A sequence comparison device, the device comprising:
the sequence acquisition module is used for acquiring a sequencing sequence, screening a first sequence from the sequencing sequences according to a first preset data length, and selecting a sequencing sequence with a data length larger than the first preset length from the sequencing sequences as the first sequence; screening a second sequence in the sequencing sequence according to a second preset length;
the first comparison module is used for comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence;
the second comparison module is used for calculating the coverage of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage, and obtaining a second reference sequence and a second comparison sequence;
the third comparison module is used for calculating a comparison path between the second reference sequence and the second comparison sequence, screening the second reference sequence and the second comparison sequence based on the editing distance, and obtaining a third reference sequence and a third comparison sequence;
in the second comparison module, the coverage comprises window coverage and overall coverage; the window coverage is obtained by dividing the first reference sequence into a plurality of windows according to a preset first base number, and calculating the coverage of the first comparison sequence on each window; the overall coverage is an average coverage of the first reference sequence;
The second comparison module is further configured to determine the first comparison sequence that meets a preset first coverage condition as the second comparison sequence, where the overall coverage of the second comparison sequence to the first reference sequence reaches a preset first coverage; the first coverage condition is that the window coverage is smaller than a preset second coverage and smaller than a first preset multiple of the whole coverage; performing the coverage calculation and the screening according to the sequence from long to short of the overlapping length of the first comparison sequence and the first reference sequence;
the third alignment module is further configured to perform the following steps in the alignment interval between the second reference sequence and the second alignment sequence:
s141, calculating a comparison path between the second reference sequence and the second comparison sequence by using a greedy algorithm, selecting an optimal comparison path and determining an editing distance of the optimal comparison path; the optimal editing distance of the optimal comparison path is as follows: the corresponding editing distance when the editing operation times are the least;
s142, selecting a second comparison sequence which is not more than a preset editing distance condition from the second comparison sequences as a third comparison sequence, wherein a corresponding reference sequence is the third reference sequence; the preset editing distance condition is a third preset multiple of the sum of the lengths of the first reference sequence and the second comparison sequence in the comparison interval;
S143, circulating the steps S141 to S142 until the coverage of the third comparison sequence on the third reference sequence reaches a preset sixth coverage;
in step S141, the constraint line range during greedy algorithm calculation is between the minimum constraint line and the maximum constraint line when the specified distance value is subtracted from the maximum extension distance under the constraint of the edit distance of the comparison path;
in step S143, the coverage interval is calculated by determining a comparison interval between the third comparison sequence and the third reference sequence by backtracking according to the optimal comparison path, searching a starting position where the first preset third base number is completely matched and a final position where the last preset third base number is completely matched in the comparison interval, and determining the starting position and the final position as the starting position and the final position of the interval.
15. A sequence correction device, the device comprising:
the sequence comparison module is used for comparing a plurality of sequencing fragments to obtain a reference sequence and a comparison sequence;
the sequence dividing module is used for dividing the reference sequence by taking k-mers as a unit and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments;
The score calculating module is used for calculating the scores of all bases at each position by taking the k-mer fragments as units from the positive sequence;
the base determining module is used for determining that the last base is the highest scoring base at the last position from the reverse sequence, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a third last base according to the k-mer fragment corresponding to the second last base; and by analogy, obtaining a k-mer score chain;
the correction module is used for replacing the reference sequence by using a k-mer score chain to obtain a first correction result;
the k value in the sequence correction device is 3;
the sequence alignment module for use in alignment using the sequence alignment method as claimed in any one of claims 1 to 6.
CN201910787734.1A 2019-08-23 2019-08-23 Sequence comparison method, sequence correction method and device thereof Active CN112397148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910787734.1A CN112397148B (en) 2019-08-23 2019-08-23 Sequence comparison method, sequence correction method and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910787734.1A CN112397148B (en) 2019-08-23 2019-08-23 Sequence comparison method, sequence correction method and device thereof

Publications (2)

Publication Number Publication Date
CN112397148A CN112397148A (en) 2021-02-23
CN112397148B true CN112397148B (en) 2023-10-03

Family

ID=74603657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910787734.1A Active CN112397148B (en) 2019-08-23 2019-08-23 Sequence comparison method, sequence correction method and device thereof

Country Status (1)

Country Link
CN (1) CN112397148B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421608B (en) * 2021-07-03 2023-12-01 南京世和基因生物技术股份有限公司 Construction method of liver cancer early screening model, detection device and computer readable medium
CN116564414B (en) * 2023-07-07 2024-03-26 腾讯科技(深圳)有限公司 Molecular sequence comparison method and device, electronic equipment, storage medium and product

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789553A (en) * 2012-07-23 2012-11-21 中国水产科学研究院 Method and device for assembling genomes by utilizing long transcriptome sequencing result
CN103955630A (en) * 2014-03-26 2014-07-30 田埂 Method for preparing reference database and performing target area sequence alignment on to-be-tested free nucleic acid samples
CN105303068A (en) * 2015-10-27 2016-02-03 华中农业大学 Reference genome and de novo assembly combination based next-generation sequencing data assembly method
CN107103206A (en) * 2017-04-27 2017-08-29 福建师范大学 The DNA sequence dna cluster of local sensitivity Hash based on standard entropy
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map
CN107895104A (en) * 2017-11-13 2018-04-10 深圳华大基因科技服务有限公司 Assess and verify the method and apparatus of the sequence assembling result of three generations's sequencing
CN109411020A (en) * 2018-11-01 2019-03-01 中国水产科学研究院 The method for carrying out whole genome sequence filling-up hole using long sequencing read

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016176847A1 (en) * 2015-05-06 2016-11-10 安诺优达基因科技(北京)有限公司 Reagent kit, apparatus, and method for detecting chromosome aneuploidy
US20190080045A1 (en) * 2017-09-13 2019-03-14 The Jackson Laboratory Detection of high-resolution structural variants using long-read genome sequence analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789553A (en) * 2012-07-23 2012-11-21 中国水产科学研究院 Method and device for assembling genomes by utilizing long transcriptome sequencing result
CN103955630A (en) * 2014-03-26 2014-07-30 田埂 Method for preparing reference database and performing target area sequence alignment on to-be-tested free nucleic acid samples
CN105303068A (en) * 2015-10-27 2016-02-03 华中农业大学 Reference genome and de novo assembly combination based next-generation sequencing data assembly method
CN107103206A (en) * 2017-04-27 2017-08-29 福建师范大学 The DNA sequence dna cluster of local sensitivity Hash based on standard entropy
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map
CN107895104A (en) * 2017-11-13 2018-04-10 深圳华大基因科技服务有限公司 Assess and verify the method and apparatus of the sequence assembling result of three generations's sequencing
CN109411020A (en) * 2018-11-01 2019-03-01 中国水产科学研究院 The method for carrying out whole genome sequence filling-up hole using long sequencing read

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王春宇.生物高通量测序片段拼接与分子标记识别算法研究.《中国博士学位论文全文数据库 信息科技辑》.2016,正文全文. *

Also Published As

Publication number Publication date
CN112397148A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
CN112397148B (en) Sequence comparison method, sequence correction method and device thereof
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
KR101313087B1 (en) Method and Apparatus for rearrangement of sequence in Next Generation Sequencing
WO2017143585A1 (en) Method and apparatus for assembling separated long fragment sequences
CN105303068A (en) Reference genome and de novo assembly combination based next-generation sequencing data assembly method
WO2018218787A1 (en) Third-generation sequencing sequence correction method based on local graph
CN110021355B (en) Haploid typing and variation detection method and device for diploid genome sequencing segment
US20150178446A1 (en) Iterative clustering of sequence reads for error correction
Roberts et al. A preprocessor for shotgun assembly of large genomes
CN102867134B (en) A kind of system and method that gene order fragment is spliced
JP2006075162A (en) Transcript mapping method of gene and system therefor
He et al. De novo assembly methods for next generation sequencing data
US20140121983A1 (en) System and method for aligning genome sequence
CN107784198B (en) Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence
US11250931B2 (en) Systems and methods for detecting recombination
CA2400890A1 (en) Method and system for the assembly of a whole genome using a shot-gun data set
CN115641911B (en) Method for detecting overlapping between sequences
CN115662523B (en) Group-oriented genome index representation and construction method and equipment
EP3663890B1 (en) Alignment method, device and system
CN108491687B (en) Scafffolding method based on contig quality evaluation classification and graph optimization
Heo Improving quality of high-throughput sequencing reads
CN115148290A (en) Hole filling method based on third-generation sequencing data
CN112786110B (en) Sequence assembling method and system
CN115602246B (en) Sequence alignment method based on group genome
CN111445949A (en) Method for annotating genome of high-altitude polyploid fish by using nanopore sequencing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 430000 No.01, 19th floor, building C11, Wuhan software new town phase 2, No.8 Huacheng Avenue, Donghu New Technology Development Zone, Wuhan City, Hubei Province

Applicant after: Wuhan hope group Biotechnology Co.,Ltd.

Address before: 430000 No.01, 19th floor, building C11, Wuhan software new town phase 2, No.8 Huacheng Avenue, Donghu New Technology Development Zone, Wuhan City, Hubei Province

Applicant before: WUHAN NEXTOMICS BIOSCIENCES Co.,Ltd.

GR01 Patent grant
GR01 Patent grant