CN112397148A - Sequence comparison method, sequence correction method and device thereof - Google Patents

Sequence comparison method, sequence correction method and device thereof Download PDF

Info

Publication number
CN112397148A
CN112397148A CN201910787734.1A CN201910787734A CN112397148A CN 112397148 A CN112397148 A CN 112397148A CN 201910787734 A CN201910787734 A CN 201910787734A CN 112397148 A CN112397148 A CN 112397148A
Authority
CN
China
Prior art keywords
sequence
comparison
alignment
base
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910787734.1A
Other languages
Chinese (zh)
Other versions
CN112397148B (en
Inventor
胡江
韩悦
汪德鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Nextomics Biosciences Co ltd
Original Assignee
Wuhan Nextomics Biosciences Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Nextomics Biosciences Co ltd filed Critical Wuhan Nextomics Biosciences Co ltd
Priority to CN201910787734.1A priority Critical patent/CN112397148B/en
Publication of CN112397148A publication Critical patent/CN112397148A/en
Application granted granted Critical
Publication of CN112397148B publication Critical patent/CN112397148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Abstract

The invention provides a sequence comparison method, a sequence correction method and a device thereof, relating to the technical field of biological information. The sequence comparison method mainly comprises interval comparison, base comparison, sequence compression and comparison result optimized storage means, and realizes high-efficiency and high-accuracy comparison. The sequence correction method is based on comparison results, and further corrects low-quality regions through positive sequence scoring and reverse sequence backtracking, so that efficient and high-accuracy comparison is realized.

Description

Sequence comparison method, sequence correction method and device thereof
Technical Field
The invention relates to the technical field of biological information, in particular to a sequence comparison method, a sequence correction method and a sequence correction device.
Background
With the rapid development of high throughput sequencing technology, it has become an important means for studying biological problems such as species evolution and diagnosing disease causes.
The second generation sequencing technology mainly based on the Illumina sequencing platform has the characteristics of high data output, low cost and high base accuracy rate, and is widely applied to the field of genomics research. Genome assembly refers to reconstructing a complete genome sequence through a sequencing sequence result, and the accuracy and the integrity of an assembly result have important influence on the aspects of subsequent gene annotation analysis, SNP (single nucleotide polymorphism) and structural variation detection and the like. Due to the complex conditions of repetitive sequences, heterozygosity and the like, the short fragment sequences generated by the second-generation sequencing technology are difficult to assemble to obtain a complete genome sequence. Taking human genome as an example, 5% -10% of undefined complex regions still exist in the existing human reference genome sequence assembled by second-generation sequencing.
The third generation sequencing technology platform, represented by Pacific Biosciences and Oxford Nanopore Technologies, has the remarkable characteristic of long read length, can generate long fragment sequences larger than 10k, has no preference for region sequencing results with high GC content, and brings important revolution for the field of genome research. The third generation sequencing technology improves the sequence length, and simultaneously has the defect of high error rate (error rate is 15-20%) of the base sequence compared with the second generation sequencing technology (error rate is 0.1%), and the error information in the sequencing sequence has great influence on the subsequent data analysis. For the third generation sequencing data, the strategy of the Overlap-Layout-consensus (OLC) algorithm is usually adopted for genome assembly, but due to the defect of high sequence error rate, the assembly complexity and the resource consumption for calculating the interrelation between sequences are greatly increased. The existing genome assembly software aiming at the third generation sequencing data adopts a method of firstly correcting the sequencing data and then assembling the genome or a method of firstly assembling the genome and then correcting the assembly result. Both of the above methods involve mutual alignment between sequences and correct for erroneous information in sequencing data based on the alignment results. The existing alignment method has a large number of invalid alignments which affect the accuracy of correction, and the alignment result needs to occupy huge storage space, so that the alignment and correction steps occupy the longest time consumption in genome assembly, and the memory requirement is huge. Taking the assembly of mammalian genome third generation sequencing data as an example, the existing software can not deal with the assembly of ultra-large genome when tens of thousands of CPU cores are consumed, thus seriously hindering the wide application of the third generation sequencing technology. Therefore, development of a comparison and correction method for processing long-read long-sequencing data in a targeted manner is urgently needed, so that the computing resource consumption of the existing method is effectively reduced, the operation efficiency is improved, and the potential of the long-read long-sequencing technology in the field of genomics research is fully unlocked. The long read length sequencing includes, but is not limited to, single molecule real time Sequencing (SMRT) technology and nanopore sequencing technology.
Disclosure of Invention
The invention aims to provide a sequence comparison method, a sequence correction method and a sequence correction device, which can better reduce the number of invalid comparisons and improve the accuracy of sequence correction.
The invention provides a sequence comparison method, which comprises the following steps: s101, obtaining a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length; s102, comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence; s103, calculating the coverage of the first reference sequence by the first comparison sequence, and screening the first reference sequence and the first comparison sequence based on the coverage to obtain a second reference sequence and a second comparison sequence; s104, calculating a comparison path between the second reference sequence and the second comparison sequence, and screening the second reference sequence and the second comparison sequence based on the edit distance to obtain a third reference sequence and a third comparison sequence.
Further, the method includes converting and storing sequences based on binary numbers, the sequences including one or more of the sequencing sequence, the first sequence, the second sequence, the first reference sequence, the first alignment sequence, the second reference sequence, the second alignment sequence, the third reference sequence, and the third alignment sequence;
preferably, the step of converting and storing the sequence based on the binary number includes: dividing the sequence into a plurality of base combinations according to a preset grouping; wherein the pre-set grouping comprises determining four bases adjacent to each other as a base combination; converting the base of the sequence into a binary number according to a preset conversion relation between the base and the binary number; and for the combination of less than four bases in the sequence, carrying out bit filling expansion on the combination of less than four bases by adopting a specified binary number to obtain a binary sequence meeting the preset grouping.
Further, the method comprises storage optimization of the alignment results: recording the comparison sequence number, the comparison direction, the comparison interval start of the comparison sequence, the comparison interval stop of the comparison sequence, the reference sequence number, the comparison interval start of the reference sequence and the comparison interval stop of the reference sequence to store the comparison result; preferably, the difference value of the serial numbers of the recorded comparison sequences, the difference value of the serial numbers of the reference sequences, the length of the comparison interval of the reference sequences and the length of the comparison interval of the comparison sequences are stored for comparison results; wherein the aligned sequences comprise one or more of a first aligned sequence, a second aligned sequence, and a third aligned sequence, and the reference sequence comprises one or more of a first reference sequence, a second reference sequence, and a third reference sequence.
Further, the method for storing the alignment result comprises: storing according to 4 bytes; each byte uses a 7-bit record value, and less than 7 bits are filled by using a specific binary number; the remaining 1 bit identifies whether the comparison is terminated using a specific binary number.
Further, in step S103: the coverage degree comprises window coverage degree and integral coverage degree; the window coverage degree is obtained by dividing the first reference sequence into a plurality of windows according to a preset first base number and calculating the coverage degree of the first comparison sequence to each window; the overall coverage is the average coverage of the first reference sequence; the method for screening the first reference sequence and the first comparison sequence based on the method comprises the steps of determining the first comparison sequence meeting a preset first coverage condition as the second comparison sequence until the overall coverage of the second comparison sequence on the first reference sequence reaches a preset first coverage; the first covering condition is that the window covering degree is smaller than a preset second covering degree and smaller than a first preset multiple of the integral covering degree; preferably, the coverage calculation and the screening are performed according to the sequence of the overlapping length of the first comparison sequence and the first reference sequence from long to short.
Further, step S103 includes filtering the chimera sequence, that is, removing the first reference sequence determined as the chimera sequence and the comparison information thereof, where the chimera sequence is a sequence that satisfies a preset second coverage condition in the first reference sequence; the second coverage condition is that at least one window exists, the window coverage of the window is less than a preset third coverage, and the window coverage of a specified number of windows adjacent to the window is greater than a preset fourth coverage; or at least one window exists, and the window coverage of the window is smaller than a preset fifth coverage; wherein the fifth coverage threshold is a second preset multiple of the average of the window coverage of a specified number of windows adjacent to the current window.
Further, step S103 further includes filtering the included sequence, that is, rejecting the first reference sequence determined as the included sequence and the alignment information thereof, wherein the included sequence is the first reference sequence satisfying the following conditions: at least one first alignment sequence exists, and the number of bases between the beginning of the alignment interval of the first alignment sequence and the first reference sequence and the number of bases between the end of the alignment interval of the first alignment sequence and the first reference sequence and the end of the first reference sequence are within a preset second base number range.
Further, in step S104, in the alignment interval between the second reference sequence and the second alignment sequence, the following steps are performed: s141, calculating an alignment path between the second reference sequence and the second alignment sequence by using a greedy algorithm, selecting an optimal alignment path and determining an editing distance of the optimal alignment path; s142, selecting a second alignment sequence not greater than a preset editing distance condition from the second alignment sequences as a third alignment sequence, wherein a corresponding reference sequence is a third reference sequence; the preset editing distance condition is a third preset multiple of the sum of the lengths of the first reference sequence and the second comparison sequence in the comparison interval; s143, the steps S141 to S142 are repeated until the coverage of the third reference sequence by the third alignment sequence reaches a preset sixth coverage.
Preferably, in step S141, the range of the constraint line in the calculation by the greedy algorithm is between the minimum constraint line and the maximum constraint line when the specified distance value is subtracted from the maximum extension distance under the constraint of the edit distance of the comparison path.
Preferably, in step S143, the interval of the coverage is calculated by determining, according to the optimal comparison path, a comparison interval of a third comparison sequence and a third reference sequence by backtracking, searching for an initial position where a first preset third base number is completely matched and an end position where a last preset third base number is completely matched in the comparison interval, and determining the initial position and the end position as the initial position and the end position of the interval.
The invention provides a sequence correction method, which comprises the following steps: s201, comparing the sequencing fragments to obtain a reference sequence and a comparison sequence; s202, dividing the reference sequence by taking the k-mer as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments; s203, calculating scores of all bases at each position by taking the k-mer fragment as a unit from the positive sequence; s204, determining the last base as the highest scoring base at the last position from reverse order, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a penultimate base according to the k-mer fragment corresponding to the penultimate base; and so on to obtain a k-mer scoring branch; s205, replacing the reference sequence by using the k-mer score chain to obtain a first correction result.
Preferably, the sequence correction method has a k value of 3.
Further, in step S203, max { score (p-1, b) + count is set according to the formula score (p, b) ═ max { (p-1, b) + countk-merCalculate the score for all bases at each position, } -Cx 0.3; wherein score (p, b) is the base score at position p of the reference sequence, and b is base A, T, G, C or a deletion, countk-merThe number of occurrences of the k-mer scoring strands in the aligned sequence is determined; c is at the p position of the reference sequenceCoverage of the reference sequence by the aligned sequences.
Further, the correction method further includes: s206, for the low-quality sequence in the first correction result, using the alignment interval sequence of the alignment sequence covering the low-quality sequence in the alignment result of the step S201 as a seed sequence, and performing correction by using a directed graph algorithm based on the seed sequence; replacing the low quality sequence with the corrected sequence to obtain a second correction result; the low-quality sequence is a sequence which meets all conditions that the number of continuous low-quality bases is not less than a preset first value, the longest sequence with the interval average quality value less than a preset second value is taken as the starting point, the interval length exceeds a preset third value, and the number of continuous high-quality bases in the interval is not more than the preset first value.
Preferably, in the correction of the directed graph algorithm, the number of the seed sequences does not exceed a preset fifth value, and preferably, the seed sequences included in the aligned sequences with the length sequence in the middle of the preset fifth value are arranged.
Further, in step S206, the specific steps of correcting the directed graph algorithm are as follows: s261, optionally drawing a directed graph by taking one seed sequence as a motif; s262, calculating the base score in the seed sequence according to the score matrix, and updating the directed graph; s263, topological sorting is carried out on the directed graph; and S264, determining the base of each node as the base with the highest occurrence probability on the node according to the topological sequencing result, and obtaining a corrected sequence.
Further, in step S262, the following steps are performed for all the seed sequences: a. initializing a node score according to the formula score (x) max { score (x-1) } -GP, where score (x) is the node score, x is the node index, and GP is the gap penalty; setting the node score without the former node as 0; b. traversing each base of the seed sequence, updating a scoring matrix according to a formula,
Figure BDA0002177874370000041
wherein x is a node index, y is a base index in the seed sequence, GP is a gap penalty, and M is a mismatch or match score; c. taking the highest scoring node with the out degree of 0 as a starting point, selecting the predecessor node with the highest score, backtracking, and taking the node index as 0 or the base index as 0 as an end point; d. if the base index of the end point is more than 0, adding a base positioned before the base index of the end point in the seed sequence into a directed graph; and if the base index of the starting point is smaller than the length of the seed sequence, adding a base positioned after the base index of the starting point in the seed sequence into the directed graph.
Further, before the directed graph algorithm correction, the first correction result is filtered according to any one of the following conditions, that is, the directed graph algorithm correction is not performed on the first correction result satisfying any one of the following conditions: a. the accumulated length of the low-quality sequence exceeds the first correction result of a preset fourth value; b. the first correction result of which the number of the seed sequences is smaller than a preset first numerical value; c. and the comparison sequence with the length exceeding a preset fourth value accounts for the first correction result with the length exceeding a fourth preset multiple.
Further, the correction method further includes: s207, correcting again by using the seed sequence as the alignment sequence and the second correction result as the reference sequence by using the methods of the steps S201, S202, S203 and S204 to obtain a third correction result.
Further, in step S201, the comparison is performed by using the comparison method described in the foregoing steps S101, S102, S103, and S104; wherein the reference sequence comprises one or more of the first reference sequence, the second reference sequence, and the third reference sequence, and the aligned sequence comprises one or more of the first aligned sequence, the second aligned sequence, and the third aligned sequence.
The invention provides a sequence comparison device, which comprises: the sequence acquisition module is used for acquiring a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length; the first comparison module is used for comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence; the second comparison module is used for calculating the coverage degree of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage degree, and obtaining a second reference sequence and a second comparison sequence; and the third comparison module is used for calculating a comparison path between the second reference sequence and the second comparison sequence, screening the second reference sequence and the second comparison sequence based on the edit distance, and obtaining a third reference sequence and a third comparison sequence.
The invention provides a sequence correction device, which comprises: the sequence comparison module is used for comparing the plurality of sequencing fragments to obtain a reference sequence and a comparison sequence; the sequence dividing module is used for dividing the reference sequence and the comparison sequence by taking the k-mer as a unit to obtain a k-mer segment; a score calculation module for calculating the scores of all bases at each position from the positive sequence in units of k-mer fragments; a base determination module, configured to determine, from reverse order, a last base as a highest-scoring base at a last position, and determine a penultimate base according to the k-mer fragment corresponding to the last base; determining a penultimate base according to the k-mer fragment corresponding to the penultimate base; and so on to obtain a k-mer branch chain; and the correction module is used for replacing the reference sequence by using the k-mer score chain to obtain a first correction result.
The invention provides a sequence comparison method and a device, wherein the method comprises the following steps: s101, obtaining a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length; s102, comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence; s103, calculating the coverage of the first reference sequence by the first comparison sequence, and screening the first reference sequence and the first comparison sequence based on the coverage to obtain a second reference sequence and a second comparison sequence; s104, calculating a comparison path between the second reference sequence and the second comparison sequence, and screening the second reference sequence and the second comparison sequence based on the editing distance to obtain a third reference sequence and a third comparison sequence. According to the invention, through the sequence comparison mode, interval comparison is carried out firstly, then base comparison is carried out, and the reference sequence and the comparison sequence are screened for multiple times, so that the number of invalid comparisons can be reduced, the comparison accuracy is improved, and the data storage and operation space is reduced. In addition, the present invention provides the alignment result between the sequenced sequences by the above-mentioned method, and can be combined with various correction methods for performing correction based on the alignment result, and the sequence correction method is not limited.
The invention provides a sequence correction method and a device, wherein the method comprises the following steps: s201, comparing the sequencing fragments to obtain a reference sequence and a comparison sequence; s202, dividing the reference sequence by taking the k-mer as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments; s203, calculating scores of all bases at each position by taking the k-mer fragment as a unit from the positive sequence; s204, determining the last base as the highest scoring base at the last position from reverse order, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a penultimate base according to the k-mer fragment corresponding to the penultimate base; and so on to obtain a k-mer scoring branch; s205, replacing the reference sequence by using the k-mer score chain to obtain a first correction result. According to the invention, sequencing fragments are compared firstly by the sequence correction mode, and then a correction result is obtained based on the k-mer fragment, so that the correction result bias caused by counting single base comparison can be avoided, the accuracy of sequence correction is improved, and the processing efficiency of sequence correction can be improved. In addition, the above-described mode provided by the present invention is based on the correction of the sequence alignment result, and the alignment method for obtaining the alignment result is not limited, and can be combined with various sequence alignment methods.
It is understood that the sequence alignment method and the sequence correction method provided by the present invention can be combined to improve the accuracy of sequence correction.
It should be noted that, for the sake of clarity, the sequence is marked by using the first, second, third, etc., the steps are numbered by using S101, S102, S201, S202, etc., and the preset conditions or preset values are marked by using the first, second, third, etc. Wherein each sequence is described in the form of a set, meaning that each sequence in the set is operated on. For example, in step S102, the first sequence is aligned with the second sequence, and all sequences in the first sequence are sequentially aligned with all sequences in the second sequence. For another example, in step S202, the dividing of the reference sequences and the corresponding dividing of the alignment sequences are performed by performing k-mer dividing on each reference sequence and performing corresponding dividing on all alignment sequences having an alignment relationship with the reference sequences.
In addition, the sequence alignment method, the sequence correction method, the device and the equipment thereof provided by the invention are suitable for processing the long read length sequencing result. The source of the long read length sequencing result is not limited, and the source is not limited to the third generation sequencing technology. The third generation sequencing technology platform is not limited to the sequence, sequence II platform of Pacific Biosciences, MinION, PromethION, GridION platform of Oxford Nanopore Technologies.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a sequence alignment method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a sequence of binary format conversion according to a second embodiment of the present invention;
fig. 3 is a schematic diagram of the binary format comparison result optimized storage according to the third embodiment of the present invention;
FIG. 4 is a flowchart of a sequence correction method according to a fourth embodiment of the present invention;
fig. 5 is a schematic diagram of searching for a low quality sequence according to a fourth embodiment of the present invention;
FIG. 6 is a block diagram of a sequence alignment apparatus according to a seventh embodiment of the present invention;
fig. 7 is a block diagram of a sequence correction apparatus according to an eighth embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the specific values mentioned in the examples, such as the first base number 64, the first coverage 40, the first predetermined multiple 1.3, the first value 4, and the gap penalty 2, are only illustrative and should not be construed as limiting.
At present, the data volume required by the assembly of a super-large genome often reaches Tb level, and software (such as Falcon and Canu) in the prior art cannot be used at all; in view of the problems of large number of invalid alignments and low accuracy of the existing sequence correction schemes for large and ultra-large genomes, the sequence alignment and correction method and device provided by the embodiment of the invention can reduce the number of invalid alignments and improve the accuracy of sequence correction. The technology can be applied to the scenes of sequence alignment and/or sequence correction of the super large genome, wherein the super large genome generally means that the genome size exceeds 30 GB.
Example one
This example provides a method of sequence alignment. Referring to the flowchart of the sequence alignment method shown in FIG. 1, the method comprises the following steps:
s101, obtaining a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length;
s102, comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence;
s103, calculating the coverage of the first reference sequence by the first comparison sequence, and screening the first reference sequence and the first comparison sequence based on the coverage to obtain a second reference sequence and a second comparison sequence;
s104, calculating a comparison path between the second reference sequence and the second comparison sequence, and screening the second reference sequence and the second comparison sequence based on the edit distance to obtain a third reference sequence and a third comparison sequence. The third reference sequence and the third alignment sequence are the alignment results of the sequencing sequence. The reference sequence and the comparison sequence have a comparison relation due to sequence similarity, and the interval with the comparison relation is a comparison interval, namely, a comparison result is reflected; wherein the reference sequence comprises one or more of a first reference sequence, a second reference sequence and a third reference sequence, and the aligned sequence comprises one or more of a first aligned sequence, a second aligned sequence and a third aligned sequence.
The sequence alignment method provided by the embodiment of the invention comprises the following steps: obtaining a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length; comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence; calculating the coverage degree of the first comparison sequence to the first reference sequence, and screening the first reference sequence and the first comparison sequence based on the coverage degree to obtain a second reference sequence and a second comparison sequence; and calculating an alignment path between the second reference sequence and the second alignment sequence, and screening the second reference sequence and the second alignment sequence based on the editing distance to obtain a third reference sequence and a third alignment sequence. In this embodiment, through the above sequence comparison method, by performing two-step comparison and performing multiple screening on the reference sequence and the comparison sequence, the number of invalid comparisons can be reduced, the comparison accuracy can be improved, and the data storage and operation space can be reduced. In addition, the present embodiment provides the alignment result between the sequenced sequences in the above manner, and can be combined with various correction methods for performing correction based on the alignment result, and the sequence correction method is not limited.
In step S101 of this embodiment, the following may be referred to for specific implementation. Selecting a sequencing sequence with the data length larger than a first preset length (such as the longest 40 multiplied by the sequence length) in the sequencing sequence reads as a first sequence, and outputting the first sequence to a first sequence file; and selecting a sequencing sequence with a second preset length (for example, higher than 1k) from the sequencing sequences as a second sequence, and outputting the second sequence to a second sequence file.
In order to facilitate the subsequent multitask parallel operation, the first sequence file and the second sequence file may be cut, and the number of the cut copies is not limited in this embodiment.
In step S102 of this embodiment, based on the first sequence and the second sequence, an interval alignment method is provided, in which all reads in the first sequence and all reads in the second sequence are aligned pairwise to obtain a first reference sequence and a first alignment sequence. This step can be implemented using existing alignment software, such as minimap 2. In the prior art, only short sequences (e.g. less than 1K) are usually filtered, and all of the filtered sequences are aligned, i.e. the second sequence is aligned with the second sequence, so that the alignment between non-first sequences in the second sequence is included, such sequences are short (e.g. longest 40 × sequence length, 1K), and many in number, and the alignment results are relatively redundant. In this embodiment, only the first sequence and the second sequence are compared, so that redundant comparison can be avoided, the comparison efficiency can be improved, and the negative influence of the redundant comparison on the subsequent correction accuracy can be avoided.
In step S103 of this embodiment, in order to effectively reduce the data set of sequences for correction and reduce multiple alignments and alignment noises caused by repeated sequences and alignment errors in the genome, the first reference sequence and the first alignment sequence are screened based on the coverage to obtain a second reference sequence and a second alignment sequence with more accurate alignment results. Meanwhile, after coverage screening, the obtained coverage levels of the second reference sequences are relatively consistent, and subsequent correction is facilitated.
The coverage degree comprises window coverage degree and integral coverage degree; the window coverage is that the first reference sequence is divided into a plurality of windows according to a preset first base number (such as 64), and the coverage of the first comparison sequence to each window is calculated; the overall coverage is the average coverage of the first reference sequence. The rule for screening the first reference sequence and the first comparison sequence based on the coverage degree is as follows: determining the first alignment sequence meeting the preset first coverage condition as a second alignment sequence, wherein the overall coverage of the second alignment sequence to the first reference sequence reaches the preset first coverage (such as 40); the first coverage condition is that the window coverage is smaller than a preset second coverage (such as 50) and smaller than a first preset multiple (such as 1.3) of the overall coverage. In order to more efficiently perform coverage calculation and screening of the first reference sequence and the first comparison sequence, the comparison interval length between the first comparison sequence and the first reference sequence is operated from long to short. Specific implementations can be referred to as follows.
S131: calculating the length of an alignment interval between the first alignment sequence and the first reference sequence, and arranging the first alignment sequence from long to short according to the alignment interval; and sequencing the first comparison sequences with the same comparison interval length from large to small according to the numbers of the first comparison sequences.
S132: the first reference sequence was divided into a plurality of windows of 64 bases in length, and the window coverage and the entire coverage were initialized to 0.
S133: and for each window of the current first reference sequence, sequentially searching for a first comparison sequence according to the S131 arrangement sequence. If the window coverage of the window is less than the preset second coverage 50 and less than a first preset multiple 1.3 of the overall coverage of the current first reference sequence, the first comparison sequence is considered to be effective, the first comparison sequence is output, and the window coverage of the window corresponding to the first comparison sequence is added with 1; otherwise, the first comparison sequence is redundant and discarded. Until the overall coverage of the current first reference sequence reaches the preset first coverage 40, which means that the first reference sequence has enough first comparison sequences, discarding all subsequent first comparison sequences.
Before step S103, the chimera sequence and the contained sequence in the first reference sequence are further filtered separately. And through filtering, the comparison results with low reliability and redundancy in the first reference sequence and the first comparison sequence are further removed.
The chimera sequence is a sequence which meets a preset second covering condition in the first reference sequence; wherein the second coverage condition is that at least one window exists, the window coverage of the window is less than a preset third coverage (e.g., 3), and the window coverage of a specified number of windows adjacent to the window is greater than a preset fourth coverage (e.g., 20); or at least one window exists, and the window coverage of the window is smaller than the preset fifth coverage; wherein the fifth coverage threshold is a second preset multiple (e.g., 1/5) of the average of the window coverage for a specified number of windows adjacent to the current window.
The specific implementation can refer to the following contents: if the window coverage of a certain window is less than the third coverage 3 and the window coverage of a plurality of nearby adjacent windows is greater than the fourth coverage 20 for any first reference sequence; or the window coverage of a window is less than a second preset multiple 1/5 of the average of the window coverage of a plurality of nearby adjacent windows; defining the first reference sequence as a chimeric sequence, and discarding the first reference sequence and the alignment information thereof.
The inclusion sequence is a first reference sequence satisfying the following conditions: at least one first alignment sequence exists, and the number of bases between the beginning of the alignment interval of the first alignment sequence and the first reference sequence and the number of bases between the end of the alignment interval of the first alignment sequence and the first reference sequence and the end of the first reference sequence are within a preset second number of bases (such as 300).
The specific implementation can refer to the following contents: if the starting point of the alignment interval of any one first reference sequence is within 300bp of the second base number from the starting point of the first reference sequence, and the end point of the alignment interval is within 300bp of the second base number from the end point of the first reference sequence, the first reference sequence is defined as the included sequence, and the first reference sequence and the alignment information thereof are discarded.
In step S104 of this embodiment, based on the interval alignment of step S102 and the coverage screening of step S103, a more precise base alignment is further performed in the alignment interval of the second reference sequence and the second aligned sequence. Base alignments are performed on the interval alignment results, rather than directly, because more efficient alignments can be performed based on less but more accurate data. Through base comparison, credible, unreliable-eliminated and redundant comparison information can be further reserved, comparison results are further reduced, and comparison accuracy is improved.
This example provides a specific implementation of the base alignment between the second reference sequence and the second aligned sequence, and within the alignment interval between the second reference sequence and the second aligned sequence, the following steps are performed:
s141, calculating an alignment path between the second reference sequence and the second alignment sequence by using a greedy algorithm, selecting an optimal alignment path and determining an editing distance of the optimal alignment path;
s142, selecting a sequence which is not more than a preset editing distance condition from the second comparison sequence as a third comparison sequence, wherein a corresponding reference sequence is a third reference sequence; wherein, the preset edit distance condition is a third preset multiple (such as 0.4) of the sum of the lengths of the first reference sequence and the second alignment sequence in the alignment interval;
s143, the steps S141 to S142 are repeated until the coverage of the third reference sequence by the third alignment sequence reaches a preset sixth coverage (e.g., 90).
Further, in order to accelerate the calculation speed of the greedy algorithm in step S141, the range of the constraint line in calculation of the greedy algorithm is limited to be between the minimum constraint line and the maximum constraint line when the specified distance value (e.g., 150) is subtracted from the maximum extension distance under the constraint of the edit distance of the comparison path. The reason for this is that all constraint lines k are calculated for each edit distance d, and most of the k lines are meaningless for finding the current optimal base alignment result.
Further, in order to further accurately calculate the coverage, the comparison interval for calculating the coverage is determined again as follows: and backtracking and determining a comparison interval of the third comparison sequence and the third reference sequence according to the optimal comparison path, searching a starting position with a first preset third base number completely matched and a final end position with a last preset third base number (such as 8) completely matched in the comparison interval, and determining the starting position and the end position as the starting position and the end position of the interval. The problem of higher error rate of the boundary (namely, the endpoint) between the comparison areas can be effectively solved through the steps.
The specific implementation can refer to the following steps (1) to (4):
(1) using a Diff greedy algorithm for evaluating the k ∈ [ k ]min,kmax]Within the constraint range of (3), calculating the edit distance of each path, selecting the optimal comparison path and determining the edit distance. Wherein k isminThe constraint line is the smallest k-line when the maximum value of the maximum extension distance minus the specified distance value 150 is found under the constraint of the current edit distance d. k is a radical ofmaxThe constraint line is the largest k-line when the maximum value of the maximum extension distance minus the specified distance value 150 is found under the constraint of the current edit distance d. The optimal editing distance is the editing distance corresponding to the minimum number of editing operations.
(2) And judging whether the optimal editing distance is larger than a preset editing distance condition. The default edit distance condition is (L1+ L2) × 0.4, where L1 is the data length of the second reference sequence and L2 is the data length of the second aligned sequence.
If so, the second reference sequence and the second comparison sequence are in error comparison, and the second comparison sequence corresponding to the current optimal editing distance is discarded; if not, the second reference sequence and the second comparison sequence are correctly compared, the second comparison sequence corresponding to the current optimal editing distance is reserved, namely the third comparison sequence, and the third reference sequence corresponds to the second comparison sequence.
(3) Backtracking to a comparison interval according to the optimal editing distance, searching from front to back in the comparison interval, and resetting a starting point of complete matching of the first third reference sequence and the third comparison sequence with the length of a preset third base number of 8 as a starting point of the comparison interval; searching from back to front, and setting the end point of the complete matching of the last third reference sequence and the third alignment sequence with the length of the preset third base number of 8 as the key point of the alignment interval.
(4) And calculating the coverage of the alignment interval until the coverage of the third reference sequence by the third alignment sequence reaches a preset sixth coverage 90.
In summary, the sequence alignment method provided in this embodiment effectively reduces invalid alignments and eliminates erroneous alignments by performing two alignments and multiple screenings, and realizes fast, accurate and efficient alignments.
Example two
In practical applications, the data size of the sequencing sequence is very large, and especially for large and ultra-large genomes, a large hard disk storage space is required. The specific sequence is converted into a binary form, so that the compression can be effectively carried out, and the storage space is saved. Based on this, the sequence comparison method provided in this embodiment further includes: converting and storing the sequence based on the binary number; the sequences include one or more of the sequencing sequence, the first sequence, the second sequence, the first reference sequence, the first alignment sequence, the second reference sequence, the second alignment sequence, the third reference sequence, and the third alignment sequence of example 1.
The steps comprise: (1) dividing the sequence into a plurality of base combinations according to a preset group; the preset grouping strategy comprises the steps of determining four adjacent bases as a base combination; (2) converting the base of the sequence into a binary number according to a preset conversion relation between the base and the binary number; (3) for the combination of less than four bases in the sequence, adopting the appointed binary number to carry out bit complement expansion on the combination of less than four bases to obtain a binary sequence meeting grouping; (4) the binary sequences are stored in accordance with the divided base combinations.
Specifically, the sequence contains four forms of adenine (a), cytosine (C), guanine (G) and thymine (T), each base requiring 1 byte (8 bits) to store data. In this embodiment, the bases can be first converted in binary form, such as: a-00, C-01, G-10, T-11, i.e. each base is converted to a binary representation occupying two bits. Then, storing four adjacent bases as a group to form a byte; therefore, the sequencing sequence compressed by adopting the binary form is about one fourth of the size of the original data, and the occupied memory space is effectively reduced.
For sequences in which the sequence length is a multiple of four in the sequenced sequence, every four base combinations can be represented as a complete one byte form. For sequences with the length not being a multiple of four, the base combination of the last group cannot occupy eight bits, and the base combination of the last group can be subjected to bit-filling expansion by 00 so as to meet the form of eight bits, namely one byte.
For ease of understanding, an example of binary format conversion of a sequencing sequence is given below:
such as the original sequencing sequence:
>6 36535
TTTGCTATTAGT
>252 33213
AAGTATTA AT
the original sequencing sequence is converted into a binary format in the above way, the sequence with the number of 636535 is converted into 111111100111001111001011, and the sequence with the number of 25233213 is converted into 000010110011110000110000. As shown in fig. 2, different memory cells are filled with different backgrounds in fig. 2.
Meanwhile, the data can be compressed and decompressed by adopting bit operation, the consumption of the bit operation on a CPU is low, and the writing and reading speed can be improved.
EXAMPLE III
The comparison result can be stored by recording the characteristics of the comparison interval, so as to reduce the storage space as much as possible. The sequence alignment method provided by the present embodiment further comprises: recording the comparison sequence number, the comparison direction, the comparison interval start of the comparison sequence, the comparison interval stop of the comparison sequence, the reference sequence number, the comparison interval start of the reference sequence and the comparison interval stop of the reference sequence to store the comparison result; preferably, the difference value of the serial numbers of the recorded alignment sequences, the difference value of the serial numbers of the reference sequences, the length of the alignment interval of the reference sequences and the length of the alignment interval of the alignment sequences are stored for comparison results; wherein the aligned sequences comprise one or more of the first aligned sequence, the second aligned sequence, and the third aligned sequence of example one, and the reference sequence comprises one or more of the first reference sequence, the second reference sequence, and the third reference sequence. In addition, the storage is performed by 4 bytes; moreover, each byte uses a 7-bit record value, and less than 7 bits are filled by using a specific binary number; the remaining 1 bit identifies whether the comparison is terminated using a specific binary number.
For ease of understanding, reference may be made to the following exemplary description. The alignment result of the first reference sequence and the first comparison sequence may include the following alignment information: a first comparison sequence number (first column), an alignment direction (second column; which indicates the direction of alignment, stored in 8 bits per byte, the first 7 bits being 0, the 8 th bit being stored as 0 or 1, 1 indicating the forward direction, 0 indicating the reverse direction), a first comparison sequence alignment interval start (third column), a first comparison sequence alignment interval end (fourth column), a first reference sequence number (fifth column), a first reference sequence alignment interval start (sixth column), and a first reference sequence alignment interval end (seventh column).
In practical applications, the alignment result of the first reference sequence and the first comparison sequence can be further processed: the first column records the difference value between the current first comparison sequence number and the last comparison sequence number, if the difference value is less than zero, the numerical value of the second column is added with two, the fifth column records the difference value between the current first reference sequence number and the last first reference sequence number, if the difference value is less than zero, the numerical value of the second column is added with four, the numerical value of the third column and the sixth column is unchanged, the fourth column records the length of the comparison interval of the first comparison sequence, the seventh column records the difference value between the length of the comparison interval of the first comparison sequence and the first reference sequence, and if the difference value is less than zero, the numerical value of the second column is added with eight.
Because the current stored value is stored in 4 bytes (32 bits), the value in the output result is generally very small, only 2 bytes or less are needed, and the bits of the upper bits are all zero and are invalid bits. Therefore, in order to further reduce the storage space, the following compression process may be performed on the values of each column in the first comparison sequence: and dividing every seven bit positions as a group, discarding the combination with the median value of 0 in the front group, setting the highest positions of other groups of the reserved group except the highest position of the last group as 0 as 1, and filling 0 in the last group with less than seven bit positions to store the byte combination after the numerical value compression. The size of the first comparison result after the compression processing in the above manner is 1/3 which is usually the original size, and obviously occupies less storage space.
For ease of understanding, reference may be made to the following specific column number, numeric value, binary example.
For example, two records in the first alignment result are taken as an example for the compression.
And (3) comparing the results:
Figure BDA0002177874370000121
the simplified result is:
Figure BDA0002177874370000122
the binary storage can refer to fig. 3, which specifically includes:
Figure BDA0002177874370000123
Figure BDA0002177874370000131
the double underline highlighted digits in the binary stored sequence are padding and identification digits.
Example four
The present embodiment provides a sequence correction method. Referring to the flowchart of the sequence correction method shown in fig. 4, the method includes the following steps:
s201, comparing the sequencing fragments to obtain a reference sequence and a comparison sequence;
s202, dividing the reference sequence by taking the k-mer as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments;
s203, calculating scores of all bases at each position by taking the k-mer fragment as a unit from the positive sequence;
s204, determining the last base as the highest scoring base at the last position from reverse order, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a penultimate base according to a k-mer fragment corresponding to the penultimate base; and so on to obtain a k-mer scoring branch;
s205, replacing the reference sequence by using the k-mer score chain to obtain a first correction result.
Where k-mer, monomeric unit (mer), corresponds to nt or bp, and 100mer DNA corresponds to 100nt per strand, the entire strand is 100 bp. Reads of length m can be divided into m-k +1 k-mers. Taking k-mer as 3 for example, dividing the sequence to be corrected by taking k-mer as 3 to obtain a plurality of 3-mer fragments, for example, sites 1, 2 and 3 are the first 3-mer, sites 2, 3 and 4 are the second 3-mer, and so on.
The sequence correction method provided by the embodiment of the invention comprises the following steps: comparing the sequencing fragments to obtain a reference sequence and a comparison sequence; dividing the reference sequence by taking the k-mer as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments; calculating the score of all bases at each position from the positive sequence in units of k-mer fragments; determining the last base as the highest scoring base at the last position from reverse order, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a penultimate base according to the k-mer fragment corresponding to the penultimate base; and so on to obtain a k-mer scoring branch; replacing the reference sequence with a k-mer scoring chain to obtain a first correction result. In this embodiment, the sequencing fragments are compared first by the sequence correction method, and then the correction result is obtained for the sequence based on the k-mer fragment, so that the bias of the correction result caused by counting only single base comparison can be avoided, the accuracy of sequence correction can be improved, and the processing efficiency of sequence correction can be improved. In addition, the above-mentioned manner provided in this embodiment is based on the calibration of the sequence alignment result, and the alignment method for obtaining the alignment result is not limited, and can be combined with various sequence alignment methods.
In step S203, the score of all bases at each position is calculated by the following calculation formula:
score(p,b)=max{score(p-1,b)+countk-mer}-C×0.3
where score (p, b) is the score for the base at position p of the reference sequence, and b is base A, T, G, C or a deletion, countk-merIs the number of occurrences of k-mer fragments in the aligned sequences; c is the coverage of the reference sequence by the aligned sequences at position p of the reference sequence.
In a particular implementation, the value of k may be 3, i.e., the k-mer is a 3-mer.
Further, the long-read-length sequencing technology is considered to have the condition of uneven accuracy distribution. Taking the third generation sequencing results as an example, the accuracy of some intervals is extremely low (< 70%), the alignment results in the intervals are poor in reliability, and therefore the accuracy of the corrected sequences obtained by performing correction based on the alignment results is low. By finding the interval of low accuracy and correcting it again, the accuracy can be further improved. Therefore, the sequence correction method may further include:
step S206, regarding the low-quality sequence in the first correction result, using the alignment interval sequence of the alignment sequence covering the low-quality sequence in the alignment result of the step S201 as a seed sequence, and performing correction by using a directed graph algorithm based on the seed sequence; replacing the low quality sequence with the corrected sequence to obtain a second correction result;
wherein, the low-quality sequence is the longest sequence starting with the low-quality base with the number of continuous bases not less than the preset first value (e.g. 4), the interval average quality value less than the preset second value (e.g. 40), the interval length exceeding the preset third value (e.g. 16), and the number of continuous high-quality bases in the interval not more than the preset first value (e.g. 4).
In a specific implementation, the base quality value Q is defined as Q ═ count3-merC × 100; determining bases with the base matrix quantity value larger than a preset first numerical value 4 as low-quality bases, and determining bases with the base quality value not larger than the first numerical value 4 as high-quality bases; the interval average quality value is the average of all base quality values within the interval.
The low-quality sequence in the first corrected result can be found based on the low-quality base and the high-quality base with reference to the following steps (a) to (c):
(a) the first corrected sequence was subjected to low quality base search in reverse order and the first low quality base position lq _ s was recorded.
(b) And (4) taking the position lq _ s as an initial position, continuing low-quality base searching in a reverse order, recording the number lq _ c of the searched continuous low-quality bases, and judging the relation between lq _ c and the first numerical value 4.
If lq _ c is not less than the first value 4, continuing to search in reverse order, and if the high-quality base is searched, recording the number hq _ c of continuous high-quality bases. If hq _ c is larger than the first value 4, initializing lq _ c and hq _ c as 0, searching low-quality bases again in a reverse order, and repeating the steps a and b. If hq _ c is not larger than the first value 4, initializing hq _ c to be 0, continuously searching in a reverse order, continuously recording lq _ c, and executing the step (c);
if lq _ c is less than the first value of 4, initializing lq _ c to 0, searching again for low-quality bases in reverse order, and repeating the steps a and b.
(c) And calculating the interval average quality value, and judging whether the interval length lq _ l is not less than the third numerical value 16 when the interval average quality value is less than or equal to the maximum value of the second numerical value 40.
If the value is larger than or equal to the third numerical value, recording the last base position lq _ e searched in reverse order as a termination position, and marking the sequence between the lq _ s position and the lq _ e position as a low-quality sequence. Meanwhile, initializing the length lq _ l of the low-quality interval, the number lq _ c of low-quality bases and the number hq _ c of high-quality bases to be 0, re-searching the low-quality bases in reverse order, and repeating the steps a, b and c until the searching of all reference sequences is completed.
If the number is less than the third numerical value, initializing the length lq _ l of the low-quality interval, the number lq _ c of the low-quality bases and the number hq _ c of the high-quality bases to be 0, searching the low-quality bases again according to a reverse order, and repeating the steps a, b and c until the searching of all the reference sequences is completed.
For ease of understanding, the low quality sequence may be found for the first correction result with reference to the following example, and in particular with reference to fig. 5:
Figure BDA0002177874370000151
where the lower case portion is a low quality base, the upper case portion is a high quality base, and the underlined portion is the sequence found for the low quality region.
In step S206, the specific steps of the directed graph algorithm for correction are:
s261, optionally a seed sequence, is mapped as a motif.
In this embodiment, in order to reduce the comparison of invalid data, before performing the directed graph algorithm correction, the first correction result may be filtered according to any one of the following conditions, that is, the directed graph algorithm correction is not performed on the first correction result satisfying any one of the following conditions: a. a first correction result that the accumulated length of the low-quality sequence exceeds a preset fourth value (such as 10000); b. a first correction result in which the number of seed sequences is less than a predetermined first value (e.g., 4); c. the length of the aligned sequence exceeding a predetermined fourth value (e.g., 10000) accounts for the first correction result exceeding a fourth predetermined multiple (e.g., 1/3). The filtered data is characterized by low reliability and overlong, and the bottleneck problem of extremely low calculation speed caused by overlong sequences in the directed graph algorithm can be solved.
Specifically, each low-quality sequence is traversed, and all alignment sequences covering the low-quality sequence are found by using the starting position and the end position of the low-quality sequence. And if the searched alignment sequences are more than 60, stopping searching the alignment sequences of the low-quality interval sequences.
Further, the searched comparison sequences are sorted according to the length of the sequences from long to short, and seed sequences are selected. In a preferred embodiment, the number of seed sequences does not exceed a predetermined fifth value (e.g., 30), and preferably the seed sequences are included in the aligned sequences with the length sequence at the middle predetermined fifth value (e.g., 30). And if the searched comparison sequences are less than 30, taking all the comparison sequences as seed sequences S.
Optionally drawing a histogram G as a motif in the seed sequence S; wherein, the Node of the directed graph is a base in the sequence, the directed Edge of the directed graph is a connecting line between any two continuous bases, and the initial Node of the directed graph is an initial base of the sequence.
And S262, calculating the base score in the seed sequence according to the score matrix, and updating the directed graph, namely adding the directed graph to all the seed sequences.
In a specific implementation, the following steps (1) to (4) are performed:
(1) initializing a node score according to the formula score (x) max { score (x-1) } -GP, where score (x) is the node score, x is the node index, and GP is the gap penalty (e.g., 2); the node score without predecessor nodes is set to 0.
(2) The present embodiment may traverse the nodes in the directed graph, traversing all predecessor nodes of each node on a per node basis. For any predecessor node, traversing each base in S, and updating the matrix score according to the following calculation formula:
Figure BDA0002177874370000161
wherein x is node index, y is base index in seed sequence, GP is gap penalty (e.g., 2), and M is mismatch score (e.g., -2) or match score (e.g., 1).
(3) And taking the highest-score node with the out degree of 0 as a starting point, selecting the predecessor node with the highest score, and backtracking by taking the node index as 0 or the base index as 0 as an end point.
Specifically, when the out-degree of a node is 0, the node is defined as a final node, and the score and the index of the final node are recorded. And after all the end nodes are found, finding out the end node with the maximum score, determining the end node as a backtracking start point, and recording the start node index and the corresponding base index. And circularly tracing back from the starting point, searching the predecessor node and the predecessor base with the highest score of any node x until the node index is 0 or the sequence index is 0, determining the node index as a backtracking end point, and recording the node index of the end point and the corresponding base index.
(4) If the base index of the end point is more than 0, adding a base positioned in the seed sequence before the base index of the end point into the directed graph; if the base index of the starting point is smaller than the length of the seed sequence, adding a base located after the base index of the starting point in the seed sequence to the directed graph.
The operation of adding the directed graph specifically comprises the following steps: skipping the current base if the current base is matched under the current node; if the base is mismatched, adding the current base into the base list contained in the current node; if the insertion is performed, the new node and the new edge are added into the directed graph.
S263, topological sorting is carried out on the directed graph; and performing topological sorting on the updated directed graph, so that if any pair of vertexes u and v in the updated directed graph satisfies an edge (u, v) epsilon E (G), u appears before v in the sorted directed graph.
And S264, determining the base of each node as the base with the highest occurrence probability on the node according to the topological sequencing result, and obtaining a corrected sequence.
In a specific implementation manner, according to the topological sorting result, the base with the largest occurrence number is selected on each node, so that the corrected low-quality sequence is determined, and the corresponding sequence of the first correction result is replaced to obtain the second correction result.
To further improve the accuracy, the sequence correction method may further include:
s207, using the seed sequence as the alignment sequence and the second calibration result as the reference sequence, and using the sequence calibration method of the foregoing steps S201 to S205, calibration is performed again to obtain a third calibration result. Details are not repeated.
In summary, the sequence correction method provided in the above embodiments can achieve sequence correction with high efficiency and high accuracy.
EXAMPLE five
In step S202 of example 4, the larger the k value is, the better the specificity of the k-mer fragment is, and the more accurate the correction is when k-mer segmentation is performed. However, the greater the k value, the less the probability of a perfectly consistent k-mer. When the sequencing accuracy is p, the probability of occurrence of a perfectly consistent k-mer is pk
As can be seen from the above formula for score (p, b), the score is related to the number of completely consistent k-mers, and if the number is too low, the number of sequences to be corrected is too small, which may seriously affect the calibration effect. Taking the average accuracy of the third-generation sequencing at the present stage as an example, the probability of complete agreement of 2-mers is 72.25%, the probability of complete agreement of 3-mers is 61.41%, and the probability of complete agreement of 4-mers is 52.20%. It is clear that the probability is too low for 4-mers and greater than 4-mers. To further illustrate the technical effect of k-mer partitioning, the algorithm is performed by way of example, as follows.
Table 1 shows exemplary data, aligned sequence fragment including Read 1-Read 7, set position 1 as the start site, position 5 as the end site, processing.
TABLE 1 data information
Figure BDA0002177874370000171
Example 1:
setting k-mer as 3, correcting according to the method of the embodiment, and determining the score chains from positions 1 to 5 as GGAGT in sequence.
Dividing by using k-mer as 2, correcting by referring to the method of the embodiment, and sequentially determining a correction chain from positions 1 to 5 as GGCGT.
The correction result at position 3 is different from the correction results. As before, the larger the k value, the better the specificity of the k-mer and the more accurate the calibration results. Thus, the 3-mer is the best choice.
Example 2:
the calibration is simply carried out according to the alignment times of single bases, the highest alignment time is selected as a calibrated sequence at each position, and the statistics of the alignment times of the single bases at each position are shown in Table 2.
TABLE 2 Single base number of alignments
Figure BDA0002177874370000172
The result of example 2 is GGCGT, which is consistent with the k-2 result in example 1. It can be seen that the number of single base alignments can not be accurately corrected by counting only the number of single base alignments.
In step S204 of example 4, the last base is determined as the highest scoring base at the last position from the reverse order, and the penultimate base is determined from the k-mer fragment corresponding to the last base; determining a penultimate base according to the k-mer fragment corresponding to the penultimate base; and the analogy is repeated to obtain the k-mer score chain.
For a more clear explanation, the calculation is performed by example 3, specifically as follows.
Table 3 shows exemplary data, aligned sequence fragments including Read 1-Read 5, set position 1 as the start position, and position 7 as the end position, and were corrected using the method of example 4. Wherein, the k-mer is set as 3-mer, and the effective coverage of each position is 5.
TABLE 3 data information
Figure BDA0002177874370000181
Using the method of this example, the k-mer fraction obtained is ACAGACC in order from positions 1 to 7.
Without backtracking, only a single base is considered, and the linkage between adjacent bases is not considered, i.e., the selection is made by the highest score. Position 3, the highest scoring base is a; position 4, the highest scoring base is G; position 5, the highest scoring base is T; position 6, the highest scoring base is G; at position 7, the highest scoring base is C. And the positions 3 to 7 of the finally determined score chain are AGTGC which is inconsistent with the result after backtracking, so that the backtracking effect is reflected.
As can be seen from the above example, when the method of this embodiment 4 is used for correction, the combination of the forward sequence and the current base is considered, that is, the most likely existing continuous small segments are considered as the correction result, so that the influence of counting only the number of single-base alignments on the preference of the correction result is effectively avoided.
EXAMPLE six
This example provides a sequence alignment method, which is performed by aligning the sequenced sequences using the method described in example one; based on the third reference sequence, the third aligned sequence, the sequence is corrected using the method described in example four. For convenience and simplicity of description, the specific working processes of the sequence alignment method and the sequence correction method described in this embodiment may refer to the corresponding processes in the first to fifth embodiments of the foregoing method, and are not described herein again.
The embodiment of the invention also provides an example of an actual application scenario, which refers to the following contents:
a human sample is sequenced by using a PacBio Sequal platform and a NanoPore PromethION platform standard process, all sequencing data of chromosome 20 are called as test data, and second generation data disclosed by the chromosome are used as standard data to evaluate the technical effects of the comparison method and the correction method disclosed by the invention, as shown in Table 4.
TABLE 4
Figure BDA0002177874370000182
Figure BDA0002177874370000191
The corrected yield refers to the proportion of data with read length exceeding 5kb in the original data after correction, the ' average reads length ' and the ' longest read length ' of Canu and Falcon are both directed at the corrected data, the ' coverage ratio of more than or equal to 99% refers to the coverage ratio of more than 99% of the length capable of being matched with the standard data, the ' similarity ratio of more than or equal to 97% refers to the coverage ratio of more than 97% of the similarity matched with the standard data, and the ' average similarity of more than or equal to 99% of the coverage "refers to the average similarity of more than 99% of the reads capable of being matched with the standard data and the standard data.
From the aforementioned results of the sequencing of chromosome 20, 34 first sequences were randomly extracted, totaling 1588683 bp. In the comparison, the method (optimization method) as described in embodiment 1, the method (non-optimization method) in which the line range is not limited in step S141 in embodiment 1, is used, respectively; in the correction, the method described in example 4 was used. The results show that the optimized method run time is 38.2s and the non-optimized method run time is 813.6 s.
EXAMPLE seven
Based on the sequence alignment method provided in the above embodiments, this embodiment provides a sequence alignment apparatus, referring to the structural block diagram of the sequence alignment apparatus shown in fig. 6, the apparatus includes:
a sequence obtaining module 301, configured to obtain a sequencing sequence, screen a first sequence in the sequencing sequence according to a first preset data length, and screen a second sequence in the sequencing sequence according to a second preset length;
a first comparison module 302, configured to compare the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence;
the second alignment module 303 is configured to calculate a coverage of the first reference sequence by the first alignment sequence, and screen the first reference sequence and the first alignment sequence based on the coverage to obtain a second reference sequence and a second alignment sequence;
the third alignment module 304 is configured to calculate an alignment path between the second reference sequence and the second alignment sequence, and screen the second reference sequence and the second alignment sequence based on the edit distance to obtain a third reference sequence and a third alignment sequence.
In one embodiment, the sequence alignment apparatus further comprises a sequence storage module, wherein the sequence storage module is configured to: converting and storing the sequence based on the binary number; the sequence comprises one or more of a sequencing sequence, a first sequence, a second sequence, a first reference sequence, a first alignment sequence, a second reference sequence, a second alignment sequence, a third reference sequence and a third alignment sequence; preferably, the step of converting and storing the sequence based on binary numbers comprises: dividing the sequence into a plurality of base combinations according to a preset group; wherein the presetting of the groups comprises determining four bases adjacent to each other as one base combination; converting the base of the sequence into a binary number according to a preset conversion relation between the base and the binary number; for the combination of less than four bases in the sequence, adopting a specified binary number to carry out bit-filling expansion on the combination of less than four bases to obtain a binary sequence meeting the preset grouping; the binary sequences are stored in accordance with the divided base combinations.
In one embodiment, the sequence alignment apparatus further comprises an alignment storage optimization module, wherein the alignment storage optimization module is configured to: recording the comparison sequence number, the comparison direction, the comparison interval start of the comparison sequence, the comparison interval stop of the comparison sequence, the reference sequence number, the comparison interval start of the reference sequence and the comparison interval stop of the reference sequence to store the comparison result; preferably, the difference value of the record comparison sequence numbers, the difference value of the reference sequence numbers, the difference value of the comparison interval lengths of the reference sequences and the comparison interval lengths of the comparison sequences are stored for comparison results; the alignment sequence comprises one or more of a first alignment sequence, a second alignment sequence and a third alignment sequence, and the reference sequence comprises one or more of a first reference sequence, a second reference sequence and a third reference sequence.
In an embodiment, the comparison storage optimization module is further configured to: storing according to 4 bytes; moreover, each byte uses 7 bits to record numerical value, and less than 7 bits are filled by using specific binary number; the remaining 1 bit uses a specific binary number to identify whether the comparison is terminated.
In an embodiment, in the second comparing module 303, the coverage includes a window coverage and an overall coverage; the window coverage is that the first reference sequence is divided into a plurality of windows according to a preset first base number, and the coverage of the first comparison sequence to each window is calculated; the overall coverage is the average coverage of the first reference sequence; the screening method comprises the steps of determining a first comparison sequence meeting a preset first coverage condition as a second comparison sequence, wherein the overall coverage of the second comparison sequence on a first reference sequence reaches a preset first coverage; the first coverage condition is that the window coverage is smaller than a preset second coverage and smaller than a first preset multiple of the overall coverage; preferably, the coverage calculation and screening are performed according to the sequence of the overlapping length of the first comparison sequence and the first reference sequence from long to short.
In an embodiment, the sequence comparison apparatus further includes a filter module for a chimera sequence, where the chimera sequence is a sequence that satisfies a preset second coverage condition in the first reference sequence; the second coverage condition is that at least one window exists, the window coverage of the window is less than a preset third coverage, and the window coverage of a specified number of windows adjacent to the window is greater than a preset fourth coverage; or at least one window exists, and the window coverage of the window is smaller than the preset fifth coverage; wherein the fifth coverage threshold is a second preset multiple of the average of the window coverage of the specified number of windows adjacent to the current window.
In one embodiment, the sequence comparison apparatus further comprises a filter module for filtering a contained sequence, wherein the contained sequence is a first reference sequence satisfying the following condition: at least one first comparison sequence exists, the number of bases between the beginning of the comparison interval of the first comparison sequence and the first reference sequence and the beginning of the first reference sequence and the number of bases between the end of the comparison interval of the first comparison sequence and the first reference sequence and the end of the first reference sequence are both within a preset second base number range.
In one embodiment, the third alignment module 304 performs the following steps within the alignment interval between the second reference sequence and the second alignment sequence: s141, calculating an alignment path between the second reference sequence and the second alignment sequence by using a greedy algorithm, selecting an optimal alignment path and determining an editing distance of the optimal alignment path; s142, selecting a second alignment sequence not greater than a preset editing distance condition from the second alignment sequences as a third alignment sequence, wherein a corresponding reference sequence is a third reference sequence; the preset editing distance condition is a third preset multiple of the sum of the lengths of the first reference sequence and the second comparison sequence in the comparison interval; s143, repeating the steps S141 to S142 until the coverage of the third reference sequence by the third alignment sequence reaches a preset sixth coverage;
preferably, in step S141, the constraint line range during calculation by the greedy algorithm is between the minimum constraint line and the maximum constraint line when the specified distance value is subtracted from the maximum extension distance under the constraint of the edit distance of the comparison path;
preferably, in step S143, the section of the coverage degree is calculated by determining, according to the optimal comparison path, a comparison section of the third comparison sequence and the third reference sequence by backtracking, searching for an initial position where the first preset third base number is completely matched and an end position where the last preset third base number is completely matched in the comparison section, and determining the initial position and the end position as the initial position and the end position of the section.
Example eight
Based on the sequence correction method provided in the foregoing embodiment, the present embodiment provides a sequence correction apparatus, referring to a structural block diagram of the sequence correction apparatus shown in fig. 7, the apparatus including:
a sequence alignment module 401, configured to perform alignment between multiple sequencing fragments to obtain a reference sequence and an alignment sequence;
a sequence dividing module 402, configured to divide a reference sequence by taking a k-mer as a unit, and perform corresponding division on the sequence by comparison to obtain a plurality of k-mer segments;
a score calculating module 403 for calculating scores of all bases at each position in units of k-mer fragments from the positive sequence;
a base determining module 404, configured to determine, from the reverse order, the last base as the highest scoring base at the last position, and determine the penultimate base according to the k-mer segment corresponding to the last base; determining a penultimate base according to a k-mer fragment corresponding to the penultimate base; and so on to obtain a k-mer scoring branch;
a correction module 405, configured to replace the reference sequence with the k-mer score chain to obtain a first correction result.
In one embodiment, in the score calculating module 403, max { score (p-1, b) + count is calculated according to the formula score (p, b) ═ max { (p, b) + countk-merCalculate the score for all bases at each position, } -C.times.0.3;
where score (p, b) is the score for the base at position p of the reference sequence, and b is base A, T, G, C or a deletion, countk-merIs the number of occurrences of k-mer fragments in the aligned sequences; c is the coverage of the reference sequence by the alignment sequence at the position of the reference sequence p; preferably, the value of k is 3.
In one embodiment, the sequence correction apparatus further includes: a low-quality sequence correction module, configured to use, as a seed sequence, an alignment region sequence of an alignment sequence covering a low-quality sequence in the alignment result of the sequence alignment module 401 for the low-quality sequence in the first correction result, and perform correction using a directed graph algorithm based on the seed sequence; replacing the low quality sequence with the corrected sequence to obtain a second correction result; the low-quality sequence is a sequence which meets all the following conditions, the longest sequence with the interval average quality value smaller than a preset second value and the interval length exceeding a preset third value is started by using low-quality bases with the number of continuous bases not smaller than a preset first value, and the number of continuous high-quality bases in the interval is not larger than the preset first value; preferably, the number of seed sequences in the correction of the directed graph algorithm does not exceed a preset fifth value, and the seed sequences included in the aligned sequences with the length sequence in the middle of the preset fifth value are preferably selected.
In an embodiment, the low quality sequence correction module is further configured to: s261, optionally drawing a directed graph by taking one seed sequence as a base sequence; s262, calculating the base score in the seed sequence according to the score matrix, and updating the directed graph; s263, topological sorting is carried out on the directed graph; and S264, determining the base of each node as the base with the highest occurrence probability on the node according to the topological sequencing result, and obtaining a corrected sequence.
In one embodiment, in step S262, the following steps are performed for all seed sequences: a. initializing a node score according to the formula score (x) max { score (x-1) } -GP, where score (x) is the node score, x is the node index, and GP is the gap penalty; setting the node score without the former node as 0;
b. traversing each base of the seed sequence, updating the scoring matrix according to a formula,
Figure BDA0002177874370000221
wherein x is a node index, y is a base index in the seed sequence, GP is a gap penalty, and M is a mismatch or match score;
c. taking the highest scoring node with the out degree of 0 as a starting point, selecting the predecessor node with the highest score, backtracking, and taking the node index as 0 or the base index as 0 as an end point; d. if the base index of the end point is more than 0, adding a base positioned in the seed sequence before the base index of the end point into the directed graph; and if the base index of the starting point is smaller than the length of the seed sequence, adding a base positioned after the base index of the starting point in the seed sequence into the directed graph.
In one embodiment, before performing the directed graph algorithm correction, the first correction result is filtered according to any one of the following conditions: a. the accumulated length of the low-quality sequence exceeds a first correction result of a preset fourth numerical value; b. a first correction result with the seed sequence number smaller than a preset first value; c. and the comparison sequence with the length exceeding a preset fourth value accounts for a first correction result with the comparison exceeding a fourth preset multiple.
In one embodiment, the sequence correction apparatus further includes: and the secondary correction module is used for correcting again by using the seed sequence as the alignment sequence and the second correction result as the reference sequence by using any one of the correction methods to obtain a third correction result.
In one embodiment, the sequence alignment module 401 is configured to perform alignment using any one of the alignment methods described above; wherein the reference sequence comprises one or more of a first reference sequence, a second reference sequence and a third reference sequence, and the alignment sequence comprises one or more of a first alignment sequence, a second alignment sequence and a third alignment sequence.
The relative steps of the modules and steps set forth in these embodiments do not limit the scope of the invention unless specifically stated otherwise. In all of the examples shown and described in these embodiments, any particular value should be construed as merely illustrative, and not restrictive, and thus other examples of exemplary embodiments may have different values.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, those skilled in the art will appreciate that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (18)

1. A method of sequence alignment, the method comprising:
s101, obtaining a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length;
s102, comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence;
s103, calculating the coverage of the first reference sequence by the first comparison sequence, and screening the first reference sequence and the first comparison sequence based on the coverage to obtain a second reference sequence and a second comparison sequence;
s104, calculating a comparison path between the second reference sequence and the second comparison sequence, and screening the second reference sequence and the second comparison sequence based on the edit distance to obtain a third reference sequence and a third comparison sequence.
2. The method according to claim 1, characterized in that it comprises:
converting and storing the sequence based on the binary number; the sequence comprises one or more of the sequencing sequence, the first sequence, the second sequence, the first reference sequence, the first alignment sequence, the second reference sequence, the second alignment sequence, the third reference sequence, and the third alignment sequence;
preferably, the step of converting and storing the sequence based on the binary number includes:
dividing the sequence into a plurality of base combinations according to a preset grouping; wherein the pre-set grouping comprises determining four bases adjacent to each other as one base combination;
converting the base of the sequence into a binary number according to a preset conversion relation between the base and the binary number;
and for the combination of less than four bases in the sequence, carrying out bit filling expansion on the combination of less than four bases by adopting a specified binary number to obtain a binary sequence meeting the preset grouping.
3. The method of claim 1, comprising optimizing the storage of the alignment results by:
recording the comparison sequence number, the comparison direction, the comparison interval start of the comparison sequence, the comparison interval stop of the comparison sequence, the reference sequence number, the comparison interval start of the reference sequence and the comparison interval stop of the reference sequence to store the comparison result;
preferably, the difference value of the serial numbers of the recorded alignment sequences, the difference value of the serial numbers of the reference sequences, the length of the alignment interval of the reference sequences and the length of the alignment interval of the alignment sequences are stored for comparison results;
wherein the aligned sequences comprise one or more of a first aligned sequence, a second aligned sequence, and a third aligned sequence, and the reference sequence comprises one or more of a first reference sequence, a second reference sequence, and a third reference sequence.
4. The method of claim 3, wherein the method of storing the alignment results comprises:
storing according to 4 bytes;
moreover, each byte uses 7 bits to record numerical value, and less than 7 bits are filled by using specific binary number; the remaining 1 bit uses a specific binary number to identify whether the comparison is terminated.
5. The method of claim 1, wherein in step S103,
the coverage degree comprises window coverage degree and integral coverage degree; the window coverage is to divide the first reference sequence into a plurality of windows according to a preset first base number, and calculate the coverage of the first comparison sequence to each window; the overall coverage is the average coverage of the first reference sequence;
the screening method comprises the steps of determining the first comparison sequence meeting a preset first coverage condition as the second comparison sequence until the overall coverage of the second comparison sequence on the first reference sequence reaches a preset first coverage; the first covering condition is that the window covering degree is smaller than a preset second covering degree and smaller than a first preset multiple of the integral covering degree;
preferably, the coverage calculation and the screening are performed according to the sequence of the overlapping length of the first comparison sequence and the first reference sequence from long to short.
6. The method of claim 5, further comprising filtering the chimeric sequence, wherein the chimeric sequence is,
sequences meeting a preset second coverage condition in the first reference sequence;
the second coverage condition is that at least one window exists, the window coverage of the window is less than a preset third coverage, and the window coverage of a specified number of windows adjacent to the window is greater than a preset fourth coverage; or at least one window exists, and the window coverage of the window is smaller than a preset fifth coverage; wherein the fifth coverage threshold is a second preset multiple of the average of the window coverage of the specified number of windows adjacent to the current window.
7. The method of claim 5, further comprising filtering the inclusion sequence, wherein the inclusion sequence is,
the first reference sequence satisfying the following condition: at least one first alignment sequence exists, and the number of bases between the beginning of the alignment interval of the first alignment sequence and the first reference sequence and the number of bases between the end of the alignment interval of the first alignment sequence and the first reference sequence and the end of the first reference sequence are within a preset second base number range.
8. The method of claim 1, wherein in step S104, during the alignment interval between the second reference sequence and the second alignment sequence, the following steps are performed:
s141, calculating a comparison path between the second reference sequence and the second comparison sequence by using a greedy algorithm, selecting an optimal comparison path and determining an editing distance of the optimal comparison path;
s142, selecting a second alignment sequence not greater than a preset editing distance condition from the second alignment sequences as a third alignment sequence, wherein a corresponding reference sequence is a third reference sequence; the preset editing distance condition is a third preset multiple of the sum of the lengths of the first reference sequence and the second comparison sequence in the comparison interval;
s143, the steps S141 to S142 are repeated until the coverage of the third reference sequence by the third alignment sequence reaches a preset sixth coverage;
preferably, in step S141, the constraint line range during calculation by the greedy algorithm is between the minimum constraint line and the maximum constraint line when the specified distance value is subtracted from the maximum extension distance under the constraint of the edit distance of the comparison path;
preferably, in step S143, the section for calculating the coverage is,
according to the optimal comparison path, backtracking and determining a comparison interval of a third comparison sequence and a third reference sequence, searching an initial position of complete matching of a first preset third base number and an end position of complete matching of a last preset third base number in the comparison interval, and determining the initial position and the end position as the initial position and the end position of the interval.
9. A method of sequence correction, the method comprising:
s201, comparing the sequencing fragments to obtain a reference sequence and a comparison sequence;
s202, dividing the reference sequence by taking the k-mer as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments;
s203, calculating scores of all bases at each position by taking the k-mer fragment as a unit from the positive sequence;
s204, determining the last base as the highest scoring base at the last position from reverse order, and determining the penultimate base according to the k-mer fragment corresponding to the last base; determining a penultimate base according to the k-mer fragment corresponding to the penultimate base; and so on to obtain a k-mer scoring branch;
s205, replacing the reference sequence by using a k-mer score chain to obtain a first correction result;
preferably, the sequence correction method has a k value of 3.
10. The method of claim 9, wherein in step S203,
according to the formula score (p, b) ═ max { score (p-1, b) + countk-merCalculate the score for all bases at each position, } -C.times.0.3;
wherein score (p, b) is the score of the base at position p of the reference sequence, and b is base A, T, G, C or a deletion, countk-merIs the number of occurrences of the k-mer fragment in the aligned sequence; c is the coverage of the reference sequence by the aligned sequence at position p of the reference sequence.
11. The method of claim 9, further comprising:
s206, for the low-quality sequence in the first correction result, using the alignment interval sequence of the alignment sequence covering the low-quality sequence in the alignment result of the step S201 as a seed sequence, and performing correction by using a directed graph algorithm based on the seed sequence; replacing the low quality sequence with the corrected sequence to obtain a second correction result;
the low-quality sequence is a sequence which meets all conditions that the number of continuous low-quality bases is not less than a preset first value, the longest sequence with the interval average quality value less than a preset second value is taken as the starting point, the interval length exceeds a preset third value, and the number of continuous high-quality bases in the interval is not more than the preset first value;
preferably, in the correction of the directed graph algorithm, the number of the seed sequences does not exceed a preset fifth value, and preferably, the seed sequences included in the aligned sequences with the length sequence located at the middle preset fifth value are included.
12. The method according to claim 11, wherein in step S206, the step of correcting by the directed graph algorithm comprises:
s261, optionally drawing a directed graph by taking one seed sequence as a motif;
s262, calculating the base score in the seed sequence according to the score matrix, and updating the directed graph;
s263, topological sorting is carried out on the directed graph;
and S264, determining the base of each node as the base with the highest occurrence probability on the node according to the topological sequencing result, and obtaining a corrected sequence.
13. The method according to claim 12, wherein in step S262, the following steps are performed for all of the seed sequences:
a. initializing a node score according to the formula score (x) max { score (x-1) } -GP, where score (x) is the node score, x is the node index, and GP is the gap penalty; setting the node score without the former node as 0;
b. traversing each base of the seed sequence, and updating the following scoring matrix according to a formula:
Figure FDA0002177874360000041
wherein x is a node index, y is a base index in the seed sequence, GP is a gap penalty, and M is a mismatch or match score;
c. taking the highest scoring node with the out degree of 0 as a starting point, selecting the predecessor node with the highest score, and backtracking by taking the node index as 0 or the base index as 0 as an end point;
d. if the base index of the end point is more than 0, adding a base positioned before the base index of the end point in the seed sequence into a directed graph; and if the base index of the starting point is smaller than the length of the seed sequence, adding a base positioned after the base index of the starting point in the seed sequence into the directed graph.
14. The method of claim 11, wherein prior to performing the directed graph algorithm correction, filtering the first correction result according to any one of the following conditions:
a. the accumulated length of the low-quality sequence exceeds the first correction result of a preset fourth numerical value;
b. the first correction result of which the number of the seed sequences is smaller than a preset first numerical value;
c. and the comparison sequence with the length exceeding a preset fourth value accounts for the first correction result with the length exceeding a fourth preset multiple.
15. The method of claim 11, further comprising:
s207, taking the seed sequence as the alignment sequence, taking the second correction result as the reference sequence, and performing correction again by using the correction method of any one of claims 9 to 10 to obtain a third correction result.
16. The method according to claim 9, wherein in step S201, the sequence alignment method according to any one of claims 1 to 8 is used for alignment;
wherein the reference sequence comprises one or more of the first reference sequence, the second reference sequence, and the third reference sequence, and the aligned sequence comprises one or more of the first aligned sequence, the second aligned sequence, and the third aligned sequence.
17. An apparatus for sequence alignment, the apparatus comprising:
the sequence acquisition module is used for acquiring a sequencing sequence, screening a first sequence in the sequencing sequence according to a first preset data length, and screening a second sequence in the sequencing sequence according to a second preset length;
the first comparison module is used for comparing the first sequence with the second sequence to obtain a first reference sequence and a first comparison sequence;
the second comparison module is used for calculating the coverage of the first comparison sequence to the first reference sequence, screening the first reference sequence and the first comparison sequence based on the coverage, and obtaining a second reference sequence and a second comparison sequence;
and the third comparison module is used for calculating a comparison path between the second reference sequence and the second comparison sequence, screening the second reference sequence and the second comparison sequence based on the edit distance, and obtaining a third reference sequence and a third comparison sequence.
18. A sequence correction apparatus, characterized in that the apparatus comprises:
the sequence comparison module is used for comparing the plurality of sequencing fragments to obtain a reference sequence and a comparison sequence;
the sequence dividing module is used for dividing the reference sequence by taking the k-mer as a unit, and correspondingly dividing the comparison sequence to obtain a plurality of k-mer fragments;
a score calculation module for calculating the scores of all bases at each position from the positive sequence in units of k-mer fragments;
a base determination module, configured to determine, from reverse order, a last base as a highest-scoring base at a last position, and determine a penultimate base according to the k-mer fragment corresponding to the last base; determining a penultimate base according to the k-mer fragment corresponding to the penultimate base; and so on to obtain a k-mer scoring branch;
and the correction module is used for replacing the reference sequence by using the k-mer score chain to obtain a first correction result.
CN201910787734.1A 2019-08-23 2019-08-23 Sequence comparison method, sequence correction method and device thereof Active CN112397148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910787734.1A CN112397148B (en) 2019-08-23 2019-08-23 Sequence comparison method, sequence correction method and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910787734.1A CN112397148B (en) 2019-08-23 2019-08-23 Sequence comparison method, sequence correction method and device thereof

Publications (2)

Publication Number Publication Date
CN112397148A true CN112397148A (en) 2021-02-23
CN112397148B CN112397148B (en) 2023-10-03

Family

ID=74603657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910787734.1A Active CN112397148B (en) 2019-08-23 2019-08-23 Sequence comparison method, sequence correction method and device thereof

Country Status (1)

Country Link
CN (1) CN112397148B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421608A (en) * 2021-07-03 2021-09-21 南京世和基因生物技术股份有限公司 Construction method, detection device and computer readable medium of liver cancer early screening model
CN116564414A (en) * 2023-07-07 2023-08-08 腾讯科技(深圳)有限公司 Molecular sequence comparison method and device, electronic equipment, storage medium and product

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789553A (en) * 2012-07-23 2012-11-21 中国水产科学研究院 Method and device for assembling genomes by utilizing long transcriptome sequencing result
CN103955630A (en) * 2014-03-26 2014-07-30 田埂 Method for preparing reference database and performing target area sequence alignment on to-be-tested free nucleic acid samples
CN105303068A (en) * 2015-10-27 2016-02-03 华中农业大学 Reference genome and de novo assembly combination based next-generation sequencing data assembly method
CN107103206A (en) * 2017-04-27 2017-08-29 福建师范大学 The DNA sequence dna cluster of local sensitivity Hash based on standard entropy
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map
CN107895104A (en) * 2017-11-13 2018-04-10 深圳华大基因科技服务有限公司 Assess and verify the method and apparatus of the sequence assembling result of three generations's sequencing
US20180201990A1 (en) * 2015-05-06 2018-07-19 Nanjing Annoroad Gene Technology Co., Ltd. Kit, apparatus, and method for detecting chromosome aneuploidy
CN109411020A (en) * 2018-11-01 2019-03-01 中国水产科学研究院 The method for carrying out whole genome sequence filling-up hole using long sequencing read
US20190080045A1 (en) * 2017-09-13 2019-03-14 The Jackson Laboratory Detection of high-resolution structural variants using long-read genome sequence analysis

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789553A (en) * 2012-07-23 2012-11-21 中国水产科学研究院 Method and device for assembling genomes by utilizing long transcriptome sequencing result
CN103955630A (en) * 2014-03-26 2014-07-30 田埂 Method for preparing reference database and performing target area sequence alignment on to-be-tested free nucleic acid samples
US20180201990A1 (en) * 2015-05-06 2018-07-19 Nanjing Annoroad Gene Technology Co., Ltd. Kit, apparatus, and method for detecting chromosome aneuploidy
CN105303068A (en) * 2015-10-27 2016-02-03 华中农业大学 Reference genome and de novo assembly combination based next-generation sequencing data assembly method
CN107103206A (en) * 2017-04-27 2017-08-29 福建师范大学 The DNA sequence dna cluster of local sensitivity Hash based on standard entropy
CN107229842A (en) * 2017-06-02 2017-10-03 肖传乐 A kind of three generations's sequencing sequence bearing calibration based on Local map
US20190080045A1 (en) * 2017-09-13 2019-03-14 The Jackson Laboratory Detection of high-resolution structural variants using long-read genome sequence analysis
CN107895104A (en) * 2017-11-13 2018-04-10 深圳华大基因科技服务有限公司 Assess and verify the method and apparatus of the sequence assembling result of three generations's sequencing
CN109411020A (en) * 2018-11-01 2019-03-01 中国水产科学研究院 The method for carrying out whole genome sequence filling-up hole using long sequencing read

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王春宇: "生物高通量测序片段拼接与分子标记识别算法研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421608A (en) * 2021-07-03 2021-09-21 南京世和基因生物技术股份有限公司 Construction method, detection device and computer readable medium of liver cancer early screening model
CN113421608B (en) * 2021-07-03 2023-12-01 南京世和基因生物技术股份有限公司 Construction method of liver cancer early screening model, detection device and computer readable medium
CN116564414A (en) * 2023-07-07 2023-08-08 腾讯科技(深圳)有限公司 Molecular sequence comparison method and device, electronic equipment, storage medium and product
CN116564414B (en) * 2023-07-07 2024-03-26 腾讯科技(深圳)有限公司 Molecular sequence comparison method and device, electronic equipment, storage medium and product

Also Published As

Publication number Publication date
CN112397148B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
Zhbannikov et al. SeqyClean: a pipeline for high-throughput sequence data preprocessing
CN107403075B (en) Comparison method, device and system
US8271206B2 (en) DNA sequence assembly methods of short reads
CN107133493B (en) Method for assembling genome sequence, method for detecting structural variation and corresponding system
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
WO2017143585A1 (en) Method and apparatus for assembling separated long fragment sequences
KR20080026153A (en) Method of processing and/or genome mapping of ditag sequences
KR101313087B1 (en) Method and Apparatus for rearrangement of sequence in Next Generation Sequencing
JP4912646B2 (en) Gene transcript mapping method and system
US10810239B2 (en) Sequence data analyzer, DNA analysis system and sequence data analysis method
CN110021355B (en) Haploid typing and variation detection method and device for diploid genome sequencing segment
WO2018218787A1 (en) Third-generation sequencing sequence correction method based on local graph
CN110692101A (en) Method for aligning targeted nucleic acid sequencing data
Roberts et al. A preprocessor for shotgun assembly of large genomes
CN112397148A (en) Sequence comparison method, sequence correction method and device thereof
CN115631789A (en) Pangenome-based group joint variation detection method
CA2400890A1 (en) Method and system for the assembly of a whole genome using a shot-gun data set
CN107784198B (en) Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence
EP3663890B1 (en) Alignment method, device and system
CN114627967A (en) Method for accurately annotating three-generation full-length transcript
Martin Algorithms and tools for the analysis of high throughput DNA sequencing data
CN111445949A (en) Method for annotating genome of high-altitude polyploid fish by using nanopore sequencing data
CN107688727B (en) Method and device for identifying transcript subtypes in biological sequence clustering and full-length transcription group
CN115602246B (en) Sequence alignment method based on group genome
CN115602244B (en) Genome variation detection method based on sequence alignment skeleton

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 430000 No.01, 19th floor, building C11, Wuhan software new town phase 2, No.8 Huacheng Avenue, Donghu New Technology Development Zone, Wuhan City, Hubei Province

Applicant after: Wuhan hope group Biotechnology Co.,Ltd.

Address before: 430000 No.01, 19th floor, building C11, Wuhan software new town phase 2, No.8 Huacheng Avenue, Donghu New Technology Development Zone, Wuhan City, Hubei Province

Applicant before: WUHAN NEXTOMICS BIOSCIENCES Co.,Ltd.

GR01 Patent grant
GR01 Patent grant