WO2019023978A1 - 比对方法、装置及系统 - Google Patents

比对方法、装置及系统 Download PDF

Info

Publication number
WO2019023978A1
WO2019023978A1 PCT/CN2017/095612 CN2017095612W WO2019023978A1 WO 2019023978 A1 WO2019023978 A1 WO 2019023978A1 CN 2017095612 W CN2017095612 W CN 2017095612W WO 2019023978 A1 WO2019023978 A1 WO 2019023978A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
extended
seed
segment
positioning result
Prior art date
Application number
PCT/CN2017/095612
Other languages
English (en)
French (fr)
Inventor
徐伟彬
金欢
颜钦
姜泽飞
周志良
Original Assignee
深圳市瀚海基因生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市瀚海基因生物科技有限公司 filed Critical 深圳市瀚海基因生物科技有限公司
Priority to EP17919940.1A priority Critical patent/EP3663890B1/en
Priority to US16/635,996 priority patent/US11482304B2/en
Priority to PCT/CN2017/095612 priority patent/WO2019023978A1/zh
Publication of WO2019023978A1 publication Critical patent/WO2019023978A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data

Definitions

  • the present invention relates to the field of data processing, and in particular, to a sequence alignment method, a sequence alignment device, and a sequence alignment system.
  • mainstream comparison software such as bwa, bowtie, and soap are mainly developed for the sequence generated by the second generation sequencing platform, and software such as blast and mapq are mainly developed for long sequences.
  • the embodiments of the present invention aim to solve at least one of the technical problems existing in the prior art or at least provide a commercial means.
  • an alignment method includes the steps of: converting each read segment into a set of short segments corresponding to the read segment, obtaining a plurality of sets of short segments; determining a correspondence of the short segments in the reference library Position to obtain a first positioning result
  • the reference library is a hash table constructed based on a reference sequence, the reference library contains a plurality of entries, and one entry of the reference library corresponds to a seed sequence, and the so-called seed sequence can be referenced to the reference sequence
  • the at least one sequence of the matching is matched, and the distance between the two seed sequences corresponding to the two adjacent entries of the reference library is less than the length of the short segment; and any one of the first positioning result is located in the reference library adjacent entry.
  • a short segment of the eye obtains a second positioning result; and is extended based on the short segment from the same reading in the second positioning result to obtain a comparison result of the reading.
  • a comparison apparatus for implementing all or part of the steps of the above comparison method, the apparatus comprising: a conversion module for converting each read segment into a corresponding one of the read segments a set of short segments to obtain a plurality of short segments; a search module for determining a corresponding position of the short segments in the reference library to obtain a first positioning result, wherein the reference library is a hash table constructed based on the reference sequence, and the reference library Containing multiple entries, one entry of the reference library corresponds to a seed sequence, and the so-called seed sequence can match at least one sequence on the reference sequence, and two seed sequences corresponding to two adjacent entries of the reference library are on the reference sequence.
  • a computer readable medium for storing/hosting a computer executable program, and those skilled in the art can understand that the above comparison method can be completed by instructing related hardware when executing the program. All or part of the steps.
  • the so-called medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • a comparison system includes: an input device for inputting data; an output device for outputting data; a processor for executing a computer executable program, executing the computer executable
  • the program includes the completion of the above comparison method; a storage device for storing data, including the computer executable program.
  • the comparison method, device and/or system of the invention can convert the read segment into short segments and convert the sequence information into position information, has high sensitivity and accuracy, and can efficiently and accurately process various sequencing platforms.
  • Machine data including long read length and short read length data. This is especially true for highly accurate and accurate localization of reads containing unrecognized bases, ie readings containing gaps.
  • FIG. 1 is a schematic flow chart of a comparison method in an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of the distances of two adjacent entries of a reference library in an embodiment of the present invention.
  • FIG 3 is a schematic view of a communication length in an embodiment of the present invention.
  • FIG. 4 is a schematic structural view of a comparison device in an embodiment of the present invention.
  • Fig. 5 is a schematic structural view of a comparison device in an embodiment of the present invention.
  • Fig. 6 is a schematic structural view of a comparison device in an embodiment of the present invention.
  • Fig. 7 is a schematic structural view of a comparison device in an embodiment of the present invention.
  • Fig. 8 is a schematic structural view of a comparison device in an embodiment of the present invention.
  • first and second are used for descriptive purposes only, and are not to be construed as indicating or implying relative importance and/or order.
  • connection should be understood broadly. For example, it may be a fixed connection, a detachable connection, or an integral connection; it may be a mechanical connection or an electrical connection; it may be directly connected. It can also be connected indirectly through an intermediate medium, which can be the internal communication between two components.
  • intermediate medium which can be the internal communication between two components.
  • reading is meant a sequence fragment obtained by determining a DNA/RNA/protein sequence, including sequences obtained by assaying at least a portion of a DNA/RNA/protein sequence using a sequencing platform.
  • the sequencing platform can be selected but not limited to Illumina's Hisq/Miseq/Nextseq sequencing platform, Thermo Fisher (Life) Technologies) Ion Torrent platform, BGI's BGISEQ platform and single-molecule sequencing platform, the sequencing method can choose single-end sequencing, or you can choose double-end sequencing, the obtained offline data is the read-out fragment, called read ( Reads).
  • alignment refers to sequence alignment, including the process of locating a read to a reference sequence, and the process of obtaining a read position/match result.
  • the method includes the following steps: S110 converts each read segment into a set of short segments corresponding to the read segment to obtain multiple sets of short segments; S120 determines a corresponding position of the short segment in the reference library to obtain a first positioning result, the reference library is a hash table constructed based on the reference sequence, the reference library includes multiple entries, and one entry of the reference library corresponds to a seed sequence, and the seed sequence can Matching with at least one sequence on the reference sequence, the distance between the two seed sequences corresponding to the two adjacent entries of the reference library on the reference sequence is smaller than the length of the short segment; S130 removes the reference to the reference library adjacent entry in the first positioning result.
  • a short segment on any of the entries obtains a second positioning result; S140 extends based on the short segment from the same segment in the second positioning result to obtain a comparison result of the reading.
  • the comparison method converts the read segment into a short segment and converts the sequence information into position information, that is, the sequence form is converted into a digital form, which facilitates fast and accurate alignment of the offline data of various sequencing platforms.
  • reads containing unrecognized bases ie fast and accurate alignments of reads containing gaps or N, such as alignments of reads obtained due to poor sequencing quality, poor base recognition, etc.
  • the reference sequence is a predetermined sequence, which may be a self-determined assembly of DNA and/or RNA sequences in advance, or may be determined by others to disclose the disclosed DNA and/or RNA sequences, which may be a pre-obtained target.
  • any reference template in a biological category to which an individual belongs for example, all or at least a portion of a published genomic assembly sequence of the same biological category, if the target individual is a human, its genomic reference sequence (also referred to as a reference genome or reference genome)
  • the HG19 provided by the NCBI database may be selected; further, a resource library containing more reference sequences may be pre-configured, and before the sequence comparison, the target individual's gender, race, region, and other factors are selected or determined to be assembled. Close sequences are used as reference sequences to aid in the subsequent obtaining of more accurate sequence analysis results.
  • the reference sequence contains the chromosome number and positional information of each locus on the chromosome.
  • the so-called reference library is essentially a hash table, which can directly use the so-called seed sequence as a key (key name) and the position of the so-called seed sequence on the reference sequence as a value (key value).
  • the reference library is constructed; the so-called seed sequence can be first converted into a numeric or integer string, and the reference library is established by using the number or an integer string as a key and the position of the seed sequence on the reference sequence as a value.
  • the position of the seed sequence on the reference sequence is a value, which may be one or more positions corresponding to the seed sequence on the reference sequence/chromosome, and the position may be directly represented by a true value or a numerical range, or may be re-encoded. Custom characters and/or numeric representations.
  • This reference library can be built by saving in binary.
  • the hash table may also be divided into block storage, and the block header key and the block tail key are set in the block header, for example, for the sequential sequence block ⁇ 5, 6, 7, 8, ..., 19, 20 ⁇ , Set the block header and block tail (or header and footer) 5 and 20. If there is a number of 3, because 3 ⁇ 5, it can be seen that 3 does not belong to the sequence block.
  • the so-called reference library can be built when the sequence is to be compared, or it can be pre-built and saved.
  • the method is based on the relationship between the length of the seed sequence and the reference sequence established by the inventor repeatedly hypothesized experimental verification, so that the constructed reference library can include the comprehensive seed sequence with the associated information of the corresponding positions of the various sub-sequences on the reference sequence.
  • the reference library is compact, has a small memory footprint and can be used for high-speed access queries in sequence location analysis.
  • An entry of the reference library obtained according to this embodiment contains only one key, one key corresponding to at least one value.
  • a method for generating all possible seed sequences and obtaining a seed sequence set is not limited.
  • elements in the set may be traversed to obtain a specific length and all possible element combinations, for example, Implemented using recursive algorithms and/or looping algorithms.
  • the reference sequence is a human genome comprising approximately 3 billion bases
  • the length of the reads to be processed is no less than 25 bp
  • L is an integer in [11, 15] for efficient alignment.
  • the reference sequence is a human cDNA reference genome
  • the total number of bases of the reference genome is totalBase
  • the length L of the seed sequence is set based on the total number of bases
  • L(seed) log(totalBase)* ⁇
  • Base types based on L and DNA sequences include A, T, C, and G.
  • the reference library on the reference genome is constructed to obtain the reference library.
  • the reference sequence is the DNA genome and transcriptome of a species
  • the total number of bases of the reference sequence is totalBase
  • the length of the seed sequence is set based on the total number of bases L
  • L(seed) log(totalBase )* ⁇
  • the base species constituting the DNA sequence include four kinds of A, T, C, and G
  • the base types constituting the RNA sequence include A, U, C, and G
  • B L B ⁇ ATCG ⁇ AUCG ⁇ ; determining a seed sequence set to match the seed sequence of the reference sequence and the seed
  • the matching position of the sequence is constructed by using a seed sequence capable of matching the reference sequence as a key and a position of the seed sequence on the reference sequence as a value to obtain the reference library.
  • a seed sequence can be converted into a string of numeric characters, with the string being The key to build the library can improve the speed of accessing the reference library built by the query.
  • the seed sequence is encoded as follows:
  • the base coding rule can be the same as above, and the same regular coding and conversion can be performed on the reference sequence, which facilitates obtaining the seed sequence in the reference sequence quickly.
  • the corresponding location information also helps to improve the access query speed of the reference library built.
  • determining a seed sequence capable of matching a seed sequence with a reference sequence and a matching position of the seed sequence comprises: sliding a reference sequence with a window of size L, and seeding the seed sequence in the seed sequence Matching with the window sequence of the sliding window to determine the seed sequence that can match the seed sequence of the reference sequence and the matching position of the seed, and the matching fault tolerance is ⁇ 1 .
  • the corresponding position information of the seed sequence on the reference sequence can be quickly obtained, which facilitates rapid construction of the reference library.
  • the so-called fault tolerance is the ratio of the allowed mismatched bases, and the mismatch is selected from at least one of substitutions, insertions, and deletions.
  • the so-called match is a strict match, that is, the fault tolerance rate ⁇ 1 is zero.
  • the position of the sliding window sequence is corresponding to the seed sequence on the reference sequence. s position.
  • the so-called match is fault-tolerant, and the fault tolerance ⁇ 1 is greater than zero.
  • the sliding window sequence The position is the corresponding position of the seed sequence on the reference sequence.
  • the corresponding position of the seed sequence on the reference sequence is encoded, and the reference library is constructed with the encoded characters, such as numeric characters, as values.
  • the constructed reference library may be a seed as a key, or each of the seed templates corresponding to the seed may be a key, the key is different from the key, and one key corresponds to at least one value.
  • the step size of the sliding window of the reference sequence is determined in accordance with L and ⁇ 1 when determining the corresponding position of the seed sequence on the reference sequence.
  • the step size of the sliding window is not less than L* ⁇ 1 .
  • the reference sequence is a human genome comprising about 3 billion bases
  • the length of the read to be processed is not less than 25 bp
  • L is 14 bp
  • ⁇ 1 is 0.2-0.3
  • the step size of the sliding window is 3 bp. -5 bp enables adjacent windows in the sliding window positioning process to cross consecutive error combinations under ⁇ 1 conditions, facilitating rapid positioning.
  • the distance between two adjacent entries of the constructed reference library is the step size of the sliding window.
  • S110 includes: sliding a read segment with a window of size L to obtain a set of short segments corresponding to the read segment, the sliding window having a step size of 1 bp.
  • K a read with a length of K
  • obtain (K-L+1) short segments of length L convert the reads into short segments, and use the high-speed access query reference library to determine the corresponding positions of the short segments in the reference library. And then obtain the short segment corresponding to the information of the reads in the reference library.
  • S120 includes: matching the short segment with the seed sequence corresponding to the entry of the reference library to determine the location of the short segment at the reference library, and performing a matching fault tolerance rate of ⁇ 2 .
  • the so-called match is a strict match, that is, the fault tolerance rate ⁇ 2 is zero.
  • the so-called match is a fault-tolerant match, the fault tolerance ⁇ 2 is greater than zero, and the proportion of bases that do not match the seed or seed template corresponding to one or more entries of the reference sequence is less than the fault tolerance ⁇ At 2 o'clock, the position information of the short sequence on the reference library is obtained.
  • ⁇ 2 ⁇ 1 and not zero, enabling as much valid data as possible.
  • the distance between the two seed sequences X1 and X2 corresponding to two adjacent entries of the reference library in the reference sequence ref can be divided into the following two cases. :
  • the key and value of the two entries of the reference library are unique, that is, an entry corresponds to a [key, value]
  • Figure 2a which is equivalent to the X1 and X2 are uniquely matched with the reference sequence (X1 and X2 are both Matching only one position to the reference sequence)
  • the distance is the distance between X1 and X2 corresponding to the two positions on the reference sequence, bold black lines show these two positions; when two entries in the reference library
  • the key of at least one entry corresponds to a plurality of values.
  • At least one of the two seed sequences X1 and X2 is non-uniquely matched with the reference sequence, ie, at least one of X1 and X2 matches at least one of the reference sequences.
  • the position, the distance referred to is the distance between the two closest positions of the X1 and X2 corresponding to each other on the reference sequence, and the bold black line shows the two positions.
  • This embodiment does not limit the representation of the distance between two sequences, for example, it can be expressed as the distance from either end of one end of the sequence to either end of the other sequence, or can be expressed as a sequence. The distance from the center to the center of another sequence.
  • S130 further includes: removing a short segment whose connectivity length is less than a predetermined threshold, and replacing the second positioning result with the removed result, the connectivity length being the second alignment result. Short segments from the same read segment and positioned to different entries in the reference library are mapped to the total length of the reference sequence. This process facilitates the removal of some of the transitionally redundant and/or relatively low quality data, which facilitates increasing the speed of the alignment.
  • the connected length can be expressed as the sum of the lengths of short segments from the same read and positioned to different entries of the reference library minus the length of the overlap between the short segments mapped onto the reference sequence.
  • there are four short segments from one read segment and positioned to different entries in the reference library denoted as Y1, Y2, Y3, and Y4, each having a length of L1, L2, L3, and L4, respectively.
  • the overlapping portion has a length J and the connected length is (L1+L2+L3+L4-J).
  • the lengths of different short segments are all L, and the predetermined threshold is L, so that it can be lost. In the case of losing part of the data that is valid but relatively low in quality, the speed of the comparison is increased.
  • S130 further includes: judging the positioning result of the reading according to the positioning result of the short segment from the same reading in the second positioning result, and removing the evaluation result. Reads that do not meet the predetermined requirements. Removing the read also removes the short segment corresponding to the read. In this way, under the premise of satisfying certain sensitivity and accuracy, based on the second positioning result, the exact matching/local fast comparison can be directly performed, and the comparison can be accelerated.
  • This embodiment does not limit the method of evaluation, and for example, a method of quantitative scoring can be utilized.
  • the short positioning results of the short clips from the same read segment are scored according to the fact that the matching points of the reference sequence are scored, and the points that do not match the reference sequence are scored; After the positioning result, the positioning result of the read segment is scored according to the positioning result of the short segment from the same read segment in the second positioning result, and the read segment whose score is not greater than the first preset value is removed.
  • the read length is 25 bp
  • the short segments from the same read are sequence constructed to obtain a reconstructed sequence.
  • the base type of a certain site can be determined according to having more short sequence support, if If the short fragment that is not supported by the locus is not aligned to the locus, then the base type of the locus is indeterminate and can be represented by N, so as to obtain the reconstructed sequence, it can be seen that the reconstructed sequence and the read segment are Correspondingly, the length of the reconstructed sequence is a read length; the reconstructed sequence is decremented by a point matching the reference sequence (ref), and the point that does not match the reference sequence is added by one point, and the ratio of the fault tolerance is a read.
  • the segment/reconstruction sequence allows a mismatch ratio of 0.12, the length of the tolerance error is 3 bp (25*0.12), the initial score Score init is the read length, and the first preset value is 22 (25-3), thus,
  • a bit operation and a dynamic programming algorithm [G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Journal of the ACM, 46(3): 395-415, 1999] is used.
  • Score init Length(read)
  • Score init Length(read)
  • the short positioning results of the short clips from the same read segment are scored according to the points that match the reference sequence are added, and the points that do not match the reference sequence are scored; After the positioning result, the positioning result of the read segment is scored according to the positioning result of the short segment from the same read segment in the second positioning result, and the short segment corresponding to the read segment whose score is not less than the second preset value is removed.
  • the read length is 25 bp, and the short segments from the same read are sequence constructed to obtain a reconstructed sequence.
  • the base type of a certain site can be determined according to having more short sequence support, if If the short fragment that is not supported by the locus is not aligned to the locus, then the base type of the locus is indeterminate and can be represented by N, so as to obtain the reconstructed sequence, it can be seen that the reconstructed sequence and the read segment are Correspondingly, the length of the reconstructed sequence is a read length; the reconstructed sequence is added with a reference point (ref) to add a point, and the reference sequence does not match the point of the reference sequence by one point, and the ratio of the fault tolerance is a read.
  • the segment/reconstruction sequence allows a mismatch ratio of 0.12, the length of the tolerance for the error is 3 bp (25*0.12), the initial score Score init is -25, and the second default value is -22 (-25-3). As such, the reconstructed sequence with scores greater than -22 is removed, and the alignment is accelerated with the loss of partially valid but relatively low quality data.
  • the extending in the S140 based on the short segment from the same read segment in the second positioning result comprises: performing sequence construction based on the short segment from the same read segment to obtain the reconstructed sequence; A common portion of the reference sequence corresponding to the reconstructed sequence is extended to obtain an extended sequence.
  • the short segment and the short segment positioning information are converted into the positioning information of the short segment corresponding to the read segment (herein referred to as a reconstructed sequence), which facilitates the fast and accurate processing of the subsequent alignment processing.
  • the so-called public part is the part shared by multiple sequences.
  • the so-called common parts are common substrings and/or common subsequences.
  • a common substring refers to a contiguous portion shared among multiple sequences, and a common subsequence does not have to be contiguous.
  • the common subsequence is BCBA and the common substring is AB.
  • the so-called sequence is constructed based on short fragments from the same read segment to obtain a reconstructed sequence.
  • the base type of a certain site on the reconstructed sequence can be determined according to having more short segment support, if a certain bit If a short segment that is not supported by the point, that is, no short segment is aligned to the reference sequence, the base type uncertainty of the site may be represented by N, thereby obtaining the so-called reconstructed sequence. It can be seen that the reconstructed sequence corresponds to the read segment, and the length of the reconstructed sequence is the read length.
  • the reference sequence corresponding to the reconstructed sequence is a reference sequence that matches the reconstructed sequence, and the length of the reference sequence is not less than the read length.
  • the reference sequence corresponding to the reconstructed sequence has the same length as the reconstructed sequence and is of a read length.
  • the reconstructed sequence is allowed to be fault-tolerantly matched with the corresponding reference sequence, and the length of the reference sequence corresponding to the reconstructed sequence is the length of the reconstructed sequence plus twice the length of the fault-tolerant matching, for example, the length of the reconstructed sequence is read.
  • the length is 25 bp, and the matching of the reconstructed sequence and the reference sequence allows 12% mismatch, and can be reconstructed by the reference sequence of the reconstructed sequence comparison and the 3 bp (25*12%) sequences at both ends of the reference sequence.
  • the reference sequence corresponding to the sequence.
  • the so-called common part is a common substring.
  • Determining, in S140, the short segment from the same read segment in the second positioning result including: searching for a common substring of the reference sequence corresponding to the reconstructed sequence and determining the reconstructed sequence and the The longest common substring of the reference sequence corresponding to the reconstructed sequence; extending the longest common substring based on the edit distance to obtain an extended sequence. In this way, the alignment result including the longer matching sequence can be obtained more accurately.
  • the so-called common part is a common subsequence.
  • Deriving in the S140 based on the short segment from the same read segment in the second positioning result including: searching for a common subsequence of the reference sequence corresponding to the reconstructed sequence and determining the reconstructed sequence and the The longest common subsequence of the reference sequence corresponding to the reconstructed sequence; extending the longest common subsequence based on the edit distance to obtain an extended sequence.
  • edit distance also known as the Levenshtein distance
  • Levenshtein distance refers to the minimum number of edit operations required between two strings, one from one to another. Editing operations include replacing one character with another, inserting one character, and deleting one character. In general, the smaller the edit distance, the greater the similarity between the two strings.
  • the longest common substring of the reference sequence corresponding to the reconstructed sequence and the reconstructed sequence can be represented as two strings x 1 x 2 ...
  • the common substring of x i and y 1 y 2 ... y j , the lengths of the strings are m and n respectively, and the length c[i, j] of the common substring of the two strings is calculated, and the transfer equation can be obtained:
  • the hole/gap means inserting or deleting a character.
  • the gap in the formula indicates the penalty required to insert or delete a character (corresponding to the position in the sequence).
  • the match means that the two characters are the same.
  • the match in the two matches the score when the two characters are the same.
  • the mismatch indicates that the two characters are not equal/different.
  • the mismatch in the formula indicates the valve points indicating that the two characters are not equal/different.
  • d[i,j] takes the smallest of the three. In a specific example, a gap is 3 points, a continuous gap increases the valve by 1 point, a position is mismatched by 2 points, and the position matches 0 points. Thus, it facilitates efficient alignment with gap sequences.
  • the so-called common part is a common subsequence.
  • S140 includes: searching for a common subsequence of short segments of the same entry in the second positioning result that are located to the reference library, determining a longest common subsequence corresponding to each read segment; extending based on the edit distance The longest common subsequence to obtain an extended sequence.
  • the longest common subsequence of the reference sequence corresponding to the reconstructed sequence and the reconstructed sequence is searched for, based on the longest common subsequence, the one corresponding to the longest common subsequence
  • the segment reconstruction sequence is transformed into the reference sequence corresponding to the longest common subsequence, and the Smith Waterman algorithm can be used to find the edit distance of the two segments, for the two strings x 1 x 2 ... x i and y 1 y 2 ...y j can be obtained by the following formula:
  • represents the score function
  • ⁇ (i,j) represents the score of the character (site) x i and y j mismatch or match
  • ⁇ (-,j) represents the score of x i vacancy (deletion) or y j insertion
  • ⁇ (i, -) represents the score of y j deletion or x i insertion
  • a gap is 3 points for a gap, 1 point for a continuous gap, 2 points for a mismatch, and 4 points for a match. In this way, efficient alignment of gap-containing sequences can be achieved and sequences with both gaps and other sites with high accuracy can be retained.
  • S140 further includes: truncating the extended sequence from at least one end of the extended sequence, calculating a proportion of the incorrectly located position of the truncated extended sequence, and stopping the truncation after the truncated extended sequence is satisfied
  • the proportion of the incorrect positioning site is smaller than the third preset value.
  • the extension sequence is truncated based on: i, calculating a first error rate and a second error rate, and if the first error rate is less than the second error rate, the first from the extended sequence The end truncates the extended sequence. If the first error rate is greater than the second error rate, the extended sequence is truncated from the second end of the extended sequence to obtain the truncated extended sequence, and the first error rate is extended.
  • the proportion of the mis-localization site of the extended sequence; ii, replacing the extended sequence with the truncated extended sequence for i, until the proportion of the incorrectly located position of the extended sequence after the truncation is less than the fourth predetermined value.
  • the double-ended truncation and culling method can better preserve the well-matched local sequence, which is beneficial to improve the efficiency of the data.
  • the length of the extended sequence is 25 bp
  • the fourth preset value is set to 0.12.
  • S140 further includes: sliding a window from the extended sequence from at least one end of the extended sequence, calculating a proportion of the wrong positioning site of the window sequence obtained by the sliding window, and locating the proportion according to the error of the window sequence
  • the extension sequence is truncated, and the truncation is stopped when the following conditions are met: the proportion of the incorrectly located position of the window sequence of the sliding window is greater than the fifth preset value.
  • the extension sequence is truncated based on: i, calculating a third error rate and a fourth error rate, and if the third error rate is less than the fourth error rate, the second from the extended sequence The end truncates the extended sequence.
  • the extended sequence is truncated from the first end of the extended sequence to obtain the truncated extended sequence, and the third error rate is extended from Sequence number
  • the ratio of the wrong positioning position of the window sequence obtained by sliding the window to the extended sequence at one end, the so-called fourth error rate is the sliding window of the extended sequence from the second end of the extended sequence, and the wrong positioning position of the obtained window sequence
  • the ratio of points; ii, replacing the extended sequence with the truncated extended sequence for i until the proportion of the incorrectly located position of the window sequence is greater than the sixth preset value.
  • the window of the sliding window is no larger than the length of the extended sequence.
  • the length of the extended sequence is 25 bp
  • the window size of the sliding window is 10 bp
  • the sixth preset value is 0.12.
  • the truncation size is 1 bp, i.e., one truncation is one base removed. In this way, it is possible to efficiently obtain alignment results containing more long sequences.
  • Bowtie http://bowtie-bio.sourceforge.net/index.shtml
  • BWA http://bio-bwa.sourceforge.net/
  • the above comparison method are used to simulate the same batch.
  • the data was sequence aligned and the simulation data was based on a human reference genome setup containing 100K sequences of 100 bp in length.
  • the space, time, ratio of the reference sequence (Map rate) and accuracy of each software/method are equivalent.
  • the time and memory required to utilize the alignment method in this embodiment is slightly longer and larger than that of Bowtie or BWA, but the ratio on the alignment of the alignment method using the embodiment and the alignment are accurate.
  • Sexuality reached 98.9% and 99.9%, both slightly higher than the use of Bowtie and BWA.
  • a comparison device is provided.
  • the device is used to implement the method in any of the above embodiments/embodiments, and the device 100 includes: a conversion module 10 for reading each read The segment is converted into a set of short segments corresponding to the read segment, and a plurality of short segments are obtained; the searching module 20 is configured to determine a corresponding position of the short segment in the reference library to obtain a first positioning result, where the reference library is a hash table constructed based on a reference sequence, the reference library comprising a plurality of entries, an entry of the reference library corresponding to a seed sequence, the seed sequence being capable of matching at least one sequence on the reference sequence, the reference The distance between the two seed sequences corresponding to the two adjacent entries of the library on the reference sequence is smaller than the length of the short segment; the culling module 30 is configured to remove the reference to the reference library in the first positioning result. a short segment on any of the neighbor entries to obtain a second positioning result; a growing module
  • the determined seed sequence set can match the seed sequence of the reference sequence and the matching position of the seed sequence, including: sliding the reference sequence with a window of size L, The seed sequence in the seed sequence set is matched with the window sequence obtained by the sliding window to determine that the seed sequence set can match the seed sequence of the reference sequence and the matching position of the seed, and the fault tolerance rate of the matching is ⁇ . 1 .
  • the step size of the sliding window is determined in accordance with L and ⁇ 1 .
  • the step size of the sliding window is not less than L* ⁇ 1 .
  • the distance between two adjacent entries of the reference library is greater than the step size of the sliding window.
  • the conversion module 10 is used to: perform a sliding window on the read segment by using a window of size L to obtain a set of short segments corresponding to the read segment, and perform the sliding window.
  • the step size is 1 bp.
  • a first screening module 32 is further included.
  • the first screening module 32 is connected to the culling module 30 for performing the following results on the second positioning result from the culling module 30: Removing a short segment whose connectivity length is less than a predetermined threshold, and replacing the second positioning result with a result of the removal, the connectivity length being different from the same read segment in the second alignment result and located in the hash table The short segment of the entry is mapped to the total length of the reference sequence.
  • a second screening module 34 is further included, and the second screening module 34 is connected to the culling module 30, for: from the same reading according to the second positioning result.
  • the result of the positioning of the short segment is used to judge the positioning result of the read segment, and the short segment corresponding to the read segment whose evaluation result does not meet the predetermined requirement is removed.
  • the second screening module 34 is configured to score the positioning result of the read segment according to the positioning result of the short segment from the same read segment in the second positioning result, and remove the score. A read that is not larger than the first preset value.
  • the second screening module 34 is configured to score the positioning result of the read segment according to the positioning result of the short segment from the same read segment in the second positioning result, and remove the score. A read that is not less than the second preset value.
  • the growth module 40 is configured to: perform sequence construction based on the short segments from the same read segment to obtain a reconstructed sequence; and refer to the reconstructed sequence based on the reconstructed sequence The common portion of the sequence is extended to obtain an extended sequence.
  • the common part is a common substring.
  • the growth module 40 is configured to: find a common substring of the reference sequence corresponding to the reconstructed sequence, and determine a longest common sub of the reference sequence corresponding to the reconstructed sequence and the reconstructed sequence a string; based on the edit distance, extending the longest common substring to obtain an extended sequence.
  • the common part is a common subsequence.
  • the growth module 40 is configured to: find a common subsequence of the reference sequence corresponding to the reconstructed sequence and the reconstructed sequence, and determine the longest common sub of the reference sequence corresponding to the reconstructed sequence and the reconstructed sequence a sequence; extending the longest common subsequence based on an edit distance to obtain an extended sequence.
  • a truncation module 50 is further included for: At least one end of the extended sequence of the growth module 40 truncates the extended sequence, and calculates the proportion of the incorrectly located site of the truncated extended sequence, and stops the truncation by the following condition: the mislocated localization of the truncated extended sequence The ratio is less than the third preset value.
  • the truncation module 50 is further configured to: i calculate a first error rate and a second error rate, and if the first error rate is less than the second error rate, extend from the The first end of the sequence truncates the extended sequence to obtain a truncated extended sequence, and if the first error rate is greater than the second error rate, the extended sequence is extended from the second end of the extended sequence Performing truncation to obtain a truncated stretch sequence, the first error rate being a ratio of an incorrectly located site of the truncated extended sequence obtained by truncating the extended sequence from the first end of the extended sequence,
  • the second error rate is a ratio of the misalignment site of the truncated extension sequence obtained by truncating the extension sequence from the second end of the extension sequence; ii, replacing the extension sequence with the truncated extension sequence i is performed until the proportion of the mis-positioned sites of the truncated extended sequence is
  • a truncation module 50 is further configured to: perform sliding window on the extended sequence from at least one end of the extended sequence, and calculate a proportion of an incorrect positioning position of the window sequence obtained by the sliding window, according to The ratio of the erroneous locating sites of the window sequence is truncated to the extended sequence, and the truncation is stopped if the ratio of the erroneous locating sites of the window sequence obtained by the sliding window is less than the fifth preset value.
  • the truncation module 50 is further configured to: i calculate a third error rate and a fourth error rate, and if the third error rate is less than the fourth error rate, extend from the The second end of the sequence truncates the extended sequence to obtain a truncated extended sequence, and if the third error rate is greater than the fourth error rate, the extended sequence is extended from the first end of the extended sequence Performing truncation to obtain a truncated extended sequence, the third error rate being a sliding window from the first end of the extended sequence, and a ratio of incorrect positioning sites of the obtained window sequence, the first The fourth error rate is a ratio of the erroneous locating site of the window sequence obtained by sliding the extended sequence from the second end of the extended sequence; ii, replacing the extended sequence with the truncated extended sequence for i, Until the proportion of the incorrectly located location of the window sequence is greater than the sixth predetermined value.
  • the truncation is 1 bp in size.
  • the window of the sliding window is no larger than the length of the extended sequence.
  • a computer readable medium is used to carry some or all of the steps of the comparison method in any of the above embodiments.
  • the so-called media includes, but is not limited to, read only memory, random access memory, magnetic disks, optical disks, and the like.
  • a system 1000 includes: an input device 100 for inputting data; an output device 200 for outputting data; and a processor 300 for executing a computer executable program Executing the computer executable program includes performing the comparison method of any of the above embodiments; the storage device 400 is configured to store data, including the computer executable program.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Library & Information Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种比对方法、装置及系统,比对方法包括:将每条读段转化成与该读段对应的一组短片段,获得多组短片段;确定短片段在参考库的对应位置,以获得第一定位结果,所称的参考库为基于参考序列构建的哈希表,参考库包含多个条目,参考库的一个条目对应一条种子序列,种子序列能够与参考序列上的至少一段序列匹配,参考库的相邻两个条目对应的两条种子序列在参考序列上的距离小于短片段的长度;去除第一定位结果中定位到参考库相邻条目中的任一条目上的短片段,获得第二定位结果;基于第二定位结果中来自相同读段的短片段的公共部分进行延伸,以获得读段的比对结果。该比对方法能够对测序数据进行高效准确的处理及定位。

Description

比对方法、装置及系统 技术领域
本发明涉及数据处理领域,具体地,涉及一种序列比对方法、一种序列比对装置及一种序列比对系统。
背景技术
在生物数据处理和分析中,比对作为生物信息分析的基础环节,对时间和效率提出了一定的要求,不同的比对模型和算法具有不同的敏感性和准确性。
目前主流的比对软件如bwa、bowtie、soap主要是针对二代测序平台产生的序列而开发的,而如blast和mapq等软件,主要是针对长序列而开发的。
对特定类型数据的敏感性、比对时间和/或比对效率,现有的比对方法有待提高。
发明内容
本发明实施方式旨在至少解决现有技术中存在的技术问题之一或者至少提供一种商业手段。
依据本发明的一方面提供的一种比对方法,包括如下步骤:将每条读段转化成与该读段对应的一组短片段,获得多组短片段;确定短片段在参考库的对应位置,以获得第一定位结果,所称的参考库为基于参考序列构建的哈希表,参考库包含多个条目,参考库的一个条目对应一条种子序列,所称的种子序列能够与参考序列上的至少一段序列匹配,参考库的相邻两个条目对应的两条种子序列在参考序列上的距离小于短片段的长度;去除第一定位结果中定位到参考库相邻条目中的任一条目上的短片段,获得第二定位结果;基于所述第二定位结果中来自相同读段的短片段进行延伸,以获得读段的比对结果。
依据本发明的另一方面提供的一种比对装置,用以实施上述比对方法的全部或部分步骤,该装置包括:转化模块,用于将每条读段转化成与该读段对应的一组短片段,获得多组短片段;查找模块,用于确定短片段在参考库的对应位置,以获得第一定位结果,所称的参考库为基于参考序列构建的哈希表,参考库包含多个条目,参考库的一个条目对应一条种子序列,所称的种子序列能够与参考序列上的至少一段序列匹配,参考库的相邻两个条目对应的两条种子序列在参考序列上的距离小于短片段的长度;剔除模块,用于去除第一定位结果中定位到参考库相邻条目中的任一条目上的短片段,获得第二定位结果;生长模块,用于基于所述第二定位结果中来自相同读段的短片段进行延伸,以获得读段的比对结果。
依据本发明的又一方面提供的一种计算机可读介质,用于存储/承载计算机可执行程序,本领域普通技术人员可以理解,在执行该程序时,通过指令相关硬件可完成上述比对方法的全部或部分步骤。所称介质可以包括:只读存储器、随机存储器、磁盘或光盘等。
依据本发明的再一方面提供的一种比对系统,包括:输入装置,用于输入数据;输出装置,用于输出数据;处理器,用于执行计算机可执行程序,执行所述计算机可执行程序包括完成上述比对方法;存储装置,用于存储数据,其中包括所述计算机可执行程序。
本发明的比对方法、装置和/或系统通过将读段转化成短片段以及将序列信息转化成位置信息,具有较高的灵敏性和准确性,能够高效准确地处理各种测序平台的下机数据,包括长读长和短读长的下机数据。特别是对于包含未能识别的碱基的读段,即包含gap的读段的高效精确的定位,尤其适用。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1是本发明实施方式中的比对方法的流程示意图。
图2是本发明实施方式中的参考库的相邻两个条目的距离的示意图。
图3是本发明实施方式中的连通长度示意图。
图4是本发明实施方式中的比对装置的结构示意图。
图5是本发明实施方式中的比对装置的结构示意图。
图6是本发明实施方式中的比对装置的结构示意图。
图7是本发明实施方式中的比对装置的结构示意图。
图8是本发明实施方式中的比对装置的结构示意图。
具体实施方式
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。
在本发明的描述中,所称的“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性和/或具有先后顺序。
所称的“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明中的具体含义。
所称的“读段”指测定DNA/RNA/蛋白序列所获得的序列片段,包括利用测序平台对DNA/RNA/蛋白序列的至少一部分进行测定识别所获得的序列。测序平台可选择但不限于Illumina公司的Hisq/Miseq/Nextseq测序平台、Thermo Fisher(Life  Technologies)公司的Ion Torrent平台、BGI的BGISEQ平台和单分子测序平台,测序方式可以选择单端测序,也可以选择双末端测序,获得的下机数据是测读出来的片段,称为读段(reads)。
所称的“比对”指序列比对,包括将读段定位到参考序列上的过程,也包括获得读段定位/匹配结果的过程。
依据本发明的实施方式提供的一种比对方法,请参考图1,该方法包括如下步骤:S110将每条读段转化成与该读段对应的一组短片段,获得多组短片段;S120确定短片段在参考库的对应位置,以获得第一定位结果,参考库为基于参考序列构建的哈希表,参考库包含多个条目,参考库的一个条目对应一条种子序列,种子序列能够与参考序列上的至少一段序列匹配,参考库的相邻两个条目对应的两条种子序列在参考序列上的距离小于短片段的长度;S130去除第一定位结果中定位到参考库相邻条目中的任一条目上的短片段,获得第二定位结果;S140基于所述第二定位结果中来自相同读段的短片段进行延伸,以获得读段的比对结果。该比对方法通过将读段转化成短片段以及将读段序列信息转化成位置信息,即将序列形态转成数字形态,利于快速准确地实现各种测序平台的下机数据的比对定位。特别是对于包含未能识别的碱基的读段,即包含gap或者N的读段的快速准确比对,比如由于测序质量不佳、碱基识别不佳等所获得的读段的比对分析,尤其适用。
所称的参考序列(reference,ref)为预先确定的序列,可以是自己预先测定组装的DNA和/或RNA序列,也可以是他人测定公开的DNA和/或RNA序列,可以是预先获得的目标个体所属生物类别中的任意的参考模板,例如,同一生物类别的已公开的基因组组装序列的全部或者至少一部分,若目标个体是人类,其基因组参考序列(也称为参考基因组或者参考染色体组)可选择NCBI数据库提供的HG19;进一步地,也可以预先配置包含更多参考序列的资源库,在进行序列比对前,先依据目标个体的性别、人种、地域等因素选择或测定组装出更接近的序列来作为参考序列,有助于后续获得更准确的序列分析结果。参考序列包含染色体编号以及各个位点在染色体上的位置信息。所称的参考库实质为哈希表(hash table),可以直接以所称的种子序列为键(键名)、以所称的种子序列在参考序列上的位置(position)为值(键值)构建该参考库;也可以先将所称的种子序列转成数字或者整数字符串,以该数字或者整数字符串为键、以种子序列在参考序列上的位置为值建立该参考库。所称的以种子序列在参考序列上的位置为值,可以是该种子序列在参考序列/染色体上对应的一个或多个位置,位置可直接以真实数值或者数值范围表示,也可以重新编码以自定义的字符和/或数字表示。根据本发明的一个实施例,利用C++的向量vector实现哈希表的构建,可表示为:Hash(seed)=Vector(position),所称的向量vector是一种对象实体,能够容纳许多其他类型相同的元素,因此也被称为容器。可用二进制保存,以此建成该参考库。另外,也可以将哈希表分成块(block)存储,在block头设置块头键和块尾键,例如,对于顺序序列块{5,6,7,8...,19,20}, 设置块头和块尾(或者说表头和表尾)5和20,若有个数为3,因3<5,可知3不属于该顺序序列块,若有个数为10,因5<10<20,可知10属于这个序列块。如此在查询的时候可以选择全局索引,也可以通过比较块头键和块尾键快速定位到所在的block,可不需要全局索引。
所称的参考库可以在要进行序列比对时构建,也可以预先构建保存。根据本发明的具体实施方式,预先构建参考库保存备用,参考库的构建包括:依据参考序列的碱基总数totalBase,确定种子序列的长度L,L=μ*log(totalBase),
Figure PCTCN2017095612-appb-000001
且L小于待比对分析的读段的长度(读长);基于所述种子序列的长度,生成所有可能的种子序列,获得种子序列集;确定种子序列集中能够匹配到参考序列的种子序列以及该种子序列的匹配位置,以获得该参考库。该方法基于发明人多次假设试验验证而建立的种子序列长度与参考序列的关系,能够使构建得的参考库包含全面的种子序列以各种子序列在参考序列上对应的位置的关联信息,该参考库结构紧凑,内存占用小且能够用于序列定位分析中的高速访问查询。根据该实施方式获得的参考库的一个条目只包含一个键,一个键对应至少一个值。
本发明的该实施方式,对生成所有可能的种子序列、获得种子序列集的方法不作限制,对于输入一个集合,可遍历该集合中的元素来获得特定长度的、所有可能的元素组合,例如可以利用递归算法和/或循环算法来实现。
在一个例子中,参考序列为人基因组,包含大约30亿个碱基,待处理的读段的长度为不小于25bp,L取[11,15]中的整数,利于高效比对。
在一个例子中,参考序列为人cDNA参考基因组,统计该参考基因组的碱基总数totalBase,基于碱基总数设定种子序列(seed)的长度L,L(seed)=log(totalBase)*μ,
Figure PCTCN2017095612-appb-000002
基于L以及DNA序列的碱基种类包括A、T、C和G四种,利用递归算法,生成所有可能的种子序列的集合,获得种子序列集,该过程可表示为seed=B1B2...BL,B∈{ATCG};确定种子序列集中能够匹配到该参考基因组的种子序列以及该种子序列的匹配位置,以能够匹配到该参考基因组的种子序列为键、以该种子序列在参考基因组上的位置位为值来构建获得该参考库。
在一个例子中,参考序列为某物种的DNA基因组和转录组,统计该参考序列的碱基总数totalBase,基于碱基总数设定种子序列(seed)的长度L,L(seed)=log(totalBase)*μ,
Figure PCTCN2017095612-appb-000003
基于L、组成DNA序列的碱基种类包括A、T、C和G四种以及组成RNA序列的碱基种类包括A、U、C和G四种,利用递归算法,生成所有可能的种子序列的集合,获得种子序列集,该过程可表示为seed=B1B2...BL,B∈{ATCG}∪{AUCG};确定种子序列集中能够匹配到该参考序列的种子序列以及该种子序列的匹配位置,以能够匹配到该参考序列的种子序列为键、以该种子序列在参考序列上的位置位为值来构建获得该参考库。
在一个例子中,可将种子序列转化成由数字字符组成的字符串,以该字符串为 键来建库,能够提高访问查询所建参考库的速度。例如,在获得能够匹配到参考序列的种子序列后,对种子序列进行如下编码:
Figure PCTCN2017095612-appb-000004
又例如,在获得种子序列集后,对种子序列集中的种子序列进行编码,碱基编码规则可与上面相同,且对参考序列也可以进行同样规则的编码转换,利于快速获得种子序列在参考序列上对应的位置信息,也利于提高所建参考库的访问查询速度。
根据本发明的具体实施方式,确定种子序列集中能够匹配到参考序列的种子序列以及该种子序列的匹配位置,包括:利用大小为L的窗口对参考序列进行滑窗,将种子序列集中的种子序列与滑窗得的窗口序列进行匹配,以确定种子序列集中能够匹配到参考序列的种子序列以及该种子的匹配位置,进行匹配的容错率为ε1。如此,能够快速获得种子序列在参考序列上的对应位置信息,利于快速构建获得参考库。所称的容错率为允许的错配碱基所占的比例,错配选自置换、插入和缺失中的至少一种。
在一个例子中,所称的匹配为严格匹配,即容错率ε1为零,当种子序列与一条或多条滑窗序列完全一致时,滑窗序列的位置为该种子序列在参考序列上对应的位置。在另一个例子中,所称的匹配为容错匹配,容错率ε1大于零,当种子序列与一条或多条滑窗序列的相同位置的碱基不一致的比例小于容错率时,滑窗序列的位置为该种子序列在参考序列上对应的位置。在一个例子中,对种子序列在参考序列上对应的位置进行编码,以编码后的字符例如数字字符为值进行参考库的构建。
换个角度,容错率ε1为不为零,相当于将一条种子序列转变成ε1允许下的一组种子模板序列(seed template),例如seed=ATCG,ε1为0.25即四个碱基中允许一个错误,则seed template可以为ATCG、TTCG、CTCG、GTCG、AACG、ACCG、AGCG等等。在ε1为0.25下确定seed=ATCG在参考序列上的位置时,相当于确定该seed对应的所有seed template在参考序列的位置,例如ref=ATCG,前面所示的所有seed template都可匹配到该位置,ref=TTCG,seed template为ATCG、TTCG、CTCG或者GTCG可以匹配到该位置。进而,构建得的参考库可以以一条seed为键,也可以以该条seed对应的所有seed template中的每一条为键,键与键不同,一个键至少对应一个值。
根据本发明的具体实施方式,在确定种子序列在参考序列上的对应位置时,对参考序列进行滑窗的步长依据L和ε1来确定。在一个例子中,进行滑窗的步长不小于L*ε1。在一个具体例子中,参考序列为人基因组,包含大约30亿个碱基,待处理的读段的长度为不小于25bp,L为14bp,ε1取0.2-0.3,进行滑窗的步长取3bp-5bp,使滑窗定位过程中相邻两个窗口能够跨过ε1条件下的连续错误组合,利于快速定 位。在一个例子中,构建得的参考库的相邻两个条目之间的距离为滑窗的步长。
根据本发明的具体实施方式,S110包括:利用大小为L的窗口对读段进行滑窗,以获得与该读段对应的一组短片段,该滑窗的步长为1bp。如此,对于一条长度为K的reads,获得(K-L+1)条长度为L的短片段,将reads转成短片段,利用高速访问查询参考库,确定各短片段在参考库的对应位置,进而获得短片段对应的reads在参考库的信息。
根据本发明的具体实施方式,S120包括:将短片段与参考库的条目对应的种子序列进行匹配,以确定短片段在参考库的位置,进行匹配的容错率为ε2
在一个例子中,所称的匹配为严格匹配,即容错率ε2为零,当一条短片段与参考库的一个条目对应的seed或者seed template完全一致时,获得该短序列在参考库上的位置信息。在另一个例子中,所称的匹配为容错匹配,容错率ε2大于零,当短序列与参考库的一个或多个条目对应的seed或者seed template不匹配的碱基的比例小于容错率ε2时,获得该短序列在参考库上的位置信息。在一个具体例子中,ε2=ε1且不为零,使能够获得尽可能多的有效数据。
根据本发明的具体实施方式,参考图2,S120中,所称的参考库的相邻两个条目对应的两条种子序列X1和X2在参考序列ref上的距离,可分为以下两种情形:当参考库的两个条目的键和值均为唯一,即一个条目对应一[键,值],参考图2a,相当于该X1和X2与参考序列均为唯一匹配时(X1和X2都只匹配到参考序列一个位置),所称的距离为X1和X2在参考序列上对应的这两个位置之间的距离,加粗黑线显示这两个位置;当参考库的两个条目中至少一个条目的键对应多个值,参考图2b,相当于该两条种子序列X1和X2中的至少一条与参考序列为非唯一匹配即X1和X2中至少有一条匹配到参考序列的多个位置,所称的距离为该X1和X2在参考序列上对应的相距最近的两个位置之间的距离,加粗黑线显示这两个位置。该实施方式对两条序列之间的距离的表示方法不作限制,例如,可以表示为一条序列的两个末端中的任一末端到另一条序列的任一末端的距离,也可以表示为一条序列的中心到另一条序列的中心的距离。
根据本发明的具体实施方式,在获得第二定位结果之后,S130还包括:去除连通长度小于预定阈值的短片段,以去除后的结果替代第二定位结果,连通长度为第二比对结果中的来自相同读段且定位到参考库不同条目的短片段映射到参考序列的总长度。该处理有利于去除一些过渡冗余的和/或相对低质量的数据,利于提高比对速度。
连通长度可表示为来自相同读段且定位到参考库不同条目的短片段的长度总和减去映射到参考序列上短片段之间的重叠部分的长度。在一个例子中,来自一条读段且定位到参考库不同条目的短片段有4条,表示为Y1、Y2、Y3和Y4,各自的长度分别为L1、L2、L3和L4,定位位置示意如图3,其中的X1和X2映射到参考序列的位置有重叠,重叠部分的长度为J,连通长度为(L1+L2+L3+L4-J)。在一个例子中,不同短片段的长度均为L,所称的预定阈值为L,如此,可在允许丢 失部分有效但质量相对低的数据的情况下,提高比对速度。
根据本发明的具体实施方式,在获得第二定位结果之后,S130还包括:依据第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行评判,去除评判结果不符合预定要求的读段。去除读段同时也是去除了该读段对应的短片段。如此,在满足一定的敏感性和准确性的前提下,基于第二定位结果,直接进行精确匹配/局部快速比对,能够加速比对。
该实施方式对评判的方法不作限定,例如可以利用量化打分的方式。在一个例子中,对来自相同读段的短片短的定位结果进行打分,打分规则是:与参考序列匹配的位点作减分,与参考序列不匹配的位点作加分;在获得第二定位结果之后,依据第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不大于第一预设值的读段。根据一个具体示例,读长为25bp,对来自相同读段的短片段进行序列构建,以获得重构序列,例如,可以根据具有更多短序列支持来确定某位点的碱基类型,若某位点没有支持的短片段即没有短片段比对到该位点,则该位点碱基类型不确定可以N来表示,以此来获得重构序列,可看出重构序列与读段是对应的,重构序列的长度为读长;重构序列与参考序列(ref)匹配的位点作减一分,与参考序列不匹配的位点作加一分,比对容错率即一条读段/重构序列允许的错配比例为0.12,比对容许错误的长度为3bp(25*0.12),初始分数Scoreinit为读长,第一预设值为22(25-3),如此,去除得分小于22即匹配不上参考序列的位点占比超过比对容错率的重构序列,利于在允许丢失部分有效但质量相对低的数据的情况下,加速比对。根据一个具体例子,使用位运算和动态规划算法[G.Myers.A fast bit-vector algorithm for approximate string matching based on dynamic progamming.Journal of the ACM,46(3):395-415,1999],对于每条重构序列,读入每个位点i的位置,利用64位的二进制掩模进行快速匹配计分,每个位点一分,初始分数Scoreinit为读长,可表示为Scoreinit=length(read),匹配计分获得分数Score,可表示为:
Figure PCTCN2017095612-appb-000005
在一个例子中,对来自相同读段的短片短的定位结果进行打分,打分规则是:与参考序列匹配的位点作加分,与参考序列不匹配的位点作减分;在获得第二定位结果之后,依据第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不小于第二预设值的读段对应的短片段。根据一个具体示例,读长为25bp,对来自相同读段的短片段进行序列构建,以获得重构序列,例如,可以根据具有更多短序列支持来确定某位点的碱基类型,若某位点没有支持的短片段即没有短片段比对到该位点,则该位点碱基类型不确定可以N来表示,以此来获得重构序列,可看出重构序列与读段是对应的,重构序列的长度为读长;重构序列与参考序列(ref)匹配的位点作加一分,与参考序列不匹配的位点作减一分,比对容错率即一条读段/重构序列允许的错配比例为0.12,比对容许错误的长度为 3bp(25*0.12),初始分数Scoreinit为-25,第二预设值为-22(-25-3),如此,去除得分大于-22的重构序列,在允许丢失部分有效但相对低质量的数据的情况下,加速比对。
根据本发明的具体实施方式,S140中的基于第二定位结果中来自相同读段的短片段进行延伸,包括:基于来自相同读段的短片段进行序列构建,获得重构序列;基于重构序列与该重构序列对应的参考序列的公共部分进行延伸,以获得延伸序列。如此,将短片段及短片段定位信息转化成短片段对应的读段(在此称为重构序列)的定位信息,利于后续比对处理快速准确的进行。
所称的公共部分,为多条序列共有的部分。根据本发明的具体实施方式,所称的公共部分为公共子串和/或公共子序列。公共子串指多条序列中共有的连续部分,公共子序列则不须连续。例如,对于ABCBDAB和BDCABA,公共子序列是BCBA,公共子串是AB。
所称的基于来自相同读段的短片段进行序列构建,获得重构序列,在一个例子中,可根据具有更多短片段支持来确定重构序列上某位点的碱基类型,若某位点没有支持的短片段即没有短片段比对到参考序列该位点,则该位点碱基类型不确定可以以N来表示,以此来获得所称的重构序列。可看出,重构序列与读段是对应的,重构序列的长度为读长。
所称的重构序列对应的参考序列,为与重构序列匹配的一段参考序列,该段参考序列的长度不小于读长。在一个例子中,重构序列对应的参考序列的长度与重构序列相同,均为读长。在另外一个例子中,允许重构序列与对应的参考序列容错匹配,重构序列对应的参考序列的长度为重构序列的长度加上两倍的容错匹配长度,例如,重构序列长度即读长为25bp,重构序列与参考序列的匹配允许错配12%,可以以重构序列对比上的那段参考序列以及该段参考序列两端各3bp(25*12%)序列来作为重构序列对应的参考序列。
根据本发明的一个具体例子,所称的公共部分为公共子串。S140中的基于第二定位结果中来自相同读段的短片段进行延伸,包括:查找所述重构序列与所述重构序列对应的参考序列的公共子串,确定所述重构序列和所述重构序列对应的参考序列的最长公共子串;基于编辑距离,延伸所述最长公共子串以获得延伸序列。如此,能够更加准确获得包含更长匹配序列的比对结果。
根据本发明的一个具体例子,所称的公共部分为公共子序列。S140中的基于第二定位结果中来自相同读段的短片段进行延伸,包括:查找所述重构序列与所述重构序列对应的参考序列的公共子序列,确定所述重构序列和所述重构序列对应的参考序列的最长公共子序列;基于编辑距离,延伸所述最长公共子序列以获得延伸序列。
所称的编辑距离,也叫Levenshtein距离,是指两个字符串之间,由一个转成另一个所需的最少编辑操作次数。编辑操作包括将一个字符替换成另一个字符、插入一个字符以及删除一个字符。一般来说,编辑距离越小,两个串的相似度越大。
在一个例子中,对于一条重构序列/读段,查找该重构序列与该重构序列对应的参考序列的最长公共子串,可表示为求两个字符串x1x2...xi和y1y2...yj的公共子串,字符串的长度分别为m和n,计算这两字符串的公共子串的长度c[i,j],可以得到转移方程:
Figure PCTCN2017095612-appb-000006
解方程可得这两条序列的最长公共子串的长度为max(c[i,j]),i∈{1,...,m},j∈{1,...,n};接着利用编辑距离,将最长公共子串转化成对应的参考序列,可使最长公共子串两端不断生长,找出两个字符串之间需要的最小字符操作(替换,删除,插入)。可以使用动态规划算法确定编辑距离,该问题具备最优子结构,编辑距离d[i,j]的计算可表示为下列公式:
Figure PCTCN2017095612-appb-000007
其中,洞/空缺(gap)表示插入或者删除一个字符,公式中的gap表示插入或者删除一个字符(对应序列中的位点)所需的罚分,匹配(match)表示两个字符一样,公式中的match表示两个字符一样时的得分,错配(mismatch)表示两个字符不相等/不同,公式中的mismatch表示表示两个字符不相等/不同时的阀分。d[i,j]取三者中最小的一项。在一个具体例子中,一gap罚3分,连续gap增加阀1分,一个位点错配罚2分,位点匹配得0分。如此,利于含gap序列的高效比对。
根据本发明的具体实施方式,所称的公共部分为公共子序列。根据本发明的具体实施方式,S140包括:查找第二定位结果中定位到参考库的相同条目的短片段的公共子序列,确定每条读段对应的最长公共子序列;基于编辑距离,延伸最长公共子序列以获得延伸序列。
在一个例子中,对于一条重构序列/读段,查找重构序列与该重构序列对应的参考序列的最长公共子序列,基于最长公共子序列,将最长公共子序列对应的那段重构序列转化为最长公共子序列对应的那段参考序列,可利用Smith Waterman算法找出这两段序列的编辑距离,对两个字符串x1x2...xi和y1y2...yj,可以通过以下公式求得:
Figure PCTCN2017095612-appb-000008
其中,
σ表示记分函数,σ(i,j)表示字符(位点)xi和yj错配或者匹配的得分,σ(-,j)表示xi空缺(删除)或者yj插入的得分,σ(i,-)表示yj删除或者xi插入的得分;接着,可利用前面示例中的计算编辑距离的方法,将最长公共子序列对应的那段重构序列转化成重构序列对应的参考序列,可在最长公共子序列对应的那段重构序列的两端不断生长,找出最小字符操作(替换,删除,插入)。在一个具体例子中,一gap罚3分,连续gap增加罚1分,一个位点错配罚2分,位点匹配得4分。如此,能够实现含gap的序列的高效比对且能够保留既含gap而其它位点准确度高的序列。
根据本发明的具体实施方式,S140还包括:从延伸序列的至少一端对延伸序列进行截断,计算截断后的延伸序列的错误定位位点的比例,满足以下条件停止截断:截断后的延伸序列的错误定位位点的比例小于第三预设值。如此,采用截断和剔除的方式,可以较好的保留匹配良好的局部序列,利于提高数据的有效率。
具体地,根据本发明的具体实施方式,基于以下对延伸序列进行截断:i、计算第一错误率和第二错误率,若第一错误率小于第二错误率,则从延伸序列的第一端对延伸序列进行截断,若第一错误率大于第二错误率,则从延伸序列的第二端对延伸序列进行截断,以获得截断后的延伸序列,所称的第一错误率为从延伸序列的第一端对延伸序列进行截断获得的截断后的延伸序列的错误定位位点的比例,所称的第二错误率为从延伸序列的第二端对延伸序列进行截断、获得的截断后的延伸序列的错误定位位点的比例;ii、以截断后的延伸序列替代延伸序列进行i,直至截断后的延伸序列的错误定位位点的比例小于第四预设值。如此,采用双端截断和剔除的方式,可以较好的保留匹配良好的局部序列,利于提高数据的有效率。根据一个具体例子,延伸序列的长度为25bp,第四预设值为第三预设置为0.12。
根据本发明的具体实施方式,S140还包括:从延伸序列的至少一端对延伸序列进行滑窗,计算滑窗得的窗口序列的错误定位位点的比例,根据窗口序列的错误定位位点的比例对延伸序列进行截断,满足以下条件停止截断:滑窗得的窗口序列的错误定位位点的比例大于第五预设值。如此,采用截断和剔除的方式,可以较好的保留匹配良好的局部序列,利于提高数据的有效率。
具体地,根据本发明的具体实施方式,基于以下对延伸序列进行截断:i、计算第三错误率和第四错误率,若第三错误率小于第四错误率,则从延伸序列的第二端对延伸序列进行截断,若第三错误率大于第四错误率,则从延伸序列的第一端对延伸序列进行截断,以获得截断后的延伸序列,所称的第三错误率为从延伸序列的第 一端对延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例,所称的第四错误率为从延伸序列的第二端对延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例;ii、以截断后的延伸序列替代延伸序列进行i,直至窗口序列的错误定位位点的比例大于第六预设值。如此,采用双端截断和剔除的方式,可以较好的保留匹配良好的局部序列,利于提高数据的有效率。
根据本发明的具体实施方式,滑窗的窗口不大于延伸序列的长度。根据一个具体例子,延伸序列的长度为25bp,滑窗的窗口大小为10bp,第六预设值为第五预设值为0.12。
根据本发明的具体实施方式,截断的大小为1bp,即一次截断为去掉1个碱基。如此,能够高效的获得包含更多长序列的比对结果。
在一个具体例子中,利用Bowtie(http://bowtie-bio.sourceforge.net/index.shtml)、BWA(http://bio-bwa.sourceforge.net/)以及上述比对方法对同一批模拟数据进行序列比对,模拟数据基于人类参考基因组设置、包含100K条长度为100bp的序列。各软件/方法运行所需空间、时间、比对上参考序列的比例(Map rate)以及准确性相当。在该示例中,利用该实施方式中的比对方法所需的时间和内存较Bowtie或BWA的稍长和大,但利用该实施方式的比对方法的序列比对上的比例以及比对准确性达到98.9%和99.9%,均较利用Bowtie和BWA的稍高。
依据本发明的实施方式提供的一种比对装置,参考图4,该装置用以实现上述任一实施例/实施方式中的方法,该装置100包括:转化模块10,用于将每条读段转化成与该读段对应的一组短片段,获得多组短片段;查找模块20,用于确定所述短片段在参考库的对应位置,以获得第一定位结果,所述参考库为基于参考序列构建的哈希表,所述参考库包含多个条目,所述参考库的一个条目对应一条种子序列,所述种子序列能够与所述参考序列上的至少一段序列匹配,所述参考库的相邻两个条目对应的两条种子序列在所述参考序列上的距离小于所述短片段的长度;剔除模块30,用于去除所述第一定位结果中定位到所述参考库相邻条目中的任一条目上的短片段,获得第二定位结果;生长模块40,用于基于所述第二定位结果中来自相同读段的短片段进行延伸,以获得所述读段的比对结果。
上述对本发明任一实施方式中的比对方法的技术特征和效果的描述,同样适用本发明这一实施方式中的比对装置,在此不再赘述。
例如,根据本发明的具体实施方式,参考图5,该装置100还包括建库模块12,用于构建所述参考库,利用所述建库模块12进行以下:依据所述参考序列的碱基总数totalBase,确定种子序列的长度L,L=μ*log(totalBase),
Figure PCTCN2017095612-appb-000009
基于所述种子序列的长度,生成所有可能的种子序列,获得种子序列集;确定所述种子序列集中能够匹配到所述参考序列的种子序列以及该种子序列的匹配位置,以获得所述参考库。
根据本发明的具体实施方式,所称的确定种子序列集中能够匹配到参考序列的种子序列以及该种子序列的匹配位置,包括:利用大小为L的窗口对所述参考序列 进行滑窗,将所述种子序列集中的种子序列与滑窗得的窗口序列进行匹配,以确定所述种子序列集中能够匹配到所述参考序列的种子序列以及该种子的匹配位置,进行所述匹配的容错率为ε1
根据本发明的具体实施方式,进行所述滑窗的步长依据L和ε1来确定。
根据本发明的具体实施方式,进行所述滑窗的步长不小于L*ε1
根据本发明的具体实施方式,所述参考库的相邻两个条目之间的距离大于所述滑窗的步长。
根据本发明的具体实施方式,利用所述转化模块10进行以下:利用大小为L的窗口对所述读段进行滑窗,以获得与该读段对应的一组短片段,进行所述滑窗的步长为1bp。
根据本发明的具体实施方式,参考图6,还包括第一筛选模块32,所述第一筛选模块32与所述剔除模块30相连,用于对来自剔除模块30的第二定位结果进行以下:去除连通长度小于预定阈值的短片段,以去除后的结果替代所述第二定位结果,所述连通长度为所述第二比对结果中的来自相同读段且定位到所述哈希表不同条目的短片段映射到参考序列的总长度。
根据本发明的具体实施方式,参考图7,还包括第二筛选模块34,所述第二筛选模块34与所述剔除模块30相连,用于:依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行评判,去除评判结果不符合预定要求的读段对应的短片段。
根据本发明的具体实施方式,所述第二筛选模块34用于:依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不大于第一预设值的读段。
根据本发明的具体实施方式,所述第二筛选模块34用于:依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不小于第二预设值的读段。
根据本发明的具体实施方式,所述生长模块40用于:基于所述来自相同读段的短片段进行序列构建,获得重构序列;基于所述重构序列与所述重构序列对应的参考序列的公共部分进行延伸,以获得延伸序列。
根据本发明的具体实施方式,所述公共部分为公共子串。所述生长模块40用于:查找所述重构序列与所述重构序列对应的参考序列的公共子串,确定所述重构序列和所述重构序列对应的参考序列的最长公共子串;基于编辑距离,延伸所述最长公共子串以获得延伸序列。
根据本发明的具体实施方式,所述公共部分为公共子序列。所述生长模块40用于:查找所述重构序列与所述重构序列对应的参考序列的公共子序列,确定所述重构序列和所述重构序列对应的参考序列的最长公共子序列;基于编辑距离,延伸所述最长公共子序列以获得延伸序列。
根据本发明的具体实施方式,参考图8,还包括截断模块50,用于:从来自所 述生长模块40的延伸序列的至少一端对所述延伸序列进行截断,计算截断后的延伸序列的错误定位位点的比例,满足以下条件停止所述截断:截断后的延伸序列的错误定位位点的比例小于第三预设值。
根据本发明的具体实施方式,还包括截断模块50,用于:i、计算第一错误率和第二错误率,若所述第一错误率小于所述第二错误率,则从所述延伸序列的第一端对所述延伸序列进行截断,获得截断后的延伸序列,若所述第一错误率大于所述第二错误率,则从所述延伸序列的第二端对所述延伸序列进行截断,获得截断后的延伸序列,所述第一错误率为从所述延伸序列的第一端对所述延伸序列进行截断获得的截断后的延伸序列的错误定位位点的比例,所述第二错误率为从所述延伸序列的第二端对所述延伸序列进行截断、获得的截断后的延伸序列的错误定位位点的比例;ii、以截断后的延伸序列替代所述延伸序列进行i,直至所述截断后的延伸序列的错误定位位点的比例小于第四预设值。
根据本发明的具体实施方式,还包括截断模块50,用于:从所述延伸序列的至少一端对所述延伸序列进行滑窗,计算滑窗得的窗口序列的错误定位位点的比例,根据所述窗口序列的错误定位位点的比例对所述延伸序列进行截断,满足以下条件停止所述截断:滑窗得的窗口序列的错误定位位点的比例小于第五预设值。
根据本发明的具体实施方式,还包括截断模块50,用于:i、计算第三错误率和第四错误率,若所述第三错误率小于所述第四错误率,则从所述延伸序列的第二端对所述延伸序列进行截断,获得截断后的延伸序列,若所述第三错误率大于所述第四错误率,则从所述延伸序列的第一端对所述延伸序列进行截断,获得截断后的延伸序列,所述第三错误率为从所述延伸序列的第一端对所述延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例,所述第四错误率为从所述延伸序列的第二端对所述延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例;ii、以截断后的延伸序列替代所述延伸序列进行i,直至所述窗口序列的错误定位位点的比例大于第六预设值。
根据本发明的具体实施方式,所述截断的大小为1bp。
根据本发明的具体实施方式,所述滑窗的窗口不大于所述延伸序列的长度。
依据本发明的一种实施方式提供的一种计算机可读介质,该介质用以承载上述任一实施方式中的比对方法的一部分或者全部步骤。所称介质包括但不限于只读存储器、随机存储器、磁盘和光盘等。
依据本发明的一种实施方式提供的一种比对系统,该系统1000包括:输入装置100,用于输入数据;输出装置200,用于输出数据;处理器300,用于执行计算机可执行程序,执行所述计算机可执行程序包括完成上述任一实施方式的比对方法;存储装置400,用于存储数据,其中包括所述计算机可执行程序。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语 的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同限定。

Claims (46)

  1. 一种比对方法,其特征在于,包括:
    将每条读段转化成与该读段对应的一组短片段,获得多组短片段;
    确定所述短片段在参考库的对应位置,以获得第一定位结果,
    所述参考库为基于参考序列构建的哈希表,所述参考库包含多个条目,所述参考库的一个条目对应一条种子序列,所述种子序列能够与所述参考序列上的至少一段序列匹配,
    所述参考库的相邻两个条目对应的两条种子序列在所述参考序列上的距离小于所述短片段的长度;
    去除所述第一定位结果中定位到所述参考库相邻条目中的任一条目上的短片段,获得第二定位结果;
    基于所述第二定位结果中来自相同读段的短片段进行延伸,以获得所述读段的比对结果。
  2. 权利要求1的方法,其特征在于,所述参考库的构建包括:
    依据所述参考序列的碱基总数totalBase,确定种子序列的长度L,L=μ*log(totalBase),
    Figure PCTCN2017095612-appb-100001
    基于所述种子序列的长度,生成所有可能的种子序列,获得种子序列集;
    确定所述种子序列集中能够匹配到所述参考序列的种子序列以及该种子序列的匹配位置,以获得所述参考库。
  3. 权利要求2的方法,其特征在于,所述确定种子序列集中能够匹配到参考序列的种子序列以及该种子序列的匹配位置,包括:
    利用大小为L的窗口对所述参考序列进行滑窗,将所述种子序列集中的种子序列与滑窗得的窗口序列进行匹配,以确定所述种子序列集中能够匹配到所述参考序列的种子序列以及该种子的匹配位置,进行所述匹配的容错率为ε1
  4. 权利要求3的方法,其特征在于,进行所述滑窗的步长依据L和ε1来确定。
  5. 权利要求3的方法,其特征在于,进行所述滑窗的步长不小于L*ε1
  6. 权利要求3的方法,其特征在于,所述参考库的相邻两个条目之间的距离大于或者等于所述滑窗的步长。
  7. 权利要求1-6任一方法,其特征在于,所述将每条读段转化成与该读段对应的一组短片段,获得多组短片段,包括:
    利用大小为L的窗口对所述读段进行滑窗,以获得与该读段对应的一组短片段,进行所述滑窗的步长为1bp。
  8. 权利要求1-7任一方法,其特征在于,所述确定短片段在参考库的对应位置,以获得第一定位结果,包括:
    将所述短片段与所述参考库的条目对应的种子序列进行匹配,以确定所述短片 段在所述参考库的位置,进行所述匹配的容错率为ε2。
  9. 权利要求1-8任一方法,其特征在于,在获得所述第二定位结果之后,
    去除连通长度小于预定阈值的短片段,以去除后的结果替代所述第二定位结果,所述连通长度为所述第二比对结果中的来自相同读段且定位到所述参考库不同条目的短片段映射到参考序列的总长度。
  10. 权利要求1-9任一方法,其特征在于,在获得所述第二定位结果之后,
    依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行评判,去除评判结果不符合预定要求的读段。
  11. 权利要求10的方法,其特征在于,在获得所述第二定位结果之后,
    依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不大于第一预设值的读段。
  12. 权利要求10的方法,其特征在于,在获得所述第二定位结果之后,
    依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不小于第二预设值的读段。
  13. 权利要求1-12任一方法,其特征在于,所述基于第二定位结果中来自相同读段的短片段进行延伸,包括:
    基于所述来自相同读段的短片段进行序列构建,获得重构序列;
    基于所述重构序列与所述重构序列对应的参考序列的公共部分进行延伸,以获得延伸序列。
  14. 权利要求13的方法,其特征在于,所述公共部分为公共子串。
  15. 权利要求14的方法,其特征在于,所述基于第二定位结果中来自相同读段的短片段进行延伸,包括:
    查找所述重构序列与所述重构序列对应的参考序列的公共子串,确定所述重构序列和所述重构序列对应的参考序列的最长公共子串;
    基于编辑距离,延伸所述最长公共子串以获得延伸序列。
  16. 权利要求13任一方法,其特征在于,所述公共部分为公共子序列。
  17. 权利要求16的方法,其特征在于,所述基于第二定位结果中来自相同读段的短片段进行延伸,包括:
    查找所述重构序列与所述重构序列对应的参考序列的公共子序列,确定所述重构序列和所述重构序列对应的参考序列的最长公共子序列;
    基于编辑距离,延伸所述最长公共子序列以获得延伸序列。
  18. 权利要求15或17的方法,其特征在于,还包括:
    从所述延伸序列的至少一端对所述延伸序列进行截断,计算截断后的延伸序列的错误定位位点的比例,满足以下条件停止所述截断:截断后的延伸序列的错误定位位点的比例小于第三预设值。
  19. 权利要求15或17的方法,其特征在于,还包括:
    i、计算第一错误率和第二错误率,若所述第一错误率小于所述第二错误率,则 从所述延伸序列的第一端对所述延伸序列进行截断,若所述第一错误率大于所述第二错误率,则从所述延伸序列的第二端对所述延伸序列进行截断,以获得截断后的延伸序列,
    所述第一错误率为从所述延伸序列的第一端对所述延伸序列进行截断获得的截断后的延伸序列的错误定位位点的比例,
    所述第二错误率为从所述延伸序列的第二端对所述延伸序列进行截断、获得的截断后的延伸序列的错误定位位点的比例;
    ii、以截断后的延伸序列替代所述延伸序列进行i,直至所述截断后的延伸序列的错误定位位点的比例小于第四预设值。
  20. 权利要求15或17的方法,其特征在于,还包括:
    从所述延伸序列的至少一端对所述延伸序列进行滑窗,计算滑窗得的窗口序列的错误定位位点的比例,
    根据所述窗口序列的错误定位位点的比例对所述延伸序列进行截断,满足以下条件停止所述截断:滑窗得的窗口序列的错误定位位点的比例大于第五预设值。
  21. 权利要求15或17的方法,其特征在于,还包括:
    i、计算第三错误率和第四错误率,若所述第三错误率小于所述第四错误率,则从所述延伸序列的第二端对所述延伸序列进行截断,
    若所述第三错误率大于所述第四错误率,则从所述延伸序列的第一端对所述延伸序列进行截断,以获得截断后的延伸序列,
    所述第三错误率为从所述延伸序列的第一端对所述延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例,
    所述第四错误率为从所述延伸序列的第二端对所述延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例;
    ii、以截断后的延伸序列替代所述延伸序列进行i,直至所述窗口序列的错误定位位点的比例大于第六预设值。
  22. 权利要求18-21任一方法,其特征在于,所述截断的大小为1bp。
  23. 权利要求20或21的方法,其特征在于,所述滑窗的窗口不大于所述延伸序列的长度。
  24. 一种比对装置,其特征在于,包括:
    转化模块,用于将每条读段转化成与该读段对应的一组短片段,获得多组短片段;
    查找模块,用于确定所述短片段在参考库的对应位置,以获得第一定位结果,
    所述参考库为基于参考序列构建的哈希表,所述参考库包含多个条目,所述参考库的一个条目对应一条种子序列,所述种子序列能够与所述参考序列上的至少一段序列匹配,
    所述参考库的相邻两个条目对应的两条种子序列在所述参考序列上的距离小于所述短片段的长度;
    剔除模块,用于去除所述第一定位结果中定位到所述参考库相邻条目中的任一条目上的短片段,获得第二定位结果;
    生长模块,用于基于所述第二定位结果中来自相同读段的短片段进行延伸,以获得所述读段的比对结果。
  25. 权利要求24的装置,其特征在于,还包括建库模块,用于构建所述参考库,利用所述建库模块进行以下:
    依据所述参考序列的碱基总数totalBase,确定种子序列的长度L,L=μ*log(totalBase),
    Figure PCTCN2017095612-appb-100002
    基于所述种子序列的长度,生成所有可能的种子序列,获得种子序列集;
    确定所述种子序列集中能够匹配到所述参考序列的种子序列以及该种子序列的匹配位置,以获得所述参考库。
  26. 权利要求25的装置,其特征在于,所述确定种子序列集中能够匹配到参考序列的种子序列以及该种子序列的匹配位置,包括:
    利用大小为L的窗口对所述参考序列进行滑窗,将所述种子序列集中的种子序列与滑窗得的窗口序列进行匹配,以确定所述种子序列集中能够匹配到所述参考序列的种子序列以及该种子的匹配位置,进行所述匹配的容错率为ε1
  27. 权利要求26的装置,其特征在于,进行所述滑窗的步长依据L和ε1来确定。
  28. 权利要求26的装置,其特征在于,进行所述滑窗的步长不小于L*ε1
  29. 权利要求26的装置,其特征在于,所述参考库的相邻两个条目之间的距离大于所述滑窗的步长。
  30. 权利要求24-29任一装置,其特征在于,利用所述转化模块进行以下:
    利用大小为L的窗口对所述读段进行滑窗,以获得与该读段对应的一组短片段,进行所述滑窗的步长为1bp。
  31. 权利要求24-30任一装置,其特征在于,还包括第一筛选模块,所述第一筛选模块与所述剔除模块相连,用于对来自剔除模块的第二定位结果进行以下:
    去除连通长度小于预定阈值的短片段,以去除后的结果替代所述第二定位结果,所述连通长度为所述第二比对结果中的来自相同读段且定位到所述哈希表不同条目的短片段映射到参考序列的总长度。
  32. 权利要求24-31任一装置,其特征在于,还包括第二筛选模块,所述第二筛选模块与所述剔除模块相连,用于:
    依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行评判,去除评判结果不符合预定要求的读段对应的短片段。
  33. 权利要求32的装置,其特征在于,所述第二筛选模块用于,
    依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不大于第一预设值的读段。
  34. 权利要求32的装置,其特征在于,所述第二筛选模块用于,
    依据所述第二定位结果中来自相同读段的短片段的定位结果,对该读段的定位结果进行计分,去除得分不小于第二预设值的读段。
  35. 权利要求24-34任一装置,其特征在于,所述生长模块用于:
    基于所述来自相同读段的短片段进行序列构建,获得重构序列;
    基于所述重构序列与所述重构序列对应的参考序列的公共部分进行延伸,以获得延伸序列。
  36. 权利要求35的装置,其特征在于,所述公共部分为公共子串。
  37. 权利要求36的装置,其特征在于,所述生长模块用于:
    查找所述重构序列与所述重构序列对应的参考序列的公共子串,确定所述重构序列和所述重构序列对应的参考序列的最长公共子串;
    基于编辑距离,延伸所述最长公共子串以获得延伸序列。
  38. 权利要求35装置,其特征在于,所述公共部分为公共子序列。
  39. 权利要求38的装置,其特征在于,所述生长模块用于:
    查找所述重构序列与所述重构序列对应的参考序列的公共子序列,确定所述重构序列和所述重构序列对应的参考序列的最长公共子序列;
    基于编辑距离,延伸所述最长公共子序列以获得延伸序列。
  40. 权利要求37或39的装置,其特征在于,还包括截断模块,用于:
    从来自所述生长模块的延伸序列的至少一端对所述延伸序列进行截断,计算截断后的延伸序列的错误定位位点的比例,满足以下条件停止所述截断:截断后的延伸序列的错误定位位点的比例小于第三预设值。
  41. 权利要求37或39的装置,其特征在于,还包括截断模块,用于:
    i、计算第一错误率和第二错误率,若所述第一错误率小于所述第二错误率,则从所述延伸序列的第一端对所述延伸序列进行截断,获得截断后的延伸序列,
    若所述第一错误率大于所述第二错误率,则从所述延伸序列的第二端对所述延伸序列进行截断,获得截断后的延伸序列,
    所述第一错误率为从所述延伸序列的第一端对所述延伸序列进行截断获得的截断后的延伸序列的错误定位位点的比例,
    所述第二错误率为从所述延伸序列的第二端对所述延伸序列进行截断、获得的截断后的延伸序列的错误定位位点的比例;
    ii、以截断后的延伸序列替代所述延伸序列进行i,直至所述截断后的延伸序列的错误定位位点的比例小于第四预设值。
  42. 权利要求37或39的装置,其特征在于,还包括截断模块,用于:
    从所述延伸序列的至少一端对所述延伸序列进行滑窗,计算滑窗得的窗口序列的错误定位位点的比例,
    根据所述窗口序列的错误定位位点的比例对所述延伸序列进行截断,满足以下条件停止所述截断:滑窗得的窗口序列的错误定位位点的比例小于第五预设值。
  43. 权利要求37或39的装置,其特征在于,还包括截断模块,用于:
    i、计算第三错误率和第四错误率,若所述第三错误率小于所述第四错误率,则从所述延伸序列的第二端对所述延伸序列进行截断,获得截断后的延伸序列,
    若所述第三错误率大于所述第四错误率,则从所述延伸序列的第一端对所述延伸序列进行截断,获得截断后的延伸序列,
    所述第三错误率为从所述延伸序列的第一端对所述延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例,
    所述第四错误率为从所述延伸序列的第二端对所述延伸序列进行滑窗、获得的窗口序列的错误定位位点的比例;
    ii、以截断后的延伸序列替代所述延伸序列进行i,直至所述窗口序列的错误定位位点的比例大于第六预设值。
  44. 权利要求30-43任一装置,其特征在于,所述截断的大小为1bp。
  45. 权利要求42或43的装置,其特征在于,所述滑窗的窗口不大于所述延伸序列的长度。
  46. 一种比对系统,其特征在于,包括:
    输入装置,用于输入数据;
    输出装置,用于输出数据;
    处理器,用于执行计算机可执行程序,执行所述计算机可执行程序包括完成权利要求1-23任一方法;
    存储装置,用于存储数据,其中包括所述计算机可执行程序。
PCT/CN2017/095612 2017-08-02 2017-08-02 比对方法、装置及系统 WO2019023978A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP17919940.1A EP3663890B1 (en) 2017-08-02 2017-08-02 Alignment method, device and system
US16/635,996 US11482304B2 (en) 2017-08-02 2017-08-02 Alignment methods, devices and systems
PCT/CN2017/095612 WO2019023978A1 (zh) 2017-08-02 2017-08-02 比对方法、装置及系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/095612 WO2019023978A1 (zh) 2017-08-02 2017-08-02 比对方法、装置及系统

Publications (1)

Publication Number Publication Date
WO2019023978A1 true WO2019023978A1 (zh) 2019-02-07

Family

ID=65232099

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/095612 WO2019023978A1 (zh) 2017-08-02 2017-08-02 比对方法、装置及系统

Country Status (3)

Country Link
US (1) US11482304B2 (zh)
EP (1) EP3663890B1 (zh)
WO (1) WO2019023978A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110534158A (zh) * 2019-08-16 2019-12-03 浪潮电子信息产业股份有限公司 一种基因序列比对方法、装置、服务器及介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782097B (zh) * 2021-09-07 2022-06-24 中国人民解放军国防科技大学 一种基于布隆过滤器的锚点筛选方法、装置和计算机设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110257889A1 (en) * 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
US20120089338A1 (en) * 2009-03-13 2012-04-12 Life Technologies Corporation Computer implemented method for indexing reference genome
CN103294932A (zh) * 2012-02-24 2013-09-11 三星Sds株式会社 用于碱基序列分析的参考序列处理系统及方法
CN103451279A (zh) * 2013-09-11 2013-12-18 北京华生恒业科技有限公司 一种基于solid测序技术的基因snp位点检测方法
CN105243297A (zh) * 2015-10-09 2016-01-13 人和未来生物科技(长沙)有限公司 一种参考基因组上基因序列片段的快速比对定位方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2904533A4 (en) * 2012-10-08 2016-06-01 Spiral Genetics Inc METHODS AND SYSTEMS FOR IDENTIFYING, FROM READING SYMBOL SEQUENCES, VARIATIONS TO A SEQUENCE OF REFERENCE SYMBOLS
KR101508816B1 (ko) 2012-10-29 2015-04-07 삼성에스디에스 주식회사 염기 서열 정렬 시스템 및 방법
US10364468B2 (en) 2016-01-13 2019-07-30 Seven Bridges Genomics Inc. Systems and methods for analyzing circulating tumor DNA
CN106096332A (zh) 2016-06-28 2016-11-09 深圳大学 面向存储的dna序列的并行快速匹配方法及其系统
CA3104546A1 (en) * 2019-05-24 2020-12-03 Illumina, Inc. Flexible seed extension for hash table genomic mapping

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120089338A1 (en) * 2009-03-13 2012-04-12 Life Technologies Corporation Computer implemented method for indexing reference genome
US20110257889A1 (en) * 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
CN103294932A (zh) * 2012-02-24 2013-09-11 三星Sds株式会社 用于碱基序列分析的参考序列处理系统及方法
CN103451279A (zh) * 2013-09-11 2013-12-18 北京华生恒业科技有限公司 一种基于solid测序技术的基因snp位点检测方法
CN105243297A (zh) * 2015-10-09 2016-01-13 人和未来生物科技(长沙)有限公司 一种参考基因组上基因序列片段的快速比对定位方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
G. MYERS: "A fast bit-vector algorithm for approximate string matching based on dynamic progamming", JOURNAL OF THE ACM, vol. 46, no. 3, 1999, pages 395 - 415, XP058340666, DOI: 10.1145/316542.316550
RIZK, G. ET AL.: "GASSST: global alignment short sequence search tool", BIOINFORMATICS, vol. 26, no. 20, 15 October 2010 (2010-10-15), pages 2534 - 2540, XP055567315 *
See also references of EP3663890A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110534158A (zh) * 2019-08-16 2019-12-03 浪潮电子信息产业股份有限公司 一种基因序列比对方法、装置、服务器及介质
CN110534158B (zh) * 2019-08-16 2023-08-04 浪潮电子信息产业股份有限公司 一种基因序列比对方法、装置、服务器及介质

Also Published As

Publication number Publication date
US20200381082A1 (en) 2020-12-03
US11482304B2 (en) 2022-10-25
EP3663890B1 (en) 2024-04-10
EP3663890C0 (en) 2024-04-10
EP3663890A4 (en) 2021-03-24
EP3663890A1 (en) 2020-06-10

Similar Documents

Publication Publication Date Title
CN107403075B (zh) 比对方法、装置及系统
US11560598B2 (en) Systems and methods for analyzing circulating tumor DNA
Hoffmann et al. Fast mapping of short sequences with mismatches, insertions and deletions using index structures
JP3672242B2 (ja) パターン検索方法、パターン検索装置、コンピュータプログラム及び記憶媒体
US8301437B2 (en) Tokenization platform
WO2018218788A1 (zh) 一种基于全局种子打分优选的三代测序序列比对方法
CN110692101B (zh) 用于比对靶向的核酸测序数据的方法
CN100356392C (zh) 一种字符识别的后处理方法
CN105760706A (zh) 一种二代测序数据的压缩方法
KR20070083641A (ko) 전사 맵핑을 위한 유전자 식별 기호 분석방법
He et al. De novo assembly methods for next generation sequencing data
CN115631789A (zh) 一种基于泛基因组的群体联合变异检测方法
WO2019023978A1 (zh) 比对方法、装置及系统
CN108140071B (zh) 使用分级反向索引表的dna比对
Vaddadi et al. Read mapping on genome variation graphs
CN112397148B (zh) 序列比对方法、序列校正方法及其装置
JP3630414B2 (ja) 塩基配列のクラスタ生成システム、塩基配列のクラスタ生成方法、該クラスタ生成方法を実行するためのプログラム、および該プログラムを記憶したコンピュータ可読な記録媒体、および塩基配列情報提供システム
CN102637204B (zh) 一种基于互索引结构的文本查询方法
CN113535962B (zh) 数据入库方法、装置、电子装置、程序产品及存储介质
Chen et al. CGAP-align: a high performance DNA short read alignment tool
CN110534158B (zh) 一种基因序列比对方法、装置、服务器及介质
Muggli et al. A succinct solution to Rmap alignment
JP2008102641A (ja) 検索装置、検索方法及びプログラム
CN115602246B (zh) 一种基于群体基因组的序列比对方法
CN117292751A (zh) 一种基于最长路径搜索的三代序列比对方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017919940

Country of ref document: EP

Effective date: 20200302