CN106682393B - Genome sequence comparison method and device - Google Patents

Genome sequence comparison method and device Download PDF

Info

Publication number
CN106682393B
CN106682393B CN201611074255.8A CN201611074255A CN106682393B CN 106682393 B CN106682393 B CN 106682393B CN 201611074255 A CN201611074255 A CN 201611074255A CN 106682393 B CN106682393 B CN 106682393B
Authority
CN
China
Prior art keywords
genome sequence
compared
reads
alignment algorithm
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611074255.8A
Other languages
Chinese (zh)
Other versions
CN106682393A (en
Inventor
何光铸
王东辉
蔡文君
刘凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ronglian Technology Group Co., Ltd
Original Assignee
UNITED ELECTRONICS CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by UNITED ELECTRONICS CO Ltd filed Critical UNITED ELECTRONICS CO Ltd
Priority to CN201611074255.8A priority Critical patent/CN106682393B/en
Publication of CN106682393A publication Critical patent/CN106682393A/en
Application granted granted Critical
Publication of CN106682393B publication Critical patent/CN106682393B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of genome sequence comparison method and devices, comprising: reads partial genome sequence from genome sequence file to be compared;According to two-way BWT alignment algorithm, single-ended Dynamic Programming alignment algorithm and both-end Dynamic Programming alignment algorithm, the partial genome sequence is compared with reference to genome sequence;According to aforementioned any alignment algorithm compare after, when in the partial genome sequence there is no without comparison on sequence when, read new partial genome sequence from genome sequence file to be compared, continue to compare according to above-mentioned steps;It repeats the above steps, completes the genome sequence file to be compared until all comparing, export comparison result.Genome sequence comparison method and device proposed by the present invention, are able to solve that the taking a long time of Genome Alignment, processing progress be slow, the problem more than consumption resource.

Description

Genome sequence comparison method and device
Technical field
The present invention relates to technical field of data processing, a kind of genome sequence comparison method and device are particularly related to.
Background technique
Genome sequence comparison is the general basic processing steps of genomic data analysis, and the purpose of this process is positioning Position of the sequencing sequence on reference genome.The reference genome sequence length of human genome has about 3GB, and sequencing sequence is long Degree is generally between 100bp to 150bp, and the sequence data total amount of general genome sequencing is about in 100GB or so.Compare this A little sequences, for industry generally using the comparison software of open source, more famous has BWA, Bowtie2 at present, and the general processing time exists 10 hours or more, be the main elapsed time step in genomic data analysis the inside.However, these two common generation gene order-checkings Sequence alignment algorithms are generally existing to be taken a long time, processing progress is slow, the problem more than consumption resource.
Summary of the invention
In view of this, being able to solve base it is an object of the invention to propose a kind of genome sequence comparison method and device Because of the taking a long time of group sequence alignment algorithms, processing progress is slow, the problem more than consumption resource.
Based on above-mentioned purpose genome sequence comparison method provided by the invention, comprising:
Partial genome sequence is read from genome sequence file to be compared;
According to two-way BWT alignment algorithm, the partial genome sequence is compared with reference to genome sequence;
After comparing according to two-way BWT alignment algorithm, at least there is a pair of reads in the partial genome sequence In only in read comparisons when, will be only one in the partial genome sequence according to single-ended Dynamic Programming alignment algorithm Each pair of reads in read comparison, is compared again with reference genome sequence;
After single-ended Dynamic Programming alignment algorithm compares, also at least there is a pair in the partial genome sequence It, will be two in the partial genome sequence according to both-end Dynamic Programming alignment algorithm when two read are not compared in reads Each pair of reads that read is not compared, is compared again with reference genome sequence;
After being compared according to aforementioned any alignment algorithm, when there is no do not compare in the partial genome sequence On sequence when, read new partial genome sequence from genome sequence file to be compared, continue according to above-mentioned steps It is compared;
It repeats the above steps, completes the genome sequence file to be compared until all comparing, export comparison result.
It is specific to wrap according to the method that two-way BWT alignment algorithm compares genome sequence in some optional embodiments It includes:
Reads is segmented using dovecote principle;
It establishes described with reference to the BWT of genome sequence, Suffix array clustering and with reference to the BWT of genome sequence backward;
Using sweep backward and sweep forward respectively to each segment of reads or reads from right to left and from left to right Both direction searches for its position on reference genome sequence.
In some optional embodiments, according to the method that single-ended Dynamic Programming alignment algorithm compares genome sequence, tool Body includes:
Determine that one in a pair of reads compares to the specific position with reference on genome sequence;
According to predeterminated position range threshold, the particular range around the specific position is chosen;
In the particular range using dynamic programming algorithm to another not be compared in a pair of of reads into Row compares.
In some optional embodiments, according to the method that both-end Dynamic Programming alignment algorithm compares genome sequence, tool Body includes:
Seed is constructed respectively to every in a pair of of reads;
Each seed is compared onto reference genome sequence;
If two of the reads have corresponding seed to compare respectively in a certain region with reference to genome sequence On, then the region is the candidate region of final comparison position;
Two of the reads are compared respectively using dynamic programming algorithm in the candidate region.
Another aspect of the present invention provides a kind of genome sequence comparison device, comprising:
Data acquisition module, for reading partial genome sequence from genome sequence file to be compared;
Comparison module, for by the partial genome sequence and referring to genome sequence according to two-way BWT alignment algorithm It is compared;After comparing according to two-way BWT alignment algorithm, at least there is a pair in the partial genome sequence When only having in a read comparison in reads, according to single-ended Dynamic Programming alignment algorithm, by the partial genome sequence only There is each pair of reads in a read comparison, is compared again with reference genome sequence;It is calculated when single-ended Dynamic Programming compares After method compares, when not compared in the presence of two read in a pair of reads at least also in the partial genome sequence, press According to both-end Dynamic Programming alignment algorithm, each pair of reads that two read in the partial genome sequence are not compared, with It is compared again with reference to genome sequence;And after being compared according to aforementioned any alignment algorithm, when the part base Because in group sequence there is no without comparison on sequence when, read new portion gene from genome sequence file to be compared Group sequence, continues to compare according to above-mentioned steps.
In some optional embodiments, the comparison module is specifically used for:
Reads is segmented using dovecote principle;
It establishes described with reference to the BWT of genome sequence, Suffix array clustering and with reference to the BWT of genome sequence backward;
Using sweep backward and sweep forward respectively to each segment of reads or reads from right to left and from left to right Both direction searches for its position on reference genome sequence.
In some optional embodiments, the comparison module is specifically used for:
Determine that one in a pair of reads compares to the specific position with reference on genome sequence;
According to predeterminated position range threshold, the particular range around the specific position is chosen;
In the particular range using dynamic programming algorithm to another not be compared in a pair of of reads into Row compares.
In some optional embodiments, the comparison module is specifically used for:
Seed is constructed respectively to every in a pair of of reads;
Each seed is compared onto reference genome sequence;
If two of the reads have corresponding seed to compare respectively in a certain region with reference to genome sequence On, then the region is the candidate region of final comparison position;
Two of the reads are compared respectively using dynamic programming algorithm in the candidate region.
From above-described embodiment as can be seen that genome sequence comparison method provided by the invention and device, more by being arranged Grade alignment algorithm continue pair using next stage alignment algorithm after the completion of previous stage algorithm compares to the part not compared Than so that the complexity of algorithm be allowed to match the complexity of data, and being optimized to every level-one algorithm, and then reach whole Optimization on algorithm speed.Using genome sequence comparison method provided by the invention and device, in same asset and guarantee ratio Pair accuracy under the premise of, the comparison time of people's whole genome sequence can be shortened to 4 hours or so, compared with the prior art Comparison time have significant shortening, improve sequencing efficiency.
Detailed description of the invention
Fig. 1 is the flow diagram of one embodiment of genome sequence comparison method provided by the invention;
Fig. 2 is the modular structure schematic diagram of one embodiment of genome sequence comparison device provided by the invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.
It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention The non-equal entity of a same names or non-equal parameter, it is seen that " first " " second " only for the convenience of statement, does not answer It is interpreted as the restriction to the embodiment of the present invention, subsequent embodiment no longer illustrates this one by one.
Based on foregoing purpose, the first aspect of the embodiment of the present invention proposes a kind of genome sequence comparison method, energy Enough solve the problems, such as that the taking a long time of Genome Alignment, processing progress are slow, consumption resource is more.As shown in Figure 1, for this The flow diagram of one embodiment of the genome sequence comparison method provided is provided.
The genome sequence comparison method, comprising the following steps:
Step 101: obtaining and refer to genome sequence and genome sequence file to be compared.Here file acquisition mode Using conventional acquisition modes.Wherein, the format of the genome sequence file to be compared can be FASTQ file.
Sequence alignment is divided into 3 ranks to carry out by the genome sequence comparison method;Every time from the to be compared of input Genome sequence file in read a part of sequence, then successively to 1 grade, 2 grades, 3 grades of alignment algorithms are executed, upper level do not have Sequence in comparison, into continuing to compare in the alignment algorithm of next stage;Specifically include following steps.
Step 102: reading partial genome sequence from genome sequence file to be compared.
Step 103: according to (the 1st grade: two-way BWT alignment algorithm, two-way BWT:Bi- of two-way BWT alignment algorithm Directional Burrows-Wheeler Transform, two-way Barrow this-Wheeler transformation), by the portion gene group sequence Column are compared with reference to genome sequence.Wherein, the two-way BWT alignment algorithm processing at most allows 4 base mistakes Reads is compared.Reads reads length, is the sequencing sequence obtained in high-flux sequence, each read is one section of base sequence.? During analysis of biological information, each read is compared onto reference genome, so that it may obtain sequencing sequence and with reference to gene The difference of group, to find to make a variation.
Optionally, the method for comparing genome sequence according to two-way BWT alignment algorithm, specifically can comprise the following steps that
Reads is segmented using dovecote principle, 0-2 base mistake of every section of permission;
Then it scans for comparing using two-way BWT alignment algorithm, comprising:
It establishes described with reference to the BWT of genome sequence, Suffix array clustering and with reference to the BWT of genome sequence backward;
Using sweep backward (backward) and sweep forward (forward) respectively to each segment of reads or reads Both direction searches for its position on reference genome sequence from right to left and from left to right.
The two-way BWT is compared when handling multiple base erroneous matchings, and efficiency is relatively low.Allow 4 most In the case where base erroneous matching, reads is segmented according to dovecote principle, each paragraph allows 0-2 base mistake Match, handle the comparison of most 2 base mistakes with two-way BWT in this way, efficiency greatly increases.
The common software BWA that compares after the BWT that establishes reference sequences and corresponding index and SA (suffixarray), It is searched for using backward, i.e., its position in the genome is searched for from right to left to each segment of reads or reads. The two-way BWT that this patent uses also establishes the backward sequence of reference sequences other than establishing traditional BWT index (being denoted as B) One BWT index (being denoted as B ').Utilize B, B ', SA, searched in two directions by backward, forward reads or The position of seeds in the genome, the efficiency of sequence alignment significantly improve.
Step 104: whether at least existing in a pair of reads in the partial genome sequence in an only read comparison (that is, at least a pair of reads is that only a read is compared in the partial genome sequence);If so, into Step 108;If it is not, entering step 105.
Step 105: according to single-ended Dynamic Programming alignment algorithm (the 2nd grade), will only have one in the partial genome sequence Each pair of reads in read comparison, is compared again with reference genome sequence.Passing through aforementioned 1st grade of two-way BWT Alignment algorithm, in a pair of of reads (A, A '), wherein one (A or A ') compares onto reference genome sequence, another (A ' Or A) but without comparing onto reference genome sequence, it will continue to compare using the 2nd grade of alignment algorithm.
Optionally, the method for comparing genome sequence according to single-ended Dynamic Programming alignment algorithm, specifically may include following step It is rapid:
Determine a read (A or A ') comparison in a pair of reads (A, A ') to the spy with reference on genome sequence (position pos) is set in positioning;The data reads that both-end is sequenced is pairs of, it is assumed that wherein one of a pair of of reads (A, A ') Read (A or A ') is compared to the position pos on reference genome sequence, then (A ' or A) theoretical compare position by another read In certain area, that is, candidate region (candidate region) around the position pos;
Therefore, according to predeterminated position range threshold, the particular range around the specific position (position pos) is chosen;Institute Stating predeterminated position range threshold can be selected according to actual needs, such as reference error tolerance is configured;Specifically Ground, in both-end sequencing, a pair of of reads is compared onto genome, then the distance between two read and two read long The sum of degree is equal to the length of sequencing fragment (fragment), determines the position of candidate region around this principle.For example, sequencing piece Section is 500bp, and each read is 150bp, then after comparing onto genome, the theoretical distance between two read is 200bp.Cause It is differed for sequencing fragment length, so theoretical distance is about in 100bp~200bp;
Using dynamic programming algorithm to another not be compared in a pair of of reads in the particular range (A ' A) is compared.Step 106: it is equal whether at least to there are two read in a pair of reads in the partial genome sequence (that is, every read of at least a pair of reads is not compared in the partial genome sequence) is not compared;If It is to enter step 108;If it is not, entering step 107.
Step 107: according to both-end Dynamic Programming alignment algorithm (3rd level), by two in the partial genome sequence Each pair of reads that read is not compared, is compared again with reference genome sequence.In a pair of of reads (A, A '), warp Aforementioned 1st grade of two-way BWT alignment algorithm and the 2nd grade of single-ended Dynamic Programming alignment algorithm are crossed, certain a pair of reads (A, A ') in A and A ' without referring to genome sequence on comparing, will continue to compare using 3rd level alignment algorithm.
Optionally, the method for comparing genome sequence according to both-end Dynamic Programming alignment algorithm, specifically may include following step It is rapid:
Seed (seeds, substrings of a read) is constructed respectively to every (A and A ') in a pair of of reads;
Specifically, many segments are respectively classified into every read of a pair of of reads (A, A '), building seed (seeds, substrings of a read);When a pair of of reads is compared onto genome, the distance between two read are in certain model In enclosing, therefore the distance between the seed of two read also should be in a certain range;
Each seed is compared onto reference genome sequence;
Specifically, the region that the seeds of pairs of (i.e. the distance between two seeds meet the requirements) is compared is retrieved, really Determine this candidate comparison area to reads.Then reads is compared to candidate region with dynamic programming algorithm.
If two (A and the A ') of the reads have corresponding kind respectively in a certain region with reference to genome sequence On son compares, then the region is the candidate region of final comparison position;
Two (A and the A ') of the reads are compared respectively using dynamic programming algorithm in the candidate region; 108 are entered step after the completion of comparing;Step 108: whether all comparing and complete the genome sequence file to be compared;If It is no, return step 102;If so, entering step 109.
Step 109: output comparison result.Optionally, BAM file is the output file that genome sequence compares, and BAM is base Because group sequence alignment result saves a format, genome sequence is noted down and has been listed in position and detailed sequence ratio with reference to genome sequence To situation.
From above-described embodiment as can be seen that genome sequence comparison method provided by the invention, compares by the way that setting is multistage Algorithm carries out continuing to compare using next stage alignment algorithm after the completion of previous stage algorithm compares to the part not compared, from And the complexity of algorithm is allowed to match the complexity of data, and optimize to every level-one algorithm, and then reach total algorithm Optimization in speed.Using genome sequence comparison method provided by the invention, in same asset and guarantee the accuracy compared Under the premise of, the comparison time of people's whole genome sequence can be shortened to 4 hours or so, compared with the comparison time of the prior art There is significant shortening, improves sequencing efficiency.
The second aspect of the embodiment of the present invention proposes a kind of genome sequence comparison device, is able to solve genome The taking a long time of sequence alignment algorithms, processing progress be slow, the problem more than consumption resource.As shown in Fig. 2, being base provided by the invention Because of group modular structure schematic diagram of one embodiment of sequence alignment device.
The genome sequence comparison device, comprising:
Data acquisition module 201, for reading partial genome sequence from genome sequence file to be compared;
Comparison module 202, for by the partial genome sequence and referring to genome according to two-way BWT alignment algorithm Sequence is compared;After comparing according to two-way BWT alignment algorithm, at least there is a pair in the partial genome sequence When only having in a read comparison in reads, according to single-ended Dynamic Programming alignment algorithm, by the partial genome sequence only There is each pair of reads in a read comparison, is compared again with reference genome sequence;It is calculated when single-ended Dynamic Programming compares After method compares, when not compared in the presence of two read in a pair of reads at least also in the partial genome sequence, press According to both-end Dynamic Programming alignment algorithm, each pair of reads that two read in the partial genome sequence are not compared, with It is compared again with reference to genome sequence;And after being compared according to aforementioned any alignment algorithm, when the part base Because in group sequence there is no without comparison on sequence when, read new portion gene from genome sequence file to be compared Group sequence, continues to compare according to above-mentioned steps;
After the completion of comparison module 202 compares partial genome sequence, if genome sequence file to be compared also not by It compares and completes, then next portion gene group is read in the continuation of data acquisition module 201 from genome sequence file to be compared Sequence is simultaneously sent to comparison module 202 and continues to compare, until genome sequence file to be compared is completed by complete compare.
In some optional embodiments, the comparison module 202 is specifically used for:
Reads is segmented using dovecote principle;
It establishes described with reference to the BWT of genome sequence, Suffix array clustering and with reference to the BWT of genome sequence backward;
Using sweep backward and sweep forward respectively to each segment of reads or reads from right to left and from left to right Both direction searches for its position on reference genome sequence.
In some optional embodiments, the comparison module 202 is specifically used for:
Determine that one in a pair of reads compares to the specific position with reference on genome sequence;
According to predeterminated position range threshold, the particular range around the specific position is chosen;
In the particular range using dynamic programming algorithm to another not be compared in a pair of of reads into Row compares.
In some optional embodiments, the comparison module 202 is specifically used for:
Seed is constructed respectively to every in a pair of of reads;
Each seed is compared onto reference genome sequence;
If two of the reads have corresponding seed to compare respectively in a certain region with reference to genome sequence On, then the region is the candidate region of final comparison position;
Two of the reads are compared respectively using dynamic programming algorithm in the candidate region.
From above-described embodiment as can be seen that genome sequence comparison device provided by the invention, compares by the way that setting is multistage Algorithm carries out continuing to compare using next stage alignment algorithm after the completion of previous stage algorithm compares to the part not compared, from And the complexity of algorithm is allowed to match the complexity of data, and optimize to every level-one algorithm, and then reach total algorithm Optimization in speed.Using genome sequence comparison device provided by the invention, in same asset and guarantee the accuracy compared Under the premise of, the comparison time of people's whole genome sequence can be shortened to 4 hours or so, compared with the comparison time of the prior art There is significant shortening, improves sequencing efficiency.
It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples;Under thinking of the invention, above embodiments Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and be existed such as Many other variations of the upper different aspect of the invention, for simplicity, they are not provided in details.
In addition, to simplify explanation and discussing, and in order not to obscure the invention, it can in provided attached drawing It is connect with showing or can not show with the well known power ground of integrated circuit (IC) chip and other components.Furthermore, it is possible to Device is shown in block diagram form, to avoid obscuring the invention, and this has also contemplated following facts, i.e., about this The details of the embodiment of a little block diagram arrangements be height depend on will implementing platform of the invention (that is, these details should It is completely within the scope of the understanding of those skilled in the art).Elaborating that detail (for example, circuit) is of the invention to describe In the case where exemplary embodiment, it will be apparent to those skilled in the art that can be in these no details In the case where or implement the present invention in the case that these details change.Therefore, these descriptions should be considered as explanation Property rather than it is restrictive.
Although having been incorporated with specific embodiments of the present invention, invention has been described, according to retouching for front It states, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example Such as, discussed embodiment can be used in other memory architectures (for example, dynamic ram (DRAM)).
The embodiment of the present invention be intended to cover fall into all such replacements within the broad range of appended claims, Modifications and variations.Therefore, all within the spirits and principles of the present invention, any omission, modification, equivalent replacement, the improvement made Deng should all be included in the protection scope of the present invention.

Claims (4)

1. a kind of genome sequence comparison method characterized by comprising
Step 101: reading partial genome sequence from genome sequence file to be compared;
Step 102: according to two-way BWT alignment algorithm, the partial genome sequence being compared with reference to genome sequence;
Step 103: after comparing according to two-way BWT alignment algorithm, at least there is a pair in the partial genome sequence When only having in a read comparison in reads, according to single-ended Dynamic Programming alignment algorithm, by the partial genome sequence only There is each pair of reads in a read comparison, is compared again with reference genome sequence;
Step 104: after single-ended Dynamic Programming alignment algorithm compares, also at least there is one in the partial genome sequence It, will be in the partial genome sequence according to both-end Dynamic Programming alignment algorithm to when two read do not compare upper in reads Each pair of reads that two read are not compared, is compared again with reference genome sequence;
Step 105: after being compared according to aforementioned any alignment algorithm, when there is no do not have in the partial genome sequence When sequence in comparison, new partial genome sequence is read from genome sequence file to be compared, according to above-mentioned steps Continue to compare;
Step 102 is repeated to step 105, completes the genome sequence file to be compared until all comparing, output compares As a result;
Wherein, the method for comparing genome sequence according to single-ended Dynamic Programming alignment algorithm, specifically includes: determining a pair of reads In one compare to the specific position with reference on genome sequence;According to predeterminated position range threshold, the spy is chosen The particular range of surrounding is set in positioning;In the particular range using dynamic programming algorithm in a pair of of reads not by than Upper another is compared;
According to the method that both-end Dynamic Programming alignment algorithm compares genome sequence, specifically include: to every in a pair of of reads Seed is constructed respectively;Each seed is compared onto reference genome sequence;If described with reference to a certain of genome sequence Region, two of the reads have corresponding seed to compare respectively, then the region is the candidate region of final comparison position; Two of the reads are compared respectively using dynamic programming algorithm in the candidate region.
2. the method according to claim 1, wherein comparing genome sequence according to two-way BWT alignment algorithm Method specifically includes:
Reads is segmented using dovecote principle;
It establishes described with reference to the BWT of genome sequence, Suffix array clustering and with reference to the BWT of genome sequence backward;
Using sweep backward and sweep forward respectively to each segment of reads or reads from right to left and from left to right two Its position on reference genome sequence is searched in direction.
3. a kind of genome sequence comparison device characterized by comprising
Data acquisition module, for reading partial genome sequence from genome sequence file to be compared;
Comparison module, for being carried out by the partial genome sequence and with reference to genome sequence according to two-way BWT alignment algorithm It compares;After being compared according to two-way BWT alignment algorithm, at least exist in a pair of reads in the partial genome sequence When in an only read comparison, according to single-ended Dynamic Programming alignment algorithm, will only have one in the partial genome sequence Each pair of reads in read comparison, is compared again with reference genome sequence;When single-ended Dynamic Programming alignment algorithm compares After, when not compared in the presence of two read in a pair of reads at least also in the partial genome sequence, according to both-end Dynamic Programming alignment algorithm, each pair of reads that two read in the partial genome sequence are not compared, with reference base Because group sequence is compared again;And after being compared according to aforementioned any alignment algorithm, when the portion gene group sequence In column there is no without comparison on sequence when, new portion gene group sequence is read from genome sequence file to be compared Column, repeatedly according to two-way BWT alignment algorithm, the step of the partial genome sequence is compared with reference genome sequence To according to both-end Dynamic Programming alignment algorithm, two read in the partial genome sequence are not compared each pair of Reads continues to compare the step of being compared again with reference genome sequence;
The comparison module specifically includes for completing single-ended Dynamic Programming alignment algorithm: determining one in a pair of reads Compare the specific position with reference on genome sequence;According to predeterminated position range threshold, the specific position week is chosen The particular range enclosed;It is another to not being compared in a pair of of reads using dynamic programming algorithm in the particular range One is compared;
The comparison module is also used to complete both-end Dynamic Programming alignment algorithm, specifically includes: to every in a pair of of reads Seed is constructed respectively;Each seed is compared onto reference genome sequence;If described with reference to a certain of genome sequence Region, two of the reads have corresponding seed to compare respectively, then the region is the candidate region of final comparison position; Two of the reads are compared respectively using dynamic programming algorithm in the candidate region.
4. device according to claim 3, which is characterized in that the comparison module is specifically used for:
Reads is segmented using dovecote principle;
It establishes described with reference to the BWT of genome sequence, Suffix array clustering and with reference to the BWT of genome sequence backward;
Using sweep backward and sweep forward respectively to each segment of reads or reads from right to left and from left to right two Its position on reference genome sequence is searched in direction.
CN201611074255.8A 2016-11-29 2016-11-29 Genome sequence comparison method and device Active CN106682393B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611074255.8A CN106682393B (en) 2016-11-29 2016-11-29 Genome sequence comparison method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611074255.8A CN106682393B (en) 2016-11-29 2016-11-29 Genome sequence comparison method and device

Publications (2)

Publication Number Publication Date
CN106682393A CN106682393A (en) 2017-05-17
CN106682393B true CN106682393B (en) 2019-05-17

Family

ID=58866137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611074255.8A Active CN106682393B (en) 2016-11-29 2016-11-29 Genome sequence comparison method and device

Country Status (1)

Country Link
CN (1) CN106682393B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480468B (en) * 2017-07-06 2020-10-02 荣联科技集团股份有限公司 Gene sample analysis method and electronic device
CN108763871B (en) * 2018-06-05 2022-05-31 北京诺禾致源科技股份有限公司 Hole filling method and device based on third-generation sequencing sequence
CN109326325B (en) * 2018-07-25 2022-02-18 郑州云海信息技术有限公司 Method, system and related assembly for gene sequence comparison
CN111312333B (en) * 2020-02-15 2022-06-21 苏州浪潮智能科技有限公司 Method, apparatus, device and medium for improving BWT table look-up performance
CN111402956A (en) * 2020-02-28 2020-07-10 苏州浪潮智能科技有限公司 Sequence comparison method, device, equipment and medium
CN112634988B (en) * 2021-01-07 2021-10-08 内江师范学院 Python language-based gene variation detection method and system
CN115602246B (en) * 2022-10-31 2023-06-20 哈尔滨工业大学 Sequence alignment method based on group genome

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2507187A1 (en) * 2005-05-11 2006-11-11 National Research Council Of Canada Molecular computer
CN101794346A (en) * 2008-12-12 2010-08-04 深圳华大基因研究院 Detection method and system of collinearity homologous zone of chromosome
CN102010912A (en) * 2010-12-01 2011-04-13 杭州师范大学 Ab initio prediction control method for micro-ribonucleic acid genes based on viral genome sequence
CN103902852A (en) * 2014-03-21 2014-07-02 深圳华大基因科技有限公司 Gene expression quantitative method and device
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
WO2014052909A3 (en) * 2012-09-27 2015-07-30 The Children's Mercy Hospital System for genome analysis and genetic disease diagnosis
CN105830078A (en) * 2013-10-21 2016-08-03 七桥基因公司 Systems and methods for using paired-end data in directed acyclic structure

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2507187A1 (en) * 2005-05-11 2006-11-11 National Research Council Of Canada Molecular computer
CN101794346A (en) * 2008-12-12 2010-08-04 深圳华大基因研究院 Detection method and system of collinearity homologous zone of chromosome
CN102010912A (en) * 2010-12-01 2011-04-13 杭州师范大学 Ab initio prediction control method for micro-ribonucleic acid genes based on viral genome sequence
WO2014052909A3 (en) * 2012-09-27 2015-07-30 The Children's Mercy Hospital System for genome analysis and genetic disease diagnosis
CN105830078A (en) * 2013-10-21 2016-08-03 七桥基因公司 Systems and methods for using paired-end data in directed acyclic structure
CN103902852A (en) * 2014-03-21 2014-07-02 深圳华大基因科技有限公司 Gene expression quantitative method and device
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于动态规划的序列比对的并行算法研究;李大卫;《井冈山大学学报(自然科学版)》;20110530;第32卷(第3期);第80-84页
基于并行计算的基因序列快速对比方法研究;杨睿;《万方数据库》;20160630;第1-32页

Also Published As

Publication number Publication date
CN106682393A (en) 2017-05-17

Similar Documents

Publication Publication Date Title
CN106682393B (en) Genome sequence comparison method and device
US10262105B2 (en) Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
CN109416927B (en) System and method for secondary analysis of nucleotide sequencing data
US20140309944A1 (en) Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
WO2014186604A1 (en) Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
CN101464955A (en) Pattern identification unit generation method, information processing apparatus, computer program, and storage medium
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN112735528A (en) Gene sequence comparison method and system
CN104160396A (en) Finding a best matching string among a set of stings
Debray et al. Identification and assessment of variable single-copy orthologous (SCO) nuclear loci for low-level phylogenomics: a case study in the genus Rosa (Rosaceae)
CN107784198B (en) Combined assembly method and system for second-generation sequence and third-generation single-molecule real-time sequencing sequence
CN103218544B (en) Based on sequence similarity and the periodic gene recognition method of frequency spectrum 3-
CN102841988B (en) A kind of system and method that nucleic acid sequence information is mated
ES2883166T3 (en) Data compression / decompression method and apparatus for identification of genomic variants
CN106754903A (en) The primer pair and method of a kind of whole genome amplification for human mitochondrial
CN116130001A (en) Third-generation sequence comparison algorithm based on k-mer positioning
Ananya et al. Novel approach to find the various stages of chronic myeloid leukemia using dynamic short distance pattern matching algorithm
Zhang et al. A three-level scoring system for fast similarity evaluation based on smith-waterman algorithm
CN105631243B (en) The detection method and device of pathogenic microorganism
CN114520024B (en) Sequence combination method based on k-mer
TWI785847B (en) Data processing system for processing gene sequencing data
Anya Check for updates Research on Algorithms for Planted (1, d) Motif Search
CN115705916A (en) Prediction method, prediction device, editing method and editing device for single base editing
CN104751016B (en) Building method based on the DNA labels for becoming neighborhood search
CN113936741A (en) RNA solvent accessibility prediction method based on context-aware computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 1002-1, 10th floor, No.56, Beisihuan West Road, Haidian District, Beijing 100080

Patentee after: Ronglian Technology Group Co., Ltd

Address before: 100080, Beijing, Haidian District, No. 56 West Fourth Ring Road, glorious Times Building, 10, 1002-1

Patentee before: UNITED ELECTRONICS Co.,Ltd.

CP03 Change of name, title or address