CN104899476A - Parallel accelerating method for BWT index construction for multiple sequences - Google Patents

Parallel accelerating method for BWT index construction for multiple sequences Download PDF

Info

Publication number
CN104899476A
CN104899476A CN201510328718.8A CN201510328718A CN104899476A CN 104899476 A CN104899476 A CN 104899476A CN 201510328718 A CN201510328718 A CN 201510328718A CN 104899476 A CN104899476 A CN 104899476A
Authority
CN
China
Prior art keywords
suffix
bwt
character
sequence
bucket
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510328718.8A
Other languages
Chinese (zh)
Inventor
彭绍亮
朱小谦
王恒
卢宇彤
杨灿群
吴诚堃
崔英博
刘欣
王海强
程乾
夏徐伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201510328718.8A priority Critical patent/CN104899476A/en
Publication of CN104899476A publication Critical patent/CN104899476A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a parallel accelerating method for BWT index construction for multiple sequences. The parallel accelerating method for the BWT index construction for multiple sequences is aimed to solves the problems of slow BWT index construction speed and low efficiency of the existing BWT index construction for a large-scale sequence set due to using a mode of combining in pairs to sort again after carrying out partitioning sorting on the sequence set to continuously recur, combine and sort. According to the technical scheme, the parallel accelerating method for the BWT index construction for multiple sequences includes that traversing all the suffixes of each sequence in the sequence set R, inspecting the first l characters of each suffix, and dividing the suffixes with the same first l characters into the same memory sub-block; independently sorting the suffixes in each sub-block in parallel; splicing the sorted sub-blocks to obtain the order of all the suffixes in the sequence set R; taking the BWT character of each suffix in sequence from the small to the large according to the lexicographical order, and connecting to obtain the BWT index of the sequence set R. The parallel accelerating method for the BWT index construction for multiple sequences has beneficial effects that the BWT index construction for multiple sequences is effectively improved, and the whole genome assembly time is reduced by about 90%.

Description

A kind of method of multisequencing BWT index construct being carried out to parallel accelerate
Technical field: the assemble method that the present invention relates to biological information field full-length genome, especially in full-length genome assembling process, the Burrows-Wheeler of extensive short data records set (100,000,000 sequence) converts the parallel acceleration method of (hereinafter referred to as BWT) index construct.
Background technology:
Full-length genome assembling is the key problem of field of bioinformatics, is basis and the prerequisite of other correlative studys of genomics.The genome of general biology comprises millions of and even billions of bases, and current gene sequencing technology once can only record the sequence fragment comprising hundreds of bases, the process that short data records is reduced into protogene group by the overlapping relation between the short data records obtained according to checking order is called that genome is assembled.For N bar sequence fragment, directly calculating their overlapping relations between any two needs O (N 2) time complexity, and up to several hundred million, can cannot complete the calculating of sequence fragment overlapping relation within effective time to the check order quantity of the sequence fragment obtained of eucaryote.Research finds, under the prerequisite of the BWT index of known array set of segments, the overlapping relation between sequence fragment calculates and can complete within a few hours.The BWT index of arrangement set is defined as follows shown in literary composition.
Make Σ={ c 1, c 2..., c σbe a limited alphabet, meet c 1< c 2< ... < c σ, wherein ' < ' represents lexcographical order, σ represents the character number in alphabet Σ, 1,2 ... σ is the serial number of letter in alphabet.Make S=s 1s 2... s i... s k-1be a limited character string, wherein s i∈ Σ; In addition, the end character of definition character string is less than any one character in alphabet Σ on lexcographical order with ' $ ' represent, ' $ '.So S can be written as the character string that length is k, S=s 1s 2... s k-1s k, wherein s k=' $ '.We use S [i, j]=s is i+1... s jrepresent the substring that a character is formed from i-th character to jth of S, wherein 1≤i≤j≤k.The substring of shape as S [1, i] is called the prefix of S, and the substring of shape as S [j, k] is called the suffix of S, wherein 1≤i, j≤k.Claim the BWT character that s [j-1] is suffix S [j, k], wherein 1 < j≤k; Make the BWT character of suffix S [1, k] be ' $ '.Make R={S 1, S 2..., S mrepresent m bar character string on alphabet Σ, S ilength be k and S i[k]=' $ '.In order to distinguish different character strings, definition S i[k] < S j[k], for 1≤i < j≤m.Press lexcographical order sequence to all suffix of sequence in arrangement set R, the character string that the BWT character then getting each suffix successively forms just is called the BWT of arrangement set R.
As can be seen from the definition of BWT, the key step building BWT sorts by lexcographical order to the suffix of all sequences in arrangement set.But for extensive short data records set, directly carry out all suffix of the sequence that it comprises, the main memory size needed for sequence is up to the TB order of magnitude.For the mankind, human genome comprises about 3,000,000,000 bases, and each base can with a character representation in A, C, G, T.The typical degree of depth is 30 × order-checking will produce the sequence that about 1,000,000,000 are about 100 bases, all suffix only enumerating these sequences just need the space of 1.25TB, considerably beyond the memory size of existing computing equipment.
For this reason, researcher proposes a kind of method that block sorting then recursively merges between two again.With arrangement set R={S 1, S 2..., S 8bWT index construct be example, suppose that the large I of the main memory of computing equipment meets the sequence of all suffix of two sequences, then R be divided into 4 pieces of R 1, R 2, R 3, R 4, wherein R i={ S 2i-1, S 2i, 1≤i≤4.First successively to R 1, R 2, R 3, R 4in all suffix carry out sorting and obtain Sort respectively 1, Sort 2, Sort 3, Sort 4then adopt sequencing by merging algorithm to Sort 1, Sort 2sequencing by merging obtains Sort 12, then to Sort 3, Sort 4sequencing by merging obtains Sort 34, afterwards to Sort 12and Sort 34sequencing by merging obtains Sort 1234, namely obtain the lexcographical order of all suffix in R, finally get Sort successively according to lexicographic order from small to large 1234the BWT character of each suffix, the character string formed is the BWT index of arrangement set R.This block sorting method that then recurrence sequencing by merging result builds BWT index solves directly to carry out sorting the excessive problem of memory requirements to the suffix of all sequences, but, because this method is that serial performs, time efficiency is poor, cannot meet the ageing requirement that extensive short data records set B WT builds.
In order to accelerate the building process of BWT index, there is researcher to build based on the method for BWT index by above-mentioned piecemeal-merging, proposing a kind of parallel acceleration method.Still with string assemble R={S 1, S 2..., S 8bWT index construct be example.First be equipped with and comprise 4 node P 1, P 2, P 3, P 4network of Workstation, the large I of main memory of each node meets the sequence of all suffix of two sequences.Then R is divided into 4 pieces of R 1, R 2, R 3, R 4, wherein R i={ S 2i-1, S 2i, 1≤i≤4.Simultaneously at node P ion to R iin all suffix carry out sequence and obtain Sort i, 1≤i≤4; Then at node P 1on to Sort 1, Sort 2sequencing by merging obtains Sort 12, simultaneously at node P 3on to Sort 3, Sort 4sequencing by merging obtains Sort 34, afterwards at node P 1to Sort 12and Sort 34sequencing by merging obtains Sort 1234, namely obtain the lexcographical order of all suffix in R, finally get Sort successively according to order from small to large 1234the BWT character of each suffix, the character string obtained is the BWT index of R.Can find out, in the starting stage, the degree of parallelism of this strategy is the total number 4 of piecemeal, and along with the degree of parallelism that carries out merged reduces by half at every turn, reduce to 1 to the final step degree of parallelism merged, namely serial performs completely, and overall average parallelism degree is low.Only have about 1/3rd through experimental verification speed-up ratio, still cannot meet the ageing requirement of extensive sequence B WT index construct.
Above-mentioned two kinds of methods are all directly carry out piecemeal to extensive arrangement set then to sort to each piecemeal, and this method of partition can solve the excessive problem of memory requirements of directly sorting.But there is no specific magnitude relationship between suffix due to each piecemeal, still to compare the size of the suffix from different piecemeal in the process recursively merging piecemeal, significantly reduce whole efficiency.
Summary of the invention:
The technical problem to be solved in the present invention takes to carry out block sorting to arrangement set in existing extensive arrangement set BWT index construct, then the mode of recursively sequencing by merging between two, cause extensive sequence sets BWT index construct speed comparatively slow, the problem of inefficiency.
Solving the technical scheme that the technology of the present invention problem adopts is: first all suffix of each sequence in ergodic sequence set R, check front l character of each suffix, the suffix with ditto l character is mutually divided into same internal memory piecemeal; At each piecemeal internal independence, lexcographical order sequence is carried out to the suffix that this piecemeal comprises concurrently afterwards; Then each sorted piecemeal is got up according to the corresponding front l of a piecemeal character words canonical ordering sequential concatenation from small to large, obtain the lexcographical order of all suffix in arrangement set R; The BWT Connection operator finally getting each suffix successively by lexcographical order order from small to large gets up to obtain the BWT index of arrangement set R.
Concrete technical scheme is as follows:
Step 1: determine the length l to separating character string used during suffix piecemeal according to sequence scale and processor memory size.For arrangement set R={S 1..., S i... S m, wherein S ilength be k, 1≤i≤m, the character number that alphabet Σ comprises is σ, and processor memory is the situation of M (byte),
Step 2: build containing σ lthe Network of Workstation of individual processor (CPU), number consecutively is
Step 3: open up σ in Network of Workstation main memory lindividual Dram piecemeal (hereinafter referred to as bucket), initial size is mk 2/ (4 σ l) byte, label is followed successively by 1 to σ l.
Step 4: subregion is carried out to the m × k bar suffix comprised in arrangement set R.
Step 4.1: put i=1;
Step 4.2: subregion is carried out to the suffix of i-th sequence.
Step 4.2.1: put j=1;
Step 4.2.2: the suffix S checking i-th sequence ifront l the character S of [j, k] i[j, j+l-1], for the suffix of curtailment l, adds character c at its end 1(as described in the background art, c 1∈ Σ, Σ={ c 1, c 2..., c σ, the sequence in arrangement set R only comprises c 1c 2... c σthis σ character, and press lexcographical order c 1< c 2< ... < c σ, 1,2... σ is the serial number of letter in alphabet) until reach l length.If S i[j, j+l-1]=c i1c i2c il, wherein i1, i2 ..., il is S respectively il the character comprised in [j, j+l-1] is at alphabet Σ={ c 1, c 2..., c σin serial number, then by suffix S i[j, k] puts into the bucket being numbered h, wherein h=(i1-1) × σ l-1+ (i2-1) × σ l-2+ ... + (il-1) × σ l-lin the bucket of+1; If memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion mk 2/ (16 σ l) byte.
Step 4.2.3: put j=j+1;
Step 4.2.4: if j≤k, then go to step 4.2.2, otherwise go to step 4.3.
Step 4.3: put i=i+1;
Step 4.4: if i≤m, then go to step 4.2, otherwise go to step 5.
Step 5: at σ lconcurrently to σ on individual processor lsuffix in individual bucket carries out lexcographical order sequence respectively, processor p tthe suffix be numbered in the bucket of t is sorted, 1≤t≤σ l.
Step 6: according to numbering from 1 to σ lorder suffix sorted in each bucket is stitched together, obtain the order of all suffix in R.Comprise m bar sequence in arrangement set R, every bar sequence has k suffix, comprises altogether m × k bar suffix in arrangement set R, if the lexcographical order of these suffix is Suffix 1< Suffix 2< ... < Suffix m × k.
Step 7: get Suffix successively 1, Suffix 2..., Suffix m × kbWT character, couple together the BWT index obtaining R.
Step 8: BWT index is exported.
The invention has the beneficial effects as follows, devise a kind of parallel acceleration method of BWT index construct, thus effectively improve the building process of multisequencing BWT index, reduce full-length genome assembling required time about 90%.And the method also can use other and relates in the application of extensive sequence, be easy to transplant and promote.
Accompanying drawing illustrates:
Fig. 1 is overview flow chart of the present invention.
Embodiment:
Below in conjunction with accompanying drawing 1, to build the BWT index (hereinafter referred to as this example) that 1,000,000,000 length are the DNA sequence dna of 100 on a group of planes for each node 64GB internal memory, the present invention is described in further detail.DNA sequence dna alphabet Σ={ A, C, G, T}, size is 4, i.e. σ=4.
As shown in Figure 1, the novel B WT index purpose parallel acceleration algorithm that the present invention proposes mainly comprises 8 steps.
Step 1: determine, to the length l of separating character string used during suffix piecemeal, to get according to sequence scale and processor memory size for this example, m=10 9, k=100, σ=4, M=64 × 2 30, calculate
Step 2: calculate σ l=4 4=256, we are equipped with the Network of Workstation containing 256 processors (CPU), are numbered p respectively 1, p 2..., p 256.
Step 3: open up σ in Network of Workstation internal memory l=256 buckets, label is followed successively by 1 to 256, and initial size is 10 9× 100 2/ (4 × 256) byte, is about 10GB, is used for respectively depositing the suffix with AAAA to TTTT beginning.
Step 4: scanning 1,000,000,000 length is that (one has 10 for all suffix of the DNA sequence dna of 100 9× 100=10 11bar suffix), check front 4 characters of each suffix.
Step 4.1: put i=1;
Step 4.2: subregion is carried out to the suffix of i-th sequence.
Step 4.2.1: put j=1;
Step 4.2.2: the suffix S checking i-th sequence ifront 4 character S of [j, 100] i[j, j+3] carries out subregion.For DNA, alphabet Σ={ A, C, G, T}, so c 1=' A', c 2=' C', c 3=' G', c 4=' T'.The serial number of character ' the serial number of A' be 1, character ' C' is 2, and the serial number of character ' the serial number of G' be 3, character ' T' is 4.For the suffix of curtailment 4, add at its end character ' A' is until length reaches 4.H=(1-1) × 4 is put into the suffix of AAAA beginning 3+ (1-1) × 4 2+ (1-1) × 4 1+ (1-1) × 4 0+ 1=1 bucket, is put into h=(1-1) × 4 with the suffix of AAAC beginning 3+ (1-1) × 4 2+ (1-1) × 4 1+ (2-1) × 4 0+ 1=2 bucket, the like, be put into h=(4-1) × 4 with the suffix of TTTT beginning 3+ (4-1) × 4 2+ (4-1) × 4 1+ (4-1) × 4 0+ 1=256 bucket.If memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion 10 9× 100 2/ (16 × 256) byte, about 2.5GB.
Step 4.2.3: put j=j+1;
Step 4.2.4: if j≤100, then go to step 4.2.2, otherwise go to step 4.3.
Step 4.3: put i=i+1;
Step 4.4: if i≤10 9, then go to step 4.2, otherwise go to step 5.
Step 5: sort to the suffix in 256 buckets concurrently at 256 processors, concrete mode is t processor p tsuffix in t bucket is sorted, 1≤t≤256.
Step 6: the result of sequence is stitched together by the order of pressing from No. 1 bucket to No. 256 buckets, obtains the order of all suffix.For 10 9the arrangement set of bar long 100, m=10 9, k=100, its suffix has 10 altogether 9× 100=10 11bar, if its lexcographical order is Suffix 1< Suffix 2< ... < Suffix 1011.
Step 7: get Suffix successively 1, Suffix 2..., Suffix 1011bWT character, couple together composition 1,000,000,000
The BWT index of DNA sequence dna.
Step 8: BWT index is exported.

Claims (2)

1. a parallel acceleration method is carried out to multisequencing BWT index construct, it is characterized in that comprising the following steps:
Step 1: determine the length l to separating character string used during suffix piecemeal according to sequence scale and processor memory size; Make Σ={ c 1, c 2..., c σbe a limited alphabet, meet c 1<c 2< ... <c σ, wherein ' < ' represents lexcographical order, and σ represents the character number in alphabet Σ; Make S=s 1s 2s is k-1s kbe a limited character string, wherein s k=' $ ', ' $ ' is end of string mark, s i∈ Σ, 1≤i<k; S [i, j]=s is i+1... s jrepresent the substring that a character is formed from i-th character to jth of S, wherein 1≤i≤j≤k; The substring of shape as S [j, k] is called the suffix of S, wherein 1≤j≤k; Claim the BWT character that s [j-1] is suffix S [j, k], wherein 1<j≤k; Make the BWT character of suffix S [1, k] be ' $ '; Make R={S 1, S 2..., S mrepresent m bar character string on alphabet Σ, S ilength be k and S i[k]=' $ '; For the situation that processor memory size is M byte, l=[(log σ(mk 2/ (2M)))];
Step 2: build containing σ lthe Network of Workstation of individual processor, number consecutively is
Step 3: open up σ in Network of Workstation main memory lindividual dynamic bucket, initial size is mk 2/ (4 σ l) byte, label is followed successively by 1 to σ l, bucket is internal memory piecemeal;
Step 4: subregion is carried out to the m*k bar suffix comprised in arrangement set R,
Step 4.1: put i=1;
Step 4.2: subregion is carried out to the suffix of i-th sequence,
Step 4.2.1: put j=1;
Step 4.2.2: the suffix S checking i-th sequence ifront l the character S of [j, k] i[j, j+l-1], for the suffix of curtailment l, adds character c at its end 1until reach l length, c 1∈ Σ; If S i[j, j+l-1]=c i1c i2c il, then by suffix S i[j, k] puts into the bucket being numbered h, wherein h=(i 1-1) * σ l-1+ (i 2-1) * σ l-2++ (i l-1) * σ l-l+ 1, i 1, i 2..., i ls respectively il the character comprised in [j, j+l-1] is at alphabet Σ={ c 1, c 2..., c σin serial number, if memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion mk 2/ (16 σ l) byte;
Step 4.2.3: put j=j+1;
Step 4.2.4: if j≤k, then go to step 4.2.2, otherwise go to step 4.3;
Step 4.3: put i=i+1;
Step 4.4: if i≤m, then go to step 4.2, otherwise go to step 5;
Step 5: at σ lconcurrently to σ on individual processor lsuffix in individual bucket carries out lexcographical order sequence respectively, processor p tthe suffix be numbered in the bucket of t is sorted, 1≤t≤σ l;
Step 6: according to numbering from 1 to σ lorder suffix sorted in each bucket is stitched together, obtain the order (lexcographical order) of all suffix in R, comprise m bar sequence in arrangement set R, every bar sequence has k suffix, m*k bar suffix is comprised altogether, if the lexcographical order of these suffix is Suffix in arrangement set R 1<Suffix 2< ... <Suffix m*k;
Step 7: get Suffix successively 1, Suffix 2..., Suffix m*kbWT character, couple together the BWT index obtaining R;
Step 8: BWT index is exported.
2. one according to claim 1 carries out parallel acceleration method to multisequencing BWT index construct, it is characterized in that the σ described in step 2 lindividual processor refers to σ lindividual CPU.
CN201510328718.8A 2015-06-15 2015-06-15 Parallel accelerating method for BWT index construction for multiple sequences Pending CN104899476A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510328718.8A CN104899476A (en) 2015-06-15 2015-06-15 Parallel accelerating method for BWT index construction for multiple sequences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510328718.8A CN104899476A (en) 2015-06-15 2015-06-15 Parallel accelerating method for BWT index construction for multiple sequences

Publications (1)

Publication Number Publication Date
CN104899476A true CN104899476A (en) 2015-09-09

Family

ID=54032138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510328718.8A Pending CN104899476A (en) 2015-06-15 2015-06-15 Parallel accelerating method for BWT index construction for multiple sequences

Country Status (1)

Country Link
CN (1) CN104899476A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484865A (en) * 2016-10-10 2017-03-08 哈尔滨工程大学 One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem
CN109375989A (en) * 2018-09-10 2019-02-22 中山大学 A kind of parallel suffix sort method and system
CN109783052A (en) * 2018-12-27 2019-05-21 深圳市轱辘汽车维修技术有限公司 Data reordering method, device, server and computer readable storage medium
CN111653318A (en) * 2019-05-24 2020-09-11 北京哲源科技有限责任公司 Acceleration method and device for gene comparison, storage medium and server
CN113299344A (en) * 2021-06-23 2021-08-24 深圳华大医学检验实验室 Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment
CN115662523A (en) * 2022-10-21 2023-01-31 哈尔滨工业大学 Method and equipment for expressing and constructing population genome-oriented index

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090193213A1 (en) * 2007-02-12 2009-07-30 Xyratex Technology Limited Method and apparatus for data transform
CN102750461A (en) * 2012-06-14 2012-10-24 东北大学 Biological sequence local comparison method capable of obtaining complete solution
CN102841988A (en) * 2012-07-28 2012-12-26 盛司潼 System and method for matching nucleotide sequence information
CN102929900A (en) * 2012-01-16 2013-02-13 中国科学院北京基因组研究所 Method and device for matching character strings
CN103117748A (en) * 2013-01-29 2013-05-22 中国科学院计算技术研究所 Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090193213A1 (en) * 2007-02-12 2009-07-30 Xyratex Technology Limited Method and apparatus for data transform
CN102929900A (en) * 2012-01-16 2013-02-13 中国科学院北京基因组研究所 Method and device for matching character strings
CN102750461A (en) * 2012-06-14 2012-10-24 东北大学 Biological sequence local comparison method capable of obtaining complete solution
CN102841988A (en) * 2012-07-28 2012-12-26 盛司潼 System and method for matching nucleotide sequence information
CN103117748A (en) * 2013-01-29 2013-05-22 中国科学院计算技术研究所 Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHI-MAN LIU ET AL: "GPU-Accelerated BWT Construction for Large Collection of Short Reads", 《ARXIV》 *
程思远 等: "CUDA 并行数据压缩技术研究", 《软件设计开发》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484865A (en) * 2016-10-10 2017-03-08 哈尔滨工程大学 One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem
CN109375989A (en) * 2018-09-10 2019-02-22 中山大学 A kind of parallel suffix sort method and system
CN109375989B (en) * 2018-09-10 2022-04-08 中山大学 Parallel suffix ordering method and system
CN109783052A (en) * 2018-12-27 2019-05-21 深圳市轱辘汽车维修技术有限公司 Data reordering method, device, server and computer readable storage medium
CN109783052B (en) * 2018-12-27 2021-11-12 深圳市轱辘车联数据技术有限公司 Data sorting method, device, server and computer readable storage medium
CN111653318A (en) * 2019-05-24 2020-09-11 北京哲源科技有限责任公司 Acceleration method and device for gene comparison, storage medium and server
CN111653318B (en) * 2019-05-24 2023-09-15 北京哲源科技有限责任公司 Acceleration method and device for gene comparison, storage medium and server
CN113299344A (en) * 2021-06-23 2021-08-24 深圳华大医学检验实验室 Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment
CN115662523A (en) * 2022-10-21 2023-01-31 哈尔滨工业大学 Method and equipment for expressing and constructing population genome-oriented index

Similar Documents

Publication Publication Date Title
CN104899476A (en) Parallel accelerating method for BWT index construction for multiple sequences
Sirén et al. Indexing graphs for path queries with applications in genome research
Chikhi et al. Data structures to represent a set of k-long DNA sequences
CN112735528A (en) Gene sequence comparison method and system
WO2014116921A1 (en) Utilization of pattern matching in stringomes
US20160019339A1 (en) Bioinformatics tools, systems and methods for sequence assembly
CN103093121A (en) Compressed storage and construction method of two-way multi-step deBruijn graph
CN103699647A (en) Character string dictionary indexing method and system
CN106228036A (en) A kind of method using fireworks algorithm identification of protein complex
CN106484865A (en) One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem
CN103761298B (en) Distributed-architecture-based entity matching method
Chen et al. Frequent patterns mining in multiple biological sequences
CN102841988B (en) A kind of system and method that nucleic acid sequence information is mated
CN109828785B (en) Approximate code clone detection method accelerated by GPU
CN105335626A (en) Method for clustering lasso cluster characteristics based on network analysis
Li et al. Efficient Distributed Parallel Aligning Reads and Reference Genome with Many Repetitive Subsequences Using Compact de Bruijn Graph
CN107169315A (en) The transmission method and system of a kind of magnanimity DNA data
Hon et al. Compressed index for dynamic text
Lin et al. To accelerate multiple sequence alignment using FPGAs
Liu et al. Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining
Limasset Novel approaches for the exploitation of high throughput sequencing data
Chen et al. An algorithm for mining frequent patterns in biological sequence
Wang et al. Finding LPRs in DNA sequences based on a new index-SUA
TWI785847B (en) Data processing system for processing gene sequencing data
Muggli et al. Succinct de Bruijn graph construction for massive populations through space-efficient merging

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150909