CN104899476A - Parallel accelerating method for BWT index construction for multiple sequences - Google Patents
Parallel accelerating method for BWT index construction for multiple sequences Download PDFInfo
- Publication number
- CN104899476A CN104899476A CN201510328718.8A CN201510328718A CN104899476A CN 104899476 A CN104899476 A CN 104899476A CN 201510328718 A CN201510328718 A CN 201510328718A CN 104899476 A CN104899476 A CN 104899476A
- Authority
- CN
- China
- Prior art keywords
- suffix
- bwt
- character
- sequence
- bucket
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a parallel accelerating method for BWT index construction for multiple sequences. The parallel accelerating method for the BWT index construction for multiple sequences is aimed to solves the problems of slow BWT index construction speed and low efficiency of the existing BWT index construction for a large-scale sequence set due to using a mode of combining in pairs to sort again after carrying out partitioning sorting on the sequence set to continuously recur, combine and sort. According to the technical scheme, the parallel accelerating method for the BWT index construction for multiple sequences includes that traversing all the suffixes of each sequence in the sequence set R, inspecting the first l characters of each suffix, and dividing the suffixes with the same first l characters into the same memory sub-block; independently sorting the suffixes in each sub-block in parallel; splicing the sorted sub-blocks to obtain the order of all the suffixes in the sequence set R; taking the BWT character of each suffix in sequence from the small to the large according to the lexicographical order, and connecting to obtain the BWT index of the sequence set R. The parallel accelerating method for the BWT index construction for multiple sequences has beneficial effects that the BWT index construction for multiple sequences is effectively improved, and the whole genome assembly time is reduced by about 90%.
Description
Technical field: the assemble method that the present invention relates to biological information field full-length genome, especially in full-length genome assembling process, the Burrows-Wheeler of extensive short data records set (100,000,000 sequence) converts the parallel acceleration method of (hereinafter referred to as BWT) index construct.
Background technology:
Full-length genome assembling is the key problem of field of bioinformatics, is basis and the prerequisite of other correlative studys of genomics.The genome of general biology comprises millions of and even billions of bases, and current gene sequencing technology once can only record the sequence fragment comprising hundreds of bases, the process that short data records is reduced into protogene group by the overlapping relation between the short data records obtained according to checking order is called that genome is assembled.For N bar sequence fragment, directly calculating their overlapping relations between any two needs O (N
2) time complexity, and up to several hundred million, can cannot complete the calculating of sequence fragment overlapping relation within effective time to the check order quantity of the sequence fragment obtained of eucaryote.Research finds, under the prerequisite of the BWT index of known array set of segments, the overlapping relation between sequence fragment calculates and can complete within a few hours.The BWT index of arrangement set is defined as follows shown in literary composition.
Make Σ={ c
1, c
2..., c
σbe a limited alphabet, meet c
1< c
2< ... < c
σ, wherein ' < ' represents lexcographical order, σ represents the character number in alphabet Σ, 1,2 ... σ is the serial number of letter in alphabet.Make S=s
1s
2... s
i... s
k-1be a limited character string, wherein s
i∈ Σ; In addition, the end character of definition character string is less than any one character in alphabet Σ on lexcographical order with ' $ ' represent, ' $ '.So S can be written as the character string that length is k, S=s
1s
2... s
k-1s
k, wherein s
k=' $ '.We use S [i, j]=s
is
i+1... s
jrepresent the substring that a character is formed from i-th character to jth of S, wherein 1≤i≤j≤k.The substring of shape as S [1, i] is called the prefix of S, and the substring of shape as S [j, k] is called the suffix of S, wherein 1≤i, j≤k.Claim the BWT character that s [j-1] is suffix S [j, k], wherein 1 < j≤k; Make the BWT character of suffix S [1, k] be ' $ '.Make R={S
1, S
2..., S
mrepresent m bar character string on alphabet Σ, S
ilength be k and S
i[k]=' $ '.In order to distinguish different character strings, definition S
i[k] < S
j[k], for 1≤i < j≤m.Press lexcographical order sequence to all suffix of sequence in arrangement set R, the character string that the BWT character then getting each suffix successively forms just is called the BWT of arrangement set R.
As can be seen from the definition of BWT, the key step building BWT sorts by lexcographical order to the suffix of all sequences in arrangement set.But for extensive short data records set, directly carry out all suffix of the sequence that it comprises, the main memory size needed for sequence is up to the TB order of magnitude.For the mankind, human genome comprises about 3,000,000,000 bases, and each base can with a character representation in A, C, G, T.The typical degree of depth is 30 × order-checking will produce the sequence that about 1,000,000,000 are about 100 bases, all suffix only enumerating these sequences just need the space of 1.25TB, considerably beyond the memory size of existing computing equipment.
For this reason, researcher proposes a kind of method that block sorting then recursively merges between two again.With arrangement set R={S
1, S
2..., S
8bWT index construct be example, suppose that the large I of the main memory of computing equipment meets the sequence of all suffix of two sequences, then R be divided into 4 pieces of R
1, R
2, R
3, R
4, wherein R
i={ S
2i-1, S
2i, 1≤i≤4.First successively to R
1, R
2, R
3, R
4in all suffix carry out sorting and obtain Sort respectively
1, Sort
2, Sort
3, Sort
4then adopt sequencing by merging algorithm to Sort
1, Sort
2sequencing by merging obtains Sort
12, then to Sort
3, Sort
4sequencing by merging obtains Sort
34, afterwards to Sort
12and Sort
34sequencing by merging obtains Sort
1234, namely obtain the lexcographical order of all suffix in R, finally get Sort successively according to lexicographic order from small to large
1234the BWT character of each suffix, the character string formed is the BWT index of arrangement set R.This block sorting method that then recurrence sequencing by merging result builds BWT index solves directly to carry out sorting the excessive problem of memory requirements to the suffix of all sequences, but, because this method is that serial performs, time efficiency is poor, cannot meet the ageing requirement that extensive short data records set B WT builds.
In order to accelerate the building process of BWT index, there is researcher to build based on the method for BWT index by above-mentioned piecemeal-merging, proposing a kind of parallel acceleration method.Still with string assemble R={S
1, S
2..., S
8bWT index construct be example.First be equipped with and comprise 4 node P
1, P
2, P
3, P
4network of Workstation, the large I of main memory of each node meets the sequence of all suffix of two sequences.Then R is divided into 4 pieces of R
1, R
2, R
3, R
4, wherein R
i={ S
2i-1, S
2i, 1≤i≤4.Simultaneously at node P
ion to R
iin all suffix carry out sequence and obtain Sort
i, 1≤i≤4; Then at node P
1on to Sort
1, Sort
2sequencing by merging obtains Sort
12, simultaneously at node P
3on to Sort
3, Sort
4sequencing by merging obtains Sort
34, afterwards at node P
1to Sort
12and Sort
34sequencing by merging obtains Sort
1234, namely obtain the lexcographical order of all suffix in R, finally get Sort successively according to order from small to large
1234the BWT character of each suffix, the character string obtained is the BWT index of R.Can find out, in the starting stage, the degree of parallelism of this strategy is the total number 4 of piecemeal, and along with the degree of parallelism that carries out merged reduces by half at every turn, reduce to 1 to the final step degree of parallelism merged, namely serial performs completely, and overall average parallelism degree is low.Only have about 1/3rd through experimental verification speed-up ratio, still cannot meet the ageing requirement of extensive sequence B WT index construct.
Above-mentioned two kinds of methods are all directly carry out piecemeal to extensive arrangement set then to sort to each piecemeal, and this method of partition can solve the excessive problem of memory requirements of directly sorting.But there is no specific magnitude relationship between suffix due to each piecemeal, still to compare the size of the suffix from different piecemeal in the process recursively merging piecemeal, significantly reduce whole efficiency.
Summary of the invention:
The technical problem to be solved in the present invention takes to carry out block sorting to arrangement set in existing extensive arrangement set BWT index construct, then the mode of recursively sequencing by merging between two, cause extensive sequence sets BWT index construct speed comparatively slow, the problem of inefficiency.
Solving the technical scheme that the technology of the present invention problem adopts is: first all suffix of each sequence in ergodic sequence set R, check front l character of each suffix, the suffix with ditto l character is mutually divided into same internal memory piecemeal; At each piecemeal internal independence, lexcographical order sequence is carried out to the suffix that this piecemeal comprises concurrently afterwards; Then each sorted piecemeal is got up according to the corresponding front l of a piecemeal character words canonical ordering sequential concatenation from small to large, obtain the lexcographical order of all suffix in arrangement set R; The BWT Connection operator finally getting each suffix successively by lexcographical order order from small to large gets up to obtain the BWT index of arrangement set R.
Concrete technical scheme is as follows:
Step 1: determine the length l to separating character string used during suffix piecemeal according to sequence scale and processor memory size.For arrangement set R={S
1..., S
i... S
m, wherein S
ilength be k, 1≤i≤m, the character number that alphabet Σ comprises is σ, and processor memory is the situation of M (byte),
Step 2: build containing σ
lthe Network of Workstation of individual processor (CPU), number consecutively is
Step 3: open up σ in Network of Workstation main memory
lindividual Dram piecemeal (hereinafter referred to as bucket), initial size is mk
2/ (4 σ
l) byte, label is followed successively by 1 to σ
l.
Step 4: subregion is carried out to the m × k bar suffix comprised in arrangement set R.
Step 4.1: put i=1;
Step 4.2: subregion is carried out to the suffix of i-th sequence.
Step 4.2.1: put j=1;
Step 4.2.2: the suffix S checking i-th sequence
ifront l the character S of [j, k]
i[j, j+l-1], for the suffix of curtailment l, adds character c at its end
1(as described in the background art, c
1∈ Σ, Σ={ c
1, c
2..., c
σ, the sequence in arrangement set R only comprises c
1c
2... c
σthis σ character, and press lexcographical order c
1< c
2< ... < c
σ, 1,2... σ is the serial number of letter in alphabet) until reach l length.If S
i[j, j+l-1]=c
i1c
i2c
il, wherein i1, i2 ..., il is S respectively
il the character comprised in [j, j+l-1] is at alphabet Σ={ c
1, c
2..., c
σin serial number, then by suffix S
i[j, k] puts into the bucket being numbered h, wherein h=(i1-1) × σ
l-1+ (i2-1) × σ
l-2+ ... + (il-1) × σ
l-lin the bucket of+1; If memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion mk
2/ (16 σ
l) byte.
Step 4.2.3: put j=j+1;
Step 4.2.4: if j≤k, then go to step 4.2.2, otherwise go to step 4.3.
Step 4.3: put i=i+1;
Step 4.4: if i≤m, then go to step 4.2, otherwise go to step 5.
Step 5: at σ
lconcurrently to σ on individual processor
lsuffix in individual bucket carries out lexcographical order sequence respectively, processor p
tthe suffix be numbered in the bucket of t is sorted, 1≤t≤σ
l.
Step 6: according to numbering from 1 to σ
lorder suffix sorted in each bucket is stitched together, obtain the order of all suffix in R.Comprise m bar sequence in arrangement set R, every bar sequence has k suffix, comprises altogether m × k bar suffix in arrangement set R, if the lexcographical order of these suffix is Suffix
1< Suffix
2< ... < Suffix
m × k.
Step 7: get Suffix successively
1, Suffix
2..., Suffix
m × kbWT character, couple together the BWT index obtaining R.
Step 8: BWT index is exported.
The invention has the beneficial effects as follows, devise a kind of parallel acceleration method of BWT index construct, thus effectively improve the building process of multisequencing BWT index, reduce full-length genome assembling required time about 90%.And the method also can use other and relates in the application of extensive sequence, be easy to transplant and promote.
Accompanying drawing illustrates:
Fig. 1 is overview flow chart of the present invention.
Embodiment:
Below in conjunction with accompanying drawing 1, to build the BWT index (hereinafter referred to as this example) that 1,000,000,000 length are the DNA sequence dna of 100 on a group of planes for each node 64GB internal memory, the present invention is described in further detail.DNA sequence dna alphabet Σ={ A, C, G, T}, size is 4, i.e. σ=4.
As shown in Figure 1, the novel B WT index purpose parallel acceleration algorithm that the present invention proposes mainly comprises 8 steps.
Step 1: determine, to the length l of separating character string used during suffix piecemeal, to get according to sequence scale and processor memory size
for this example, m=10
9, k=100, σ=4, M=64 × 2
30, calculate
Step 2: calculate σ
l=4
4=256, we are equipped with the Network of Workstation containing 256 processors (CPU), are numbered p respectively
1, p
2..., p
256.
Step 3: open up σ in Network of Workstation internal memory
l=256 buckets, label is followed successively by 1 to 256, and initial size is 10
9× 100
2/ (4 × 256) byte, is about 10GB, is used for respectively depositing the suffix with AAAA to TTTT beginning.
Step 4: scanning 1,000,000,000 length is that (one has 10 for all suffix of the DNA sequence dna of 100
9× 100=10
11bar suffix), check front 4 characters of each suffix.
Step 4.1: put i=1;
Step 4.2: subregion is carried out to the suffix of i-th sequence.
Step 4.2.1: put j=1;
Step 4.2.2: the suffix S checking i-th sequence
ifront 4 character S of [j, 100]
i[j, j+3] carries out subregion.For DNA, alphabet Σ={ A, C, G, T}, so c
1=' A', c
2=' C', c
3=' G', c
4=' T'.The serial number of character ' the serial number of A' be 1, character ' C' is 2, and the serial number of character ' the serial number of G' be 3, character ' T' is 4.For the suffix of curtailment 4, add at its end character ' A' is until length reaches 4.H=(1-1) × 4 is put into the suffix of AAAA beginning
3+ (1-1) × 4
2+ (1-1) × 4
1+ (1-1) × 4
0+ 1=1 bucket, is put into h=(1-1) × 4 with the suffix of AAAC beginning
3+ (1-1) × 4
2+ (1-1) × 4
1+ (2-1) × 4
0+ 1=2 bucket, the like, be put into h=(4-1) × 4 with the suffix of TTTT beginning
3+ (4-1) × 4
2+ (4-1) × 4
1+ (4-1) × 4
0+ 1=256 bucket.If memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion 10
9× 100
2/ (16 × 256) byte, about 2.5GB.
Step 4.2.3: put j=j+1;
Step 4.2.4: if j≤100, then go to step 4.2.2, otherwise go to step 4.3.
Step 4.3: put i=i+1;
Step 4.4: if i≤10
9, then go to step 4.2, otherwise go to step 5.
Step 5: sort to the suffix in 256 buckets concurrently at 256 processors, concrete mode is t processor p
tsuffix in t bucket is sorted, 1≤t≤256.
Step 6: the result of sequence is stitched together by the order of pressing from No. 1 bucket to No. 256 buckets, obtains the order of all suffix.For 10
9the arrangement set of bar long 100, m=10
9, k=100, its suffix has 10 altogether
9× 100=10
11bar, if its lexcographical order is Suffix
1< Suffix
2< ... < Suffix
1011.
Step 7: get Suffix successively
1, Suffix
2..., Suffix
1011bWT character, couple together composition 1,000,000,000
The BWT index of DNA sequence dna.
Step 8: BWT index is exported.
Claims (2)
1. a parallel acceleration method is carried out to multisequencing BWT index construct, it is characterized in that comprising the following steps:
Step 1: determine the length l to separating character string used during suffix piecemeal according to sequence scale and processor memory size; Make Σ={ c
1, c
2..., c
σbe a limited alphabet, meet c
1<c
2< ... <c
σ, wherein ' < ' represents lexcographical order, and σ represents the character number in alphabet Σ; Make S=s
1s
2s
is
k-1s
kbe a limited character string, wherein s
k=' $ ', ' $ ' is end of string mark, s
i∈ Σ, 1≤i<k; S [i, j]=s
is
i+1... s
jrepresent the substring that a character is formed from i-th character to jth of S, wherein 1≤i≤j≤k; The substring of shape as S [j, k] is called the suffix of S, wherein 1≤j≤k; Claim the BWT character that s [j-1] is suffix S [j, k], wherein 1<j≤k; Make the BWT character of suffix S [1, k] be ' $ '; Make R={S
1, S
2..., S
mrepresent m bar character string on alphabet Σ, S
ilength be k and S
i[k]=' $ '; For the situation that processor memory size is M byte, l=[(log
σ(mk
2/ (2M)))];
Step 2: build containing σ
lthe Network of Workstation of individual processor, number consecutively is
Step 3: open up σ in Network of Workstation main memory
lindividual dynamic bucket, initial size is mk
2/ (4 σ
l) byte, label is followed successively by 1 to σ
l, bucket is internal memory piecemeal;
Step 4: subregion is carried out to the m*k bar suffix comprised in arrangement set R,
Step 4.1: put i=1;
Step 4.2: subregion is carried out to the suffix of i-th sequence,
Step 4.2.1: put j=1;
Step 4.2.2: the suffix S checking i-th sequence
ifront l the character S of [j, k]
i[j, j+l-1], for the suffix of curtailment l, adds character c at its end
1until reach l length, c
1∈ Σ; If S
i[j, j+l-1]=c
i1c
i2c
il, then by suffix S
i[j, k] puts into the bucket being numbered h, wherein h=(i
1-1) * σ
l-1+ (i
2-1) * σ
l-2++ (i
l-1) * σ
l-l+ 1, i
1, i
2..., i
ls respectively
il the character comprised in [j, j+l-1] is at alphabet Σ={ c
1, c
2..., c
σin serial number, if memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion mk
2/ (16 σ
l) byte;
Step 4.2.3: put j=j+1;
Step 4.2.4: if j≤k, then go to step 4.2.2, otherwise go to step 4.3;
Step 4.3: put i=i+1;
Step 4.4: if i≤m, then go to step 4.2, otherwise go to step 5;
Step 5: at σ
lconcurrently to σ on individual processor
lsuffix in individual bucket carries out lexcographical order sequence respectively, processor p
tthe suffix be numbered in the bucket of t is sorted, 1≤t≤σ
l;
Step 6: according to numbering from 1 to σ
lorder suffix sorted in each bucket is stitched together, obtain the order (lexcographical order) of all suffix in R, comprise m bar sequence in arrangement set R, every bar sequence has k suffix, m*k bar suffix is comprised altogether, if the lexcographical order of these suffix is Suffix in arrangement set R
1<Suffix
2< ... <Suffix
m*k;
Step 7: get Suffix successively
1, Suffix
2..., Suffix
m*kbWT character, couple together the BWT index obtaining R;
Step 8: BWT index is exported.
2. one according to claim 1 carries out parallel acceleration method to multisequencing BWT index construct, it is characterized in that the σ described in step 2
lindividual processor refers to σ
lindividual CPU.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510328718.8A CN104899476A (en) | 2015-06-15 | 2015-06-15 | Parallel accelerating method for BWT index construction for multiple sequences |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510328718.8A CN104899476A (en) | 2015-06-15 | 2015-06-15 | Parallel accelerating method for BWT index construction for multiple sequences |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104899476A true CN104899476A (en) | 2015-09-09 |
Family
ID=54032138
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510328718.8A Pending CN104899476A (en) | 2015-06-15 | 2015-06-15 | Parallel accelerating method for BWT index construction for multiple sequences |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104899476A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484865A (en) * | 2016-10-10 | 2017-03-08 | 哈尔滨工程大学 | One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem |
CN109375989A (en) * | 2018-09-10 | 2019-02-22 | 中山大学 | A kind of parallel suffix sort method and system |
CN109783052A (en) * | 2018-12-27 | 2019-05-21 | 深圳市轱辘汽车维修技术有限公司 | Data reordering method, device, server and computer readable storage medium |
CN111653318A (en) * | 2019-05-24 | 2020-09-11 | 北京哲源科技有限责任公司 | Acceleration method and device for gene comparison, storage medium and server |
CN113299344A (en) * | 2021-06-23 | 2021-08-24 | 深圳华大医学检验实验室 | Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment |
CN115662523A (en) * | 2022-10-21 | 2023-01-31 | 哈尔滨工业大学 | Method and equipment for expressing and constructing population genome-oriented index |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090193213A1 (en) * | 2007-02-12 | 2009-07-30 | Xyratex Technology Limited | Method and apparatus for data transform |
CN102750461A (en) * | 2012-06-14 | 2012-10-24 | 东北大学 | Biological sequence local comparison method capable of obtaining complete solution |
CN102841988A (en) * | 2012-07-28 | 2012-12-26 | 盛司潼 | System and method for matching nucleotide sequence information |
CN102929900A (en) * | 2012-01-16 | 2013-02-13 | 中国科学院北京基因组研究所 | Method and device for matching character strings |
CN103117748A (en) * | 2013-01-29 | 2013-05-22 | 中国科学院计算技术研究所 | Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method |
-
2015
- 2015-06-15 CN CN201510328718.8A patent/CN104899476A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090193213A1 (en) * | 2007-02-12 | 2009-07-30 | Xyratex Technology Limited | Method and apparatus for data transform |
CN102929900A (en) * | 2012-01-16 | 2013-02-13 | 中国科学院北京基因组研究所 | Method and device for matching character strings |
CN102750461A (en) * | 2012-06-14 | 2012-10-24 | 东北大学 | Biological sequence local comparison method capable of obtaining complete solution |
CN102841988A (en) * | 2012-07-28 | 2012-12-26 | 盛司潼 | System and method for matching nucleotide sequence information |
CN103117748A (en) * | 2013-01-29 | 2013-05-22 | 中国科学院计算技术研究所 | Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method |
Non-Patent Citations (2)
Title |
---|
CHI-MAN LIU ET AL: "GPU-Accelerated BWT Construction for Large Collection of Short Reads", 《ARXIV》 * |
程思远 等: "CUDA 并行数据压缩技术研究", 《软件设计开发》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484865A (en) * | 2016-10-10 | 2017-03-08 | 哈尔滨工程大学 | One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem |
CN109375989A (en) * | 2018-09-10 | 2019-02-22 | 中山大学 | A kind of parallel suffix sort method and system |
CN109375989B (en) * | 2018-09-10 | 2022-04-08 | 中山大学 | Parallel suffix ordering method and system |
CN109783052A (en) * | 2018-12-27 | 2019-05-21 | 深圳市轱辘汽车维修技术有限公司 | Data reordering method, device, server and computer readable storage medium |
CN109783052B (en) * | 2018-12-27 | 2021-11-12 | 深圳市轱辘车联数据技术有限公司 | Data sorting method, device, server and computer readable storage medium |
CN111653318A (en) * | 2019-05-24 | 2020-09-11 | 北京哲源科技有限责任公司 | Acceleration method and device for gene comparison, storage medium and server |
CN111653318B (en) * | 2019-05-24 | 2023-09-15 | 北京哲源科技有限责任公司 | Acceleration method and device for gene comparison, storage medium and server |
CN113299344A (en) * | 2021-06-23 | 2021-08-24 | 深圳华大医学检验实验室 | Gene sequencing analysis method, gene sequencing analysis device, storage medium and computer equipment |
CN115662523A (en) * | 2022-10-21 | 2023-01-31 | 哈尔滨工业大学 | Method and equipment for expressing and constructing population genome-oriented index |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104899476A (en) | Parallel accelerating method for BWT index construction for multiple sequences | |
Sirén et al. | Indexing graphs for path queries with applications in genome research | |
Chikhi et al. | Data structures to represent a set of k-long DNA sequences | |
CN112735528A (en) | Gene sequence comparison method and system | |
WO2014116921A1 (en) | Utilization of pattern matching in stringomes | |
US20160019339A1 (en) | Bioinformatics tools, systems and methods for sequence assembly | |
CN103093121A (en) | Compressed storage and construction method of two-way multi-step deBruijn graph | |
CN103699647A (en) | Character string dictionary indexing method and system | |
CN106228036A (en) | A kind of method using fireworks algorithm identification of protein complex | |
CN106484865A (en) | One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem | |
CN103761298B (en) | Distributed-architecture-based entity matching method | |
Chen et al. | Frequent patterns mining in multiple biological sequences | |
CN102841988B (en) | A kind of system and method that nucleic acid sequence information is mated | |
CN109828785B (en) | Approximate code clone detection method accelerated by GPU | |
CN105335626A (en) | Method for clustering lasso cluster characteristics based on network analysis | |
Li et al. | Efficient Distributed Parallel Aligning Reads and Reference Genome with Many Repetitive Subsequences Using Compact de Bruijn Graph | |
CN107169315A (en) | The transmission method and system of a kind of magnanimity DNA data | |
Hon et al. | Compressed index for dynamic text | |
Lin et al. | To accelerate multiple sequence alignment using FPGAs | |
Liu et al. | Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining | |
Limasset | Novel approaches for the exploitation of high throughput sequencing data | |
Chen et al. | An algorithm for mining frequent patterns in biological sequence | |
Wang et al. | Finding LPRs in DNA sequences based on a new index-SUA | |
TWI785847B (en) | Data processing system for processing gene sequencing data | |
Muggli et al. | Succinct de Bruijn graph construction for massive populations through space-efficient merging |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150909 |