CN104899476A

CN104899476A - Parallel accelerating method for BWT index construction for multiple sequences

Info

Publication number: CN104899476A
Application number: CN201510328718.8A
Authority: CN
Inventors: 彭绍亮; 朱小谦; 王恒; 卢宇彤; 杨灿群; 吴诚堃; 崔英博; 刘欣; 王海强; 程乾; 夏徐伟
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2015-06-15
Filing date: 2015-06-15
Publication date: 2015-09-09

Abstract

The invention discloses a parallel accelerating method for BWT index construction for multiple sequences. The parallel accelerating method for the BWT index construction for multiple sequences is aimed to solves the problems of slow BWT index construction speed and low efficiency of the existing BWT index construction for a large-scale sequence set due to using a mode of combining in pairs to sort again after carrying out partitioning sorting on the sequence set to continuously recur, combine and sort. According to the technical scheme, the parallel accelerating method for the BWT index construction for multiple sequences includes that traversing all the suffixes of each sequence in the sequence set R, inspecting the first l characters of each suffix, and dividing the suffixes with the same first l characters into the same memory sub-block; independently sorting the suffixes in each sub-block in parallel; splicing the sorted sub-blocks to obtain the order of all the suffixes in the sequence set R; taking the BWT character of each suffix in sequence from the small to the large according to the lexicographical order, and connecting to obtain the BWT index of the sequence set R. The parallel accelerating method for the BWT index construction for multiple sequences has beneficial effects that the BWT index construction for multiple sequences is effectively improved, and the whole genome assembly time is reduced by about 90%.

Description

A kind of method of multisequencing BWT index construct being carried out to parallel accelerate

Technical field: the assemble method that the present invention relates to biological information field full-length genome, especially in full-length genome assembling process, the Burrows-Wheeler of extensive short data records set (100,000,000 sequence) converts the parallel acceleration method of (hereinafter referred to as BWT) index construct.

Background technology:

Full-length genome assembling is the key problem of field of bioinformatics, is basis and the prerequisite of other correlative studys of genomics.The genome of general biology comprises millions of and even billions of bases, and current gene sequencing technology once can only record the sequence fragment comprising hundreds of bases, the process that short data records is reduced into protogene group by the overlapping relation between the short data records obtained according to checking order is called that genome is assembled.For N bar sequence fragment, directly calculating their overlapping relations between any two needs O (N ²) time complexity, and up to several hundred million, can cannot complete the calculating of sequence fragment overlapping relation within effective time to the check order quantity of the sequence fragment obtained of eucaryote.Research finds, under the prerequisite of the BWT index of known array set of segments, the overlapping relation between sequence fragment calculates and can complete within a few hours.The BWT index of arrangement set is defined as follows shown in literary composition.

Make Σ={ c ₁, c ₂..., c _σbe a limited alphabet, meet c ₁< c ₂< ... < c _σ, wherein ' < ' represents lexcographical order, σ represents the character number in alphabet Σ, 1,2 ... σ is the serial number of letter in alphabet.Make S=s ₁s ₂... s _i... s _k-1be a limited character string, wherein s _i∈ Σ; In addition, the end character of definition character string is less than any one character in alphabet Σ on lexcographical order with ' $ ' represent, ' $ '.So S can be written as the character string that length is k, S=s ₁s ₂... s _k-1s _k, wherein s _k=' $ '.We use S [i, j]=s _is _i+1... s _jrepresent the substring that a character is formed from i-th character to jth of S, wherein 1≤i≤j≤k.The substring of shape as S [1, i] is called the prefix of S, and the substring of shape as S [j, k] is called the suffix of S, wherein 1≤i, j≤k.Claim the BWT character that s [j-1] is suffix S [j, k], wherein 1 < j≤k; Make the BWT character of suffix S [1, k] be ' $ '.Make R={S ₁, S ₂..., S _mrepresent m bar character string on alphabet Σ, S _ilength be k and S _i[k]=' $ '.In order to distinguish different character strings, definition S _i[k] < S _j[k], for 1≤i < j≤m.Press lexcographical order sequence to all suffix of sequence in arrangement set R, the character string that the BWT character then getting each suffix successively forms just is called the BWT of arrangement set R.

As can be seen from the definition of BWT, the key step building BWT sorts by lexcographical order to the suffix of all sequences in arrangement set.But for extensive short data records set, directly carry out all suffix of the sequence that it comprises, the main memory size needed for sequence is up to the TB order of magnitude.For the mankind, human genome comprises about 3,000,000,000 bases, and each base can with a character representation in A, C, G, T.The typical degree of depth is 30 × order-checking will produce the sequence that about 1,000,000,000 are about 100 bases, all suffix only enumerating these sequences just need the space of 1.25TB, considerably beyond the memory size of existing computing equipment.

For this reason, researcher proposes a kind of method that block sorting then recursively merges between two again.With arrangement set R={S ₁, S ₂..., S ₈bWT index construct be example, suppose that the large I of the main memory of computing equipment meets the sequence of all suffix of two sequences, then R be divided into 4 pieces of R ₁, R ₂, R ₃, R ₄, wherein R _i={ S _2i-1, S _2i, 1≤i≤4.First successively to R ₁, R ₂, R ₃, R ₄in all suffix carry out sorting and obtain Sort respectively ₁, Sort ₂, Sort ₃, Sort ₄then adopt sequencing by merging algorithm to Sort ₁, Sort ₂sequencing by merging obtains Sort ₁₂, then to Sort ₃, Sort ₄sequencing by merging obtains Sort ₃₄, afterwards to Sort ₁₂and Sort ₃₄sequencing by merging obtains Sort ₁₂₃₄, namely obtain the lexcographical order of all suffix in R, finally get Sort successively according to lexicographic order from small to large ₁₂₃₄the BWT character of each suffix, the character string formed is the BWT index of arrangement set R.This block sorting method that then recurrence sequencing by merging result builds BWT index solves directly to carry out sorting the excessive problem of memory requirements to the suffix of all sequences, but, because this method is that serial performs, time efficiency is poor, cannot meet the ageing requirement that extensive short data records set B WT builds.

In order to accelerate the building process of BWT index, there is researcher to build based on the method for BWT index by above-mentioned piecemeal-merging, proposing a kind of parallel acceleration method.Still with string assemble R={S ₁, S ₂..., S ₈bWT index construct be example.First be equipped with and comprise 4 node P ₁, P ₂, P ₃, P ₄network of Workstation, the large I of main memory of each node meets the sequence of all suffix of two sequences.Then R is divided into 4 pieces of R ₁, R ₂, R ₃, R ₄, wherein R _i={ S _2i-1, S _2i, 1≤i≤4.Simultaneously at node P _ion to R _iin all suffix carry out sequence and obtain Sort _i, 1≤i≤4; Then at node P ₁on to Sort ₁, Sort ₂sequencing by merging obtains Sort ₁₂, simultaneously at node P ₃on to Sort ₃, Sort ₄sequencing by merging obtains Sort ₃₄, afterwards at node P ₁to Sort ₁₂and Sort ₃₄sequencing by merging obtains Sort ₁₂₃₄, namely obtain the lexcographical order of all suffix in R, finally get Sort successively according to order from small to large ₁₂₃₄the BWT character of each suffix, the character string obtained is the BWT index of R.Can find out, in the starting stage, the degree of parallelism of this strategy is the total number 4 of piecemeal, and along with the degree of parallelism that carries out merged reduces by half at every turn, reduce to 1 to the final step degree of parallelism merged, namely serial performs completely, and overall average parallelism degree is low.Only have about 1/3rd through experimental verification speed-up ratio, still cannot meet the ageing requirement of extensive sequence B WT index construct.

Above-mentioned two kinds of methods are all directly carry out piecemeal to extensive arrangement set then to sort to each piecemeal, and this method of partition can solve the excessive problem of memory requirements of directly sorting.But there is no specific magnitude relationship between suffix due to each piecemeal, still to compare the size of the suffix from different piecemeal in the process recursively merging piecemeal, significantly reduce whole efficiency.

Summary of the invention:

The technical problem to be solved in the present invention takes to carry out block sorting to arrangement set in existing extensive arrangement set BWT index construct, then the mode of recursively sequencing by merging between two, cause extensive sequence sets BWT index construct speed comparatively slow, the problem of inefficiency.

Solving the technical scheme that the technology of the present invention problem adopts is: first all suffix of each sequence in ergodic sequence set R, check front l character of each suffix, the suffix with ditto l character is mutually divided into same internal memory piecemeal; At each piecemeal internal independence, lexcographical order sequence is carried out to the suffix that this piecemeal comprises concurrently afterwards; Then each sorted piecemeal is got up according to the corresponding front l of a piecemeal character words canonical ordering sequential concatenation from small to large, obtain the lexcographical order of all suffix in arrangement set R; The BWT Connection operator finally getting each suffix successively by lexcographical order order from small to large gets up to obtain the BWT index of arrangement set R.

Concrete technical scheme is as follows:

Step 1: determine the length l to separating character string used during suffix piecemeal according to sequence scale and processor memory size.For arrangement set R={S ₁..., S _i... S _m, wherein S _ilength be k, 1≤i≤m, the character number that alphabet Σ comprises is σ, and processor memory is the situation of M (byte),

Step 2: build containing σ ^lthe Network of Workstation of individual processor (CPU), number consecutively is

Step 3: open up σ in Network of Workstation main memory ^lindividual Dram piecemeal (hereinafter referred to as bucket), initial size is mk ²/ (4 σ ^l) byte, label is followed successively by 1 to σ ^l.

Step 4: subregion is carried out to the m × k bar suffix comprised in arrangement set R.

Step 4.1: put i=1;

Step 4.2: subregion is carried out to the suffix of i-th sequence.

Step 4.2.1: put j=1;

Step 4.2.2: the suffix S checking i-th sequence _ifront l the character S of [j, k] _i[j, j+l-1], for the suffix of curtailment l, adds character c at its end ₁(as described in the background art, c ₁∈ Σ, Σ={ c ₁, c ₂..., c _σ, the sequence in arrangement set R only comprises c ₁c ₂... c _σthis σ character, and press lexcographical order c ₁< c ₂< ... < c _σ, 1,2... σ is the serial number of letter in alphabet) until reach l length.If S _i[j, j+l-1]=c _i1c _i2c _il, wherein i1, i2 ..., il is S respectively _il the character comprised in [j, j+l-1] is at alphabet Σ={ c ₁, c ₂..., c _σin serial number, then by suffix S _i[j, k] puts into the bucket being numbered h, wherein h=(i1-1) × σ ^l-1+ (i2-1) × σ ^l-2+ ... + (il-1) × σ ^l-lin the bucket of+1; If memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion mk ²/ (16 σ ^l) byte.

Step 4.2.3: put j=j+1;

Step 4.2.4: if j≤k, then go to step 4.2.2, otherwise go to step 4.3.

Step 4.3: put i=i+1;

Step 4.4: if i≤m, then go to step 4.2, otherwise go to step 5.

Step 5: at σ ^lconcurrently to σ on individual processor ^lsuffix in individual bucket carries out lexcographical order sequence respectively, processor p _tthe suffix be numbered in the bucket of t is sorted, 1≤t≤σ ^l.

Step 6: according to numbering from 1 to σ ^lorder suffix sorted in each bucket is stitched together, obtain the order of all suffix in R.Comprise m bar sequence in arrangement set R, every bar sequence has k suffix, comprises altogether m × k bar suffix in arrangement set R, if the lexcographical order of these suffix is Suffix ₁< Suffix ₂< ... < Suffix _{m × k}.

Step 7: get Suffix successively ₁, Suffix ₂..., Suffix _{m × k}bWT character, couple together the BWT index obtaining R.

Step 8: BWT index is exported.

The invention has the beneficial effects as follows, devise a kind of parallel acceleration method of BWT index construct, thus effectively improve the building process of multisequencing BWT index, reduce full-length genome assembling required time about 90%.And the method also can use other and relates in the application of extensive sequence, be easy to transplant and promote.

Accompanying drawing illustrates:

Fig. 1 is overview flow chart of the present invention.

Embodiment:

Below in conjunction with accompanying drawing 1, to build the BWT index (hereinafter referred to as this example) that 1,000,000,000 length are the DNA sequence dna of 100 on a group of planes for each node 64GB internal memory, the present invention is described in further detail.DNA sequence dna alphabet Σ={ A, C, G, T}, size is 4, i.e. σ=4.

As shown in Figure 1, the novel B WT index purpose parallel acceleration algorithm that the present invention proposes mainly comprises 8 steps.

Step 1: determine, to the length l of separating character string used during suffix piecemeal, to get according to sequence scale and processor memory size for this example, m=10 ⁹, k=100, σ=4, M=64 × 2 ³⁰, calculate

Step 2: calculate σ ^l=4 ⁴=256, we are equipped with the Network of Workstation containing 256 processors (CPU), are numbered p respectively ₁, p ₂..., p ₂₅₆.

Step 3: open up σ in Network of Workstation internal memory ^l=256 buckets, label is followed successively by 1 to 256, and initial size is 10 ⁹× 100 ²/ (4 × 256) byte, is about 10GB, is used for respectively depositing the suffix with AAAA to TTTT beginning.

Step 4: scanning 1,000,000,000 length is that (one has 10 for all suffix of the DNA sequence dna of 100 ⁹× 100=10 ¹¹bar suffix), check front 4 characters of each suffix.

Step 4.1: put i=1;

Step 4.2: subregion is carried out to the suffix of i-th sequence.

Step 4.2.1: put j=1;

Step 4.2.2: the suffix S checking i-th sequence _ifront 4 character S of [j, 100] _i[j, j+3] carries out subregion.For DNA, alphabet Σ={ A, C, G, T}, so c ₁=' A', c ₂=' C', c ₃=' G', c ₄=' T'.The serial number of character ' the serial number of A' be 1, character ' C' is 2, and the serial number of character ' the serial number of G' be 3, character ' T' is 4.For the suffix of curtailment 4, add at its end character ' A' is until length reaches 4.H=(1-1) × 4 is put into the suffix of AAAA beginning ³+ (1-1) × 4 ²+ (1-1) × 4 ¹+ (1-1) × 4 ⁰+ 1=1 bucket, is put into h=(1-1) × 4 with the suffix of AAAC beginning ³+ (1-1) × 4 ²+ (1-1) × 4 ¹+ (2-1) × 4 ⁰+ 1=2 bucket, the like, be put into h=(4-1) × 4 with the suffix of TTTT beginning ³+ (4-1) × 4 ²+ (4-1) × 4 ¹+ (4-1) × 4 ⁰+ 1=256 bucket.If memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion 10 ⁹× 100 ²/ (16 × 256) byte, about 2.5GB.

Step 4.2.3: put j=j+1;

Step 4.2.4: if j≤100, then go to step 4.2.2, otherwise go to step 4.3.

Step 4.3: put i=i+1;

Step 4.4: if i≤10 ⁹, then go to step 4.2, otherwise go to step 5.

Step 5: sort to the suffix in 256 buckets concurrently at 256 processors, concrete mode is t processor p _tsuffix in t bucket is sorted, 1≤t≤256.

Step 6: the result of sequence is stitched together by the order of pressing from No. 1 bucket to No. 256 buckets, obtains the order of all suffix.For 10 ⁹the arrangement set of bar long 100, m=10 ⁹, k=100, its suffix has 10 altogether ⁹× 100=10 ¹¹bar, if its lexcographical order is Suffix ₁< Suffix ₂< ... < Suffix ₁₀₁₁.

Step 7: get Suffix successively ₁, Suffix ₂..., Suffix ₁₀₁₁bWT character, couple together composition 1,000,000,000

The BWT index of DNA sequence dna.

Step 8: BWT index is exported.

Claims

1. a parallel acceleration method is carried out to multisequencing BWT index construct, it is characterized in that comprising the following steps:

Step 1: determine the length l to separating character string used during suffix piecemeal according to sequence scale and processor memory size; Make Σ={ c ₁, c ₂..., c _σbe a limited alphabet, meet c ₁<c ₂< ... <c _σ, wherein ' < ' represents lexcographical order, and σ represents the character number in alphabet Σ; Make S=s ₁s ₂s _is _k-1s _kbe a limited character string, wherein s _k=' $ ', ' $ ' is end of string mark, s _i∈ Σ, 1≤i<k; S [i, j]=s _is _i+1... s _jrepresent the substring that a character is formed from i-th character to jth of S, wherein 1≤i≤j≤k; The substring of shape as S [j, k] is called the suffix of S, wherein 1≤j≤k; Claim the BWT character that s [j-1] is suffix S [j, k], wherein 1<j≤k; Make the BWT character of suffix S [1, k] be ' $ '; Make R={S ₁, S ₂..., S _mrepresent m bar character string on alphabet Σ, S _ilength be k and S _i[k]=' $ '; For the situation that processor memory size is M byte, l=[(log _σ(mk ²/ (2M)))];

Step 2: build containing σ ^lthe Network of Workstation of individual processor, number consecutively is

Step 3: open up σ in Network of Workstation main memory ^lindividual dynamic bucket, initial size is mk ²/ (4 σ ^l) byte, label is followed successively by 1 to σ ^l, bucket is internal memory piecemeal;

Step 4: subregion is carried out to the m*k bar suffix comprised in arrangement set R,

Step 4.1: put i=1;

Step 4.2: subregion is carried out to the suffix of i-th sequence,

Step 4.2.1: put j=1;

Step 4.2.2: the suffix S checking i-th sequence _ifront l the character S of [j, k] _i[j, j+l-1], for the suffix of curtailment l, adds character c at its end ₁until reach l length, c ₁∈ Σ; If S _i[j, j+l-1]=c _i1c _i2c _il, then by suffix S _i[j, k] puts into the bucket being numbered h, wherein h=(i ₁-1) * σ ^l-1+ (i ₂-1) * σ ^l-2++ (i _l-1) * σ ^l-l+ 1, i ₁, i ₂..., i _ls respectively _il the character comprised in [j, j+l-1] is at alphabet Σ={ c ₁, c ₂..., c _σin serial number, if memory headroom is not enough to deposit new suffix in bucket, then by its memory headroom expansion mk ²/ (16 σ ^l) byte;

Step 4.2.3: put j=j+1;

Step 4.2.4: if j≤k, then go to step 4.2.2, otherwise go to step 4.3;

Step 4.3: put i=i+1;

Step 4.4: if i≤m, then go to step 4.2, otherwise go to step 5;

Step 5: at σ ^lconcurrently to σ on individual processor ^lsuffix in individual bucket carries out lexcographical order sequence respectively, processor p _tthe suffix be numbered in the bucket of t is sorted, 1≤t≤σ ^l;

Step 6: according to numbering from 1 to σ ^lorder suffix sorted in each bucket is stitched together, obtain the order (lexcographical order) of all suffix in R, comprise m bar sequence in arrangement set R, every bar sequence has k suffix, m*k bar suffix is comprised altogether, if the lexcographical order of these suffix is Suffix in arrangement set R ₁<Suffix ₂< ... <Suffix _m*k;

Step 7: get Suffix successively ₁, Suffix ₂..., Suffix _m*kbWT character, couple together the BWT index obtaining R;

Step 8: BWT index is exported.

2. one according to claim 1 carries out parallel acceleration method to multisequencing BWT index construct, it is characterized in that the σ described in step 2 ^lindividual processor refers to σ ^lindividual CPU.