CN105224828B

CN105224828B - A kind of gene order fragment is quickly positioned with key assignments index data compression method

Info

Publication number: CN105224828B
Application number: CN201510648867.2A
Authority: CN
Inventors: 宋卓; 李�根
Original assignee: Human And Future Biotechnology (changsha) Co Ltd
Current assignee: Human And Future Biotechnology (changsha) Co Ltd
Priority date: 2015-10-09
Filing date: 2015-10-09
Publication date: 2017-10-27
Anticipated expiration: 2035-10-09
Also published as: CN105224828A

Abstract

The invention discloses a kind of quick positioning key assignments index data compression method of gene order fragment, step includes：1) initialization compression result set Set_comp, the prefix length n for setting data compression to use；2) from gene sequence data set Set to be compressed_origOne current gene order fragment Key to be compressed of middle taking-up；3) current gene order fragment Key is circulated into skew 0 to (n 1) the secondary formation n gene order fragment sequence Key with common prefix respectively_r0,Key_r1,…,Key_r(n‑1), n is prefix length, and all gene order fragment sequences are offset into number of times based on common prefix and different circulations and suffix is separately added into compression result set Set_comp；4) data acquisition system Set to be compressed is judged_origWhether it is empty, next current gene order fragment Key to be compressed is taken out if non-NULL, and redirect execution step 2)；Otherwise, by compression result set Set_compOutput.The present invention can improve search efficiency during big data quantity, have the advantages that compressed capability is strong, occupy little space.

Description

A kind of gene order fragment is quickly positioned with key assignments index data compression method

Technical field

The present invention relates to the bioinformatic analysis technology of gene sequencing data, and in particular to a kind of gene order fragment is fast Speed positioning key assignments index data compression method.

Background technology

Sequencing sequence location technology is the basis of current high flux gene sequencing data analysis.Sequence fragment is generally used The methods such as BWA carry out the optimal string matching of tolerable partial error.It is in most cases, most of but actual experiment shows Obtained sequence fragment, which is sequenced, can be dispersed as shorter gene order fragment (36BP), and be reflected by accurate Key-Value Shooting method carries out accurately and fast matching completely.

In order to be able to allow short gene order quickly and accurately to be matched in reference chain, it is necessary to first with the data of reference chain Based on, Key-Value index data bases are made, are set up as follows：As reference chain data are：ACGTGCA, if needing By the database of the key-value pair (Key-Value to) of 4 characters, one group of short sequences match of structure, as shown in Figure 1., will referring to Fig. 1 From back to front, character starts reference chain data one by one, using 4 characters as length, can obtain 4 groups of Key-Value to as looking into Ask the data of database.If it is " GTGC " that obtained short sequence, which is sequenced, is mapped by Key-Value, can quickly obtain GTGC The Offset (skew) that reference sequences should be located at is 2 position.But, this method exist one it is important the problem of be：Generally make The reference sequences chain for being used as database is longer, and actual capabilities are more than 2*10⁹Individual character.If using 36 characters as fragment, making Key-Value data pair, then the index data being only made up of Key, will produce (2*10⁹–36)*36Bytes≈67.05GB Huge data volume.Huge index data can a large amount of consumption calculations systems memory source, and cause Key-Value systems Cache hit rates decline to a great extent, if in the case that memory source is inadequate, also resulting in the systematicness caused by memory pages are exchanged Can significantly it shake, so that very efficient should accurately match, during Project Realization, performance is had a greatly reduced quality.It is existing The method of condensed prefix tree can catch in index data, position and be worth all identical characters, merged in index tree, from And reduce the size of data directory.But the data after this method compression must carry out Key-Value inquiries using tree construction, and it is looked into Ask efficiency and depth, the size of data volume of tree are closely related, when data volume is big, the depth of tree can be deepened therewith, and it is inquired about Efficiency can be remarkably decreased, in addition, data space shared by a large amount of pointers needed for construction condensed prefix tree construction also offsets pressure significantly Contracting ability.

The content of the invention

The technical problem to be solved in the present invention：Above mentioned problem for prior art there is provided one kind can improve big data Search efficiency during amount, compressed capability is strong, quickly positioning key assignments index data compresses for the gene order fragment that occupies little space Method.

In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is：

A kind of gene order fragment, which is quickly positioned, uses key assignments index data compression method, and step includes：

1) initialization compression result set Set_comp, the prefix length n for setting data compression to use；

2) from gene sequence data set Set to be compressed_origOne current gene order fragment to be compressed of middle taking-up Key；

3) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n bases with common prefix respectively Because of sequence fragment sequence Key_r0,Key_r1,…,Key_r(n-1), n is prefix length, and all gene order fragment sequences are based on altogether Compression result set Set is separately added into prefix and different circulation skew number of times and suffix_comp；

4) data acquisition system Set to be compressed is judged_origWhether it is empty, next current base to be compressed is taken out if non-NULL Because of sequence fragment Key, and redirect execution step 2)；Otherwise, by compression result set Set_compOutput.

Preferably, the step 3) detailed step include：

3.1) current gene order fragment Key is circulated into skew 0 to (n-1) secondary formation n with common prefix respectively Gene order fragment sequence Key_r0,Key_r1,…,Key_r(n-1), n is prefix length；

3.2) from the gene order fragment sequence Key_r0,Key_r1,…,Key_r(n-1)One gene order fragment of middle selection Sequence Key_riIt is used as current gene order fragment sequence；

3.3) by current gene order fragment sequence Key_riIt is prefix Prefix according to prefix length n cuttings_riAnd suffix Postfix_ri, the prefix Prefix_riWith suffix Postfix_riLength sum be current gene order fragment sequence Key_ri's Length；

3.4) prefix Prefix is judged_riCorresponding mapping relations are integrated into compression result set Set_compIn whether deposited If existed, execution step 3.5 is being redirected)；Otherwise execution step 3.6 is redirected)；

3.5) current gene order fragment sequence Key is judged_riData<I, Postfix_ri>In prefix Prefix_riCorrespondence Mapping relations set in whether existed, if it does not exist, then by current gene order fragment sequence Key_riData<I, Postfix_ri>Add prefix Prefix_riCorresponding mapping relations set, wherein i represents current gene order fragment sequence Key_ri Circulate the number of times of skew, Postfix_riFor current gene order fragment sequence Key_riSuffix, redirect execution step 3.7)；It is no Then, current gene order fragment sequence Key is ignored_riThe follow-up gene order fragment sequence with common prefix, redirects and performs step It is rapid 4)；

3.6) it is prefix Prefix_riNewly-built mapping relations Prefix_ri→{<i,Postfix_ri>And add compression result collection Close Set_comp, wherein i represents current gene order fragment sequence Key_riCirculate the number of times of skew, Postfix_riFor current gene sequence Row fragment sequence Key_riSuffix, redirect execution step 3.7)；

3.7) gene order fragment sequence Key is judged_r0,Key_r1,…,Key_r(n-1)Whether it has been disposed, if still It is untreated to finish, then select next current gene order fragment sequence Key_riAnd redirect execution step 3.3), otherwise redirect and hold Row step 4).

Preferably, the step 1) in set data compression to use prefix length n detailed step include：

1.1) construction prefix length n compression ratio function f (n)；

1.2) ask for so that compression ratio function f (n) value reaches the prefix length n of maximum.

Preferably, the step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction；

In formula (1), f (n) is compression ratio function, and TL is data acquisition system Set to be compressed_origLength, SL is data to be compressed Set Set_origIn gene order fragment Key to be compressed length, b is data acquisition system Set to be compressed_origIn base to be compressed Because of the bit storage space shared by each element in sequence fragment Key, the length that S (n) is prefix Prefix is indexed when being n The byte estimation function that data are accounted for, the byte estimation function S's (n) that index data is accounted for when the length of the prefix Prefix is n Calculate shown in function expression such as formula (2)；

In formula (2), the byte estimation function that index data is accounted for when the length that S (n) is prefix Prefix is n, TL is to wait to press Contracting data acquisition system Set_origLength, SL is data acquisition system Set to be compressed_origIn gene order fragment Key to be compressed length, b For data acquisition system Set to be compressed_origIn bit storage in gene order fragment Key to be compressed shared by each element it is empty Between, n is prefix length.

Preferably, the circulation skew is ring shift left.

Preferably, prefix length n values are 32.

The quick positioning key assignments index data compression method tool of gene order fragment of the present invention has the advantage that：The present invention will Current gene order fragment Key circulates skew 0 to (n-1) the secondary formation n gene order fragment sequences with common prefix respectively Arrange Key_r0,Key_r1,…,Key_r(n-1), n is prefix length, and all gene order fragment sequences are based on into common prefix and difference Circulation skew number of times and suffix be separately added into compression result set Set_comp, before gene order fragment Key is cut into Sew (Prefix) and suffix (Postfix) two parts, by carrying out the circulation offset operation of certain number of times to gene order fragment, The same prefix sequence of seizure as much as possible in adjacent short-movie section sequence, and by by before these gene order fragments Key Sew sequence merging, and by suffix array together with the coding that circulation offsets number of times, joint uniquely represents a specific gene sequence Column-slice section Key, can so greatly save these memory spaces for indexing short sequence, simultaneously as only prefix and suffix two Level sequence, the series of the invention that traditional prefix compressed tree is not present increases with data scale and increases caused defect, can Search efficiency during big data quantity is improved, has the advantages that compressed capability is strong, occupy little space.

Brief description of the drawings

Fig. 1 is the principle schematic in the key-value pair data storehouse that prior art builds gene order fragment.

Fig. 2 is the flow chart of present invention method.

Fig. 3 is the principle schematic in the key-value pair data storehouse that the embodiment of the present invention builds gene order fragment.

Fig. 4 be present invention method step 3) flow chart.

Embodiment

As shown in Fig. 2 the present embodiment gene order fragment is wrapped the step of quickly positioning with key assignments index data compression method Include：

According to key-value pair data storehouse building process it can be found that because gene order fragment Key data are by word one by one In the sequence for one section of length-specific that symbol starts and intercepted, its adjacent short sequence (n character) repeatedly intercepted, actually there is big portion Divide duplicate repeat character (RPT).In the present embodiment, for the n gene order fragment sequences obtained after circulation skew Key_r0,Key_r1,…,Key_r(n-1), each gene order fragment sequence is encapsulated as based on common prefix and different followed Ring offsets number of times and suffix and adds compression result set Set_comp.Definition circulation offset operation symbol<<_RN is represented sequential element Circulation skew n, as shown in figure 3, with T, G, C, A it is adjacent 3 times from data acquisition system Set to be compressed_origThe short sequence of interception Exemplified by character string, by gene order fragment Key (T, G, C, A) respectively circulate skew 0 to (n-1) it is secondary, form gene order respectively Fragment sequence T, G, C, A, gene order fragment sequence T, G, C, G, gene order fragment sequence T, G, C, G, therefore gene order Fragment sequence can be expressed as TG respectively<<_R0CA、TG<<_R1CG、TG<<_R2CG, circulation offset operation symbol<<_RN includes circulation and offset Number of times, circulation offset operation symbol<<_RTG on front side of n is common prefix, circulation offset operation symbol<<_ROn rear side of n is suffix. It should be noted that being only the exemplary illustration carried out by taking the gene order fragment Key of 4 bases as an example herein, in addition The gene order fragment of other quantity base can be used as needed, and its principle is identical with the present embodiment, therefore no longer goes to live in the household of one's in-laws on getting married herein State.

It can be seen from Fig. 3, when circulation skew number of times is 0, suffix C, A are respectively positioned on common prefix TG's before circulation skew Rear side；When circulation skew number of times is 1, suffix C is in circulation skew anteposition in common prefix TG rear side, and suffix G is inclined in circulation Anteposition is moved in common prefix TG front side；When it is 2 to circulate skew number of times, before suffix C, G are respectively positioned on jointly before circulation skew Sew TG front side.Therefore, the principle offset based on above-mentioned circulation, can rapidly be gone back according to the gene order fragment sequence after compression Original obtains the initial data of gene order fragment sequence.In the present embodiment, circulation skew is ring shift left, is certainly circulated right The general principle of shifting is identical with ring shift left, therefore its specific implementation details that will not be repeated here.

The present embodiment step 1) in set data compression to use prefix length n detailed step include：

1.1) construction prefix length n compression ratio function f (n)；

Step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction；

As shown in figure 4, step 3) detailed step include：

In the present embodiment, compression result set Set_compData structure it is as follows：

{

prefix₁→{<rotate₁,postfix₁>,<rotate₂,postfix₂>...,

prefix₂→{<rotate₃,postfix₃>...,

…}

In above-mentioned data structure, prefix₁For the common prefix of first gene order fragment, prefix₁→{< rotate₁,postfix₁>,<rotate₂,postfix₂>... } and it is prefix Prefix_riCorresponding mapping relations set, rotate₁For with prefix prefix₁First gene order fragment sequence circulation skew number of times, postfix₁For with Prefix prefix₁First gene order fragment sequence suffix, rotate₂For with prefix prefix₁Second gene The circulation skew number of times of sequence fragment sequence, postfix₂For with prefix prefix₁Second gene order fragment sequence Suffix；prefix₂For the common prefix of second gene order fragment, prefix₂→{<rotate₃,postfix₃>... } be Prefix prefix₂Corresponding mapping relations set, rotate₃For with prefix prefix₂First gene order fragment sequence Circulation skew number of times, postfix₃For with prefix prefix₁First gene order fragment sequence suffix.

In the present embodiment, data acquisition system Set to be compressed_origLength TL=2*10⁹, gene order fragment Key length SL= Bit storage space b=2bits in 36, gene order fragment Key shared by each element is (because effective set of reference sequences Into only ACGT), the prefix length n=32 that data compression is used is chosen, is so as to the length that calculates prefix Prefix The byte estimation function S (n) that index data is accounted for when 32=6500000000Bytes, i.e. 6.05GB, relative to number when not compressing According to size (TL*SL*b/8=2*10⁹× 36 × 2/8Bytes=16.76GB) for, the present embodiment gene order fragment is quick Positioning can reach nearly 2.8 times of compression ratio with key assignments index data compression method, therefore the present embodiment can improve big data quantity When search efficiency, have the advantages that compressed capability is strong, occupy little space.

Described above is only the preferred embodiment of the present invention, and protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical schemes belonged under thinking of the present invention belong to protection scope of the present invention.It should be pointed out that for the art Those of ordinary skill for, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of gene order fragment is quickly positioned with key assignments index data compression method, it is characterized in that step includes：

2) from gene sequence data set Set to be compressed_origOne current gene order fragment Key to be compressed of middle taking-up；

3) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n gene sequences with common prefix respectively Row fragment sequence Key_r0,Key_r1,…,Key_r(n-1), n is prefix length, before all gene order fragment sequences are based on jointly Sew and be separately added into compression result set Set with different circulations skew number of times and suffix_comp；

4) data acquisition system Set to be compressed is judged_origWhether it is empty, next current gene sequence to be compressed is taken out if non-NULL Column-slice section Key, and redirect execution step 2)；Otherwise, by compression result set Set_compOutput.

2. gene order fragment according to claim 1, which is quickly positioned, uses key assignments index data compression method, its feature exists In the step 3) detailed step include：

3.1) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n genes with common prefix respectively Sequence fragment sequence Key_r0,Key_r1,…,Key_r(n-1), n is prefix length；

3.2) from the gene order fragment sequence Key_r0,Key_r1,…,Key_r(n-1)One gene order fragment sequence of middle selection Key_riIt is used as current gene order fragment sequence；

3.4) prefix Prefix is judged_riCorresponding mapping relations are integrated into compression result set Set_compIn whether existed, If existed, execution step 3.5 is redirected)；Otherwise execution step 3.6 is redirected)；

3.5) current gene order fragment sequence Key is judged_riData<I, Postfix_ri>In prefix Prefix_riIt is corresponding to reflect Penetrate in set of relationship and whether existed, if it does not exist, then by current gene order fragment sequence Key_riData<I, Postfix_ri>Add prefix Prefix_riCorresponding mapping relations set, wherein i represents current gene order fragment sequence Key_ri Circulate the number of times of skew, Postfix_riFor current gene order fragment sequence Key_riSuffix, redirect execution step 3.7)；It is no Then, current gene order fragment sequence Key is ignored_riThe follow-up gene order fragment sequence with common prefix, redirects and performs step It is rapid 4)；

3.6) it is prefix Prefix_riNewly-built mapping relations Prefix_ri→{<i,Postfix_ri>And add compression result set Set_comp, wherein i represents current gene order fragment sequence Key_riCirculate the number of times of skew, Postfix_riFor current gene order Fragment sequence Key_riSuffix, redirect execution step 3.7)；

3.7) gene order fragment sequence Key is judged_r0,Key_r1,…,Key_r(n-1)Whether it has been disposed, if not yet handled Finish, then select next current gene order fragment sequence Key_riAnd redirect execution step 3.3), otherwise redirect execution step 4)。

3. gene order fragment according to claim 2, which is quickly positioned, uses key assignments index data compression method, its feature exists In the step 1) in set data compression to use prefix length n detailed step include：

1.1) construction prefix length n compression ratio function f (n)；

4. gene order fragment according to claim 3, which is quickly positioned, uses key assignments index data compression method, its feature exists In the step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction；

<mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>L</mi> <mo>*</mo> <mi>S</mi> <mi>L</mi> <mo>*</mo> <mi>b</mi> </mrow> <mrow> <mn>8</mn> <mo>*</mo> <mi>S</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> 1

In formula (1), f (n) is compression ratio function, and TL is data acquisition system Set to be compressed_origLength, SL is data acquisition system to be compressed Set_origIn gene order fragment Key to be compressed length, b is data acquisition system Set to be compressed_origIn gene sequence to be compressed Bit storage space in column-slice section Key shared by each element, the index data when length that S (n) is prefix Prefix is n The byte estimation function accounted for, the calculating for the byte estimation function S (n) that index data the is accounted for when length of the prefix Prefix is n Shown in function expression such as formula (2)；

In formula (2), the byte estimation function that index data is accounted for when the length that S (n) is prefix Prefix is n, TL is number to be compressed According to set Set_origLength, SL is data acquisition system Set to be compressed_origIn gene order fragment Key to be compressed length, b is treats Compressed data set Set_origIn bit storage space in gene order fragment Key to be compressed shared by each element, n For prefix length.

5. the gene order fragment according to any one in Claims 1 to 4 quickly with key assignments index data compressed by positioning Method, it is characterised in that the circulation skew is ring shift left.

6. gene order fragment according to claim 5, which is quickly positioned, uses key assignments index data compression method, its feature exists In prefix length n values are 32.