CN105224828B - A kind of gene order fragment is quickly positioned with key assignments index data compression method - Google Patents

A kind of gene order fragment is quickly positioned with key assignments index data compression method Download PDF

Info

Publication number
CN105224828B
CN105224828B CN201510648867.2A CN201510648867A CN105224828B CN 105224828 B CN105224828 B CN 105224828B CN 201510648867 A CN201510648867 A CN 201510648867A CN 105224828 B CN105224828 B CN 105224828B
Authority
CN
China
Prior art keywords
key
prefix
gene order
mrow
order fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510648867.2A
Other languages
Chinese (zh)
Other versions
CN105224828A (en
Inventor
宋卓
李�根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Human And Future Biotechnology (changsha) Co Ltd
Original Assignee
Human And Future Biotechnology (changsha) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Human And Future Biotechnology (changsha) Co Ltd filed Critical Human And Future Biotechnology (changsha) Co Ltd
Priority to CN201510648867.2A priority Critical patent/CN105224828B/en
Publication of CN105224828A publication Critical patent/CN105224828A/en
Application granted granted Critical
Publication of CN105224828B publication Critical patent/CN105224828B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of quick positioning key assignments index data compression method of gene order fragment, step includes:1) initialization compression result set Setcomp, the prefix length n for setting data compression to use;2) from gene sequence data set Set to be compressedorigOne current gene order fragment Key to be compressed of middle taking-up;3) current gene order fragment Key is circulated into skew 0 to (n 1) the secondary formation n gene order fragment sequence Key with common prefix respectivelyr0,Keyr1,…,Keyr(n‑1), n is prefix length, and all gene order fragment sequences are offset into number of times based on common prefix and different circulations and suffix is separately added into compression result set Setcomp;4) data acquisition system Set to be compressed is judgedorigWhether it is empty, next current gene order fragment Key to be compressed is taken out if non-NULL, and redirect execution step 2);Otherwise, by compression result set SetcompOutput.The present invention can improve search efficiency during big data quantity, have the advantages that compressed capability is strong, occupy little space.

Description

A kind of gene order fragment is quickly positioned with key assignments index data compression method
Technical field
The present invention relates to the bioinformatic analysis technology of gene sequencing data, and in particular to a kind of gene order fragment is fast Speed positioning key assignments index data compression method.
Background technology
Sequencing sequence location technology is the basis of current high flux gene sequencing data analysis.Sequence fragment is generally used The methods such as BWA carry out the optimal string matching of tolerable partial error.It is in most cases, most of but actual experiment shows Obtained sequence fragment, which is sequenced, can be dispersed as shorter gene order fragment (36BP), and be reflected by accurate Key-Value Shooting method carries out accurately and fast matching completely.
In order to be able to allow short gene order quickly and accurately to be matched in reference chain, it is necessary to first with the data of reference chain Based on, Key-Value index data bases are made, are set up as follows:As reference chain data are:ACGTGCA, if needing By the database of the key-value pair (Key-Value to) of 4 characters, one group of short sequences match of structure, as shown in Figure 1., will referring to Fig. 1 From back to front, character starts reference chain data one by one, using 4 characters as length, can obtain 4 groups of Key-Value to as looking into Ask the data of database.If it is " GTGC " that obtained short sequence, which is sequenced, is mapped by Key-Value, can quickly obtain GTGC The Offset (skew) that reference sequences should be located at is 2 position.But, this method exist one it is important the problem of be:Generally make The reference sequences chain for being used as database is longer, and actual capabilities are more than 2*109Individual character.If using 36 characters as fragment, making Key-Value data pair, then the index data being only made up of Key, will produce (2*109–36)*36Bytes≈67.05GB Huge data volume.Huge index data can a large amount of consumption calculations systems memory source, and cause Key-Value systems Cache hit rates decline to a great extent, if in the case that memory source is inadequate, also resulting in the systematicness caused by memory pages are exchanged Can significantly it shake, so that very efficient should accurately match, during Project Realization, performance is had a greatly reduced quality.It is existing The method of condensed prefix tree can catch in index data, position and be worth all identical characters, merged in index tree, from And reduce the size of data directory.But the data after this method compression must carry out Key-Value inquiries using tree construction, and it is looked into Ask efficiency and depth, the size of data volume of tree are closely related, when data volume is big, the depth of tree can be deepened therewith, and it is inquired about Efficiency can be remarkably decreased, in addition, data space shared by a large amount of pointers needed for construction condensed prefix tree construction also offsets pressure significantly Contracting ability.
The content of the invention
The technical problem to be solved in the present invention:Above mentioned problem for prior art there is provided one kind can improve big data Search efficiency during amount, compressed capability is strong, quickly positioning key assignments index data compresses for the gene order fragment that occupies little space Method.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is:
A kind of gene order fragment, which is quickly positioned, uses key assignments index data compression method, and step includes:
1) initialization compression result set Setcomp, the prefix length n for setting data compression to use;
2) from gene sequence data set Set to be compressedorigOne current gene order fragment to be compressed of middle taking-up Key;
3) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n bases with common prefix respectively Because of sequence fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length, and all gene order fragment sequences are based on altogether Compression result set Set is separately added into prefix and different circulation skew number of times and suffixcomp
4) data acquisition system Set to be compressed is judgedorigWhether it is empty, next current base to be compressed is taken out if non-NULL Because of sequence fragment Key, and redirect execution step 2);Otherwise, by compression result set SetcompOutput.
Preferably, the step 3) detailed step include:
3.1) current gene order fragment Key is circulated into skew 0 to (n-1) secondary formation n with common prefix respectively Gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length;
3.2) from the gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1)One gene order fragment of middle selection Sequence KeyriIt is used as current gene order fragment sequence;
3.3) by current gene order fragment sequence KeyriIt is prefix Prefix according to prefix length n cuttingsriAnd suffix Postfixri, the prefix PrefixriWith suffix PostfixriLength sum be current gene order fragment sequence Keyri's Length;
3.4) prefix Prefix is judgedriCorresponding mapping relations are integrated into compression result set SetcompIn whether deposited If existed, execution step 3.5 is being redirected);Otherwise execution step 3.6 is redirected);
3.5) current gene order fragment sequence Key is judgedriData<I, Postfixri>In prefix PrefixriCorrespondence Mapping relations set in whether existed, if it does not exist, then by current gene order fragment sequence KeyriData<I, Postfixri>Add prefix PrefixriCorresponding mapping relations set, wherein i represents current gene order fragment sequence Keyri Circulate the number of times of skew, PostfixriFor current gene order fragment sequence KeyriSuffix, redirect execution step 3.7);It is no Then, current gene order fragment sequence Key is ignoredriThe follow-up gene order fragment sequence with common prefix, redirects and performs step It is rapid 4);
3.6) it is prefix PrefixriNewly-built mapping relations Prefixri→{<i,Postfixri>And add compression result collection Close Setcomp, wherein i represents current gene order fragment sequence KeyriCirculate the number of times of skew, PostfixriFor current gene sequence Row fragment sequence KeyriSuffix, redirect execution step 3.7);
3.7) gene order fragment sequence Key is judgedr0,Keyr1,…,Keyr(n-1)Whether it has been disposed, if still It is untreated to finish, then select next current gene order fragment sequence KeyriAnd redirect execution step 3.3), otherwise redirect and hold Row step 4).
Preferably, the step 1) in set data compression to use prefix length n detailed step include:
1.1) construction prefix length n compression ratio function f (n);
1.2) ask for so that compression ratio function f (n) value reaches the prefix length n of maximum.
Preferably, the step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction;
In formula (1), f (n) is compression ratio function, and TL is data acquisition system Set to be compressedorigLength, SL is data to be compressed Set SetorigIn gene order fragment Key to be compressed length, b is data acquisition system Set to be compressedorigIn base to be compressed Because of the bit storage space shared by each element in sequence fragment Key, the length that S (n) is prefix Prefix is indexed when being n The byte estimation function that data are accounted for, the byte estimation function S's (n) that index data is accounted for when the length of the prefix Prefix is n Calculate shown in function expression such as formula (2);
In formula (2), the byte estimation function that index data is accounted for when the length that S (n) is prefix Prefix is n, TL is to wait to press Contracting data acquisition system SetorigLength, SL is data acquisition system Set to be compressedorigIn gene order fragment Key to be compressed length, b For data acquisition system Set to be compressedorigIn bit storage in gene order fragment Key to be compressed shared by each element it is empty Between, n is prefix length.
Preferably, the circulation skew is ring shift left.
Preferably, prefix length n values are 32.
The quick positioning key assignments index data compression method tool of gene order fragment of the present invention has the advantage that:The present invention will Current gene order fragment Key circulates skew 0 to (n-1) the secondary formation n gene order fragment sequences with common prefix respectively Arrange Keyr0,Keyr1,…,Keyr(n-1), n is prefix length, and all gene order fragment sequences are based on into common prefix and difference Circulation skew number of times and suffix be separately added into compression result set Setcomp, before gene order fragment Key is cut into Sew (Prefix) and suffix (Postfix) two parts, by carrying out the circulation offset operation of certain number of times to gene order fragment, The same prefix sequence of seizure as much as possible in adjacent short-movie section sequence, and by by before these gene order fragments Key Sew sequence merging, and by suffix array together with the coding that circulation offsets number of times, joint uniquely represents a specific gene sequence Column-slice section Key, can so greatly save these memory spaces for indexing short sequence, simultaneously as only prefix and suffix two Level sequence, the series of the invention that traditional prefix compressed tree is not present increases with data scale and increases caused defect, can Search efficiency during big data quantity is improved, has the advantages that compressed capability is strong, occupy little space.
Brief description of the drawings
Fig. 1 is the principle schematic in the key-value pair data storehouse that prior art builds gene order fragment.
Fig. 2 is the flow chart of present invention method.
Fig. 3 is the principle schematic in the key-value pair data storehouse that the embodiment of the present invention builds gene order fragment.
Fig. 4 be present invention method step 3) flow chart.
Embodiment
As shown in Fig. 2 the present embodiment gene order fragment is wrapped the step of quickly positioning with key assignments index data compression method Include:
1) initialization compression result set Setcomp, the prefix length n for setting data compression to use;
2) from gene sequence data set Set to be compressedorigOne current gene order fragment to be compressed of middle taking-up Key;
3) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n bases with common prefix respectively Because of sequence fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length, and all gene order fragment sequences are based on altogether Compression result set Set is separately added into prefix and different circulation skew number of times and suffixcomp
4) data acquisition system Set to be compressed is judgedorigWhether it is empty, next current base to be compressed is taken out if non-NULL Because of sequence fragment Key, and redirect execution step 2);Otherwise, by compression result set SetcompOutput.
According to key-value pair data storehouse building process it can be found that because gene order fragment Key data are by word one by one In the sequence for one section of length-specific that symbol starts and intercepted, its adjacent short sequence (n character) repeatedly intercepted, actually there is big portion Divide duplicate repeat character (RPT).In the present embodiment, for the n gene order fragment sequences obtained after circulation skew Keyr0,Keyr1,…,Keyr(n-1), each gene order fragment sequence is encapsulated as based on common prefix and different followed Ring offsets number of times and suffix and adds compression result set Setcomp.Definition circulation offset operation symbol<<RN is represented sequential element Circulation skew n, as shown in figure 3, with T, G, C, A it is adjacent 3 times from data acquisition system Set to be compressedorigThe short sequence of interception Exemplified by character string, by gene order fragment Key (T, G, C, A) respectively circulate skew 0 to (n-1) it is secondary, form gene order respectively Fragment sequence T, G, C, A, gene order fragment sequence T, G, C, G, gene order fragment sequence T, G, C, G, therefore gene order Fragment sequence can be expressed as TG respectively<<R0CA、TG<<R1CG、TG<<R2CG, circulation offset operation symbol<<RN includes circulation and offset Number of times, circulation offset operation symbol<<RTG on front side of n is common prefix, circulation offset operation symbol<<ROn rear side of n is suffix. It should be noted that being only the exemplary illustration carried out by taking the gene order fragment Key of 4 bases as an example herein, in addition The gene order fragment of other quantity base can be used as needed, and its principle is identical with the present embodiment, therefore no longer goes to live in the household of one's in-laws on getting married herein State.
It can be seen from Fig. 3, when circulation skew number of times is 0, suffix C, A are respectively positioned on common prefix TG's before circulation skew Rear side;When circulation skew number of times is 1, suffix C is in circulation skew anteposition in common prefix TG rear side, and suffix G is inclined in circulation Anteposition is moved in common prefix TG front side;When it is 2 to circulate skew number of times, before suffix C, G are respectively positioned on jointly before circulation skew Sew TG front side.Therefore, the principle offset based on above-mentioned circulation, can rapidly be gone back according to the gene order fragment sequence after compression Original obtains the initial data of gene order fragment sequence.In the present embodiment, circulation skew is ring shift left, is certainly circulated right The general principle of shifting is identical with ring shift left, therefore its specific implementation details that will not be repeated here.
The present embodiment step 1) in set data compression to use prefix length n detailed step include:
1.1) construction prefix length n compression ratio function f (n);
1.2) ask for so that compression ratio function f (n) value reaches the prefix length n of maximum.
Step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction;
In formula (1), f (n) is compression ratio function, and TL is data acquisition system Set to be compressedorigLength, SL is data to be compressed Set SetorigIn gene order fragment Key to be compressed length, b is data acquisition system Set to be compressedorigIn base to be compressed Because of the bit storage space shared by each element in sequence fragment Key, the length that S (n) is prefix Prefix is indexed when being n The byte estimation function that data are accounted for, the byte estimation function S's (n) that index data is accounted for when the length of the prefix Prefix is n Calculate shown in function expression such as formula (2);
In formula (2), the byte estimation function that index data is accounted for when the length that S (n) is prefix Prefix is n, TL is to wait to press Contracting data acquisition system SetorigLength, SL is data acquisition system Set to be compressedorigIn gene order fragment Key to be compressed length, b For data acquisition system Set to be compressedorigIn bit storage in gene order fragment Key to be compressed shared by each element it is empty Between, n is prefix length.
As shown in figure 4, step 3) detailed step include:
3.1) current gene order fragment Key is circulated into skew 0 to (n-1) secondary formation n with common prefix respectively Gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length;
3.2) from the gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1)One gene order fragment of middle selection Sequence KeyriIt is used as current gene order fragment sequence;
3.3) by current gene order fragment sequence KeyriIt is prefix Prefix according to prefix length n cuttingsriAnd suffix Postfixri, the prefix PrefixriWith suffix PostfixriLength sum be current gene order fragment sequence Keyri's Length;
3.4) prefix Prefix is judgedriCorresponding mapping relations are integrated into compression result set SetcompIn whether deposited If existed, execution step 3.5 is being redirected);Otherwise execution step 3.6 is redirected);
3.5) current gene order fragment sequence Key is judgedriData<I, Postfixri>In prefix PrefixriCorrespondence Mapping relations set in whether existed, if it does not exist, then by current gene order fragment sequence KeyriData<I, Postfixri>Add prefix PrefixriCorresponding mapping relations set, wherein i represents current gene order fragment sequence Keyri Circulate the number of times of skew, PostfixriFor current gene order fragment sequence KeyriSuffix, redirect execution step 3.7);It is no Then, current gene order fragment sequence Key is ignoredriThe follow-up gene order fragment sequence with common prefix, redirects and performs step It is rapid 4);
3.6) it is prefix PrefixriNewly-built mapping relations Prefixri→{<i,Postfixri>And add compression result collection Close Setcomp, wherein i represents current gene order fragment sequence KeyriCirculate the number of times of skew, PostfixriFor current gene sequence Row fragment sequence KeyriSuffix, redirect execution step 3.7);
3.7) gene order fragment sequence Key is judgedr0,Keyr1,…,Keyr(n-1)Whether it has been disposed, if still It is untreated to finish, then select next current gene order fragment sequence KeyriAnd redirect execution step 3.3), otherwise redirect and hold Row step 4).
In the present embodiment, compression result set SetcompData structure it is as follows:
{
prefix1→{<rotate1,postfix1>,<rotate2,postfix2>...,
prefix2→{<rotate3,postfix3>...,
…}
In above-mentioned data structure, prefix1For the common prefix of first gene order fragment, prefix1→{< rotate1,postfix1>,<rotate2,postfix2>... } and it is prefix PrefixriCorresponding mapping relations set, rotate1For with prefix prefix1First gene order fragment sequence circulation skew number of times, postfix1For with Prefix prefix1First gene order fragment sequence suffix, rotate2For with prefix prefix1Second gene The circulation skew number of times of sequence fragment sequence, postfix2For with prefix prefix1Second gene order fragment sequence Suffix;prefix2For the common prefix of second gene order fragment, prefix2→{<rotate3,postfix3>... } be Prefix prefix2Corresponding mapping relations set, rotate3For with prefix prefix2First gene order fragment sequence Circulation skew number of times, postfix3For with prefix prefix1First gene order fragment sequence suffix.
In the present embodiment, data acquisition system Set to be compressedorigLength TL=2*109, gene order fragment Key length SL= Bit storage space b=2bits in 36, gene order fragment Key shared by each element is (because effective set of reference sequences Into only ACGT), the prefix length n=32 that data compression is used is chosen, is so as to the length that calculates prefix Prefix The byte estimation function S (n) that index data is accounted for when 32=6500000000Bytes, i.e. 6.05GB, relative to number when not compressing According to size (TL*SL*b/8=2*109× 36 × 2/8Bytes=16.76GB) for, the present embodiment gene order fragment is quick Positioning can reach nearly 2.8 times of compression ratio with key assignments index data compression method, therefore the present embodiment can improve big data quantity When search efficiency, have the advantages that compressed capability is strong, occupy little space.
Described above is only the preferred embodiment of the present invention, and protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical schemes belonged under thinking of the present invention belong to protection scope of the present invention.It should be pointed out that for the art Those of ordinary skill for, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (6)

1. a kind of gene order fragment is quickly positioned with key assignments index data compression method, it is characterized in that step includes:
1) initialization compression result set Setcomp, the prefix length n for setting data compression to use;
2) from gene sequence data set Set to be compressedorigOne current gene order fragment Key to be compressed of middle taking-up;
3) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n gene sequences with common prefix respectively Row fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length, before all gene order fragment sequences are based on jointly Sew and be separately added into compression result set Set with different circulations skew number of times and suffixcomp
4) data acquisition system Set to be compressed is judgedorigWhether it is empty, next current gene sequence to be compressed is taken out if non-NULL Column-slice section Key, and redirect execution step 2);Otherwise, by compression result set SetcompOutput.
2. gene order fragment according to claim 1, which is quickly positioned, uses key assignments index data compression method, its feature exists In the step 3) detailed step include:
3.1) current gene order fragment Key is circulated into skew 0 to (n-1) the secondary formation n genes with common prefix respectively Sequence fragment sequence Keyr0,Keyr1,…,Keyr(n-1), n is prefix length;
3.2) from the gene order fragment sequence Keyr0,Keyr1,…,Keyr(n-1)One gene order fragment sequence of middle selection KeyriIt is used as current gene order fragment sequence;
3.3) by current gene order fragment sequence KeyriIt is prefix Prefix according to prefix length n cuttingsriAnd suffix Postfixri, the prefix PrefixriWith suffix PostfixriLength sum be current gene order fragment sequence Keyri's Length;
3.4) prefix Prefix is judgedriCorresponding mapping relations are integrated into compression result set SetcompIn whether existed, If existed, execution step 3.5 is redirected);Otherwise execution step 3.6 is redirected);
3.5) current gene order fragment sequence Key is judgedriData<I, Postfixri>In prefix PrefixriIt is corresponding to reflect Penetrate in set of relationship and whether existed, if it does not exist, then by current gene order fragment sequence KeyriData<I, Postfixri>Add prefix PrefixriCorresponding mapping relations set, wherein i represents current gene order fragment sequence Keyri Circulate the number of times of skew, PostfixriFor current gene order fragment sequence KeyriSuffix, redirect execution step 3.7);It is no Then, current gene order fragment sequence Key is ignoredriThe follow-up gene order fragment sequence with common prefix, redirects and performs step It is rapid 4);
3.6) it is prefix PrefixriNewly-built mapping relations Prefixri→{<i,Postfixri>And add compression result set Setcomp, wherein i represents current gene order fragment sequence KeyriCirculate the number of times of skew, PostfixriFor current gene order Fragment sequence KeyriSuffix, redirect execution step 3.7);
3.7) gene order fragment sequence Key is judgedr0,Keyr1,…,Keyr(n-1)Whether it has been disposed, if not yet handled Finish, then select next current gene order fragment sequence KeyriAnd redirect execution step 3.3), otherwise redirect execution step 4)。
3. gene order fragment according to claim 2, which is quickly positioned, uses key assignments index data compression method, its feature exists In the step 1) in set data compression to use prefix length n detailed step include:
1.1) construction prefix length n compression ratio function f (n);
1.2) ask for so that compression ratio function f (n) value reaches the prefix length n of maximum.
4. gene order fragment according to claim 3, which is quickly positioned, uses key assignments index data compression method, its feature exists In the step 1.1) in shown in the obtained compression ratio function such as formula (1) of construction;
<mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>L</mi> <mo>*</mo> <mi>S</mi> <mi>L</mi> <mo>*</mo> <mi>b</mi> </mrow> <mrow> <mn>8</mn> <mo>*</mo> <mi>S</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> 1
In formula (1), f (n) is compression ratio function, and TL is data acquisition system Set to be compressedorigLength, SL is data acquisition system to be compressed SetorigIn gene order fragment Key to be compressed length, b is data acquisition system Set to be compressedorigIn gene sequence to be compressed Bit storage space in column-slice section Key shared by each element, the index data when length that S (n) is prefix Prefix is n The byte estimation function accounted for, the calculating for the byte estimation function S (n) that index data the is accounted for when length of the prefix Prefix is n Shown in function expression such as formula (2);
<mrow> <mi>S</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mrow> <mo>(</mo> <mrow> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mrow> <mi>S</mi> <mi>L</mi> <mo>-</mo> <mi>n</mi> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mrow> <mi>S</mi> <mi>L</mi> <mo>-</mo> <mi>n</mi> </mrow> <mo>)</mo> </mrow> <mo>*</mo> <mi>b</mi> </mrow> <mo>)</mo> </mrow> <mo>*</mo> <mi>T</mi> <mi>L</mi> </mrow> <mn>8</mn> </mfrac> <mo>+</mo> <mfrac> <mrow> <mi>n</mi> <mo>*</mo> <mi>b</mi> <mo>*</mo> <mi>T</mi> <mi>L</mi> </mrow> <mrow> <mn>8</mn> <mo>*</mo> <mrow> <mo>(</mo> <mrow> <mi>S</mi> <mi>L</mi> <mo>-</mo> <mi>n</mi> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
In formula (2), the byte estimation function that index data is accounted for when the length that S (n) is prefix Prefix is n, TL is number to be compressed According to set SetorigLength, SL is data acquisition system Set to be compressedorigIn gene order fragment Key to be compressed length, b is treats Compressed data set SetorigIn bit storage space in gene order fragment Key to be compressed shared by each element, n For prefix length.
5. the gene order fragment according to any one in Claims 1 to 4 quickly with key assignments index data compressed by positioning Method, it is characterised in that the circulation skew is ring shift left.
6. gene order fragment according to claim 5, which is quickly positioned, uses key assignments index data compression method, its feature exists In prefix length n values are 32.
CN201510648867.2A 2015-10-09 2015-10-09 A kind of gene order fragment is quickly positioned with key assignments index data compression method Active CN105224828B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510648867.2A CN105224828B (en) 2015-10-09 2015-10-09 A kind of gene order fragment is quickly positioned with key assignments index data compression method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510648867.2A CN105224828B (en) 2015-10-09 2015-10-09 A kind of gene order fragment is quickly positioned with key assignments index data compression method

Publications (2)

Publication Number Publication Date
CN105224828A CN105224828A (en) 2016-01-06
CN105224828B true CN105224828B (en) 2017-10-27

Family

ID=54993793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510648867.2A Active CN105224828B (en) 2015-10-09 2015-10-09 A kind of gene order fragment is quickly positioned with key assignments index data compression method

Country Status (1)

Country Link
CN (1) CN105224828B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930104B (en) * 2016-05-17 2019-01-18 百度在线网络技术(北京)有限公司 Date storage method and device
CN106484865A (en) * 2016-10-10 2017-03-08 哈尔滨工程大学 One kind is based on four word chained list dictionary tree searching algorithm of DNA k mer index problem
CN106897582B (en) * 2017-01-25 2018-03-09 人和未来生物科技(长沙)有限公司 A kind of heterogeneous platform understood towards gene data
CN110428868B (en) * 2018-04-27 2021-11-26 人和未来生物科技(长沙)有限公司 Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data
CN110060731B (en) * 2019-04-12 2022-10-21 福建师范大学 Method for determining number of overlapped gene pairs among genes based on distributed calculation
CN110782946A (en) * 2019-10-17 2020-02-11 南京医基云医疗数据研究院有限公司 Method and device for identifying repeated sequence, storage medium and electronic equipment
CN112765113B (en) * 2021-01-31 2024-04-09 云知声智能科技股份有限公司 Index compression method, index compression device, computer readable storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101036141A (en) * 2004-03-26 2007-09-12 甲骨文国际有限公司 A database management system with persistent, user- accessible bitmap values
CN101499094A (en) * 2009-03-10 2009-08-05 焦点科技股份有限公司 Data compression storing and retrieving method and system
CN102831224A (en) * 2012-08-24 2012-12-19 北京百度网讯科技有限公司 Creating method for data index base and searching suggest generation method and device
CN103870492A (en) * 2012-12-14 2014-06-18 腾讯科技(深圳)有限公司 Data storing method and device based on key sorting

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9715525B2 (en) * 2013-06-28 2017-07-25 Khalifa University Of Science, Technology And Research Method and system for searching and storing data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101036141A (en) * 2004-03-26 2007-09-12 甲骨文国际有限公司 A database management system with persistent, user- accessible bitmap values
CN101499094A (en) * 2009-03-10 2009-08-05 焦点科技股份有限公司 Data compression storing and retrieving method and system
CN102831224A (en) * 2012-08-24 2012-12-19 北京百度网讯科技有限公司 Creating method for data index base and searching suggest generation method and device
CN103870492A (en) * 2012-12-14 2014-06-18 腾讯科技(深圳)有限公司 Data storing method and device based on key sorting

Also Published As

Publication number Publication date
CN105224828A (en) 2016-01-06

Similar Documents

Publication Publication Date Title
CN105224828B (en) A kind of gene order fragment is quickly positioned with key assignments index data compression method
US20200285634A1 (en) System for data sharing platform based on distributed data sharing environment based on block chain, method of searching for data in the system, and method of providing search index in the system
Bowe et al. Succinct de Bruijn graphs
EP3072076B1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
CN106326475B (en) Efficient static hash table implementation method and system
US10938961B1 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
US20080222094A1 (en) Apparatus and Method for Searching for Multiple Inexact Matching of Genetic Data or Information
CN109408681A (en) A kind of character string matching method, device, equipment and readable storage medium storing program for executing
CN103189867A (en) Duplicated data search method and equipment
CN105677683A (en) Batch data query method and device
CN111445952B (en) Method and system for quickly comparing similarity of super-long gene sequences
CN111801665A (en) Hierarchical Locality Sensitive Hash (LSH) partition indexing for big data applications
US11275731B2 (en) Accelerated filtering, grouping and aggregation in a database system
CN107330094A (en) The Bloom Filter tree construction and key-value pair storage method of dynamic memory key-value pair
CN104618361A (en) Network stream data reordering method
Cracco et al. Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT
US8271500B2 (en) Minimal perfect hash functions using double hashing
Hon et al. Towards an optimal space-and-query-time index for top-k document retrieval
CN113468571B (en) Source tracing method based on block chain
KR20230170891A (en) In-memory efficient multistep search
CN101430741A (en) Short sequence mapping method and system
US11119995B2 (en) Systems and methods for sketch computation
CN104794129A (en) Data processing method and system based on query logs
CN116521733A (en) Data query method and device
CN102456073A (en) Partial extremum inquiry method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant