CN104284189A - Improved BWT data compression method and hardware implementing system thereof - Google Patents

Improved BWT data compression method and hardware implementing system thereof Download PDF

Info

Publication number
CN104284189A
CN104284189A CN201410571262.3A CN201410571262A CN104284189A CN 104284189 A CN104284189 A CN 104284189A CN 201410571262 A CN201410571262 A CN 201410571262A CN 104284189 A CN104284189 A CN 104284189A
Authority
CN
China
Prior art keywords
lyndon
word
module
character string
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410571262.3A
Other languages
Chinese (zh)
Other versions
CN104284189B (en
Inventor
李冰
陈帅
董乾
刘勇
赵霞
王刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201410571262.3A priority Critical patent/CN104284189B/en
Publication of CN104284189A publication Critical patent/CN104284189A/en
Application granted granted Critical
Publication of CN104284189B publication Critical patent/CN104284189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a hardware implementing system of an improved BWTS data compression method. The hardware implementing system of the improved BWTS data compression method comprises an input caching module, a LyndonWord searching module, a LyndonWord caching module, a LyndonWord length caching module, a transposition module, a transposition caching module, a ranking module and an output caching module. The input caching module is used for temporary storage of character strings to be processed and synchronizing data input and data processing. The LyndonWord searching module searches for the LyndonWords of a data block. The LyndonWord caching module caches the LyndonWords. The transposition module completes transposition of all the LyndonWords. The transposition cashing module caches transposition results. The ranging module ranks all the character strings existing after transposition is completed according to a lexicographical order and takes the character string located at the last line as the output of the BWTS method. The output caching module caches the output characteristic string, and then the output characteristic string can be used by following-up modules. By the adoption of the improved BWT data compression method and the hardware implementing system of the improved BWT data compression method, the situation that by the adoption of an existing BWT method, character string restoring can only be achieved through a constant generated through forward transformation is changed, and therefore the operating efficiency of the data compression method is improved.

Description

A kind of BWT data compression method of improvement and system for implementing hardware thereof
Technical field
The present invention relates to technical field of data compression, the BWT data compression method of particularly a kind of improvement and system for implementing hardware thereof.
Background technology
Data compression technique is the study hotspot of information science always, and it stores in data and has a wide range of applications in transmission.Improve constantly with network transfer speeds although data storage device capacity constantly expands, but the diversity of data and explosive growth, make efficient compression method become the important means effectively reducing storage and transmission cost.Data compression is divided into Lossless Compression and lossy compression method.Lossy compression method allows information dropout to a certain degree, is used widely in fields such as multimedia interactive system, transmission of video business and home entertainings.Lossless Compression is the reversible encoding based on information entropy principle, the redundant information in information source is removed under the prerequisite not affecting comentropy, information after compression can be reduced, and it is preserved in the fields such as analysis and many mixed image compression methods be all widely used at remote sensing image processing, medical imaging process, history archive.Removing redundant information is to greatest extent the target that Lossless Compression is pursued.Compression ratio and compression speed etc. is had at present to the leading indicator of compression method performance evaluation.The BWT transformation idea that to be MikeBurrows propose according to DavidWheeler, improve and be successfully applied to the transform method that real data compresses, this conversion is the study hotspot in current Lossless Compression field.The reversible data conversion method of BWT to be a kind of with data block be operand, its core concept is that the character matrix obtained after turning character string wheel sorts and converts.Itself can not reduce data volume, but the data after conversion are easier to compression, so BWT is the preliminary treatment before compressing data.
Fig. 1 shows a kind of Bzip2 data compression system of efficiently increasing income based on BWT method of the prior art.As shown in Figure 1, character string S there will be continuous print identical characters after BWT method, with after through the process of MTF method, the result obtained will be continuous print 0 and a series of small integer, for reducing overall entropy further; Finally use Huffman coding to carry out data compression with the form of the minimum binary tree in cum rights path, obtain higher compression ratio.In addition, due to the similitude of BWT and suffix array, make the string matching that BWT is used as in FM-index method.BWT method makes the data in character block occur stronger cohesion, and namely identical characters condenses together, and this feature makes follow-up compression method have better compression ratio.The method changes compression method must carry out with data flow model the limitation that processes, makes character block in compression method be treated as possibility, and this is the revolutionary character progress in Lossless Compression field.In addition, BWT method is also applied to bioinformatics, for the range measurement between full-length genome comparison, genome annotation and two genome sequences.Chnnel coding is often used as among communication system.
Fig. 2 shows the data compression schematic diagram based on BWT data compression method of the prior art.As shown in Figure 2, figure below describes the general principle of BWT method, realizes the block process of data by BWT method.Suppose to input character string (block) S=' ABRACA ' that length is n, character string S cyclic shift is formed the matrix M of n*n, the every a line in M is sorted according to lexcographical order, structural matrix Q.Last row getting Q are just output sequence L=' CARAAB ', and the position of source string in Q (line number) is just output constant index=1.But among numerous application, BWT method is often used as process in early stage, is the character string of n, forms the character string that length is n and a constant after the process of BWT method for length.The existence of this constant is made troubles to a lot of subsequent treatment.Such as when BWT is used for chnnel coding, due to noise effect, if this constant is lost, then this character string cannot be recovered.When for Lossless Compression, be the character string of n for length, after treatment, the character string that length is n+1 can be become, therefore change the entropy of character string.At present, not yet find both at home and abroad for the research without suffix BWT method.
In view of this, for current BWT method Problems existing, be necessary the BWT data compression method proposing a kind of improvement, constant that existing BWT method must be generated by direct transform can be changed and could realize the situation that character string recovers, to improve the operational efficiency of data compression method. ?
Summary of the invention
In order to overcome the weak point of the prior art of above-mentioned indication, the present invention is intended to the BWT data compression method proposing a kind of improvement, constant that existing BWT method must be generated by direct transform can be changed and could realize the situation that character string recovers, to improve the operational efficiency of data compression method.
To achieve these goals, the invention provides a kind of BWTS method and system for implementing hardware thereof, comprising: input buffer module, for temporary pending character string, and synchrodata input and data processing, after processing, character string is exported to Lyndon Word and search module; Lyndon Word searches module, for searching the longest Lyndon Word come from input buffer module character string, and export the longest Lyndon Word found to Lyndon Word cache module, the length of each the longest Lyndon Word is exported to Lyndon Word length cache module; Lyndon Word cache module, searches the Lyndon Word of module for transpose modules for temporary output from Lyndon Word; Lyndon Word length cache module, searches the length of all Lyndon Word found in module and number for order module for temporary Lyndon Word; Transpose modules, searches the transposition of all Lyndon Word in module for completing Lyndon Word and keeps in transposition cache module by transposition result; Transposition cache module, the transposition result exported for temporary transpose modules is for order module; Order module, for all character strings in transposition cache module being pressed lexcographical order sequence, and gets the output of last row as BWTS method, and is temporarily stored in output buffer module; Output buffer module, for the character string of temporary output, for subsequent module.
Described Lyndon Word searches module and comprises further: get character string submodule, for getting character from input buffer module, and record the length of now got character string, start to read in by turn from character string initial character, often increase by one just to be inputted subsequent module and carry out Lyndon Word judgement, if there is Lyndon Word, then string length is inputted Lyndon Word length cache module, length zero setting, get character string next time then from last character that character string is got in this time; Displacement submodule, for judging submodule by the character string input Lyndon Word getting character string submodule and this character string be successively shifted, and by all shift character string input N*N registers; N*N register, for store come from a bit submodule wait to judge that all shift characters statements based on collusion Lyndon Word of character string judge submodule process; Lyndon Word judges submodule, for successively taking out shift character string from N*N register and contrasting with former character string, wherein: contrast number waits to judge the length of character string, if the sequence of comparing result display former symbol string dictionary is minimum, then this character string is Lyndon Word, this character string is exported to Lyndon Word cache module.
Described transpose modules comprises further: the longest Lyndon Word length differentiates submodule, for being differentiated the length of the longest Lyndon Word of processed character string by the content in Lyndon Word length temporary storage module, and this numerical value is sent to character string expansion submodule; Character string expansion submodule, for all extending to the length of the longest Lyndon Word for cyclic shift submodule by all Lyndon Word in Lyndon Word cache module; Cyclic shift submodule, for the Lyndon Word cyclic shift successively by coming from character string expansion submodule, and is stored to transposition cache module.
Described order module comprises: sorting sub-module, for the character string in transposition cache module being sorted for BWTS result acquisition module according to lexcographical order; BWTS result acquisition module, last row for the ranking results by sorting sub-module read, and as the output of BWTS method, and keep in output buffer module.
To achieve these goals, the present invention also provides a kind of BWT data compression method of improvement, comprising: input of character string, keeps in pending character string by input buffer module, and synchrodata input and data processing, after processing, character string is exported to Lyndon Word and search module; Search module searches by Lyndon Word and come from the longest Lyndon Word in input buffer module character string, and export the longest Lyndon Word found to Lyndon Word cache module, the length of each the longest Lyndon Word is exported to Lyndon Word length cache module; Kept in by Lyndon Word cache module and search the Lyndon Word of module for transpose modules output from Lyndon Word; Keep in Lyndon Word by Lyndon Word length cache module and search the length of all Lyndon Word found in module and number for order module; Complete by transpose modules the transposition that Lyndon Word searches all Lyndon Word in module, and transposition result is kept in transposition cache module; The transposition result of transpose modules output is kept in for order module by transposition cache module; By order module, all character strings in transposition cache module are pressed lexcographical order sequence, and get the output of last row as BWTS method, and be temporarily stored in output buffer module; The character string of output is kept in, for subsequent module by output buffer module.
The described Lyndon Word searching data block comprises further: get character from input buffer module, and record the length of now got character string, start to read in by turn from character string initial character, often increase by one just to be inputted subsequent module and carry out Lyndon Word judgement, if there is Lyndon Word, then string length is inputted Lyndon Word length cache module, length zero setting, get character string next time then from last character that character string is got in this time; The character string input Lyndon Word getting character string submodule is judged submodule and this character string is successively shifted, and by all shift character string input N*N registers; What come from a bit submodule by the storage of N*N register waits to judge that all shift characters statements based on collusion Lyndon Word of character string judge submodule process; Successively from N*N register, take out shift character string and contrast with former character string, wherein: contrast number waits to judge the length of character string, if the sequence of comparing result display former symbol string dictionary is minimum, then this character string is Lyndon Word, this character string is exported to Lyndon Word cache module.
The described transposition completing all Lyndon Word comprises further: the length being differentiated the longest Lyndon Word of processed character string by the content in Lyndon Word length temporary storage module, and this numerical value is sent to character string expansion submodule; All Lyndon Word in Lyndon Word cache module are extended to the length of the longest Lyndon Word for cyclic shift submodule; To the Lyndon Word cyclic shift successively of character string expansion submodule be come from, and be stored to transposition cache module.
The described all character strings completed by transposition comprise further by lexcographical order sequence: the character string in transposition cache module sorted for BWTS result acquisition module according to lexcographical order; Last row of the ranking results of sorting sub-module are read, as the output of BWTS method, and keeps in output buffer module.
The BWT data compression method of improvement disclosed by the invention and system for implementing hardware thereof, can change constant that existing BWT method must be generated by direct transform and could realize the situation that character string recovers, to improve the operational efficiency of data compression method.
The aspect that the present invention adds and advantage will part provide in the following description, and these will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
Fig. 1 shows a kind of Bzip2 data compression system of efficiently increasing income based on BWT method of the prior art;
Fig. 2 shows the data compression schematic diagram based on BWT data compression method of the prior art;
Fig. 3 shows the canonical schema that Lyndon Word divides;
Fig. 4 illustrates the system for implementing hardware of the BWT data compression method of a kind of improvement provided by the invention;
Fig. 5 illustrates that the Lyndon Word of system for implementing hardware provided by the invention searches modular structure schematic diagram;
Fig. 6 illustrates the transpose modules structural representation of system for implementing hardware provided by the invention;
Fig. 7 illustrates the order module structural representation of system for implementing hardware provided by the invention.
Embodiment
Be described below in detail embodiments of the present invention, the example of described execution mode is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the execution mode be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
Those skilled in the art of the present technique are appreciated that unless expressly stated, and singulative used herein " ", " one ", " described " and " being somebody's turn to do " also can comprise plural form.Should be further understood that, the wording used in specification of the present invention " comprises " and refers to there is described feature, integer, step, operation, element and/or assembly, but does not get rid of and exist or add other features one or more, integer, step, operation, element, assembly and/or their group.Should be appreciated that, when we claim element to be " connected " or " coupling " to another element time, it can be directly connected or coupled to other elements, or also can there is intermediary element.In addition, " connection " used herein or " coupling " can comprise wireless connections or couple.Wording "and/or" used herein comprises one or more arbitrary unit listing item be associated and all combinations.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, and all terms used herein (comprising technical term and scientific terminology) have the meaning identical with the general understanding of the those of ordinary skill in field belonging to the present invention.Should also be understood that those terms defined in such as general dictionary should be understood to have the meaning consistent with the meaning in the context of prior art, unless and define as here, can not explain by idealized or too formal implication.
The BWT data compression method (being called " BWTS method ") that the present invention proposes a kind of improvement comprising: Islington character (Lyndon Word) divides and transposition two parts, and concrete grammar is as follows:
1, the longest Lyndon word divides
Lyndon word is proposed in 1954 by mathematician Roger Lyndon, and is referred to as standard word canonical ordering sequence (standard lexicographic sequence).Lyndon Word is a string like this character, is compared to the cyclically shifted sequences that they are all, and (dictionary sequence (lexicographical order) is a kind of sort method for stochastic variable formation sequence to its lexcographical order.Its method is, in alphabetical order, or the little large order of numeral, ascending formation sequence.) be all minimum.
Fig. 3 shows the canonical schema that Lyndon Word divides.As shown in Figure 3, the longest Lyndon word of indication of the present invention extends from the first character of character string backward, find the longest Lyndon word, from the character late of this longest Lyndon word, find the longest Lyndon word afterwards, until end of string.Describe the longest Lyndon word in detail for character string ' banana ' to divide: S=' banana ', first read in character ' b ', monocase is obviously Lyndon word, then continue to find the longest Lyndon word started with this character, so read in character ' a ' again, then now ' ba ' is not obviously Lyndon word, once the character string of non-Lyndon word be detected, the character string then started with character ' b ' does not just need to have detected backward again, is ' b ' the longest Lyndon word just for starting with character ' b ' herein.Next step just detects the longest Lyndon word from the character late ' a ' of character ' b ', ' a ' is obviously Lyndon word, read in again ' n ', ' an ' is also Lyndon word, enter once again ' a ', ' ana ' is not then Lyndon word, so ' an ' is the longest Lyndon word started with character ' a '.Next step detects the longest Lyndon word from the character late ' a ' of the longest Lyndon word-' an ', detects identical with the last time, and ' an ' is the longest Lyndon word from this ' a ' character.So next step is just from the character late ' a ' of this longest Lyndon word, due to end of string, therefore ' a ' is just the longest Lyndon word herein.So for input of character string S=' banana ', it is just ' b ', ' an ', ' an ', ' a ' that its Lyndon word exports.
2, be shifted transposition and sequence
Displacement transposition is each the longest Lyndon word the longest Lyndon word being divided generation and carries out same length process.Illustrate for S=' banana ', the result that its longest Lyndon word divides is ' b ', ' an ', ' an ', ' a ', wherein the longest character string is ' an ', length is 2, the character string being then less than 2 for length then extends to length 2 by cyclic shift, namely ' b ' is extended for ' bb ', ' a ' is extended for ' aa ', the matrix that cyclic shift forms 2*2 is then carried out for the longest character string ' an ', i.e. ' an ', ' na ', so the output of displacement transposition is ' bb ', ' an ', ' na ', ' an ', ' na ', ' aa '.
The character string that sequence is displacement transposition generates arranges with lexcographical order, and for above-mentioned example, ranking results is just ' aa ', ' an ', ' an ', ' bb ', ' na ', ' na '.
For above-mentioned ranking results, getting last row is just this Output rusults without suffix BWT method.For this example, Output rusults is L=' annbaa ', without the output of suffix constant.
The essence of the method direct transform is divided by Lyndon word, is hidden in this output sequence by homing sequence during inverse transformation, and do not need to use suffix constant to illustrate.
Fig. 4 illustrates a kind of BWTS method provided by the invention and system for implementing hardware thereof, comprising: input buffer module, for temporary pending character string, and synchrodata input and data processing, after processing, character string is exported to Lyndon Word and search module; Lyndon Word searches module, for searching the longest Lyndon Word come from input buffer module character string, and export the longest Lyndon Word found to Lyndon Word cache module, the length of each the longest Lyndon Word is exported to Lyndon Word length cache module; Lyndon Word cache module, searches the Lyndon Word of module for transpose modules for temporary output from Lyndon Word; Lyndon Word length cache module, searches the length of all Lyndon Word found in module and number for order module for temporary Lyndon Word; Transpose modules, searches the transposition of all Lyndon Word in module for completing Lyndon Word and keeps in transposition cache module by transposition result; Transposition cache module, the transposition result exported for temporary transpose modules is for order module; Order module, for all character strings in transposition cache module being pressed lexcographical order sequence, and gets the output of last row as BWTS method, and is temporarily stored in output buffer module; Output buffer module, for the character string of temporary output, for subsequent module.
Fig. 5 illustrates that the Lyndon Word of system for implementing hardware provided by the invention searches modular structure schematic diagram.As shown in Figure 5, described Lyndon Word searches module and comprises further: get character string submodule, for getting character from input buffer module, and record the length of now got character string, start to read in by turn from character string initial character, often increase by one just to be inputted subsequent module and carry out Lyndon Word judgement, if there is Lyndon Word, then string length is inputted Lyndon Word length cache module, length zero setting, get character string next time then from last character that character string is got in this time; Displacement submodule, for judging submodule by the character string input Lyndon Word getting character string submodule and this character string be successively shifted, and by all shift character string input N*N registers; N*N register, for store come from a bit submodule wait to judge that all shift characters statements based on collusion Lyndon Word of character string judge submodule process; Lyndon Word judges submodule, for successively taking out shift character string from N*N register and contrasting with former character string, wherein: contrast number waits to judge the length of character string, if the sequence of comparing result display former symbol string dictionary is minimum, then this character string is Lyndon Word, this character string is exported to Lyndon Word cache module.
Fig. 6 illustrates the transpose modules structural representation of system for implementing hardware provided by the invention.As shown in Figure 6, described transpose modules comprises further: the longest Lyndon Word length differentiates submodule, for being differentiated the length of the longest Lyndon Word of processed character string by the content in Lyndon Word length temporary storage module, and this numerical value is sent to character string expansion submodule; Character string expansion submodule, for all extending to the length of the longest Lyndon Word for cyclic shift submodule by all Lyndon Word in Lyndon Word cache module; Cyclic shift submodule, for the Lyndon Word cyclic shift successively by coming from character string expansion submodule, and is stored to transposition cache module.
Fig. 7 illustrates the order module structural representation of system for implementing hardware provided by the invention.As shown in Figure 7, described order module comprises: sorting sub-module, for the character string in transposition cache module being sorted for BWTS result acquisition module according to lexcographical order; BWTS result acquisition module, last row for the ranking results by sorting sub-module read, and as the output of BWTS method, and keep in output buffer module.
Specific embodiment: for character string " icanucan ".
First " icanucan " is stored in input buffer 102.Get character string submodule 202 get character " i " input displacement submodule and Lyndon Word judge submodule 208.The length of " i " is 1 without the need to being shifted.Get character string submodule 202 get character " c " and " ic " input displacement submodule and Lyndon Word are judged submodule 208.Be shifted submodule 204 by " ic " displacement for " ci " and stored in N*N register.Lyndon Word judges that submodule contrasts, by lexcographical order ic>ci, so obviously ic is not Lyndon Word." ica " that then extend backward, " ican " ... not Lyndon Word.Then now the longest Lyndon Word is " i ", and stored in Lyndon Word cache module 108.Then by length 1 stored in Lyndon Word length cache module 106.Start to get word from " c " afterwards, get character string submodule 202 and get character " c " input displacement submodule and Lyndon Word judges submodule 208.The length of " c " is 1 without the need to being shifted.Get character string submodule 202 get character " a " and " ca " input displacement submodule and Lyndon Word are judged submodule 208.Be shifted submodule 204 by " ca " displacement for " ac " and stored in N*N register.
Lyndon Word judges that submodule contrasts, by lexcographical order ca>ac, so obviously ca is not Lyndon Word." ica " that then extend backward, " ican " ... not Lyndon Word.Then now the longest Lyndon Word is " c ", and stored in Lyndon Word cache module 108.Then by length 1 stored in Lyndon Word length cache module 106.Start to get word from " a " afterwards, get character string submodule 202 and get character " a " input displacement submodule and Lyndon Word judges submodule 208.The length of " a " is 1 without the need to being shifted.Get character string submodule 202 get character " n " and " an " input displacement submodule and Lyndon Word are judged submodule 208.Be shifted submodule 204 by " an " displacement for " na " and stored in N*N register.Lyndon Word judges that submodule contrasts, by lexcographical order an<na, so obviously an is Lyndon Word.But whether be that the longest Lyndon Word needs to continue to judge.Get character string submodule 202 get character " u " and " anu " input displacement submodule and Lyndon Word are judged submodule 208." anu " displacement is " nua ", " uan " stored in N*N register by displacement submodule 204.Lyndon Word judges that submodule contrasts, by lexcographical order anu<nua<uan, so obviously anu is Lyndon Word.But whether be that the longest Lyndon Word needs to continue to judge.Get character string submodule 202 get character " c " and " anuc " input displacement submodule and Lyndon Word are judged submodule 208." anuc " displacement is " nuca ", " ucna ", " cnau " stored in N*N register by displacement submodule 204.Lyndon Word judges that submodule contrasts, by lexcographical order anuc>cnau>nucaGreatT.GreaT.G Tucna, so obviously anuc is Lyndon Word.But whether be that the longest Lyndon Word needs to continue to judge.Get character string submodule 202 afterwards get character " a " and " anuca " input displacement submodule and Lyndon Word are judged submodule 208." anuca " displacement is " nucaa ", " ucaan ", " caanu ", " aanuc " stored in N*N register by displacement submodule 204.Lyndon Word judges that submodule contrasts, by lexcographical order anuca>aanuc, so obviously anuca is not Lyndon Word." anucan " that then extend neither Lyndon Word backward.Then now the longest Lyndon Word is " anuc ", and stored in Lyndon Word cache module 108.Then by length 4 stored in Lyndon Word length cache module 106.
Just read in from " a " afterwards, get character string submodule 202 get character " a " input displacement submodule and Lyndon Word judge submodule 208.The length of " a " is 1 without the need to being shifted.Get character string submodule 202 get character " n " and " an " input displacement submodule and Lyndon Word are judged submodule 208.Be shifted submodule 204 by " an " displacement for " na " and stored in N*N register.Lyndon Word judges that submodule contrasts, by lexcographical order an<na, so obviously an is Lyndon Word.Therefore by " an " stored in Lyndon Word cache module, and by length 2 stored in Lyndon Word length cache module.So far whole Sequence Detection completes.The longest now temporary in Lyndon Word cache module Lyndon Word: " i ", " c ", " anuc ", " an ", the content stored in Lyndon Word length cache module is: 1,1,4,2.
Enter the transposition stage afterwards.The longest Lyndon Word length differentiates that first submodule 302 reads in the content 1,1,4,2 in Lyndon Word length cache module 106, and inquires maximum 4 wherein.Character string expansion submodule 304 differentiates the Lyndon Word length 1,1,4,2 in submodule 302 according to the longest Lyndon Word length.Each the longest Lyndon Word character is successively read in: " i ", " c ", " anuc ", " an " according to length.Expand according to maximum 4: " i " expands to " iiii ", " c " expands to " cccc ", and " anuc " length is 4, and do not need expansion, " an " expands to " anan ".Cyclic shift submodule 306 reads in escape character (ESC) string " iiii ", " cccc ", " anuc ", " anan ".Above sequence is carried out cyclic shift, and each sequence during temporary displacement is in transposition temporary storage module 112.Identical with former sequence after " iiii " sequential shift, therefore do not need displacement.Identical with former sequence after " cccc " sequential shift, therefore do not need displacement." anuc " is shifted, and keeps in " nuca ", " ucan ", " canu "." anan " is shifted, and temporary " nana ".Order module 114 reads sequence " iiii " temporary in transposition temporary storage module, " cccc ", " anuc ", " nuca ", " ucan ", " canu ", " anan ", " nana ".All sequences arranges according to lexcographical order by order module 402, and rank results is: " anan ", " anuc ", " canu ", " cccc ", " iiii ", " nana ", " nuca ", " ucan ".
Last character of each sequence above-mentioned takes out by BWTS result acquisition module 404, the output of composition BWTS: " ncuciaan ", exports output buffer module 116 to.
Sum up this process as follows: icanucan=> [i] [c] [anuc] [an]=> [anan]=> ncuciaan
[anuc]
[canu]
[cccc]
[iiii]
[nana]
[nuca]
[ucan]
To achieve these goals, the present invention also provides a kind of BWT data compression method of improvement, comprising: input of character string, keeps in pending character string by input buffer module, and synchrodata input and data processing, after processing, character string is exported to Lyndon Word and search module; Search module searches by Lyndon Word and come from the longest Lyndon Word in input buffer module character string, and export the longest Lyndon Word found to Lyndon Word cache module, the length of each the longest Lyndon Word is exported to Lyndon Word length cache module; Kept in by Lyndon Word cache module and search the Lyndon Word of module for transpose modules output from Lyndon Word; Keep in Lyndon Word by Lyndon Word length cache module and search the length of all Lyndon Word found in module and number for order module; Complete by transpose modules the transposition that Lyndon Word searches all Lyndon Word in module, and transposition result is kept in transposition cache module; The transposition result of transpose modules output is kept in for order module by transposition cache module; By order module, all character strings in transposition cache module are pressed lexcographical order sequence, and get the output of last row as BWTS method, and be temporarily stored in output buffer module; The character string of output is kept in, for subsequent module by output buffer module.
The described Lyndon Word searching data block comprises further: get character from input buffer module, and record the length of now got character string, start to read in by turn from character string initial character, often increase by one just to be inputted subsequent module and carry out Lyndon Word judgement, if there is Lyndon Word, then string length is inputted Lyndon Word length cache module, length zero setting, get character string next time then from last character that character string is got in this time; The character string input Lyndon Word getting character string submodule is judged submodule and this character string is successively shifted, and by all shift character string input N*N registers; What come from a bit submodule by the storage of N*N register waits to judge that all shift characters statements based on collusion Lyndon Word of character string judge submodule process; Successively from N*N register, take out shift character string and contrast with former character string, wherein: contrast number waits to judge the length of character string, if the sequence of comparing result display former symbol string dictionary is minimum, then this character string is Lyndon Word, this character string is exported to Lyndon Word cache module.
The described transposition completing all Lyndon Word comprises further: the length being differentiated the longest Lyndon Word of processed character string by the content in Lyndon Word length temporary storage module, and this numerical value is sent to character string expansion submodule; All Lyndon Word in Lyndon Word cache module are extended to the length of the longest Lyndon Word for cyclic shift submodule; To the Lyndon Word cyclic shift successively of character string expansion submodule be come from, and be stored to transposition cache module.
The described all character strings completed by transposition comprise further by lexcographical order sequence: the character string in transposition cache module sorted for BWTS result acquisition module according to lexcographical order; Last row of the ranking results of sorting sub-module are read, as the output of BWTS method, and keeps in output buffer module.
The BWT data compression method of improvement disclosed by the invention and system for implementing hardware thereof, can change constant that existing BWT method must be generated by direct transform and could realize the situation that character string recovers.This change can improve compression ratio, the one-tenth block process of subsequent step of being especially more convenient in the hardware implementing of data compression on the one hand in data compression applications.In addition, BWT algorithm is often used as chnnel coding, if but there is mistake or lose in its constant generated in the transmitting procedure of strong noise, whole character string will be caused to recover, and this algorithm can solve this problem: the starting point of the inverse transformation of improved B WT algorithm is always from zero character, one of them character errors only by one or two character in the whole character string of impact, can not impact whole character string.Such as, in above-mentioned example, for ' ncuciaan ', ' icanucan ' output after improved B WT algorithm occurs that mistake is just ' ncuchaan ' in the transmission, after inverse transformation, then generate character string ' icanhcan ', wherein only have a character to occur mistake, greatly reduce error rate.
Those skilled in the art of the present technique are appreciated that the present invention can relate to the equipment for performing the one or more operation in operation described in the application.Described equipment for required object and specialized designs and manufacture, or also can comprise the known device in all-purpose computer, and described all-purpose computer activates or reconstructs with having storage procedure Selection within it.Such computer program can be stored in equipment (such as, computer) in computer-readable recording medium or be stored in and be suitable for store electrons instruction and be coupled in the medium of any type of bus respectively, described computer-readable medium includes but not limited to dish (comprising floppy disk, hard disk, CD, CD-ROM and magneto optical disk), the immediately memory (RAM) of any type, read-only memory (ROM), electrically programmable ROM, electric erasable ROM(EPROM), electrically erasable ROM(EEPROM), flash memory, magnetic card or light card.Computer-readable recording medium comprises for be stored by the readable form of equipment (such as, computer) or any mechanism of transmission information.Such as, computer-readable recording medium comprise memory (RAM) immediately, read-only memory (ROM), magnetic disk storage medium, optical storage medium, flash memory device, with electricity, light, sound or signal (such as carrier wave, infrared signal, digital signal) etc. that other form is propagated.
Those skilled in the art of the present technique are appreciated that the combination that can realize the frame in each frame in these structure charts and/or block diagram and/or flow graph and these structure charts and/or block diagram and/or flow graph with computer program instructions.These computer program instructions can be supplied to the processor of all-purpose computer, special purpose computer or other programmable data processing methods to generate machine, thus create the method for specifying in the frame of implementation structure figure and/or block diagram and/or flow graph or multiple frame by the instruction that the processor of computer or other programmable data processing methods performs.
Those skilled in the art of the present technique are appreciated that various operations, method, the step in flow process, measure, the scheme discussed in the present invention can be replaced, changes, combines or delete.Further, there is various operations, method, other steps in flow process, measure, the scheme discussed in the present invention also can be replaced, change, reset, decompose, combine or delete.Further, of the prior art have also can be replaced with the step in operation various disclosed in the present invention, method, flow process, measure, scheme, changed, reset, decomposed, combined or deleted.
The above is only some embodiments of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (8)

1. a system for implementing hardware for the BWT data compression method improved, is characterized in that, comprising:
Input buffer module, for temporary pending character string, and synchrodata input and data processing, after processing, character string is exported to Lyndon Word and search module;
Lyndon Word searches module, for searching the longest Lyndon Word come from input buffer module character string, and export the longest Lyndon Word found to Lyndon Word cache module, the length of each the longest Lyndon Word is exported to Lyndon Word length cache module;
Lyndon Word cache module, searches the Lyndon Word of module for transpose modules for temporary output from Lyndon Word;
Lyndon Word length cache module, searches the length of all Lyndon Word found in module and number for order module for temporary Lyndon Word;
Transpose modules, searches the transposition of all Lyndon Word in module for completing Lyndon Word and keeps in transposition cache module by transposition result;
Transposition cache module, the transposition result exported for temporary transpose modules is for order module;
Order module, for all character strings in transposition cache module being pressed lexcographical order sequence, and gets the output of last row as BWTS method, and is temporarily stored in output buffer module;
Output buffer module, for the character string of temporary output, for subsequent module.
2. system according to claim 1, is characterized in that, described Lyndon Word searches module and comprises further:
Get character string submodule, for getting character from input buffer module, and record the length of now got character string, start to read in by turn from character string initial character, often increase by one just to be inputted subsequent module and carry out Lyndon Word judgement, if there is Lyndon Word, then string length is inputted Lyndon Word length cache module, length zero setting, get character string next time then from last character that character string is got in this time;
Displacement submodule, for judging submodule by the character string input Lyndon Word getting character string submodule and this character string be successively shifted, and by all shift character string input N*N registers;
N*N register, for store come from a bit submodule wait to judge that all shift characters statements based on collusion Lyndon Word of character string judge submodule process;
Lyndon Word judges submodule, for successively taking out shift character string from N*N register and contrasting with former character string, wherein: contrast number waits to judge the length of character string, if the sequence of comparing result display former symbol string dictionary is minimum, then this character string is Lyndon Word, this character string is exported to Lyndon Word cache module.
3. system according to claim 1, is characterized in that, described transpose modules comprises further:
The longest Lyndon Word length differentiates submodule, for being differentiated the length of the longest Lyndon Word of processed character string by the content in Lyndon Word length temporary storage module, and this numerical value is sent to character string expansion submodule;
Character string expansion submodule, for all extending to the length of the longest Lyndon Word for cyclic shift submodule by all Lyndon Word in Lyndon Word cache module;
Cyclic shift submodule, for the Lyndon Word cyclic shift successively by coming from character string expansion submodule, and is stored to transposition cache module.
4. system according to claim 1, is characterized in that, described order module comprises:
Sorting sub-module, for sorting the character string in transposition cache module for BWTS result acquisition module according to lexcographical order;
BWTS result acquisition module, last row for the ranking results by sorting sub-module read, and as the output of BWTS method, and keep in output buffer module.
5. the BWT data compression method improved, is characterized in that, comprising:
Input of character string, keeps in pending character string by input buffer module, and synchrodata input and data processing, after processing, character string is exported to Lyndon Word and search module;
Search module searches by Lyndon Word and come from the longest Lyndon Word in input buffer module character string, and export the longest Lyndon Word found to Lyndon Word cache module, the length of each the longest Lyndon Word is exported to Lyndon Word length cache module;
Kept in by Lyndon Word cache module and search the Lyndon Word of module for transpose modules output from Lyndon Word;
Keep in Lyndon Word by Lyndon Word length cache module and search the length of all Lyndon Word found in module and number for order module;
Complete by transpose modules the transposition that Lyndon Word searches all Lyndon Word in module, and transposition result is kept in transposition cache module;
The transposition result of transpose modules output is kept in for order module by transposition cache module;
By order module, all character strings in transposition cache module are pressed lexcographical order sequence, and get the output of last row as BWTS method, and be temporarily stored in output buffer module;
The character string of output is kept in, for subsequent module by output buffer module.
6. method according to claim 5, is characterized in that, described in search data block Lyndon Word comprise further:
Character is got from input buffer module, and record the length of now got character string, start to read in by turn from character string initial character, often increase by one just to be inputted subsequent module and carry out Lyndon Word judgement, if there is Lyndon Word, then string length is inputted Lyndon Word length cache module, length zero setting, get character string next time then from last character that character string is got in this time;
The character string input Lyndon Word getting character string submodule is judged submodule and this character string is successively shifted, and by all shift character string input N*N registers;
What come from a bit submodule by the storage of N*N register waits to judge that all shift characters statements based on collusion Lyndon Word of character string judge submodule process;
Successively from N*N register, take out shift character string and contrast with former character string, wherein: contrast number waits to judge the length of character string, if the sequence of comparing result display former symbol string dictionary is minimum, then this character string is Lyndon Word, this character string is exported to Lyndon Word cache module.
7. method according to claim 5, is characterized in that, described in complete all Lyndon Word transposition comprise further:
Differentiated the length of the longest Lyndon Word of processed character string by the content in Lyndon Word length temporary storage module, and this numerical value is sent to character string expansion submodule;
All Lyndon Word in Lyndon Word cache module are extended to the length of the longest Lyndon Word for cyclic shift submodule;
To the Lyndon Word cyclic shift successively of character string expansion submodule be come from, and be stored to transposition cache module.
8. method according to claim 5, is characterized in that, the described all character strings completed by transposition comprise further by lexcographical order sequence:
Character string in transposition cache module is sorted for BWTS result acquisition module according to lexcographical order;
Last row of the ranking results of sorting sub-module are read, as the output of BWTS method, and keeps in output buffer module.
CN201410571262.3A 2014-10-23 2014-10-23 A kind of improved BWT data compression methods and its system for implementing hardware Active CN104284189B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410571262.3A CN104284189B (en) 2014-10-23 2014-10-23 A kind of improved BWT data compression methods and its system for implementing hardware

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410571262.3A CN104284189B (en) 2014-10-23 2014-10-23 A kind of improved BWT data compression methods and its system for implementing hardware

Publications (2)

Publication Number Publication Date
CN104284189A true CN104284189A (en) 2015-01-14
CN104284189B CN104284189B (en) 2017-06-16

Family

ID=52258598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410571262.3A Active CN104284189B (en) 2014-10-23 2014-10-23 A kind of improved BWT data compression methods and its system for implementing hardware

Country Status (1)

Country Link
CN (1) CN104284189B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005464A (en) * 2015-07-02 2015-10-28 东南大学 Burrows Wheeler Transform hardware processing apparatus
CN107342102A (en) * 2016-04-29 2017-11-10 上海磁宇信息科技有限公司 A kind of MRAM chip and searching method with function of search
CN116821967A (en) * 2023-08-30 2023-09-29 山东远联信息科技有限公司 Intersection computing method and system for privacy protection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6674908B1 (en) * 2002-05-04 2004-01-06 Edward Lasar Aronov Method of compression of binary data with a random number generator
US20130019029A1 (en) * 2011-07-13 2013-01-17 International Business Machines Corporation Lossless compression of a predictive data stream having mixed data types
CN103117748A (en) * 2013-01-29 2013-05-22 中国科学院计算技术研究所 Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method
CN103810228A (en) * 2012-11-01 2014-05-21 辉达公司 System, method, and computer program product for parallel reconstruction of a sampled suffix array

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6674908B1 (en) * 2002-05-04 2004-01-06 Edward Lasar Aronov Method of compression of binary data with a random number generator
US20130019029A1 (en) * 2011-07-13 2013-01-17 International Business Machines Corporation Lossless compression of a predictive data stream having mixed data types
CN103810228A (en) * 2012-11-01 2014-05-21 辉达公司 System, method, and computer program product for parallel reconstruction of a sampled suffix array
CN103117748A (en) * 2013-01-29 2013-05-22 中国科学院计算技术研究所 Method and system for sequencing suffixes in BWT (burrows-wheeler transform) implementation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
S MARCUS: "On Two-Dimensional Lyndon Words", 《INTERNATIONAL SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL》 *
王宁: "高速数据压缩与缓存的FPGA实现", 《微计算机信息》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005464A (en) * 2015-07-02 2015-10-28 东南大学 Burrows Wheeler Transform hardware processing apparatus
CN105005464B (en) * 2015-07-02 2017-10-10 东南大学 A kind of Burrows Wheeler mapping hardware processing units
CN107342102A (en) * 2016-04-29 2017-11-10 上海磁宇信息科技有限公司 A kind of MRAM chip and searching method with function of search
CN116821967A (en) * 2023-08-30 2023-09-29 山东远联信息科技有限公司 Intersection computing method and system for privacy protection
CN116821967B (en) * 2023-08-30 2023-11-21 山东远联信息科技有限公司 Intersection computing method and system for privacy protection

Also Published As

Publication number Publication date
CN104284189B (en) 2017-06-16

Similar Documents

Publication Publication Date Title
US11836081B2 (en) Methods and systems for handling data received by a state machine engine
US11977977B2 (en) Methods and systems for data analysis in a state machine
US9535861B2 (en) Methods and systems for routing in a state machine
CN112953550B (en) Data compression method, electronic device and storage medium
CN107111623A (en) Parallel historical search and coding for the compression based on dictionary
CN106027062A (en) Hardware data compressor that directly huffman encodes output tokens from lz77 engine
CN110222231B (en) Hot degree prediction method for video clip
CN105959013A (en) Hardware data compressor that pre-huffman encodes to decide whether to huffman encode a matched string or a back pointer thereto
CN104284189A (en) Improved BWT data compression method and hardware implementing system thereof
CN110428868A (en) Gene sequencing quality row data compression pretreatment, decompression restoring method and system
CN106027063A (en) Hardware data compressor that sorts hash chains based on node string match probabilities
CN114442954B (en) LZ4 coding compression device
CN111787325B (en) Entropy encoder and encoding method thereof
Li et al. Elf: Erasing-based lossless floating-point compression
CN100546200C (en) Be used for method, decoder, system and equipment from the bitstream decoding codewords of variable length
Li et al. Erasing-based lossless compression method for streaming floating-point time series
EP4066121A1 (en) Pattern-based cache block compression
CN108155969B (en) Decoding circuit for variable length coded data
CN113407375B (en) Database deleted data recovery method, device, equipment and storage medium
CN102663287A (en) Attack characteristic extraction method for realizing sequence-based alignment through code conversion
JP3210183B2 (en) Data compression method and apparatus
JP2772125B2 (en) Dictionary search method
CN117744648A (en) Word segmentation method, search method and related products
Bafna et al. Pipelined processor for image compression through Burrows-Wheeler Transform
JPS61154275A (en) Data analyzing system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant