CN102737132A - Multi-rule combined compression method based on database row and column mixed storage - Google Patents

Multi-rule combined compression method based on database row and column mixed storage Download PDF

Info

Publication number
CN102737132A
CN102737132A CN2012102093622A CN201210209362A CN102737132A CN 102737132 A CN102737132 A CN 102737132A CN 2012102093622 A CN2012102093622 A CN 2012102093622A CN 201210209362 A CN201210209362 A CN 201210209362A CN 102737132 A CN102737132 A CN 102737132A
Authority
CN
China
Prior art keywords
rule
data
compression
attribute column
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012102093622A
Other languages
Chinese (zh)
Inventor
曹晖
冯柯
毛云青
何清法
周丽霞
蒋志勇
赵殿奎
关刚
王效忠
李海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN SHENZHOU GENERAL DATA CO Ltd
Original Assignee
TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN SHENZHOU GENERAL DATA CO Ltd filed Critical TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority to CN2012102093622A priority Critical patent/CN102737132A/en
Publication of CN102737132A publication Critical patent/CN102737132A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a multi-rule combined compression method based on database row and column mixed storage. A mixed storage compression mode for organizing data in a database according to tuple rows and compressing the data in the database according to property columns is provided by combining the current software and hardware development tendency and a severe performance bottleneck in database industry, and has the characteristic of high compression rate of column storage and the advantage of convenience in random positioning and accessing of row storage. Furthermore, a rule encoding method in a plurality of property columns is provided according to different data distribution characteristics; particularly, an inter-column compression rule is provided according to a possible relation among the property columns in a database sheet; by a rear-end general compression algorithm, a multi-level combined compression function is efficiently supplied to upper database application; and the maximum encoding and decoding speed under an appointed compression rate condition is guaranteed.

Description

The more rules compound compressed method of mixing storage based on the database ranks
Technical field
The present invention relates to data storage technology, data compression technique, data retrieval technology, particularly relate to a kind of more rules compound compressed method of mixing storage based on the database ranks.
Background technology
Query processing and data storage are two key elements of database, and both complement each other, and the common guarantee database can provide data management efficiently and retrieval service for the user.But along with going deep into of information revolution, the new data of real world applications generation magnanimity at all times, and the user also is more prone to keep more the historical data of time period for a long time, and the restriction of data storage capacity has been the serious problems that very urgent needs face.On the other hand, the speed of development of storage hardware has lagged behind other computer system hardware greatly, and storage system becomes the serious bottleneck of restriction database overall performance.Under these circumstances; The pressure that storage system faces is heavy further; In order to support mass data storage to guarantee that simultaneously storage system can not drag slow database overall performance; The user often can only realize that the thing followed is uncontrollable cost and scaling concern through piling up storage system hardware.
For this reason; The database compress technique is arisen at the historic moment, and will store data compression through embedded compress technique, and the IO to storage system when when significantly reducing data storage capacity, also having reduced query processing consumes; Thereby reduced the cost of whole storage system, and in a disguised form improved its performance.Because database need carry out the Compress softwares operation to data after introducing compress technique, this can need to consume more processor resource.But because the development of storage system does not catch up with the processor speed of development of following Moore's Law far away, the memory property processor resource is superfluous relatively relatively in the total system, and therefore the additive decrementation to it can't influence database performance.
Because existing memory property bottleneck problem can be significantly alleviated in the database compression, ripe business database such as industry such as ORACLE, DB2 etc. has all been introduced compress technique.The basic compression method that all is based on dictionary that uses of current database compression, its basic ideas are that the frequent data pattern that occurs in the data is extracted as symbol table, and in actual storage, replace to reach the purpose of compression with more brief quotation mark.This compress mode based on dictionary can obtain compression effectiveness preferably for the major applications data; But more and more showing, many real world applications have the DATA DISTRIBUTION of using self-character; In the case in order to reach best compression effectiveness; Compression method based on dictionary is nowhere near, and needs database to provide compress mode targetedly according to the DATA DISTRIBUTION characteristics.
On the other hand, the storage mode of current database compression bottom can be divided into row storage and row storage dual mode.Data are deposited continuously and are read according to tuple row form in the row storage, but because therefore the basic no datat association of each each attribute of tuple row in the tables of data can't obtain the good compression effect.And the row storage is relative therewith; It is deposited each attribute column data in the tables of data separately continuously; Can greatly improve the continuous data similarity reaching higher compressibility, but efficient is extremely low when the organizational form of having broken the data tuple row simultaneously can cause database to do the inquiry of tradition row again.Therefore if will between data storage and search efficiency, reach balancing performance, need database that a kind of both bottom data organizational forms of having a few of storage and row storage that can combine to be listed as are provided.
To above problem; The present invention provides a kind of more rules compound compressed method of mixing storage based on ranks in database; Organize by row for user's application data; By the row compression, can compress according to the adaptively selected suitable coding rule of DATA DISTRIBUTION characteristics simultaneously, thereby guarantee more high compression rate.
Summary of the invention
The object of the present invention is to provide a kind of more rules compound compressed method of mixing storage based on the database ranks.
The technical scheme that the present invention solves its technical matters employing is:
1, a kind of more rules compound compressed method of mixing storage based on the database ranks is characterized in that adopting following steps to realize::
1) receives the user and import data, and all data are divided into a plurality of attribute column according to the reorganization of subscriber's meter attributed scheme;
2) utilize dictionary rule compression method to make up dictionary structure and weight table to each attribute column data in the current data packet;
3) dictionary and the weight information that builds to each attribute column utilization estimated the size after these row use the interior reduced rule coding of various row, chooses the minimum reduced rule of space hold based on contrast for each attribute column;
4) find step based on reduced rule between the dictionary information and executing row of each attribute column, will estimate the back size of encoding after the discovery suitable rales and compare, choose the space optimal case with the interior reduced rule of the row of respective column;
5) according to optimal compression rule selection scheme each attribute columns in the packet is tunneled line discipline compressed encoding;
6) based target requires compression level, and the columns after utilizing universal compressed method to rule encoding is according to carrying out compound compressed, promptly accomplishes this packet compression after reaching the expection compression ratio.
2, a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: needs use the outside data that import of compressed buffer buffering in the step 1); The mode that adopts ranks to mix is stored the importing data; Promptly make as a whole independently packet whenever buffer zone receives some line data, the mode-definition of binding data storehouse correspondence table is cut apart data and is deposited by row by the attribute column pattern extraction in packet then.
3, a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: step 2) to use the purpose of dictionary encoding be to set up a statistical information relevant with compression respectively for follow-up rule selection as each column data in the packet; Dictionary encoding and data importing and attribute column cutting procedure are combined closely and are realized the dynamic coding to data; Interior data is also deposited integral body with the dictionary encoding frame mode when compressed buffer is received an independent data bag, and the practical implementation of method comprises following content:
1) set up the secondary data structure line correlation content initialization of going forward side by side for the dictionary encoding on each attribute column in the packet, dictionary table adopts static Hash structure, and the initial setting up entry number is that the twice of packet line number is to guarantee less collision rate;
2) after the new data tuple imports to compressed buffer; Each property element in this data line is assigned in the dictionary encoding structure of corresponding attribute column; And obtain each attribute of an element value and length, with the raw data size of each attribute column of this cumulative record under the situation of not using reduced rule;
3) each attribute column is calculated the cryptographic hash to dictionary table for initiate property element according to its property value, finds this property value institute's entry item and upgrade the corresponding weighted value of item entries in dictionary table through hash index then.If owing to the reason of hash index clashes, item entries has been that other property value element is occupied, then adopts square probe method to continue to seek next corresponding item entries, and judgement and operation before repeating.Find corresponding dictionary table clauses and subclauses for property element after, compressed buffer can be stored in the numbering that property value replaces with its corresponding dictionary table item entries in the reference list of current attribute column;
4) in the dictionary table maintenance process, after every insertion some property elements, need assess current dictionary encoding size of population.At first obtain the original size of current attribute column all properties value under no compression situation, all exist item entries and reference list size to estimate through the storage size behind the dictionary encoding based on dictionary table simultaneously;
5) after compressed buffer reception data tuple sum reaches the packet line number upper limit, also corresponding foundation of each attribute column dictionary encoding structure accomplished.Need use the dictionary table of each attribute column the quick sorting algorithm that is directed against this attribute data type this moment, and all exist item entries to carry out ascending sort with dictionary table, and insert again in the dictionary table, upgrade the corresponding reference list of attribute column simultaneously synchronously;
4, a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: step 3) is directed against each attribute column on the dictionary encoding structure of setting up before; Utilize basic dictionary statistical information comprehensively to be listed as interior reduced rule assessment, the practical implementation process comprises following content:
1) attribute column is carried out the assessment of constant coding rule; Scan whole dictionary table and seek the maximum item entries of weight as most possible constant default value; The entry number that can estimate exception table based on the weight and the overall line number of packet of this default value simultaneously combines attribute to take length simultaneously and can estimate out size after the constant encoding compression;
2) attribute column is carried out the Run-Length Coding rule evaluation; Scan attribute lists the reference list corresponding with dictionary table; Item to continuous appearance in the process of order traversal goes recuperation to arrive the corresponding item entries of Run-Length Coding, and finally combines the corresponding dictionary entry item of each clauses and subclauses can get size after the overall compression of Run-Length Coding to the end;
3) attribute column is carried out the sequential coding rule evaluation, travel through attribute column, calculate each row relative difference as benchmark, can obtain the compression sizes of final nucleotide sequence coding rule then through the byte length that adds up to each different differences with the first trip property value by the row order;
5, based on the described a kind of more rules compound compressed method of mixing storage based on the database ranks of claim 1; It is characterized in that: compression rule need further be carried out degree of depth compression between the row that step 4) proposes on the reduced rule basis in row; Reach the compression on the coarsegrain more through excavating data contact between attribute column, the specific embodiment is following:
1) all properties is listed as according to the item entries summary of dictionary table separately carries out ascending sort; Order traversal is carried out in twos rule according to the bubble sort mode to attribute column and is relatively judged then, is followed successively by attribute column in the order ergodic process and sets up between row rule and excavate supplementary structure;
2) equate the rule encoding condition judgment between being listed as, as being compressed row and traveling through follow-up all properties row successively, judge whether to exist and have the row that equate rule, and estimate and equate that rule is big or small after compressing when the prostatitis with current attribute column;
3) being listed as a derivation rule encoding condition judges; Current attribute column also travels through follow-up all properties row successively as the row of deriving; Judge whether to exist and to be used the target column that equates rule when the prostatitis, and estimate target column and use derivation rule compression back size;
6, a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: in accomplishing and between row behind rule discovery and the coding evaluation the row of all properties row; Step 5) will travel through all properties row in the current data packet one by one, and choose and estimate the highest reduced rule of compressibility current attribute column is carried out rule encoding.
7, a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: the step 6) continuation uses the universal compressed algorithm of rear end on the existing rule encoding packed data of current compressed buffer basis, to carry out the compound compressed of the degree of depth; Can further reduce the storage space of overall data like this; And can design the compressibility that compression level is controlled final data according to the user; Guaranteeing that compression reaches under the demand of database practical application, drops to final encoding and decoding required time and processor resource minimum.The practical implementation step is following:
1) with the attribute column data behind the strictly all rules coding according to the coding big or small descending sort in back;
2) use the LZOP method to compress to each attribute column,, the compression level requirement is set, then stop compression process if the overall compression rate reaches upper layer application if the compression rear space increases then continues the service regeulations code storage;
3) each attribute column is attempted using the LZMA method instead and compress, requirement is set, then stop compression process in case the overall compression rate reaches compression level;
4) if residue attribute column size summation less than before attribute column after overcompression size and 1/10th, then stop compression process.
Description of drawings
Fig. 1 is an implementation step process flow diagram of the present invention.
Fig. 2 is a more rules compound compressed method principle of work synoptic diagram.
Embodiment
Combine practical implementation and example that technical scheme of the present invention is described further at present.
1, like Fig. 1 and shown in Figure 2, practical implementation process of the present invention and principle of work are following:
1) receives the user and import data, and all data are divided into a plurality of attribute column according to the reorganization of subscriber's meter attributed scheme.
2) utilize dictionary rule compression method to make up dictionary structure and weight table to each attribute column data in the current data packet.
3) dictionary and the weight information that builds to each attribute column utilization estimated the size after these row use the interior reduced rule coding of various row, chooses the minimum reduced rule of space hold based on contrast for each attribute column.
4) the dictionary information based on each attribute column is listed as a reduced rule discovery, will estimate row big or small and respective column interior reduced rule in coding back after the discovery suitable rales relatively, chooses the space optimal case.
5) according to optimal compression rule selection scheme each attribute column data in the packet are carried out regular compressed encoding.
6) based target requires compression level, and the columns after utilizing universal compressed method to rule encoding is according to carrying out compound compressed, promptly accomplishes this packet compression after reaching the expection compression ratio.
Need to use the outside data that import of compressed buffer buffering in the step 1); The mode that adopts ranks to mix is stored the importing data; When line number reaches when being used for assign thresholds, all data of current reception as the independent data bag, can be specified line number threshold value N by the outside in advance; Promptly make as a whole independently packet whenever buffer zone receives the N line data, the mode-definition of binding data storehouse correspondence table is cut apart data and is deposited by row by the attribute column pattern extraction in packet then.To in packet, data be deposited the similarity density that can increase data with respect to pure capable file layout by the row gathering like this; Simultaneously when the database upper strata need be according to tuple row form visit data; Only need this packet of access can obtain all properties of required tuple row fast, avoided similar pure row storage mode need travel through all row again and found the corresponding attribute of tuple.
Step 2) using the purpose of dictionary encoding is to set up a statistical information relevant with compression respectively as each column data in the packet to select for follow-up rule; And according to dictionary table data are replaced with the reference list storage to dictionary simultaneously; Dictionary encoding and data importing and attribute column cutting procedure are combined closely and are realized the dynamic coding to data; Interior data is also deposited integral body with the dictionary encoding frame mode when compressed buffer is received an independent data bag, and the practical implementation of method comprises following content:
1) set up the secondary data structure line correlation content initialization of going forward side by side for the dictionary encoding on each attribute column in the packet, dictionary table adopts static Hash structure, and the initial setting up entry number is that the twice of packet line number is to guarantee less collision rate.
2) after the new data tuple imports to compressed buffer; Each property element in this data line is assigned in the dictionary encoding structure of corresponding attribute column; And obtain each attribute of an element value and length, with the raw data size of each attribute column of this cumulative record under the situation of not using reduced rule
3) each attribute column is calculated the cryptographic hash to dictionary table for initiate property element according to its property value, finds this property value institute's entry item in dictionary table through hash index then.If clauses and subclauses have existed then the weighted value of this clauses and subclauses correspondence in packet have been added one at this moment; If do not exist the original place newly-increased one to item entries that should property value, it is one that the respective weights value is set simultaneously; And if owing to the reason of hash index clashes; Item entries has been that other property value element is occupied; Then adopt square probe method, next corresponding item entries is sought in the squared continuation of the corresponding HASH value of this property value, and judgement and operation before repeating.Find corresponding dictionary table clauses and subclauses for property element after, compressed buffer can be stored in the numbering that property value replaces with its corresponding dictionary table item entries in the reference list of current attribute column.
4) in the dictionary table maintenance process, after every insertion some property elements, need assess current dictionary encoding size of population.At first obtain the original size of current attribute column all properties value under no compression situation, all exist item entries and reference list size to estimate through the storage size behind the dictionary encoding based on dictionary table simultaneously.If estimation has surpassed raw data in the size of not having compression through dictionary encoding compression back size, then abandon the original data content that dictionary encoding is directly deposited property element.
5) after compressed buffer reception data tuple sum reaches the packet line number upper limit, also corresponding foundation of each attribute column dictionary encoding structure accomplished.Need use the dictionary table of each attribute column the quick sorting algorithm that is directed against this attribute data type this moment, and all exist item entries to carry out ascending sort with dictionary table, and insert in the dictionary table again.Need the item entries numbering mapping table before and after the record ordering in the sequencer procedure, after the dictionary table ordering is accomplished, the reference list of this attribute column is upgraded assignment again, change up-to-date dictionary entry corresponding relation into according to item entries numbering correspondence table.
Step 3) on the dictionary encoding structure of setting up before, utilizes basic dictionary statistical information comprehensively to be listed as interior reduced rule assessment to each attribute column, and the practical implementation process comprises following content:
1) attribute column is carried out the assessment of constant coding rule, the constant coding is primarily aimed at the attribute column that a large amount of same repetition values occur, and coding structure is made up of default value and exception table.Wherein default value is deposited and is repeated the constant value that occurs in this attribute column in a large number, and other property element that is not equal to this constant value is then deposited its corresponding row number and property value in exception table.Before carrying out the assessment of constant rule encoding size; At first need scan whole dictionary table and seek the maximum item entries of weight as most possible constant default value; The entry number that simultaneously can estimate exception table according to the weight and the overall line number of packet of this default value combines attribute to take length simultaneously and can estimate out size after the constant encoding compression.
2) attribute column is carried out the Run-Length Coding rule evaluation; Run-Length Coding is primarily aimed at DATA DISTRIBUTION and assembles the characteristic that better demonstrates the continuous appearance of numerous identical datas simultaneously; The attribute column of this distribution is converted into the expression way that property value adds reference position; A plurality of so continuous property values can change by a record alternative, reduce the storage space expense greatly.The size assessment of Run-Length Coding needs scan attribute to list the reference list corresponding with dictionary table; Item to continuous appearance in the process of order traversal goes recuperation to arrive the corresponding item entries of Run-Length Coding, and finally combines the corresponding dictionary entry item of each clauses and subclauses can get size after the overall compression of Run-Length Coding to the end.
3) attribute column is carried out the sequential coding rule evaluation; Sequence rules be primarily aimed at possibly occur in the database particularly major key attribute of sequence arranged; When showing order and property value self, increasing progressively of property value retinue number take up room when big; Can change the reference value of this attribute column of storage and the difference of each property value and this reference value into, consider the order of data in addition, the most of difference on the common attribute column can continue to use dictionary rule encoding even constant rule encoding.Can obtain the compression sizes of final nucleotide sequence coding rule through the byte length that adds up to each different differences.
Reduced rule is the degree of depth compression scheme of in row, further carrying out on the reduced rule basis between the row that step 4) proposes, and gets in touch the compression that reaches on the coarsegrain more through excavating data between attribute column, and embodiment is following:
1) all properties is listed as according to the item entries summary of dictionary table separately carries out ascending sort; Order traversal is carried out in twos rule according to the bubble sort mode to attribute column and is relatively judged then; Be followed successively by attribute column in the order ergodic process and set up rule excavation supplementary structure between row; Supplementary structure is i.e. dictionary item reference list through each attribute column of traversal also, number reconstructs each item entries of dictionary table with corresponding dictionary item according to the row of each clauses and subclauses in the reference list and appears as the location sets table.
2) equate the rule encoding condition judgment between being listed as; In the process that all properties row travel through in proper order in the packet; To set up rule excavation supplementary structure when the prostatitis after; Promptly begin with current attribute column as being compressed row and traveling through follow-up all properties row successively; Judge whether to exist and work as the prostatitis and have the row that equate rule; Equate that regular judgement mainly may further comprise the steps: a), each dictionary item is found all corresponding row positions of this item entries according to the supplementary structure of setting up before according to the dictionary table of weight size ascending order scanning when the prostatitis; B) all corresponding row number locational reference list on the traversal target column, and find out corresponding dictionary entry item; C) for each dictionary item, statistics and target column same lines number locational unequal entry number, and be added on the global abnormal counting; D) according to anomalous counts assessment through equating rule compression back attribute column size, if surpass current attribute column, then abandon equating the rule excavation when prostatitis and target column through half of rule encoding compression sizes in the row.
3) be listed as a derivation rule encoding condition and judge, excavates between two attribute column and whether exist the property element value to concern one to one, if exist then the objective attribute target attribute row can only be stored property value based on reference column and contrast the derivation table and get final product.The judgement of derivation rule mainly may further comprise the steps: find all corresponding row positions of each dictionary entry item a) according to the dictionary table of weight size descending scanning when the prostatitis, and based on supplementary structure dictionary entry item position table; B) obtain all corresponding row number locational property value of target column; Therefrom choose the maximum property value of occurrence number as when prostatitis and the derivation relation of target column on this dictionary entry item, and the position is appearred in other property value on the target column be added to global abnormal and count; C) based on anomalous counts assessment through derivation rule compression back attribute column size, if surpass attribute column, then abandon the derivation rule excavation with target column when the prostatitis through half of rule encoding compression back size in the row.
In accomplishing and between row behind rule discovery and the coding evaluation to the row of all properties row; Step 5) will travel through all properties row in the current data packet one by one; If current attribute column existence, then directly will be worked as the prostatitis and encode according to reduced rule between row through under the situation that rule is compressed between row by other attribute column.Otherwise the assessment size of reduced rule coding in four kinds of row that need this attribute column to be assessed before the inspection, and therefrom choose and estimate that reduced rule carries out rule encoding to current attribute column in the highest row of compressibility.All properties row are through behind the rule encoding, after the packed data head increases the basic descriptor of rule encoding, can deposit continuously with current compressed buffer in.
The step 6) continuation uses the universal compressed algorithm of rear end on the existing rule encoding packed data of current compressed buffer basis, to carry out the compound compressed of the degree of depth; Can further reduce the storage space of overall data like this; And can design the compressibility that compression level is controlled final data according to the user; Guaranteeing to press volume to reach under the demand of database practical application, drop to final encoding and decoding required time and processor resource minimum.The practical implementation step is following:
1) with the attribute column data behind the strictly all rules coding according to the coding big or small descending sort in back.
2) use the LZOP method to compress to each attribute column,, the compression level requirement is set, then stop compression process if the overall compression rate reaches upper layer application if the compression rear space increases then continues the service regeulations code storage.
3) each attribute column is attempted using the LZMA method instead and compress, requirement is set, then stop compression process in case the overall compression rate reaches compression level.
4) if residue attribute column size summation less than before attribute column after overcompression size and 1/10th, then stop compression process.

Claims (7)

1. more rules compound compressed method of mixing storage based on the database ranks is characterized in that adopting following steps to realize::
1) receives the user and import data, and all data are divided into a plurality of attribute column according to the reorganization of subscriber's meter attributed scheme;
2) utilize dictionary rule compression method to make up dictionary structure and weight table to each attribute column data in the current data packet;
3) dictionary and the weight information that builds to each attribute column utilization estimated the size after these row use the interior reduced rule coding of various row, chooses the minimum reduced rule of space hold based on contrast for each attribute column;
4) the dictionary information based on each attribute column is listed as a reduced rule discovery, will estimate row big or small and respective column interior reduced rule in coding back after the discovery suitable rales relatively, chooses the space optimal case;
5) according to optimal compression rule selection scheme each attribute column data in the packet are carried out regular compressed encoding;
6) based target requires compression level, and the columns after utilizing universal compressed method to rule encoding is according to carrying out compound compressed, promptly accomplishes this packet compression after reaching the expection compression ratio.
2. a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: needs use the outside data that import of compressed buffer buffering in the step 1); The mode that adopts ranks to mix is stored the importing data; Promptly make as a whole independently packet whenever buffer zone receives some line data, the mode-definition of binding data storehouse correspondence table is cut apart data and is deposited by row by the attribute column pattern extraction in packet then.
3. a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: step 2) to use the purpose of dictionary encoding be to set up a statistical information relevant with compression respectively for follow-up rule selection as each column data in the packet; Dictionary encoding and data importing and attribute column cutting procedure are combined closely and are realized the dynamic coding to data; Interior data is also deposited integral body with the dictionary encoding frame mode when compressed buffer is received an independent data bag, and the practical implementation of method comprises following content:
1) set up the secondary data structure line correlation content initialization of going forward side by side for the dictionary encoding on each attribute column in the packet, dictionary table adopts static Hash structure, and the initial setting up entry number is that the twice of packet line number is to guarantee less collision rate;
2) after the new data tuple imports to compressed buffer; Each property element in this data line is assigned in the dictionary encoding structure of corresponding attribute column; And obtain each attribute of an element value and length, with the raw data size of each attribute column of this cumulative record under the situation of not using reduced rule;
3) each attribute column is calculated the cryptographic hash to dictionary table for initiate property element according to its property value, finds this property value institute's entry item and upgrade the corresponding weighted value of item entries in dictionary table through hash index then.If owing to the reason of hash index clashes, item entries has been that other property value element is occupied, then adopts square probe method to continue to seek next corresponding item entries, and judgement and operation before repeating.Find corresponding dictionary table clauses and subclauses for property element after, compressed buffer can be stored in the numbering that property value replaces with its corresponding dictionary table item entries in the reference list of current attribute column;
4) in the dictionary table maintenance process, after every insertion some property elements, need assess current dictionary encoding size of population.At first obtain the original size of current attribute column all properties value under no compression situation, all exist item entries and reference list size to estimate through the storage size behind the dictionary encoding based on dictionary table simultaneously;
5) after compressed buffer reception data tuple sum reaches the packet line number upper limit, also corresponding foundation of each attribute column dictionary encoding structure accomplished.Need use the dictionary table of each attribute column the quick sorting algorithm that is directed against this attribute data type this moment, and all exist item entries to carry out ascending sort with dictionary table, and insert again in the dictionary table, upgrade the reference list of attribute column to seat simultaneously synchronously.
4. a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: step 3) is directed against each attribute column on the dictionary encoding structure of setting up before; Utilize basic dictionary statistical information comprehensively to be listed as interior reduced rule assessment, the practical implementation process comprises following content:
1) attribute column is carried out the assessment of constant coding rule; Scan whole dictionary table and seek the maximum item entries of weight as most possible constant default value; The entry number that can estimate exception table based on the weight and the overall line number of packet of this default value simultaneously combines attribute to take length simultaneously and can estimate out size after the constant encoding compression;
2) attribute column is carried out the Run-Length Coding rule evaluation; Scan attribute lists and the reference list of dictionary table to seat; Item to continuous appearance in the process of order traversal goes recuperation to arrive the corresponding item entries of Run-Length Coding, and finally combines the corresponding dictionary entry item of each clauses and subclauses can get size after the overall compression of Run-Length Coding to the end;
3) attribute column is carried out the sequential coding rule evaluation, travel through attribute column, calculate each row relative difference as benchmark, can obtain the compression sizes of final nucleotide sequence coding rule then through the byte length that adds up to each different differences with the first trip property value by the row order.
5. based on the described a kind of more rules compound compressed method of mixing storage based on the database ranks of claim 1; It is characterized in that: compression rule need further be carried out degree of depth compression between the row that step 4) proposes on the reduced rule basis in row; Reach the compression on the coarsegrain more through excavating data contact between attribute column, the specific embodiment is following:
1) all properties is listed as according to the item entries summary of dictionary table separately carries out ascending sort; Order traversal is carried out in twos rule according to the bubble sort mode to attribute column and is relatively judged then, is followed successively by attribute column in the order ergodic process and sets up between row rule and excavate supplementary structure;
2) equate the rule encoding condition judgment between being listed as, as being compressed row and traveling through follow-up all properties row successively, judge whether to exist and have the row that equate rule, and estimate and equate that rule is big or small after compressing when the prostatitis with current attribute column;
3) being listed as a derivation rule encoding condition judges; Current attribute column also travels through follow-up all properties row successively as the row of deriving; Judge whether to exist and to be used the target column that equates rule when the prostatitis, and estimate target column and use derivation rule compression back size.
6. a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: in accomplishing and between row behind rule discovery and the coding evaluation the row of all properties row; Step 5) will travel through all properties row in the current data packet one by one, and choose and estimate the highest reduced rule of compressibility current attribute column is carried out rule encoding.
7. a kind of more rules compound compressed method of mixing storage based on the database ranks according to claim 1; It is characterized in that: the step 6) continuation uses the universal compressed algorithm of rear end on the existing rule encoding packed data of current compressed buffer basis, to carry out the compound compressed of the degree of depth; Can further reduce the storage space of overall data like this; And can design the compressibility that compression level is controlled final data according to the user; Guaranteeing that compression reaches under the demand of database practical application, drops to final encoding and decoding required time and processor resource minimum.The practical implementation step is following:
1) with the attribute column data behind the strictly all rules coding according to the coding big or small descending sort in back;
2) use the LZOP method to compress to each attribute column,, the compression level requirement is set, then stop compression process if the overall compression rate reaches upper layer application if the compression rear space increases then continues the service regeulations code storage;
3) each attribute column is attempted using the LZMA method instead and compress, requirement is set, then stop compression process in case the overall compression rate reaches compression level;
4) if residue attribute column size summation less than before attribute column after overcompression size and 1/10th, then stop compression process.
CN2012102093622A 2012-06-25 2012-06-25 Multi-rule combined compression method based on database row and column mixed storage Pending CN102737132A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012102093622A CN102737132A (en) 2012-06-25 2012-06-25 Multi-rule combined compression method based on database row and column mixed storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012102093622A CN102737132A (en) 2012-06-25 2012-06-25 Multi-rule combined compression method based on database row and column mixed storage

Publications (1)

Publication Number Publication Date
CN102737132A true CN102737132A (en) 2012-10-17

Family

ID=46992633

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012102093622A Pending CN102737132A (en) 2012-06-25 2012-06-25 Multi-rule combined compression method based on database row and column mixed storage

Country Status (1)

Country Link
CN (1) CN102737132A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473276A (en) * 2013-08-26 2013-12-25 广东电网公司电力调度控制中心 Storage method of very large data and distributed database system and retrieval method thereof
CN104424287A (en) * 2013-08-30 2015-03-18 深圳市腾讯计算机系统有限公司 Query method and query device for data
CN104462334A (en) * 2014-12-03 2015-03-25 天津南大通用数据技术股份有限公司 Data compression method and device for packing database
CN104572893A (en) * 2014-12-24 2015-04-29 天津南大通用数据技术股份有限公司 Hybrid storage method for data in database
CN104657426A (en) * 2015-01-22 2015-05-27 江苏瑞中数据股份有限公司 Unified-view-based row and column hybrid data storage model establishment method
CN104753539A (en) * 2013-12-26 2015-07-01 中国移动通信集团公司 Data compression method and device
CN105306063A (en) * 2015-10-12 2016-02-03 浙江大学 Optimization and recovery methods for record type data storage space
CN105589969A (en) * 2015-12-23 2016-05-18 浙江大华技术股份有限公司 Data processing method and device
CN106033377A (en) * 2015-03-13 2016-10-19 中国移动通信集团浙江有限公司 Data disaster tolerance method and disaster tolerance server
CN106528896A (en) * 2016-12-29 2017-03-22 网易(杭州)网络有限公司 Database optimization method and apparatus
CN106557494A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 Update the method and device of row storage table
CN107682016A (en) * 2017-09-26 2018-02-09 深信服科技股份有限公司 A kind of data compression method, data decompression method and related system
CN107851063A (en) * 2015-07-28 2018-03-27 华为技术有限公司 The dynamic coding algorithm of intelligently encoding accumulator system
CN109101516A (en) * 2017-11-30 2018-12-28 新华三大数据技术有限公司 A kind of data query method and server
WO2019114753A1 (en) * 2017-12-12 2019-06-20 清华大学 Method and device for storing time sequence data with adaptive code length
CN109995373A (en) * 2018-01-03 2019-07-09 上海艾拉比智能科技有限公司 A kind of mixing packing compression method of integer array
CN110147202A (en) * 2019-05-15 2019-08-20 杭州云象网络技术有限公司 A method of reducing block chain intelligence contract code storage volume
CN110268397A (en) * 2016-12-30 2019-09-20 日彩电子科技(深圳)有限公司 Effectively optimizing data layout method applied to data warehouse
CN113688127A (en) * 2020-05-19 2021-11-23 Sap欧洲公司 Data compression technique
CN114782148A (en) * 2022-06-16 2022-07-22 青岛农村商业银行股份有限公司 Agricultural product purchase management platform and business data compression method thereof
CN116185971A (en) * 2023-04-27 2023-05-30 济宁市质量计量检验检测研究院(济宁半导体及显示产品质量监督检验中心、济宁市纤维质量监测中心) Intelligent processing system for electronic pressure weighing data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1770150A (en) * 2004-11-03 2006-05-10 北京神舟航天软件技术有限公司 Database compression and decompression method
CN102112962A (en) * 2008-07-31 2011-06-29 微软公司 Efficient column based data encoding for large-scale data storage

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1770150A (en) * 2004-11-03 2006-05-10 北京神舟航天软件技术有限公司 Database compression and decompression method
CN102112962A (en) * 2008-07-31 2011-06-29 微软公司 Efficient column based data encoding for large-scale data storage

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473276A (en) * 2013-08-26 2013-12-25 广东电网公司电力调度控制中心 Storage method of very large data and distributed database system and retrieval method thereof
CN103473276B (en) * 2013-08-26 2017-08-25 广东电网公司电力调度控制中心 Ultra-large type date storage method, distributed data base system and its search method
CN104424287B (en) * 2013-08-30 2019-06-07 深圳市腾讯计算机系统有限公司 Data query method and apparatus
CN104424287A (en) * 2013-08-30 2015-03-18 深圳市腾讯计算机系统有限公司 Query method and query device for data
CN104753539A (en) * 2013-12-26 2015-07-01 中国移动通信集团公司 Data compression method and device
CN104462334A (en) * 2014-12-03 2015-03-25 天津南大通用数据技术股份有限公司 Data compression method and device for packing database
CN104572893A (en) * 2014-12-24 2015-04-29 天津南大通用数据技术股份有限公司 Hybrid storage method for data in database
CN104572893B (en) * 2014-12-24 2018-02-27 天津南大通用数据技术股份有限公司 A kind of data mixing storage method in database
CN104657426A (en) * 2015-01-22 2015-05-27 江苏瑞中数据股份有限公司 Unified-view-based row and column hybrid data storage model establishment method
CN104657426B (en) * 2015-01-22 2018-07-03 江苏瑞中数据股份有限公司 A kind of method for building up of the ranks blended data storage model based on unified view
CN106033377B (en) * 2015-03-13 2019-06-28 中国移动通信集团浙江有限公司 Data disaster tolerance method and Disaster Recovery Service
CN106033377A (en) * 2015-03-13 2016-10-19 中国移动通信集团浙江有限公司 Data disaster tolerance method and disaster tolerance server
CN107851063A (en) * 2015-07-28 2018-03-27 华为技术有限公司 The dynamic coding algorithm of intelligently encoding accumulator system
CN107851063B (en) * 2015-07-28 2020-12-25 华为技术有限公司 Dynamic coding algorithm for intelligent coding memory system
CN106557494B (en) * 2015-09-25 2019-09-20 北京国双科技有限公司 Update the method and device of column storage table
CN106557494A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 Update the method and device of row storage table
CN105306063A (en) * 2015-10-12 2016-02-03 浙江大学 Optimization and recovery methods for record type data storage space
CN105306063B (en) * 2015-10-12 2018-11-02 浙江大学 A kind of optimization of recordable data memory space and restoration methods
CN105589969A (en) * 2015-12-23 2016-05-18 浙江大华技术股份有限公司 Data processing method and device
CN106528896B (en) * 2016-12-29 2019-05-14 网易(杭州)网络有限公司 A kind of database optimizing method and device
CN106528896A (en) * 2016-12-29 2017-03-22 网易(杭州)网络有限公司 Database optimization method and apparatus
CN110268397B (en) * 2016-12-30 2023-06-13 日彩电子科技(深圳)有限公司 Efficient optimized data layout method applied to data warehouse system
CN110268397A (en) * 2016-12-30 2019-09-20 日彩电子科技(深圳)有限公司 Effectively optimizing data layout method applied to data warehouse
CN107682016A (en) * 2017-09-26 2018-02-09 深信服科技股份有限公司 A kind of data compression method, data decompression method and related system
CN109101516A (en) * 2017-11-30 2018-12-28 新华三大数据技术有限公司 A kind of data query method and server
US11269881B2 (en) 2017-11-30 2022-03-08 New H3C Big Data Technologies Co., Ltd. Data query
WO2019114753A1 (en) * 2017-12-12 2019-06-20 清华大学 Method and device for storing time sequence data with adaptive code length
US11101818B2 (en) 2017-12-12 2021-08-24 Tsinghua University Method and device for storing time series data with adaptive length encoding
CN109995373A (en) * 2018-01-03 2019-07-09 上海艾拉比智能科技有限公司 A kind of mixing packing compression method of integer array
CN109995373B (en) * 2018-01-03 2023-08-15 上海艾拉比智能科技有限公司 Mixed packing compression method for integer arrays
CN110147202A (en) * 2019-05-15 2019-08-20 杭州云象网络技术有限公司 A method of reducing block chain intelligence contract code storage volume
CN113688127A (en) * 2020-05-19 2021-11-23 Sap欧洲公司 Data compression technique
CN114782148A (en) * 2022-06-16 2022-07-22 青岛农村商业银行股份有限公司 Agricultural product purchase management platform and business data compression method thereof
CN114782148B (en) * 2022-06-16 2022-09-02 青岛农村商业银行股份有限公司 Agricultural product purchase management platform and business data compression method thereof
CN116185971A (en) * 2023-04-27 2023-05-30 济宁市质量计量检验检测研究院(济宁半导体及显示产品质量监督检验中心、济宁市纤维质量监测中心) Intelligent processing system for electronic pressure weighing data

Similar Documents

Publication Publication Date Title
CN102737132A (en) Multi-rule combined compression method based on database row and column mixed storage
CN102521386B (en) Method for grouping space metadata based on cluster storage
CN102419752B (en) Industrial database message storage method
CN105205146B (en) A method of calculating microblog users influence power
CN103020296B (en) The large data processing method of a kind of High-precision multi-dimensional counting Bloom Filter
CN105468642A (en) Data storage method and apparatus
CN103714145A (en) Relational and Key-Value type database spatial data index method
CN101963982A (en) Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN102737123B (en) A kind of multidimensional data distribution method
CN101478608A (en) Fast operating method for mass data based on two-dimensional hash
CN104750432B (en) A kind of date storage method and device
CN103345496A (en) Multimedia information searching method and system
CN103019887A (en) Data backup method and device
CN106326475A (en) High-efficiency static hash table implement method and system
Islambekov et al. Unsupervised space–time clustering using persistent homology
CN102420831A (en) Multi-domain network packet classification method
Dai et al. Improving load balance for data-intensive computing on cloud platforms
CN101499097A (en) Hash table based data stream frequent pattern internal memory compression and storage method
CN106407221B (en) Address data retrieval method and device
CN105302838A (en) Classification method as well as search method and device
Tang et al. A hybrid index for multi-dimensional query in HBase
CN104112025A (en) Partitioning method for processing virtual asset data based on perception of node computing power
CN102831146A (en) Database substring filtering index system and method for constructing and inquiring database substring filtering index system
CN102567432A (en) Intelligent information adaptation method and device for the same
CN104794237A (en) Web page information processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
DD01 Delivery of document by public notice

Addressee: Tianjin Shenzhou General Data Co., Ltd.

Document name: the First Notification of an Office Action

DD01 Delivery of document by public notice

Addressee: Tianjin Shenzhou General Data Co., Ltd.

Document name: Notification that Application Deemed to be Withdrawn

C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20121017