CN106599112A

CN106599112A - Massive incomplete data storage and operation method

Info

Publication number: CN106599112A
Application number: CN201611081152.4A
Authority: CN
Inventors: 王妍; 杨钧; 李俊; 吴阳; 宋宝燕
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2017-04-26

Abstract

The invention relates to a massive incomplete data storage and operation method. The method comprises the steps of employing different processing strategies according to characteristics of complete data and incomplete data, marking an attribute-missing field of the incomplete data, and building a corresponding index; and compressing an index file again according to attribute division so as to save storage space, completing non-decompression query by coding a dictionary, and achieving deletion, modification and insertion of the massive incomplete data on the basis of query. According to the method, operation is directly performed on the massive incomplete data by crossing data cleaning, the storage space can be substantially reduced, the compression position of the incomplete data is rapidly positioned, and query rapidness, deletion accuracy, modification result completeness and insertion high-efficiency are ensured. By the method, the storage space can be saved, the compression position of the incomplete data is rapidly positioned, and the query rapidness, the deletion accuracy, the modification result completeness and the insertion high-efficiency are ensured.

Description

A kind of magnanimity deficiency of data storage and method of operating

Technical field

The present invention relates to a kind of magnanimity deficiency of data storage and method of operating, belong to big data technical field.

Background technology

In recent years, with the fast development of internet, data scale is continuously increased, and mechanical disorder and human factor can cause Loss of data, forms magnanimity deficiency of data, and therefore these problems seriously govern the using value of data, have to property value The mass data of disappearance is stored and operated, with realistic meaning.

At present often completion method is lacked using attribute to deficiency of data, but the method for this prediction filling will also tend to Cause the mistake of data.For mass data, advanced row data cleansing is the drawbacks of again peration data has more.First, The time overhead cleaned to mass data is excessive；Secondly, the result of cleaning can be affected by uncertain factor, therefore can Can introduce new " noise " causes wash result to be inaccurate；Finally, data cleansing can also bring timeliness sex chromosome mosaicism, cause very Many timeliness data will lose meaning.Study herein be ignore data cleansing directly magnanimity deficiency of data is compressed and Operation, and existing method is mostly the process to complete data.Therefore design a kind of in hgher efficiency suitable for deficiency of data Processing method：Storage and method of operating (compression-based for based on the magnanimity deficiency of data of compression Operated method massive incompletedata, OM-MI), the method can quickly locate peration data Compression position, improves operating efficiency, additionally it is possible to significantly reduce memory space.

And for the compression of mass data, generally by the way of encoder dictionary, such method by data are carried out by Row storage, the operation to compressed encoding position will be converted into the process of initial data.Can be real using this encoder dictionary method Now directly inquire about without decompression, improve efficiency, but be the increase in matching dictionary and safeguard the cost of dictionary, generally in operation not frequently Used in numerous system.For the storage of current magnanimity deficiency of data and method of operating, there is very many deficiencies. Existing method is first cleaning reprocessing mostly, but the cleaning cost of mass data is excessive, also results in loss of data and ageing Problem.It is therefore proposed that OM-MI methods, cross data cleansing and directly magnanimity deficiency of data is operated, the method can be big Amplitude ground reduces memory space, quickly locates compressed file, improves operating efficiency.

The content of the invention

The present invention is directed to the deficiencies in the prior art, and the present invention provides a kind of storage of magnanimity deficiency of data and operation side Method.

The present invention's is achieved through the following technical solutions：A kind of magnanimity deficiency of data storage and method of operating, its It is characterised by, comprises the steps：

(1) when magnanimity deficiency of data is stored, storage is compressed respectively to partial data and deficiency of data, in fact Existing step is as follows：

(1.1) for a Massive information database system, the data manipulation sentence frequently used after statistics is looked into The predicate occurred after WHERE in sentence is ask, these predicates are divided into into certainty predicate Def_Val and uncertain predicate Undef_ val；

Wherein it is determined that property predicate Def_val refers to the predicate that operation has determined before issuing, what is generally frequently used consolidates Determine scope operation, such as " Age>55”.The attribute-name and property value of certainty predicate is fixed, and is occurred as an entirety.

Uncertain predicate Undef_Val refer to operation issue before can not completely specified predicate, generally frequently use The Value Operations such as unfixed, a certain property value of this kind of predicate whether there is in record, such as " Name=*** ".Uncertain predicate Attribute-name is fixed, and property value is variable.

(1.2) obtain after all certainty predicates and uncertain predicate, met each tuple in compression storage Uncertain predicate property value and certainty predicate be stored indexed by data, while tuple is stored in treating accordingly In compressed cache block；After a certain cache blocks are filled, it is compressed in order storage, and tuple institute is stored in database In compressed file address；

Concordance list includes following field attribute：Id、Tp_Id、Undef_val_I……Undef_val_i、Def_Val、 Block_Id、Delet_Flag、Com_Flag；

Wherein Id is index number；Tp_Id is tuple sequence number；Field Undef_Val_i is that current tuple does not know for i-th The property value of predicate；Field Def_Val stores the certainty predicate that current tuple is met with position coding form；

Block_Id is the sequence number of current tuple place cache blocks；

Delet_Flag is deleted marker position, and data need to delete and are then set to 1, are otherwise 0；

Com_Flag is data tuple integrity flag position, complete during tuple to be then set to 1, is otherwise 0；

Compressed file address table includes following field attribute：Block_Id、Address；Wherein, Block_Id is current unit The sequence number of group place cache blocks, field Address is corresponding compressed file after cache blocks compression；

For n certainty predicate Q₁,Q₂,。。。,Q_nIf the position of field Def_Val of current tuple is encoded to B₁B₂…… B_nIf current tuple meets condition Q_i, then Bi=1, otherwise Bi=0；If data need to delete, Delet_Flag=1 is made, it is no Then it is set to 0；To m predicate Q₁,Q₂,……,Q_mIf, Q_iValue all exist, then Com_Flag=1；Otherwise Com_Flag=0；

(1.3) can obtain complete by the connection by concordance list and compressed file address table on attribute Block_Id Index file；The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address；When a certain caching After block compression storage, by for Block_Id one new value of tax, identical Def_Val value can correspond to different cache blocks and pressure Contracting file address；

(1.4) magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute；After being compressed process, rope is obtained Quotation part and compressed data；Assume that index file includes i bar tuple j attributes, then i >=m and j≤n；The each unit for D* Group t, calculates first the Def_Val values and the value of Com_Flag of the certainty predicate that t is met, and t is written as into the Def_Val point The cache blocks BlockDef_Val to be compressed for matching somebody with somebody, in com_Flag, by the property value Undef_ of the uncertain querying condition of t Vals, Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion concordance list；If Blockdef_Query, Com_Flag meet specified number of tuples, take a kind of compression algorithm by Blockdef_Query, Com_ Flag is compressed；By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_Flag writing address Table；Index file is encoded using K-OF-N, compressed data K, compressed index file T and encoder dictionary can be finally obtained M；Index file is compressed using K-OF-N codings；

(2) compression based on magnanimity deficiency of data is stored, and completes the inquiry operation of magnanimity deficiency of data, and it realizes step It is rapid as follows：

(2.1) need first to generate search index by querying condition：The create-rule of search index is：By query statement Represented with Undef_Query and Def_Query；

(2.2) if 1. only existing Def_Query in search index, directly according to Def_ in compressed index file Query carries out selection operation；

If 2. there is Undef_Query in search index, it is right to need to be found in encoder dictionary according to Undef_Query The coding answered；

(2.3) selected and projection operation in compressed index file；

(2.4) Query Result is decompressed；

(3) compression based on magnanimity deficiency of data is stored, and completes the deletion action of magnanimity deficiency of data, and it realizes step It is rapid as follows：

(3.1) deletion is parsed into into deletion index；

(3.2) exist if 1. not knowing predicate, look for its correspondence to encode in encoder dictionary, then to compressed index text Part T deleted markers position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then 1 deleted marker position, Selection result is projected in Address, corresponding Address compressed packages are decompressed, by cancel statement tuple deletion is carried out, By not deleted tuple import compressed cache, after the completion of be compressed；

2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position etc. in T Tuple in 1 carries out selection operation, and then deleted marker position is 1, and selection result is projected on Address, right Compressed package corresponding to Address is directly deleted；It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect The corresponding encoder dictionary of deletion, result is obtained；

(4) compression based on magnanimity deficiency of data is stored, and completes the modification operation of magnanimity deficiency of data, and it realizes step It is rapid as follows：

(4.1) Address_Set for needing modification is inquired first with search algorithm；

(4.2) database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_ID_ SET gathers；

(4.3) compressed package before the tuple without operation is led back to；The tuple of operation is imported into its Update_ Buffer_{Def_Query,Com_Flag}, reach and compressed after certain amount, usually 150；(4.4) Tp_ID belongs in index file T The tuple of Tp_ID_SET is modified, including Def_Query, Com_Flag, Undef_Query, Block_ID；

(4.5) encoder dictionary is changed；

(5) compression based on magnanimity deficiency of data is stored, and completes the insertion operation of magnanimity deficiency of data, and it realizes step It is rapid as follows：

(5.1) it is changed to:For the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first；(calculate In order to determine the sequence number of subsequent compression cache blocks during the purpose of the two values)

(5.2) 1. if complete, t is write into cache blocks Block to be compressed_{Def_Query,Com_Flag}；

2. if incomplete, then just t writes multiple slow Block to be compressed_{Def_Query,Com_Flag}；

(5.3) the completely rear entirety of cache blocks is compressed, and obtains Block_Id_{Def_Query,Com_Flag}With Address_{Def_Query,Com_Flag}；

(5.4) information of the tuple is inserted in original encoder dictionary；

(5.5) new tuple is inserted in compressed index file according to encoder dictionary.

Beneficial effects of the present invention：Compared with prior art, the present invention proposes a kind of storage of magnanimity deficiency of data And method of operating.The method is crossed data cleansing and directly magnanimity deficiency of data is operated, and can significantly be reduced and be deposited Storage space, the compression position of quick positioning deficiency of data, it is ensured that the rapidity of inquiry, the accuracy of deletion, changes result Integrality and the high efficiency of insertion.This method can save memory space, the compression position of quick positioning deficiency of data, it is ensured that The rapidity of inquiry, the accuracy of deletion changes the integrality of result and the high efficiency of insertion.

Description of the drawings

Fig. 1 is concordance list structure.

Fig. 2 is address table structure.

Fig. 3 is deficiency of data compression process figure.

Fig. 4 is imperfect compressed data querying flow figure.

Fig. 5 is that imperfect compressed data deletes flow chart.

Fig. 6 is imperfect compressed data modification process figure.

Fig. 7 is that imperfect compressed data inserts flow chart.

Specific embodiment

A kind of magnanimity deficiency of data storage and method of operating, comprise the steps：

The first step：For a Massive information database system, the data manipulation sentence frequently used after statistics, i.e., These predicates are divided into certainty predicate Def_Val and uncertain predicate Undef_ by the predicate occurred after WHERE in query statement val；

Second step：After to obtaining all certainty predicates and uncertain predicate, in compression storage by each tuple institute The property value and certainty predicate of the uncertain predicate for meeting is stored indexed by data, while tuple is stored in accordingly Cache blocks to be compressed in；After a certain cache blocks are filled, it is compressed in order storage, and unit is stored in database Group place compressed file address；

Block_Id is the sequence number of current tuple place cache blocks；

Compressed file address table is included as properties：Block_Id、Address；Wherein, Block_Id is current tuple institute In the sequence number of cache blocks, field Address is corresponding compressed file after cache blocks compression；

3rd step：To having been obtained by the connection by concordance list and compressed file address table on attribute Block_Id Whole index file；The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address；When a certain After cache blocks compression storage, a new value will be assigned for Block_Id, identical Def_Val value may the different cache blocks of correspondence With compressed file address；

4th step：To magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute；After being compressed process, obtain To index file and compressed data；Assume that index file includes i bar tuple j attributes, then i >=m and j≤n；It is each for D* Bar tuple t, calculates first the Def_Val values and the value of Com_Flag of the certainty predicate that t is met, and t is written as into the Def_ The cache blocks BlockDef_Val to be compressed of Val distribution, in com_Flag, by the property value of the uncertain querying condition of t Undef_vals, Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion index Table；If Blockdef_Query, Com_Flag meet specifies number of tuples, a kind of compression algorithm is taken by Blockdef_ Query, Com_Flag are compressed；By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_ Flag writing address tables；Index file is encoded using K-OF-N, compressed data K, compressed index text can be finally obtained Part T and encoder dictionary M；Index file is compressed using K-OF-N codings.

Concrete application process is as follows：

Table 1 is regional population's information table, and for the property value absent field for being labeled as " * ", it is probably any value, then Any predicate should just be met, in order to ensure that Query Result is credible meaningful, need the tuple to being marked with " * " containing attribute to enter Row repeated compression.I.e. for tuple t containing " * " tag field, 2 may be calculated even more than Def_Val values, then Need that t is respectively written into the cache blocks Block multiple to be compressed for these Def_Val values distribution_{Def_Val,com_Flag}In, it is follow-up Processing procedure is with the not tuple containing tag field.

The people information table that table 1 is marked with " * "

Assume that uncertain predicate is：Name=***；Certainty predicate is：Age>35, Salary>6000.Then through calculating The Def_Val values that tuple 1 can be obtained are 10, then the Def_Val values of tuple 2 are 01 or 11, it is therefore desirable to simultaneously press the tuple In being reduced to 01 and 11 corresponding 2 compressed files.Its index file is as shown in table 2：

The index file example 2 of table 2

The index compression file example of table 3

Index file is compressed using K-of-N codings (K-of-N encoding), encoder dictionary M is obtained after compression With compressed index file Tc, table 3, table 4 is respectively the volume of the index file that encoded dictionary compression is obtained and Address fields Code word allusion quotation.

The encoder dictionary Address examples of table 4

The operation more than just completes the compression storage of magnanimity deficiency of data, for the compressed data for obtaining, according to Operated, the step of the following is data manipulation.

1. the inquiry of magnanimity deficiency of data：

The first step：Need first to generate search index by querying condition.The create-rule of search index is：Language will be inquired about Sentence is represented with Undef_Query and Def_Query；

Second step：If Def_Query is 1. only existed in search index, directly according to Def_ in compressed index file Query carries out selection operation

3rd step：Selected in compressed index file and projection operation；

4th step：Decompression Query Result.

2. the deletion of magnanimity deficiency of data：

The first step：Cancel statement is parsed into into deletion index；

Second step：If 1. not knowing predicate to exist, its correspondence is looked for encode in encoder dictionary, then to compressed index File T deleted markers position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then deleted marker position 1, selection result is projected in Address, corresponding Address compressed packages are decompressed, carry out tuple by cancel statement and delete Remove, by not deleted tuple import compressed cache, after the completion of be compressed；

2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position etc. in T Tuple in 1 carries out selection operation, and then deleted marker position is 1, and selection result is projected on Address, right Compressed package corresponding to Address is directly deleted.It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect The corresponding encoder dictionary of deletion, result is obtained.

3. magnanimity deficiency of data modification：

The first step：Inquiring first with search algorithm needs the Address_Set of modification；

Second step:Database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_ ID_SET gathers；

3rd step：Compressed package before tuple without operation is led back to；The tuple of operation is imported into its Update_ Buffer_{Def_Query,Com_Flag}, reach and compressed after certain amount；

4th step：Tp_ID belongs to the tuple of Tp_ID_SET and modifies in index file T, including Def_Query, Com_ Flag,Undef_Query,Block_ID；

5th step：Modification encoder dictionary.

4. magnanimity deficiency of data insertion：

The first step：For the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first；

Second step：1. if complete, t is write into cache blocks Block to be compressed_{Def_Query,Com_Flag}；

2. if incomplete, then t is write into multiple slow Block to be compressed_{Def_Query,Com_Flag}；

3rd step：The completely rear entirety of cache blocks is compressed, and obtains Block_Id_{Def_Query,Com_Flag}With Address_{Def_Query,Com_Flag}；

4th step：The information of the tuple is inserted in original encoder dictionary；

5th step：New tuple is inserted in compressed index file according to encoder dictionary.

Claims

1. a kind of magnanimity deficiency of data is stored and method of operating, it is characterised in that comprised the steps：

(1) when magnanimity deficiency of data is stored, storage is compressed respectively to partial data and deficiency of data, it realizes step It is rapid as follows：

(1.1) for a Massive information database system, the data manipulation sentence frequently used after statistics inquires about language These predicates are divided into certainty predicate Def_Val and uncertain predicate Undef_val by the predicate occurred after WHERE in sentence；

(1.2) obtain after all certainty predicates and uncertain predicate, met each tuple not in compression storage The property value and certainty predicate for determining predicate is stored indexed by data, while tuple is stored in corresponding to be compressed In cache blocks；After a certain cache blocks are filled, it is compressed in order storage, and pressure that tuple is located is stored in database Contracting file address；

Concordance list includes following field attribute：Id、Tp_Id、Undef_val_I……Undef_val_i、Def_Val、Block_ Id、Delet_Flag、Com_Flag；

Wherein Id is index number；Tp_Id is tuple sequence number；Field Undef_Val_i does not know predicate i-th for current tuple Property value；Field Def_Val stores the certainty predicate that current tuple is met with position coding form；

Block_Id is the sequence number of current tuple place cache blocks；

Compressed file address table includes following field attribute：Block_Id、Address；Wherein, Block_Id is current tuple institute In the sequence number of cache blocks, field Address is corresponding compressed file after cache blocks compression；

For n certainty predicate Q₁,Q₂,。。。,Q_nIf the position of field Def_Val of current tuple is encoded to B₁B₂……B_n, If current tuple meets condition Q_i, then Bi=1, otherwise Bi=0；If data need to delete, Delet_Flag=1 is made, otherwise It is set to 0；To m predicate Q₁,Q₂,……,Q_mIf, Q_iValue all exist, then Com_Flag=1；Otherwise Com_Flag=0；

(1.3) complete index can be obtained by the connection by concordance list and compressed file address table on attribute Block_Id File；The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address；When a certain cache blocks pressure After contracting storage, a new value will be assigned for Block_Id, identical Def_Val value can correspond to different cache blocks and compression text Part address；

(1.4) magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute；After being compressed process, index text is obtained Part and compressed data；Assume that index file includes i bar tuple j attributes, then i >=m and j≤n；Each tuple t for D*, The Def_Val values and the value of Com_Flag of the certainty predicate that t is met are calculated first, and t is written as into the Def_Val distribution Cache blocks BlockDef_Val to be compressed, in com_Flag, by the property value Undef_vals of the uncertain querying condition of t, Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion concordance list；If Blockdef_Query, Com_Flag meet specified number of tuples, take a kind of compression algorithm by Blockdef_Query, Com_ Flag is compressed；By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_Flag writing address Table；Index file is encoded using K-OF-N, compressed data K, compressed index file T and encoder dictionary can be finally obtained M；Index file is compressed using K-OF-N codings；

(2) compression based on magnanimity deficiency of data is stored, and completes the inquiry operation of magnanimity deficiency of data, and implementation step is such as Under：

(2.1) need first to generate search index by querying condition：The create-rule of search index is：Query statement is used Undef_Query and Def_Query is representing；

(2.2) if 1. only existing Def_Query in search index, directly entered according to Def_Query in compressed index file Row selection operation；

If 2. there is Undef_Query in search index, need to be found in encoder dictionary according to Undef_Query corresponding Coding；

(2.3) selected and projection operation in compressed index file；

(2.4) Query Result is decompressed；

(3) compression based on magnanimity deficiency of data is stored, and completes the deletion action of magnanimity deficiency of data, and implementation step is such as Under：

(3.1) deletion is parsed into into deletion index；

(3.2) exist if 1. not knowing predicate, look for its correspondence to encode in encoder dictionary, then to compressed index file T Deleted marker position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then 1 deleted marker position, right Selection result is projected in Address, decompresses corresponding Address compressed packages, and by cancel statement tuple deletion is carried out, will Not deleted tuple imports compressed cache, after the completion of be compressed；

2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position in T equal to 1 Tuple carry out selection operation, then deleted marker position be 1, selection result is projected on Address, it is right Compressed package corresponding to Address is directly deleted；It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect The corresponding encoder dictionary of deletion, result is obtained；

(4) compression based on magnanimity deficiency of data is stored, and completes the modification operation of magnanimity deficiency of data, and implementation step is such as Under：

(4.2) database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_ID_SET Set；

(4.5) encoder dictionary is changed；

(5) compression based on magnanimity deficiency of data is stored, and completes the insertion operation of magnanimity deficiency of data, and implementation step is such as Under：

(5.1) for the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first；

(5.4) information of the tuple is inserted in original encoder dictionary；