CN106599112A - Massive incomplete data storage and operation method - Google Patents

Massive incomplete data storage and operation method Download PDF

Info

Publication number
CN106599112A
CN106599112A CN201611081152.4A CN201611081152A CN106599112A CN 106599112 A CN106599112 A CN 106599112A CN 201611081152 A CN201611081152 A CN 201611081152A CN 106599112 A CN106599112 A CN 106599112A
Authority
CN
China
Prior art keywords
compressed
flag
tuple
data
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611081152.4A
Other languages
Chinese (zh)
Inventor
王妍
杨钧
李俊
吴阳
宋宝燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN201611081152.4A priority Critical patent/CN106599112A/en
Publication of CN106599112A publication Critical patent/CN106599112A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a massive incomplete data storage and operation method. The method comprises the steps of employing different processing strategies according to characteristics of complete data and incomplete data, marking an attribute-missing field of the incomplete data, and building a corresponding index; and compressing an index file again according to attribute division so as to save storage space, completing non-decompression query by coding a dictionary, and achieving deletion, modification and insertion of the massive incomplete data on the basis of query. According to the method, operation is directly performed on the massive incomplete data by crossing data cleaning, the storage space can be substantially reduced, the compression position of the incomplete data is rapidly positioned, and query rapidness, deletion accuracy, modification result completeness and insertion high-efficiency are ensured. By the method, the storage space can be saved, the compression position of the incomplete data is rapidly positioned, and the query rapidness, the deletion accuracy, the modification result completeness and the insertion high-efficiency are ensured.

Description

A kind of magnanimity deficiency of data storage and method of operating
Technical field
The present invention relates to a kind of magnanimity deficiency of data storage and method of operating, belong to big data technical field.
Background technology
In recent years, with the fast development of internet, data scale is continuously increased, and mechanical disorder and human factor can cause Loss of data, forms magnanimity deficiency of data, and therefore these problems seriously govern the using value of data, have to property value The mass data of disappearance is stored and operated, with realistic meaning.
At present often completion method is lacked using attribute to deficiency of data, but the method for this prediction filling will also tend to Cause the mistake of data.For mass data, advanced row data cleansing is the drawbacks of again peration data has more.First, The time overhead cleaned to mass data is excessive;Secondly, the result of cleaning can be affected by uncertain factor, therefore can Can introduce new " noise " causes wash result to be inaccurate;Finally, data cleansing can also bring timeliness sex chromosome mosaicism, cause very Many timeliness data will lose meaning.Study herein be ignore data cleansing directly magnanimity deficiency of data is compressed and Operation, and existing method is mostly the process to complete data.Therefore design a kind of in hgher efficiency suitable for deficiency of data Processing method:Storage and method of operating (compression-based for based on the magnanimity deficiency of data of compression Operated method massive incompletedata, OM-MI), the method can quickly locate peration data Compression position, improves operating efficiency, additionally it is possible to significantly reduce memory space.
And for the compression of mass data, generally by the way of encoder dictionary, such method by data are carried out by Row storage, the operation to compressed encoding position will be converted into the process of initial data.Can be real using this encoder dictionary method Now directly inquire about without decompression, improve efficiency, but be the increase in matching dictionary and safeguard the cost of dictionary, generally in operation not frequently Used in numerous system.For the storage of current magnanimity deficiency of data and method of operating, there is very many deficiencies. Existing method is first cleaning reprocessing mostly, but the cleaning cost of mass data is excessive, also results in loss of data and ageing Problem.It is therefore proposed that OM-MI methods, cross data cleansing and directly magnanimity deficiency of data is operated, the method can be big Amplitude ground reduces memory space, quickly locates compressed file, improves operating efficiency.
The content of the invention
The present invention is directed to the deficiencies in the prior art, and the present invention provides a kind of storage of magnanimity deficiency of data and operation side Method.
The present invention's is achieved through the following technical solutions:A kind of magnanimity deficiency of data storage and method of operating, its It is characterised by, comprises the steps:
(1) when magnanimity deficiency of data is stored, storage is compressed respectively to partial data and deficiency of data, in fact Existing step is as follows:
(1.1) for a Massive information database system, the data manipulation sentence frequently used after statistics is looked into The predicate occurred after WHERE in sentence is ask, these predicates are divided into into certainty predicate Def_Val and uncertain predicate Undef_ val;
Wherein it is determined that property predicate Def_val refers to the predicate that operation has determined before issuing, what is generally frequently used consolidates Determine scope operation, such as " Age>55”.The attribute-name and property value of certainty predicate is fixed, and is occurred as an entirety.
Uncertain predicate Undef_Val refer to operation issue before can not completely specified predicate, generally frequently use The Value Operations such as unfixed, a certain property value of this kind of predicate whether there is in record, such as " Name=*** ".Uncertain predicate Attribute-name is fixed, and property value is variable.
(1.2) obtain after all certainty predicates and uncertain predicate, met each tuple in compression storage Uncertain predicate property value and certainty predicate be stored indexed by data, while tuple is stored in treating accordingly In compressed cache block;After a certain cache blocks are filled, it is compressed in order storage, and tuple institute is stored in database In compressed file address;
Concordance list includes following field attribute:Id、Tp_Id、Undef_val_I……Undef_val_i、Def_Val、 Block_Id、Delet_Flag、Com_Flag;
Wherein Id is index number;Tp_Id is tuple sequence number;Field Undef_Val_i is that current tuple does not know for i-th The property value of predicate;Field Def_Val stores the certainty predicate that current tuple is met with position coding form;
Block_Id is the sequence number of current tuple place cache blocks;
Delet_Flag is deleted marker position, and data need to delete and are then set to 1, are otherwise 0;
Com_Flag is data tuple integrity flag position, complete during tuple to be then set to 1, is otherwise 0;
Compressed file address table includes following field attribute:Block_Id、Address;Wherein, Block_Id is current unit The sequence number of group place cache blocks, field Address is corresponding compressed file after cache blocks compression;
For n certainty predicate Q1,Q2,。。。,QnIf the position of field Def_Val of current tuple is encoded to B1B2…… BnIf current tuple meets condition Qi, then Bi=1, otherwise Bi=0;If data need to delete, Delet_Flag=1 is made, it is no Then it is set to 0;To m predicate Q1,Q2,……,QmIf, QiValue all exist, then Com_Flag=1;Otherwise Com_Flag=0;
(1.3) can obtain complete by the connection by concordance list and compressed file address table on attribute Block_Id Index file;The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address;When a certain caching After block compression storage, by for Block_Id one new value of tax, identical Def_Val value can correspond to different cache blocks and pressure Contracting file address;
(1.4) magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute;After being compressed process, rope is obtained Quotation part and compressed data;Assume that index file includes i bar tuple j attributes, then i >=m and j≤n;The each unit for D* Group t, calculates first the Def_Val values and the value of Com_Flag of the certainty predicate that t is met, and t is written as into the Def_Val point The cache blocks BlockDef_Val to be compressed for matching somebody with somebody, in com_Flag, by the property value Undef_ of the uncertain querying condition of t Vals, Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion concordance list;If Blockdef_Query, Com_Flag meet specified number of tuples, take a kind of compression algorithm by Blockdef_Query, Com_ Flag is compressed;By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_Flag writing address Table;Index file is encoded using K-OF-N, compressed data K, compressed index file T and encoder dictionary can be finally obtained M;Index file is compressed using K-OF-N codings;
(2) compression based on magnanimity deficiency of data is stored, and completes the inquiry operation of magnanimity deficiency of data, and it realizes step It is rapid as follows:
(2.1) need first to generate search index by querying condition:The create-rule of search index is:By query statement Represented with Undef_Query and Def_Query;
(2.2) if 1. only existing Def_Query in search index, directly according to Def_ in compressed index file Query carries out selection operation;
If 2. there is Undef_Query in search index, it is right to need to be found in encoder dictionary according to Undef_Query The coding answered;
(2.3) selected and projection operation in compressed index file;
(2.4) Query Result is decompressed;
(3) compression based on magnanimity deficiency of data is stored, and completes the deletion action of magnanimity deficiency of data, and it realizes step It is rapid as follows:
(3.1) deletion is parsed into into deletion index;
(3.2) exist if 1. not knowing predicate, look for its correspondence to encode in encoder dictionary, then to compressed index text Part T deleted markers position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then 1 deleted marker position, Selection result is projected in Address, corresponding Address compressed packages are decompressed, by cancel statement tuple deletion is carried out, By not deleted tuple import compressed cache, after the completion of be compressed;
2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position etc. in T Tuple in 1 carries out selection operation, and then deleted marker position is 1, and selection result is projected on Address, right Compressed package corresponding to Address is directly deleted;It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect The corresponding encoder dictionary of deletion, result is obtained;
(4) compression based on magnanimity deficiency of data is stored, and completes the modification operation of magnanimity deficiency of data, and it realizes step It is rapid as follows:
(4.1) Address_Set for needing modification is inquired first with search algorithm;
(4.2) database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_ID_ SET gathers;
(4.3) compressed package before the tuple without operation is led back to;The tuple of operation is imported into its Update_ BufferDef_Query,Com_Flag, reach and compressed after certain amount, usually 150;(4.4) Tp_ID belongs in index file T The tuple of Tp_ID_SET is modified, including Def_Query, Com_Flag, Undef_Query, Block_ID;
(4.5) encoder dictionary is changed;
(5) compression based on magnanimity deficiency of data is stored, and completes the insertion operation of magnanimity deficiency of data, and it realizes step It is rapid as follows:
(5.1) it is changed to:For the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first;(calculate In order to determine the sequence number of subsequent compression cache blocks during the purpose of the two values)
(5.2) 1. if complete, t is write into cache blocks Block to be compressedDef_Query,Com_Flag
2. if incomplete, then just t writes multiple slow Block to be compressedDef_Query,Com_Flag
(5.3) the completely rear entirety of cache blocks is compressed, and obtains Block_IdDef_Query,Com_FlagWith AddressDef_Query,Com_Flag
(5.4) information of the tuple is inserted in original encoder dictionary;
(5.5) new tuple is inserted in compressed index file according to encoder dictionary.
Beneficial effects of the present invention:Compared with prior art, the present invention proposes a kind of storage of magnanimity deficiency of data And method of operating.The method is crossed data cleansing and directly magnanimity deficiency of data is operated, and can significantly be reduced and be deposited Storage space, the compression position of quick positioning deficiency of data, it is ensured that the rapidity of inquiry, the accuracy of deletion, changes result Integrality and the high efficiency of insertion.This method can save memory space, the compression position of quick positioning deficiency of data, it is ensured that The rapidity of inquiry, the accuracy of deletion changes the integrality of result and the high efficiency of insertion.
Description of the drawings
Fig. 1 is concordance list structure.
Fig. 2 is address table structure.
Fig. 3 is deficiency of data compression process figure.
Fig. 4 is imperfect compressed data querying flow figure.
Fig. 5 is that imperfect compressed data deletes flow chart.
Fig. 6 is imperfect compressed data modification process figure.
Fig. 7 is that imperfect compressed data inserts flow chart.
Specific embodiment
A kind of magnanimity deficiency of data storage and method of operating, comprise the steps:
(1) when magnanimity deficiency of data is stored, storage is compressed respectively to partial data and deficiency of data, in fact Existing step is as follows:
The first step:For a Massive information database system, the data manipulation sentence frequently used after statistics, i.e., These predicates are divided into certainty predicate Def_Val and uncertain predicate Undef_ by the predicate occurred after WHERE in query statement val;
Wherein it is determined that property predicate Def_val refers to the predicate that operation has determined before issuing, what is generally frequently used consolidates Determine scope operation, such as " Age>55”.The attribute-name and property value of certainty predicate is fixed, and is occurred as an entirety.
Uncertain predicate Undef_Val refer to operation issue before can not completely specified predicate, generally frequently use The Value Operations such as unfixed, a certain property value of this kind of predicate whether there is in record, such as " Name=*** ".Uncertain predicate Attribute-name is fixed, and property value is variable.
Second step:After to obtaining all certainty predicates and uncertain predicate, in compression storage by each tuple institute The property value and certainty predicate of the uncertain predicate for meeting is stored indexed by data, while tuple is stored in accordingly Cache blocks to be compressed in;After a certain cache blocks are filled, it is compressed in order storage, and unit is stored in database Group place compressed file address;
Concordance list includes following field attribute:Id、Tp_Id、Undef_val_I……Undef_val_i、Def_Val、 Block_Id、Delet_Flag、Com_Flag;
Wherein Id is index number;Tp_Id is tuple sequence number;Field Undef_Val_i is that current tuple does not know for i-th The property value of predicate;Field Def_Val stores the certainty predicate that current tuple is met with position coding form;
Block_Id is the sequence number of current tuple place cache blocks;
Delet_Flag is deleted marker position, and data need to delete and are then set to 1, are otherwise 0;
Com_Flag is data tuple integrity flag position, complete during tuple to be then set to 1, is otherwise 0;
Compressed file address table is included as properties:Block_Id、Address;Wherein, Block_Id is current tuple institute In the sequence number of cache blocks, field Address is corresponding compressed file after cache blocks compression;
For n certainty predicate Q1,Q2,。。。,QnIf the position of field Def_Val of current tuple is encoded to B1B2…… BnIf current tuple meets condition Qi, then Bi=1, otherwise Bi=0;If data need to delete, Delet_Flag=1 is made, it is no Then it is set to 0;To m predicate Q1,Q2,……,QmIf, QiValue all exist, then Com_Flag=1;Otherwise Com_Flag=0;
3rd step:To having been obtained by the connection by concordance list and compressed file address table on attribute Block_Id Whole index file;The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address;When a certain After cache blocks compression storage, a new value will be assigned for Block_Id, identical Def_Val value may the different cache blocks of correspondence With compressed file address;
4th step:To magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute;After being compressed process, obtain To index file and compressed data;Assume that index file includes i bar tuple j attributes, then i >=m and j≤n;It is each for D* Bar tuple t, calculates first the Def_Val values and the value of Com_Flag of the certainty predicate that t is met, and t is written as into the Def_ The cache blocks BlockDef_Val to be compressed of Val distribution, in com_Flag, by the property value of the uncertain querying condition of t Undef_vals, Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion index Table;If Blockdef_Query, Com_Flag meet specifies number of tuples, a kind of compression algorithm is taken by Blockdef_ Query, Com_Flag are compressed;By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_ Flag writing address tables;Index file is encoded using K-OF-N, compressed data K, compressed index text can be finally obtained Part T and encoder dictionary M;Index file is compressed using K-OF-N codings.
Concrete application process is as follows:
Table 1 is regional population's information table, and for the property value absent field for being labeled as " * ", it is probably any value, then Any predicate should just be met, in order to ensure that Query Result is credible meaningful, need the tuple to being marked with " * " containing attribute to enter Row repeated compression.I.e. for tuple t containing " * " tag field, 2 may be calculated even more than Def_Val values, then Need that t is respectively written into the cache blocks Block multiple to be compressed for these Def_Val values distributionDef_Val,com_FlagIn, it is follow-up Processing procedure is with the not tuple containing tag field.
The people information table that table 1 is marked with " * "
Assume that uncertain predicate is:Name=***;Certainty predicate is:Age>35, Salary>6000.Then through calculating The Def_Val values that tuple 1 can be obtained are 10, then the Def_Val values of tuple 2 are 01 or 11, it is therefore desirable to simultaneously press the tuple In being reduced to 01 and 11 corresponding 2 compressed files.Its index file is as shown in table 2:
The index file example 2 of table 2
The index compression file example of table 3
Index file is compressed using K-of-N codings (K-of-N encoding), encoder dictionary M is obtained after compression With compressed index file Tc, table 3, table 4 is respectively the volume of the index file that encoded dictionary compression is obtained and Address fields Code word allusion quotation.
The encoder dictionary Address examples of table 4
The operation more than just completes the compression storage of magnanimity deficiency of data, for the compressed data for obtaining, according to Operated, the step of the following is data manipulation.
1. the inquiry of magnanimity deficiency of data:
The first step:Need first to generate search index by querying condition.The create-rule of search index is:Language will be inquired about Sentence is represented with Undef_Query and Def_Query;
Second step:If Def_Query is 1. only existed in search index, directly according to Def_ in compressed index file Query carries out selection operation
If 2. there is Undef_Query in search index, it is right to need to be found in encoder dictionary according to Undef_Query The coding answered;
3rd step:Selected in compressed index file and projection operation;
4th step:Decompression Query Result.
2. the deletion of magnanimity deficiency of data:
The first step:Cancel statement is parsed into into deletion index;
Second step:If 1. not knowing predicate to exist, its correspondence is looked for encode in encoder dictionary, then to compressed index File T deleted markers position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then deleted marker position 1, selection result is projected in Address, corresponding Address compressed packages are decompressed, carry out tuple by cancel statement and delete Remove, by not deleted tuple import compressed cache, after the completion of be compressed;
2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position etc. in T Tuple in 1 carries out selection operation, and then deleted marker position is 1, and selection result is projected on Address, right Compressed package corresponding to Address is directly deleted.It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect The corresponding encoder dictionary of deletion, result is obtained.
3. magnanimity deficiency of data modification:
The first step:Inquiring first with search algorithm needs the Address_Set of modification;
Second step:Database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_ ID_SET gathers;
3rd step:Compressed package before tuple without operation is led back to;The tuple of operation is imported into its Update_ BufferDef_Query,Com_Flag, reach and compressed after certain amount;
4th step:Tp_ID belongs to the tuple of Tp_ID_SET and modifies in index file T, including Def_Query, Com_ Flag,Undef_Query,Block_ID;
5th step:Modification encoder dictionary.
4. magnanimity deficiency of data insertion:
The first step:For the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first;
Second step:1. if complete, t is write into cache blocks Block to be compressedDef_Query,Com_Flag
2. if incomplete, then t is write into multiple slow Block to be compressedDef_Query,Com_Flag
3rd step:The completely rear entirety of cache blocks is compressed, and obtains Block_IdDef_Query,Com_FlagWith AddressDef_Query,Com_Flag
4th step:The information of the tuple is inserted in original encoder dictionary;
5th step:New tuple is inserted in compressed index file according to encoder dictionary.

Claims (1)

1. a kind of magnanimity deficiency of data is stored and method of operating, it is characterised in that comprised the steps:
(1) when magnanimity deficiency of data is stored, storage is compressed respectively to partial data and deficiency of data, it realizes step It is rapid as follows:
(1.1) for a Massive information database system, the data manipulation sentence frequently used after statistics inquires about language These predicates are divided into certainty predicate Def_Val and uncertain predicate Undef_val by the predicate occurred after WHERE in sentence;
(1.2) obtain after all certainty predicates and uncertain predicate, met each tuple not in compression storage The property value and certainty predicate for determining predicate is stored indexed by data, while tuple is stored in corresponding to be compressed In cache blocks;After a certain cache blocks are filled, it is compressed in order storage, and pressure that tuple is located is stored in database Contracting file address;
Concordance list includes following field attribute:Id、Tp_Id、Undef_val_I……Undef_val_i、Def_Val、Block_ Id、Delet_Flag、Com_Flag;
Wherein Id is index number;Tp_Id is tuple sequence number;Field Undef_Val_i does not know predicate i-th for current tuple Property value;Field Def_Val stores the certainty predicate that current tuple is met with position coding form;
Block_Id is the sequence number of current tuple place cache blocks;
Delet_Flag is deleted marker position, and data need to delete and are then set to 1, are otherwise 0;
Com_Flag is data tuple integrity flag position, complete during tuple to be then set to 1, is otherwise 0;
Compressed file address table includes following field attribute:Block_Id、Address;Wherein, Block_Id is current tuple institute In the sequence number of cache blocks, field Address is corresponding compressed file after cache blocks compression;
For n certainty predicate Q1,Q2,。。。,QnIf the position of field Def_Val of current tuple is encoded to B1B2……Bn, If current tuple meets condition Qi, then Bi=1, otherwise Bi=0;If data need to delete, Delet_Flag=1 is made, otherwise It is set to 0;To m predicate Q1,Q2,……,QmIf, QiValue all exist, then Com_Flag=1;Otherwise Com_Flag=0;
(1.3) complete index can be obtained by the connection by concordance list and compressed file address table on attribute Block_Id File;The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address;When a certain cache blocks pressure After contracting storage, a new value will be assigned for Block_Id, identical Def_Val value can correspond to different cache blocks and compression text Part address;
(1.4) magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute;After being compressed process, index text is obtained Part and compressed data;Assume that index file includes i bar tuple j attributes, then i >=m and j≤n;Each tuple t for D*, The Def_Val values and the value of Com_Flag of the certainty predicate that t is met are calculated first, and t is written as into the Def_Val distribution Cache blocks BlockDef_Val to be compressed, in com_Flag, by the property value Undef_vals of the uncertain querying condition of t, Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion concordance list;If Blockdef_Query, Com_Flag meet specified number of tuples, take a kind of compression algorithm by Blockdef_Query, Com_ Flag is compressed;By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_Flag writing address Table;Index file is encoded using K-OF-N, compressed data K, compressed index file T and encoder dictionary can be finally obtained M;Index file is compressed using K-OF-N codings;
(2) compression based on magnanimity deficiency of data is stored, and completes the inquiry operation of magnanimity deficiency of data, and implementation step is such as Under:
(2.1) need first to generate search index by querying condition:The create-rule of search index is:Query statement is used Undef_Query and Def_Query is representing;
(2.2) if 1. only existing Def_Query in search index, directly entered according to Def_Query in compressed index file Row selection operation;
If 2. there is Undef_Query in search index, need to be found in encoder dictionary according to Undef_Query corresponding Coding;
(2.3) selected and projection operation in compressed index file;
(2.4) Query Result is decompressed;
(3) compression based on magnanimity deficiency of data is stored, and completes the deletion action of magnanimity deficiency of data, and implementation step is such as Under:
(3.1) deletion is parsed into into deletion index;
(3.2) exist if 1. not knowing predicate, look for its correspondence to encode in encoder dictionary, then to compressed index file T Deleted marker position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then 1 deleted marker position, right Selection result is projected in Address, decompresses corresponding Address compressed packages, and by cancel statement tuple deletion is carried out, will Not deleted tuple imports compressed cache, after the completion of be compressed;
2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position in T equal to 1 Tuple carry out selection operation, then deleted marker position be 1, selection result is projected on Address, it is right Compressed package corresponding to Address is directly deleted;It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect The corresponding encoder dictionary of deletion, result is obtained;
(4) compression based on magnanimity deficiency of data is stored, and completes the modification operation of magnanimity deficiency of data, and implementation step is such as Under:
(4.1) Address_Set for needing modification is inquired first with search algorithm;
(4.2) database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_ID_SET Set;
(4.3) compressed package before the tuple without operation is led back to;The tuple of operation is imported into its Update_ BufferDef_Query,Com_Flag, reach and compressed after certain amount, usually 150;(4.4) Tp_ID belongs in index file T The tuple of Tp_ID_SET is modified, including Def_Query, Com_Flag, Undef_Query, Block_ID;
(4.5) encoder dictionary is changed;
(5) compression based on magnanimity deficiency of data is stored, and completes the insertion operation of magnanimity deficiency of data, and implementation step is such as Under:
(5.1) for the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first;
(5.2) 1. if complete, t is write into cache blocks Block to be compressedDef_Query,Com_Flag
2. if incomplete, then just t writes multiple slow Block to be compressedDef_Query,Com_Flag
(5.3) the completely rear entirety of cache blocks is compressed, and obtains Block_IdDef_Query,Com_FlagWith AddressDef_Query,Com_Flag
(5.4) information of the tuple is inserted in original encoder dictionary;
(5.5) new tuple is inserted in compressed index file according to encoder dictionary.
CN201611081152.4A 2016-11-30 2016-11-30 Massive incomplete data storage and operation method Pending CN106599112A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611081152.4A CN106599112A (en) 2016-11-30 2016-11-30 Massive incomplete data storage and operation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611081152.4A CN106599112A (en) 2016-11-30 2016-11-30 Massive incomplete data storage and operation method

Publications (1)

Publication Number Publication Date
CN106599112A true CN106599112A (en) 2017-04-26

Family

ID=58594035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611081152.4A Pending CN106599112A (en) 2016-11-30 2016-11-30 Massive incomplete data storage and operation method

Country Status (1)

Country Link
CN (1) CN106599112A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459971A (en) * 2020-04-01 2020-07-28 辽宁大学 Skyline-join query processing method based on crowdsourcing
CN112199366A (en) * 2019-04-28 2021-01-08 杭州数梦工场科技有限公司 Data table processing method, device and equipment
CN113505578A (en) * 2021-05-26 2021-10-15 中国再保险(集团)股份有限公司 Mass file quick checking method for typhoon and disaster great model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060101206A1 (en) * 2004-11-05 2006-05-11 Wood David A Adaptive cache compression system
CN101599072A (en) * 2009-07-03 2009-12-09 南开大学 Intelligent computer systems building method based on information inference
WO2011129818A1 (en) * 2010-04-13 2011-10-20 Empire Technology Development Llc Adaptive compression
CN104750860A (en) * 2015-04-16 2015-07-01 东北大学 Data storage method of uncertain data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060101206A1 (en) * 2004-11-05 2006-05-11 Wood David A Adaptive cache compression system
CN101599072A (en) * 2009-07-03 2009-12-09 南开大学 Intelligent computer systems building method based on information inference
WO2011129818A1 (en) * 2010-04-13 2011-10-20 Empire Technology Development Llc Adaptive compression
CN104750860A (en) * 2015-04-16 2015-07-01 东北大学 Data storage method of uncertain data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王妍: "基于压缩的海量不完整数据近似查询方法", 《计算机研究与发展》 *
赵锴: "基于谓词索引的海量数据压缩存储及数据操作算法", 《计算机科学》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199366A (en) * 2019-04-28 2021-01-08 杭州数梦工场科技有限公司 Data table processing method, device and equipment
CN111459971A (en) * 2020-04-01 2020-07-28 辽宁大学 Skyline-join query processing method based on crowdsourcing
CN111459971B (en) * 2020-04-01 2023-11-10 辽宁大学 Skyline-join query processing method based on crowdsourcing
CN113505578A (en) * 2021-05-26 2021-10-15 中国再保险(集团)股份有限公司 Mass file quick checking method for typhoon and disaster great model
CN113505578B (en) * 2021-05-26 2024-07-30 中国再保险(集团)股份有限公司 Rapid verification method for mass files of typhoon disaster model

Similar Documents

Publication Publication Date Title
US7076486B2 (en) Method and system for efficiently identifying differences between large files
US9691164B2 (en) System and method for symbol-space based compression of patterns
KR102407510B1 (en) Method, apparatus, device and medium for storing and querying data
US10726016B2 (en) In-memory column-level multi-versioned global dictionary for in-memory databases
CN109446221B (en) Interactive data exploration method based on semantic analysis
CN104040541B (en) For more efficiently using memory to the technology of CPU bandwidth
CN111899089A (en) Enterprise risk early warning method and system based on knowledge graph
CN106599112A (en) Massive incomplete data storage and operation method
JP7426907B2 (en) Advanced database decompression
CN111309930B (en) Medical knowledge graph entity alignment method based on representation learning
WO2020098315A1 (en) Information matching method and terminal
CN109558166A (en) A kind of code search method of facing defects positioning
CN113377758A (en) Data quality auditing engine and auditing method thereof
CN105488471B (en) A kind of font recognition methods and device
US20100125614A1 (en) Systems and processes for functionally interpolated increasing sequence encoding
CN105302915A (en) High-performance data processing system based on memory calculation
CN114997181A (en) Intelligent question-answering method and system based on user feedback correction
CN104731908A (en) ETL-based data cleaning method
CN110866407B (en) Analysis method, device and equipment for determining similarity between text of mutual translation
CN113627132A (en) Data deduplication mark code generation method and system, electronic device and storage medium
CN115470355A (en) Rail transit information query method and device, electronic equipment and storage medium
US20180349443A1 (en) Edge store compression in graph databases
US10366067B2 (en) Adaptive index leaf block compression
CN106598492A (en) Compression optimization method applied to mass incomplete data
CN110572160A (en) Compression method for decoding module code of instruction set simulator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170426