CN106599112A - Massive incomplete data storage and operation method - Google Patents
Massive incomplete data storage and operation method Download PDFInfo
- Publication number
- CN106599112A CN106599112A CN201611081152.4A CN201611081152A CN106599112A CN 106599112 A CN106599112 A CN 106599112A CN 201611081152 A CN201611081152 A CN 201611081152A CN 106599112 A CN106599112 A CN 106599112A
- Authority
- CN
- China
- Prior art keywords
- compressed
- flag
- tuple
- data
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a massive incomplete data storage and operation method. The method comprises the steps of employing different processing strategies according to characteristics of complete data and incomplete data, marking an attribute-missing field of the incomplete data, and building a corresponding index; and compressing an index file again according to attribute division so as to save storage space, completing non-decompression query by coding a dictionary, and achieving deletion, modification and insertion of the massive incomplete data on the basis of query. According to the method, operation is directly performed on the massive incomplete data by crossing data cleaning, the storage space can be substantially reduced, the compression position of the incomplete data is rapidly positioned, and query rapidness, deletion accuracy, modification result completeness and insertion high-efficiency are ensured. By the method, the storage space can be saved, the compression position of the incomplete data is rapidly positioned, and the query rapidness, the deletion accuracy, the modification result completeness and the insertion high-efficiency are ensured.
Description
Technical field
The present invention relates to a kind of magnanimity deficiency of data storage and method of operating, belong to big data technical field.
Background technology
In recent years, with the fast development of internet, data scale is continuously increased, and mechanical disorder and human factor can cause
Loss of data, forms magnanimity deficiency of data, and therefore these problems seriously govern the using value of data, have to property value
The mass data of disappearance is stored and operated, with realistic meaning.
At present often completion method is lacked using attribute to deficiency of data, but the method for this prediction filling will also tend to
Cause the mistake of data.For mass data, advanced row data cleansing is the drawbacks of again peration data has more.First,
The time overhead cleaned to mass data is excessive;Secondly, the result of cleaning can be affected by uncertain factor, therefore can
Can introduce new " noise " causes wash result to be inaccurate;Finally, data cleansing can also bring timeliness sex chromosome mosaicism, cause very
Many timeliness data will lose meaning.Study herein be ignore data cleansing directly magnanimity deficiency of data is compressed and
Operation, and existing method is mostly the process to complete data.Therefore design a kind of in hgher efficiency suitable for deficiency of data
Processing method:Storage and method of operating (compression-based for based on the magnanimity deficiency of data of compression
Operated method massive incompletedata, OM-MI), the method can quickly locate peration data
Compression position, improves operating efficiency, additionally it is possible to significantly reduce memory space.
And for the compression of mass data, generally by the way of encoder dictionary, such method by data are carried out by
Row storage, the operation to compressed encoding position will be converted into the process of initial data.Can be real using this encoder dictionary method
Now directly inquire about without decompression, improve efficiency, but be the increase in matching dictionary and safeguard the cost of dictionary, generally in operation not frequently
Used in numerous system.For the storage of current magnanimity deficiency of data and method of operating, there is very many deficiencies.
Existing method is first cleaning reprocessing mostly, but the cleaning cost of mass data is excessive, also results in loss of data and ageing
Problem.It is therefore proposed that OM-MI methods, cross data cleansing and directly magnanimity deficiency of data is operated, the method can be big
Amplitude ground reduces memory space, quickly locates compressed file, improves operating efficiency.
The content of the invention
The present invention is directed to the deficiencies in the prior art, and the present invention provides a kind of storage of magnanimity deficiency of data and operation side
Method.
The present invention's is achieved through the following technical solutions:A kind of magnanimity deficiency of data storage and method of operating, its
It is characterised by, comprises the steps:
(1) when magnanimity deficiency of data is stored, storage is compressed respectively to partial data and deficiency of data, in fact
Existing step is as follows:
(1.1) for a Massive information database system, the data manipulation sentence frequently used after statistics is looked into
The predicate occurred after WHERE in sentence is ask, these predicates are divided into into certainty predicate Def_Val and uncertain predicate Undef_
val;
Wherein it is determined that property predicate Def_val refers to the predicate that operation has determined before issuing, what is generally frequently used consolidates
Determine scope operation, such as " Age>55”.The attribute-name and property value of certainty predicate is fixed, and is occurred as an entirety.
Uncertain predicate Undef_Val refer to operation issue before can not completely specified predicate, generally frequently use
The Value Operations such as unfixed, a certain property value of this kind of predicate whether there is in record, such as " Name=*** ".Uncertain predicate
Attribute-name is fixed, and property value is variable.
(1.2) obtain after all certainty predicates and uncertain predicate, met each tuple in compression storage
Uncertain predicate property value and certainty predicate be stored indexed by data, while tuple is stored in treating accordingly
In compressed cache block;After a certain cache blocks are filled, it is compressed in order storage, and tuple institute is stored in database
In compressed file address;
Concordance list includes following field attribute:Id、Tp_Id、Undef_val_I……Undef_val_i、Def_Val、
Block_Id、Delet_Flag、Com_Flag;
Wherein Id is index number;Tp_Id is tuple sequence number;Field Undef_Val_i is that current tuple does not know for i-th
The property value of predicate;Field Def_Val stores the certainty predicate that current tuple is met with position coding form;
Block_Id is the sequence number of current tuple place cache blocks;
Delet_Flag is deleted marker position, and data need to delete and are then set to 1, are otherwise 0;
Com_Flag is data tuple integrity flag position, complete during tuple to be then set to 1, is otherwise 0;
Compressed file address table includes following field attribute:Block_Id、Address;Wherein, Block_Id is current unit
The sequence number of group place cache blocks, field Address is corresponding compressed file after cache blocks compression;
For n certainty predicate Q1,Q2,。。。,QnIf the position of field Def_Val of current tuple is encoded to B1B2……
BnIf current tuple meets condition Qi, then Bi=1, otherwise Bi=0;If data need to delete, Delet_Flag=1 is made, it is no
Then it is set to 0;To m predicate Q1,Q2,……,QmIf, QiValue all exist, then Com_Flag=1;Otherwise Com_Flag=0;
(1.3) can obtain complete by the connection by concordance list and compressed file address table on attribute Block_Id
Index file;The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address;When a certain caching
After block compression storage, by for Block_Id one new value of tax, identical Def_Val value can correspond to different cache blocks and pressure
Contracting file address;
(1.4) magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute;After being compressed process, rope is obtained
Quotation part and compressed data;Assume that index file includes i bar tuple j attributes, then i >=m and j≤n;The each unit for D*
Group t, calculates first the Def_Val values and the value of Com_Flag of the certainty predicate that t is met, and t is written as into the Def_Val point
The cache blocks BlockDef_Val to be compressed for matching somebody with somebody, in com_Flag, by the property value Undef_ of the uncertain querying condition of t
Vals, Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion concordance list;If
Blockdef_Query, Com_Flag meet specified number of tuples, take a kind of compression algorithm by Blockdef_Query, Com_
Flag is compressed;By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_Flag writing address
Table;Index file is encoded using K-OF-N, compressed data K, compressed index file T and encoder dictionary can be finally obtained
M;Index file is compressed using K-OF-N codings;
(2) compression based on magnanimity deficiency of data is stored, and completes the inquiry operation of magnanimity deficiency of data, and it realizes step
It is rapid as follows:
(2.1) need first to generate search index by querying condition:The create-rule of search index is:By query statement
Represented with Undef_Query and Def_Query;
(2.2) if 1. only existing Def_Query in search index, directly according to Def_ in compressed index file
Query carries out selection operation;
If 2. there is Undef_Query in search index, it is right to need to be found in encoder dictionary according to Undef_Query
The coding answered;
(2.3) selected and projection operation in compressed index file;
(2.4) Query Result is decompressed;
(3) compression based on magnanimity deficiency of data is stored, and completes the deletion action of magnanimity deficiency of data, and it realizes step
It is rapid as follows:
(3.1) deletion is parsed into into deletion index;
(3.2) exist if 1. not knowing predicate, look for its correspondence to encode in encoder dictionary, then to compressed index text
Part T deleted markers position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then 1 deleted marker position,
Selection result is projected in Address, corresponding Address compressed packages are decompressed, by cancel statement tuple deletion is carried out,
By not deleted tuple import compressed cache, after the completion of be compressed;
2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position etc. in T
Tuple in 1 carries out selection operation, and then deleted marker position is 1, and selection result is projected on Address, right
Compressed package corresponding to Address is directly deleted;It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect
The corresponding encoder dictionary of deletion, result is obtained;
(4) compression based on magnanimity deficiency of data is stored, and completes the modification operation of magnanimity deficiency of data, and it realizes step
It is rapid as follows:
(4.1) Address_Set for needing modification is inquired first with search algorithm;
(4.2) database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_ID_
SET gathers;
(4.3) compressed package before the tuple without operation is led back to;The tuple of operation is imported into its Update_
BufferDef_Query,Com_Flag, reach and compressed after certain amount, usually 150;(4.4) Tp_ID belongs in index file T
The tuple of Tp_ID_SET is modified, including Def_Query, Com_Flag, Undef_Query, Block_ID;
(4.5) encoder dictionary is changed;
(5) compression based on magnanimity deficiency of data is stored, and completes the insertion operation of magnanimity deficiency of data, and it realizes step
It is rapid as follows:
(5.1) it is changed to:For the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first;(calculate
In order to determine the sequence number of subsequent compression cache blocks during the purpose of the two values)
(5.2) 1. if complete, t is write into cache blocks Block to be compressedDef_Query,Com_Flag;
2. if incomplete, then just t writes multiple slow Block to be compressedDef_Query,Com_Flag;
(5.3) the completely rear entirety of cache blocks is compressed, and obtains Block_IdDef_Query,Com_FlagWith
AddressDef_Query,Com_Flag;
(5.4) information of the tuple is inserted in original encoder dictionary;
(5.5) new tuple is inserted in compressed index file according to encoder dictionary.
Beneficial effects of the present invention:Compared with prior art, the present invention proposes a kind of storage of magnanimity deficiency of data
And method of operating.The method is crossed data cleansing and directly magnanimity deficiency of data is operated, and can significantly be reduced and be deposited
Storage space, the compression position of quick positioning deficiency of data, it is ensured that the rapidity of inquiry, the accuracy of deletion, changes result
Integrality and the high efficiency of insertion.This method can save memory space, the compression position of quick positioning deficiency of data, it is ensured that
The rapidity of inquiry, the accuracy of deletion changes the integrality of result and the high efficiency of insertion.
Description of the drawings
Fig. 1 is concordance list structure.
Fig. 2 is address table structure.
Fig. 3 is deficiency of data compression process figure.
Fig. 4 is imperfect compressed data querying flow figure.
Fig. 5 is that imperfect compressed data deletes flow chart.
Fig. 6 is imperfect compressed data modification process figure.
Fig. 7 is that imperfect compressed data inserts flow chart.
Specific embodiment
A kind of magnanimity deficiency of data storage and method of operating, comprise the steps:
(1) when magnanimity deficiency of data is stored, storage is compressed respectively to partial data and deficiency of data, in fact
Existing step is as follows:
The first step:For a Massive information database system, the data manipulation sentence frequently used after statistics, i.e.,
These predicates are divided into certainty predicate Def_Val and uncertain predicate Undef_ by the predicate occurred after WHERE in query statement
val;
Wherein it is determined that property predicate Def_val refers to the predicate that operation has determined before issuing, what is generally frequently used consolidates
Determine scope operation, such as " Age>55”.The attribute-name and property value of certainty predicate is fixed, and is occurred as an entirety.
Uncertain predicate Undef_Val refer to operation issue before can not completely specified predicate, generally frequently use
The Value Operations such as unfixed, a certain property value of this kind of predicate whether there is in record, such as " Name=*** ".Uncertain predicate
Attribute-name is fixed, and property value is variable.
Second step:After to obtaining all certainty predicates and uncertain predicate, in compression storage by each tuple institute
The property value and certainty predicate of the uncertain predicate for meeting is stored indexed by data, while tuple is stored in accordingly
Cache blocks to be compressed in;After a certain cache blocks are filled, it is compressed in order storage, and unit is stored in database
Group place compressed file address;
Concordance list includes following field attribute:Id、Tp_Id、Undef_val_I……Undef_val_i、Def_Val、
Block_Id、Delet_Flag、Com_Flag;
Wherein Id is index number;Tp_Id is tuple sequence number;Field Undef_Val_i is that current tuple does not know for i-th
The property value of predicate;Field Def_Val stores the certainty predicate that current tuple is met with position coding form;
Block_Id is the sequence number of current tuple place cache blocks;
Delet_Flag is deleted marker position, and data need to delete and are then set to 1, are otherwise 0;
Com_Flag is data tuple integrity flag position, complete during tuple to be then set to 1, is otherwise 0;
Compressed file address table is included as properties:Block_Id、Address;Wherein, Block_Id is current tuple institute
In the sequence number of cache blocks, field Address is corresponding compressed file after cache blocks compression;
For n certainty predicate Q1,Q2,。。。,QnIf the position of field Def_Val of current tuple is encoded to B1B2……
BnIf current tuple meets condition Qi, then Bi=1, otherwise Bi=0;If data need to delete, Delet_Flag=1 is made, it is no
Then it is set to 0;To m predicate Q1,Q2,……,QmIf, QiValue all exist, then Com_Flag=1;Otherwise Com_Flag=0;
3rd step:To having been obtained by the connection by concordance list and compressed file address table on attribute Block_Id
Whole index file;The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address;When a certain
After cache blocks compression storage, a new value will be assigned for Block_Id, identical Def_Val value may the different cache blocks of correspondence
With compressed file address;
4th step:To magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute;After being compressed process, obtain
To index file and compressed data;Assume that index file includes i bar tuple j attributes, then i >=m and j≤n;It is each for D*
Bar tuple t, calculates first the Def_Val values and the value of Com_Flag of the certainty predicate that t is met, and t is written as into the Def_
The cache blocks BlockDef_Val to be compressed of Val distribution, in com_Flag, by the property value of the uncertain querying condition of t
Undef_vals, Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion index
Table;If Blockdef_Query, Com_Flag meet specifies number of tuples, a kind of compression algorithm is taken by Blockdef_
Query, Com_Flag are compressed;By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_
Flag writing address tables;Index file is encoded using K-OF-N, compressed data K, compressed index text can be finally obtained
Part T and encoder dictionary M;Index file is compressed using K-OF-N codings.
Concrete application process is as follows:
Table 1 is regional population's information table, and for the property value absent field for being labeled as " * ", it is probably any value, then
Any predicate should just be met, in order to ensure that Query Result is credible meaningful, need the tuple to being marked with " * " containing attribute to enter
Row repeated compression.I.e. for tuple t containing " * " tag field, 2 may be calculated even more than Def_Val values, then
Need that t is respectively written into the cache blocks Block multiple to be compressed for these Def_Val values distributionDef_Val,com_FlagIn, it is follow-up
Processing procedure is with the not tuple containing tag field.
The people information table that table 1 is marked with " * "
Assume that uncertain predicate is:Name=***;Certainty predicate is:Age>35, Salary>6000.Then through calculating
The Def_Val values that tuple 1 can be obtained are 10, then the Def_Val values of tuple 2 are 01 or 11, it is therefore desirable to simultaneously press the tuple
In being reduced to 01 and 11 corresponding 2 compressed files.Its index file is as shown in table 2:
The index file example 2 of table 2
The index compression file example of table 3
Index file is compressed using K-of-N codings (K-of-N encoding), encoder dictionary M is obtained after compression
With compressed index file Tc, table 3, table 4 is respectively the volume of the index file that encoded dictionary compression is obtained and Address fields
Code word allusion quotation.
The encoder dictionary Address examples of table 4
The operation more than just completes the compression storage of magnanimity deficiency of data, for the compressed data for obtaining, according to
Operated, the step of the following is data manipulation.
1. the inquiry of magnanimity deficiency of data:
The first step:Need first to generate search index by querying condition.The create-rule of search index is:Language will be inquired about
Sentence is represented with Undef_Query and Def_Query;
Second step:If Def_Query is 1. only existed in search index, directly according to Def_ in compressed index file
Query carries out selection operation
If 2. there is Undef_Query in search index, it is right to need to be found in encoder dictionary according to Undef_Query
The coding answered;
3rd step:Selected in compressed index file and projection operation;
4th step:Decompression Query Result.
2. the deletion of magnanimity deficiency of data:
The first step:Cancel statement is parsed into into deletion index;
Second step:If 1. not knowing predicate to exist, its correspondence is looked for encode in encoder dictionary, then to compressed index
File T deleted markers position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then deleted marker position
1, selection result is projected in Address, corresponding Address compressed packages are decompressed, carry out tuple by cancel statement and delete
Remove, by not deleted tuple import compressed cache, after the completion of be compressed;
2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position etc. in T
Tuple in 1 carries out selection operation, and then deleted marker position is 1, and selection result is projected on Address, right
Compressed package corresponding to Address is directly deleted.It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect
The corresponding encoder dictionary of deletion, result is obtained.
3. magnanimity deficiency of data modification:
The first step:Inquiring first with search algorithm needs the Address_Set of modification;
Second step:Database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_
ID_SET gathers;
3rd step:Compressed package before tuple without operation is led back to;The tuple of operation is imported into its Update_
BufferDef_Query,Com_Flag, reach and compressed after certain amount;
4th step:Tp_ID belongs to the tuple of Tp_ID_SET and modifies in index file T, including Def_Query, Com_
Flag,Undef_Query,Block_ID;
5th step:Modification encoder dictionary.
4. magnanimity deficiency of data insertion:
The first step:For the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first;
Second step:1. if complete, t is write into cache blocks Block to be compressedDef_Query,Com_Flag;
2. if incomplete, then t is write into multiple slow Block to be compressedDef_Query,Com_Flag;
3rd step:The completely rear entirety of cache blocks is compressed, and obtains Block_IdDef_Query,Com_FlagWith
AddressDef_Query,Com_Flag;
4th step:The information of the tuple is inserted in original encoder dictionary;
5th step:New tuple is inserted in compressed index file according to encoder dictionary.
Claims (1)
1. a kind of magnanimity deficiency of data is stored and method of operating, it is characterised in that comprised the steps:
(1) when magnanimity deficiency of data is stored, storage is compressed respectively to partial data and deficiency of data, it realizes step
It is rapid as follows:
(1.1) for a Massive information database system, the data manipulation sentence frequently used after statistics inquires about language
These predicates are divided into certainty predicate Def_Val and uncertain predicate Undef_val by the predicate occurred after WHERE in sentence;
(1.2) obtain after all certainty predicates and uncertain predicate, met each tuple not in compression storage
The property value and certainty predicate for determining predicate is stored indexed by data, while tuple is stored in corresponding to be compressed
In cache blocks;After a certain cache blocks are filled, it is compressed in order storage, and pressure that tuple is located is stored in database
Contracting file address;
Concordance list includes following field attribute:Id、Tp_Id、Undef_val_I……Undef_val_i、Def_Val、Block_
Id、Delet_Flag、Com_Flag;
Wherein Id is index number;Tp_Id is tuple sequence number;Field Undef_Val_i does not know predicate i-th for current tuple
Property value;Field Def_Val stores the certainty predicate that current tuple is met with position coding form;
Block_Id is the sequence number of current tuple place cache blocks;
Delet_Flag is deleted marker position, and data need to delete and are then set to 1, are otherwise 0;
Com_Flag is data tuple integrity flag position, complete during tuple to be then set to 1, is otherwise 0;
Compressed file address table includes following field attribute:Block_Id、Address;Wherein, Block_Id is current tuple institute
In the sequence number of cache blocks, field Address is corresponding compressed file after cache blocks compression;
For n certainty predicate Q1,Q2,。。。,QnIf the position of field Def_Val of current tuple is encoded to B1B2……Bn,
If current tuple meets condition Qi, then Bi=1, otherwise Bi=0;If data need to delete, Delet_Flag=1 is made, otherwise
It is set to 0;To m predicate Q1,Q2,……,QmIf, QiValue all exist, then Com_Flag=1;Otherwise Com_Flag=0;
(1.3) complete index can be obtained by the connection by concordance list and compressed file address table on attribute Block_Id
File;The different cache blocks of different Def_Val values correspondences, also with regard to corresponding different compressed file address;When a certain cache blocks pressure
After contracting storage, a new value will be assigned for Block_Id, identical Def_Val value can correspond to different cache blocks and compression text
Part address;
(1.4) magnanimity deficiency of data D*, it is assumed that D* includes m bar tuples, n attribute;After being compressed process, index text is obtained
Part and compressed data;Assume that index file includes i bar tuple j attributes, then i >=m and j≤n;Each tuple t for D*,
The Def_Val values and the value of Com_Flag of the certainty predicate that t is met are calculated first, and t is written as into the Def_Val distribution
Cache blocks BlockDef_Val to be compressed, in com_Flag, by the property value Undef_vals of the uncertain querying condition of t,
Def_Val, Block_IdDef_Val, Com_Flag, Delet_Flag=0, Com_Flag insertion concordance list;If
Blockdef_Query, Com_Flag meet specified number of tuples, take a kind of compression algorithm by Blockdef_Query, Com_
Flag is compressed;By Block_IdDef_Query, Com_Flag and AddressDef_Query, Com_Flag writing address
Table;Index file is encoded using K-OF-N, compressed data K, compressed index file T and encoder dictionary can be finally obtained
M;Index file is compressed using K-OF-N codings;
(2) compression based on magnanimity deficiency of data is stored, and completes the inquiry operation of magnanimity deficiency of data, and implementation step is such as
Under:
(2.1) need first to generate search index by querying condition:The create-rule of search index is:Query statement is used
Undef_Query and Def_Query is representing;
(2.2) if 1. only existing Def_Query in search index, directly entered according to Def_Query in compressed index file
Row selection operation;
If 2. there is Undef_Query in search index, need to be found in encoder dictionary according to Undef_Query corresponding
Coding;
(2.3) selected and projection operation in compressed index file;
(2.4) Query Result is decompressed;
(3) compression based on magnanimity deficiency of data is stored, and completes the deletion action of magnanimity deficiency of data, and implementation step is such as
Under:
(3.1) deletion is parsed into into deletion index;
(3.2) exist if 1. not knowing predicate, look for its correspondence to encode in encoder dictionary, then to compressed index file T
Deleted marker position is equal to 0, and integrity flag position is equal to 1 tuple and carries out selection operation, is then 1 deleted marker position, right
Selection result is projected in Address, decompresses corresponding Address compressed packages, and by cancel statement tuple deletion is carried out, will
Not deleted tuple imports compressed cache, after the completion of be compressed;
2. if it is determined that property predicate exists and do not know predicate and do not exist, by Def_Val values to integrity flag position in T equal to 1
Tuple carry out selection operation, then deleted marker position be 1, selection result is projected on Address, it is right
Compressed package corresponding to Address is directly deleted;It is overall to delete when tuple of the deleted marker position equal to 1 is more than a boundary, connect
The corresponding encoder dictionary of deletion, result is obtained;
(4) compression based on magnanimity deficiency of data is stored, and completes the modification operation of magnanimity deficiency of data, and implementation step is such as
Under:
(4.1) Address_Set for needing modification is inquired first with search algorithm;
(4.2) database will be imported after corresponding compressed package decompression, operation of modifying obtains operation tuple Tp_ID_SET
Set;
(4.3) compressed package before the tuple without operation is led back to;The tuple of operation is imported into its Update_
BufferDef_Query,Com_Flag, reach and compressed after certain amount, usually 150;(4.4) Tp_ID belongs in index file T
The tuple of Tp_ID_SET is modified, including Def_Query, Com_Flag, Undef_Query, Block_ID;
(4.5) encoder dictionary is changed;
(5) compression based on magnanimity deficiency of data is stored, and completes the insertion operation of magnanimity deficiency of data, and implementation step is such as
Under:
(5.1) for the tuple of each needs insertion, its Def_Query, Com_Flag are calculated first;
(5.2) 1. if complete, t is write into cache blocks Block to be compressedDef_Query,Com_Flag;
2. if incomplete, then just t writes multiple slow Block to be compressedDef_Query,Com_Flag;
(5.3) the completely rear entirety of cache blocks is compressed, and obtains Block_IdDef_Query,Com_FlagWith
AddressDef_Query,Com_Flag;
(5.4) information of the tuple is inserted in original encoder dictionary;
(5.5) new tuple is inserted in compressed index file according to encoder dictionary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611081152.4A CN106599112A (en) | 2016-11-30 | 2016-11-30 | Massive incomplete data storage and operation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611081152.4A CN106599112A (en) | 2016-11-30 | 2016-11-30 | Massive incomplete data storage and operation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106599112A true CN106599112A (en) | 2017-04-26 |
Family
ID=58594035
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611081152.4A Pending CN106599112A (en) | 2016-11-30 | 2016-11-30 | Massive incomplete data storage and operation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599112A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111459971A (en) * | 2020-04-01 | 2020-07-28 | 辽宁大学 | Skyline-join query processing method based on crowdsourcing |
CN112199366A (en) * | 2019-04-28 | 2021-01-08 | 杭州数梦工场科技有限公司 | Data table processing method, device and equipment |
CN113505578A (en) * | 2021-05-26 | 2021-10-15 | 中国再保险(集团)股份有限公司 | Mass file quick checking method for typhoon and disaster great model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060101206A1 (en) * | 2004-11-05 | 2006-05-11 | Wood David A | Adaptive cache compression system |
CN101599072A (en) * | 2009-07-03 | 2009-12-09 | 南开大学 | Intelligent computer systems building method based on information inference |
WO2011129818A1 (en) * | 2010-04-13 | 2011-10-20 | Empire Technology Development Llc | Adaptive compression |
CN104750860A (en) * | 2015-04-16 | 2015-07-01 | 东北大学 | Data storage method of uncertain data |
-
2016
- 2016-11-30 CN CN201611081152.4A patent/CN106599112A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060101206A1 (en) * | 2004-11-05 | 2006-05-11 | Wood David A | Adaptive cache compression system |
CN101599072A (en) * | 2009-07-03 | 2009-12-09 | 南开大学 | Intelligent computer systems building method based on information inference |
WO2011129818A1 (en) * | 2010-04-13 | 2011-10-20 | Empire Technology Development Llc | Adaptive compression |
CN104750860A (en) * | 2015-04-16 | 2015-07-01 | 东北大学 | Data storage method of uncertain data |
Non-Patent Citations (2)
Title |
---|
王妍: "基于压缩的海量不完整数据近似查询方法", 《计算机研究与发展》 * |
赵锴: "基于谓词索引的海量数据压缩存储及数据操作算法", 《计算机科学》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112199366A (en) * | 2019-04-28 | 2021-01-08 | 杭州数梦工场科技有限公司 | Data table processing method, device and equipment |
CN111459971A (en) * | 2020-04-01 | 2020-07-28 | 辽宁大学 | Skyline-join query processing method based on crowdsourcing |
CN111459971B (en) * | 2020-04-01 | 2023-11-10 | 辽宁大学 | Skyline-join query processing method based on crowdsourcing |
CN113505578A (en) * | 2021-05-26 | 2021-10-15 | 中国再保险(集团)股份有限公司 | Mass file quick checking method for typhoon and disaster great model |
CN113505578B (en) * | 2021-05-26 | 2024-07-30 | 中国再保险(集团)股份有限公司 | Rapid verification method for mass files of typhoon disaster model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7076486B2 (en) | Method and system for efficiently identifying differences between large files | |
US9691164B2 (en) | System and method for symbol-space based compression of patterns | |
KR102407510B1 (en) | Method, apparatus, device and medium for storing and querying data | |
US10726016B2 (en) | In-memory column-level multi-versioned global dictionary for in-memory databases | |
CN109446221B (en) | Interactive data exploration method based on semantic analysis | |
CN104040541B (en) | For more efficiently using memory to the technology of CPU bandwidth | |
CN111899089A (en) | Enterprise risk early warning method and system based on knowledge graph | |
CN106599112A (en) | Massive incomplete data storage and operation method | |
JP7426907B2 (en) | Advanced database decompression | |
CN111309930B (en) | Medical knowledge graph entity alignment method based on representation learning | |
WO2020098315A1 (en) | Information matching method and terminal | |
CN109558166A (en) | A kind of code search method of facing defects positioning | |
CN113377758A (en) | Data quality auditing engine and auditing method thereof | |
CN105488471B (en) | A kind of font recognition methods and device | |
US20100125614A1 (en) | Systems and processes for functionally interpolated increasing sequence encoding | |
CN105302915A (en) | High-performance data processing system based on memory calculation | |
CN114997181A (en) | Intelligent question-answering method and system based on user feedback correction | |
CN104731908A (en) | ETL-based data cleaning method | |
CN110866407B (en) | Analysis method, device and equipment for determining similarity between text of mutual translation | |
CN113627132A (en) | Data deduplication mark code generation method and system, electronic device and storage medium | |
CN115470355A (en) | Rail transit information query method and device, electronic equipment and storage medium | |
US20180349443A1 (en) | Edge store compression in graph databases | |
US10366067B2 (en) | Adaptive index leaf block compression | |
CN106598492A (en) | Compression optimization method applied to mass incomplete data | |
CN110572160A (en) | Compression method for decoding module code of instruction set simulator |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170426 |