CN100412863C

CN100412863C - Huge amount of data compacting storage method and implementation apparatus therefor

Info

Publication number: CN100412863C
Application number: CNB2005100890945A
Authority: CN
Inventors: 王珊; 杜小勇; 任永杰; 周庆庆; 冯玉
Original assignee: 北京人大金仓信息技术有限公司; 王珊; 杜小勇; 任永杰; 周庆庆; 冯玉
Current assignee: Beijing National People's Congress Gold Warehouse Information Technology Ltd By Share Ltd; Du Xiaoyong; Wang Shan
Priority date: 2005-08-05
Filing date: 2005-08-05
Publication date: 2008-08-20
Anticipated expiration: 2025-08-05
Also published as: CN1908932A

Abstract

The related compression storage method for great much data comprises: detecting CPU type to select matched storage strategy; saving one piece of record with integral information in website as the reference for other records; noting the reference record difference by increment means; removing the last-version data in history to realize the dynamic conversion for history and active data; according to frequency of data value, recoding data to calculate without decoding. This invention can improve query efficiency with less storage space.

Description

A kind of mass data tightens storage means and actuating unit

Technical field

The present invention relates to data storage technology, its be particularly related to a kind of in the data base management system (DBMS) (DBMS) of using many versions concurrency control protocol (MVCC) memory technology of mass data, be a kind of mass data deflation storage means and actuating unit concretely.

Background technology

The database storing subsystem is the important component part of Database Systems, and it is responsible for the preservation of data on storage medium, provides data access to inquiry treatment subsystem.Storage means refers to data and deposit form and the location mode on Volatile media (as common memory) on non-volatile media (as disk, tape).

Common storage means has following several:

One, the strategy of conservative storage: this is traditional storage mode [1].Deposit because some CPU requires the data in the internal memory to align according to data type,, data are all required to deposit according to alignment in disk and internal memory so this method is taken into account this part CPU.In this case, the efficient of CPU is best, but can increase the storage size of data undoubtedly, increases the I/O amount.

Two, the strategy of compact storage: many systems use this strategy [2] according to the characteristics of CPU.If for example CPU does not have the requirement of internal memory alignment, the 32 bit CPU series of Intel for example, DBMS just can not line up data on storage medium and deposits.The storage space of data occupancy be will dwindle like this, thereby I/O amount, promote query efficient reduced.This method can be brought some burdens to CPU to a certain extent, because finish in CPU inside on the reality of work of alignment.

Three, the strategy of compression storage: some system takes the mode of compression section data to save storage space [3].For packed data, need to use LZ algorithm or other compression algorithm that expends CPU.Because the data after the compression have lost data layout, so when data offer the inquiry subsystem, need to expend CPU again and decompress.

Method one has superiority in the application of CPU bottleneck, if but in the application of I/O bottleneck, using and do not need the CPU that aligns, method two just has advantage so.Simultaneously, method is wasted storage medium for a moment.In the application enough more than needed for CPU, mode three has certain benefit.Yet above-mentioned these systems do not have the characteristics that the semantic features according to data is optimized.

Summary of the invention

The objective of the invention is to, provide a kind of mass data to tighten storage means and actuating unit,, improve search efficiency, reduce storage space in order to improve the existing problem and shortage of above-mentioned existing storage means.

Technical scheme of the present invention is:

A kind of mass data tightens storage means, and it may further comprise the steps:

Select the step of storage policy, by the type of detecting CPU, the storage policy of selecting to be complementary with described cpu type carries out data storage;

Tighten the step of redundant data in the page, preserve the record with complete information in the page, other record value is all with reference to this record value;

Tighten the step of combined index, the mode of employing increment writes down the difference between each index record value;

Distinguish the step of historical data and activity data, will remove by the additional head Version Control data of many versions concurrency control protocol (MVCC) in the historical data, and realize the dynamic translation of historical data and activity data;

The step of recompile according to the frequency difference of the data value of conventional relationship data, is carried out recompile to the data value of conventional relationship data, and carry out computing under the situation of not decoding in the database of storage mass data.

Tighten in the step of redundant data in the described page, at first estimate the frequency that the repetition values in the page occurs, after reaching predetermined frequency, choose its value of record complete documentation, other record value is all with reference to this record value then.

In the step of described recompile, frequency according to data value in the database (numerical value, character or character string) appearance, carry out recompile in database inside for data value, the data value that frequency is high adopts short coding, and the low data value of frequency adopts long codes.

The present invention also provides a kind of mass data to tighten the storage actuating unit, and it comprises:

The storage policy selected cell is used to detect the type of CPU, calls the storage policy that is complementary with described cpu type and carries out data storage;

Redundant data tightens the unit, is used to judge the frequency that the repetition values in the page occurs, and chooses a record after reaching predetermined frequency, its value of complete documentation, and make other record value all with reference to this record value;

Historical data tightens the unit, is used for historical data is removed by the additional head control data of many versions concurrency control protocol (MVCC), and realizes the dynamic translation of historical data and activity data;

The recompile unit is used in the database of storage mass data the frequency difference according to the data value of conventional relationship data, the data value of conventional relationship data is carried out recompile, and carry out computing under the situation of not decoding.

Combined index tightens the unit, and the mode of employing increment writes down the difference between each index record value.

Beneficial effect of the present invention is: improved I/O efficient: because the data storage scale reduces greatly, so it is also few to visit the page that same data volume need visit; Do not expend CPU than common compress mode: the present invention reuses the redundance employing recodification of data and the mode of the adjacent tuple values of reference, do not need to carry out conventional compression algorithm and just can reach the benefit that data volume reduces, and most of calculating all needn't be untied data; Saved storage space; Save storage space because of redundant minimizing.

Description of drawings

Fig. 1 carries out engine structure figure for detecting Kingbase6;

Fig. 2 selects the process flow diagram of storage policy for detecting CPU;

Fig. 3 is the process flow diagram that tightens redundant data in the page;

Fig. 4 is the process flow diagram of recompile;

Fig. 5 is for distinguishing the process flow diagram of historical data and activity data;

Fig. 6 is the structured flowchart of apparatus of the present invention.

Embodiment

Below in conjunction with description of drawings the specific embodiment of the present invention.At the existing problem and shortage part of above-mentioned existing storage means, the present invention is based on and use the Database Systems of the concurrent agreement of MVCC to propose a kind of data compaction storage means that improves search efficiency, reduces storage space.This data compaction storage means comprises:

As shown in Figure 1, for the KingbaseES6 database is carried out engine 101, carry out database cleaning order, adopt the data semantic analysis module to analyze, comprising: the step of selecting storage policy, by the type of detecting CPU, the storage policy of selecting to be complementary with described cpu type carries out data storage; After having determined storage policy, carry out step:

Tighten the step of redundant data in the page, preserve a record and have complete information in the page, other writes down all with reference to this record;

Distinguish the step of historical data and activity data, will remove by the additional head control data of many versions concurrency control protocol in the historical data, and realize the dynamic translation of historical data and activity data;

The step of recompile according to the frequency difference of data value, is carried out recompile to data value, and carry out computing under the situation of not decoding in database.

Tighten the step of combined index, the mode of employing increment writes down the difference between each index record.

The inventive method comprises the content of following two aspects:

The deflation storage of heap data (heap); The deflation storage of index data (index).

Wherein, heap is meant that common relation data, index are meant B+tree or other index data.In the data each has formatted data to be called a tuple (or record).All tuples all are stored in the page, and a page is the piece of fixed size in the file.

Since this method than the method for common use compression algorithm on CPU, to save many, and have the structure of holding the record, needn't separate the characteristics such as just can carrying out data operation that tighten, so we are referred to as the data compaction method, to distinguish.

At first we illustrate the deflation storage means of heap data.

At platform.Platform at different comprises IA32, SPARC etc., and we have adopted conservative data store strategy and compact tactful way of combining.This selection course can be specified by the user, and for example for the application of CPU bottleneck, the user can specify and take conservative strategy, otherwise can take compact strategy.

Tighten in the page or leaf.In the page, there is a large amount of tuples to have repetition values, be worth us for this part so and also can consider to reach the purpose that reduces memory capacity and I/O by reducing redundancy.According to the deleting history data characteristic not of MVCC agreement on non-covering storage mode, we at first estimate the repetition values occurrence frequency in the page, after reaching certain frequency, choose its value of record complete documentation, and other record is all with reference to it then.Since write down not deleted, so there is not maintenance cost so fully.

Distinguish history and activity data.For the different editions of mark records, MVCC need be at the additional record-header that differs in size of the head of record in heap data.In than relatively large data village, often used in village names, this can bring no small storage overhead, causes IO efficient to reduce.In fact, most data of observing in the database are historical datas, that is to say that they are to have the affairs submitted to produce, and activity data to be the affairs that are in the activity produce.The control information of historical data can very brief expression.Will reduce the scale of database like this.Consider that historical data and activity data all are relative, they need conversion reciprocally, and we have realized this automatic conversion process when realizing the roll-back segment technology.

Recompile is whole to be tightened.Because the frequency difference that data value (numerical value, character or character string) occurs in the database, we carry out recompile in database inside can for data value, the data value that frequency is high adopts short coding, and the low employing long codes of frequency, we significantly reduce its length for the record of many repeat patterns like this.This mode keeps tuple structure than directly using common compression algorithm to have three advantages (i) to whole tuple; (ii) cataloged procedure is a linear transformation; (iii) expend the characteristics of CPU less; But other method of front still will be used CPU more relatively.We adopt three measures to improve: (i) optionally handle.After the benefit of the recodification of predicting acquired a certain degree, we just carried out this processing; (ii) postpone the deflation opportunity of separating.Because our coding is a linear transformation, data do not need to separate to tighten usually just can carry out computing.Have only to must deflation time, we just do.(iii) client is shared and is separated deflation.The position that the user can select to separate deflation is at client or server end.

We illustrate the deflation storage means of index data then:

The optimal combination index.Except for tightening the step of redundant data in the step of the selection storage policy of heap data and the page.We are in particular combined index and carry out storage optimization.Because observe in combined index except last row, the data of several row in front (being called prefix) nearly all are the same, and keep orderly.We can consider that the mode of the packed record of similar previous processed heap reduces data redundancy so.Simultaneously, because the order of index data, we adopt the mode of increment to write down difference between each index record, and this is (not high enough because of the deflation degree) that heap data not have employing.Index does not adopt the step of the recompile of heap data, and this is because the situation of these class methods of application seldom in index.

In the KingbaseES6 system, apparatus of the present invention have been realized.That is: a kind of mass data tightens storage actuating unit (as shown in Figure 6), and it comprises:

Redundant data tightens the unit, is used to judge the frequency that the repetition values in the page occurs, and chooses a record after reaching predetermined frequency, its value of complete documentation, and make other record all with reference to this record;

Historical data tightens the unit, is used for historical data is removed by the additional head control data of many versions concurrency control protocol, and realizes the dynamic translation of historical data and activity data;

The recompile unit is used in database data value is carried out recompile, and most of computing can being carried out under the situation of not decoding according to the frequency difference of data value.

Combined index tightens the unit, and the mode of employing increment writes down the difference between each index record.

As shown in Figure 2, prepare the initialization data storehouse, carry out described storage policy selected cell, be used for judging the type of CPU, and consider customer requirements by installation script; Judge whether CPU can handle the non-alignment data,, if not, then select alignment thereof storage data if then select the non-alignment mode to store data.Enter the initialization data storehouse then.

As shown in Figure 3, when preparing to insert data, carry out described redundant data and tighten the unit, judge for the new record that inserts whether the page is full; If full then further analyze the data in page redundance, select this suboptimum reference record and reorganize the page, if there is surplus in the space, then continue on this page, to insert operation, if the space is not saved then changed the next page over to.As shown in Figure 4, when preparing recompile, carry out described recompile unit, according to concerning the size some pages of sampling and calculating the codomain distribution plan, select some coded systems and attempt on sample page according to distribution plan, whether the SQL daily record of record is judged favourable to user inquiring when moving according to user inquiring, if favourable then carry out recompile, otherwise finish.Whether favourable to user inquiring why carry out judgement, be because this method may cause user inquiring to use more CPU, and if be judged as the CPU bottleneck from the SQL daily record, then needn't adopt this consolidation methods.

As shown in Figure 5, when preparing to distinguish historical data, carry out described historical data and tighten the unit, for the capable A that concerns of order, set up an isostructural new B that concerns, from A, read a tuple, judge whether then to determining to have submitted to or determined rollback, if: then remove header information, and zone bit is set, be copied to and concern B; If not: then directly be copied to and concern B.And then judge concern whether also have tuple among the A, and if: then continue above-mentioned step, if not: will concern then that A replaces with concerns B, end.

Deposit data dictionary and/or Service Database in by the data after above-mentioned each cell processing, by the use of data.

Wherein, in the concrete storage of database, the form of the page as shown in Figure 7.The step that it has been realized tightening the step of redundant data in the page and has tightened combined index:

Wherein, specifically the control format of each tuple is as shown in Figure 8.It has realized selecting the step of storage policy, distinguishes historical data and the step of activity data and the step of recompile.

Wherein, use " row information bit " this unit to illustrate whether this line data is historical data, the deflation mode that data are taked (being used) with dictionary information.

After this method of employing, use (as the TPCC test) at typical OLTP, the average data space utilization rate of test reduces 30～40%.Database performance promotes 20%～30% in the application of I/O bottleneck.Fig. 1 is the synoptic diagram of this method in the core execution engine of KingbaseES6 system.

Particularly, the present invention has the following advantages:

1, improved I/O efficient: because the data storage scale reduces greatly, so it is also few to visit the page that same data volume need visit;

2, do not expend CPU than common compress mode: the redundance of our method recycling data adopts the mode of recodification and the adjacent tuple values of reference, do not need to carry out conventional compression algorithm and just can reach the benefit that data volume reduces, and most of calculating all needn't be untied data;

3, saved storage space; Save storage space because of redundant minimizing.

Above embodiment only is used to illustrate the present invention, but not is used to limit the present invention.

Illustrate:

[1] see Database System Implementation, H.Garcia-Molina, J D Ullman, J Widom work, Prentice Hall publishes

[2], see the SQL Server 2000 Book Online of Microsoft as SQL Server 2000

[3], see the ALTER TABLE SET STORAGE clause of Postgresql database as PostgreSQL

Claims

1. a mass data tightens storage means, it is characterized in that may further comprise the steps:

2. method according to claim 1 is characterized in that also comprising: if above-mentioned each step has the data after the processing, then deposit data in database.

3. method according to claim 1 is characterized in that, by the type of detecting CPU, selects the strategy of compact storage that data are stored.

4. method according to claim 1, it is characterized in that, in the described page, tighten in the step of redundant data, at first estimate the frequency that the repetition values in the page occurs, after reaching predetermined frequency, choose its value of record complete documentation, other record value is all with reference to this record value then.

5. method according to claim 1 is characterized in that, in the step of described recompile, frequency according to data value appearance in the database, carry out recompile in database inside for data value, the data value that frequency is high adopts short coding, and the low data value of frequency adopts long codes.

6. a mass data tightens the storage actuating unit, it is characterized in that comprising:

Combined index tightens the unit, and the mode of employing increment writes down the difference between each index record value;

7. device according to claim 6 is characterized in that also comprising: the data after said apparatus is handled deposit database in.