CN102567527A

CN102567527A - Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout

Info

Publication number: CN102567527A
Application number: CN2011104527265A
Authority: CN
Inventors: 周傲英; 徐辰; 夏帆; 陈�峰; 祝海通; 周敏奇; 钱卫宁
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2011-12-30
Filing date: 2011-12-30
Publication date: 2012-07-11

Abstract

The invention belongs to the technical field of databases, in particular discloses a materialized view layout in a distributive system under a column-orientated storage environment and a maintaining method of the materialized view layout. The invention comprises a high-expandability data layout strategy and a method for efficiently maintaining consistency of views. According to the invention, a relational data sheet is stored by using a column-orientated storage model, and the materialized view layout in the distributive system under the column-orientated storage environment is maintained by introducing a consistency model. The invention is suitable for large-scale distributed file systems of the column-orientated storage environment, and provides a data management solution for a data analysis-orientated application.

Description

Press under the row storage environment Materialized View layout and maintaining method thereof in the distributed system

Technical field

The invention belongs to database technical field, be specifically related to a kind of by Materialized View layout and maintaining method in the distributed system under the row storage environment.

Background technology

Rapid growth along with data scale; Data-intensive calculating is especially noticeable many large-scale IT enterprises in the current research field; Like Google, Amazon and their rival are being devoted to make up the large-scale data analysis platform, to support data-intensive calculating.Here, data-intensive computing system comprises acquisition, upgrades, share, and the data of filing, and be provided at computing power enough on the mass data collection.The group system that the commercial computer of being shared by a large amount of nothings constitutes effectively and efficiently provides these services as infrastructure usually.

Generally, the source of data a lot (for example, operating database, the webpage of Web 2.0 etc.), these data all will constantly be integrated into data analysis platform (that is data-intensive computing system).Relation such as Fig. 1 between data source and the data analysis platform.The large-scale data analysis platform is collected data from various data sources, and these data materializations are stored in order to data analysis.View is the data structure of a data analysis that is applicable to efficient processing commonly used.Yet when being stored in data analysis platform Materialized View can not reflect the latest update information of coming comfortable data source the time, Materialized View will be out-of-date.Therefore, how to keep the consistance between Materialized View and the data source to become a problem demanding prompt solution.

Different with the traditional data warehouse, the view in the data analysis platform is based on the data source large-scale distributed file system, like HDFS (Hadoop distributed file system), GFS (Google file system) etc.It should be noted that HDFS has adopted the file access pattern of " once write repeatedly and read " to come management data, in a single day file is created, is write and close, and then can not upgrade this document once more except data supplementing being arrived the end of file.That is to say that the present invention can not delete, the record in insertion or the updating file.In addition, the present invention uses and to store relation database table towards the row memory model, rather than traditional by row memory model (that is N-ary model).Make the Data Update in the file compare the difficulty more that becomes by the row memory model by the row memory model with traditional.Therefore, under new environment, new file access pattern and bring great challenge for the maintenance of Materialized View by the row memory model.

The present invention has overcome the defective that distributed file system in the prior art can not be upgraded file, has proposed a kind ofly by Materialized View layout and maintaining method in the distributed system under the row storage environment.The present invention uses and stores relation database table towards the row memory model, and introduces consistency model to safeguarding by Materialized View layout in the distributed system under the row storage environment.

Summary of the invention

The invention discloses a kind ofly, comprising by Materialized View layout in the distributed system under the row storage environment:

Primary attribute collection: the set of primary attribute;

The projection of primary attribute: each primary attribute is projected as the projection of primary attribute on physical level, and the projection of said each primary attribute is divided into a plurality of sections, comprises data tuple in said section;

The projection of nonprime attribute: each nonprime attribute is projected as the projection of nonprime attribute on physical level;

Connect index: the mapping between the projection of said primary attribute and the projection of nonprime attribute;

Label vector: the bit vector of the said data tuple existence of expression on logical level;

The projection of said primary attribute is connected with the projection mapping of said nonprime attribute through connecting index; Indicate through said label vector whether data tuple is present in the projection of corresponding nonprime attribute in the projection of said primary attribute.

Wherein, the projection of said primary attribute is divided into a plurality of sections through hash function.

Wherein, said data tuple is to organize by the mode of row storage.

Wherein, said primary attribute separates projection with nonprime attribute.

Wherein, in the said projection, each primary attribute collection of said primary attribute collection projects into row separately.

Among the present invention, Materialized View is meant and calculates in advance and the preservation table connects or the result of more operation consuming time such as gathering, like this, when carrying out inquiry, just can avoid these operations consuming time, thereby obtain the result fast.The Materialized View layout is meant the physical store mode of Materialized View, promptly how in file system, to organize, dispose data.

The invention discloses a kind of maintaining method, comprise the steps: by Materialized View layout in the distributed system under the row storage environment

Step 1: through the operation of using the view log record to carry out for Materialized View;

Step 2: through the operation in the said view daily record of consistency model batch processing.

Wherein, said view daily record comprises the basic operation of in Materialized View, inserting tuple, deletion tuple.

Wherein, said consistency model comprises the final consistency model and based on the consistency model of time shaft.

The invention has the beneficial effects as follows and can the consistency maintenance algorithm between traditional data warehouse and the data source be transplanted on the large-scale data analysis platform based on distributed system, thereby can keep the consistance between data analysis platform top view and the data source.

Description of drawings

Fig. 1 presses under the row storage environment Materialized View data layout synoptic diagram of Materialized View layout in the distributed system for the present invention.

Fig. 2 presses under the row storage environment synoptic diagram of the data platform of Materialized View layout and data source in the distributed system for the present invention.

Fig. 3 presses under the row storage environment connection index synoptic diagram of Materialized View layout in the distributed system for the present invention.

Fig. 4 presses under the row storage environment consistency model synoptic diagram of Materialized View layout in the distributed system for the present invention.

Embodiment

In conjunction with following specific embodiment and accompanying drawing, the present invention is done further detailed description, protection content of the present invention is not limited to following examples.Under spirit that does not deviate from inventive concept and scope, variation and advantage that those skilled in the art can expect all are included among the present invention, and are protection domain with the appending claims.

Embodiment of the present invention is divided into two aspects, and the one, the storage means of Materialized View, the present invention is left Materialized View on the distributed file system in, and to organize data by the row storage mode, therefore proposes novel data layout method; Two are based on the maintaining method of the Materialized View of this storage organization, i.e. the updating maintenance of view, and provide corresponding model.

As shown in Figure 2 is synoptic diagram by data platform and data source in the distributed system under the row storage environment, and the view in the data analysis platform is based on the data source large-scale distributed file system.

As shown in Figure 1 is the data layout of Materialized View of the present invention, is data source on the solid line, is data platform under the solid line.Relation database table R ₁, R ₂,, R _nBe positioned at each data source, and Materialized View is structured in based on the data analysis platform on the Hadoop distributed file system (HDFS).The logical storage structure of the Materialized View that on the dotted line is (MV) is the structure of its physical store under the dotted line.On logical level, Materialized View is a relation table.On physical level, Materialized View is to be kept among the HDFS with the mode by the row storage.The present invention has also designed two specific data structures in addition, connects index (Join Index) and label vector (Tag Vector).

The storage of Materialized View: primary attribute collection { k ₁, k ₂..., k _nIt is the set of the primary attribute of relation table in each data source.On physical level, this n primary attribute is projected onto n projection (projection) respectively, and promptly the value of each primary attribute is a separate storage.The projection that comprises a primary attribute of Materialized View is divided into several sections (segment), and this division is to be realized by certain suitable hash function, for example uses hash function hash (x)=x mod n, does not rely on the worth distribution situation of primary attribute.This hash function guarantees that data tuple that each section comprises about equally.Moreover every section data are to organize by the mode of row storage, in each piece (block) of storage HDFS.If a piece overflows, then another piece will be distributed to this section.Nonprime attribute can be projected into the projection of any amount, but the data tuple order will be consistent.Thereby the nonprime attribute of reconstruct Materialized View partly is not difficult to accomplish.

Connect index:, connect index and be used as the mapping between each projection in order to make up whole Materialized View from each projection effectively.Illustrated the connection index between projection MV2 and the projection MV1 like Fig. 3.Can rebuild whole Materialized View through connecting index.In the framework like Fig. 2, each projection that comprises primary attribute all has a connecting strand to quote work to the mapping that comprises the nonprime attribute projection.Therefore, n primary attribute projection just has n corresponding connection index.

Label vector: label vector is the bit vector of designation data tuple existence on logical level.If i element of this vector is 1, show that so Materialized View comprises i data tuple on the logical layer level.If i element is 0, this data tuple does not belong to the data tuple in the Materialized View on logical level so, even it has physically stored.Using this vectorial reason is that HDFS does not support the record in the revised file.

The invention also discloses maintaining method, promptly safeguard the solution of Materialized View on the data analysis platform by Materialized View layout in the distributed system under the row storage environment.The view daily record is to be used for writing down the operation that view is carried out, and comprises deletion and the insertion of Materialized View being carried out data tuple, and proposes the sacrificial vessel body application use of two consistency models.

When data source takes place to upgrade, updated information will be sent to the data analysis platform.After updating message arrives data platform, the strobe algorithm will be called and handle renewal and make Materialized View arrive the state consistent with data source.For example, when the data analysis platform received a deletion message, strobe will produce the operation of corresponding data tuple in the deletion Materialized View.When the data analysis platform received insertion message, strobe sent the compensation inquiry to relevant data source.The data analysis platform receive compensation inquiry as a result the time will produce an insert action.In order to say something better, the present invention provides following two definition:

Definition 1: if the value of the primary attribute of two data tuples equates that these two data tuples are conflicted so.

Definition 2: if the value of all properties of two data tuples equates that these two data tuples are repetitions so.

From definition, the data tuple of two repetitions must be conflicted, and the conflicting data tuple might not repeat.The view daily record is exactly the action lists in the strobe algorithm, comprises deletion and inserts two basic operations of data tuple.Deletion (MV, k _i, var) expression delete property k from view MV _iValue equal the data tuple of var, Insertion (MV, T) expression MV that data tuple T is inserted into, if among the MV not and T conflicting data tuple.

Algorithm 1 is the program of deleted data tuple from Materialized View.Shown in algorithm 1, corresponding hash function is used for the section that definite var belongs to.Next, each element in this this section of algorithm scanning if the value of certain element equals var, gets access to (SID so from connect index _i, key _i).If be positioned at (SID in the label vector _i, key _i) value be 1, mean so to have a data tuple, its attribute k _iValue be var.In order to delete this data tuple is deleted from Materialized View MV, a simple solution is all data tuple of reading in this section, writes again then except those all data tuple data tuple to be deleted.Yet this way will expend the I/O expense and the network bandwidth a large amount of in the distributed file system.More excellent solution then is that the value with label vector relevant position is changed to 0.

Algorithm 1:

Algorithm 2 is a program of in Materialized View, inserting data tuple.In algorithm 2, inserting data tuple T before the Materialized View, whether should judge has in the Materialized View and T conflicting data tuple.Be similar to the method for deleted data tuple, use Hash to navigate to corresponding section.Because each primary attribute value need be carried out hash calculation, so obtain n (SID after the calculating through n Hash _i, key _i) (i=1,2 ..., n).If this n (SID _i, key _i) have identical value, promptly (SID, key), and the value of label vector is 1 accordingly, this means and the existence of T conflicting data tuple that T will be dropped so.If this n (SID _i, key _i) have identical value, promptly (SID, key), and the value of label vector is 0 accordingly, needs so further to judge.(SID, the value of nonprime attribute key) then mean on physical level, to have a data tuple that repeats with T, though it does not exist on logical level if all nonprime attribute values of data tuple T equal to be positioned at.Therefore, only need label vector is positioned at that (SID, value key) is set to 1.

Algorithm 2:

If do not meet above-mentioned situation, call Append (MV, T, (SID so _T, key _T), SL), like algorithm 3.T is appended among the MV.At first, append the value of nonprime attribute.The value of then, appending primary attribute is also with (SID _T, key _T) add in the middle of the connection index.At last, the value with the label vector relevant position is changed to 1.

Algorithm 3:

Consistency model is the Critical policies that Materialized View of the present invention is safeguarded.Because Materialized View is towards analyzing; Rather than towards affairs; Immediately executive logging in the view daily record deletion and to insert operation be unnecessary, these operations in the view daily record be with batch processing mode during the checkpoint by batch processing, as shown in Figure 4.Different strategies is provided with the checkpoint and has caused different consistency models.This joint has proposed two consistency models, comprises the final consistency model and based on the consistency model of time shaft.

Algorithm 4 is the program of final consistency model.In the final consistency model, the size of view daily record has determined when to handle the operation in the view daily record.Shown in algorithm 4, (that is, th), the checkpoint is set and the operation of writing down in the view daily record is handled one by one when the size of view daily record surpasses certain threshold value.When every time processing finished, the size of view daily record was changed to 0.Under this consistency model, if do not upgrade in the data source of a very long time, Materialized View will progressively become with data source and be consistent so, the reason of Here it is this Materialized View consistency maintaining method is named as final consistency model.

Algorithm 4:

Algorithm 5 is the program based on the consistency model of time shaft.Framework based on the consistency model of time shaft is to be similar to the final consistency model, and the consistency model of different are based on time shaft regularly is provided with the checkpoint.Therefore, this method needs the information of timestamp.In algorithm 5; The current system time of this inspection when each round-robin begins; If reach set parameter the interval time of the timestamp of current time and previous checkpoint (promptly; Interval), this moment the checkpoint will be set so, and carry out all operations between a checkpoint and the current check point.At this consistency model, time shaft is used as the operation when decision handles the view daily record.Therefore, this model is called as the consistency model based on time shaft.

Algorithm 5:

The advantage of final consistency model is its simplicity, when handles the size that the view daily record is only depended in operation in the view daily record.Obviously, the final consistency model is a weaker uniformity model, and this model only is fit to towards the application scenario of analyzing.Can guarantee Materialized View to be moved to the state consistent with data source based on the model of time shaft based on the time.Yet, need more information in this algorithm, like timestamp.Consistency model based on time shaft wants complicated than the final consistency model, and has brought stronger consistance based on the consistency model of time shaft.

Claims

1. press under the row storage environment Materialized View layout in the distributed system for one kind, it is characterized in that, comprising:

Primary attribute collection: the set of primary attribute;

2. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that the projection of said primary attribute is divided into a plurality of sections through hash function.

3. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that, said data tuple is to organize by the mode of row storage.

4. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that said primary attribute separates projection with nonprime attribute.

5. press under the row storage environment Materialized View layout in the distributed system as claim 4 is said, it is characterized in that in the said projection, each primary attribute collection of said primary attribute collection projects into row separately.

6. press under the row storage environment maintaining method of Materialized View layout in the distributed system for one kind, it is characterized in that, comprise the steps:

7. press under the row storage environment maintaining method of Materialized View layout in the distributed system as claim 6 is said, it is characterized in that said view daily record comprises the basic operation of in Materialized View, inserting tuple, deletion tuple.

8. press under the row storage environment maintaining method of Materialized View layout in the distributed system as claim 6 is said, it is characterized in that, said consistency model comprises the final consistency model and based on the consistency model of time shaft.