CN102567527A - Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout - Google Patents
Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout Download PDFInfo
- Publication number
- CN102567527A CN102567527A CN2011104527265A CN201110452726A CN102567527A CN 102567527 A CN102567527 A CN 102567527A CN 2011104527265 A CN2011104527265 A CN 2011104527265A CN 201110452726 A CN201110452726 A CN 201110452726A CN 102567527 A CN102567527 A CN 102567527A
- Authority
- CN
- China
- Prior art keywords
- materialized view
- projection
- attribute
- storage environment
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention belongs to the technical field of databases, in particular discloses a materialized view layout in a distributive system under a column-orientated storage environment and a maintaining method of the materialized view layout. The invention comprises a high-expandability data layout strategy and a method for efficiently maintaining consistency of views. According to the invention, a relational data sheet is stored by using a column-orientated storage model, and the materialized view layout in the distributive system under the column-orientated storage environment is maintained by introducing a consistency model. The invention is suitable for large-scale distributed file systems of the column-orientated storage environment, and provides a data management solution for a data analysis-orientated application.
Description
Technical field
The invention belongs to database technical field, be specifically related to a kind of by Materialized View layout and maintaining method in the distributed system under the row storage environment.
Background technology
Rapid growth along with data scale; Data-intensive calculating is especially noticeable many large-scale IT enterprises in the current research field; Like Google, Amazon and their rival are being devoted to make up the large-scale data analysis platform, to support data-intensive calculating.Here, data-intensive computing system comprises acquisition, upgrades, share, and the data of filing, and be provided at computing power enough on the mass data collection.The group system that the commercial computer of being shared by a large amount of nothings constitutes effectively and efficiently provides these services as infrastructure usually.
Generally, the source of data a lot (for example, operating database, the webpage of Web 2.0 etc.), these data all will constantly be integrated into data analysis platform (that is data-intensive computing system).Relation such as Fig. 1 between data source and the data analysis platform.The large-scale data analysis platform is collected data from various data sources, and these data materializations are stored in order to data analysis.View is the data structure of a data analysis that is applicable to efficient processing commonly used.Yet when being stored in data analysis platform Materialized View can not reflect the latest update information of coming comfortable data source the time, Materialized View will be out-of-date.Therefore, how to keep the consistance between Materialized View and the data source to become a problem demanding prompt solution.
Different with the traditional data warehouse, the view in the data analysis platform is based on the data source large-scale distributed file system, like HDFS (Hadoop distributed file system), GFS (Google file system) etc.It should be noted that HDFS has adopted the file access pattern of " once write repeatedly and read " to come management data, in a single day file is created, is write and close, and then can not upgrade this document once more except data supplementing being arrived the end of file.That is to say that the present invention can not delete, the record in insertion or the updating file.In addition, the present invention uses and to store relation database table towards the row memory model, rather than traditional by row memory model (that is N-ary model).Make the Data Update in the file compare the difficulty more that becomes by the row memory model by the row memory model with traditional.Therefore, under new environment, new file access pattern and bring great challenge for the maintenance of Materialized View by the row memory model.
The present invention has overcome the defective that distributed file system in the prior art can not be upgraded file, has proposed a kind ofly by Materialized View layout and maintaining method in the distributed system under the row storage environment.The present invention uses and stores relation database table towards the row memory model, and introduces consistency model to safeguarding by Materialized View layout in the distributed system under the row storage environment.
Summary of the invention
The invention discloses a kind ofly, comprising by Materialized View layout in the distributed system under the row storage environment:
Primary attribute collection: the set of primary attribute;
The projection of primary attribute: each primary attribute is projected as the projection of primary attribute on physical level, and the projection of said each primary attribute is divided into a plurality of sections, comprises data tuple in said section;
The projection of nonprime attribute: each nonprime attribute is projected as the projection of nonprime attribute on physical level;
Connect index: the mapping between the projection of said primary attribute and the projection of nonprime attribute;
Label vector: the bit vector of the said data tuple existence of expression on logical level;
The projection of said primary attribute is connected with the projection mapping of said nonprime attribute through connecting index; Indicate through said label vector whether data tuple is present in the projection of corresponding nonprime attribute in the projection of said primary attribute.
Wherein, the projection of said primary attribute is divided into a plurality of sections through hash function.
Wherein, said data tuple is to organize by the mode of row storage.
Wherein, said primary attribute separates projection with nonprime attribute.
Wherein, in the said projection, each primary attribute collection of said primary attribute collection projects into row separately.
Among the present invention, Materialized View is meant and calculates in advance and the preservation table connects or the result of more operation consuming time such as gathering, like this, when carrying out inquiry, just can avoid these operations consuming time, thereby obtain the result fast.The Materialized View layout is meant the physical store mode of Materialized View, promptly how in file system, to organize, dispose data.
The invention discloses a kind of maintaining method, comprise the steps: by Materialized View layout in the distributed system under the row storage environment
Step 1: through the operation of using the view log record to carry out for Materialized View;
Step 2: through the operation in the said view daily record of consistency model batch processing.
Wherein, said view daily record comprises the basic operation of in Materialized View, inserting tuple, deletion tuple.
Wherein, said consistency model comprises the final consistency model and based on the consistency model of time shaft.
The invention has the beneficial effects as follows and can the consistency maintenance algorithm between traditional data warehouse and the data source be transplanted on the large-scale data analysis platform based on distributed system, thereby can keep the consistance between data analysis platform top view and the data source.
Description of drawings
Fig. 1 presses under the row storage environment Materialized View data layout synoptic diagram of Materialized View layout in the distributed system for the present invention.
Fig. 2 presses under the row storage environment synoptic diagram of the data platform of Materialized View layout and data source in the distributed system for the present invention.
Fig. 3 presses under the row storage environment connection index synoptic diagram of Materialized View layout in the distributed system for the present invention.
Fig. 4 presses under the row storage environment consistency model synoptic diagram of Materialized View layout in the distributed system for the present invention.
Embodiment
In conjunction with following specific embodiment and accompanying drawing, the present invention is done further detailed description, protection content of the present invention is not limited to following examples.Under spirit that does not deviate from inventive concept and scope, variation and advantage that those skilled in the art can expect all are included among the present invention, and are protection domain with the appending claims.
Embodiment of the present invention is divided into two aspects, and the one, the storage means of Materialized View, the present invention is left Materialized View on the distributed file system in, and to organize data by the row storage mode, therefore proposes novel data layout method; Two are based on the maintaining method of the Materialized View of this storage organization, i.e. the updating maintenance of view, and provide corresponding model.
As shown in Figure 2 is synoptic diagram by data platform and data source in the distributed system under the row storage environment, and the view in the data analysis platform is based on the data source large-scale distributed file system.
As shown in Figure 1 is the data layout of Materialized View of the present invention, is data source on the solid line, is data platform under the solid line.Relation database table R
1, R
2,, R
nBe positioned at each data source, and Materialized View is structured in based on the data analysis platform on the Hadoop distributed file system (HDFS).The logical storage structure of the Materialized View that on the dotted line is (MV) is the structure of its physical store under the dotted line.On logical level, Materialized View is a relation table.On physical level, Materialized View is to be kept among the HDFS with the mode by the row storage.The present invention has also designed two specific data structures in addition, connects index (Join Index) and label vector (Tag Vector).
The storage of Materialized View: primary attribute collection { k
1, k
2..., k
nIt is the set of the primary attribute of relation table in each data source.On physical level, this n primary attribute is projected onto n projection (projection) respectively, and promptly the value of each primary attribute is a separate storage.The projection that comprises a primary attribute of Materialized View is divided into several sections (segment), and this division is to be realized by certain suitable hash function, for example uses hash function hash (x)=x mod n, does not rely on the worth distribution situation of primary attribute.This hash function guarantees that data tuple that each section comprises about equally.Moreover every section data are to organize by the mode of row storage, in each piece (block) of storage HDFS.If a piece overflows, then another piece will be distributed to this section.Nonprime attribute can be projected into the projection of any amount, but the data tuple order will be consistent.Thereby the nonprime attribute of reconstruct Materialized View partly is not difficult to accomplish.
Connect index:, connect index and be used as the mapping between each projection in order to make up whole Materialized View from each projection effectively.Illustrated the connection index between projection MV2 and the projection MV1 like Fig. 3.Can rebuild whole Materialized View through connecting index.In the framework like Fig. 2, each projection that comprises primary attribute all has a connecting strand to quote work to the mapping that comprises the nonprime attribute projection.Therefore, n primary attribute projection just has n corresponding connection index.
Label vector: label vector is the bit vector of designation data tuple existence on logical level.If i element of this vector is 1, show that so Materialized View comprises i data tuple on the logical layer level.If i element is 0, this data tuple does not belong to the data tuple in the Materialized View on logical level so, even it has physically stored.Using this vectorial reason is that HDFS does not support the record in the revised file.
The invention also discloses maintaining method, promptly safeguard the solution of Materialized View on the data analysis platform by Materialized View layout in the distributed system under the row storage environment.The view daily record is to be used for writing down the operation that view is carried out, and comprises deletion and the insertion of Materialized View being carried out data tuple, and proposes the sacrificial vessel body application use of two consistency models.
When data source takes place to upgrade, updated information will be sent to the data analysis platform.After updating message arrives data platform, the strobe algorithm will be called and handle renewal and make Materialized View arrive the state consistent with data source.For example, when the data analysis platform received a deletion message, strobe will produce the operation of corresponding data tuple in the deletion Materialized View.When the data analysis platform received insertion message, strobe sent the compensation inquiry to relevant data source.The data analysis platform receive compensation inquiry as a result the time will produce an insert action.In order to say something better, the present invention provides following two definition:
Definition 1: if the value of the primary attribute of two data tuples equates that these two data tuples are conflicted so.
Definition 2: if the value of all properties of two data tuples equates that these two data tuples are repetitions so.
From definition, the data tuple of two repetitions must be conflicted, and the conflicting data tuple might not repeat.The view daily record is exactly the action lists in the strobe algorithm, comprises deletion and inserts two basic operations of data tuple.Deletion (MV, k
i, var) expression delete property k from view MV
iValue equal the data tuple of var, Insertion (MV, T) expression MV that data tuple T is inserted into, if among the MV not and T conflicting data tuple.
Algorithm 1:
Algorithm 2:
If do not meet above-mentioned situation, call Append (MV, T, (SID so
T, key
T), SL), like algorithm 3.T is appended among the MV.At first, append the value of nonprime attribute.The value of then, appending primary attribute is also with (SID
T, key
T) add in the middle of the connection index.At last, the value with the label vector relevant position is changed to 1.
Algorithm 3:
Consistency model is the Critical policies that Materialized View of the present invention is safeguarded.Because Materialized View is towards analyzing; Rather than towards affairs; Immediately executive logging in the view daily record deletion and to insert operation be unnecessary, these operations in the view daily record be with batch processing mode during the checkpoint by batch processing, as shown in Figure 4.Different strategies is provided with the checkpoint and has caused different consistency models.This joint has proposed two consistency models, comprises the final consistency model and based on the consistency model of time shaft.
Algorithm 4 is the program of final consistency model.In the final consistency model, the size of view daily record has determined when to handle the operation in the view daily record.Shown in algorithm 4, (that is, th), the checkpoint is set and the operation of writing down in the view daily record is handled one by one when the size of view daily record surpasses certain threshold value.When every time processing finished, the size of view daily record was changed to 0.Under this consistency model, if do not upgrade in the data source of a very long time, Materialized View will progressively become with data source and be consistent so, the reason of Here it is this Materialized View consistency maintaining method is named as final consistency model.
Algorithm 4:
Algorithm 5 is the program based on the consistency model of time shaft.Framework based on the consistency model of time shaft is to be similar to the final consistency model, and the consistency model of different are based on time shaft regularly is provided with the checkpoint.Therefore, this method needs the information of timestamp.In algorithm 5; The current system time of this inspection when each round-robin begins; If reach set parameter the interval time of the timestamp of current time and previous checkpoint (promptly; Interval), this moment the checkpoint will be set so, and carry out all operations between a checkpoint and the current check point.At this consistency model, time shaft is used as the operation when decision handles the view daily record.Therefore, this model is called as the consistency model based on time shaft.
Algorithm 5:
The advantage of final consistency model is its simplicity, when handles the size that the view daily record is only depended in operation in the view daily record.Obviously, the final consistency model is a weaker uniformity model, and this model only is fit to towards the application scenario of analyzing.Can guarantee Materialized View to be moved to the state consistent with data source based on the model of time shaft based on the time.Yet, need more information in this algorithm, like timestamp.Consistency model based on time shaft wants complicated than the final consistency model, and has brought stronger consistance based on the consistency model of time shaft.
Claims (8)
1. press under the row storage environment Materialized View layout in the distributed system for one kind, it is characterized in that, comprising:
Primary attribute collection: the set of primary attribute;
The projection of primary attribute: each primary attribute is projected as the projection of primary attribute on physical level, and the projection of said each primary attribute is divided into a plurality of sections, comprises data tuple in said section;
The projection of nonprime attribute: each nonprime attribute is projected as the projection of nonprime attribute on physical level;
Connect index: the mapping between the projection of said primary attribute and the projection of nonprime attribute;
Label vector: the bit vector of the said data tuple existence of expression on logical level;
The projection of said primary attribute is connected with the projection mapping of said nonprime attribute through connecting index; Indicate through said label vector whether data tuple is present in the projection of corresponding nonprime attribute in the projection of said primary attribute.
2. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that the projection of said primary attribute is divided into a plurality of sections through hash function.
3. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that, said data tuple is to organize by the mode of row storage.
4. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that said primary attribute separates projection with nonprime attribute.
5. press under the row storage environment Materialized View layout in the distributed system as claim 4 is said, it is characterized in that in the said projection, each primary attribute collection of said primary attribute collection projects into row separately.
6. press under the row storage environment maintaining method of Materialized View layout in the distributed system for one kind, it is characterized in that, comprise the steps:
Step 1: through the operation of using the view log record to carry out for Materialized View;
Step 2: through the operation in the said view daily record of consistency model batch processing.
7. press under the row storage environment maintaining method of Materialized View layout in the distributed system as claim 6 is said, it is characterized in that said view daily record comprises the basic operation of in Materialized View, inserting tuple, deletion tuple.
8. press under the row storage environment maintaining method of Materialized View layout in the distributed system as claim 6 is said, it is characterized in that, said consistency model comprises the final consistency model and based on the consistency model of time shaft.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011104527265A CN102567527A (en) | 2011-12-30 | 2011-12-30 | Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011104527265A CN102567527A (en) | 2011-12-30 | 2011-12-30 | Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102567527A true CN102567527A (en) | 2012-07-11 |
Family
ID=46412926
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011104527265A Pending CN102567527A (en) | 2011-12-30 | 2011-12-30 | Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102567527A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761308A (en) * | 2014-01-23 | 2014-04-30 | 杭州电子科技大学 | Materialized view selection method based on self-adaption genetic algorithm |
CN104793926A (en) * | 2014-04-17 | 2015-07-22 | 厦门极致互动网络技术有限公司 | Resource allocation method and system in distributed system |
CN106462585A (en) * | 2014-03-21 | 2017-02-22 | 华为技术有限公司 | System and method for column-specific materialization scheduling |
-
2011
- 2011-12-30 CN CN2011104527265A patent/CN102567527A/en active Pending
Non-Patent Citations (3)
Title |
---|
HAITONG ZHU ET AL.: "Efficient Star Jion for Column-oriented Data Store in the MapReduce Environment", 《2011 EIGHTH WEB INFORMATION SYSTEMS AND APPLICATIONS CONFERENCE》 * |
孟勃荣: "含聚集物化视图的增量维护方法", 《计算机工程与设计》 * |
朱文等: "数据仓库中物化视图维护算法的分析和比较", 《现代计算机》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761308A (en) * | 2014-01-23 | 2014-04-30 | 杭州电子科技大学 | Materialized view selection method based on self-adaption genetic algorithm |
CN103761308B (en) * | 2014-01-23 | 2017-02-08 | 杭州电子科技大学 | Materialized view selection method based on self-adaption genetic algorithm |
CN106462585A (en) * | 2014-03-21 | 2017-02-22 | 华为技术有限公司 | System and method for column-specific materialization scheduling |
CN106462585B (en) * | 2014-03-21 | 2019-10-22 | 华为技术有限公司 | System and method for particular column materialization scheduling |
CN104793926A (en) * | 2014-04-17 | 2015-07-22 | 厦门极致互动网络技术有限公司 | Resource allocation method and system in distributed system |
CN104793926B (en) * | 2014-04-17 | 2018-06-01 | 厦门极致互动网络技术股份有限公司 | Resource allocation method and system in a kind of distributed system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10073888B1 (en) | Adjusting partitioning policies of a database system in view of storage reconfiguration | |
US10915528B2 (en) | Pluggable storage system for parallel query engines | |
CN105630864B (en) | Forced ordering of a dictionary storing row identifier values | |
US11314779B1 (en) | Managing timestamps in a sequential update stream recording changes to a database partition | |
US20160328429A1 (en) | Mutations in a column store | |
US20120203745A1 (en) | System and method for range search over distributive storage systems | |
CN112286941B (en) | Big data synchronization method and device based on Binlog + HBase + Hive | |
CN102546247A (en) | Massive data continuous analysis system suitable for stream processing | |
CN104298760A (en) | Data processing method and data processing device applied to data warehouse | |
CN103714163A (en) | Pattern management method and system of NoSQL database | |
CN102722582A (en) | System and method for integrating data on basis of reverse clearing | |
CN103795811A (en) | Information storage and data statistical management method based on meta data storage | |
CN105787058A (en) | User label system and data pushing system based on same | |
CN102779138A (en) | Hard disk access method of real time data | |
CN106991190A (en) | A kind of database automatically creates subdata base system | |
CN106021593A (en) | Copying processing method in take-over process of first database and second database | |
US10831709B2 (en) | Pluggable storage system for parallel query engines across non-native file systems | |
CN103034650A (en) | System and method for processing data | |
CN114416868B (en) | Data synchronization method, device, equipment and storage medium | |
CN102567527A (en) | Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout | |
CN103365987A (en) | Clustered database system and data processing method based on shared-disk framework | |
CN105956041A (en) | Data model processing method based on Spring Data for MongoDB cluster | |
Goncalves et al. | DottedDB: Anti-entropy without merkle trees, deletes without tombstones | |
Qu et al. | Distributed snapshot maintenance in wide-column NoSQL databases using partitioned incremental ETL pipelines | |
CN114860727A (en) | Zipper watch updating method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120711 |