CN102567527A - Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout - Google Patents

Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout Download PDF

Info

Publication number
CN102567527A
CN102567527A CN2011104527265A CN201110452726A CN102567527A CN 102567527 A CN102567527 A CN 102567527A CN 2011104527265 A CN2011104527265 A CN 2011104527265A CN 201110452726 A CN201110452726 A CN 201110452726A CN 102567527 A CN102567527 A CN 102567527A
Authority
CN
China
Prior art keywords
materialized view
projection
attribute
storage environment
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104527265A
Other languages
Chinese (zh)
Inventor
周傲英
徐辰
夏帆
陈�峰
祝海通
周敏奇
钱卫宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN2011104527265A priority Critical patent/CN102567527A/en
Publication of CN102567527A publication Critical patent/CN102567527A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention belongs to the technical field of databases, in particular discloses a materialized view layout in a distributive system under a column-orientated storage environment and a maintaining method of the materialized view layout. The invention comprises a high-expandability data layout strategy and a method for efficiently maintaining consistency of views. According to the invention, a relational data sheet is stored by using a column-orientated storage model, and the materialized view layout in the distributive system under the column-orientated storage environment is maintained by introducing a consistency model. The invention is suitable for large-scale distributed file systems of the column-orientated storage environment, and provides a data management solution for a data analysis-orientated application.

Description

Press under the row storage environment Materialized View layout and maintaining method thereof in the distributed system
Technical field
The invention belongs to database technical field, be specifically related to a kind of by Materialized View layout and maintaining method in the distributed system under the row storage environment.
Background technology
Rapid growth along with data scale; Data-intensive calculating is especially noticeable many large-scale IT enterprises in the current research field; Like Google, Amazon and their rival are being devoted to make up the large-scale data analysis platform, to support data-intensive calculating.Here, data-intensive computing system comprises acquisition, upgrades, share, and the data of filing, and be provided at computing power enough on the mass data collection.The group system that the commercial computer of being shared by a large amount of nothings constitutes effectively and efficiently provides these services as infrastructure usually.
Generally, the source of data a lot (for example, operating database, the webpage of Web 2.0 etc.), these data all will constantly be integrated into data analysis platform (that is data-intensive computing system).Relation such as Fig. 1 between data source and the data analysis platform.The large-scale data analysis platform is collected data from various data sources, and these data materializations are stored in order to data analysis.View is the data structure of a data analysis that is applicable to efficient processing commonly used.Yet when being stored in data analysis platform Materialized View can not reflect the latest update information of coming comfortable data source the time, Materialized View will be out-of-date.Therefore, how to keep the consistance between Materialized View and the data source to become a problem demanding prompt solution.
Different with the traditional data warehouse, the view in the data analysis platform is based on the data source large-scale distributed file system, like HDFS (Hadoop distributed file system), GFS (Google file system) etc.It should be noted that HDFS has adopted the file access pattern of " once write repeatedly and read " to come management data, in a single day file is created, is write and close, and then can not upgrade this document once more except data supplementing being arrived the end of file.That is to say that the present invention can not delete, the record in insertion or the updating file.In addition, the present invention uses and to store relation database table towards the row memory model, rather than traditional by row memory model (that is N-ary model).Make the Data Update in the file compare the difficulty more that becomes by the row memory model by the row memory model with traditional.Therefore, under new environment, new file access pattern and bring great challenge for the maintenance of Materialized View by the row memory model.
The present invention has overcome the defective that distributed file system in the prior art can not be upgraded file, has proposed a kind ofly by Materialized View layout and maintaining method in the distributed system under the row storage environment.The present invention uses and stores relation database table towards the row memory model, and introduces consistency model to safeguarding by Materialized View layout in the distributed system under the row storage environment.
Summary of the invention
The invention discloses a kind ofly, comprising by Materialized View layout in the distributed system under the row storage environment:
Primary attribute collection: the set of primary attribute;
The projection of primary attribute: each primary attribute is projected as the projection of primary attribute on physical level, and the projection of said each primary attribute is divided into a plurality of sections, comprises data tuple in said section;
The projection of nonprime attribute: each nonprime attribute is projected as the projection of nonprime attribute on physical level;
Connect index: the mapping between the projection of said primary attribute and the projection of nonprime attribute;
Label vector: the bit vector of the said data tuple existence of expression on logical level;
The projection of said primary attribute is connected with the projection mapping of said nonprime attribute through connecting index; Indicate through said label vector whether data tuple is present in the projection of corresponding nonprime attribute in the projection of said primary attribute.
Wherein, the projection of said primary attribute is divided into a plurality of sections through hash function.
Wherein, said data tuple is to organize by the mode of row storage.
Wherein, said primary attribute separates projection with nonprime attribute.
Wherein, in the said projection, each primary attribute collection of said primary attribute collection projects into row separately.
Among the present invention, Materialized View is meant and calculates in advance and the preservation table connects or the result of more operation consuming time such as gathering, like this, when carrying out inquiry, just can avoid these operations consuming time, thereby obtain the result fast.The Materialized View layout is meant the physical store mode of Materialized View, promptly how in file system, to organize, dispose data.
The invention discloses a kind of maintaining method, comprise the steps: by Materialized View layout in the distributed system under the row storage environment
Step 1: through the operation of using the view log record to carry out for Materialized View;
Step 2: through the operation in the said view daily record of consistency model batch processing.
Wherein, said view daily record comprises the basic operation of in Materialized View, inserting tuple, deletion tuple.
Wherein, said consistency model comprises the final consistency model and based on the consistency model of time shaft.
The invention has the beneficial effects as follows and can the consistency maintenance algorithm between traditional data warehouse and the data source be transplanted on the large-scale data analysis platform based on distributed system, thereby can keep the consistance between data analysis platform top view and the data source.
Description of drawings
Fig. 1 presses under the row storage environment Materialized View data layout synoptic diagram of Materialized View layout in the distributed system for the present invention.
Fig. 2 presses under the row storage environment synoptic diagram of the data platform of Materialized View layout and data source in the distributed system for the present invention.
Fig. 3 presses under the row storage environment connection index synoptic diagram of Materialized View layout in the distributed system for the present invention.
Fig. 4 presses under the row storage environment consistency model synoptic diagram of Materialized View layout in the distributed system for the present invention.
Embodiment
In conjunction with following specific embodiment and accompanying drawing, the present invention is done further detailed description, protection content of the present invention is not limited to following examples.Under spirit that does not deviate from inventive concept and scope, variation and advantage that those skilled in the art can expect all are included among the present invention, and are protection domain with the appending claims.
Embodiment of the present invention is divided into two aspects, and the one, the storage means of Materialized View, the present invention is left Materialized View on the distributed file system in, and to organize data by the row storage mode, therefore proposes novel data layout method; Two are based on the maintaining method of the Materialized View of this storage organization, i.e. the updating maintenance of view, and provide corresponding model.
As shown in Figure 2 is synoptic diagram by data platform and data source in the distributed system under the row storage environment, and the view in the data analysis platform is based on the data source large-scale distributed file system.
As shown in Figure 1 is the data layout of Materialized View of the present invention, is data source on the solid line, is data platform under the solid line.Relation database table R 1, R 2,, R nBe positioned at each data source, and Materialized View is structured in based on the data analysis platform on the Hadoop distributed file system (HDFS).The logical storage structure of the Materialized View that on the dotted line is (MV) is the structure of its physical store under the dotted line.On logical level, Materialized View is a relation table.On physical level, Materialized View is to be kept among the HDFS with the mode by the row storage.The present invention has also designed two specific data structures in addition, connects index (Join Index) and label vector (Tag Vector).
The storage of Materialized View: primary attribute collection { k 1, k 2..., k nIt is the set of the primary attribute of relation table in each data source.On physical level, this n primary attribute is projected onto n projection (projection) respectively, and promptly the value of each primary attribute is a separate storage.The projection that comprises a primary attribute of Materialized View is divided into several sections (segment), and this division is to be realized by certain suitable hash function, for example uses hash function hash (x)=x mod n, does not rely on the worth distribution situation of primary attribute.This hash function guarantees that data tuple that each section comprises about equally.Moreover every section data are to organize by the mode of row storage, in each piece (block) of storage HDFS.If a piece overflows, then another piece will be distributed to this section.Nonprime attribute can be projected into the projection of any amount, but the data tuple order will be consistent.Thereby the nonprime attribute of reconstruct Materialized View partly is not difficult to accomplish.
Connect index:, connect index and be used as the mapping between each projection in order to make up whole Materialized View from each projection effectively.Illustrated the connection index between projection MV2 and the projection MV1 like Fig. 3.Can rebuild whole Materialized View through connecting index.In the framework like Fig. 2, each projection that comprises primary attribute all has a connecting strand to quote work to the mapping that comprises the nonprime attribute projection.Therefore, n primary attribute projection just has n corresponding connection index.
Label vector: label vector is the bit vector of designation data tuple existence on logical level.If i element of this vector is 1, show that so Materialized View comprises i data tuple on the logical layer level.If i element is 0, this data tuple does not belong to the data tuple in the Materialized View on logical level so, even it has physically stored.Using this vectorial reason is that HDFS does not support the record in the revised file.
The invention also discloses maintaining method, promptly safeguard the solution of Materialized View on the data analysis platform by Materialized View layout in the distributed system under the row storage environment.The view daily record is to be used for writing down the operation that view is carried out, and comprises deletion and the insertion of Materialized View being carried out data tuple, and proposes the sacrificial vessel body application use of two consistency models.
When data source takes place to upgrade, updated information will be sent to the data analysis platform.After updating message arrives data platform, the strobe algorithm will be called and handle renewal and make Materialized View arrive the state consistent with data source.For example, when the data analysis platform received a deletion message, strobe will produce the operation of corresponding data tuple in the deletion Materialized View.When the data analysis platform received insertion message, strobe sent the compensation inquiry to relevant data source.The data analysis platform receive compensation inquiry as a result the time will produce an insert action.In order to say something better, the present invention provides following two definition:
Definition 1: if the value of the primary attribute of two data tuples equates that these two data tuples are conflicted so.
Definition 2: if the value of all properties of two data tuples equates that these two data tuples are repetitions so.
From definition, the data tuple of two repetitions must be conflicted, and the conflicting data tuple might not repeat.The view daily record is exactly the action lists in the strobe algorithm, comprises deletion and inserts two basic operations of data tuple.Deletion (MV, k i, var) expression delete property k from view MV iValue equal the data tuple of var, Insertion (MV, T) expression MV that data tuple T is inserted into, if among the MV not and T conflicting data tuple.
Algorithm 1 is the program of deleted data tuple from Materialized View.Shown in algorithm 1, corresponding hash function is used for the section that definite var belongs to.Next, each element in this this section of algorithm scanning if the value of certain element equals var, gets access to (SID so from connect index i, key i).If be positioned at (SID in the label vector i, key i) value be 1, mean so to have a data tuple, its attribute k iValue be var.In order to delete this data tuple is deleted from Materialized View MV, a simple solution is all data tuple of reading in this section, writes again then except those all data tuple data tuple to be deleted.Yet this way will expend the I/O expense and the network bandwidth a large amount of in the distributed file system.More excellent solution then is that the value with label vector relevant position is changed to 0.
Algorithm 1:
Figure 986425DEST_PATH_IMAGE001
Algorithm 2 is a program of in Materialized View, inserting data tuple.In algorithm 2, inserting data tuple T before the Materialized View, whether should judge has in the Materialized View and T conflicting data tuple.Be similar to the method for deleted data tuple, use Hash to navigate to corresponding section.Because each primary attribute value need be carried out hash calculation, so obtain n (SID after the calculating through n Hash i, key i) (i=1,2 ..., n).If this n (SID i, key i) have identical value, promptly (SID, key), and the value of label vector is 1 accordingly, this means and the existence of T conflicting data tuple that T will be dropped so.If this n (SID i, key i) have identical value, promptly (SID, key), and the value of label vector is 0 accordingly, needs so further to judge.(SID, the value of nonprime attribute key) then mean on physical level, to have a data tuple that repeats with T, though it does not exist on logical level if all nonprime attribute values of data tuple T equal to be positioned at.Therefore, only need label vector is positioned at that (SID, value key) is set to 1.
Algorithm 2:
Figure 443951DEST_PATH_IMAGE002
If do not meet above-mentioned situation, call Append (MV, T, (SID so T, key T), SL), like algorithm 3.T is appended among the MV.At first, append the value of nonprime attribute.The value of then, appending primary attribute is also with (SID T, key T) add in the middle of the connection index.At last, the value with the label vector relevant position is changed to 1.
Algorithm 3:
Figure 258323DEST_PATH_IMAGE003
Consistency model is the Critical policies that Materialized View of the present invention is safeguarded.Because Materialized View is towards analyzing; Rather than towards affairs; Immediately executive logging in the view daily record deletion and to insert operation be unnecessary, these operations in the view daily record be with batch processing mode during the checkpoint by batch processing, as shown in Figure 4.Different strategies is provided with the checkpoint and has caused different consistency models.This joint has proposed two consistency models, comprises the final consistency model and based on the consistency model of time shaft.
Algorithm 4 is the program of final consistency model.In the final consistency model, the size of view daily record has determined when to handle the operation in the view daily record.Shown in algorithm 4, (that is, th), the checkpoint is set and the operation of writing down in the view daily record is handled one by one when the size of view daily record surpasses certain threshold value.When every time processing finished, the size of view daily record was changed to 0.Under this consistency model, if do not upgrade in the data source of a very long time, Materialized View will progressively become with data source and be consistent so, the reason of Here it is this Materialized View consistency maintaining method is named as final consistency model.
Algorithm 4:
Algorithm 5 is the program based on the consistency model of time shaft.Framework based on the consistency model of time shaft is to be similar to the final consistency model, and the consistency model of different are based on time shaft regularly is provided with the checkpoint.Therefore, this method needs the information of timestamp.In algorithm 5; The current system time of this inspection when each round-robin begins; If reach set parameter the interval time of the timestamp of current time and previous checkpoint (promptly; Interval), this moment the checkpoint will be set so, and carry out all operations between a checkpoint and the current check point.At this consistency model, time shaft is used as the operation when decision handles the view daily record.Therefore, this model is called as the consistency model based on time shaft.
Algorithm 5:
Figure 410136DEST_PATH_IMAGE005
The advantage of final consistency model is its simplicity, when handles the size that the view daily record is only depended in operation in the view daily record.Obviously, the final consistency model is a weaker uniformity model, and this model only is fit to towards the application scenario of analyzing.Can guarantee Materialized View to be moved to the state consistent with data source based on the model of time shaft based on the time.Yet, need more information in this algorithm, like timestamp.Consistency model based on time shaft wants complicated than the final consistency model, and has brought stronger consistance based on the consistency model of time shaft.

Claims (8)

1. press under the row storage environment Materialized View layout in the distributed system for one kind, it is characterized in that, comprising:
Primary attribute collection: the set of primary attribute;
The projection of primary attribute: each primary attribute is projected as the projection of primary attribute on physical level, and the projection of said each primary attribute is divided into a plurality of sections, comprises data tuple in said section;
The projection of nonprime attribute: each nonprime attribute is projected as the projection of nonprime attribute on physical level;
Connect index: the mapping between the projection of said primary attribute and the projection of nonprime attribute;
Label vector: the bit vector of the said data tuple existence of expression on logical level;
The projection of said primary attribute is connected with the projection mapping of said nonprime attribute through connecting index; Indicate through said label vector whether data tuple is present in the projection of corresponding nonprime attribute in the projection of said primary attribute.
2. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that the projection of said primary attribute is divided into a plurality of sections through hash function.
3. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that, said data tuple is to organize by the mode of row storage.
4. press under the row storage environment Materialized View layout in the distributed system according to claim 1, it is characterized in that said primary attribute separates projection with nonprime attribute.
5. press under the row storage environment Materialized View layout in the distributed system as claim 4 is said, it is characterized in that in the said projection, each primary attribute collection of said primary attribute collection projects into row separately.
6. press under the row storage environment maintaining method of Materialized View layout in the distributed system for one kind, it is characterized in that, comprise the steps:
Step 1: through the operation of using the view log record to carry out for Materialized View;
Step 2: through the operation in the said view daily record of consistency model batch processing.
7. press under the row storage environment maintaining method of Materialized View layout in the distributed system as claim 6 is said, it is characterized in that said view daily record comprises the basic operation of in Materialized View, inserting tuple, deletion tuple.
8. press under the row storage environment maintaining method of Materialized View layout in the distributed system as claim 6 is said, it is characterized in that, said consistency model comprises the final consistency model and based on the consistency model of time shaft.
CN2011104527265A 2011-12-30 2011-12-30 Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout Pending CN102567527A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104527265A CN102567527A (en) 2011-12-30 2011-12-30 Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104527265A CN102567527A (en) 2011-12-30 2011-12-30 Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout

Publications (1)

Publication Number Publication Date
CN102567527A true CN102567527A (en) 2012-07-11

Family

ID=46412926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104527265A Pending CN102567527A (en) 2011-12-30 2011-12-30 Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout

Country Status (1)

Country Link
CN (1) CN102567527A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761308A (en) * 2014-01-23 2014-04-30 杭州电子科技大学 Materialized view selection method based on self-adaption genetic algorithm
CN104793926A (en) * 2014-04-17 2015-07-22 厦门极致互动网络技术有限公司 Resource allocation method and system in distributed system
CN106462585A (en) * 2014-03-21 2017-02-22 华为技术有限公司 System and method for column-specific materialization scheduling

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAITONG ZHU ET AL.: "Efficient Star Jion for Column-oriented Data Store in the MapReduce Environment", 《2011 EIGHTH WEB INFORMATION SYSTEMS AND APPLICATIONS CONFERENCE》 *
孟勃荣: "含聚集物化视图的增量维护方法", 《计算机工程与设计》 *
朱文等: "数据仓库中物化视图维护算法的分析和比较", 《现代计算机》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761308A (en) * 2014-01-23 2014-04-30 杭州电子科技大学 Materialized view selection method based on self-adaption genetic algorithm
CN103761308B (en) * 2014-01-23 2017-02-08 杭州电子科技大学 Materialized view selection method based on self-adaption genetic algorithm
CN106462585A (en) * 2014-03-21 2017-02-22 华为技术有限公司 System and method for column-specific materialization scheduling
CN106462585B (en) * 2014-03-21 2019-10-22 华为技术有限公司 System and method for particular column materialization scheduling
CN104793926A (en) * 2014-04-17 2015-07-22 厦门极致互动网络技术有限公司 Resource allocation method and system in distributed system
CN104793926B (en) * 2014-04-17 2018-06-01 厦门极致互动网络技术股份有限公司 Resource allocation method and system in a kind of distributed system

Similar Documents

Publication Publication Date Title
US10073888B1 (en) Adjusting partitioning policies of a database system in view of storage reconfiguration
US10915528B2 (en) Pluggable storage system for parallel query engines
CN105630864B (en) Forced ordering of a dictionary storing row identifier values
US11314779B1 (en) Managing timestamps in a sequential update stream recording changes to a database partition
US20160328429A1 (en) Mutations in a column store
US20120203745A1 (en) System and method for range search over distributive storage systems
CN112286941B (en) Big data synchronization method and device based on Binlog + HBase + Hive
CN102546247A (en) Massive data continuous analysis system suitable for stream processing
CN104298760A (en) Data processing method and data processing device applied to data warehouse
CN103714163A (en) Pattern management method and system of NoSQL database
CN102722582A (en) System and method for integrating data on basis of reverse clearing
CN103795811A (en) Information storage and data statistical management method based on meta data storage
CN105787058A (en) User label system and data pushing system based on same
CN102779138A (en) Hard disk access method of real time data
CN106991190A (en) A kind of database automatically creates subdata base system
CN106021593A (en) Copying processing method in take-over process of first database and second database
US10831709B2 (en) Pluggable storage system for parallel query engines across non-native file systems
CN103034650A (en) System and method for processing data
CN114416868B (en) Data synchronization method, device, equipment and storage medium
CN102567527A (en) Materialized view layout in distributive system under column-orientated storage environment and maintaining method of materialized view layout
CN103365987A (en) Clustered database system and data processing method based on shared-disk framework
CN105956041A (en) Data model processing method based on Spring Data for MongoDB cluster
Goncalves et al. DottedDB: Anti-entropy without merkle trees, deletes without tombstones
Qu et al. Distributed snapshot maintenance in wide-column NoSQL databases using partitioned incremental ETL pipelines
CN114860727A (en) Zipper watch updating method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120711