CN103678550A

CN103678550A - Mass data real-time query method based on dynamic index structure

Info

Publication number: CN103678550A
Application number: CN201310648180.XA
Authority: CN
Inventors: 陈丹伟; 庄俊
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Longkon wisdom Polytron Technologies Inc
Priority date: 2013-09-09
Filing date: 2013-12-04
Publication date: 2014-03-26
Anticipated expiration: 2033-12-04
Also published as: CN103678550B

Abstract

The invention discloses a mass data real-time query method based on a dynamic index structure (DC-Tree). According to the method, dimensionality reduction is carried out on a mass multi-dimension data set, high space efficiency and low query time are supported, distributed redundant storage is supported, therefore, data distribution efficiency in a traditional distributed mechanism is improved and the method is suitable for mass data processing. The method includes the first step that a multi-dimension data record (DR) maps a function fz through a Z Curve in a Master Node to generate a dimensionality reduction result set S; the second step that the Master Node selects k hash functions to carry out mapping on the result set S through a Bloom Filter to generate a node set NN; the third step that the data record DR is updated, and dynamic establishment is carried out on each element in the node set NN; the fourth step that a user inquires an MDS result to obtain the node set NN through the first step and the second step, and a parallel query method is started; the fifth step that the user carries out aggregation on all access nodes in the node set NN to obtain the final query result Rset.

Description

A kind of mass data Query method in real time based on dynamic indexing structure

Technical field

The present invention relates to the large data query technique of computing machine field, particularly a kind of mass data Query method in real time based on dynamic indexing structure.

Background technology

Along with the develop rapidly of internet, social networks, mobile application etc. are increasingly burning hot, and we see that the data volume of the network information is increasing, and large data are defined as a kind of emerging concept data, and data, as the carrier of information, play a part very important.The explosive growth of data makes us enter the epoch that large-scale data is analyzed, and is characterized in that calculating strength is large, and requires large-scale concurrent Storage and Processing ability.How processing rapidly mass data, extract timely and effectively valuable information from mass data, is the technical matters of being badly in need of solution.

At present, large-scale data analysis has 2 kinds of mainstream technologys: the first is to start the eighties in 20th century, the parallel database that Teradata, Gamma research project be representative of take is progressively full-fledged, it is comprised of sequence of operations symbol, the output stream of last operational character is the inlet flow of next operational character, record passes through these operational characters successively by the mode of streamline, has higher performance.The 2nd kind is with the parallel computation framework of the simple function formula programming based on Map Reduce and distributed file system GFS composition a kind of " without sharing " headed by Google, supports its every day of the search of hundreds of millions times.The Hadoop of Apache is the realization of increasing income of a kind of Map Reduce.But these large-scale data treatment technologies are difficult to requirement of real time, it is more the processing for off-line data.Hadoop similarly is more a kind of ETL instrument, and both relations are not to vie each other but complement one another.

On the other hand, the dynamic indexing structure R-Tree being proposed by Guttman and the mutation based on R-Tree, the operations such as its insertion, inquiry can be carried out simultaneously, and support the model of multidimensional, advantage in numerous Spatial Data Index Technology is very obvious, when but it is processed for large-scale data,, along with the increase of height of tree degree, its inquiry node degree of overlapping increases, and causes search efficiency to decline very fast.And the present invention can solve problem above well.

Summary of the invention

The object of the invention is to provide a kind of extensive multidimensional data Query method in real time based on dynamic indexing structure (DC-Tree), the method has solved the hysteresis quality problem that extensive multidimensional data is processed, and has realized the mass data real-time query model in distributed structure/architecture system.

The technical solution adopted for the present invention to solve the technical problems is: the present invention proposes a kind of mass data Query method in real time based on dynamic indexing structure (DC-Tree), and the method comprises the steps:

Step 1: multidimensional data records DR by Z Curve mapping function f in MasterNode _z, generate dimensionality reduction result set S;

Step 2:MasterNode selectes k hash function, by Bloom Filter, result set S is shone upon, and generates set of node NN;

Step 3: new data records DR more, carries out dynamic construction to each element in set of node NN;

Step 4: user User inquiry MDS result, by step 1, step 2, obtain set of node NN, enable parallel query method;

Step 5: user User carries out polymerization to the result set of all access nodes in set of node NN, obtains final Query Result Rset.

The present invention is by magnanimity cube dimensionality reduction based on dynamic indexing structure, support the method for the low query time of space-efficient, and support distributed redundant storage, thus promoted the efficiency of data allocations in traditional distributed mechanism, adapt to the processing of large-scale data.The present invention has set up the multidimensional data tree with concept hierarchy structure, and the single attribute querying method breaking traditions makes data set with multidimensional functional attributes be divided into different dimensions and builds, the polymerization workload while greatly reducing single attribute inquiry.

The present invention, by high-dimensional data space data-mapping is arrived to the one-dimensional space, greatly reduces the work load of data management node, the dynamic increase of supported data memory node.Design mass data simultaneously and inserted and querying method, supported the dynamic construction of multidimensional property data, and supported the real-time effect of mass data inquiry, increased query script access lock mechanism, adapted to the concurrency demand of inquiry.

One, system architecture

Fig. 1 provides the architectural framework of mass data real time inquiry system, and this system is comprised of following four parts: data management node (Master Node), dynamic index tree (DC-Tree), data memory node (Data Node) and user (User).MasterNode is responsible for the location of data query/renewal, mainly uses dimensionality reduction and fast query technology.DC-Tree is mainly used for dynamic construction multidimensional property data query tree, and real-time query effect is provided.DataNode is responsible for the storage of concrete data.User (User) sends inquiry request to MasterNode, and MasterNode will, to inquiry request contents processing, determine that institute's query contents is on part DataNode, and these satisfactory DataNode are submitted to user.After completing this operation, user will disconnect in MasterNode, and the DataNode that initiatively access is submitted to inquires about.Entire system framework is illustrated in fig. 1 shown below.

Mass data real-time query scheme of the present invention is comprised of following four part operations: MDS (the minimum subset of describing) decomposition, Z curve dimension-reduction treatment, Bloom Filter location, DC-Tree index and result polymerization.

Two, method flow

1.MDS (the minimum subset of describing) decomposes

MDS (the minimum subset of the describing) form of expression is (M ₁..., M _d), wherein

might as well establish M _i={ a _i1, a _i2..., a _ik, 1≤i≤d wherein, a _ik∈ D _i, multidimensional data record set corresponding to this MDS (the minimum subset of describing) is { (a ₁₁, a ₂₁..., a _d1) ..., (a _1k, a _2k..., a _dk), be designated as MM.

2.Z curve dimension-reduction treatment

According to acquired results collection MM in above-mentioned steps 1, use Z Curve method to carry out dimensionality reduction operation, establishing Z Curve mapping function is f _z(p, m, n), p ∈ MM wherein, m is Z Curve exponent number, the number of dimensions that n is multidimensional model, might as well establish mapping function f _zrreturn value is y _p.This mapping function computation process false code is as follows:

(1)y _p＝0;

REPEAT

(2)y _p＝y _p+2 ^n(i-1)+j-1a _ji

UNTIL?j≥n

UNTIL?i≥m

(3)RETURN?y _p

Because the mapping function space complexity of n dimension m rank Z Curve is O (n), so the above results needs length, be that the array of n is deposited result set y _p, might as well establish this result set is S.

3.Bloom Filter location

According to the result set S={y after gained dimension-reduction treatment in above-mentioned steps 2 ₁..., y _n, then according to the elaboration to Bloom Filter in related work, now need to select k hash function HF _i, 1≤i≤k wherein, because Bloom Filter itself exists certain error rate, in order to reduce this positive tropism's mistake, the present invention has used Knuth demonstration when building hash function: two hash function HF ₁and HF ₂form by below can generate more hash function:

HF _i＝[HF ₁+HF ₂+f(i)]mod?r

1≤i≤k wherein, r is Bloom Filter array length, HF ₁and HF ₂two separate hash functions.When f (i)=0, adopt two hash functions mechanism, otherwise be just expansion hash function mechanism, the hash function of generation has kept positive tropism's error rate constant like this, and has improved the counting yield of system.

After selected k function, in pair set S, data are shone upon, and return to a DataNode set of node, might as well be made as NN.And this collection NN is back to user.

4.DC-Tree index and result polymerization

User, according to gained set NN in above-mentioned steps 3, navigates to the required DataNode that carries out index, and DataNode adopts DC-Tree indexing means to search.After searching on each DataNode, indexed results can be sent to an indexed results and concentrate, might as well be made as RSet, now again this indexed results collection be carried out to polymerization, obtain final Query Result.

Beneficial effect:

1, the present invention has improved the efficiency of data allocations, adapts to the processing of large-scale data, the polymerization workload while having reduced single attribute inquiry.

2, the present invention has realized the efficient concurrent processing of large-scale data and real-time function.

Accompanying drawing explanation

Fig. 1 is system architecture diagram of the present invention.

Fig. 2 is dynamic insertion method process flow diagram of the present invention.

Fig. 3 is parallel query method flow diagram of the present invention.

Embodiment

Below by conjunction with Figure of description, further illustrate technical scheme of the present invention.

Embodiment 1

As shown in Figures 2 and 3, the present invention proposes a kind of mass data Query method in real time based on dynamic indexing structure (DC-Tree), and the method comprises the steps:

A new multidimensional data of the present invention records DR, and by the quick locating query set of node of MasterNode NN, and dynamic appending is to corresponding DC-Tree, and user User is by MDS query node collection NN, and Query Result is returned in polymerization.

Its embodiment is:

(1) multidimensional data records DR by Z Curve mapping function f in MasterNode _z, generate dimensionality reduction result set S;

(2) selected k the hash function of MasterNode, shines upon result set S by Bloom Filter, generates set of node NN;

(3) new data records DR more, carries out dynamic construction to each element in set of node NN;

Dynamically insert: for root node D applies for locking LOCK; Upgrade the Measure value of directory junction; If DR is only included in a child's the MDS of D, make so D be set to this catalogue child node; If DR is included in child's the MDS of a plurality of D, finds out so that child who comprises minimal data node in these children, and D is set to this catalogue child node; If DR is not included in any one child's the MDS of D, first copy a D, might as well be made as D ', DR is added in each child's node of D, calculate the overlapping value after adding, select that child's node of overlapping value minimum, and be made as D; Data recording DR is inserted in D, and upgrades the Measure value of D; If it is maximum that the spatial accommodation of D has reached, call division function SPLIT, using D as parameter transmission; Upgrade Measure and the MDS of father's node of D; Make D point to father's node of D, if D does not upgrade or D is not root node, again data recording DR is inserted in D, and upgrades the Measure value of D, continue to carry out, otherwise finish; For root node D application release UNLOCK;

(4) user User inquiry MDS result, obtains set of node NN by step 1, step 2, enables parallel query method;

Parallel query: to all nodes in set of node NN, if this node do not lock, Concurrency Access all in NN node; For root node D applies for locking LOCK; The node of child each time C to D, any one dimension to C, if with inquiry MDS not on same dimension hierarchy, lower dimension hierarchy in both is converted to more high-dimensional level; If C _ MDS is included in inquiry MDS, should _ MDS and Measure Values thereof join in result set; If C _ MDS and inquiry MDS have overlapping but be not comprised in inquiry MDS, this child's node C is made as to D, recursive call parallel query function PARALLEL QUERY, continue and NN in node carry out same operation; If C is leafy node, access finishes; For root node D application release UNLOCK;

(5) user User carries out polymerization to the result set of all access nodes in set of node NN, obtains final Query Result Rset;

(6) overall process finishes.

Embodiment 2

As shown in Figure 1, the present invention provides the architectural framework of mass data real time inquiry system, and this system is comprised of following four parts: data management node (Master Node), dynamic index tree (DC-Tree), data memory node (Data Node) and user (User).MasterNode is responsible for the location of data query/renewal, mainly uses dimensionality reduction and fast query technology.DC-Tree is mainly used for dynamic construction multidimensional property data query tree, and real-time query effect is provided.DataNode is responsible for the storage of concrete data.User (User) sends inquiry request to MasterNode, and MasterNode will, to inquiry request contents processing, determine that institute's query contents is on part DataNode, and these satisfactory DataNode are submitted to user.After completing this operation, user will disconnect in MasterNode, and the DataNode that initiatively access is submitted to inquires about.

Mass data Query method in real time of the present invention is comprised of following four operation parts, comprising: MDS (the minimum subset of describing) decomposition, Z curve dimension-reduction treatment, Bloom Filter location, DC-Tree index and result polymerization.

Claims

1. the mass data Query method in real time based on dynamic indexing structure, is characterized in that, described method comprises following steps:

2. a kind of mass data Query method in real time based on dynamic indexing structure according to claim 1, is characterized in that: in described method, set up real-time query model, by magnanimity cube dimensionality reduction.

3. a kind of mass data Query method in real time based on dynamic indexing structure according to claim 1, is characterized in that: in described method, set up the multidimensional data tree with concept hierarchy structure.

4. a kind of mass data Query method in real time based on dynamic indexing structure according to claim 1, it is characterized in that, described method comprises: MDS (the minimum subset of describing) decomposition, Z curve dimension-reduction treatment, Bloom Filter location, DC-Tree index and result polymerization.

5. a kind of mass data Query method in real time based on dynamic indexing structure according to claim 1, is characterized in that: described method is based on dynamic indexing structure.