CN111913987B

CN111913987B - Distributed query system and method based on dimension group-space-time-probability filtering

Info

Publication number: CN111913987B
Application number: CN202010794372.1A
Authority: CN
Inventors: 王之琼; 信俊昌; 雷盛楠; 王司亓; 李嘉欣; 汪宇; 唐俊日; 隋玲
Original assignee: 东北大学
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2023-08-04
Anticipated expiration: 2040-08-10
Also published as: CN111913987A

Abstract

The invention provides a distributed query system and a distributed query method based on dimension group-space-time-probability filtering, and relates to the technical field of big data query. Firstly, optimizing an initial query task queue through a query optimizer to obtain a rewritten query task queue; performing dimension group filtering on the attributes through a dimension group filter to obtain a query target dimension group candidate set; further screening the query target dimension group candidate set through a space-time filter to obtain a query candidate data set; the probability filter starts the distributed sampling calculation and query process for the two sets, performs query calculation and confidence calculation for the samples, gathers to obtain a global query result and total confidence, and writes the result into a result buffer for buffering. And finally, the query optimizer reads the completed result buffer of the rewritten query task from the result buffer, calculates the query result returned to the initial query task, optimizes the multi-query task, reduces the query calculation cost and improves the query efficiency.

Description

Distributed query system and method based on dimension group-space-time-probability filtering

Technical Field

The invention relates to the technical field of big data query, in particular to a distributed query system and method based on dimension group-space-time-probability filtering.

Background

In the context of the big data age, distributed data storage, querying and analysis techniques have been widely used. Distributed queries involve multiple storage nodes and multimodal data. Different from the traditional single-node query optimization, the distributed query is mainly performed with optimized scheduling on the query task of the distributed system in a big data environment, so that the network transmission and calculation cost is reduced, the accuracy of the query result is improved, the query efficiency is improved, and further, efficient distributed query and optimization are realized. The efficient distributed query and optimization is the core of big data management, and is an important support for big data intelligent analysis. The main approach of big data distributed query optimization is to reduce the query candidate set, avoid redundant data reading and calculation, and realize an efficient intermediate query process. Therefore, the adoption of an effective query candidate set filtering method is an important point of optimization.

In recent years, the industry has developed a large number of complex query algorithms aiming at big data environments, but the algorithms are single, mainly the optimization algorithms aiming at specific queries such as aggregated queries, preference queries and analysis queries, and the efficiency is still not high when a plurality of queries are simultaneously carried out, and the efficient distributed query optimization method is still lacking at present, so that the intermediate query process of the multitasking big data is optimized, and the aim of high-efficiency data management is fulfilled.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a distributed query system and a distributed query method based on dimension group-space-time-probability filtering, and an optimized rewriting scheme of multiple query tasks is obtained through analysis, so that query redundancy cost is reduced; and considering factors influencing the query candidate set, performing dimension group filtering aiming at high-dimension attributes on the rewriting scheme, performing space-time filtering aiming at space-time attributes and probability filtering based on sampling calculation, and achieving filtering with good candidate set effect through the three filtering, so as to reduce unnecessary query cost.

In order to solve the technical problems, the invention adopts the following technical scheme: in one aspect, the invention provides a distributed query system based on dimension group-space-time-probability filtering, which comprises a query optimizer, a dimension group filter, a space-time filter, a probability filter and a result buffer;

the query optimizer optimizes the query task by analyzing the internal relevance of a plurality of query tasks in the initial query task queue to obtain an rewritten query task queue; reading the completed result cache of the rewritten query task from the result cache, and calculating the query result of the returned initial query task according to the saved query task relation mapping before and after optimization;

the dimension group filter performs dimension group filtering on the attribute of the rewriting query task based on the grouping storage information of the metadata high-dimension attribute table to obtain a query target dimension group candidate set;

the metadata high-dimensional attribute table stores metadata at a distributed platform according to the following policies: according to the inherent relativity of different attributes, vertically dividing an attribute table of a query task into a plurality of dimension groups, wherein each dimension group comprises a plurality of attributes; transversely dividing the attribute table, wherein each divided part contains multiple rows of metadata information and is stored in different data nodes of the distributed platform; for data in the same data node, dividing according to dimension groups, storing each dimension group in the form of a data block, storing local indexes by each data node, storing global indexes by a main control management node of a distributed platform, and constructing global-local hierarchical indexes from bottom to top, wherein the indexes comprise space-time information of the data;

the space-time filter further screens the query target dimension group candidate set aiming at the space-time attribute corresponding to the rewritten query task to obtain a query candidate data set;

and the probability filter starts a distributed sampling calculation and query process according to the obtained target dimension group candidate set and the query candidate data set, performs sample collection of target data in parallel on each data node of the distributed platform, performs query calculation and confidence calculation on the samples, collects to obtain a global query result and total confidence, and writes the global query result and the total confidence into a result buffer for buffering.

On the other hand, the invention also provides a distributed query method based on dimension group-space-time-probability filtering, which specifically comprises the following steps:

step 1: performing association analysis on query tasks in the initial query task queue through a query optimizer, and optimizing to obtain a rewritten query task queue;

step 1.1: for n inquiry tasks Q in initial inquiry task queue ₁ ，Q ₂ ，…，Q _n Analyzing to obtain an abstract syntax tree set AST of each query task;

step 1.2: performing association analysis on the abstract syntax tree set AST of each query task, and performing equivalent rewriting on the initial query task according to data association and calculation association to obtain a rewritten query task queue Q '' ₁ ，Q′ ₂ ，…，Q′ _m Storing the mapping relation between the rewriting inquiry task and the initial inquiry task;

step 1.3: in order to avoid repeated inquiry, reading the result cache data of the rewritten inquiry task to remove the inquiry task of the existing inquiry result, and finally obtaining an optimized rewritten inquiry task queue;

step 2: aiming at the rewritten query task queue obtained by optimization, performing dimension group filtering on target attributes of the query task through a dimension group filter to obtain a query target dimension group candidate set;

step 2.1, analyzing the target attribute of the query task in the optimized rewriting query task queue to obtain an attribute set;

step 2.2, grouping storage information based on the metadata high-dimensional attribute table to obtain a target dimension group of the rewriting query task;

step 2.3, screening the global index of the metadata according to the target dimension group of the rewritten query task to obtain a query target dimension group candidate set;

step 3: on the basis of the metadata global index, filtering the query target dimension group candidate set through a space-time filter according to the space-time attribute range of the query task target to obtain a query data candidate set;

step 3.1: mapping global index data into a three-dimensional space according to the time attribute value and the space attribute value in the metadata global index;

step 3.2: generating a target three-dimensional space region according to the target range of the query task, screening and filtering the global index data according to the mapping result of the global index data and the target three-dimensional space region range, and reserving the global index data in the target three-dimensional space region range to obtain a query data candidate set of the optimized rewritten query task;

step 4: starting to execute a distributed query task aiming at each data node in a distributed platform, extracting samples from target data in each data node through a probability filter, calculating a local query result, summarizing the local query result to obtain a query result of an optimized rewriting query task, and writing the query result into a result buffer;

step 4.1: according to the filtering results in the step 2 and the step 3, collecting data samples according to target data in each data node for storing data in the distribution platform;

step 4.2: carrying out local query calculation and local confidence calculation on data samples collected by each data node;

step 4.3: summarizing the local query calculation results to obtain the global query result and the total confidence of the optimized rewritten query task, and writing the global query result and the total confidence into a result buffer;

step 5: the query optimizer reads the query result of the optimized rewritten query task cached in the result buffer, and performs reconstruction processing to obtain the query result of the initial query task;

step 5.1: the query optimizer reads the query result of the rewritten query task cached in the result buffer, and performs reconstruction calculation according to the saved mapping information of the initial query task and the rewritten query task to obtain the query result of the initial query task;

step 5.2: and returning the reconstruction calculation result to finish the inquiry.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the distributed query system and the method based on dimension group-space-time-probability filtration provided by the invention provide a high-efficiency distributed query optimization method for multi-query tasks, dimension group filtration of high-dimension attributes is carried out on optimized query task targets, candidate set size is reduced, the middle query process of multi-dimension data is optimized, network transmission and calculation cost is reduced, the targets of high-efficiency data management are realized, optimization of multi-query tasks can be realized, query calculation cost is reduced, and query efficiency is improved.

Drawings

FIG. 1 is a block diagram of a distributed query system based on dimension group-space-time-probability filtering according to an embodiment of the present invention;

FIG. 2 is a diagram of a metadata storage model provided by an embodiment of the present invention;

FIG. 3 is a flow chart of a distributed query method based on dimension group-space-time-probability filtering according to an embodiment of the present invention;

fig. 4 is a flowchart of sample collection of high-dimensional data according to a Markov chain-based Gibbs sampling method according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

The internet public opinion data exists in a multi-modal form including video, audio, text, images, etc. In this embodiment, the network public opinion data is taken as an example, and the distributed query system and method based on the dimension group-space-time-probability filtering are adopted to query the network public opinion data.

In this embodiment, a distributed query system based on dimension group-space-time-probability filtering, as shown in fig. 1, includes a query optimizer, a dimension group filter, a space-time filter, a probability filter and a result buffer;

the metadata high-dimensional attribute table stores metadata at a distributed platform according to the following policies: according to the inherent relativity of different attributes, the attributes of the query task are vertically divided into a plurality of dimension groups, each dimension group comprises a plurality of attributes, the attribute association in the same group is strong, and the attribute association between different groups is weak; transversely dividing the attribute table, wherein each divided part contains multiple rows of metadata information and is stored in different data nodes of the distributed platform; for data in the same data node, dividing according to dimension groups, storing each dimension group in the form of a data block, storing local indexes by each data node, storing global indexes by a main control management node of a distributed platform, and constructing global-local hierarchical indexes from bottom to top, wherein the indexes comprise space-time information of the data;

in this embodiment, as shown in fig. 2, the metadata high-dimensional attribute table stores metadata on the distributed platform according to the following policies: the attribute table of the query task is vertically divided into a plurality of dimension groups according to the correlation of different attributes, the attribute table is divided into a plurality of dimension groups, the attribute correlation in the same group is strong, the attribute correlation between different groups is weak, and the metadata is divided into N groups, namely D= { D ₁ ，D ₂ ，D ₃ ，……，D _N Each group contains a plurality of attributes. The lateral partitioning of the attribute table each contains multiple rows of metadata information stored at different data nodes of the distributed platform. For data in the same data node, the data is divided according to dimension groups, and each dimension group is stored in the form of an HDFS data block. Each data node stores a local index, the main control management node stores a global index, a global-local hierarchical index is built from bottom to top, and the indexes contain space-time information of data.

A distributed query method based on dimension group-space-time-probability filtering is shown in fig. 3, and specifically comprises the following steps:

in this embodiment, the initial query task queue includes three query tasks Q ₁ ，Q ₂ ，Q ₃ Each inquiry task is specifically:

Q ₁ time t of inquiry ₁ To t ₂ In, M province meets feature a for each city ₂ Feature b > 1 ₁ Top5 data of (c);

Q ₂ time t of inquiry ₁ To t ₂ In, M province satisfies characteristic b ₂ Feature c in data > 5 ₁ Top3 data of (c);

Q ₃ time t of inquiry ₁ To t ₃ In, M province satisfies feature a ₂ Feature b in data > 2 ₁ Top3 data of (c);

wherein t is ₁ ＜t ₂ ＜t ₃ M province contains 3 cities M ₁ ，M ₂ ，M ₃ . The query optimizer parses each query task to obtain an abstract syntax tree AST for each query task.

in this embodiment, the abstract syntax tree of three query tasks is traversed, the computation association and the data association between the query tasks are analyzed, and the person Q is obtained by analysis ₁ And Q is equal to ₂ There is a data association between Q ₂ And Q is equal to ₃ The computing association and the data association exist between the two, so that the rewritten query task is determined, specifically:

Q′ ₁ : time of inquiry t ₁ To t ₂ In the M province data, (1) the satisfying feature a of each city ₂ Feature b > 1 ₁ Top5 data of (2) satisfy feature b ₂ Feature c in data > 5 ₁ Top3 of (x).

Q′ ₂ : time of inquiry t ₂ To t ₃ In the data of area A, the characteristic a is satisfied ₂ Feature b > 2 ₁ Top3 of (x).

the query optimizer accesses the result cache and rewrites the query task Q' ₁ ，Q′ ₂ And sending the filtered water to a dimension group filter, and carrying out subsequent filtering operation by the dimension group filter.

in this embodiment, for query task Q' ₁ And Q' ₂ The related attributes are analyzed to obtain a target attribute set d 'corresponding to the query task' ₁ ＝{a ₂ ，b ₂ ，c ₁ And d' ₂ ＝{a ₂ }. According to the attribute dimension group information, filtering all attribute dimension groups D in the global index to obtain a target dimension group D of the query task ₁ ′＝{D ₁ ，D ₂ ，D ₅ }，D′ ₂ ＝{D ₁ Subsequent query task execution process is directed to the target dimension group. According to the metadata in the filtered target dimension group, inquiring the global index to obtain an rewriting inquiry task Q' ₁ And Q' ₂ Is provided for a target dimension set candidate set.

Step 3: on the basis of the metadata global index, filtering the query target dimension group candidate set through a space-time filter according to the space-time attribute range of the query task target to obtain a query candidate data set;

step 3.2: generating a target three-dimensional space region according to the target range of the query task, screening and filtering the global index data according to the mapping result of the global index data and the target three-dimensional space region range, and reserving the global index data in the target three-dimensional space region range to obtain a query data candidate set of the optimized rewriting task;

in this embodiment, according to query task Q' ₁ And Q' ₂ The given time range and space range are filtered, and the range can form a three-dimensional space V under the space, and the three-dimensional space V is represented by a pair of diagonal points, wherein the target space range of the query task ranges from (lon 1, lat 1) to (1 on2, lat 2). Query task Q' ₁ And Q' ₂ Corresponding space-time ranges are S ₁ And S is ₂ ：

S ₁ ＝[P(lon ₁ ，lat ₁ ，t ₁ )，P(lon ₁ ，lat ₁ ，t ₂ )]Empty, emptyInter-range from (lon 1, lat 1) to (lon 2, lat 2), time from t1 to t2.

S ₂ ＝[P(lon ₁ ，lat ₁ ，t ₂ )，P(lon ₁ ，lat ₁ ，t ₃ )]The space ranges from (lon 1, lat 1) to (lon 2, lat 2), and the time ranges from t2 to t3.

in this embodiment, task Q is queried for overwriting ₁ 'and Q' ₁ According to the query candidate set and the local index of each data node, extracting samples from target data in each data node and calculating a query result to reduce calculation cost, the embodiment uses a Gibbs sampling method based on a Markov chain to collect samples of high-dimensional data, as shown in fig. 4, the extracted sample set is obtained according to the process that the Markov chain converges to smooth distribution, then the local query result is calculated, and finally the summary result is obtained, wherein the specific process is as follows:

(1) And starting Spark distributed query through a probability filter, and sampling and calculating a target data set for node query by the map.

Firstly, obtaining stable distribution pi (x) of Markov chain according to local index, and setting number n of collected samples ₂ X is a variable, and the data value of the variable X in the current state t is X _t The method comprises the steps of carrying out a first treatment on the surface of the Then according toThe Markov chain of formula (1) has converging properties, and the next state of the Markov chain is determined only by the current state, as shown in formula (2). The Markov chain is converged to the plateau distribution pi (x) by calculation as follows:

P(X _t+1 ＝x|X _t ，X _t-1 ，...)＝P(X _t+1 ＝x|X _t ) (1)

(a) Initializing the initial state of a Markov chain to be X ₀ =θ, θ is a random value;

(b) Execution of n ₁ +n ₂ -1 calculation process, state transition number threshold n ₁ For the transition times of the Markov chain convergence, sampling is carried out by cycling the following processes through the rotation calculation of a plurality of dimensions: let the t-th cycle state be X _t ＝x _t And (3) calculating according to the formula (3), and obtaining the next sample according to the conditional probability of Markov chain transfer until the stable distribution pi (x) is finally obtained.

(2) Obtaining a sampling set S' = { x according to the process of obtaining the stable distribution of the Markov chain _n1 ，……，x _n1+n2-1 Confidence degree corresponding to the sample is calculated, and confidence degree P of the sample is defined _r Is that

(3) And calculating a local query result of the query task according to the obtained sampling set S', and summarizing the query result of each data node through the reduce operation, wherein the query result is as follows:

R′ ₁ ：①、{[M ₁ ，(9.5，9.8.5，7.8，5.9)]，[M ₂ ，(8.8，8.4，7.8，5.6，4.5)]，[M ₃ ，(10.5，9.2，7.9，7.8，7.2)]}，②、{0.960.63,0.52, confidence level P ₁ ＝0.95。

R′ ₂ : {12.3, 11.5,8.9}, confidence degree P ₂ ＝0.97。

Finally obtaining the rewritten query task Q' ₁ And Q' ₂ Query result R 'of (2)' ₁ And R'. ₂ The result contains the total average confidence corresponding to the task, and the query result of the rewritten query task is written into the result buffer.

In this embodiment, the query optimizer reads the query result R 'of the rewritten query task cached in the result buffer' ₁ And R'. ₂ . Then according to the mapping from the rewritten inquiry task to the initial inquiry task, the inquiry result R 'of the rewritten inquiry task' ₁ And R'. ₂ Performing reconstruction calculation to obtain an initial query task Q ₁ ，Q ₂ ，Q ₃ Is a final query result of:

R ₁ ：{[M ₁ ，(9.5，9.8.5，7.8，5.9)]，[M ₂ ，(8.8，8.4，7.8，5.6，4.5)]，[M ₃ ，(10.5，9.2，7.9，7.8，7.2)]confidence level P ₁ ＝0.95。

R ₂ : {0.96,0.63,0.52}, confidence level P ₂ ＝0.95。

R ₃ : {12.3, 11.5, 10.5}, confidence degree P ₃ ＝0.96。

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. A distributed query system based on dimension group-space-time-probability filtering, characterized in that: the system comprises a query optimizer, a dimension group filter, a space-time filter, a probability filter and a result buffer;

2. A distributed query system based on dimension group-space-time-probability filtering according to claim 1, wherein:

the metadata high-dimensional attribute table stores metadata at a distributed platform according to the following policies: according to the inherent relativity of different attributes, vertically dividing an attribute table of a query task into a plurality of dimension groups, wherein each dimension group comprises a plurality of attributes; transversely dividing the attribute table, wherein each divided part contains multiple rows of metadata information and is stored in different data nodes of the distributed platform; for data in the same data node, dividing according to dimension groups, storing each dimension group in the form of a data block, storing local indexes by each data node, storing global indexes by a main control management node of a distributed platform, and constructing global-local hierarchical indexes from bottom to top, wherein the indexes comprise space-time information of the data.

3. A distributed query method based on dimension group-space-time-probability filtering, implemented based on the distributed query system of claim 1, characterized in that: the method specifically comprises the following steps:

step 5: the query optimizer reads the query result of the optimized rewritten query task cached in the result buffer, and obtains the query result of the initial query task after reconstruction processing.

4. A distributed query method based on dimension group-space-time-probability filtering according to claim 3, wherein: the specific method of the step 1 is as follows:

step 1.3: in order to avoid repeated inquiry, the result cache data of the rewritten inquiry task is read to remove the inquiry task of the existing inquiry result, and finally an optimized rewritten inquiry task queue is obtained.

5. The distributed query method based on dimension group-space-time-probability filtering according to claim 4, wherein: the specific method of the step 2 is as follows:

and 2.3, screening the global index of the metadata according to the target dimension group of the rewritten query task to obtain a query target dimension group candidate set.

6. The distributed query method based on dimension group-space-time-probability filtering according to claim 5, wherein: the specific method of the step 3 is as follows:

step 3.2: generating a target three-dimensional space region according to the target range of the query task, screening and filtering the global index data according to the mapping result of the global index data and the target three-dimensional space region range, and reserving the global index data in the target three-dimensional space region range to obtain a query data candidate set of the optimized rewritten query task.

7. The distributed query method based on dimension group-space-time-probability filtering according to claim 6, wherein: the specific method of the step 4 is as follows:

step 4.3: and summarizing the local query calculation results to obtain the global query result and the total confidence of the optimized rewritten query task, and writing the global query result and the total confidence into a result buffer.

8. The distributed query method based on dimension group-space-time-probability filtering according to claim 7, wherein: the specific method in the step 5 is as follows: