CN111913987B - Distributed query system and method based on dimension group-space-time-probability filtering - Google Patents

Distributed query system and method based on dimension group-space-time-probability filtering Download PDF

Info

Publication number
CN111913987B
CN111913987B CN202010794372.1A CN202010794372A CN111913987B CN 111913987 B CN111913987 B CN 111913987B CN 202010794372 A CN202010794372 A CN 202010794372A CN 111913987 B CN111913987 B CN 111913987B
Authority
CN
China
Prior art keywords
query
task
data
result
dimension group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010794372.1A
Other languages
Chinese (zh)
Other versions
CN111913987A (en
Inventor
王之琼
信俊昌
雷盛楠
王司亓
李嘉欣
汪宇
唐俊日
隋玲
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202010794372.1A priority Critical patent/CN111913987B/en
Publication of CN111913987A publication Critical patent/CN111913987A/en
Application granted granted Critical
Publication of CN111913987B publication Critical patent/CN111913987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24539Query rewriting; Transformation using cached or materialised query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a distributed query system and a distributed query method based on dimension group-space-time-probability filtering, and relates to the technical field of big data query. Firstly, optimizing an initial query task queue through a query optimizer to obtain a rewritten query task queue; performing dimension group filtering on the attributes through a dimension group filter to obtain a query target dimension group candidate set; further screening the query target dimension group candidate set through a space-time filter to obtain a query candidate data set; the probability filter starts the distributed sampling calculation and query process for the two sets, performs query calculation and confidence calculation for the samples, gathers to obtain a global query result and total confidence, and writes the result into a result buffer for buffering. And finally, the query optimizer reads the completed result buffer of the rewritten query task from the result buffer, calculates the query result returned to the initial query task, optimizes the multi-query task, reduces the query calculation cost and improves the query efficiency.

Description

Distributed query system and method based on dimension group-space-time-probability filtering
Technical Field
The invention relates to the technical field of big data query, in particular to a distributed query system and method based on dimension group-space-time-probability filtering.
Background
In the context of the big data age, distributed data storage, querying and analysis techniques have been widely used. Distributed queries involve multiple storage nodes and multimodal data. Different from the traditional single-node query optimization, the distributed query is mainly performed with optimized scheduling on the query task of the distributed system in a big data environment, so that the network transmission and calculation cost is reduced, the accuracy of the query result is improved, the query efficiency is improved, and further, efficient distributed query and optimization are realized. The efficient distributed query and optimization is the core of big data management, and is an important support for big data intelligent analysis. The main approach of big data distributed query optimization is to reduce the query candidate set, avoid redundant data reading and calculation, and realize an efficient intermediate query process. Therefore, the adoption of an effective query candidate set filtering method is an important point of optimization.
In recent years, the industry has developed a large number of complex query algorithms aiming at big data environments, but the algorithms are single, mainly the optimization algorithms aiming at specific queries such as aggregated queries, preference queries and analysis queries, and the efficiency is still not high when a plurality of queries are simultaneously carried out, and the efficient distributed query optimization method is still lacking at present, so that the intermediate query process of the multitasking big data is optimized, and the aim of high-efficiency data management is fulfilled.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a distributed query system and a distributed query method based on dimension group-space-time-probability filtering, and an optimized rewriting scheme of multiple query tasks is obtained through analysis, so that query redundancy cost is reduced; and considering factors influencing the query candidate set, performing dimension group filtering aiming at high-dimension attributes on the rewriting scheme, performing space-time filtering aiming at space-time attributes and probability filtering based on sampling calculation, and achieving filtering with good candidate set effect through the three filtering, so as to reduce unnecessary query cost.
In order to solve the technical problems, the invention adopts the following technical scheme: in one aspect, the invention provides a distributed query system based on dimension group-space-time-probability filtering, which comprises a query optimizer, a dimension group filter, a space-time filter, a probability filter and a result buffer;
the query optimizer optimizes the query task by analyzing the internal relevance of a plurality of query tasks in the initial query task queue to obtain an rewritten query task queue; reading the completed result cache of the rewritten query task from the result cache, and calculating the query result of the returned initial query task according to the saved query task relation mapping before and after optimization;
the dimension group filter performs dimension group filtering on the attribute of the rewriting query task based on the grouping storage information of the metadata high-dimension attribute table to obtain a query target dimension group candidate set;
the metadata high-dimensional attribute table stores metadata at a distributed platform according to the following policies: according to the inherent relativity of different attributes, vertically dividing an attribute table of a query task into a plurality of dimension groups, wherein each dimension group comprises a plurality of attributes; transversely dividing the attribute table, wherein each divided part contains multiple rows of metadata information and is stored in different data nodes of the distributed platform; for data in the same data node, dividing according to dimension groups, storing each dimension group in the form of a data block, storing local indexes by each data node, storing global indexes by a main control management node of a distributed platform, and constructing global-local hierarchical indexes from bottom to top, wherein the indexes comprise space-time information of the data;
the space-time filter further screens the query target dimension group candidate set aiming at the space-time attribute corresponding to the rewritten query task to obtain a query candidate data set;
and the probability filter starts a distributed sampling calculation and query process according to the obtained target dimension group candidate set and the query candidate data set, performs sample collection of target data in parallel on each data node of the distributed platform, performs query calculation and confidence calculation on the samples, collects to obtain a global query result and total confidence, and writes the global query result and the total confidence into a result buffer for buffering.
On the other hand, the invention also provides a distributed query method based on dimension group-space-time-probability filtering, which specifically comprises the following steps:
step 1: performing association analysis on query tasks in the initial query task queue through a query optimizer, and optimizing to obtain a rewritten query task queue;
step 1.1: for n inquiry tasks Q in initial inquiry task queue 1 ,Q 2 ,…,Q n Analyzing to obtain an abstract syntax tree set AST of each query task;
step 1.2: performing association analysis on the abstract syntax tree set AST of each query task, and performing equivalent rewriting on the initial query task according to data association and calculation association to obtain a rewritten query task queue Q '' 1 ,Q′ 2 ,…,Q′ m Storing the mapping relation between the rewriting inquiry task and the initial inquiry task;
step 1.3: in order to avoid repeated inquiry, reading the result cache data of the rewritten inquiry task to remove the inquiry task of the existing inquiry result, and finally obtaining an optimized rewritten inquiry task queue;
step 2: aiming at the rewritten query task queue obtained by optimization, performing dimension group filtering on target attributes of the query task through a dimension group filter to obtain a query target dimension group candidate set;
step 2.1, analyzing the target attribute of the query task in the optimized rewriting query task queue to obtain an attribute set;
step 2.2, grouping storage information based on the metadata high-dimensional attribute table to obtain a target dimension group of the rewriting query task;
step 2.3, screening the global index of the metadata according to the target dimension group of the rewritten query task to obtain a query target dimension group candidate set;
step 3: on the basis of the metadata global index, filtering the query target dimension group candidate set through a space-time filter according to the space-time attribute range of the query task target to obtain a query data candidate set;
step 3.1: mapping global index data into a three-dimensional space according to the time attribute value and the space attribute value in the metadata global index;
step 3.2: generating a target three-dimensional space region according to the target range of the query task, screening and filtering the global index data according to the mapping result of the global index data and the target three-dimensional space region range, and reserving the global index data in the target three-dimensional space region range to obtain a query data candidate set of the optimized rewritten query task;
step 4: starting to execute a distributed query task aiming at each data node in a distributed platform, extracting samples from target data in each data node through a probability filter, calculating a local query result, summarizing the local query result to obtain a query result of an optimized rewriting query task, and writing the query result into a result buffer;
step 4.1: according to the filtering results in the step 2 and the step 3, collecting data samples according to target data in each data node for storing data in the distribution platform;
step 4.2: carrying out local query calculation and local confidence calculation on data samples collected by each data node;
step 4.3: summarizing the local query calculation results to obtain the global query result and the total confidence of the optimized rewritten query task, and writing the global query result and the total confidence into a result buffer;
step 5: the query optimizer reads the query result of the optimized rewritten query task cached in the result buffer, and performs reconstruction processing to obtain the query result of the initial query task;
step 5.1: the query optimizer reads the query result of the rewritten query task cached in the result buffer, and performs reconstruction calculation according to the saved mapping information of the initial query task and the rewritten query task to obtain the query result of the initial query task;
step 5.2: and returning the reconstruction calculation result to finish the inquiry.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the distributed query system and the method based on dimension group-space-time-probability filtration provided by the invention provide a high-efficiency distributed query optimization method for multi-query tasks, dimension group filtration of high-dimension attributes is carried out on optimized query task targets, candidate set size is reduced, the middle query process of multi-dimension data is optimized, network transmission and calculation cost is reduced, the targets of high-efficiency data management are realized, optimization of multi-query tasks can be realized, query calculation cost is reduced, and query efficiency is improved.
Drawings
FIG. 1 is a block diagram of a distributed query system based on dimension group-space-time-probability filtering according to an embodiment of the present invention;
FIG. 2 is a diagram of a metadata storage model provided by an embodiment of the present invention;
FIG. 3 is a flow chart of a distributed query method based on dimension group-space-time-probability filtering according to an embodiment of the present invention;
fig. 4 is a flowchart of sample collection of high-dimensional data according to a Markov chain-based Gibbs sampling method according to an embodiment of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
The internet public opinion data exists in a multi-modal form including video, audio, text, images, etc. In this embodiment, the network public opinion data is taken as an example, and the distributed query system and method based on the dimension group-space-time-probability filtering are adopted to query the network public opinion data.
In this embodiment, a distributed query system based on dimension group-space-time-probability filtering, as shown in fig. 1, includes a query optimizer, a dimension group filter, a space-time filter, a probability filter and a result buffer;
the query optimizer optimizes the query task by analyzing the internal relevance of a plurality of query tasks in the initial query task queue to obtain an rewritten query task queue; reading the completed result cache of the rewritten query task from the result cache, and calculating the query result of the returned initial query task according to the saved query task relation mapping before and after optimization;
the dimension group filter performs dimension group filtering on the attribute of the rewriting query task based on the grouping storage information of the metadata high-dimension attribute table to obtain a query target dimension group candidate set;
the metadata high-dimensional attribute table stores metadata at a distributed platform according to the following policies: according to the inherent relativity of different attributes, the attributes of the query task are vertically divided into a plurality of dimension groups, each dimension group comprises a plurality of attributes, the attribute association in the same group is strong, and the attribute association between different groups is weak; transversely dividing the attribute table, wherein each divided part contains multiple rows of metadata information and is stored in different data nodes of the distributed platform; for data in the same data node, dividing according to dimension groups, storing each dimension group in the form of a data block, storing local indexes by each data node, storing global indexes by a main control management node of a distributed platform, and constructing global-local hierarchical indexes from bottom to top, wherein the indexes comprise space-time information of the data;
in this embodiment, as shown in fig. 2, the metadata high-dimensional attribute table stores metadata on the distributed platform according to the following policies: the attribute table of the query task is vertically divided into a plurality of dimension groups according to the correlation of different attributes, the attribute table is divided into a plurality of dimension groups, the attribute correlation in the same group is strong, the attribute correlation between different groups is weak, and the metadata is divided into N groups, namely D= { D 1 ,D 2 ,D 3 ,……,D N Each group contains a plurality of attributes. The lateral partitioning of the attribute table each contains multiple rows of metadata information stored at different data nodes of the distributed platform. For data in the same data node, the data is divided according to dimension groups, and each dimension group is stored in the form of an HDFS data block. Each data node stores a local index, the main control management node stores a global index, a global-local hierarchical index is built from bottom to top, and the indexes contain space-time information of data.
The space-time filter further screens the query target dimension group candidate set aiming at the space-time attribute corresponding to the rewritten query task to obtain a query candidate data set;
and the probability filter starts a distributed sampling calculation and query process according to the obtained target dimension group candidate set and the query candidate data set, performs sample collection of target data in parallel on each data node of the distributed platform, performs query calculation and confidence calculation on the samples, collects to obtain a global query result and total confidence, and writes the global query result and the total confidence into a result buffer for buffering.
A distributed query method based on dimension group-space-time-probability filtering is shown in fig. 3, and specifically comprises the following steps:
step 1: performing association analysis on query tasks in the initial query task queue through a query optimizer, and optimizing to obtain a rewritten query task queue;
step 1.1: for n inquiry tasks Q in initial inquiry task queue 1 ,Q 2 ,…,Q n Analyzing to obtain an abstract syntax tree set AST of each query task;
in this embodiment, the initial query task queue includes three query tasks Q 1 ,Q 2 ,Q 3 Each inquiry task is specifically:
Q 1 time t of inquiry 1 To t 2 In, M province meets feature a for each city 2 Feature b > 1 1 Top5 data of (c);
Q 2 time t of inquiry 1 To t 2 In, M province satisfies characteristic b 2 Feature c in data > 5 1 Top3 data of (c);
Q 3 time t of inquiry 1 To t 3 In, M province satisfies feature a 2 Feature b in data > 2 1 Top3 data of (c);
wherein t is 1 <t 2 <t 3 M province contains 3 cities M 1 ,M 2 ,M 3 . The query optimizer parses each query task to obtain an abstract syntax tree AST for each query task.
Step 1.2: performing association analysis on the abstract syntax tree set AST of each query task, and performing equivalent rewriting on the initial query task according to data association and calculation association to obtain a rewritten query task queue Q '' 1 ,Q′ 2 ,…,Q′ m Storing the mapping relation between the rewriting inquiry task and the initial inquiry task;
in this embodiment, the abstract syntax tree of three query tasks is traversed, the computation association and the data association between the query tasks are analyzed, and the person Q is obtained by analysis 1 And Q is equal to 2 There is a data association between Q 2 And Q is equal to 3 The computing association and the data association exist between the two, so that the rewritten query task is determined, specifically:
Q′ 1 : time of inquiry t 1 To t 2 In the M province data, (1) the satisfying feature a of each city 2 Feature b > 1 1 Top5 data of (2) satisfy feature b 2 Feature c in data > 5 1 Top3 of (x).
Q′ 2 : time of inquiry t 2 To t 3 In the data of area A, the characteristic a is satisfied 2 Feature b > 2 1 Top3 of (x).
Step 1.3: in order to avoid repeated inquiry, reading the result cache data of the rewritten inquiry task to remove the inquiry task of the existing inquiry result, and finally obtaining an optimized rewritten inquiry task queue;
the query optimizer accesses the result cache and rewrites the query task Q' 1 ,Q′ 2 And sending the filtered water to a dimension group filter, and carrying out subsequent filtering operation by the dimension group filter.
Step 2: aiming at the rewritten query task queue obtained by optimization, performing dimension group filtering on target attributes of the query task through a dimension group filter to obtain a query target dimension group candidate set;
step 2.1, analyzing the target attribute of the query task in the optimized rewriting query task queue to obtain an attribute set;
step 2.2, grouping storage information based on the metadata high-dimensional attribute table to obtain a target dimension group of the rewriting query task;
step 2.3, screening the global index of the metadata according to the target dimension group of the rewritten query task to obtain a query target dimension group candidate set;
in this embodiment, for query task Q' 1 And Q' 2 The related attributes are analyzed to obtain a target attribute set d 'corresponding to the query task' 1 ={a 2 ,b 2 ,c 1 And d' 2 ={a 2 }. According to the attribute dimension group information, filtering all attribute dimension groups D in the global index to obtain a target dimension group D of the query task 1 ′={D 1 ,D 2 ,D 5 },D′ 2 ={D 1 Subsequent query task execution process is directed to the target dimension group. According to the metadata in the filtered target dimension group, inquiring the global index to obtain an rewriting inquiry task Q' 1 And Q' 2 Is provided for a target dimension set candidate set.
Step 3: on the basis of the metadata global index, filtering the query target dimension group candidate set through a space-time filter according to the space-time attribute range of the query task target to obtain a query candidate data set;
step 3.1: mapping global index data into a three-dimensional space according to the time attribute value and the space attribute value in the metadata global index;
step 3.2: generating a target three-dimensional space region according to the target range of the query task, screening and filtering the global index data according to the mapping result of the global index data and the target three-dimensional space region range, and reserving the global index data in the target three-dimensional space region range to obtain a query data candidate set of the optimized rewriting task;
in this embodiment, according to query task Q' 1 And Q' 2 The given time range and space range are filtered, and the range can form a three-dimensional space V under the space, and the three-dimensional space V is represented by a pair of diagonal points, wherein the target space range of the query task ranges from (lon 1, lat 1) to (1 on2, lat 2). Query task Q' 1 And Q' 2 Corresponding space-time ranges are S 1 And S is 2
S 1 =[P(lon 1 ,lat 1 ,t 1 ),P(lon 1 ,lat 1 ,t 2 )]Empty, emptyInter-range from (lon 1, lat 1) to (lon 2, lat 2), time from t1 to t2.
S 2 =[P(lon 1 ,lat 1 ,t 2 ),P(lon 1 ,lat 1 ,t 3 )]The space ranges from (lon 1, lat 1) to (lon 2, lat 2), and the time ranges from t2 to t3.
Step 4: starting to execute a distributed query task aiming at each data node in a distributed platform, extracting samples from target data in each data node through a probability filter, calculating a local query result, summarizing the local query result to obtain a query result of an optimized rewriting query task, and writing the query result into a result buffer;
step 4.1: according to the filtering results in the step 2 and the step 3, collecting data samples according to target data in each data node for storing data in the distribution platform;
step 4.2: carrying out local query calculation and local confidence calculation on data samples collected by each data node;
step 4.3: summarizing the local query calculation results to obtain the global query result and the total confidence of the optimized rewritten query task, and writing the global query result and the total confidence into a result buffer;
in this embodiment, task Q is queried for overwriting 1 'and Q' 1 According to the query candidate set and the local index of each data node, extracting samples from target data in each data node and calculating a query result to reduce calculation cost, the embodiment uses a Gibbs sampling method based on a Markov chain to collect samples of high-dimensional data, as shown in fig. 4, the extracted sample set is obtained according to the process that the Markov chain converges to smooth distribution, then the local query result is calculated, and finally the summary result is obtained, wherein the specific process is as follows:
(1) And starting Spark distributed query through a probability filter, and sampling and calculating a target data set for node query by the map.
Firstly, obtaining stable distribution pi (x) of Markov chain according to local index, and setting number n of collected samples 2 X is a variable, and the data value of the variable X in the current state t is X t The method comprises the steps of carrying out a first treatment on the surface of the Then according toThe Markov chain of formula (1) has converging properties, and the next state of the Markov chain is determined only by the current state, as shown in formula (2). The Markov chain is converged to the plateau distribution pi (x) by calculation as follows:
P(X t+1 =x|X t ,X t-1 ,...)=P(X t+1 =x|X t ) (1)
(a) Initializing the initial state of a Markov chain to be X 0 =θ, θ is a random value;
(b) Execution of n 1 +n 2 -1 calculation process, state transition number threshold n 1 For the transition times of the Markov chain convergence, sampling is carried out by cycling the following processes through the rotation calculation of a plurality of dimensions: let the t-th cycle state be X t =x t And (3) calculating according to the formula (3), and obtaining the next sample according to the conditional probability of Markov chain transfer until the stable distribution pi (x) is finally obtained.
(2) Obtaining a sampling set S' = { x according to the process of obtaining the stable distribution of the Markov chain n1 ,……,x n1+n2-1 Confidence degree corresponding to the sample is calculated, and confidence degree P of the sample is defined r Is that
(3) And calculating a local query result of the query task according to the obtained sampling set S', and summarizing the query result of each data node through the reduce operation, wherein the query result is as follows:
R′ 1 :①、{[M 1 ,(9.5,9.8.5,7.8,5.9)],[M 2 ,(8.8,8.4,7.8,5.6,4.5)],[M 3 ,(10.5,9.2,7.9,7.8,7.2)]},②、{0.960.63,0.52, confidence level P 1 =0.95。
R′ 2 : {12.3, 11.5,8.9}, confidence degree P 2 =0.97。
Finally obtaining the rewritten query task Q' 1 And Q' 2 Query result R 'of (2)' 1 And R'. 2 The result contains the total average confidence corresponding to the task, and the query result of the rewritten query task is written into the result buffer.
Step 5: the query optimizer reads the query result of the optimized rewritten query task cached in the result buffer, and performs reconstruction processing to obtain the query result of the initial query task;
step 5.1: the query optimizer reads the query result of the rewritten query task cached in the result buffer, and performs reconstruction calculation according to the saved mapping information of the initial query task and the rewritten query task to obtain the query result of the initial query task;
step 5.2: and returning the reconstruction calculation result to finish the inquiry.
In this embodiment, the query optimizer reads the query result R 'of the rewritten query task cached in the result buffer' 1 And R'. 2 . Then according to the mapping from the rewritten inquiry task to the initial inquiry task, the inquiry result R 'of the rewritten inquiry task' 1 And R'. 2 Performing reconstruction calculation to obtain an initial query task Q 1 ,Q 2 ,Q 3 Is a final query result of:
R 1 :{[M 1 ,(9.5,9.8.5,7.8,5.9)],[M 2 ,(8.8,8.4,7.8,5.6,4.5)],[M 3 ,(10.5,9.2,7.9,7.8,7.2)]confidence level P 1 =0.95。
R 2 : {0.96,0.63,0.52}, confidence level P 2 =0.95。
R 3 : {12.3, 11.5, 10.5}, confidence degree P 3 =0.96。
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims (8)

1. A distributed query system based on dimension group-space-time-probability filtering, characterized in that: the system comprises a query optimizer, a dimension group filter, a space-time filter, a probability filter and a result buffer;
the query optimizer optimizes the query task by analyzing the internal relevance of a plurality of query tasks in the initial query task queue to obtain an rewritten query task queue; reading the completed result cache of the rewritten query task from the result cache, and calculating the query result of the returned initial query task according to the saved query task relation mapping before and after optimization;
the dimension group filter performs dimension group filtering on the attribute of the rewriting query task based on the grouping storage information of the metadata high-dimension attribute table to obtain a query target dimension group candidate set;
the space-time filter further screens the query target dimension group candidate set aiming at the space-time attribute corresponding to the rewritten query task to obtain a query candidate data set;
and the probability filter starts a distributed sampling calculation and query process according to the obtained target dimension group candidate set and the query candidate data set, performs sample collection of target data in parallel on each data node of the distributed platform, performs query calculation and confidence calculation on the samples, collects to obtain a global query result and total confidence, and writes the global query result and the total confidence into a result buffer for buffering.
2. A distributed query system based on dimension group-space-time-probability filtering according to claim 1, wherein:
the metadata high-dimensional attribute table stores metadata at a distributed platform according to the following policies: according to the inherent relativity of different attributes, vertically dividing an attribute table of a query task into a plurality of dimension groups, wherein each dimension group comprises a plurality of attributes; transversely dividing the attribute table, wherein each divided part contains multiple rows of metadata information and is stored in different data nodes of the distributed platform; for data in the same data node, dividing according to dimension groups, storing each dimension group in the form of a data block, storing local indexes by each data node, storing global indexes by a main control management node of a distributed platform, and constructing global-local hierarchical indexes from bottom to top, wherein the indexes comprise space-time information of the data.
3. A distributed query method based on dimension group-space-time-probability filtering, implemented based on the distributed query system of claim 1, characterized in that: the method specifically comprises the following steps:
step 1: performing association analysis on query tasks in the initial query task queue through a query optimizer, and optimizing to obtain a rewritten query task queue;
step 2: aiming at the rewritten query task queue obtained by optimization, performing dimension group filtering on target attributes of the query task through a dimension group filter to obtain a query target dimension group candidate set;
step 3: on the basis of the metadata global index, filtering the query target dimension group candidate set through a space-time filter according to the space-time attribute range of the query task target to obtain a query data candidate set;
step 4: starting to execute a distributed query task aiming at each data node in a distributed platform, extracting samples from target data in each data node through a probability filter, calculating a local query result, summarizing the local query result to obtain a query result of an optimized rewriting query task, and writing the query result into a result buffer;
step 5: the query optimizer reads the query result of the optimized rewritten query task cached in the result buffer, and obtains the query result of the initial query task after reconstruction processing.
4. A distributed query method based on dimension group-space-time-probability filtering according to claim 3, wherein: the specific method of the step 1 is as follows:
step 1.1: for n inquiry tasks Q in initial inquiry task queue 1 ,Q 2 ,…,Q n Analyzing to obtain an abstract syntax tree set AST of each query task;
step 1.2: performing association analysis on the abstract syntax tree set AST of each query task, and performing equivalent rewriting on the initial query task according to data association and calculation association to obtain a rewritten query task queue Q '' 1 ,Q′ 2 ,…,Q′ m Storing the mapping relation between the rewriting inquiry task and the initial inquiry task;
step 1.3: in order to avoid repeated inquiry, the result cache data of the rewritten inquiry task is read to remove the inquiry task of the existing inquiry result, and finally an optimized rewritten inquiry task queue is obtained.
5. The distributed query method based on dimension group-space-time-probability filtering according to claim 4, wherein: the specific method of the step 2 is as follows:
step 2.1, analyzing the target attribute of the query task in the optimized rewriting query task queue to obtain an attribute set;
step 2.2, grouping storage information based on the metadata high-dimensional attribute table to obtain a target dimension group of the rewriting query task;
and 2.3, screening the global index of the metadata according to the target dimension group of the rewritten query task to obtain a query target dimension group candidate set.
6. The distributed query method based on dimension group-space-time-probability filtering according to claim 5, wherein: the specific method of the step 3 is as follows:
step 3.1: mapping global index data into a three-dimensional space according to the time attribute value and the space attribute value in the metadata global index;
step 3.2: generating a target three-dimensional space region according to the target range of the query task, screening and filtering the global index data according to the mapping result of the global index data and the target three-dimensional space region range, and reserving the global index data in the target three-dimensional space region range to obtain a query data candidate set of the optimized rewritten query task.
7. The distributed query method based on dimension group-space-time-probability filtering according to claim 6, wherein: the specific method of the step 4 is as follows:
step 4.1: according to the filtering results in the step 2 and the step 3, collecting data samples according to target data in each data node for storing data in the distribution platform;
step 4.2: carrying out local query calculation and local confidence calculation on data samples collected by each data node;
step 4.3: and summarizing the local query calculation results to obtain the global query result and the total confidence of the optimized rewritten query task, and writing the global query result and the total confidence into a result buffer.
8. The distributed query method based on dimension group-space-time-probability filtering according to claim 7, wherein: the specific method in the step 5 is as follows:
step 5.1: the query optimizer reads the query result of the rewritten query task cached in the result buffer, and performs reconstruction calculation according to the saved mapping information of the initial query task and the rewritten query task to obtain the query result of the initial query task;
step 5.2: and returning the reconstruction calculation result to finish the inquiry.
CN202010794372.1A 2020-08-10 2020-08-10 Distributed query system and method based on dimension group-space-time-probability filtering Active CN111913987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010794372.1A CN111913987B (en) 2020-08-10 2020-08-10 Distributed query system and method based on dimension group-space-time-probability filtering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010794372.1A CN111913987B (en) 2020-08-10 2020-08-10 Distributed query system and method based on dimension group-space-time-probability filtering

Publications (2)

Publication Number Publication Date
CN111913987A CN111913987A (en) 2020-11-10
CN111913987B true CN111913987B (en) 2023-08-04

Family

ID=73284755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010794372.1A Active CN111913987B (en) 2020-08-10 2020-08-10 Distributed query system and method based on dimension group-space-time-probability filtering

Country Status (1)

Country Link
CN (1) CN111913987B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678520A (en) * 2013-11-29 2014-03-26 中国科学院计算技术研究所 Multi-dimensional interval query method and system based on cloud computing
CN105229633A (en) * 2013-03-13 2016-01-06 萨勒斯福斯通讯有限公司 For realizing system, method and apparatus disclosed in data upload, process and predicted query API
CN105608077A (en) * 2014-10-27 2016-05-25 青岛金讯网络工程有限公司 Big data distributed storage method and system
CN106777279A (en) * 2016-12-29 2017-05-31 苏碧云 A kind of time-space relationship analysis system
CN109166069A (en) * 2018-07-17 2019-01-08 华中科技大学 Data correlation method, system and equipment based on Markov logical network
CN109213513A (en) * 2017-06-30 2019-01-15 腾讯科技(深圳)有限公司 The determination method, apparatus and computer readable storage medium of software share accounting
CN110147372A (en) * 2019-05-21 2019-08-20 电子科技大学 A kind of distributed data base Intelligent Hybrid storage method towards HTAP

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11144548B2 (en) * 2018-04-24 2021-10-12 Dremio Corporation Optimized data structures of a relational cache with a learning capability for accelerating query execution by a data system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105229633A (en) * 2013-03-13 2016-01-06 萨勒斯福斯通讯有限公司 For realizing system, method and apparatus disclosed in data upload, process and predicted query API
CN103678520A (en) * 2013-11-29 2014-03-26 中国科学院计算技术研究所 Multi-dimensional interval query method and system based on cloud computing
CN105608077A (en) * 2014-10-27 2016-05-25 青岛金讯网络工程有限公司 Big data distributed storage method and system
CN106777279A (en) * 2016-12-29 2017-05-31 苏碧云 A kind of time-space relationship analysis system
CN109213513A (en) * 2017-06-30 2019-01-15 腾讯科技(深圳)有限公司 The determination method, apparatus and computer readable storage medium of software share accounting
CN109166069A (en) * 2018-07-17 2019-01-08 华中科技大学 Data correlation method, system and equipment based on Markov logical network
CN110147372A (en) * 2019-05-21 2019-08-20 电子科技大学 A kind of distributed data base Intelligent Hybrid storage method towards HTAP

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
无线传感器网络中轮廓查询处理技术研究;信俊昌;《中国博士学位论文全文数据库 信息科技辑》;I140-33 *

Also Published As

Publication number Publication date
CN111913987A (en) 2020-11-10

Similar Documents

Publication Publication Date Title
CN104424258B (en) Multidimensional data query method, query server, column storage server and system
CN106372114A (en) Big data-based online analytical processing system and method
CN109033303B (en) Large-scale knowledge graph fusion method based on reduction anchor points
CN110990638A (en) Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment
CN108509543B (en) Streaming RDF data multi-keyword parallel search method based on Spark Streaming
CN106503223B (en) online house source searching method and device combining position and keyword information
CN104137095B (en) System for evolution analysis
US20150006509A1 (en) Incremental maintenance of range-partitioned statistics for query optimization
CN109166615B (en) Medical CT image storage and retrieval method based on random forest hash
CN106599040A (en) Layered indexing method and search method for cloud storage
CN103761286B (en) A kind of Service Source search method based on user interest
CN112598128A (en) Model training and online analysis processing method and device
CN110018997B (en) Mass small file storage optimization method based on HDFS
CN114201480A (en) Multi-source POI fusion method and device based on NLP technology and readable storage medium
CN113268457B (en) Self-adaptive learning index method and system supporting efficient writing
CN111913987B (en) Distributed query system and method based on dimension group-space-time-probability filtering
CN110334290B (en) MF-Octree-based spatio-temporal data rapid retrieval method
CN116089414B (en) Time sequence database writing performance optimization method and device based on mass data scene
CN111046092B (en) Parallel similarity connection method based on CPU-GPU heterogeneous system structure
WO2024016569A1 (en) Index recommendation method and apparatus based on data feature
Li et al. Aggregate nearest keyword search in spatial databases
CN103995869A (en) Data-caching method based on Apriori algorithm
CN113722274A (en) Efficient R-tree index remote sensing data storage model
WO2021057824A1 (en) Method and apparatus for querying data, computing device, and storage medium
Zhao et al. Graph indexing for spatial data traversal in road map databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant