CN108256028B

CN108256028B - Multi-dimensional dynamic sampling method for approximate query in cloud computing environment

Info

Publication number: CN108256028B
Application number: CN201810025016.6A
Authority: CN
Inventors: 史英杰; 刘怡; 郭飞; 刘昊
Original assignee: Beijing Institute Fashion Technology
Current assignee: Beijing Institute Fashion Technology
Priority date: 2018-01-11
Filing date: 2018-01-11
Publication date: 2021-09-28
Anticipated expiration: 2038-01-11
Also published as: CN108256028A

Abstract

A multi-dimensional dynamic sampling method for approximating a query in a cloud computing environment, comprising the steps of: the dynamic sampling system comprises an offline processing stage for creating layered samples and an online processing stage for dynamically selecting samples; in the off-line processing stage, a load list set analysis module analyzes a load query statement; the data characteristic analysis module analyzes the data characteristics; the coverage index calculation module calculates the total coverage index; the hierarchical column set determining module selects a hierarchical column set used for creating hierarchical samples; the method comprises the steps of establishing a layered sample at a layered sample data establishing module; in the on-line processing stage, the query analysis module analyzes the query sentence of the user; the sample selection module selects the layered sample data with the minimum sampling cost; the sample size determination module determines the size of the sample drawn from each sample layer. The invention effectively solves the problem of inaccurate small-packet estimation caused by data tilt in approximate query, and reduces sampling cost under the limit of limited sample storage space.

Description

Multi-dimensional dynamic sampling method for approximate query in cloud computing environment

Technical Field

The invention relates to a data sampling method for approximate query, in particular to a dynamic sampling method facing to multi-query load in a cloud computing environment.

Background

The cloud computing environment provides a high-expansibility and high-cost-performance mode for managing big data, and becomes a mainstream platform for managing the big data. However, even in a cloud computing environment, queries for large data cannot meet the speed requirements for real-time processing and interaction with users. For ad hoc query and exploratory data analysis applications, it is more meaningful to quickly obtain estimated results rather than expend a lot of time and computational resources to obtain fully accurate results. The approximate query processing technology estimates the query result based on sample data, thereby greatly reducing the query execution time and having important significance for big data analysis.

An approximate query processing technique based on sample data is proposed by Acharya et al, which uses a uniform random sampling method, i.e., each tuple is extracted with equal probability. The unified random sampling is suitable for the condition of uniform distribution of data, has the advantages of simplicity and easiness in operation, but when small groups are generated in group aggregation query due to data inclination, the accuracy of an estimation result is seriously reduced due to the unified random sampling, so that the estimation significance is lost. The Surajit et al provides a weighted sampling method, which analyzes the number of query predicates that each tuple can satisfy, and takes the number as the probability weight of the tuple being sampled, and the more the number of query predicates that the tuple satisfies, the greater the probability of being sampled. The weighted sampling technology can relieve the problem of inaccurate estimation caused by data inclination in unified random sampling to a certain extent, but the effect of the weighted sampling technology completely depends on the load on which the sampling weight is calculated, and when the query statement is different from the query statement, the sampling weight has no meaning. A congress sampling method was proposed by swaroup et al that creates a common sample for all possible grouped columns and queries. However, the effectiveness of the sample is gradually reduced along with the increase of the number of queries, and the preprocessing time is exponentially increased along with the increase of the number of columns, so that the application scenario of the multi-query statement cannot be dealt with. In general, the above techniques are performed under the condition that the query statement is of a small and fixed type, and the extensibility is not strong in practical application. In addition, the above techniques are all proposed in the field of relational databases, and cannot be applied to a cloud computing environment.

Disclosure of Invention

The method is used for a data preprocessing stage in an approximate query process, preprocessing an original data set to generate a plurality of layered sample data sets, dynamically selecting the sample data sets according to query statement contents and sampling sizes of the query statement contents when a query statement arrives, and providing sample size extracted from each sample layer. The method provided by the invention effectively solves the problem of inaccurate small packet estimation caused by data tilt in approximate query, and reduces sampling cost under the limit of limited sample storage space.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a method for multi-dimensional dynamic sampling for approximating a query in a cloud computing environment, the method comprising the steps of:

1) the dynamic sampling system comprises an offline processing stage for creating layered samples and an online processing stage for dynamically selecting samples;

2) a load list set analysis module, a data characteristic analysis module, a coverage index calculation module, a hierarchical list set determination module and a hierarchical sample data creation module are arranged at an offline processing stage;

3) the load column set analysis module analyzes the load query statements, extracts the grouped column sets of each load query statement, calculates the occurrence frequency of each grouped column set, generates a candidate hierarchical column set CS, and analyzes each candidate hierarchical column set CS_iThe relationship between the data and the data is output to a data characteristic analysis module;

4) the data characteristic analysis module starts a MapReduce operation to scan the original data set and outputs a data distribution result of the original data set to the coverage index calculation module;

5) the coverage index calculation module calculates the CS based on each candidate hierarchical column set by combining the data distribution result_iPerforming total coverage index under the condition of layered sampling;

6) the hierarchical column set determining module combines information such as a coverage index and a sample storage space to select a hierarchical column set for creating a hierarchical sample;

7) starting a MapReduce operation at a layered sample data creating module to create layered samples, scanning an original data set by a Map function, transmitting the original data set to a corresponding Reduce function according to values of tuples on each layered column set for creating the layered samples, updating statistical information by the Reduce function and outputting tuple data to the layered sample data set;

8) a query analysis module, a sample selection module and a sample size determination module are arranged at the online processing stage;

9) the query analysis module analyzes the query sentences input by the user on line and extracts each user query sentenceOf a group column set CS_q；

10) The sample selection module selects CS according to the grouping column set of the user query statement_qSelecting the layered sample data with the minimum sampling cost from the layered sample data set;

11) the sample size determination module determines a sample size drawn from each sample layer according to a sample size of the approximate query statement.

The invention has the following advantages:

1. the method is used for establishing the hierarchical column set by analyzing the load characteristics and the data distribution characteristics, and establishing a plurality of multi-dimensional hierarchical sample data based on the hierarchical column set and the sample storage space, so that the problem of inaccurate estimation result caused by data inclination in approximate query is solved;

2. in the process of determining the column set for creating the hierarchical sample, after the coverage index represents the given hierarchical sample, different query statements use the sample to perform hierarchical sampling, so that the method lays a foundation for the expansion of query load;

3. given the sample size of the hierarchical layer and the total sample size, the present invention combines the query of the packet column set CS in determining the sample size from each sample layer_qWith sample hierarchical set of columns CS_sThe relationships of (a) respectively propose solutions: (1) when CS is used_s＝CS_qWhen the sampling size of the corresponding sample layer is selected, a larger value is selected from the average sample size of each sample layer and the sample size proportional to the sample layer size, so that the problem that the sample size of small groups and large groups is too small is solved; (2) when in use

The invention first puts the sample in CS_qAnd combining the sample layers with the same value on the column set into a large sample layer to determine the sample size, and then performing layered sampling in each large sample layer, thereby dynamically determining the sample size of each sample layer.

Drawings

FIG. 1 is a diagram of a multi-dimensional dynamic sampling framework for approximating queries in a cloud computing environment.

Detailed Description

The invention is further described with reference to the following examples and the accompanying drawings.

A multi-dimensional dynamic sampling method for approximating a query in a cloud computing environment, comprising the steps of: 1) the dynamic sampling system comprises an offline processing stage for creating layered samples and an online processing stage for dynamically selecting samples; 2) a load list set analysis module, a data characteristic analysis module, a coverage index calculation module, a hierarchical list set determination module and a hierarchical sample data creation module are arranged at an offline processing stage; 3) the load column set analysis module analyzes the load query sentences, extracts the grouped column sets of each query sentence, calculates the occurrence frequency of each column set, analyzes the relation among the column sets and outputs the result to the data characteristic analysis module; 4) the data characteristic analysis module starts a MapReduce operation to scan an original data set and outputs a data distribution result to the coverage index calculation module; 5) the coverage index calculation module is used for calculating the total coverage index under the condition of carrying out layered sampling according to each candidate layered column set by combining the data distribution information; 6) the hierarchical column set determining module combines information such as a coverage index and a sample storage space to select a hierarchical column set for creating a hierarchical sample; 7) starting a MapReduce operation at a layered sample creating module, scanning an original data set by a Map function, transmitting the original data set to a corresponding Reduce function according to values of tuples on each layered column set, updating statistical information by the Reduce function and outputting tuple data to a layered sample data set; 8) a query analysis module, a sample selection module and a sample size determination module are arranged at the online processing stage; 9) the query analysis module analyzes the query sentence input by the user and extracts a grouping list set; 10) the sample selection module selects the layered sample data with the minimum sampling cost according to the grouping column set of the query statement; 11) the sample size determination module determines a sample size drawn from each sample layer according to a sample size of the approximate query statement.

In the step 3), the load list set analysis module comprises the following steps: (1) analyzing all SQL query statements in the load, and extracting corresponding groupsA column set; (2) calculating the occurrence times of each grouping column set and generating a candidate layering column set CS ═ CS₁,CS₂,...,CS_M}; (3) analyzing any two hierarchical sets of columns CS in CS_iAnd CS_jIn a relation of (1), if

Then CS will be_j-CS_iAnd storing the result into the set RS and outputting the result to the data analysis module.

In the step 4), a MapReduce job is started to scan the original data and analyze the data characteristics, and the number of tuples of the original data set with different values on each RS column set is calculated, which specifically comprises the following steps: (1) analyzing each tuple r of the original data set by a Map function in a Map stage, forming a key-value pair, setting the name of each column set in the RS as a key, and setting the grouping attribute value of the tuple on the corresponding column set as a value; (2) the combination function in the Map stage combines a plurality of key-value pairs belonging to the same column set to form a new key-value pair output; (3) all key-value pairs belonging to the same column set are transmitted to the same Reduce function, the function combines the values of the key-value pairs, and the value number of different attribute values on the column set is calculated, so that the number of different values of the original data set on each column set of the RS is generated.

In the step 5), the coverage index calculation module calculates any column set CS in the CS_iWhen creating hierarchical samples for a hierarchical set of columns, each candidate hierarchical set of columns CS in CS_jCoverage index CI_i,jThe calculation method is as follows: if CS_j＝CS_iThen CI is_i,j1 is ═ 1; if it is

Then CI is_i,j＝1/v_i,jWherein v is_i,jRepresenting a set of columns CS of a raw data set_i-CS_jThe different values of the above are the numbers; otherwise, then CI_i,j＝0。

In the step 6), the specific steps for determining the hierarchical column set are as follows: (1) for any candidate hierarchical column set CS_iThe calculation is based on CS_iCreating scoresTotal coverage index f in case of layer samples_iThe calculation formula is as follows:

in the formula, P_jDenotes CS_jThe probability of occurrence in the load is calculated by the formula

N_jIs CS_jNumber of occurrences in the load; CI_i，jIs based on CS_iWhen creating hierarchical samples, the column set CS_jThe coverage index of (a); (2) and sorting the total coverage indexes of all candidate hierarchical column sets in a descending order, selecting the first X candidate hierarchical column sets with the maximum total coverage indexes as the column sets finally used for creating the hierarchical samples, wherein X is determined by the size of the space used for storing the samples by the system.

In the step 7), a MapReduce job is started to create a hierarchical sample, and the specific steps are as follows: (1) scanning an original data set by a Map function in a Map stage, analyzing each tuple r and generating a key-value pair, setting a key as a structural body formed by a column set name and values on the column set, wherein the column set name is from an output result in the step 6), and setting the whole tuple as a value; (2) key-value pairs which belong to the same column set and have the same value on the column set are transmitted to the same Reduce function, in the function, the number of tuples belonging to the same sample layer is counted, and the tuples are output to a file to form a layered sample file;

in the step 9), the query sentence input by the user on line is analyzed, and the grouping column set CS is extracted_qThen, step 10) selects the layered sample data with the minimum sampling cost, and the selection method is as follows: if there is one sample S (CS)_s) Hierarchical set of columns CS_s＝CS_qThen the sample is selected; otherwise, sample S (CS) is selected_s) Wherein CS_sIs to satisfy the condition

The minimum column set. According to the total sample size N of the user approximate query statement, the step 11) sample size determining module determinesDetermining the number of samples selected from each sample layer if CS is satisfied_s＝CS_qThe size of the sample extracted from each sample layer is

Wherein T is the number of sample layers, | G_jL is the size of each sample layer, and l R is the size of the original data set; if it satisfies

The step of determining the size of the sample to be extracted from each sample layer is: (1) subjecting the sample to CS_qThe sample layers with the same value on the column set are combined into a large sample layer, and the size of the sample extracted from each large sample layer is

(2) At each large sample layer G_iFrom each of which a small sample layer G_ijThe size of the sample extracted is

The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims

1. A method for multi-dimensional dynamic sampling for approximating a query in a cloud computing environment, the method comprising the steps of:

5) the coverage index calculation module calculates the CS based on each candidate hierarchical column set by combining the data distribution result_iPerforming total coverage index under the condition of layered sampling; the coverage index calculation module calculates any column set CS in the CS_iWhen creating hierarchical samples for a hierarchical set of columns, each candidate hierarchical set of columns CS in CS_jCoverage index CI_i,jThe calculation method is as follows: if CS_j＝CS_iThen CI is_i,j1 is ═ 1; if it is

Then CI is_i,j＝1/v_i,jWherein v is_i,jRepresenting the original data set at CS_i-CS_jThe different values of the above are the numbers; otherwise, then CI_i,j＝0；

6) The hierarchical column set determination module, in combination with the coverage index and the sample storage space information, selects a hierarchical column set for creating hierarchical samples, comprising the steps of:

(1) for any candidate hierarchical column set CS_iThe calculation is based on CS_iTotal coverage index f when creating hierarchical samples_i，

Wherein, P_jDenotes CS_jThe probability of occurrence in a load query statement,

N_jis CS_jNumber of occurrences in a load query statement; CI_i，jIs based on CS_iWhen creating layered samples CS_jThe coverage index of (a);

(2) sorting the total coverage indexes of all candidate hierarchical column sets in a descending order, selecting the first X candidate hierarchical column sets with the maximum total coverage indexes as the grouping column sets finally used for creating the hierarchical samples, wherein X is determined by the space size of the dynamic sampling system used for storing the samples;

9) the query analysis module analyzes the query sentences input by the user on line and extracts the grouping column set CS of each user query sentence_q；

2. The method as claimed in claim 1, wherein in the step 3), the load list set analysis module parses the load query statement, which includes the following steps:

(1) analyzing all SQL query sentences in the load query sentences, and extracting corresponding grouping column sets;

(2) calculating the occurrence times of each grouping column set and generating a candidate layering column set CS ═ CS₁,CS₂,...,CS_M}；

(3) Analyzing any two candidate hierarchical column sets CS in CS_iAnd CS_jIn a relation of (1), if

Then CS will be_j-CS_iAnd storing the result into the set RS and outputting the result to the data characteristic analysis module.

3. The method as claimed in claim 1, wherein in the step 4), a MapReduce job is started to scan raw data and analyze data characteristics, and the method comprises the following steps:

(1) analyzing each tuple r of the original data set by a Map function in a Map stage, forming a key-value pair, setting the name of each column set in the RS as a key, and setting the grouping attribute value of the tuple on the corresponding column set as a value;

(2) the combination function in the Map stage combines the key-value pairs belonging to the same column set to form a new key-value pair output;

(3) all key-value pairs belonging to the same column set are transmitted to the same Reduce function, the function combines the values of the key-value pairs, and the value number of different attribute values on the column set is calculated, so that the number of different values of the original data set on each column set of the RS is generated.

4. The method as claimed in claim 1, wherein in the step 7), a MapReduce job is started for hierarchical sample creation, which includes the following steps:

(1) scanning an original data set by a Map function in a Map stage, analyzing each tuple r and generating a key-value pair, setting a key as a structural body formed by a column set name and values on the column set, wherein the column set name is from an output result in the step 6), and setting the whole tuple as a value;

(2) and key-value pairs which belong to the same column set and have the same value on the grouped column set are transmitted to the same Reduce function, in the function, the number of tuples belonging to the same sample layer is counted, and the tuples are output to a file to form a layered sample file.

5. The method as claimed in claim 1, wherein in step 9), the query sentence inputted by the user on-line is analyzed, and the grouping column set CS is extracted_qThen, step 10) selects the layered sample data with the minimum sampling cost, and the selection method is as follows: if there is one sample S (CS)_s) Hierarchical set of columns CS_s＝CS_qThen the sample is selected; otherwise, sample S (CS) is selected_s) Wherein CS_sIs to satisfy the condition

A minimum column set of; according to the total sample size N of the user approximate query statement, the sample size determining module in the step 11) determines the number of samples selected from each sample layer, if the number satisfies CS_s＝CS_qThe size of the sample extracted from each sample layer is