CN108256028A

CN108256028A - The Dynamic and Multi dimensional method of sampling of approximate query is used in a kind of cloud computing environment

Info

Publication number: CN108256028A
Application number: CN201810025016.6A
Authority: CN
Inventors: 史英杰; 刘怡; 郭飞; 刘昊
Original assignee: Beijing Institute Fashion Technology
Current assignee: Beijing Institute Fashion Technology
Priority date: 2018-01-11
Filing date: 2018-01-11
Publication date: 2018-07-06
Anticipated expiration: 2038-01-11
Also published as: CN108256028B

Abstract

A multi-dimensional dynamic sampling method for approximate query in a cloud computing environment, comprising the following steps: the dynamic sampling system includes an offline processing stage for creating hierarchical samples and an online processing stage for dynamically selecting samples; in the offline processing stage, The load column set analysis module analyzes the load query statement; the data characteristic analysis module analyzes the data characteristics; the total coverage index of the coverage index calculation module; the stratified column set determination module selects the stratified column set for creating stratified samples; The stratified sample data creation module creates stratified samples; in the online processing stage, the query analysis module parses the user query statement; the sample selection module selects the stratified sample data with the smallest sampling cost; the sample size determination module determines the sample data from each sample layer The sample size drawn. The invention effectively solves the problem of inaccurate estimation of small groups caused by data skew in approximate query, and reduces sampling cost under the limitation of limited sample storage space.

Description

A Multidimensional Dynamic Sampling Method for Approximate Query in Cloud Computing Environment

技术领域technical field

本发明涉及一种用于近似查询的数据采样方法，特别是在云计算环境中面向多查询负载的动态采样方法。The invention relates to a data sampling method for approximate query, in particular to a dynamic sampling method for multiple query loads in a cloud computing environment.

背景技术Background technique

云计算环境提供了一种高扩展性和高性价比的管理大数据的方式，成为管理大数据的主流平台。然而针对大数据的查询即便是在云计算环境下，也无法达到实时处理及与用户交互的速度需求。对于即席查询和探索性数据分析应用来说，与其耗费大量的时间和计算资源来获取完全精确的结果，快速获得估计的结果更有意义。近似查询处理技术基于样本数据对查询结果进行估计，从而大大减少查询执行时间，对于大数据分析具有重要意义。The cloud computing environment provides a highly scalable and cost-effective way to manage big data, and has become the mainstream platform for managing big data. However, even in the cloud computing environment, the query for big data cannot meet the speed requirements for real-time processing and user interaction. For ad-hoc query and exploratory data analysis applications, it makes more sense to obtain estimated results quickly rather than expending a lot of time and computing resources to obtain completely accurate results. Approximate query processing technology estimates query results based on sample data, thereby greatly reducing query execution time, which is of great significance for big data analysis.

基于样本数据的近似查询处理技术由Acharya等人提出，该技术采用统一随机采样的方法，即每个元组被抽取的概率相等。统一随机采样适用于数据均匀分布的情况，其优点是简单易操作，但在当数据倾斜导致分组聚集查询出现小分组时，统一随机采样将导致估计结果的准确性严重降低，从而失去估计意义。由Surajit等人提出了加权采样方法，该方法分析每条元组所能满足查询谓词的个数，并以此作为其被抽样的概率权值，元组所满足的查询谓词数越多，则被抽样的概率就越大。该加权采样技术能从一定程度上缓解统一随机采样时由数据倾斜导致的估计不准确问题，然而其效果完全取决于计算抽样权值所依据的负载，当查询语句与其不同时，抽样权值将没有意义。由Swarup等人提出了国会采样方法，该方法针对所有可能分组的列和查询创建一个通用的样本。然而该样本的有效性随着查询数量的增长逐渐降低，而且其预处理时间随着列数的增加呈指数级增长，无法应对多查询语句的应用场景。总的来说，上述技术均在查询语句种类较少且固定的情况下进行，在实际应用中其扩展性不强。此外，上述技术都是在关系数据库领域提出的，无法适用于云计算环境。The approximate query processing technology based on sample data was proposed by Acharya et al. This technology adopts the method of uniform random sampling, that is, the probability of each tuple being extracted is equal. Uniform random sampling is suitable for evenly distributed data, and its advantage is that it is simple and easy to operate. However, when data skew causes small groups to appear in group aggregation queries, uniform random sampling will seriously reduce the accuracy of the estimation results, thus losing the meaning of estimation. The weighted sampling method was proposed by Surajit et al. This method analyzes the number of query predicates that each tuple can satisfy, and uses it as the probability weight of being sampled. The more query predicates the tuple satisfies, the more The probability of being sampled is greater. This weighted sampling technique can alleviate to a certain extent the problem of inaccurate estimation caused by data skew during uniform random sampling. However, its effect depends entirely on the load on which the sampling weight is calculated. When the query statement is different from it, the sampling weight will be Pointless. Congressional sampling was proposed by Swarup et al., which creates a common sample for all possible grouping columns and queries. However, the effectiveness of this sample gradually decreases with the increase of the number of queries, and its preprocessing time increases exponentially with the increase of the number of columns, which cannot cope with the application scenario of multiple query statements. Generally speaking, the above technologies are all carried out under the condition that the types of query statements are few and fixed, and their scalability is not strong in practical applications. In addition, the above technologies are all proposed in the field of relational databases and cannot be applied to cloud computing environments.

发明内容Contents of the invention

针对现有技术中存在的问题，本发明的目的在于提供一种云计算环境中用于近似查询的多维动态采样方法，该方法用于近似查询过程中的数据预处理阶段，对原始数据集进行预处理，生成多个分层的样本数据集，当查询语句到来时，根据查询语句内容以及其欲采样大小动态选择样本数据集，并提供从每个样本层中抽取的样本量。本发明提供的方法有效解决近似查询中由于数据倾斜导致的小分组估计不准确问题，并且在有限的样本存储空间限制下减少采样代价。In view of the problems existing in the prior art, the object of the present invention is to provide a multi-dimensional dynamic sampling method for approximate query in a cloud computing environment, the method is used for the data preprocessing stage in the approximate query process, and the original data set is Preprocessing generates multiple layered sample data sets. When a query statement arrives, the sample data set is dynamically selected according to the content of the query statement and its size to be sampled, and the sample size extracted from each sample layer is provided. The method provided by the invention effectively solves the problem of inaccurate estimation of small groups caused by data skew in approximate query, and reduces sampling cost under the limitation of limited sample storage space.

为了实现上述目的，本发明的技术方案如下：In order to achieve the above object, the technical scheme of the present invention is as follows:

一种云计算环境中用于近似查询的多维动态采样方法，该方法包括以下步骤：A multi-dimensional dynamic sampling method for approximate query in a cloud computing environment, the method includes the following steps:

1)动态采样系统包括用于创建分层样本的离线处理阶段和用于动态选择样本的在线处理阶段；1) The dynamic sampling system includes an offline processing stage for creating stratified samples and an online processing stage for dynamically selecting samples;

2)在离线处理阶段设置负载列集解析模块、数据特征分析模块、覆盖指数计算模块、分层列集确定模块和分层样本数据创建模块；2) In the offline processing stage, set up a load column set analysis module, a data feature analysis module, a coverage index calculation module, a layered column set determination module, and a layered sample data creation module;

3)负载列集解析模块对负载查询语句进行解析，抽取出每条负载查询语句的分组列集，计算每种分组列集出现的次数，生成候选分层列集集合CS，并分析其中各候选分层列集CS_i之间的关系，将结果输出给数据特征分析模块；3) The load column set analysis module analyzes the load query statement, extracts the group column set of each load query statement, calculates the occurrence times of each group column set, generates candidate hierarchical column set CS, and analyzes each candidate Hierarchically array the relationship between CS _i , and output the result to the data feature analysis module;

4)数据特征分析模块启动一个MapReduce作业扫描原始数据集，并将原始数据集的数据分布结果输出至覆盖指数计算模块；4) The data feature analysis module starts a MapReduce job to scan the original data set, and outputs the data distribution result of the original data set to the coverage index calculation module;

5)覆盖指数计算模块结合数据分布结果计算基于每种候选分层列集CS_i进行分层采样情况下的总覆盖指数；5) The coverage index calculation module calculates the total coverage index in the case of stratified sampling based on each candidate stratified column set CS _i in combination with the data distribution results;

6)分层列集确定模块结合覆盖指数及样本存储空间等信息选择用于创建分层样本的分层列集；6) The layered array set determination module selects the layered array set for creating stratified samples in combination with information such as coverage index and sample storage space;

7)在分层样本数据创建模块启动一个MapReduce作业进行分层样本创建，Map函数对原始数据集进行扫描，根据元组在用于创建分层样本的各分层列集上的取值将其传输至相对应的Reduce函数，Reduce函数更新统计信息并将元组数据输出至分层样本数据集；7) Start a MapReduce job in the layered sample data creation module to create layered samples, the Map function scans the original data set, and converts the tuples to Transfer to the corresponding Reduce function, the Reduce function updates the statistical information and outputs the tuple data to the hierarchical sample data set;

8)在在线处理阶段设置查询解析模块、样本选择模块和样本大小确定模块；8) Set query analysis module, sample selection module and sample size determination module in the online processing stage;

9)查询解析模块对用户在线输入的查询语句进行解析，并抽取出每条用户查询语句的分组列集CS_q；9) The query parsing module parses the query statement input by the user online, and extracts the grouping CS _q of each user query statement;

10)样本选择模块根据用户查询语句的分组列集CS_q从分层样本数据集中选择采样代价最小的分层样本数据；10) The sample selection module selects the stratified sample data with the minimum sampling cost from the stratified sample data set according to the grouping column set CS _q of the user query statement;

11)样本大小确定模块根据近似查询语句的样本大小确定从每个样本层所抽取的样本大小。11) The sample size determining module determines the sample size extracted from each sample layer according to the sample size of the approximate query statement.

本发明具有以下优点：The present invention has the following advantages:

1、本发明通过分析负载特征及数据分布特征确定用于创建分层列集，并基于分层列集及样本存储空间创建多个多维分层样本数据，从而解决了近似查询中数据倾斜所带来的估计结果不准确问题；1. The present invention determines by analyzing the load characteristics and data distribution characteristics to create a hierarchical column set, and creates multiple multi-dimensional hierarchical sample data based on the hierarchical column set and sample storage space, thereby solving the problem caused by data skew in approximate queries. Inaccurate estimated results;

2、在确定用于创建分层样本的列集过程中，本发明通过覆盖指数体现给定分层样本后，不同的查询语句使用该样本进行分层采样的代价，因此本发明为查询负载的扩展奠定了基础；2. In the process of determining the column set used to create a stratified sample, the present invention embodies the cost of stratified sampling using the sample for different query statements after a given stratified sample is reflected by the coverage index, so the present invention is a measure of the query load The extension lays the groundwork;

3、给定分层样本及总抽样大小后，在确定从每个样本层中抽样大小的过程中，本发明结合查询分组列集CS_q与样本分层列集CS_s的关系分别提出了解决方案：(1)当CS_s＝CS_q时，从各个样本层均分的样本量和与样本层大小成比例的样本量中选择较大的值作为相应样本层的抽样大小，从而解决了小分组和大分组样本量过小的问题；(2)当时，本发明首先将样本中在CS_q列集上取值相同的样本层合并成为一个大的样本层确定样本大小，然后在每个大样本层中再进行分层抽样，从而实现了动态确定各样本层的样本大小。3. After the stratified sample and the total sampling size are given, in the process of determining the sampling size from each sample stratum, the present invention proposes a solution respectively in combination with the relationship between the query grouping set CS _q and the sample stratified set CS _s Solution: (1) When CS _s = CS _q , select a larger value from the sample size equally divided by each sample layer and the sample size proportional to the size of the sample layer as the sampling size of the corresponding sample layer, thus solving the problem of small The problem of too small sample size for grouping and large grouping; (2) when , the present invention first merges the sample layers with the same value on the CS _q column set in the sample into a large sample layer to determine the sample size, and then performs stratified sampling in each large sample layer, thereby realizing dynamic determination The sample size for each sample stratum.

附图说明Description of drawings

图1是云计算环境中用于近似查询的多维动态采样框架图。Figure 1 is a framework diagram of multi-dimensional dynamic sampling for approximate query in cloud computing environment.

具体实施方式Detailed ways

下面结合实施例和说明书附图对本发明作进一步的说明。The present invention will be further described below in conjunction with embodiment and accompanying drawing.

一种云计算环境中用于近似查询的多维动态采样方法，其包括以下步骤：1)动态采样系统包括用于创建分层样本的离线处理阶段和用于动态选择样本的在线处理阶段；2)在离线处理阶段设置负载列集解析模块、数据特征分析模块、覆盖指数计算模块、分层列集确定模块和分层样本数据创建模块；3)负载列集解析模块对负载查询语句进行解析，抽取出每条查询语句的分组列集，计算每种列集出现的次数，并分析各列集之间的关系，将结果输出给数据特征分析模块；4)数据特征分析模块启动一个MapReduce作业扫描原始数据集，并将数据分布结果输出给覆盖指数计算模块；5)覆盖指数计算模块结合数据分布信息计算按每种候选分层列集进行分层采样情况下的总覆盖指数；6)分层列集确定模块结合覆盖指数及样本存储空间等信息选择用于创建分层样本的分层列集；7)在分层样本创建模块启动一个MapReduce作业，Map函数对原始数据集进行扫描，根据元组在各分层列集上的取值将其传输至相对应的Reduce函数，Reduce函数更新统计信息并将元组数据输出至分层样本数据集；8)在在线处理阶段设置查询解析模块、样本选择模块和样本大小确定模块；9)查询解析模块对用户输入的查询语句进行解析，并抽取出分组列集；10)样本选择模块根据查询语句的分组列集选择采样代价最小的分层样本数据；11)样本大小确定模块根据近似查询语句的样本大小确定从每个样本层所抽取的样本大小。A multi-dimensional dynamic sampling method for approximate query in a cloud computing environment, comprising the following steps: 1) the dynamic sampling system includes an offline processing stage for creating hierarchical samples and an online processing stage for dynamically selecting samples; 2) In the offline processing stage, the load column set analysis module, the data characteristic analysis module, the coverage index calculation module, the layered column set determination module and the layered sample data creation module are set; 3) the load column set analysis module analyzes the load query statement and extracts Output the grouping column sets of each query statement, calculate the number of occurrences of each column set, and analyze the relationship between each column set, and output the results to the data feature analysis module; 4) The data feature analysis module starts a MapReduce job to scan the original Data set, and the data distribution result is output to the coverage index calculation module; 5) the coverage index calculation module combines the data distribution information to calculate the total coverage index under the stratified sampling situation by each candidate stratification column set; 6) the stratification column The set determination module selects the stratified column set for creating stratified samples in combination with information such as coverage index and sample storage space; 7) Start a MapReduce job in the stratified sample creation module, and the Map function scans the original data set, according to the tuple The value on each layered column set is transferred to the corresponding Reduce function, and the Reduce function updates the statistical information and outputs the tuple data to the layered sample data set; 8) Set the query parsing module, sample The selection module and the sample size determination module; 9) The query analysis module analyzes the query statement input by the user, and extracts the grouped column set; 10) The sample selection module selects the hierarchical sample data with the smallest sampling cost according to the grouped column set of the query statement ; 11) The sample size determination module determines the sample size extracted from each sample layer according to the sample size of the approximate query statement.

所述步骤3)中，负载列集分析模块的步骤如下：(1)对负载中的所有SQL查询语句进行解析，抽取出相应的分组列集；(2)计算每种分组列集出现的次数，并生成候选分层列集集合CS＝{CS₁,CS₂,...,CS_M}；(3)分析CS中的任意两个分层列集CS_i和CS_j的关系，若则将CS_j-CS_i存入集合RS，并将结果输出给数据分析模块。In said step 3), the steps of the load column set analysis module are as follows: (1) analyze all SQL query statements in the load, and extract the corresponding group column sets; (2) calculate the number of occurrences of each group column set , and generate a candidate hierarchical array set CS={CS ₁ , CS ₂ ,...,CS _M }; (3) analyze the relationship between any two hierarchical array sets CS _i and CS _j in CS, if Store CS _j - CS _i in the set RS, and output the result to the data analysis module.

所述步骤4)中，启动一个MapReduce作业扫描原始数据并对数据特征进行分析，计算原始数据集在RS各列集上不同取值的元组个数，其具体步骤如下：(1)Map阶段的Map函数分析原始数据集的各个元组r,并形成键-值对，设置RS中各列集名称为键，设置元组在相应列集上的分组属性值为值；(2)Map阶段的Combine函数将属于同一个列集的若干键-值对进行合并，形成一个新的键-值对输出；(3)所有属于同一个列集的键-值对被传输至同一个Reduce函数，该函数对键-值对的值进行合并，计算该列集上不同属性值的取值个数，从而生成原始数据集在RS各列集上不同取值的个数。In said step 4), start a MapReduce job to scan the original data and analyze the data characteristics, and calculate the number of tuples of different values of the original data set on each column set of RS. The specific steps are as follows: (1) Map stage The Map function analyzes each tuple r of the original data set, and forms a key-value pair, sets the name of each column set in RS as the key, and sets the grouping attribute value of the tuple on the corresponding column set as value; (2) Map stage The Combine function combines several key-value pairs belonging to the same column set to form a new key-value pair output; (3) all key-value pairs belonging to the same column set are transferred to the same Reduce function, This function merges the values of key-value pairs, and calculates the number of values of different attribute values on the column set, thereby generating the number of different values of the original data set on each column set of RS.

所述步骤5)中，覆盖指数计算模块计算以CS中任一列集CS_i为分层列集创建分层样本时，CS中各候选分层列集CS_j的覆盖指数CI_i,j，其计算方式为：若CS_j＝CS_i，则CI_i,j＝1；若则CI_i,j＝1/v_i,j，其中v_i,j表示原始数据集在列集CS_j-CS_i上的不同取值个数；其他情况，则CI_i,j＝0。In the step 5), the coverage index calculation module calculates the coverage index CI i, _j of each candidate hierarchical column set CS _{j in the CS when any column set CS i} in the CS is used as the hierarchical column set to create a stratified sample _, where The calculation method is: if CS _j =CS _i , then CI _i,j =1; if Then CI _i,j =1/v _i,j , where v _i,j represents the number of different values of the original data set on the column set CS _j -CS _i ; otherwise, CI _i,j =0.

所述步骤6)中，分层列集确定的具体步骤如下：(1)针对任一候选分层列集CS_i，计算基于CS_i创建分层样本时的总覆盖指数f_i，计算公式为：式中，P_j表示CS_j在负载中出现的概率，计算公式为N_j为CS_j在负载中出现的次数；C_i,j为基于CS_i创建分层样本时，列集CS_j的覆盖指数；(2)将所有候选分层列集的总覆盖指数进行降序排序，选出总覆盖指数最大的前X个候选分层列集作为最终用于创建分层样本的列集，X由系统用于存储样本的空间大小决定。In the step 6), the specific steps for determining the stratified column set are as follows: (1) For any candidate stratified column set CS _i , calculate the total coverage index f _i when creating stratified samples based on CS _i , the calculation formula is : In the formula, P _j represents the probability of CS _j appearing in the load, and the calculation formula is N _j is the number of occurrences of CS _j in the load; C _i,j is the coverage index of the column set CS _j when creating stratified samples based on CS _i ; (2) the total coverage index of all candidate stratified column sets is sorted in descending order Sorting, select the top X candidate stratified array sets with the largest total coverage index as the final array set used to create stratified samples, and X is determined by the space used by the system to store samples.

所述步骤7)中，启动一个MapReduce作业进行分层样本创建，其具体步骤如下：(1)Map阶段的Map函数扫描原始数据集，分析每条元组r并生成键-值对，设置键为由列集名称和列集上的取值所构成的结构体，其中列集名称来自步骤6)的输出结果，设置整个元组为值；(2)属于同一列集且在该列集上取值相同的键-值对被传输至同一个Reduce函数，在该函数中，统计属于同一样本层的元组个数，并将元组输出至文件，形成分层样本文件；In said step 7), a MapReduce job is started to create a layered sample, and its specific steps are as follows: (1) the Map function in the Map stage scans the original data set, analyzes each tuple r and generates a key-value pair, and sets the key It is a structure composed of the column set name and the value on the column set, where the column set name comes from the output result of step 6), and set the entire tuple as the value; (2) belongs to the same column set and is on the column set Key-value pairs with the same value are transferred to the same Reduce function, in which the number of tuples belonging to the same sample layer is counted, and the tuples are output to a file to form a hierarchical sample file;

所述步骤9)中，对用户在线输入的查询语句进行解析，抽取出分组列集CS_q，然后由步骤10)选择采样代价最小的分层样本数据，选择方法为：若存在一个样本S(CS_s)的分层列集CS_s＝CS_q，则选择该样本；否则，选择样本S(CS_s)，其中CS_s是满足条件的最小列集。根据用户近似查询语句的总样本大小N，由步骤11)样本大小确定模块确定从每个样本层中选择的样本数量，若满足CS_s＝CS_q，则从每个样本层抽取的样本大小为In the step 9), the query sentence input by the user is analyzed online, and the grouping column set CS _q is extracted, and then the stratified sample data with the smallest sampling cost is selected in step 10), the selection method is: if there is a sample S( CS _s ) the hierarchical array set CS _s = CS _q , then select the sample; otherwise, select the sample S(CS _s ), where CS _s is the satisfying condition The smallest set of columns. According to the total sample size N of the user's approximate query statement, the sample size determination module in step 11) determines the number of samples selected from each sample layer. If CS _s =CS _q is satisfied, the sample size extracted from each sample layer is

其中，T为样本层的个数，|G_j|为各个样本层的大小，|R|为原始数据集的大小；若满足则确定从每个样本层抽取样本大小的步骤为：(1)将样本中在CS_q列集上取值相同的样本层合并成为一个大的样本层考虑，从每个大样本层中抽取的样本大小为Among them, T is the number of sample layers, |G _j | is the size of each sample layer, |R| is the size of the original data set; if satisfy Then the steps to determine the size of the samples drawn from each sample layer are: (1) Merge the sample layers with the same value on the CS _q column set in the sample into one large sample layer, and the samples extracted from each large sample layer The sample size is

(2)在每个大的样本层G_i中，从其中各个小样本层G_ij抽取的样本大小为(2) In each large sample layer G _i , the sample size drawn from each small sample layer G _ij is

上述实施例仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和等同替换，这些对本发明权利要求进行改进和等同替换后的技术方案，均落入本发明的保护范围。The foregoing embodiments are only preferred implementations of the present invention. It should be pointed out that those skilled in the art can make several improvements and equivalent replacements without departing from the principle of the present invention. Technical solutions requiring improvement and equivalent replacement all fall within the protection scope of the present invention.

Claims

1. a multidimensional dynamic sampling method for approximate query in a cloud computing environment, characterized in that the method comprises the following steps:

1) The dynamic sampling system includes an offline processing stage for creating stratified samples and an online processing stage for dynamically selecting samples;

2) In the offline processing stage, set up a load column set analysis module, a data feature analysis module, a coverage index calculation module, a layered column set determination module, and a layered sample data creation module;

3) The load column set analysis module analyzes the load query statement, extracts the group column set of each load query statement, calculates the occurrence times of each group column set, generates candidate hierarchical column set CS, and analyzes each candidate Hierarchically array the relationship between CS _i , and output the result to the data feature analysis module;

4) The data feature analysis module starts a MapReduce job to scan the original data set, and outputs the data distribution result of the original data set to the coverage index calculation module;

5) The coverage index calculation module calculates the total coverage index in the case of stratified sampling based on each candidate stratified column set CS _i in combination with the data distribution results;

6) The layered array set determination module selects the layered array set for creating stratified samples in combination with information such as coverage index and sample storage space;

7) Start a MapReduce job in the layered sample data creation module to create layered samples, the Map function scans the original data set, and converts the tuples to Transfer to the corresponding Reduce function, the Reduce function updates the statistical information and outputs the tuple data to the hierarchical sample data set;

8) Set query analysis module, sample selection module and sample size determination module in the online processing stage;

9) The query parsing module parses the query statement input by the user online, and extracts the grouping CS _q of each user query statement;

10) The sample selection module selects the stratified sample data with the minimum sampling cost from the stratified sample data set according to the grouping column set CS _q of the user query statement;

11) The sample size determining module determines the sample size extracted from each sample layer according to the sample size of the approximate query statement.

2. the method for claim 1, is characterized in that, described step 3) in, load column analysis module is analyzed load query statement, and it comprises the following steps:

(1) Parse all the SQL query statements in the load query statement, and extract the corresponding grouping column sets;

(2) Calculate the number of occurrences of each grouping set, and generate a set of candidate hierarchical sets CS={CS ₁ , CS ₂ ,...,CS _M };

(3) Analyze the relationship between any two candidate hierarchical array sets CS _i and CS _j in CS, if Store CS _j - CS _i in the set RS, and output the result to the data feature analysis module.

3. method as claimed in claim 1, is characterized in that, described step 4) in, start a MapReduce job scan raw data and data characteristic is analyzed, and it comprises the following steps:

(1) The Map function in the Map stage analyzes each tuple r of the original data set, and forms a key-value pair, sets the name of each column set in RS as the key, and sets the grouping attribute value of the tuple on the corresponding column set as value;

(2) The Combine function in the Map stage merges several key-value pairs belonging to the same column set to form a new key-value pair output;

(3) All key-value pairs belonging to the same column set are transmitted to the same Reduce function, which merges the values of the key-value pairs and calculates the number of different attribute values on the column set, thereby generating The number of different values of the original data set in each column set of RS.

4. The method according to claim 1, characterized in that, in the step 5), the coverage index calculation module calculates when creating stratified samples with any column set CS _i in the CS, each candidate in the CS The coverage index CI _i,j of the hierarchical array set CS _j is calculated as follows: if CS _j =CS _i , then CI _i,j =1; if Then CI _i,j =1/v _i,j , where v _i,j represents the number of different values of the original data set on CS _j -CS _i ; otherwise, CI _i,j =0.

5. the method for claim 1, is characterized in that, described step 6) in, layered column set determination module is used to create the layered column set of stratified sample in conjunction with information such as coverage index and sample storage space, It includes the following steps:

(1) For any candidate stratified column set CS _i , calculate the total coverage index _fi when creating stratified samples based on CS _i , Among them, P _j represents the probability of CS _j appearing in the load query statement, N _j is the number of occurrences of CS _j in the load query statement; C _i,j is the coverage index of CS _j when creating stratified samples based on CS _i ;

(2) Sort the total coverage index of all candidate hierarchical array sets in descending order, and select the top X candidate hierarchical array sets with the largest total coverage index as the final grouping array set used to create stratified samples, X is determined by dynamic sampling The size of the space used by the system to store samples is determined.

6. the method for claim 1, is characterized in that, described step 7) in, start a MapReduce job and carry out layered sample creation, it comprises the following steps:

(1) The Map function in the Map stage scans the original data set, analyzes each tuple r and generates a key-value pair, and sets the key as a structure composed of the column set name and the value on the column set, where the column set name Output from step 6), set the entire tuple as value;

(2) The key-value pairs that belong to the same column set and have the same value on the grouped column set are transmitted to the same Reduce function. In this function, the number of tuples belonging to the same sample layer is counted, and the tuples Output to file, forming a layered sample file.

7. The method according to claim 1, characterized in that, in said step 9), the query sentence input by the user online is analyzed, and the grouping column set CS _q is extracted, and then the one with the smallest sampling cost is selected in step 10). For stratified sample data, the selection method is as follows: if there is a stratified array set CS _s = CS _q of sample S(CS _s ), then select the sample; otherwise, select sample S(CS _s ), where CS _s is the satisfying condition The minimum column set; according to the total sample size N of the user's approximate query statement, the sample size determination module in step 11) determines the number of samples selected from each sample layer. If CS _s =CS _q is satisfied, then from each sample layer The sample size drawn was

Among them, T is the number of groups, |G _j | is the size of each sample layer, |R| is the size of the original data set; if satisfy Then the steps to determine the size of the samples drawn from each sample layer are: (1) Merge the sample layers with the same value on the CS _q column set in the sample into one large sample layer, and the samples extracted from each large sample layer The sample size is

(2) In each large sample layer G _i , the sample size drawn from each small sample layer G _ij is