CN108256028A - The Dynamic and Multi dimensional method of sampling of approximate query is used in a kind of cloud computing environment - Google Patents
The Dynamic and Multi dimensional method of sampling of approximate query is used in a kind of cloud computing environment Download PDFInfo
- Publication number
- CN108256028A CN108256028A CN201810025016.6A CN201810025016A CN108256028A CN 108256028 A CN108256028 A CN 108256028A CN 201810025016 A CN201810025016 A CN 201810025016A CN 108256028 A CN108256028 A CN 108256028A
- Authority
- CN
- China
- Prior art keywords
- sample
- stratified
- column set
- data
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000004458 analytical method Methods 0.000 claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims abstract description 15
- 239000000284 extract Substances 0.000 claims description 8
- 238000010187 selection method Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000013517 stratification Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011985 exploratory data analysis Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种云计算环境中用于近似查询的多维动态采样方法,包括以下步骤:动态采样系统包括用于创建分层样本的离线处理阶段和用于动态选择样本的在线处理阶段;在离线处理阶段,负载列集解析模块对负载查询语句进行解析;数据特征分析模块对数据特征进行分析;覆盖指数计算模块总覆盖指数;分层列集确定模块选择用于创建分层样本的分层列集;在分层样本数据创建模块进行分层样本创建;在在线处理阶段,查询解析模块对用户查询语句进行解析;样本选择模块选择采样代价最小的分层样本数据;样本大小确定模块确定从每个样本层所抽取的样本大小。本发明有效解决近似查询中由于数据倾斜导致的小分组估计不准确问题,且在有限样本存储空间限制下减少采样代价。
A multi-dimensional dynamic sampling method for approximate query in a cloud computing environment, comprising the following steps: the dynamic sampling system includes an offline processing stage for creating hierarchical samples and an online processing stage for dynamically selecting samples; in the offline processing stage, The load column set analysis module analyzes the load query statement; the data characteristic analysis module analyzes the data characteristics; the total coverage index of the coverage index calculation module; the stratified column set determination module selects the stratified column set for creating stratified samples; The stratified sample data creation module creates stratified samples; in the online processing stage, the query analysis module parses the user query statement; the sample selection module selects the stratified sample data with the smallest sampling cost; the sample size determination module determines the sample data from each sample layer The sample size drawn. The invention effectively solves the problem of inaccurate estimation of small groups caused by data skew in approximate query, and reduces sampling cost under the limitation of limited sample storage space.
Description
技术领域technical field
本发明涉及一种用于近似查询的数据采样方法,特别是在云计算环境中面向多查询负载的动态采样方法。The invention relates to a data sampling method for approximate query, in particular to a dynamic sampling method for multiple query loads in a cloud computing environment.
背景技术Background technique
云计算环境提供了一种高扩展性和高性价比的管理大数据的方式,成为管理大数据的主流平台。然而针对大数据的查询即便是在云计算环境下,也无法达到实时处理及与用户交互的速度需求。对于即席查询和探索性数据分析应用来说,与其耗费大量的时间和计算资源来获取完全精确的结果,快速获得估计的结果更有意义。近似查询处理技术基于样本数据对查询结果进行估计,从而大大减少查询执行时间,对于大数据分析具有重要意义。The cloud computing environment provides a highly scalable and cost-effective way to manage big data, and has become the mainstream platform for managing big data. However, even in the cloud computing environment, the query for big data cannot meet the speed requirements for real-time processing and user interaction. For ad-hoc query and exploratory data analysis applications, it makes more sense to obtain estimated results quickly rather than expending a lot of time and computing resources to obtain completely accurate results. Approximate query processing technology estimates query results based on sample data, thereby greatly reducing query execution time, which is of great significance for big data analysis.
基于样本数据的近似查询处理技术由Acharya等人提出,该技术采用统一随机采样的方法,即每个元组被抽取的概率相等。统一随机采样适用于数据均匀分布的情况,其优点是简单易操作,但在当数据倾斜导致分组聚集查询出现小分组时,统一随机采样将导致估计结果的准确性严重降低,从而失去估计意义。由Surajit等人提出了加权采样方法,该方法分析每条元组所能满足查询谓词的个数,并以此作为其被抽样的概率权值,元组所满足的查询谓词数越多,则被抽样的概率就越大。该加权采样技术能从一定程度上缓解统一随机采样时由数据倾斜导致的估计不准确问题,然而其效果完全取决于计算抽样权值所依据的负载,当查询语句与其不同时,抽样权值将没有意义。由Swarup等人提出了国会采样方法,该方法针对所有可能分组的列和查询创建一个通用的样本。然而该样本的有效性随着查询数量的增长逐渐降低,而且其预处理时间随着列数的增加呈指数级增长,无法应对多查询语句的应用场景。总的来说,上述技术均在查询语句种类较少且固定的情况下进行,在实际应用中其扩展性不强。此外,上述技术都是在关系数据库领域提出的,无法适用于云计算环境。The approximate query processing technology based on sample data was proposed by Acharya et al. This technology adopts the method of uniform random sampling, that is, the probability of each tuple being extracted is equal. Uniform random sampling is suitable for evenly distributed data, and its advantage is that it is simple and easy to operate. However, when data skew causes small groups to appear in group aggregation queries, uniform random sampling will seriously reduce the accuracy of the estimation results, thus losing the meaning of estimation. The weighted sampling method was proposed by Surajit et al. This method analyzes the number of query predicates that each tuple can satisfy, and uses it as the probability weight of being sampled. The more query predicates the tuple satisfies, the more The probability of being sampled is greater. This weighted sampling technique can alleviate to a certain extent the problem of inaccurate estimation caused by data skew during uniform random sampling. However, its effect depends entirely on the load on which the sampling weight is calculated. When the query statement is different from it, the sampling weight will be Pointless. Congressional sampling was proposed by Swarup et al., which creates a common sample for all possible grouping columns and queries. However, the effectiveness of this sample gradually decreases with the increase of the number of queries, and its preprocessing time increases exponentially with the increase of the number of columns, which cannot cope with the application scenario of multiple query statements. Generally speaking, the above technologies are all carried out under the condition that the types of query statements are few and fixed, and their scalability is not strong in practical applications. In addition, the above technologies are all proposed in the field of relational databases and cannot be applied to cloud computing environments.
发明内容Contents of the invention
针对现有技术中存在的问题,本发明的目的在于提供一种云计算环境中用于近似查询的多维动态采样方法,该方法用于近似查询过程中的数据预处理阶段,对原始数据集进行预处理,生成多个分层的样本数据集,当查询语句到来时,根据查询语句内容以及其欲采样大小动态选择样本数据集,并提供从每个样本层中抽取的样本量。本发明提供的方法有效解决近似查询中由于数据倾斜导致的小分组估计不准确问题,并且在有限的样本存储空间限制下减少采样代价。In view of the problems existing in the prior art, the object of the present invention is to provide a multi-dimensional dynamic sampling method for approximate query in a cloud computing environment, the method is used for the data preprocessing stage in the approximate query process, and the original data set is Preprocessing generates multiple layered sample data sets. When a query statement arrives, the sample data set is dynamically selected according to the content of the query statement and its size to be sampled, and the sample size extracted from each sample layer is provided. The method provided by the invention effectively solves the problem of inaccurate estimation of small groups caused by data skew in approximate query, and reduces sampling cost under the limitation of limited sample storage space.
为了实现上述目的,本发明的技术方案如下:In order to achieve the above object, the technical scheme of the present invention is as follows:
一种云计算环境中用于近似查询的多维动态采样方法,该方法包括以下步骤:A multi-dimensional dynamic sampling method for approximate query in a cloud computing environment, the method includes the following steps:
1)动态采样系统包括用于创建分层样本的离线处理阶段和用于动态选择样本的在线处理阶段;1) The dynamic sampling system includes an offline processing stage for creating stratified samples and an online processing stage for dynamically selecting samples;
2)在离线处理阶段设置负载列集解析模块、数据特征分析模块、覆盖指数计算模块、分层列集确定模块和分层样本数据创建模块;2) In the offline processing stage, set up a load column set analysis module, a data feature analysis module, a coverage index calculation module, a layered column set determination module, and a layered sample data creation module;
3)负载列集解析模块对负载查询语句进行解析,抽取出每条负载查询语句的分组列集,计算每种分组列集出现的次数,生成候选分层列集集合CS,并分析其中各候选分层列集CSi之间的关系,将结果输出给数据特征分析模块;3) The load column set analysis module analyzes the load query statement, extracts the group column set of each load query statement, calculates the occurrence times of each group column set, generates candidate hierarchical column set CS, and analyzes each candidate Hierarchically array the relationship between CS i , and output the result to the data feature analysis module;
4)数据特征分析模块启动一个MapReduce作业扫描原始数据集,并将原始数据集的数据分布结果输出至覆盖指数计算模块;4) The data feature analysis module starts a MapReduce job to scan the original data set, and outputs the data distribution result of the original data set to the coverage index calculation module;
5)覆盖指数计算模块结合数据分布结果计算基于每种候选分层列集CSi进行分层采样情况下的总覆盖指数;5) The coverage index calculation module calculates the total coverage index in the case of stratified sampling based on each candidate stratified column set CS i in combination with the data distribution results;
6)分层列集确定模块结合覆盖指数及样本存储空间等信息选择用于创建分层样本的分层列集;6) The layered array set determination module selects the layered array set for creating stratified samples in combination with information such as coverage index and sample storage space;
7)在分层样本数据创建模块启动一个MapReduce作业进行分层样本创建,Map函数对原始数据集进行扫描,根据元组在用于创建分层样本的各分层列集上的取值将其传输至相对应的Reduce函数,Reduce函数更新统计信息并将元组数据输出至分层样本数据集;7) Start a MapReduce job in the layered sample data creation module to create layered samples, the Map function scans the original data set, and converts the tuples to Transfer to the corresponding Reduce function, the Reduce function updates the statistical information and outputs the tuple data to the hierarchical sample data set;
8)在在线处理阶段设置查询解析模块、样本选择模块和样本大小确定模块;8) Set query analysis module, sample selection module and sample size determination module in the online processing stage;
9)查询解析模块对用户在线输入的查询语句进行解析,并抽取出每条用户查询语句的分组列集CSq;9) The query parsing module parses the query statement input by the user online, and extracts the grouping CS q of each user query statement;
10)样本选择模块根据用户查询语句的分组列集CSq从分层样本数据集中选择采样代价最小的分层样本数据;10) The sample selection module selects the stratified sample data with the minimum sampling cost from the stratified sample data set according to the grouping column set CS q of the user query statement;
11)样本大小确定模块根据近似查询语句的样本大小确定从每个样本层所抽取的样本大小。11) The sample size determining module determines the sample size extracted from each sample layer according to the sample size of the approximate query statement.
本发明具有以下优点:The present invention has the following advantages:
1、本发明通过分析负载特征及数据分布特征确定用于创建分层列集,并基于分层列集及样本存储空间创建多个多维分层样本数据,从而解决了近似查询中数据倾斜所带来的估计结果不准确问题;1. The present invention determines by analyzing the load characteristics and data distribution characteristics to create a hierarchical column set, and creates multiple multi-dimensional hierarchical sample data based on the hierarchical column set and sample storage space, thereby solving the problem caused by data skew in approximate queries. Inaccurate estimated results;
2、在确定用于创建分层样本的列集过程中,本发明通过覆盖指数体现给定分层样本后,不同的查询语句使用该样本进行分层采样的代价,因此本发明为查询负载的扩展奠定了基础;2. In the process of determining the column set used to create a stratified sample, the present invention embodies the cost of stratified sampling using the sample for different query statements after a given stratified sample is reflected by the coverage index, so the present invention is a measure of the query load The extension lays the groundwork;
3、给定分层样本及总抽样大小后,在确定从每个样本层中抽样大小的过程中,本发明结合查询分组列集CSq与样本分层列集CSs的关系分别提出了解决方案:(1)当CSs=CSq时,从各个样本层均分的样本量和与样本层大小成比例的样本量中选择较大的值作为相应样本层的抽样大小,从而解决了小分组和大分组样本量过小的问题;(2)当时,本发明首先将样本中在CSq列集上取值相同的样本层合并成为一个大的样本层确定样本大小,然后在每个大样本层中再进行分层抽样,从而实现了动态确定各样本层的样本大小。3. After the stratified sample and the total sampling size are given, in the process of determining the sampling size from each sample stratum, the present invention proposes a solution respectively in combination with the relationship between the query grouping set CS q and the sample stratified set CS s Solution: (1) When CS s = CS q , select a larger value from the sample size equally divided by each sample layer and the sample size proportional to the size of the sample layer as the sampling size of the corresponding sample layer, thus solving the problem of small The problem of too small sample size for grouping and large grouping; (2) when , the present invention first merges the sample layers with the same value on the CS q column set in the sample into a large sample layer to determine the sample size, and then performs stratified sampling in each large sample layer, thereby realizing dynamic determination The sample size for each sample stratum.
附图说明Description of drawings
图1是云计算环境中用于近似查询的多维动态采样框架图。Figure 1 is a framework diagram of multi-dimensional dynamic sampling for approximate query in cloud computing environment.
具体实施方式Detailed ways
下面结合实施例和说明书附图对本发明作进一步的说明。The present invention will be further described below in conjunction with embodiment and accompanying drawing.
一种云计算环境中用于近似查询的多维动态采样方法,其包括以下步骤:1)动态采样系统包括用于创建分层样本的离线处理阶段和用于动态选择样本的在线处理阶段;2)在离线处理阶段设置负载列集解析模块、数据特征分析模块、覆盖指数计算模块、分层列集确定模块和分层样本数据创建模块;3)负载列集解析模块对负载查询语句进行解析,抽取出每条查询语句的分组列集,计算每种列集出现的次数,并分析各列集之间的关系,将结果输出给数据特征分析模块;4)数据特征分析模块启动一个MapReduce作业扫描原始数据集,并将数据分布结果输出给覆盖指数计算模块;5)覆盖指数计算模块结合数据分布信息计算按每种候选分层列集进行分层采样情况下的总覆盖指数;6)分层列集确定模块结合覆盖指数及样本存储空间等信息选择用于创建分层样本的分层列集;7)在分层样本创建模块启动一个MapReduce作业,Map函数对原始数据集进行扫描,根据元组在各分层列集上的取值将其传输至相对应的Reduce函数,Reduce函数更新统计信息并将元组数据输出至分层样本数据集;8)在在线处理阶段设置查询解析模块、样本选择模块和样本大小确定模块;9)查询解析模块对用户输入的查询语句进行解析,并抽取出分组列集;10)样本选择模块根据查询语句的分组列集选择采样代价最小的分层样本数据;11)样本大小确定模块根据近似查询语句的样本大小确定从每个样本层所抽取的样本大小。A multi-dimensional dynamic sampling method for approximate query in a cloud computing environment, comprising the following steps: 1) the dynamic sampling system includes an offline processing stage for creating hierarchical samples and an online processing stage for dynamically selecting samples; 2) In the offline processing stage, the load column set analysis module, the data characteristic analysis module, the coverage index calculation module, the layered column set determination module and the layered sample data creation module are set; 3) the load column set analysis module analyzes the load query statement and extracts Output the grouping column sets of each query statement, calculate the number of occurrences of each column set, and analyze the relationship between each column set, and output the results to the data feature analysis module; 4) The data feature analysis module starts a MapReduce job to scan the original Data set, and the data distribution result is output to the coverage index calculation module; 5) the coverage index calculation module combines the data distribution information to calculate the total coverage index under the stratified sampling situation by each candidate stratification column set; 6) the stratification column The set determination module selects the stratified column set for creating stratified samples in combination with information such as coverage index and sample storage space; 7) Start a MapReduce job in the stratified sample creation module, and the Map function scans the original data set, according to the tuple The value on each layered column set is transferred to the corresponding Reduce function, and the Reduce function updates the statistical information and outputs the tuple data to the layered sample data set; 8) Set the query parsing module, sample The selection module and the sample size determination module; 9) The query analysis module analyzes the query statement input by the user, and extracts the grouped column set; 10) The sample selection module selects the hierarchical sample data with the smallest sampling cost according to the grouped column set of the query statement ; 11) The sample size determination module determines the sample size extracted from each sample layer according to the sample size of the approximate query statement.
所述步骤3)中,负载列集分析模块的步骤如下:(1)对负载中的所有SQL查询语句进行解析,抽取出相应的分组列集;(2)计算每种分组列集出现的次数,并生成候选分层列集集合CS={CS1,CS2,...,CSM};(3)分析CS中的任意两个分层列集CSi和CSj的关系,若则将CSj-CSi存入集合RS,并将结果输出给数据分析模块。In said step 3), the steps of the load column set analysis module are as follows: (1) analyze all SQL query statements in the load, and extract the corresponding group column sets; (2) calculate the number of occurrences of each group column set , and generate a candidate hierarchical array set CS={CS 1 , CS 2 ,...,CS M }; (3) analyze the relationship between any two hierarchical array sets CS i and CS j in CS, if Store CS j - CS i in the set RS, and output the result to the data analysis module.
所述步骤4)中,启动一个MapReduce作业扫描原始数据并对数据特征进行分析,计算原始数据集在RS各列集上不同取值的元组个数,其具体步骤如下:(1)Map阶段的Map函数分析原始数据集的各个元组r,并形成键-值对,设置RS中各列集名称为键,设置元组在相应列集上的分组属性值为值;(2)Map阶段的Combine函数将属于同一个列集的若干键-值对进行合并,形成一个新的键-值对输出;(3)所有属于同一个列集的键-值对被传输至同一个Reduce函数,该函数对键-值对的值进行合并,计算该列集上不同属性值的取值个数,从而生成原始数据集在RS各列集上不同取值的个数。In said step 4), start a MapReduce job to scan the original data and analyze the data characteristics, and calculate the number of tuples of different values of the original data set on each column set of RS. The specific steps are as follows: (1) Map stage The Map function analyzes each tuple r of the original data set, and forms a key-value pair, sets the name of each column set in RS as the key, and sets the grouping attribute value of the tuple on the corresponding column set as value; (2) Map stage The Combine function combines several key-value pairs belonging to the same column set to form a new key-value pair output; (3) all key-value pairs belonging to the same column set are transferred to the same Reduce function, This function merges the values of key-value pairs, and calculates the number of values of different attribute values on the column set, thereby generating the number of different values of the original data set on each column set of RS.
所述步骤5)中,覆盖指数计算模块计算以CS中任一列集CSi为分层列集创建分层样本时,CS中各候选分层列集CSj的覆盖指数CIi,j,其计算方式为:若CSj=CSi,则CIi,j=1;若则CIi,j=1/vi,j,其中vi,j表示原始数据集在列集CSj-CSi上的不同取值个数;其他情况,则CIi,j=0。In the step 5), the coverage index calculation module calculates the coverage index CI i, j of each candidate hierarchical column set CS j in the CS when any column set CS i in the CS is used as the hierarchical column set to create a stratified sample , where The calculation method is: if CS j =CS i , then CI i,j =1; if Then CI i,j =1/v i,j , where v i,j represents the number of different values of the original data set on the column set CS j -CS i ; otherwise, CI i,j =0.
所述步骤6)中,分层列集确定的具体步骤如下:(1)针对任一候选分层列集CSi,计算基于CSi创建分层样本时的总覆盖指数fi,计算公式为:式中,Pj表示CSj在负载中出现的概率,计算公式为Nj为CSj在负载中出现的次数;Ci,j为基于CSi创建分层样本时,列集CSj的覆盖指数;(2)将所有候选分层列集的总覆盖指数进行降序排序,选出总覆盖指数最大的前X个候选分层列集作为最终用于创建分层样本的列集,X由系统用于存储样本的空间大小决定。In the step 6), the specific steps for determining the stratified column set are as follows: (1) For any candidate stratified column set CS i , calculate the total coverage index f i when creating stratified samples based on CS i , the calculation formula is : In the formula, P j represents the probability of CS j appearing in the load, and the calculation formula is N j is the number of occurrences of CS j in the load; C i,j is the coverage index of the column set CS j when creating stratified samples based on CS i ; (2) the total coverage index of all candidate stratified column sets is sorted in descending order Sorting, select the top X candidate stratified array sets with the largest total coverage index as the final array set used to create stratified samples, and X is determined by the space used by the system to store samples.
所述步骤7)中,启动一个MapReduce作业进行分层样本创建,其具体步骤如下:(1)Map阶段的Map函数扫描原始数据集,分析每条元组r并生成键-值对,设置键为由列集名称和列集上的取值所构成的结构体,其中列集名称来自步骤6)的输出结果,设置整个元组为值;(2)属于同一列集且在该列集上取值相同的键-值对被传输至同一个Reduce函数,在该函数中,统计属于同一样本层的元组个数,并将元组输出至文件,形成分层样本文件;In said step 7), a MapReduce job is started to create a layered sample, and its specific steps are as follows: (1) the Map function in the Map stage scans the original data set, analyzes each tuple r and generates a key-value pair, and sets the key It is a structure composed of the column set name and the value on the column set, where the column set name comes from the output result of step 6), and set the entire tuple as the value; (2) belongs to the same column set and is on the column set Key-value pairs with the same value are transferred to the same Reduce function, in which the number of tuples belonging to the same sample layer is counted, and the tuples are output to a file to form a hierarchical sample file;
所述步骤9)中,对用户在线输入的查询语句进行解析,抽取出分组列集CSq,然后由步骤10)选择采样代价最小的分层样本数据,选择方法为:若存在一个样本S(CSs)的分层列集CSs=CSq,则选择该样本;否则,选择样本S(CSs),其中CSs是满足条件的最小列集。根据用户近似查询语句的总样本大小N,由步骤11)样本大小确定模块确定从每个样本层中选择的样本数量,若满足CSs=CSq,则从每个样本层抽取的样本大小为In the step 9), the query sentence input by the user is analyzed online, and the grouping column set CS q is extracted, and then the stratified sample data with the smallest sampling cost is selected in step 10), the selection method is: if there is a sample S( CS s ) the hierarchical array set CS s = CS q , then select the sample; otherwise, select the sample S(CS s ), where CS s is the satisfying condition The smallest set of columns. According to the total sample size N of the user's approximate query statement, the sample size determination module in step 11) determines the number of samples selected from each sample layer. If CS s =CS q is satisfied, the sample size extracted from each sample layer is
其中,T为样本层的个数,|Gj|为各个样本层的大小,|R|为原始数据集的大小;若满足则确定从每个样本层抽取样本大小的步骤为:(1)将样本中在CSq列集上取值相同的样本层合并成为一个大的样本层考虑,从每个大样本层中抽取的样本大小为Among them, T is the number of sample layers, |G j | is the size of each sample layer, |R| is the size of the original data set; if satisfy Then the steps to determine the size of the samples drawn from each sample layer are: (1) Merge the sample layers with the same value on the CS q column set in the sample into one large sample layer, and the samples extracted from each large sample layer The sample size is
(2)在每个大的样本层Gi中,从其中各个小样本层Gij抽取的样本大小为(2) In each large sample layer G i , the sample size drawn from each small sample layer G ij is
上述实施例仅是本发明的优选实施方式,应当指出:对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和等同替换,这些对本发明权利要求进行改进和等同替换后的技术方案,均落入本发明的保护范围。The foregoing embodiments are only preferred implementations of the present invention. It should be pointed out that those skilled in the art can make several improvements and equivalent replacements without departing from the principle of the present invention. Technical solutions requiring improvement and equivalent replacement all fall within the protection scope of the present invention.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810025016.6A CN108256028B (en) | 2018-01-11 | 2018-01-11 | Multi-dimensional dynamic sampling method for approximate query in cloud computing environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810025016.6A CN108256028B (en) | 2018-01-11 | 2018-01-11 | Multi-dimensional dynamic sampling method for approximate query in cloud computing environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108256028A true CN108256028A (en) | 2018-07-06 |
CN108256028B CN108256028B (en) | 2021-09-28 |
Family
ID=62726068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810025016.6A Active CN108256028B (en) | 2018-01-11 | 2018-01-11 | Multi-dimensional dynamic sampling method for approximate query in cloud computing environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108256028B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117435647A (en) * | 2023-12-20 | 2024-01-23 | 北京遥感设备研究所 | An approximate query method, device and equipment based on incremental sampling |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1081610A2 (en) * | 1999-09-03 | 2001-03-07 | Cognos Incorporated | Methods for transforming metadata models |
CN102521386A (en) * | 2011-12-22 | 2012-06-27 | 清华大学 | Method for grouping space metadata based on cluster storage |
EP3035211A1 (en) * | 2014-12-18 | 2016-06-22 | Business Objects Software Ltd. | Visualizing large data volumes utilizing initial sampling and multi-stage calculations |
CN106095951A (en) * | 2016-06-13 | 2016-11-09 | 哈尔滨工程大学 | Data space multi-dimensional indexing method based on load balancing and inquiry log |
CN106528815A (en) * | 2016-11-14 | 2017-03-22 | 中国人民解放军理工大学 | Method and system for probabilistic aggregation query of road network moving objects |
-
2018
- 2018-01-11 CN CN201810025016.6A patent/CN108256028B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1081610A2 (en) * | 1999-09-03 | 2001-03-07 | Cognos Incorporated | Methods for transforming metadata models |
CN102521386A (en) * | 2011-12-22 | 2012-06-27 | 清华大学 | Method for grouping space metadata based on cluster storage |
EP3035211A1 (en) * | 2014-12-18 | 2016-06-22 | Business Objects Software Ltd. | Visualizing large data volumes utilizing initial sampling and multi-stage calculations |
CN106095951A (en) * | 2016-06-13 | 2016-11-09 | 哈尔滨工程大学 | Data space multi-dimensional indexing method based on load balancing and inquiry log |
CN106528815A (en) * | 2016-11-14 | 2017-03-22 | 中国人民解放军理工大学 | Method and system for probabilistic aggregation query of road network moving objects |
Non-Patent Citations (1)
Title |
---|
YINGJIE SHI: "You Can Stop Early with COLA: Online Processing ofAggregate Queries in the Cloud", 《CIKM "12: PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE 》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117435647A (en) * | 2023-12-20 | 2024-01-23 | 北京遥感设备研究所 | An approximate query method, device and equipment based on incremental sampling |
CN117435647B (en) * | 2023-12-20 | 2024-03-29 | 北京遥感设备研究所 | Approximate query method, device and equipment based on incremental sampling |
Also Published As
Publication number | Publication date |
---|---|
CN108256028B (en) | 2021-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103955489B (en) | Based on the Massive short documents of Information Entropy Features weight quantization this distributed KNN sorting algorithms and system | |
CN109033314B (en) | Real-time query method and system for large-scale knowledge graph under condition of limited memory | |
CN107301206A (en) | A kind of distributed olap analysis method and system based on pre-computation | |
CN104050235B (en) | Distributed information retrieval method based on set selection | |
Yun et al. | Fastraq: A fast approach to range-aggregate queries in big data environments | |
CN107832456B (en) | Parallel KNN text classification method based on critical value data division | |
US10210280B2 (en) | In-memory database search optimization using graph community structure | |
WO2021047373A1 (en) | Big data-based column data processing method, apparatus, and medium | |
CN105426529A (en) | Image retrieval method and system based on user search intention positioning | |
CN104268142A (en) | Meta search result ranking algorithm based on rejection strategy | |
Tang et al. | An intermediate data partition algorithm for skew mitigation in spark computing environment | |
CN105045806A (en) | Dynamic splitting and maintenance method of quantile query oriented summary data | |
CN102799681B (en) | Top-k query method oriented to any data segment | |
CN104809210A (en) | Top-k query method based on massive data weighing under distributed computing framework | |
CN110597876B (en) | Approximate query method for predicting future query based on offline learning historical query | |
Wenli | Application research on latent semantic analysis for information retrieval | |
CN108256028A (en) | The Dynamic and Multi dimensional method of sampling of approximate query is used in a kind of cloud computing environment | |
CN106933844B (en) | Construction method of reachability query index facing large-scale RDF data | |
CN102156710A (en) | Plant identification method based on cloud model and TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) method | |
Hu et al. | Improved k-means text clustering algorithm based on BERT and density peak | |
US10706055B2 (en) | Partition aware evaluation of top-N queries | |
CN107562872A (en) | Metric space data similarity search method and device based on SQL | |
CN114911826A (en) | A method and system for retrieving linked data | |
Murugan et al. | A time efficient and accurate retrieval of range aggregate queries using fuzzy clustering means (FCM) approach | |
CN104715031A (en) | Outlier division sampling method used in mass data approximate aggregation query |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |