CN108256028B - Multi-dimensional dynamic sampling method for approximate query in cloud computing environment - Google Patents

Multi-dimensional dynamic sampling method for approximate query in cloud computing environment Download PDF

Info

Publication number
CN108256028B
CN108256028B CN201810025016.6A CN201810025016A CN108256028B CN 108256028 B CN108256028 B CN 108256028B CN 201810025016 A CN201810025016 A CN 201810025016A CN 108256028 B CN108256028 B CN 108256028B
Authority
CN
China
Prior art keywords
sample
hierarchical
data
column set
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810025016.6A
Other languages
Chinese (zh)
Other versions
CN108256028A (en
Inventor
史英杰
刘怡
郭飞
刘昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute Fashion Technology
Original Assignee
Beijing Institute Fashion Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute Fashion Technology filed Critical Beijing Institute Fashion Technology
Priority to CN201810025016.6A priority Critical patent/CN108256028B/en
Publication of CN108256028A publication Critical patent/CN108256028A/en
Application granted granted Critical
Publication of CN108256028B publication Critical patent/CN108256028B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A multi-dimensional dynamic sampling method for approximating a query in a cloud computing environment, comprising the steps of: the dynamic sampling system comprises an offline processing stage for creating layered samples and an online processing stage for dynamically selecting samples; in the off-line processing stage, a load list set analysis module analyzes a load query statement; the data characteristic analysis module analyzes the data characteristics; the coverage index calculation module calculates the total coverage index; the hierarchical column set determining module selects a hierarchical column set used for creating hierarchical samples; the method comprises the steps of establishing a layered sample at a layered sample data establishing module; in the on-line processing stage, the query analysis module analyzes the query sentence of the user; the sample selection module selects the layered sample data with the minimum sampling cost; the sample size determination module determines the size of the sample drawn from each sample layer. The invention effectively solves the problem of inaccurate small-packet estimation caused by data tilt in approximate query, and reduces sampling cost under the limit of limited sample storage space.

Description

Multi-dimensional dynamic sampling method for approximate query in cloud computing environment
Technical Field
The invention relates to a data sampling method for approximate query, in particular to a dynamic sampling method facing to multi-query load in a cloud computing environment.
Background
The cloud computing environment provides a high-expansibility and high-cost-performance mode for managing big data, and becomes a mainstream platform for managing the big data. However, even in a cloud computing environment, queries for large data cannot meet the speed requirements for real-time processing and interaction with users. For ad hoc query and exploratory data analysis applications, it is more meaningful to quickly obtain estimated results rather than expend a lot of time and computational resources to obtain fully accurate results. The approximate query processing technology estimates the query result based on sample data, thereby greatly reducing the query execution time and having important significance for big data analysis.
An approximate query processing technique based on sample data is proposed by Acharya et al, which uses a uniform random sampling method, i.e., each tuple is extracted with equal probability. The unified random sampling is suitable for the condition of uniform distribution of data, has the advantages of simplicity and easiness in operation, but when small groups are generated in group aggregation query due to data inclination, the accuracy of an estimation result is seriously reduced due to the unified random sampling, so that the estimation significance is lost. The Surajit et al provides a weighted sampling method, which analyzes the number of query predicates that each tuple can satisfy, and takes the number as the probability weight of the tuple being sampled, and the more the number of query predicates that the tuple satisfies, the greater the probability of being sampled. The weighted sampling technology can relieve the problem of inaccurate estimation caused by data inclination in unified random sampling to a certain extent, but the effect of the weighted sampling technology completely depends on the load on which the sampling weight is calculated, and when the query statement is different from the query statement, the sampling weight has no meaning. A congress sampling method was proposed by swaroup et al that creates a common sample for all possible grouped columns and queries. However, the effectiveness of the sample is gradually reduced along with the increase of the number of queries, and the preprocessing time is exponentially increased along with the increase of the number of columns, so that the application scenario of the multi-query statement cannot be dealt with. In general, the above techniques are performed under the condition that the query statement is of a small and fixed type, and the extensibility is not strong in practical application. In addition, the above techniques are all proposed in the field of relational databases, and cannot be applied to a cloud computing environment.
Disclosure of Invention
The method is used for a data preprocessing stage in an approximate query process, preprocessing an original data set to generate a plurality of layered sample data sets, dynamically selecting the sample data sets according to query statement contents and sampling sizes of the query statement contents when a query statement arrives, and providing sample size extracted from each sample layer. The method provided by the invention effectively solves the problem of inaccurate small packet estimation caused by data tilt in approximate query, and reduces sampling cost under the limit of limited sample storage space.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a method for multi-dimensional dynamic sampling for approximating a query in a cloud computing environment, the method comprising the steps of:
1) the dynamic sampling system comprises an offline processing stage for creating layered samples and an online processing stage for dynamically selecting samples;
2) a load list set analysis module, a data characteristic analysis module, a coverage index calculation module, a hierarchical list set determination module and a hierarchical sample data creation module are arranged at an offline processing stage;
3) the load column set analysis module analyzes the load query statements, extracts the grouped column sets of each load query statement, calculates the occurrence frequency of each grouped column set, generates a candidate hierarchical column set CS, and analyzes each candidate hierarchical column set CSiThe relationship between the data and the data is output to a data characteristic analysis module;
4) the data characteristic analysis module starts a MapReduce operation to scan the original data set and outputs a data distribution result of the original data set to the coverage index calculation module;
5) the coverage index calculation module calculates the CS based on each candidate hierarchical column set by combining the data distribution resultiPerforming total coverage index under the condition of layered sampling;
6) the hierarchical column set determining module combines information such as a coverage index and a sample storage space to select a hierarchical column set for creating a hierarchical sample;
7) starting a MapReduce operation at a layered sample data creating module to create layered samples, scanning an original data set by a Map function, transmitting the original data set to a corresponding Reduce function according to values of tuples on each layered column set for creating the layered samples, updating statistical information by the Reduce function and outputting tuple data to the layered sample data set;
8) a query analysis module, a sample selection module and a sample size determination module are arranged at the online processing stage;
9) the query analysis module analyzes the query sentences input by the user on line and extracts each user query sentenceOf a group column set CSq
10) The sample selection module selects CS according to the grouping column set of the user query statementqSelecting the layered sample data with the minimum sampling cost from the layered sample data set;
11) the sample size determination module determines a sample size drawn from each sample layer according to a sample size of the approximate query statement.
The invention has the following advantages:
1. the method is used for establishing the hierarchical column set by analyzing the load characteristics and the data distribution characteristics, and establishing a plurality of multi-dimensional hierarchical sample data based on the hierarchical column set and the sample storage space, so that the problem of inaccurate estimation result caused by data inclination in approximate query is solved;
2. in the process of determining the column set for creating the hierarchical sample, after the coverage index represents the given hierarchical sample, different query statements use the sample to perform hierarchical sampling, so that the method lays a foundation for the expansion of query load;
3. given the sample size of the hierarchical layer and the total sample size, the present invention combines the query of the packet column set CS in determining the sample size from each sample layerqWith sample hierarchical set of columns CSsThe relationships of (a) respectively propose solutions: (1) when CS is useds=CSqWhen the sampling size of the corresponding sample layer is selected, a larger value is selected from the average sample size of each sample layer and the sample size proportional to the sample layer size, so that the problem that the sample size of small groups and large groups is too small is solved; (2) when in use
Figure GDA0003128007590000031
The invention first puts the sample in CSqAnd combining the sample layers with the same value on the column set into a large sample layer to determine the sample size, and then performing layered sampling in each large sample layer, thereby dynamically determining the sample size of each sample layer.
Drawings
FIG. 1 is a diagram of a multi-dimensional dynamic sampling framework for approximating queries in a cloud computing environment.
Detailed Description
The invention is further described with reference to the following examples and the accompanying drawings.
A multi-dimensional dynamic sampling method for approximating a query in a cloud computing environment, comprising the steps of: 1) the dynamic sampling system comprises an offline processing stage for creating layered samples and an online processing stage for dynamically selecting samples; 2) a load list set analysis module, a data characteristic analysis module, a coverage index calculation module, a hierarchical list set determination module and a hierarchical sample data creation module are arranged at an offline processing stage; 3) the load column set analysis module analyzes the load query sentences, extracts the grouped column sets of each query sentence, calculates the occurrence frequency of each column set, analyzes the relation among the column sets and outputs the result to the data characteristic analysis module; 4) the data characteristic analysis module starts a MapReduce operation to scan an original data set and outputs a data distribution result to the coverage index calculation module; 5) the coverage index calculation module is used for calculating the total coverage index under the condition of carrying out layered sampling according to each candidate layered column set by combining the data distribution information; 6) the hierarchical column set determining module combines information such as a coverage index and a sample storage space to select a hierarchical column set for creating a hierarchical sample; 7) starting a MapReduce operation at a layered sample creating module, scanning an original data set by a Map function, transmitting the original data set to a corresponding Reduce function according to values of tuples on each layered column set, updating statistical information by the Reduce function and outputting tuple data to a layered sample data set; 8) a query analysis module, a sample selection module and a sample size determination module are arranged at the online processing stage; 9) the query analysis module analyzes the query sentence input by the user and extracts a grouping list set; 10) the sample selection module selects the layered sample data with the minimum sampling cost according to the grouping column set of the query statement; 11) the sample size determination module determines a sample size drawn from each sample layer according to a sample size of the approximate query statement.
In the step 3), the load list set analysis module comprises the following steps: (1) analyzing all SQL query statements in the load, and extracting corresponding groupsA column set; (2) calculating the occurrence times of each grouping column set and generating a candidate layering column set CS ═ CS1,CS2,...,CSM}; (3) analyzing any two hierarchical sets of columns CS in CSiAnd CSjIn a relation of (1), if
Figure GDA0003128007590000041
Then CS will bej-CSiAnd storing the result into the set RS and outputting the result to the data analysis module.
In the step 4), a MapReduce job is started to scan the original data and analyze the data characteristics, and the number of tuples of the original data set with different values on each RS column set is calculated, which specifically comprises the following steps: (1) analyzing each tuple r of the original data set by a Map function in a Map stage, forming a key-value pair, setting the name of each column set in the RS as a key, and setting the grouping attribute value of the tuple on the corresponding column set as a value; (2) the combination function in the Map stage combines a plurality of key-value pairs belonging to the same column set to form a new key-value pair output; (3) all key-value pairs belonging to the same column set are transmitted to the same Reduce function, the function combines the values of the key-value pairs, and the value number of different attribute values on the column set is calculated, so that the number of different values of the original data set on each column set of the RS is generated.
In the step 5), the coverage index calculation module calculates any column set CS in the CSiWhen creating hierarchical samples for a hierarchical set of columns, each candidate hierarchical set of columns CS in CSjCoverage index CIi,jThe calculation method is as follows: if CSj=CSiThen CI isi,j1 is ═ 1; if it is
Figure GDA0003128007590000053
Then CI isi,j=1/vi,jWherein v isi,jRepresenting a set of columns CS of a raw data seti-CSjThe different values of the above are the numbers; otherwise, then CIi,j=0。
In the step 6), the specific steps for determining the hierarchical column set are as follows: (1) for any candidate hierarchical column set CSiThe calculation is based on CSiCreating scoresTotal coverage index f in case of layer samplesiThe calculation formula is as follows:
Figure GDA0003128007590000051
in the formula, PjDenotes CSjThe probability of occurrence in the load is calculated by the formula
Figure GDA0003128007590000052
NjIs CSjNumber of occurrences in the load; CIi,jIs based on CSiWhen creating hierarchical samples, the column set CSjThe coverage index of (a); (2) and sorting the total coverage indexes of all candidate hierarchical column sets in a descending order, selecting the first X candidate hierarchical column sets with the maximum total coverage indexes as the column sets finally used for creating the hierarchical samples, wherein X is determined by the size of the space used for storing the samples by the system.
In the step 7), a MapReduce job is started to create a hierarchical sample, and the specific steps are as follows: (1) scanning an original data set by a Map function in a Map stage, analyzing each tuple r and generating a key-value pair, setting a key as a structural body formed by a column set name and values on the column set, wherein the column set name is from an output result in the step 6), and setting the whole tuple as a value; (2) key-value pairs which belong to the same column set and have the same value on the column set are transmitted to the same Reduce function, in the function, the number of tuples belonging to the same sample layer is counted, and the tuples are output to a file to form a layered sample file;
in the step 9), the query sentence input by the user on line is analyzed, and the grouping column set CS is extractedqThen, step 10) selects the layered sample data with the minimum sampling cost, and the selection method is as follows: if there is one sample S (CS)s) Hierarchical set of columns CSs=CSqThen the sample is selected; otherwise, sample S (CS) is selecteds) Wherein CSsIs to satisfy the condition
Figure GDA0003128007590000054
The minimum column set. According to the total sample size N of the user approximate query statement, the step 11) sample size determining module determinesDetermining the number of samples selected from each sample layer if CS is satisfieds=CSqThe size of the sample extracted from each sample layer is
Figure GDA0003128007590000061
Wherein T is the number of sample layers, | GjL is the size of each sample layer, and l R is the size of the original data set; if it satisfies
Figure GDA0003128007590000064
The step of determining the size of the sample to be extracted from each sample layer is: (1) subjecting the sample to CSqThe sample layers with the same value on the column set are combined into a large sample layer, and the size of the sample extracted from each large sample layer is
Figure GDA0003128007590000062
(2) At each large sample layer GiFrom each of which a small sample layer GijThe size of the sample extracted is
Figure GDA0003128007590000063
The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims (5)

1. A method for multi-dimensional dynamic sampling for approximating a query in a cloud computing environment, the method comprising the steps of:
1) the dynamic sampling system comprises an offline processing stage for creating layered samples and an online processing stage for dynamically selecting samples;
2) a load list set analysis module, a data characteristic analysis module, a coverage index calculation module, a hierarchical list set determination module and a hierarchical sample data creation module are arranged at an offline processing stage;
3) the load column set analysis module analyzes the load query statements, extracts the grouped column sets of each load query statement, calculates the occurrence frequency of each grouped column set, generates a candidate hierarchical column set CS, and analyzes each candidate hierarchical column set CSiThe relationship between the data and the data is output to a data characteristic analysis module;
4) the data characteristic analysis module starts a MapReduce operation to scan the original data set and outputs a data distribution result of the original data set to the coverage index calculation module;
5) the coverage index calculation module calculates the CS based on each candidate hierarchical column set by combining the data distribution resultiPerforming total coverage index under the condition of layered sampling; the coverage index calculation module calculates any column set CS in the CSiWhen creating hierarchical samples for a hierarchical set of columns, each candidate hierarchical set of columns CS in CSjCoverage index CIi,jThe calculation method is as follows: if CSj=CSiThen CI isi,j1 is ═ 1; if it is
Figure FDA0003170432510000011
Then CI isi,j=1/vi,jWherein v isi,jRepresenting the original data set at CSi-CSjThe different values of the above are the numbers; otherwise, then CIi,j=0;
6) The hierarchical column set determination module, in combination with the coverage index and the sample storage space information, selects a hierarchical column set for creating hierarchical samples, comprising the steps of:
(1) for any candidate hierarchical column set CSiThe calculation is based on CSiTotal coverage index f when creating hierarchical samplesi
Figure FDA0003170432510000012
Wherein, PjDenotes CSjThe probability of occurrence in a load query statement,
Figure FDA0003170432510000013
Njis CSjNumber of occurrences in a load query statement; CIi,jIs based on CSiWhen creating layered samples CSjThe coverage index of (a);
(2) sorting the total coverage indexes of all candidate hierarchical column sets in a descending order, selecting the first X candidate hierarchical column sets with the maximum total coverage indexes as the grouping column sets finally used for creating the hierarchical samples, wherein X is determined by the space size of the dynamic sampling system used for storing the samples;
7) starting a MapReduce operation at a layered sample data creating module to create layered samples, scanning an original data set by a Map function, transmitting the original data set to a corresponding Reduce function according to values of tuples on each layered column set for creating the layered samples, updating statistical information by the Reduce function and outputting tuple data to the layered sample data set;
8) a query analysis module, a sample selection module and a sample size determination module are arranged at the online processing stage;
9) the query analysis module analyzes the query sentences input by the user on line and extracts the grouping column set CS of each user query sentenceq
10) The sample selection module selects CS according to the grouping column set of the user query statementqSelecting the layered sample data with the minimum sampling cost from the layered sample data set;
11) the sample size determination module determines a sample size drawn from each sample layer according to a sample size of the approximate query statement.
2. The method as claimed in claim 1, wherein in the step 3), the load list set analysis module parses the load query statement, which includes the following steps:
(1) analyzing all SQL query sentences in the load query sentences, and extracting corresponding grouping column sets;
(2) calculating the occurrence times of each grouping column set and generating a candidate layering column set CS ═ CS1,CS2,...,CSM};
(3) Analyzing any two candidate hierarchical column sets CS in CSiAnd CSjIn a relation of (1), if
Figure FDA0003170432510000021
Then CS will bej-CSiAnd storing the result into the set RS and outputting the result to the data characteristic analysis module.
3. The method as claimed in claim 1, wherein in the step 4), a MapReduce job is started to scan raw data and analyze data characteristics, and the method comprises the following steps:
(1) analyzing each tuple r of the original data set by a Map function in a Map stage, forming a key-value pair, setting the name of each column set in the RS as a key, and setting the grouping attribute value of the tuple on the corresponding column set as a value;
(2) the combination function in the Map stage combines the key-value pairs belonging to the same column set to form a new key-value pair output;
(3) all key-value pairs belonging to the same column set are transmitted to the same Reduce function, the function combines the values of the key-value pairs, and the value number of different attribute values on the column set is calculated, so that the number of different values of the original data set on each column set of the RS is generated.
4. The method as claimed in claim 1, wherein in the step 7), a MapReduce job is started for hierarchical sample creation, which includes the following steps:
(1) scanning an original data set by a Map function in a Map stage, analyzing each tuple r and generating a key-value pair, setting a key as a structural body formed by a column set name and values on the column set, wherein the column set name is from an output result in the step 6), and setting the whole tuple as a value;
(2) and key-value pairs which belong to the same column set and have the same value on the grouped column set are transmitted to the same Reduce function, in the function, the number of tuples belonging to the same sample layer is counted, and the tuples are output to a file to form a layered sample file.
5. The method as claimed in claim 1, wherein in step 9), the query sentence inputted by the user on-line is analyzed, and the grouping column set CS is extractedqThen, step 10) selects the layered sample data with the minimum sampling cost, and the selection method is as follows: if there is one sample S (CS)s) Hierarchical set of columns CSs=CSqThen the sample is selected; otherwise, sample S (CS) is selecteds) Wherein CSsIs to satisfy the condition
Figure FDA0003170432510000031
A minimum column set of; according to the total sample size N of the user approximate query statement, the sample size determining module in the step 11) determines the number of samples selected from each sample layer, if the number satisfies CSs=CSqThe size of the sample extracted from each sample layer is
Figure FDA0003170432510000032
Wherein T is the number of sample layers, | GjL is the size of each sample layer, and l R is the size of the original data set; if it satisfies
Figure FDA0003170432510000033
The step of determining the size of the sample to be extracted from each sample layer is: (1) subjecting the sample to CSqThe sample layers with the same value on the column set are combined into a large sample layer, and the size of the sample extracted from each large sample layer is
Figure FDA0003170432510000034
(2) At each large sample layer GiFrom each of which a small sample layer GijThe size of the sample extracted is
Figure FDA0003170432510000035
CN201810025016.6A 2018-01-11 2018-01-11 Multi-dimensional dynamic sampling method for approximate query in cloud computing environment Active CN108256028B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810025016.6A CN108256028B (en) 2018-01-11 2018-01-11 Multi-dimensional dynamic sampling method for approximate query in cloud computing environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810025016.6A CN108256028B (en) 2018-01-11 2018-01-11 Multi-dimensional dynamic sampling method for approximate query in cloud computing environment

Publications (2)

Publication Number Publication Date
CN108256028A CN108256028A (en) 2018-07-06
CN108256028B true CN108256028B (en) 2021-09-28

Family

ID=62726068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810025016.6A Active CN108256028B (en) 2018-01-11 2018-01-11 Multi-dimensional dynamic sampling method for approximate query in cloud computing environment

Country Status (1)

Country Link
CN (1) CN108256028B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117435647B (en) * 2023-12-20 2024-03-29 北京遥感设备研究所 Approximate query method, device and equipment based on incremental sampling

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1081610A2 (en) * 1999-09-03 2001-03-07 Cognos Incorporated Methods for transforming metadata models
CN102521386A (en) * 2011-12-22 2012-06-27 清华大学 Method for grouping space metadata based on cluster storage
EP3035211A1 (en) * 2014-12-18 2016-06-22 Business Objects Software Ltd. Visualizing large data volumes utilizing initial sampling and multi-stage calculations
CN106095951A (en) * 2016-06-13 2016-11-09 哈尔滨工程大学 Data space multi-dimensional indexing method based on load balancing and inquiry log
CN106528815A (en) * 2016-11-14 2017-03-22 中国人民解放军理工大学 Method and system for probabilistic aggregation query of road network moving objects

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1081610A2 (en) * 1999-09-03 2001-03-07 Cognos Incorporated Methods for transforming metadata models
CN102521386A (en) * 2011-12-22 2012-06-27 清华大学 Method for grouping space metadata based on cluster storage
EP3035211A1 (en) * 2014-12-18 2016-06-22 Business Objects Software Ltd. Visualizing large data volumes utilizing initial sampling and multi-stage calculations
CN106095951A (en) * 2016-06-13 2016-11-09 哈尔滨工程大学 Data space multi-dimensional indexing method based on load balancing and inquiry log
CN106528815A (en) * 2016-11-14 2017-03-22 中国人民解放军理工大学 Method and system for probabilistic aggregation query of road network moving objects

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
You Can Stop Early with COLA: Online Processing ofAggregate Queries in the Cloud;Yingjie Shi;《CIKM "12: Proceedings of the 21st ACM international conference on Information and knowledge 》;20121031;1223-1232 *

Also Published As

Publication number Publication date
CN108256028A (en) 2018-07-06

Similar Documents

Publication Publication Date Title
US11687801B2 (en) Knowledge graph data structures and uses thereof
CN103955489B (en) Based on the Massive short documents of Information Entropy Features weight quantization this distributed KNN sorting algorithms and system
US11003649B2 (en) Index establishment method and device
US9798831B2 (en) Processing data in a MapReduce framework
CN110990638A (en) Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment
US20200059689A1 (en) Query processing in data analysis
JP2017188137A (en) Method, program and system for automatic discovery of relationship between fields in environment where different types of data sources coexist
CN110569289B (en) Column data processing method, equipment and medium based on big data
Yun et al. Fastraq: A fast approach to range-aggregate queries in big data environments
CN118210908B (en) Retrieval enhancement method and device, electronic equipment and storage medium
JP6159908B6 (en) Method, program, and system for automatic discovery of relationships between fields in a heterogeneous data source mixed environment
US8073834B2 (en) Efficient handling of multipart queries against relational data
Wan et al. LKAQ: Large-scale knowledge graph approximate query algorithm
US11782991B2 (en) Accelerated large-scale similarity calculation
JPWO2017170459A6 (en) Method, program, and system for automatic discovery of relationships between fields in a heterogeneous data source mixed environment
WO2018053889A1 (en) Distributed computing framework and distributed computing method
CN108256028B (en) Multi-dimensional dynamic sampling method for approximate query in cloud computing environment
CN110704515B (en) Two-stage online sampling method based on MapReduce model
Zhao et al. Parallel K-Medoids Improved Algorithm Based on MapReduce
Li Collaborative filtering recommendation algorithm based on cluster
CN112650770B (en) MySQL parameter recommendation method based on query work load analysis
Fu An improved parallel collaborative filtering algorithm based on Hadoop
Li et al. Heterogeneous embeddings for relational data integration tasks
Ni et al. Approximate Query Processing with Error Guarantees
Zhang et al. DATA MINING TECHNOLOGY BASED ON ASSOCIATION RULES ALGORITHM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant