CN112685444A - Data query method and device, computer equipment and storage medium - Google Patents

Data query method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112685444A
CN112685444A CN202011547152.5A CN202011547152A CN112685444A CN 112685444 A CN112685444 A CN 112685444A CN 202011547152 A CN202011547152 A CN 202011547152A CN 112685444 A CN112685444 A CN 112685444A
Authority
CN
China
Prior art keywords
result set
data
data table
query
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011547152.5A
Other languages
Chinese (zh)
Inventor
汪金忠
孙迁
范东
汤国强
张强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Suning Cloud Computing Co ltd
Original Assignee
Jiangsu Suning Cloud Computing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Suning Cloud Computing Co ltd filed Critical Jiangsu Suning Cloud Computing Co ltd
Priority to CN202011547152.5A priority Critical patent/CN112685444A/en
Publication of CN112685444A publication Critical patent/CN112685444A/en
Priority to CA3143821A priority patent/CA3143821A1/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data query method, a data query device, computer equipment and a storage medium, which are used for analyzing a query request and acquiring a connection algorithm and a corresponding scene in the query request; if the connection algorithm is one of full external connection, left external connection and right external connection, the scene is the sorting of the summarized values in the first data table, and the summarized values in the second data table are not sorted, the operations of summarizing, sorting and limit in the query request are executed in a first database where the first data table is located, so that a first result set is obtained; and taking the dimension value in the first result set as a dynamic filtering condition, and executing dynamic filtering operation corresponding to the dynamic filtering condition and summarizing operation in the query request in a second database where a second data table is located to obtain a second result set, so that the data volume of the first result set and the second result set returned to the query engine is obviously reduced, the pressure of network data transmission is reduced, the time for returning the query result is shortened, and the query performance of the query engine is improved.

Description

Data query method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data query method, apparatus, computer device, and storage medium.
Background
In the big data era, more and more attention is paid to data analysis of mass data, the related service scenes are more and more complex, and the data volume is more and more large and is often more than 10 billion and billion. It is generally difficult to solve all the data analysis problems with a single olap database, and therefore most internet enterprises use multiple sets of olap databases to solve different business analysis problems. But joint summary query may be needed to be performed on data stored in different databases at the same time, and currently, federal query engines such as spark sql, presto, impala, and openlokeng in huachen are commonly used.
When data query analysis is performed by using federated query engines such as spark sql, presto, impala and the like, a plurality of pieces of summarized data from the same or different olap databases are connected, and then sorted according to the summarized values of a certain table and the pieces of TOPN data are returned. The traditional data query method only presses the filtering condition into the olap database, all the data meeting the condition are taken out, then a plurality of data sets are connected (join) calculated in the federal query engine, and then the results of the connection calculation are sequenced and subjected to limit operation.
Disclosure of Invention
In view of the above, it is necessary to provide a data query method, apparatus, computer device and storage medium for solving the above technical problems.
In a first aspect, a method for data query includes the following steps:
analyzing the query request, and acquiring a connection algorithm and a corresponding scene in the query request;
if the connection algorithm is one of full external connection, left external connection and right external connection, the scene is the summary value sequence in the first data table, and the summary value in the second data table is not sequenced, the summary, sequence and limit operations in the query request are executed in a first database where the first data table is located, and a first result set is obtained;
and taking the dimension value in the first result set as a dynamic filtering condition, and executing a dynamic filtering operation corresponding to the dynamic filtering condition and a summarizing operation in the query request in a second database where a second data table is located to obtain a second result set.
In one embodiment, the method further comprises: and acquiring a first target result set by using the first result set and the second result set through the connection algorithm, and extracting first target data from the first target result set as a query result.
In an embodiment, if the join algorithm is an all-outsourced join and the amount of data in the first data table is smaller than a preset limit value of a limit operation, the method further includes performing a summarization, sorting, and limit operation in the query request in the second database to obtain a third result set.
In one embodiment, the method further comprises:
combining the second result set and the third result set, and removing repeated results to obtain a fourth result set;
and acquiring a second target result set by using the first result set and the fourth result set through the connection algorithm, and extracting second target data from the second target result set as a query result.
In one embodiment, before performing the summarization, sorting and limit operation in the query request in the first database where the first data table is located, the method further includes:
and judging whether the data volume in the first data table and the second data table meets a preset data volume condition, and if so, executing the operations of summarizing, sorting and limit in the query request in a first database where the first data table is located.
In one embodiment, the preset data volume condition is: the ratio of the data amount in the first data table to the data amount in the second data table exceeds a preset ratio, or the data amount in the first data table and the data amount in the second data table both exceed a preset amount.
In one embodiment, the first data table is a left table when the connection algorithm is a left outer connection and a right table when the connection algorithm is a right outer connection.
In a second aspect, the apparatus comprises:
the analysis module is used for analyzing the query request and acquiring a connection algorithm and a corresponding scene in the query request;
a first result set obtaining module, configured to, if the connection algorithm is one of full external connection, left external connection, and right external connection, and the scenario is a summary value ranking in the first data table, and a summary value in the second data table is not ranked, execute the summary, ranking, and limit operations in the query request in a first database where the first data table is located to obtain a first result set;
and the second result set acquisition module is used for taking the dimension value in the first result set as a dynamic filtering condition, and executing dynamic filtering operation corresponding to the dynamic filtering condition and summarizing operation in the query request in a second database where a second data table is located to obtain a second result set.
In a third aspect, a computer device comprises a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method when executing the computer program.
In a fourth aspect, a computer-readable storage medium, on which a computer program is stored, characterized in that the computer program realizes the steps of the method when executed by a processor.
The invention has the beneficial effects that:
if the connection algorithm is one of full external connection, left external connection and right external connection, the scene is the summary value sequence in the first data table, and the summary value in the second data table is not sequenced, the summary, sequence and limit operations in the query request are executed in a first database where the first data table is located, so that a first result set is obtained; and the dimension value in the first result set is used as the dynamic filtering condition, the dynamic filtering operation corresponding to the dynamic filtering condition and the summarizing operation in the query request are executed in the second database to obtain a second result set, so that the data volume of the first result set and the second result set returned to the query engine is obviously reduced, the pressure of network data transmission is reduced, the data volume summarized in the second database is obviously reduced by using the dimension value in the first result set as the dynamic filtering condition in the second database, the time of returning the query result is shortened, and the query performance of the query engine is improved. When the operation mode is the full external connection, when the data volume in the first data table is considered to be smaller than the preset limit value of the limit operation, the summarizing, sorting and limit operations in the query request are executed in the second database to obtain a third result set, and the third result set is obtained, so that the data returned by the second database are dynamically adjusted, and the result error condition caused by summarizing dimension information which is not in the first data table possibly exists in the second data table is avoided. According to the invention, before the summarizing, sorting and limit operations in the query request are executed in the first database where the first data table is located to obtain the first result set, the cost for executing the summarizing, sorting and limit operations in the first database is also considered, and the summarizing, sorting and limit operations are executed in the first database only after the preset data volume condition is met, so that the condition that the cost is high due to the fact that the summarizing, limit, filtering and other operations are directly pushed down to the database is avoided.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a prior art data query method;
FIG. 2 is a data query method in one embodiment;
FIG. 3 is a diagram illustrating a data query method in an application scenario according to an embodiment;
FIG. 4 is a process for dynamically adjusting the data query method in an application scenario according to an embodiment;
FIG. 5 illustrates a data query device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As described in the background of the present invention, the conventional data query method is to push the filtering condition to the olap database, take all the data meeting the condition out, then perform a join calculation on multiple data sets in the federal query SQL engine, and finally perform a sort and limit operation on the data results, for example, in the e-commerce field, it may be necessary to analyze sales conditions and traffic, analyze traffic UV (number of people visiting a certain site or clicking a different IP address of a certain news) and PV (page view volume, or click volume) of the commodity 100 before sales is sold for a period of time, so as to perform a subsequent targeted marketing activity, store sales data into a dry, store traffic data into a clikhouse, which are different olap databases, if the summary values in the sales data table are sorted in the query request, the summary values in the traffic data table are not sorted, and then, computing by adopting an all-external connection mode, then, a general data query method, as shown in fig. 1, firstly summarizes the sales data table in the drive, summarizes the volume data table in the clickhouse, and then returns the total summarized data to the SQL engine, at this time, the data volume returned by the drive and the clickhouse may be very huge, the data volume returned by the drive is 1000 ten thousand, and the data volume returned by the clickhouse is 5000 ten thousand, then, subsequently, respectively performing exchange and sort on the data volumes returned by the drive and the clickhouse in the SQL, then performing join calculation, sorting the result set according to the sales, performing limit on the sorted result, and then performing project taking record before sales 100, in this process, because the data volume returned by the drive and the clickhouse may bring huge network transmission pressure, and therefore, the result return time is long, and subsequently, when the two tables are subjected to full external connection calculation, the sort merge join algorithm is needed to complete the full external connection calculation of the large table, and the sort mrege needs to carry out a large amount of shuffle and sort operations, so that the cost is very high, the resources are very consumed, and the execution efficiency is very low.
Therefore, the invention provides a data query method, if the connection algorithm is one of full external connection, left external connection and right external connection, the scene is the sorting of the summarized values in the first data table, and the summarized values in the second data table are not sorted, the operations of summarizing, sorting and limit in the query request are executed in the first database where the first data table is located to obtain a first result set; the dimension value in the first result set is used as a dynamic filtering condition, the dynamic filtering operation corresponding to the dynamic filtering condition and the summarizing operation in the query request are executed in the second database to obtain a second result set, so that the data volume of the first result set and the second result set returned to the query engine is obviously reduced, the pressure of network data transmission is reduced, the dimension value in the first result set is used as the dynamic filtering condition in the second database, the data volume summarized in the second database is obviously reduced, the time for returning the query result is shortened, the query performance of the query engine is improved, and the data query method, the device, the computer equipment and the storage medium are further explained with specific embodiments.
Example one
As shown in fig. 2, the present embodiment provides a data query method, which is applied in a query engine, and the method at least includes the following steps:
and S1, analyzing the query request, and acquiring a connection algorithm and a corresponding scene in the query request.
In actual operation, a user inputs a corresponding connection algorithm and a corresponding scene in a federal query engine, specifically, the connection algorithms among multiple tables are various, such as join (connection) and union (union), wherein the join is divided into full external connection, left external connection, right external connection and internal connection, and the corresponding scene includes various types, such as no sorting of summary values; sorting summary values of one table, not sorting summary values of another table, sorting dimensions, and so forth.
And S2, if the connection algorithm is one of full external connection, left external connection and right external connection, the scene is the summary value sorting in the first data table, and the summary value in the second data table is not sorted, performing the summary, sorting and limit operations in the query request in the first database where the first data table is located to obtain a first result set.
In this embodiment, before querying data, a connection algorithm and a corresponding scene need to be judged first, and the operation mode applicable to the present invention includes full external connection, left external connection, and right external connection, where when the full external connection matches two tables, data between the two tables does not need to be matched, that is, a collection of data of the two tables is returned; the left external connection is mainly based on the left table, all data in the left table are required to be returned, if records in the right table can be matched with the left table, the value in the result set is set as the value in the right table, and if not, the value is set as the null value; and the right external connection is mainly based on the right table, all data in the returned right table are required, if the records in the left table can be matched with the right table, the values in the result set are set as the values in the left table, and otherwise, the values are set as null values. And then judging corresponding scenes, wherein the scenes suitable for the method are that the summarized values in the first data table are sorted, and the summarized values in the second data table are not sorted, and when the emphasis is needed, the method is convenient for understanding, a table for sorting the summarized values is used as the first data table, and a table for not sorting the summarized values is used as the second data table, wherein the first and the second are not limited to the data tables.
In this embodiment, if the connection algorithm is one of full external connection, left external connection, and right external connection, and the scenario is the summarized value sorting in the first data table, and the summarized value in the second data table is not sorted, the summarizing, sorting, and limit operations in the query request are executed in the first database where the first data table is located to obtain the first result set, that is, when both the connection algorithm and the scenario satisfy the condition, the query engine pushes the summarizing, sorting, and limit operations in the query request down to the first database for execution, so as to fully utilize the characteristics of the database itself, return the first result set with a smaller data size to the query engine, and significantly reduce the pressure on network data transmission.
And S3, taking the dimension value in the first result set as a dynamic filtering condition, and executing a dynamic filtering operation corresponding to the dynamic filtering condition and a summarizing operation in the query request in a second database where a second data table is located to obtain a second result set.
In this embodiment, after the query engine obtains the first result set, the dimension value in the first result set is used as a dynamic filtering condition, and is transmitted to the second database where the second data table is located, the dynamic filtering operation and the summarizing operation are pushed down to the second database for execution, and the second result set is returned to the query engine, so that the data volume in the first result set and the second result set returned to the query engine is significantly reduced, the pressure of network data transmission is reduced, and the dimension value in the first result set is used as the dynamic filtering condition in the second database, so that the data volume for summarizing in the second database is significantly reduced, the time for returning the query result is shortened, and the query performance of the query engine is improved.
In a preferred embodiment, the first result set and the second result set are used by the join algorithm to obtain a first target result set, and first target data is extracted from the first target result set as a query result.
In this embodiment, the first result set and the second result set are data tables with small data volumes, the query engine obtains the first target result set from the first result set and the second result set through a connection algorithm, the adopted connection algorithm may be the aforementioned all-external connection, left-external connection, and right-external connection, when the all-external connection algorithm is adopted, a broadcast hash join algorithm may be adopted, which is a most efficient join algorithm, it avoids resource-consuming shuffle and sort calculations, alleviates the delay problem of complex data analysis, shortens the time for obtaining the target result set, and further improves the query performance of the federal query engine. In addition, when the all-external connection algorithm is adopted, the sort merge join algorithm can be used, and the data query method in the invention has higher query efficiency than the data query method in the prior art even if the complicated sort merge join algorithm is adopted.
In an embodiment, before the first result set and the second result set are subjected to the join algorithm to obtain the first target result set, exchanging data in the first result set to obtain a first exchange set, and then joining the first exchange set with the second result set, where the exchange method is preferably a broadcase exchange, which is an efficient exchange method, and can significantly improve efficiency of obtaining an exchange result.
In one embodiment, the applying the first result set and the second result set to the join algorithm to obtain a first target result set further comprises: and sequencing and limit operation are carried out on the results obtained after the connection algorithm is adopted, so that a first target result set is obtained.
The above-described method of the present invention will be further described with reference to the specific example of fig. 3.
As mentioned in the foregoing, when analyzing lost sales and traffic that may be needed in the e-commerce field, analyzing the traffic UV (the number of people visiting a certain site or clicking a different IP address of a certain news) and PV (page view volume, or click volume) of the first 100 goods in sales ranking within a period of time, for this reason, storing the sales data into the druid, storing the traffic data into the clickhouse, if the query request is to rank the summarized values in the sales data table, not ranking the summarized values in the sales data table, and then calculating in an all-out connection manner, then the data query method in the present invention, as shown in fig. 3, the Spark SQL engine passes a first query statement into the druid, so that the druid performs the summarization, ranking, and limit operations in the query request, where the first query statement may be:
select goods _ id, sum (samples) as total _ samples from drive _ samples where __ time 2020-09-01and __ time 2020-09-30group by goods _ id (this step is to summarize the data in the first data sheet)
order by total _ sales desc limit 100 (this step is sorting summary results, limit)
The druid returns the first 100 records as the first result set back to the Spark SQL engine.
Then, the Spark SQL engine transmits a second query statement to the clickhouse, the second query statement takes the dimension value in the first result set as a dynamic filtering condition, and executes a dynamic filtering operation and a summarizing operation corresponding to the dynamic filtering condition in the clickhouse, where the second query statement may be:
select goods_id,count(visitor_id)as pv,count(disitinct visitor_id)as uv from ch_traffic
where__time>=2020-09-01and__time<=2020-09-30and
goods _ id in (value1, value2, … …) (this step is dynamic filtering, summarizing data in the second data table)
group by goods_id
Wherein, the goods _ id in (value1, value2, … …) in the second query statement is a dynamic filtering condition, and the Spark SQL engine obtains the result returned from the clickhouse as the second result set.
After obtaining a first result set and a second result set, the Spark SQL engine exchanges data in the first result set by using a broadcast exchange to obtain a first exchange set, then uses a broadcast hash join to perform connection calculation on the first exchange set and the second result set by using a broadcast hash join, then performs sort and limit operations to return the results of rows, then performs project to select corresponding columns to obtain a first target result set, extracts target data from the first target result set as a query result, and effectively shortens the time for obtaining the target result set by using the broadcast hash join, thereby shortening the time for obtaining the query result.
The invention can return the query result within a few seconds by the query method, but the query method in the prior art needs to return a result set of more than 1000 ten thousand from the drive, and the query result can not be obtained by directly timeout error reporting.
In a preferred embodiment, if the join algorithm is an all-outsourced join and the amount of data in the first data table is smaller than a preset limit value of a limit operation, the method further includes performing a sort and limit operation in the query request in the second database to obtain a third result set.
The data query method described above needs to satisfy a certain condition, that is, the data amount in the first data table is greater than the preset limit value of the limit operation, and when the all-external connection is adopted, if the data amount in the first data table is less than the preset limit value of the limit operation, that is, the data amount in the first data table is smaller, during the all-external connection calculation, if the dimension value in the first result set is still used as the dynamic filtering condition to perform the dynamic filtering operation and summarization in the second database, there may be a problem that the data amount in the returned second result set is very small because there may be summarized dimension information that is not in the first data table in the second data report, which is absolutely not allowed and therefore needs to dynamically adjust the data query method.
Therefore, in this embodiment, a dynamic adjustment process of the data query method is involved, at this time, the query engine transmits the third query statement to the second database to execute the summarization, sorting and limit operations in the query request, so as to obtain the third result set, that is, the query engine pushes down the summarization, sorting and limit operations to the second database to obtain the corresponding result set.
In a preferred embodiment, the method further comprises: combining the second result set and the third result set, and removing repeated results to obtain a fourth result set; and acquiring a second target result set by using the first result set and the fourth result set through the connection algorithm, and extracting second target data from the second target result set as a query result.
In this embodiment, the query engine performs joint calculation on the second result set and the third result set, removes the repeated result to obtain a fourth result set, obtains a second target result set by using a connection algorithm on the first result set and the fourth result set, and extracts second target data from the second target result set as a query result, where the connection algorithm used may be the aforementioned all-outer connection, left-outer connection, and right-outer connection, and when the all-outer connection algorithm is used, a broadcast hash join algorithm may be used, or a sort merge join algorithm may be used, which is not described herein again.
In an embodiment, before the first result set and the fourth result set are subjected to the join algorithm to obtain the second target result set, exchanging data in the first result set to obtain a second exchange set, and then joining the second exchange set with the fourth result set, where the exchange method is preferably a broadcase exchange, which is an efficient exchange method, and can significantly improve efficiency of obtaining an exchange result.
In one embodiment, the applying the first result set and the fourth result set to the join algorithm to obtain a second target result set comprises: and sequencing and limit operation are carried out on the results obtained after the connection algorithm is adopted, so that a second target result set is obtained.
In order to make the dynamic adjustment process of the query method more clear, the following description is further set forth with reference to fig. 4.
As shown in fig. 4, which is a dynamic adjustment of the query method based on the example in fig. 3, if the connection algorithm in the query request is full external connection and the data amount in the first data table is less than the preset limit value of the limit operation, the Spark SQL engine passes a third query statement to the clickhouse, and performs the summarization, sorting, and limit operations, where the third query statement may be:
select goods_id,count(visitor_id)as pv,count(disitinct visitor_id)as uv
from ch_traffic where__time>=2020-09-01and__time<=2020-09-30
group by foods _ id (this step is a summary of the second data table)
// orderby goods _ id desc (this step is to sort the summary results)
limit 100 (this step is to carry out limit on the sequencing result)
And then, the clickhouse returns the result meeting the condition to the Spark SQL engine, and the Spark SQL engine obtains a third result set.
After obtaining the third result set, the Spark SQL engine performs a joint operation on the second result set and the third result set, calculates a collection set thereof, performs a distinting repeated result removing operation, removes data repeated in an analysis dimension in the collection set to obtain a fourth result set, for example, the analysis dimension of the second result set and the third result set has a refrigerator product, so that the repeated data needs to be deleted, performs a broadcast exchange algorithm exchange on the data in the first result set to obtain a second exchange set, performs a join calculation on the second exchange set and the fourth result set by using a broadcast hash, performs sort and limit operation on the results returned to the rows, performs project selection on corresponding columns to obtain a second target result set, and extracts target data from the second target result set as a query result.
In a preferred embodiment, in order to further improve the efficiency of data query, before performing the summarization, sorting, and limit operations in the query request in the first database in which the first data table is located, the method further includes: and judging whether the data volume in the first data table and the second data table meets a preset data volume condition, and if so, executing the operations of summarizing, sorting and limit in the query request in a first database where the first data table is located.
In the invention, the data query method is to execute the operations of summarization, sorting and limit in the query request in a first database where a first data table is located to obtain a first result set, to use the dimension value in the first result set as a dynamic filtering condition, executing the dynamic filtering operation corresponding to the dynamic filtering condition and the summarizing operation in the query request in a second database in which a second data table is located to obtain a second result set, that is, summarizing calculations of the first data table and the second data table have an execution sequence, in a general data query method, the summary calculation of the first data table and the second data table is performed concurrently, so if the data amount of the first data table and the second data table is relatively small, or the returned data volume is small, and it is possible that the query time is long because the first data table is summarized and then the second data table is summarized and calculated. Therefore, the cost of the data query needs to be judged before the summarization, sorting and limit operations in the query request are executed in the first database where the first data table is located to obtain the first result set, and the summarization, sorting and limit operations in the query request are executed in the first database where the first data table is located only after whether the data volume in the first data table and the second data table meets the preset data volume condition or not.
The Cost-Based Optimizer (Cost-Based optimization) technique is used here to calculate the Cost of all possible physical plans and pick out the physical execution plan with the smallest Cost. The core of the method is that statistical information is collected in advance, and the cost is calculated according to the characteristics (such as size and distribution) of data and the characteristics (distribution and size of an intermediate result set) of an operation operator, so that a physical execution plan with the minimum execution cost is selected better. However, how to estimate the cost by using the statistics and calculate the size of the summarized return value by using the statistics, the grouping summary fields and the filtering conditions is different from the cost calculation model of various olap databases and is out of the discussion of the present invention.
According to the invention, based on a cost optimization technology, cost estimation is carried out before summarizing, sorting and limit operation in the query request is executed in a first database where a first data table is located, and the following indexes are collected in the actual data query process: and evaluating index values (such as calculated data quantity and returned data quantity) during cost calculation and actual execution results (actual query duration and actual returned value of each summary calculation), and evaluating the effect of the acquisition index values. If the difference between the evaluation value and the actual value is large, further optimization is needed, the error reasons of statistic collection and cost calculation in cost calculation are evaluated, the algorithm is improved, the evaluation error is reduced, and the calculation precision is improved.
In a preferred embodiment, the preset data volume condition is: the ratio of the data amount in the first data table to the data amount in the second data table exceeds a preset ratio, or the data amount in the first data table and the data amount in the second data table both exceed a preset amount.
In the present embodiment, for simplicity, the cost estimation is first simplified to a preset data volume condition, which specifically includes the following two aspects:
in a first aspect, a ratio of the data amount in the first data table to the data amount in the second data table exceeds a preset ratio, in which case the data amount in the first data table is relatively small, and the data amount in the second data table is relatively large, and if the total calculation takes a long time, the preset ratio may be set according to an actual situation, and a general value is set to 5, that is, the data amount in the second data table is 5 times the data amount in the first data table.
In a second aspect, the amount of data in the first data table and the amount of data in the second data table both exceed a preset amount, in which case, the preset amount may be set according to actual conditions, for example, the total amount of data is estimated to be more than 100 ten thousand.
Any one of the above two conditions may be satisfied.
In a preferred embodiment, in order to further improve the efficiency of data query, the summarization, sorting, and limit operations in the query request are executed in the first database where the first data table is located, and the size of the preset limit value of the limit operation needs to be determined, if the preset limit value is smaller than the limit threshold, for example, the limit threshold may be 500, that is, the preset limit value is smaller, the summarization, sorting, and limit operations in the query request are directly executed in the first database where the first data table is located, that is, the summarization, sorting, and limit operations are pushed down to the first database by the query engine, and the corresponding query result can be returned in a shorter time.
In a preferred embodiment, when the concatenation algorithm is a left outer concatenation, the first data table is a left table, and when the concatenation algorithm is a right outer concatenation, the first data table is a right table.
The query engines adopted in the above examples are Spark SQL engines, the most important two functions in Spark3.0 are dynamic partial pruning and active query engine functions, which can greatly improve query performance in a specific scene, but the invention further improves query performance in a specific scene, which is different from Spark. First, dynamic partition pruning of Spark3.0 is directed to the hive partition table, and the partition field must be inside the join's on condition; the method comprises the steps of inquiring a dimension table to obtain a filter dimension value of a partition field, dynamically partitioning and cutting a hive partition table, reducing data scanning amount and improving query performance. Secondly, the adaptive query engine function of spark3.0 is used for solving the dynamic optimization problem of dynamic adjustment/data inclination/execution plan of Reduce number, and the invention is further improved and mainly used for solving the problem that the data amount in the first data table is smaller than the preset limit value of limit operation, and the development of extended functions is carried out on the adaptive execution framework of spark 3.0.
Of course, the present invention is only exemplified by Spark SQL engine, the optimization method of the present invention can be applied to similar federal query engines such as presto, inpala, drill, etc., and the databases used can be various, including but not limited to clickhouse, doris, druid, kylin, elastic search, mysql, postgres, kudu, hbase, etc.
If the connection algorithm is one of full external connection, left external connection and right external connection, the scene is the summary value sequence in the first data table, and the summary value in the second data table is not sequenced, the summary, sequence and limit operations in the query request are executed in a first database where the first data table is located, so that a first result set is obtained; and the dimension value in the first result set is used as the dynamic filtering condition, the dynamic filtering operation corresponding to the dynamic filtering condition and the summarizing operation in the query request are executed in the second database to obtain a second result set, so that the data volume of the first result set and the second result set returned to the query engine is obviously reduced, the pressure of network data transmission is reduced, the data volume summarized in the second database is obviously reduced by using the dimension value in the first result set as the dynamic filtering condition in the second database, the time of returning the query result is shortened, and the query performance of the query engine is improved. When the operation mode is the full external connection, when the data volume in the first data table is considered to be smaller than the preset limit value of the limit operation, the summarizing, sorting and limit operations in the query request are executed in the second database to obtain a third result set, and the third result set is obtained, so that the data returned by the second database are dynamically adjusted, and the result error condition caused by summarizing dimension information which is not in the first data table possibly exists in the second data table is avoided. According to the invention, before the summarizing, sorting and limit operations in the query request are executed in the first database where the first data table is located to obtain the first result set, the cost for executing the summarizing, sorting and limit operations in the first database is also considered, and the summarizing, sorting and limit operations are executed in the first database only after the preset data volume condition is met, so that the condition that the cost is high due to the fact that the summarizing, limit, filtering and other operations are directly pushed down to the database is avoided.
Example two
As shown in fig. 5, there is provided a data query apparatus, the apparatus including:
the analysis module is used for analyzing the query request and acquiring a connection algorithm and a corresponding scene in the query request;
a first result set obtaining module, configured to, if the connection algorithm is one of full external connection, left external connection, and right external connection, and the scenario is a summary value ranking in the first data table, and a summary value in the second data table is not ranked, execute the summary, ranking, and limit operations in the query request in a first database where the first data table is located to obtain a first result set;
and the second result set acquisition module is used for taking the dimension value in the first result set as a dynamic filtering condition, and executing dynamic filtering operation corresponding to the dynamic filtering condition and summarizing operation in the query request in a second database where a second data table is located to obtain a second result set.
In one embodiment, the data query apparatus further includes a first target result set obtaining module, configured to obtain a first target result set by using the connection algorithm on the first result set and the second result set; and the query result acquisition module is used for extracting first target data from the first target result set as a query result.
In an embodiment, if the connection algorithm is an all-external connection and the data size in the first data table is smaller than a preset limit value of a limit operation, the data query device further includes a third result set obtaining module, configured to execute the operations of collecting, sorting, and limiting in the query request in the second database to obtain a third result set.
In an embodiment, the data query apparatus further includes a fourth result set obtaining module, configured to combine the second result set with the third result set, and obtain a fourth result set after removing duplicate results; the query result obtaining module is further configured to obtain a second target result set from the first result set and the fourth result set by using the connection algorithm, and extract second target data from the second target result set as a query result.
In an embodiment, the data query apparatus further includes a determining module, configured to determine whether a data amount in the first data table and the second data table meets a preset data amount condition, and if so, perform the operations of summarizing, sorting, and limit in the query request in a first database where the first data table is located.
In one embodiment, the preset data volume condition is: the ratio of the data amount in the first data table to the data amount in the second data table exceeds a preset ratio, or the data amount in the first data table and the data amount in the second data table both exceed a preset amount.
In one embodiment, the first data table is a left table when the connection algorithm is a left outer connection and a right table when the connection algorithm is a right outer connection.
The embodiment provides a data query device, which belongs to the same application concept as the data query method provided by the embodiment, can execute the data query method of the embodiment, and has corresponding functional modules and beneficial effects of the data query method. For details of the technology that are not described in detail in this embodiment, reference may be made to the data query method provided in this embodiment, and details are not described here again.
EXAMPLE III
In this embodiment, a computer device is provided, which may be a server. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data query method.
In an embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the data query method according to the embodiment when executing the computer program, and please refer to the description in the first embodiment for the implementation process of the method and the technical effects that can be achieved, which are not described herein again.
Example four
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, where the computer program is executed by a processor to implement the data query method according to the embodiment, and the execution process and the achievable technical effects of the method refer to the description in the first embodiment, which is not repeated herein
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for data query, the method comprising the steps of:
analyzing the query request, and acquiring a connection algorithm and a corresponding scene in the query request;
if the connection algorithm is one of full external connection, left external connection and right external connection, the scene is the summary value sequence in the first data table, and the summary value in the second data table is not sequenced, the summary, sequence and limit operations in the query request are executed in a first database where the first data table is located, and a first result set is obtained;
and taking the dimension value in the first result set as a dynamic filtering condition, and executing a dynamic filtering operation corresponding to the dynamic filtering condition and a summarizing operation in the query request in a second database where a second data table is located to obtain a second result set.
2. The method of claim 1, further comprising:
and acquiring a first target result set by using the first result set and the second result set through the connection algorithm, and extracting first target data from the first target result set as a query result.
3. The method of claim 1, wherein if the join algorithm is an all-outshot join and the amount of data in the first data table is less than a preset limit value of a limit operation, the method further comprises performing a summarization, sorting, and limit operation in the query request in the second database to obtain a third result set.
4. The method of claim 3, further comprising:
combining the second result set and the third result set, and removing repeated results to obtain a fourth result set;
and acquiring a second target result set by using the first result set and the fourth result set through the connection algorithm, and extracting second target data from the second target result set as a query result.
5. The method according to any one of claims 1 to 4, wherein before performing the summarization, sorting and limit operation in the query request in the first database where the first data table is located, the method further comprises:
and judging whether the data volume in the first data table and the second data table meets a preset data volume condition, and if so, executing the operations of summarizing, sorting and limit in the query request in a first database where the first data table is located.
6. The method according to claim 5, wherein the preset data volume condition is: the ratio of the data amount in the first data table to the data amount in the second data table exceeds a preset ratio, or the data amount in the first data table and the data amount in the second data table both exceed a preset amount.
7. The method of claim 1, wherein the first data table is a left table when the join algorithm is a left outer join and a right table when the join algorithm is a right outer join.
8. A data query apparatus, characterized in that the apparatus comprises:
the analysis module is used for analyzing the query request and acquiring a connection algorithm and a corresponding scene in the query request;
a first result set obtaining module, configured to, if the connection algorithm is one of full external connection, left external connection, and right external connection, and the scenario is a summary value ranking in the first data table, and a summary value in the second data table is not ranked, execute the summary, ranking, and limit operations in the query request in a first database where the first data table is located to obtain a first result set;
and the second result set acquisition module is used for taking the dimension value in the first result set as a dynamic filtering condition, and executing dynamic filtering operation corresponding to the dynamic filtering condition and summarizing operation in the query request in a second database where a second data table is located to obtain a second result set.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202011547152.5A 2020-12-24 2020-12-24 Data query method and device, computer equipment and storage medium Pending CN112685444A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011547152.5A CN112685444A (en) 2020-12-24 2020-12-24 Data query method and device, computer equipment and storage medium
CA3143821A CA3143821A1 (en) 2020-12-24 2021-12-23 Data query method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011547152.5A CN112685444A (en) 2020-12-24 2020-12-24 Data query method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112685444A true CN112685444A (en) 2021-04-20

Family

ID=75451917

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011547152.5A Pending CN112685444A (en) 2020-12-24 2020-12-24 Data query method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112685444A (en)
CA (1) CA3143821A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313268A (en) * 2021-06-11 2021-08-27 杭州煋辰数智科技有限公司 Federal learning-based prediction method and device, storage medium and remote sensing equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968420A (en) * 2011-08-31 2013-03-13 国际商业机器公司 Database query method and system
CN106156168A (en) * 2015-04-16 2016-11-23 华为技术有限公司 The method of data is being inquired about and across subregion inquiry unit in partitioned data base
CN108108434A (en) * 2017-12-19 2018-06-01 福建中金在线信息科技有限公司 A kind of method and device for managing database
CN109240901A (en) * 2018-08-28 2019-01-18 北京小度信息科技有限公司 Method for analyzing performance, performance evaluation device, storage medium and electronic equipment
CN110009514A (en) * 2019-03-07 2019-07-12 平安科技(深圳)有限公司 Extracting method, device, terminal and the computer readable storage medium of data
CN110297858A (en) * 2019-05-27 2019-10-01 苏宁云计算有限公司 Optimization method, device, computer equipment and the storage medium of executive plan
CN110837585A (en) * 2019-11-07 2020-02-25 中盈优创资讯科技有限公司 Multi-source heterogeneous data association query method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968420A (en) * 2011-08-31 2013-03-13 国际商业机器公司 Database query method and system
CN106156168A (en) * 2015-04-16 2016-11-23 华为技术有限公司 The method of data is being inquired about and across subregion inquiry unit in partitioned data base
CN108108434A (en) * 2017-12-19 2018-06-01 福建中金在线信息科技有限公司 A kind of method and device for managing database
CN109240901A (en) * 2018-08-28 2019-01-18 北京小度信息科技有限公司 Method for analyzing performance, performance evaluation device, storage medium and electronic equipment
CN110009514A (en) * 2019-03-07 2019-07-12 平安科技(深圳)有限公司 Extracting method, device, terminal and the computer readable storage medium of data
CN110297858A (en) * 2019-05-27 2019-10-01 苏宁云计算有限公司 Optimization method, device, computer equipment and the storage medium of executive plan
CN110837585A (en) * 2019-11-07 2020-02-25 中盈优创资讯科技有限公司 Multi-source heterogeneous data association query method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113313268A (en) * 2021-06-11 2021-08-27 杭州煋辰数智科技有限公司 Federal learning-based prediction method and device, storage medium and remote sensing equipment

Also Published As

Publication number Publication date
CA3143821A1 (en) 2022-06-24

Similar Documents

Publication Publication Date Title
EP3117347B1 (en) Systems and methods for rapid data analysis
Cormode Data sketching
Cormode Sketch techniques for approximate query processing
CN108376143B (en) Novel OLAP pre-calculation system and method for generating pre-calculation result
US9235622B2 (en) System and method for an efficient query sort of a data stream with duplicate key values
US9317550B2 (en) Query expansion
Istvan et al. Histograms as a side effect of data movement for big data
CN102591948B (en) Method and system for improving search results based on user behavior analysis
US11803521B2 (en) Implementation of data access metrics for automated physical database design
CN101963965A (en) Document indexing method, data query method and server based on search engine
US20040002956A1 (en) Approximate query processing using multiple samples
Cuzzocrea et al. Approximate range–sum query answering on data cubes with probabilistic guarantees
CN112286961B (en) SQL optimization query method and device
CN112559567A (en) Query method and device suitable for OLAP query engine
CN108459965B (en) Software traceable generation method combining user feedback and code dependence
Kulessa et al. Model-based approximate query processing
US20160154805A1 (en) Estimating most frequent values for a data set
CN112685444A (en) Data query method and device, computer equipment and storage medium
CN114238360A (en) User behavior analysis system
CN106919566A (en) A kind of query statistic method and system based on mass data
US11625399B2 (en) Methods and devices for dynamic filter pushdown for massive parallel processing databases on cloud
Xie et al. OLAP over probabilistic data cubes II: parallel materialization and extended aggregates
CN105589969A (en) Data processing method and device
CN106933909B (en) Multi-dimensional data query method and device
CN112347098A (en) Database table splitting method and system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination