CN112286961A

CN112286961A - SQL optimization query method and device

Info

Publication number: CN112286961A
Application number: CN202011048287.7A
Authority: CN
Inventors: 卢勇亮; 郑尧峰; 汤国强; 桑强; 孙迁
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-29
Anticipated expiration: 2040-09-29
Also published as: CA3132004A1; CN112286961B

Abstract

The invention discloses an SQL optimization query method and device, relates to the technical field of big data, and can avoid the starting of a fusing mechanism and improve the query efficiency. The method comprises the following steps: when the SQL query statement is analyzed to be a UNION set type query statement, counting the number of storage segments related to target data, wherein the UNION set type query statement comprises a plurality of sub query statements; when the number of related storage segments is larger than a threshold value, splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles, summarizing target data dimensions corresponding to the sub-query statements, and adding the summarized target data dimensions as new conditions into the filtering conditions of the sub-query statement; and pushing the sub-query statements containing the filter conditions down to a computing engine, reading the target data corresponding to each granularity period and the dimension value in the target data, and returning a query result to the user after the target data is computed by the processing end. The device is applied with the method provided by the scheme.

Description

SQL optimization query method and device

Technical Field

The invention relates to the technical field of big data, in particular to a SQL optimization query method and a device.

Background

With the continuous increase of the business data volume, the data volume of the data source which is taken into the drive engine by the flow model is huge, and each time when SQL carries out the same-round-ratio query, all data in the SQL specified period need to be searched by the union all clause of the inner layer. In order to protect the effective query of the pipeline engine to other tasks, the pipeline engine can judge the total number of segments in the current SQL period and start a fusing mechanism when the total number of segments exceeds a threshold value, and even if the total number of segments does not exceed the threshold value, the query time can be increased by adopting a full-scale query mode, and the query efficiency is reduced.

Disclosure of Invention

The invention aims to provide a method and a device for SQL optimization query, which can avoid the starting of a fusing mechanism and improve the query efficiency.

In order to achieve the above object, a first aspect of the present invention provides a method for optimizing a query in SQL, comprising:

when the SQL query statement is analyzed to be a UNION set type query statement, counting the number of storage segments related to target data, wherein the UNION set type query statement comprises a plurality of sub query statements;

when the number of the related storage segments is larger than a threshold value, splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles, summarizing target data dimensions corresponding to the sub-query statements, and adding the summarized target data dimensions as new conditions into the filtering conditions of the sub-query statement;

and pushing the sub-query statements containing the filter conditions down to a computing engine, reading the target data corresponding to each granularity period and the dimension value in the target data, and returning a query result to a user after the target data is computed by a processing end.

Preferably, when the number of the storage segments is greater than the threshold, the method for splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles includes:

extracting a query cycle corresponding to each sub-query statement, and averagely dividing each query cycle into a plurality of granularity cycles according to preset granularity.

Preferably, the method for adding the collected target data dimensions corresponding to the multiple sub-query statements as the added conditions into the filter conditions of the sub-query statements comprises the following steps:

identifying a target data dimension corresponding to each sub-query statement, and when the target data dimensions of the sub-query statements are consistent, adding the target data dimension of any sub-query statement as a newly added condition into the filtering condition of each sub-query statement;

and when the target data dimensions of the sub-query statements are not consistent, summarizing and combining the target data dimensions of the sub-query statements respectively, and adding the combined target data dimensions as newly added conditions into the filtering conditions of each sub-query statement.

Preferably, the query statement of the UNION set type at least includes a first periodic sub-query statement and a second periodic sub-query statement, and the filter condition includes a query metric, a dimension rank and a target data query quantity in addition to the newly added condition.

Preferably, the method for the computing engine to concurrently read the target data and the dimension value thereof corresponding to each granularity period includes:

and respectively querying the target data meeting the target data query quantity condition from the corresponding storage segment of the corresponding data source and extracting the dimension value of the target data according to each sub-query statement and the filter condition in the sub-query statement by the computing engine.

Further, the method for returning the query result to the user after the computation of the processing end comprises the following steps:

and the processing terminal returns the query result to the user after calculating according to a preset rule according to the dimension value returned by each sub-query statement.

Preferably, the step of pushing down the sub-query statement including the filter condition to the calculation engine and reading the target data corresponding to each granularity period and the dimension value therein, and after the target data and the dimension value are calculated by the processing end and the query result is returned to the user, the method further includes:

and counting hot data sources in the historical SQL query statement and hot data dimensions in the historical SQL query statement, and pre-extracting dimensional values corresponding to the hot data dimensions for quick calling of subsequent SQL query statements.

and caching each sub-query statement, the corresponding target data and the dimension value in the sub-query statement in the database so as to be quickly called from the database when the same sub-query statement appears in the follow-up process.

Compared with the prior art, the SQL optimization query method provided by the invention has the following beneficial effects:

in the SQL optimization query method provided by the invention, the number of storage segments related to target data is required to be pre-counted before a computing engine queries the target data according to an SQL query statement, when the number of the related storage segments is larger than a threshold value, a fusing mechanism of the computing engine can be triggered if the full-scale query is directly performed, so the invention adopts the scheme for the situation that the query cycle corresponding to the sub-query statement is divided into a plurality of granularity cycles, simultaneously, in order to avoid the problem that the query result cannot be output due to different target data dimensions corresponding to different sub-query statements, the target data dimensions corresponding to a plurality of sub-query statements are required to be summarized as new adding conditions into the filtering conditions of each sub-query statement before the query, after the operation is completed, the sub-query statements containing the filtering conditions are finally pushed down to the computing engine to read the target data corresponding to each granularity cycle and the dimension values therein, and returning a query result to the user after the computation of the processing terminal.

Therefore, the method and the device not only solve the problem that a fusing mechanism of a calculation engine is easy to trigger in the process of full-scale query, but also ensure that the dimensions of the target data corresponding to each sub-query statement are uniform, and ensure that the query result can be output.

A second aspect of the present invention provides an apparatus for SQL optimization query, which is applied to the method for SQL optimization query according to the foregoing technical solution, and the apparatus includes:

the statement analyzing unit is used for counting the number of storage segments related to target data when an SQL query statement is analyzed to be a UNION set type query statement, and the UNION set type query statement comprises a plurality of sub query statements;

the identification processing unit is used for splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles when the number of the related storage segments is larger than a threshold value, summarizing target data dimensions corresponding to the sub-query statements and adding the summarized target data dimensions as new conditions into the filter conditions of the sub-query statements;

and the query unit is used for pushing down the sub-query statements containing the filter conditions to a calculation engine, reading the target data corresponding to each granularity period and the dimension value in the target data, calculating by the processing end and returning a query result to the user.

Compared with the prior art, the beneficial effects of the SQL optimizing query device provided by the invention are the same as those of the SQL optimizing query method provided by the technical scheme, and are not described herein again.

A third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described SQL optimized query method.

Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the invention are the same as the beneficial effects of the SQL optimization query method provided by the technical scheme, and the detailed description is omitted here.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a partial schematic flow chart of an SQL optimization query method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a full flow of an SQL optimization query method in the embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of an SQL query optimization method according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, the present embodiment provides an SQL optimizing query method, including:

when the SQL query statement is analyzed to be a UNION set type query statement, counting the number of storage segments related to target data, wherein the UNION set type query statement comprises a plurality of sub query statements; when the number of related storage segments is larger than a threshold value, splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles, summarizing target data dimensions corresponding to the sub-query statements, and adding the summarized target data dimensions as new conditions into the filtering conditions of the sub-query statement; and pushing the sub-query statements containing the filter conditions down to a computing engine, reading the target data corresponding to each granularity period and the dimension value in the target data, and returning a query result to the user after the target data is computed by the processing end.

In the SQL optimized query method provided in this embodiment, the number of storage segments related to target data needs to be pre-counted before a computing engine queries the target data according to an SQL query statement, and when the number of storage segments related to the target data is greater than a threshold, if a full-scale query is directly performed, a fusing mechanism of the computing engine may be triggered, so the embodiment adopts a scheme for such a situation that a query cycle corresponding to a sub-query statement is split into multiple granularity cycles, and meanwhile, in order to avoid a problem that query results cannot be output due to different target data dimensions corresponding to different sub-query statements, a target data dimension summary corresponding to multiple sub-query statements needs to be added as a new condition to a filtering condition of each sub-query statement before querying, after the operation is completed, the sub-query statement including the filtering condition is finally pushed down to the computing engine to read the target data corresponding to each granularity cycle and a dimension value therein, and returning a query result to the user after the computation of the processing terminal.

Therefore, the problem that a computing engine fusing mechanism is easily triggered in the full-scale query process is solved, the dimension of the target data corresponding to each sub-query statement is uniform, and the query result can be output.

In specific implementation, the query statement of the UNION set type is usually represented as UNION ALL, and includes a plurality of sub-query statements, such as a first periodic sub-query statement and a second periodic sub-query statement, and the filter condition (limit condition) includes a query metric, a dimension order, a target data query quantity, and the like, in addition to the new addition condition. The computing engine is a pipeline, the data source is a database, the storage segment is a segment, and the processing end is a spark end, wherein the target data query quantity in the filtering condition is mainly used for limiting the data reading quantity of the computing engine target, so as to avoid triggering a fusing mechanism.

In the above embodiment, when the number of the storage segments is greater than the threshold, the method for splitting the query cycle corresponding to the sub-query statement into multiple granularity cycles includes:

Illustratively, the SQL query statement is composed of an inner layer statement and an outer layer statement, taking a circle ratio cycle query as an example, the inner layer statement includes a first cycle sub-query statement and a second cycle sub-query statement, that is, a current cycle sub-query statement and a historical cycle sub-query statement, the inner layer statement mainly functions to read target data corresponding to the first cycle sub-query statement and the second cycle sub-query statement from the druid, and the outer layer statement mainly functions to perform corresponding circle ratio calculation on target data dimension values read by the inner layer statement at the spark end.

In specific implementation, please refer to fig. 2, the logic plan tree is obtained by parsing the SQL query statement, metadata information that can be used for clipping is extracted, for example, the metadata information includes a storage segment query instruction, a target data dimension query instruction, a target data measurement query instruction, a target data cycle query instruction, and the like, which are related to the target data, and then whether the number of storage segments related to the target data in the query cycle is greater than a threshold is checked, if the number of storage segments is not greater than the threshold, the whole sub-query of the SQL query statement is pushed down to the druid for full-scale query by adopting the prior art, and if the number of storage segments is greater than the threshold, the query cycle is averagely split into multiple granularity cycles for concurrent query, so as to avoid triggering. For example, the minimum granularity is day when the database is ingested, and if the current SQL query statement needs to query data for one month, the query period for one month is split into 30 granularity periods. And then, sorting the SQL query statements corresponding to each granularity period, wherein the purpose is to enable the SQL query statements corresponding to each granularity period to only retain required target dimension data and filter out other irrelevant derived measurement data. After the SQL information corresponding to the granularity periods is sorted, the SQL information is converted into json statements which can be received by the druid, then the druid is requested to ask for target data, after the druid queries the target data corresponding to each granularity period, the dimension set corresponding to each granularity period and the corresponding dimension value are returned, then the data sets returned by all the granularity periods are merged, and a data set which is not more than the query quantity of the target data is obtained through a selection sorting algorithm.

Illustratively, referring to fig. 3, the result of increase of the same-ring ratio in this month and the previous month is summarized through SQL query statements, and the number of target data queries in the filter condition is set to 500 pieces of data. If the full data query with a period of one month is directly performed at the druid, and there are 2 billion pieces of target data in a large druid table, in this embodiment, one month is split into 30 granularity periods, each granularity period corresponds to approximately 10 million pieces of target data, the data sets corresponding to each granularity period in the current month and the previous month are concurrently read by the druid, because the query number of the target data in the filtering condition is 500, only the first 500 pieces of target data are filtered from each granularity period, then the target data in the current month and the previous month belonging to the same granularity period are aggregated into 500 pieces of target data, then the 500 pieces of target data in each granularity period are scanned by the druid, and after the filtering is performed again, the finally obtained 500 query results are returned to the user. Compared with the scheme of full query adopted in the prior art, the embodiment only sets the query quantity of the target data in the filtering condition to be 500 items according to needs, and then only calculates and reads the index data of 500 items in the front of the sequence according to the filtering condition in the drive, so that the calculation pressure of the drive is greatly reduced.

In the above embodiment, the method for adding the collected target data dimensions corresponding to the multiple sub-query statements as the new conditions into the filter conditions of the sub-query statements includes:

identifying the target data dimension corresponding to each sub-query statement, and when the target data dimensions of the sub-query statements are consistent, adding the target data dimension of any sub-query statement as a newly added condition into the filtering condition of each sub-query statement; and when the target data dimensions of the sub-query statements are not consistent, summarizing and combining the target data dimensions of the sub-query statements respectively, and adding the combined target data dimensions as new conditions into the filtering conditions of each sub-query statement.

In specific implementation, in order to prevent the problem that the spark end cannot perform corresponding operation according to the target data acquired by the pipeline based on the first cycle and the target data acquired based on the second cycle due to the fact that the target data dimensions corresponding to the first cycle and the second cycle are not consistent, if the target data dimension number corresponding to the first cycle is 100 and the target data dimension number corresponding to the second cycle is 150, the target data dimension numbers corresponding to the first cycle and the second cycle need to be merged together to be used as a newly added condition to be added to the filtering condition of each sub query statement, so that the merged target data dimension can cover the dimensions of the target data corresponding to the first cycle and the second cycle.

In the above embodiment, the method for concurrently reading, by the compute engine, the target data and the dimension value thereof corresponding to each granularity cycle includes:

In the above embodiment, the method for returning the query result to the user after computation by the processing end includes:

and the processing terminal returns the query result to the user after calculating according to the preset rule according to the dimension value returned by each sub-query statement. For example, for the calculation that the preset rule is the growth rate, the dimension value acquired in the first period may be divided by the dimension value acquired in the second period to obtain the growth rate result.

In the above embodiment, the step of pushing down the sub-query statement including the filter condition to the calculation engine and reading the target data and the dimension value in the target data corresponding to each granularity period, and after the target data and the dimension value are calculated by the processing end and the query result is returned to the user, the method further includes:

In specific implementation, statistics is continuously carried out in the process of executing the SQL query statement, a most frequently used data source and a dimension field of the data source are used, a pipeline statement is dynamically constructed in the valley period of query, the dimension and the dimension value are pre-extracted, the dimension and the dimension value are pre-stored like a materialized view, and then the fact that if the query requirements of the same data source and the dimension value exist, the data can be quickly extracted from the materialized view without requesting the pipeline to fetch corresponding dimension value set data is discovered, so that quick calling is achieved.

Specifically, if the current SQL query statement includes the query result of the history cycle, the current SQL query statement query cycle may be split, the query result of the history cycle included therein may be directly retrieved from the cache library, and the current history cycle not included therein is obtained from the druid query at the earlier stage. Such as: dimension data from No. 1 to No. 20 are cached, if the current query is the dimension from No. 1 to No. 30, only the granularity period from No. 21 to No. 30 needs to be pressed down to query the drive, and the pressure of the drive is greatly reduced.

After the target data after the current early stage and the historical stage are combined is obtained, the sub-query statements corresponding to the SQL query statements can be sorted:

1. sorting the group by dimension fields of the sub-query statements;

2. sorting the derived indexes needing to be calculated by the sub query sentences;

3. summarizing target data dimensions corresponding to the sub-query statements and adding the summarized target data dimensions as newly added conditions into the filter conditions of the sub-query statements;

after each sub-query statement is sorted, the sub-query statement is converted into a json statement of a drive computing engine, and because a filter condition is added, the dimension information to be queried by the sub-query statement is larger than the filter condition. That is, the drive does not need to scan all data in the query period, so that the problems of fusing and query timeout abnormality do not occur, and the query pressure of the drive is reduced.

And because the data volume returned by the pipeline is filtered, the spark end does not need to receive all data in the query period, the data processing volume of the spark end is reduced, and the overall efficiency is greatly improved.

Example two

The present embodiment provides an SQL optimizing query apparatus, including:

Compared with the prior art, the beneficial effects of the SQL optimization query device provided by the embodiment of the invention are the same as the beneficial effects of the SQL optimization query method provided by the first embodiment, and are not described herein again.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the SQL optimized query method are executed.

Compared with the prior art, the beneficial effects of the computer-readable storage medium provided by the embodiment are the same as the beneficial effects of the SQL optimization query method provided by the above technical scheme, and are not described herein again.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the invention may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the embodiment, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An SQL optimization query method is characterized by comprising the following steps:

2. The method of claim 1, wherein splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles when the number of the storage segments is greater than a threshold value comprises:

3. The method according to claim 1 or 2, wherein the method for adding the collected target data dimensions corresponding to the plurality of sub-query statements as the added conditions into the filter conditions of the sub-query statements comprises:

4. The method of claim 3, wherein the UNION set type query statements include at least a first periodic sub-query statement and a second periodic sub-query statement, and wherein the filter conditions include a query metric, a dimension ordering, and a target data query quantity in addition to the new addition condition.

5. The method of claim 4, wherein the method for the compute engine to concurrently read the target data and the dimension value thereof corresponding to each granularity period comprises:

6. The method of claim 5, wherein the method for returning the query result to the user after the processing end computing comprises:

7. The method according to claim 1, wherein the pushing down the sub-query statement containing the filter condition to the compute engine and reading the target data and the dimension value in the target data corresponding to each granularity period in step, and after the query result is returned to the user after the computation by the processing end, the method further comprises:

8. The method according to claim 1, wherein the pushing down the sub-query statement containing the filter condition to the compute engine and reading the target data and the dimension value in the target data corresponding to each granularity period in step, and after the query result is returned to the user after the computation by the processing end, the method further comprises:

9. An SQL optimized query device, comprising:

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the claims 1 to 8.