CN114138814A - Data query method, device, platform and storage medium - Google Patents

Data query method, device, platform and storage medium Download PDF

Info

Publication number
CN114138814A
CN114138814A CN202111441654.4A CN202111441654A CN114138814A CN 114138814 A CN114138814 A CN 114138814A CN 202111441654 A CN202111441654 A CN 202111441654A CN 114138814 A CN114138814 A CN 114138814A
Authority
CN
China
Prior art keywords
query
index
data
target
plan
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111441654.4A
Other languages
Chinese (zh)
Inventor
王卓
陈祥麟
艾智远
易乐天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202111441654.4A priority Critical patent/CN114138814A/en
Publication of CN114138814A publication Critical patent/CN114138814A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a data query method, which comprises the following steps: acquiring a data query request; generating at least one reference query plan based on the data query request; if it is determined that a target index having an association relation with the data query request exists in the distributed file system, determining a target query plan which has the minimum query cost and comprises the target index from at least one reference query plan; wherein the distributed file system is at least used for storing aggregated index data; determining a query result corresponding to the data query request based on the target query plan. The embodiment of the application also discloses a data query device, a platform and a storage medium.

Description

Data query method, device, platform and storage medium
Technical Field
The present application relates to the field of communications technologies, and in particular, to a data query method, apparatus, platform, and storage medium.
Background
With the rapid development of internet technology, big data attracts more and more attention. In the big data era, analysis of massive data is an important technology for changing data into value. At present, mass data analysis is mainly implemented by querying mass data through Structured Query Language (SQL) statements.
However, in the process of querying mass data by using SQL statements at present, the data volume is huge, so that the time spent in the query process is long, the data analysis process is very slow, the timeliness of decision making based on data analysis is poor, and the value of data is seriously reduced.
Content of application
In view of this, embodiments of the present application are expected to provide a data query method, apparatus, platform, and storage medium, to solve the problem that a long query time is usually required in the query process of mass data at present, and provide a method for quickly querying mass data, which shortens the time required for querying mass data, effectively ensures timeliness of data analysis, and improves the utilization value of data.
In order to achieve the purpose, the technical scheme of the application is realized as follows:
in a first aspect, a method for data query, the method includes:
acquiring a data query request;
generating at least one reference query plan based on the data query request;
if it is determined that a target index having an association relation with the data query request exists in the distributed file system, determining a target query plan which has the minimum query cost and comprises the target index from at least one reference query plan; wherein the distributed file system is at least used for storing aggregated index data;
determining a query result corresponding to the data query request based on the target query plan.
Optionally, if it is determined that a target index having an association relationship with the data query request exists in the index database, before determining, from at least one of the reference query plans, a target query plan having a minimum query cost and including the target index, the method further includes:
determining index configuration parameters for data to be analyzed;
determining an index data generation task based on the index configuration parameters;
generating a task based on the index data, and generating index construction indication information;
sending the index construction indication information to the distributed cluster; the index construction indication information is used for indicating the distributed cluster to generate aggregated index data based on the index data task, and storing the aggregated index data to the distributed file system.
Optionally, the determining an index data generation task based on the index configuration parameter includes:
determining a common connection relation among a plurality of configuration fields in the index configuration parameter and a dependency relation among the plurality of configuration fields in the index configuration parameter;
and determining the index data generation task based on the connection relation and the dependency relation.
Optionally, before generating the index construction indication information based on the index data generation task, the method further includes:
determining a historical time period to be subjected to aggregation index processing;
correspondingly, the generating of the index construction indication information based on the index data generation task includes:
generating the index construction indication information based on the index data generation task and the historical time period; wherein the index construction indication information is used for indicating the distributed cluster to execute the index data task to generate the aggregated index data aiming at the data in the historical time period.
Optionally, the sending the index building indication information to the distributed cluster includes:
and when the current time is detected to be the preset time, sending the index construction indication information to the distributed cluster.
Optionally, the determining, based on the target query plan, a query result corresponding to the data query request includes:
splitting the target query plan to obtain at least one reference physical plan execution fragment;
executing sharding based on the at least one reference physical plan, determining the query result.
Optionally, the executing the sharding based on the at least one reference physical plan and determining the query result include:
determining a target physical plan execution shard comprising the target index from at least one of the reference physical plan execution shards;
executing the target physical plan execution fragmentation, and acquiring a first sub-query result corresponding to the target physical plan execution fragmentation from the distributed file system;
executing execution fragments except the target physical plan execution fragment in at least one reference physical plan execution fragment to obtain a second sub-query result;
and obtaining the query result based on the first sub-query result and the second sub-query result.
In a second aspect, a data query apparatus includes: the device comprises an acquisition unit, a first generation unit, a first determination unit and an acquisition unit, wherein:
the acquisition unit is used for acquiring a data query request;
the first generating unit is used for generating at least one reference query plan based on the data query request;
the first determining unit is configured to determine, from at least one reference query plan, a target query plan that has a minimum query cost and includes the target index if it is determined that the target index having an association relationship with the data query request exists in the distributed file system; the distributed file system is used for storing aggregated index data;
the first determining unit is further configured to determine a query result corresponding to the data query request based on the target query plan.
In a third aspect, a data query platform, the platform comprising at least: a data query node and a distributed file system; wherein:
the distributed file system is used for storing aggregated index data;
the data query node is configured to execute a stored data query program, and implement the steps of the data query method according to any one of the above items.
In a fourth aspect, a storage medium has a data query program stored thereon, and the data query program, when executed by a processor, implements the steps of the data query method as in any one of the above.
According to the data query method, the data query device, the data query platform and the storage medium, after the data query request is obtained, at least one reference query plan is generated based on the data query request, if it is determined that a target index having an association relation with the data query request exists in the distributed file system, a target query plan with the minimum query cost and including the target index is determined from the at least one reference query plan, and finally, a query result corresponding to the data query request is obtained through the target query plan. Therefore, the received data query request is analyzed, when a target index having an association relation with the data query request exists in a distributed file system for storing aggregated index data, a corresponding query result is obtained in an aggregated index mode, the problem that long query time is usually needed in the query process of mass data at present is solved, a method for rapidly querying the mass data is provided, the time spent on querying the mass data is shortened, the timeliness of data analysis is effectively guaranteed, and the utilization value of the data is improved.
Drawings
Fig. 1 is a schematic flowchart of a data query method provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of another data query method according to an embodiment of the present application;
FIG. 3 is a schematic flowchart of another data query method provided in an embodiment of the present application;
FIG. 4 is a diagram illustrating a query architecture according to an embodiment of the present disclosure;
fig. 5 is a schematic view of a process of generating aggregated index data according to an embodiment of the present application;
fig. 6 is a schematic view of an application scenario provided in an embodiment of the present application;
fig. 7 is a schematic view of another application scenario provided in the embodiment of the present application;
FIG. 8 is a schematic diagram of a query plan provided in an embodiment of the present application;
fig. 9 is a schematic data flow diagram of a query data flow according to an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a data query device according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a data query platform according to an embodiment of the present application.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
For the convenience of the following description, some terms referred to hereinafter will first be briefly described as follows.
And (3) query planning: a set of implementation steps performed to complete a query in implementing a data query request.
And (3) query cost: refers to the size of the resource consumed by the query statement during the query process.
Aggregation index: also referred to as a cluster index, is an index in which the logical order of key values determines the physical order of corresponding rows in a table, and an index may contain multiple columns (combinatorial index).
Dimension: is a perspective from which to view data, typically grouped fields in SQL statements. For example, in the SQL statement of select sum (a1) from A group by a2, a2 is the dimension.
And (3) measurement: the data to be analyzed for presentation, i.e., the metrics, is typically an aggregation function in SQL statements. For example, in the SQL statement of select sum (a1) from A group by a2, sum (a1) is the measure.
A data cube: all dimensions and measures are combined to form a data set. For example, when the dimensions are a1, a2, a3, and the metrics are sum (a4), sum (a5), count (a6), the data cube includes the following data sets: (a, sum (a), sum (a), count (a)), 2.(a, sum (a), sum (a), count (a)), 3.(a, sum (a), sum (a), count (a)), 4.(a, a, sum (a), sum (a), count (a)), 5.(a, a, sum (a), sum (a), count (a)), 6.(a, a, sum (a), sum (a), count (a)), 7.(a, a, sum (a), sum (a), count (a)), 8.(. the sum (a), sum (a), count (a)) (wherein ". the." represents statistics for all data).
Aggregation index: the index data belongs to a subset of a data cube and can be used for accelerating statistics of query statements, and corresponding data is called aggregate index data and can be referred to as index data for short in the application. For example, when the user queries select sum (a4), sum (a5), count (a6) from A group by a1, the returned results may be aggregated by the aggregate index (a1, sum (a4), sum (a5), count (a6)) as the query result.
Physical planning: the method comprises the steps of performing lexical analysis and syntactic analysis on SQL sentences to obtain an analysis tree for establishing query, and obtaining a relational algebra expression according to the analysis tree.
Slicing: sharding may be defined simply as a partitioning scheme that distributes a large database across multiple physical nodes. Each partition contains a certain portion of the database, called a slice, and the partitioning manner may be arbitrary and is not limited to the conventional horizontal partition and vertical partition. A shard may contain the contents of multiple tables and may even contain the contents of multiple database instances. Each shard is placed on a database server. One database server may process data of one or more shards. The system needs a server to forward the query route, and is responsible for forwarding the query to the fragments or fragment aggregation nodes containing the data accessed by the query to execute.
And (3) performing fragmentation by a physical plan: refers to a relational algebra expression for realizing the query of corresponding data from each fragment.
An embodiment of the present application provides a data query method, which is applied to a data query node, and is shown in fig. 1, where the method includes the following steps:
step 101, obtaining a data query request.
In the embodiment of the present application, the data query request may be request information input to the data query node by a user. The data querying node may be a node with management capabilities, which may run on an electronic device with computing capabilities, which may be a computer device or a server device, for example.
Step 102, generating at least one reference query plan based on the data query request.
In the embodiment of the application, the data query node analyzes the data query request to generate and obtain at least one reference query plan. Each of the at least one reference query plan is for implementing a logical query plan for the data query request.
Step 103, if it is determined that a target index having an association relationship with the data query request exists in the distributed file system, determining a target query plan having the minimum query cost and including the target index from at least one reference query plan.
Wherein the distributed file system is at least used for storing the aggregation index data.
In the embodiment of the present application, at least a data set formed by combining all dimensions and measures of the original analysis data, i.e. a data cube, i.e. aggregated index data, is stored in the distributed file system. And analyzing the data query request, determining whether at least partial indexes capable of responding to the data query request exist in the distributed file system, and if at least partial indexes capable of responding to the data query request exist in the distributed file system, determining that at least partial indexes are target indexes. From the at least one reference query plan, a target query plan is determined that has a minimum query cost and includes a target index. And if the index responding to the data query request does not exist in the distributed system, directly performing response analysis on the data query request on the original analysis data to obtain a query result based on the original analysis data. The original analysis data is corresponding data collected in the operation process of the distributed cluster.
And 104, determining a query result corresponding to the data query request based on the target query plan.
In the embodiment of the application, a target query plan is executed, and corresponding data is at least acquired from aggregated index data stored in a distributed system or acquired from the aggregated index data and original analysis data according to the target query plan, so as to obtain a query result.
According to the data query method provided by the embodiment of the application, after the data query request is obtained, at least one reference query plan is generated based on the data query request, if it is determined that the target index having the incidence relation with the data query request exists in the distributed file system, the target query plan with the minimum query cost and including the target index is determined from the at least one reference query plan, and finally, the query result corresponding to the data query request is obtained through the target query plan. Therefore, the received data query request is analyzed, when a target index having an association relation with the data query request exists in a distributed file system for storing aggregated index data, a corresponding query result is obtained in an aggregated index mode, the problem that long query time is usually needed in the query process of mass data at present is solved, a method for rapidly querying the mass data is provided, the time spent on querying the mass data is shortened, the timeliness of data analysis is effectively guaranteed, and the utilization value of the data is improved.
Based on the foregoing embodiments, an embodiment of the present application provides a data query method, which is applied to a data query node, and is shown in fig. 2, where the method includes the following steps:
step 201, obtaining a data query request.
In an embodiment of the present application, the data query request may be a data query statement in the form of an SQL statement. The user inputs a data query request in the form of an SQL statement to the data query node.
Step 202, generating at least one reference query plan based on the data query request.
In the embodiment of the application, the data query node analyzes the data query request to generate at least one reference query plan.
Step 203, determining index configuration parameters for the data to be analyzed.
In an embodiment of the present application, the index configuration parameter includes at least one of the following parameters: data table relationships, grouping fields, aggregation functions, index data generation methods, and the like. The index configuration parameters may be obtained by setting by a user directly according to a use requirement, or may be an index generation rule obtained by analyzing, by the data query node, a historical query statement sent by the user for the data to be analyzed.
For example, when the historical query statement is select a, B, sum (c) from a inner join B on a.a ═ b.b group by a, B, the corresponding index configuration parameters that can be determined include: the data table relationship is a join B, the connection condition is a.a ═ b.b, the grouping fields are a and B, the aggregation function is sum (c), the index data generation method may be, for example, to generate new index data in an incremental manner each time, or may be, for example, to generate all index data at a time.
And step 204, determining an index data generation task based on the index configuration parameters.
In the embodiment of the application, as the number of statistical query statements which a user may need to accelerate is large, after the index configuration is determined, various different combinations can be performed according to the grouping fields to obtain an index data generation task.
For example, for grouping fields a and B, in addition to the selection a, B, sum (c) from a inner join b.a ═ b.b group by a, B query requirement, the user may also query selection a, sum (c) from a inner join b.a ═ b.b group by a, or selection B, sum (c) from a inner join b.a ═ b.b group by B, so that according to the index configuration parameters, an index data generation task is determined to be obtained, which can generate aggregate index data corresponding to combinations of three grouping fields a, B, a and B respectively for the original analysis data.
And step 205, generating a task based on the index data, and generating index construction indication information.
In the embodiment of the application, the data query node generates corresponding index construction indication information according to the determined index data generation task. The index construction indication information comprises an index data generation task.
And step 206, sending index construction indication information to the distributed cluster.
The index construction indication information is used for indicating the distributed cluster to generate the aggregated index data based on the index data task and storing the aggregated index data to the distributed file system.
In the embodiment of the application, after the data query node generates the index construction indication information, the data query node determines a distributed cluster corresponding to the request content corresponding to the data query request, and sends the index construction indication information to the determined distributed cluster, so that the distributed cluster responds to the received index construction indication information, and performs calculation analysis on corresponding original analysis data to obtain aggregated index data corresponding to an index data task, and stores the calculated aggregated index data in the distributed file system, so that the data query node can directly obtain the corresponding aggregated index data from the distributed file system after receiving the corresponding data query request.
Illustratively, after the data query node generates an index data generation task of aggregated index data corresponding to a combination of three grouped fields a, b, a and b, index construction indication information is generated, and the index construction indication information is sent to a Distributed cluster, such as a compute engine (Spark) cluster, so that the Spark cluster computes and generates the aggregated index data corresponding to the combination of three grouped fields a, b, a and b, and after the Spark cluster computes and obtains the aggregated index data corresponding to the combination of three grouped fields a, b, a and b, the aggregated index data corresponding to the combination of three grouped fields a, b, a and b is stored in a Distributed File System, such as a Distributed File System (HDFS). Therefore, when the data query node needs to query the corresponding aggregated index data, the data query node can directly obtain the aggregated index data from the HDFS.
Step 207, if it is determined that a target index having an association relationship with the data query request exists in the distributed file system, determining a target query plan having the minimum query cost and including the target index from the at least one reference query plan.
Wherein the distributed file system is at least used for storing the aggregation index data.
In the embodiment of the present application, the target index having an association relationship with the data query request refers to a relationship that can be used for implementing the data query request, for example, the index fields a and b may implement the acquisition of the index data corresponding to the index fields a and b, and also implement the acquisition of the index data corresponding to the index fields a and b. At least one of the target query plans refers to one of the query plans.
Step 208, determining a query result corresponding to the data query request based on the target query plan.
In the embodiment of the present application, in the target query plan, at least one query sub-plan may be combined, so that each query sub-plan may be executed to obtain a sub-query result corresponding to each query sub-plan, and then at least one queried sub-query result is correspondingly combined to obtain a final query result.
Based on the foregoing embodiments, in other embodiments of the present application, step 204 can be implemented by steps 204a to 204 b:
and 204a, determining a common connection relation among the plurality of configuration fields in the index configuration parameter and a dependency relation among the plurality of configuration fields in the index configuration parameter.
In the embodiment of the application, the common connection relation among the plurality of configuration fields in the index configuration parameter is determined, so that the resource overhead caused by the calculation of the same connection relation can be effectively reduced, and the dependency relation among the plurality of configuration fields in the index configuration parameter is determined.
Illustratively, the actual stored data for the < packet field ab, sum (c) > includes three fields a, b, sum (c), the corresponding data is the query result set of select a, B, sum (c) from a join B on a.a ═ B group by a, B, < grouping field a, sum (c) > the actual stored data includes two fields a, sum (c), the corresponding data are select a, sum (c) from a join B on a.a ═ b.b group by a, therefore, < packet field a, sum (c) > can be obtained by < packet field ab, sum (c) > by way of group by b, and at this time, it can be considered that the < packet field a, sum (c) > has a dependency relationship with the < packet field ab, sum (c) >, thus, < packet field a, sum (c) > can be generated from the result of < packet field ab, sum (c) >.
And step 204b, determining an index data generation task based on the connection relation and the dependency relation.
In the embodiment of the application, the index data generation task includes each connection relation and a dependency relation between bytes corresponding to each link mode, so that the distributed cluster can perform large-scale calculation based on the index data generation task.
Based on the foregoing embodiments, in other embodiments of the present application, as shown in fig. 3, before the data querying node performs step 205, the data querying node is further configured to perform step 209:
step 209, determining the historical time period to be subjected to the aggregation index processing.
In this embodiment of the present application, the historical time period to be subjected to the aggregation index processing may be a processing cycle of the analysis of the aggregation index processing, for example, from a time of the last aggregation index processing to a time of the current aggregation index processing, or the historical time period may be a certain time period set by a user, so that the aggregation index processing may be performed on the incremental data in the historical time period.
Illustratively, the latest update time of the current index is recorded and recorded as lastTime, then the acquisition time of the original analysis data acquired at the current time is determined and recorded as currtime, and the historical time period is determined to be lastTime to currtime.
Correspondingly, step 205 may be implemented by step 205 a:
step 205a, generating index construction indication information based on the index data generation task and the historical time period.
The index construction indication information is used for indicating the distributed cluster to execute index data tasks aiming at data in a historical time period to generate aggregated index data.
In this embodiment of the present application, the data query node needs to notify the distributed cluster of the historical time period, and therefore, the generated index construction indication information needs to include the historical time period corresponding to the data that needs to be subjected to the aggregated index processing, in addition to the index data generation task.
Based on the foregoing embodiments, in other embodiments of the present application, step 206 may be implemented by step 206 a:
step 206a, when detecting that the current time is the preset time, sending index building indication information to the distributed cluster.
In the embodiment of the present application, the preset time is a time required to perform the aggregation index processing, that is, for example, the aggregation index processing may be performed periodically.
Based on the foregoing embodiments, in other embodiments of the present application, step 208 can be implemented by steps 208a to 208 b:
and 208a, splitting the target query plan to obtain at least one reference physical plan execution fragment.
In this embodiment of the present application, since the target query plan is generally composed of a series of execution plans, the target query plan needs to be split to obtain at least one reference physical plan execution fragment. For example, a plan related to the aggregation index in the target query plan and a plan unrelated to the aggregation index are separated, so that the corresponding plan related to the aggregation index can be executed for aggregation index data stored in the distributed file system, and a plan unrelated to the aggregation index can be executed for non-aggregation index data stored in the distributed file system.
Step 208b, the shards are executed based on the at least one reference physical plan, and query results are determined.
In the embodiment of the application, each reference physical plan execution fragment is executed to obtain a sub-query result corresponding to each reference physical plan execution fragment, and at least one sub-query result is processed to obtain a query result.
Based on the foregoing embodiments, in other embodiments of the present application, the step 208b may be implemented by the steps a 11-a 14:
step a11, determining a target physical plan execution slice including a target index from the at least one reference physical plan execution slice.
In the embodiment of the application, at least one reference physical plan execution fragment is analyzed, and a target physical plan execution fragment including a target index is determined.
Step a12, executing the target physical plan execution fragment, and obtaining a first sub-query result corresponding to the target physical plan execution fragment from the distributed file system.
In the embodiment of the application, the target physical plan execution fragments are executed, so that the first sub-query results corresponding to the target physical plan execution fragments are obtained from the aggregation index data stored in the distributed file system.
And a13, executing the execution fragments except the target physical plan execution fragment in the at least one reference physical plan execution fragment to obtain a second sub-query result.
In the embodiment of the application, based on the original analysis data, executing the execution fragments except the target physical plan execution fragment in the at least one reference physical plan execution fragment to obtain a second sub-query result.
And a step a14, obtaining a query result based on the first sub-query result and the second sub-query result.
In the embodiment of the present application, the first sub-query result and the second sub-query result are processed according to a relationship, for example, a connection relationship, between the first query result and the second sub-query result, so as to obtain the query result.
Based on the foregoing embodiments, an embodiment of the present application provides a query architecture for implementing a data query method, as shown in fig. 4, including: the system comprises a timing task module, an index building module, an HDFS (Hadoop distributed File System), an index connector and a query module. The timing task module and the index building module belong to an index building part, and the index connector and the query module belong to a query part. Wherein:
and the timing task module is used for automatically initiating an index construction task of the newly added data, so that the labor cost is effectively saved. Namely, the timing task module sends an index building command to the index building module after the system time reaches the preset time, and triggers the index building task. Further, the timing task module may be further configured to determine new data, that is, determine a historical time period, and send the historical time period to the index construction module when the index construction task is triggered.
And the index construction module is used for generating a corresponding index data generation task according to the index configuration parameters after receiving the index construction command sent by the timing scheduling module, and calculating the read original analysis data based on the index data generation task to obtain aggregated index data. As shown in fig. 5, the flow of generating aggregated index data may include: grouping fields (a, B), wherein the corresponding aggregation function comprises sum (c), and the corresponding connection relation is A join B; grouping fields (a), wherein the corresponding aggregation function comprises sum (c), and the corresponding connection relation is A join B; grouping the fields (B), wherein the corresponding aggregation function comprises sum (c), the corresponding connection relation is A join B, the index configuration parameters are analyzed, the common connection relation is A join B, and the dependency relation is as follows: the < grouping field a, sum (c) > and the < grouping field (a, b), sum (c) > have a dependency relationship, the < grouping field b, sum (c) > and the < grouping field (a, b), sum (c) > have a dependency relationship, that is, the < grouping field a, sum (c) > can be generated by the result of the < grouping field (a, b), sum (c) > and the < grouping field (a, b), sum (c) > can be generated by the result of the < grouping field (a, b), sum (c) > so that it can be determined that the index data generation task is: calculating the task of the grouping field a, the task of the grouping field c and the task of the grouping field B, the task of the grouping field c, the task of the grouping field B, the grouping field c, the grouping field B and the task of the grouping field c, the grouping field B and the grouping field B, and the grouping field a, the grouping field c and the grouping field B, the grouping field B and the grouping field c, and obtaining the grouping field B, the grouping field B and the grouping field sum c, sending the obtained index data generating task to the Spark cluster, calculating the index data generating task in parallel by the Spark cluster to obtain corresponding aggregation index data, and storing the aggregation index data in the HDFS for subsequent query.
And the query module is used for receiving a data query request sent by a user, analyzing the data query request, generating a query plan and returning a query result corresponding to the query plan. The query module may specifically be an open-source distributed SQL query engine Presto. When the query module analyzes the query request, two types of syntax trees can be generated for the query request, wherein one type of syntax tree is generated based on the original analysis data, and the other type of syntax tree is generated based on the aggregation index data.
The query module generates a syntax tree by aggregating the index data, analyzes the data query request for the syntax tree generated by aggregating the index data, determines that the query statements of the data query request can be subjected to packet query, namely, syntax statements corresponding to the target index are included, and sends the packet query to the index connector.
An index connector for receiving the packet query, the index connector determining whether the determined packet query can be responded by the index; if the index response is available, the feedback message of the index response can be fed back to the query module, and if the index response is unavailable, the feedback message of the index response unavailable can be fed back to the query module.
The query module is further configured to, after receiving the feedback message sent by the index connector, generate a query plan including the group query, that is, process an actual code segment of the group query, if the feedback message indicates that the group query can be responded by the index, and not generate the query plan including the group query, if the feedback message indicates that the group query cannot be responded by the index. And the query module generates a query plan comprising the group query and then sends the query plan to the index connector, and the query module also generates other query plans not comprising the group query and executes the other query plans.
For example, it is determined whether the grouping query select a, sum (c) from a join b.a ═ b.b group by a can be processed by the generated index, and since the generated index aggregation index < a, sum (c >, the query statement select a, sum (c) from a join b.a ═ b.b group by a has a corresponding index response, and a data result corresponding to the return < a, sum (c > is generated and directly returned.
And the index connector is used for receiving the query plan including the group query sent by the query module, acquiring the aggregation index data corresponding to the query plan from the HDFS to obtain the sub-query results corresponding to the group query, and sending the sub-query results corresponding to the group query to the query module.
And the query module is also used for receiving the sub-query results corresponding to the grouping query and the sub-query results corresponding to other query plans, which are sent by the index connector, obtaining final query results and returning the final query results.
And the HDFS is used for storing the aggregation index data and the raw analysis data.
For example, for the query structure shown in fig. 4, an application embodiment of a corresponding data query method assumes that there are a data table a and a data table B as shown in fig. 6. Wherein, the data table A comprises data collected in batches, and two batches of data collected at the positions 201x-01-01 and 201x-01-02 respectively. For data table a and data table B, the commonly used SQL query statements of the user are:
with a_group as(select a1,sum(a3)as a_sum from A group by a1)
select a1,b2,b3,a_sum from a_group,B where a_group.a1=B.b1。
for the SQL query statement, it is determined that the dimension in the index configuration parameters for the data table a is a1, and the metric is sum (a 3).
The data of the previous day of construction starting at 2:00 a.m. each day is configured in the timed task module. Thus, every morning at 2:00, the timing task module sends an index building command to the index building module, the index building module reads the data of the previous day in the data table A and generates corresponding aggregated index data, the generated data is shown in FIG. 7, an index building task is started once at 201x-01-02 morning at 2:00, and the index data of 201x-01-01 in FIG. 7 is generated; the index build task was initiated once at 2:00 in the morning of 201x-01-03 to generate the index data of 201x-01-02 in FIG. 7.
Suppose that the data query request received by the query module is:
with a_group as(select a1,sum(a3)as a_sum from A group by a1)
select a1,b2,b3,a_sum from a_group,B where a_group.a1=B.b1。
the query module determines that the grouping query select a1, sum (a3) as a _ sum from A group by a1, and asks the index connector whether it can respond to the grouping query select a1, sum (a3) as a _ sum from aggregate by a 1. Since the index building module has built a target index of dimension a1 and measure sum (a3), the index connector reply may respond to the packet query select a1, sum (a3) as a _ sum from A group by a1 and return a query cost. The query plan including the target index data in all the generated logical query plans of the query module is assumed as shown in fig. 8. The query module generates a physical execution plan for the logical plan shown in FIG. 8 and sends the query plan for data Table A to the index connector so that the index connector is responsible for the index scan of data Table A.
Finally, a data flow schematic diagram of a query data flow in an application embodiment corresponding to the corresponding data query method may be as shown in fig. 9, where a query module receives 201x-01-01 data and 201x-01-02 data for a data table a sent by an index connector, merges the 201x-01-01 data and the 201x-01-02 data to obtain an index merging result of the data table a, and then connects the index merging result of the data table a to an obtained data table B through a join connection relationship to obtain a final query result.
It should be noted that, for the explanation of the same steps or concepts in the present embodiment as in the other embodiments, reference may be made to the description in the other embodiments, and details are not described here.
According to the data query method provided by the embodiment of the application, after the data query request is obtained, at least one reference query plan is generated based on the data query request, if it is determined that the target index having the incidence relation with the data query request exists in the distributed file system, the target query plan with the minimum query cost and including the target index is determined from the at least one reference query plan, and finally, the query result corresponding to the data query request is obtained through the target query plan. Therefore, the received data query request is analyzed, when a target index having an association relation with the data query request exists in a distributed file system for storing aggregated index data, a corresponding query result is obtained in an aggregated index mode, the problem that long query time is usually needed in the query process of mass data at present is solved, a method for rapidly querying the mass data is provided, the time spent on querying the mass data is shortened, the timeliness of data analysis is effectively guaranteed, and the utilization value of the data is improved.
Based on the foregoing embodiments, the present application provides a data query apparatus 3, where the data query apparatus 3 may be applied to the embodiments corresponding to fig. 1 to 3 and other methods, and as shown in fig. 10, the data query apparatus 3 includes: an acquisition unit 31, a first generation unit 32, and a first determination unit 33; wherein:
an obtaining unit 31, configured to obtain a data query request;
a first generating unit 32, configured to generate at least one reference query plan based on the data query request;
the first determining unit 33 is configured to determine, if it is determined that a target index having an association relationship with a data query request exists in the distributed file system, a target query plan having a minimum query cost and including the target index from at least one reference query plan; the distributed file system is used for storing the aggregation index data;
the first determining unit 33 is further configured to determine a query result corresponding to the data query request based on the target query plan.
In other embodiments of the present application, before the first determining unit 33, the data querying device further includes: a second determining unit, a second generating unit and a transmitting unit; wherein:
the second determining unit is used for determining the index configuration parameters of the data to be analyzed;
the second determining unit is further used for determining an index data generation task based on the index configuration parameters;
the second generation unit is used for generating a task based on the index data and generating index construction indication information;
the sending unit is used for sending index construction indication information to the distributed cluster; the index construction indication information is used for indicating the distributed cluster to generate the aggregated index data based on the index data task and storing the aggregated index data to the distributed file system.
In other embodiments of the present application, the second determining unit includes: a first determination module and a second determination module; wherein:
the first determining module is used for determining the common connection relation among the plurality of configuration fields in the index configuration parameter and the dependency relation among the plurality of configuration fields in the index configuration parameter;
and the second determining module is used for determining the index data generation task based on the connection relation and the dependency relation.
In other embodiments of the present application, the second determining unit is further configured to determine a historical time period to be subjected to the aggregation index processing;
correspondingly, the second generating unit is specifically configured to: generating an index construction indication information based on the index data generation task and the historical time period; the index construction indication information is used for indicating the distributed cluster to execute index data tasks aiming at data in a historical time period to generate aggregated index data.
In other embodiments of the present application, the sending unit is specifically configured to implement the following steps: and when the current time is detected to be the preset time, sending index construction indication information to the distributed cluster.
In other embodiments of the present application, the first determining unit includes: a processing module and an execution module; wherein:
the processing module is used for splitting the target query plan to obtain at least one reference physical plan execution fragment;
and the execution module is used for executing the fragments based on at least one reference physical plan and determining a query result.
In other embodiments of the present application, the execution module is specifically configured to implement the following steps:
determining a target physical plan execution fragment comprising a target index from at least one reference physical plan execution fragment;
executing the target physical plan execution fragmentation, and acquiring a first sub-query result corresponding to the target physical plan execution fragmentation from a distributed file system;
executing execution fragments except the target physical plan execution fragment in the at least one reference physical plan execution fragment to obtain a second sub-query result;
and obtaining a query result based on the first sub-query result and the second sub-query result.
It should be noted that, in the embodiment, the interaction process between the steps implemented by the units and the modules may refer to the interaction processes in the embodiments corresponding to fig. 1 to 3 and the methods provided in the foregoing embodiments, and details are not described here.
According to the data query method provided by the embodiment of the application, after the data query request is obtained, at least one reference query plan is generated based on the data query request, if it is determined that the target index having the incidence relation with the data query request exists in the distributed file system, the target query plan with the minimum query cost and including the target index is determined from the at least one reference query plan, and finally, the query result corresponding to the data query request is obtained through the target query plan. Therefore, the received data query request is analyzed, when a target index having an association relation with the data query request exists in a distributed file system for storing aggregated index data, a corresponding query result is obtained in an aggregated index mode, the problem that long query time is usually needed in the query process of mass data at present is solved, a method for rapidly querying the mass data is provided, the time spent on querying the mass data is shortened, the timeliness of data analysis is effectively guaranteed, and the utilization value of the data is improved.
Based on the foregoing embodiments, the present application provides a data query platform 4, where the data query platform 4 may be applied to the embodiments corresponding to fig. 1 to 3, and as shown in fig. 11, the data query platform 4 at least includes: a data query node 41 and a distributed file system 42; wherein:
a distributed file system 42 for storing aggregated index data;
the data query node 41 is configured to execute a stored data query program to implement the data query method provided in any one of the embodiments of fig. 1 to 3, which is not described herein again.
Based on the foregoing embodiments, embodiments of the present application provide a computer-readable storage medium, which is referred to as a storage medium for short, where the computer-readable storage medium stores one or more data query programs, and the one or more data query programs can be executed by one or more processors to implement the data query method provided in the embodiments corresponding to fig. 1 to 3, and details are not described here again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present application may be substantially or partially embodied in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), and including instructions for enabling a terminal (such as a mobile phone, a computer, … …, an air conditioner, or a network communication link device) to execute the method described in the embodiments of the present application.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims (10)

1. A method for data query, the method comprising:
acquiring a data query request;
generating at least one reference query plan based on the data query request;
if it is determined that a target index having an association relation with the data query request exists in the distributed file system, determining a target query plan which has the minimum query cost and comprises the target index from at least one reference query plan; wherein the distributed file system is at least used for storing aggregated index data;
determining a query result corresponding to the data query request based on the target query plan.
2. The method of claim 1, wherein if it is determined that a target index having an association relationship with the data query request exists in the index database, before determining a target query plan having a minimum query cost and including the target index from at least one of the reference query plans, the method further comprises:
determining index configuration parameters for data to be analyzed;
determining an index data generation task based on the index configuration parameters;
generating a task based on the index data, and generating index construction indication information;
sending the index construction indication information to the distributed cluster; the index construction indication information is used for indicating the distributed cluster to generate aggregated index data based on the index data task, and storing the aggregated index data to the distributed file system.
3. The method of claim 2, wherein determining an index data generation task based on the index configuration parameter comprises:
determining a common connection relation among a plurality of configuration fields in the index configuration parameter and a dependency relation among the plurality of configuration fields in the index configuration parameter;
and determining the index data generation task based on the connection relation and the dependency relation.
4. The method according to claim 2 or 3, wherein before generating the index construction indication information based on the index data generation task, the method further comprises:
determining a historical time period to be subjected to aggregation index processing;
correspondingly, the generating of the index construction indication information based on the index data generation task includes:
generating the index construction indication information based on the index data generation task and the historical time period; wherein the index construction indication information is used for indicating the distributed cluster to execute the index data task to generate the aggregated index data aiming at the data in the historical time period.
5. The method according to claim 2 or 3, wherein the sending the index building indication information to the distributed cluster comprises:
and when the current time is detected to be the preset time, sending the index construction indication information to the distributed cluster.
6. The method of any of claims 1 to 3, wherein determining query results corresponding to the data query request based on the target query plan comprises:
splitting the target query plan to obtain at least one reference physical plan execution fragment;
executing sharding based on the at least one reference physical plan, determining the query result.
7. The method of claim 6, wherein said determining the query result based on the at least one reference physical plan execution shard comprises:
determining a target physical plan execution shard comprising the target index from at least one of the reference physical plan execution shards;
executing the target physical plan execution fragmentation, and acquiring a first sub-query result corresponding to the target physical plan execution fragmentation from the distributed file system;
executing execution fragments except the target physical plan execution fragment in at least one reference physical plan execution fragment to obtain a second sub-query result;
and obtaining the query result based on the first sub-query result and the second sub-query result.
8. A data query apparatus, characterized in that the apparatus comprises: the device comprises an acquisition unit, a first generation unit, a first determination unit and an acquisition unit, wherein:
the acquisition unit is used for acquiring a data query request;
the first generating unit is used for generating at least one reference query plan based on the data query request;
the first determining unit is configured to determine, from at least one reference query plan, a target query plan that has a minimum query cost and includes the target index if it is determined that the target index having an association relationship with the data query request exists in the distributed file system; the distributed file system is used for storing aggregated index data;
the first determining unit is further configured to determine a query result corresponding to the data query request based on the target query plan.
9. A data query platform, the platform comprising at least: a data query node and a distributed file system; wherein:
the distributed file system is used for storing aggregated index data;
the data query node, configured to execute a stored data query program, implementing the steps of the data query method according to any one of claims 1 to 7.
10. A storage medium having stored thereon a data query program which, when executed by a processor, implements the steps of the data query method according to any one of claims 1 to 7.
CN202111441654.4A 2021-11-30 2021-11-30 Data query method, device, platform and storage medium Pending CN114138814A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111441654.4A CN114138814A (en) 2021-11-30 2021-11-30 Data query method, device, platform and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111441654.4A CN114138814A (en) 2021-11-30 2021-11-30 Data query method, device, platform and storage medium

Publications (1)

Publication Number Publication Date
CN114138814A true CN114138814A (en) 2022-03-04

Family

ID=80389965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111441654.4A Pending CN114138814A (en) 2021-11-30 2021-11-30 Data query method, device, platform and storage medium

Country Status (1)

Country Link
CN (1) CN114138814A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116303575A (en) * 2023-03-22 2023-06-23 本原数据(北京)信息技术有限公司 Database data query method and device and nonvolatile storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116303575A (en) * 2023-03-22 2023-06-23 本原数据(北京)信息技术有限公司 Database data query method and device and nonvolatile storage medium
CN116303575B (en) * 2023-03-22 2023-12-12 本原数据(北京)信息技术有限公司 Database data query method and device and nonvolatile storage medium

Similar Documents

Publication Publication Date Title
CN109656963B (en) Metadata acquisition method, apparatus, device and computer readable storage medium
WO2020238597A1 (en) Hadoop-based data updating method, device, system and medium
CN111008521B (en) Method, device and computer storage medium for generating wide table
CN108536808B (en) Spark calculation framework-based data acquisition method and device
US11301470B2 (en) Control method for performing multi-table join operation and corresponding apparatus
Tsalouchidou et al. Scalable dynamic graph summarization
CN111209309A (en) Method, device and equipment for determining processing result of data flow graph and storage medium
CN112883095A (en) Method, system, equipment and storage medium for multi-source heterogeneous data convergence
CN108733727B (en) Query processing method, data source registration method and query engine
CN107870949B (en) Data analysis job dependency relationship generation method and system
CN111367681A (en) Cloud computing cluster-oriented bridge design system under high load state
CN109299101B (en) Data retrieval method, device, server and storage medium
CN111930770A (en) Data query method and device and electronic equipment
CN113901078A (en) Business order association query method, device, equipment and storage medium
CN114138814A (en) Data query method, device, platform and storage medium
CN111159300A (en) Data processing method and device based on block chain
CN110297858B (en) Optimization method and device for execution plan, computer equipment and storage medium
CN116401277A (en) Data processing method, device, system, equipment and medium
CN116089446A (en) Optimization control method and device for structured query statement
CN115982278A (en) Self-service real-time data comparison method and system based on MPP database
CN106294721B (en) Cluster data counting and exporting methods and devices
Ovando-Leon et al. A simulation tool for a large-scale nosql database
CN112597193B (en) Data processing method and data processing system
CN117633059B (en) Data query method based on distributed database
CN111552561B (en) Task processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination