WO2022161417A1

WO2022161417A1 - Data query method and apparatus, and device and storage medium

Info

Publication number: WO2022161417A1
Application number: PCT/CN2022/074138
Authority: WO
Inventors: 李铮; 刘玉; 罗旦
Original assignee: 华为技术有限公司
Priority date: 2021-01-27
Filing date: 2022-01-27
Publication date: 2022-08-04
Also published as: CN114817310A

Abstract

Disclosed in the present application are a data query method and apparatus, and a device and a storage medium, which can be applied to a data processing system comprising a coordination node and a working node. The coordination node sends, to the working node, a task of performing a query operation for a first file and a second file, wherein the task comprises a query condition for the first file and the second file, and the amount of data of the first file is greater than the amount of data of the second file. The working node estimates, in the first file and according to the query condition in the task and data information respectively corresponding to the first file and the second file, the proportion of data, which meets the query condition, in the first file; and the working node reads the first file and the second file when the proportion is greater than a preset threshold value, and executes the query operation on the first file and the second file. In this way, when it is estimated that the proportion of data, which meets a query condition, in a first file is relatively great, the problems of the data query efficiency being reduced and query overheads increasing instead of reducing due to the generation of a dynamic filter are avoided.

Description

A data query method, device, equipment and storage medium

technical field

The embodiments of the present application relate to the technical field of data processing, and in particular, to a data query method, apparatus, device, and storage medium.

Background technique

In the era of big data, with the expansion of data collection methods, the cost of data collection is getting lower and lower, and correspondingly, the amount of data stored in the data source is also increasing. When querying and analyzing massive data, data query efficiency and query overhead become the core issues in the field of data processing.

At present, when the data processing system performs data query based on the query statement input by the user, it usually generates a filter condition (dynamic filter, DF) to reduce the amount of data read from the data source, so as to improve the efficiency of data query and reduce the amount of data. query overhead. However, the filtering conditions generated by the data processing system in practical application may not achieve a good data filtering effect, that is, the amount of data before and after filtering is not much different. At this time, the data query efficiency and data query overhead are not significantly optimized. The generation, calculation, and transmission processes involved in filter conditions may reduce the data query efficiency of the data processing system and increase query overhead.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a data query method, apparatus, device, storage medium, and computer program product, so that the data query efficiency of the data processing system is kept at a high level, and the query overhead is kept at a low level.

In a first aspect, an embodiment of the present application provides a data query method, which can be applied to a data processing system, and the data processing system includes a coordination node and a working node. The coordinating node may send a task of querying the first file and the second file to the working node, the sent task includes query conditions for the first file and the second file, and the data volume of the first file is larger than the data amount of the second file. Exemplarily, the task sent by the coordinating node may be, for example, a logical plan tree generated by the coordinating node. According to the query conditions in the received task and the data information corresponding to the first file and the second file, the work node estimates the proportion of the data in the first file that meets the query conditions in the first file, and calculates the proportion of the data in the first file. When the ratio is greater than the preset threshold, read the first file and the second file to the working node, and perform a query operation on the read first file and the second file.

Since the proportion of data that meets the query conditions in the first file is relatively large, it means that the filtering effect of the filter conditions to be generated is poor. At this time, the data processing system directly reads the first file and the second file and queries the It is possible to filter the data in the first file without generating filter conditions. In this way, the problem of low data query efficiency and increased query overhead caused by generating filter conditions can be avoided. The data query overhead of the data processing system can also be kept low.

In a possible implementation, when the estimated proportion of the data that meets the query conditions in the first file in the first file is less than or equal to a preset threshold, the worker node can read the second file and query The first data in the second file that satisfies the query condition; then, the worker node can generate a filter condition according to the first data, and send the filter condition to the data source where the first file is located, where the filter condition can be used to indicate that the data source is from The first file is queried for second data matching the first data, so that the worker node can receive the second data sent by the data source, and perform a query operation on the first data and the second data.

In this embodiment, when the proportion of the data of the query condition in the first file is relatively small, it means that the filtering effect of the filter condition to be generated is better. At this time, the data processing system can generate the filter condition to reduce the number of working nodes The data volume of the first file read from the data source, so that the data query efficiency of the data processing system can reach a high level, and when querying data that meets the query conditions, the working node does not need to check the data that does not meet the query conditions in the first file. A large amount of data of the query condition is read from the data source, so that the data query overhead of the data processing system can be effectively reduced.

In a possible implementation manner, the data information respectively corresponding to the first file and the second file may be the sampled data in the first file and the sampled data in the second file, and the working node is estimating the first file. When the proportion of the data that meets the query conditions in the first file in the Then, the working node can determine the identifier of the target sampled data that meets the query condition in the sampled data of the second file, and calculate the sampled data of the first file with the identifier in the sampled data of the first file. percentage of the sampled data. In this way, by sampling a small amount of data in the first file and the second file, it can be estimated whether the filter condition to be generated has a better data filtering effect, which enables the data processing system to reduce the cost of the filter condition. The filtering effect is evaluated to improve the feasibility of the program implementation.

In a possible implementation manner, the data information respectively corresponding to the first file and the second file may specifically be the data statistics information corresponding to the first file and the second file respectively, then the working node in the estimated first file conforms to the When the proportion of the data of the query condition in the first file, specifically, according to the data statistical information corresponding to the second file, determine the identifier of the data that meets the query condition in the second file, and further according to the corresponding data of the first file. The data statistics information is calculated, and the proportion of the data with the identifier in the first file in the first file is calculated. In this way, the proportion of the filtering effect that can be used to evaluate the filtering conditions can be obtained by using the data statistics information corresponding to each file, so as to further determine whether to generate the filtering conditions based on the proportion.

The data statistics information of each file may be generated in advance by the working node and the coordinating node. For example, the working node may read the first file and the second file from the data source to the working node in advance, and hand it over to the coordinating node for pairing. Perform data statistics on the read first file and the second file, and obtain and save the data statistics corresponding to the first file and the second file respectively, so that when subsequent data query is performed, the worker node can each time based on the locally saved data statistics Information to determine whether to generate filter conditions for this query process. Or, in other examples, during the process of storing the first file and the second file in the data source, the data source may generate corresponding data statistical information for each file and keep it locally to the data source. When the working node is determining whether to generate a filter condition, it can obtain the data statistics information corresponding to the first file and the second file from the data source, so as to estimate the proportion of data used for estimating the level of the filter condition.

In a possible implementation manner, the data statistics information corresponding to the first file may specifically be any one or more of the data range, data distribution interval, and repeated data of the first file, and of course, may also be other possible implementation. In this way, a working node can determine whether to generate a filter condition according to the above-mentioned information such as the data range, data distribution interval, and repeated data.

In a possible implementation manner, when the worker node determines to generate a filter condition and use the filter condition to query data from the first file, the worker node may calculate that the queried data in the data that meets the query condition is in the first file. The proportion of the traversed data of a file, and when the proportion is greater than the filtering threshold, it indicates that the filtering condition does not have an optimal data filtering effect in the actual data query process. At this time, the worker node can stop The remaining data is queried by using the filter condition, and the remaining data in the data that meets the query condition can be queried from the untraversed data of the first file according to the query condition. In this way, when the filter condition does not have the expected data filtering effect, the filter condition can be stopped to query the remaining data in time, so as to avoid extra overhead caused by the application of the filter condition as much as possible.

In the second aspect, based on the same inventive concept as the method embodiments of the first aspect, the embodiments of the present application provide a data query apparatus. The device has functions corresponding to the implementations of the above-mentioned first aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.

In a third aspect, an embodiment of the present application provides a device, including: a processor and a memory; the memory is used to store an instruction, and when the computing device runs, the processor executes the instruction stored in the memory, so that the device executes the instruction The data query method in the first aspect or any implementation manner of the first aspect. It should be noted that the memory may be integrated in the processor, or may be independent of the processor. The apparatus may also include a bus. Among them, the processor is connected to the memory through the bus. The memory may include readable memory and random access memory.

In a fourth aspect, an embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored in the readable storage medium, and when the readable storage medium runs on a computer, causes the first aspect or any one of the first aspect to be executed. The data query method in the implementation is executed.

In a fifth aspect, the embodiments of the present application further provide a computer program product including instructions, which, when running on a computer, enables the computer to execute any data query method in the first aspect or any implementation manner of the first aspect.

In addition, for the technical effect brought by any one of the implementations in the second aspect to the sixth aspect, reference may be made to the technical effects brought by different implementations in the first aspect, or reference may be made to the technical effects brought by different implementations in the second aspect The technical effect will not be repeated here.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some implementations described in the present application. For example, for those skilled in the art, other drawings can also be obtained from these drawings.

FIG. 1 is a schematic structural diagram of an exemplary data processing system provided by an embodiment of the present application;

2 is a schematic diagram of the people table and the orders table included in the data source 103;

3 is a schematic structural diagram of another exemplary data processing system provided by an embodiment of the present application;

4 is a schematic flowchart of a data query method provided by an embodiment of the present application;

5 is a schematic diagram of a query statement input interface provided by an embodiment of the present application;

6 is a schematic structural diagram of a data query apparatus provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.

Detailed ways

Referring to FIG. 1, it is a schematic structural diagram of an exemplary data processing system. As shown in FIG. 1 , the data processing system 100 may include a coordinator node (coordinator node) 101 and a worker node (worker node) 102 . The worker node 102 may access data in the data source 103, and the data source 103 may include one or more data sources, such as hive and oracle data sources as shown in FIG. 1 .

The data processing system 100 may externally provide a client (client) 104 for performing human-computer interaction with the user, so as to execute the corresponding data query process based on the query statement input by the user. The coordination node 101 may receive the SQL statement input by the user through the client 104, and perform syntax analysis and semantic analysis on the SQL statement. The syntax analysis means that the coordination node 101 uses the syntax rules of the SQL language to check whether there is a syntax error in the SQL statement; the semantic analysis means that the coordination node 101 analyzes whether the semantics of the SQL statement is legal. When the syntax and semantics of the SQL statement are legal, the coordination node 101 can generate a logical plan tree according to the SQL statement, and the logical plan tree indicates a logical execution plan for computing, analyzing, and accessing data. Then, the coordinating node 101 can optimize the plan tree through one or more optimizers, and send the optimized logical plan tree to the worker node 102 for execution. Specifically, the corresponding scheduler can determine to send the logical plan tree to which executor in worker node 102 to execute. The worker node 102 may include one or more executors (worker, including executor 1 and executor 2 as an example in FIG. 1 ), which can execute the corresponding plan according to the received logical plan tree, and through the coordination node 101 The query result obtained after executing the plan is returned to the client 104 so that the client 104 presents the query result to the user.

In the era of big data, the amount of data in the data source 103 is usually relatively large, for example, the number of rows may reach tens of millions or even hundreds of millions, etc., which may form a table A with a large amount of data, such as a fact table. At this time, if the work node 102 directly traverses the table A, the data processing system 100 may reduce the efficiency of querying data due to too much data to be traversed. To this end, at present, the table A is usually joined with another table B (such as a dimension table, etc.) with a small amount of data. The table A and table B contain at least the same columns, and the two tables The data has at least some of the same column values in that same column. Then, the worker node 102 can first query the data that meets the query condition according to Table B, and the data can be used as a filter condition to filter to obtain the data in Table A that meets the query condition (that is, the data required by the user). Wherein, the filter condition may be, for example, some column values of the same column in table A and table B. .

For example, it is assumed that the data source 103 has a people table with a large amount of data and an orders table with a small amount of data as shown in FIG. 2 . Both the people table and the orders table have the same id (namely identity) column, and further have some of the same column values. As shown in Figure 2, the id columns of the two tables have some of the same column values of 120001 to 149999. Suppose the user inputs the SQL statement on the client 104: select people.name from people join orders on people.id=orders.id where orders.sum>10, that is, the data whose orders.sum is greater than 10 in the data source 103 is queried. In order to reduce the amount of data read out for the people table, the worker node 102 may first collect data from the orders table that satisfy the query condition according to the query conditions "people.id=orders.id" and "orders.sum>10" in the query statement The value of the id column corresponding to the data, that is, the id values such as 120001, 120002, . Take the data in the people table, that is, read the row data with id values of 120001, 120002, ..., 149999, without reading the row data with id values of 150000 and 150001. In this way, in the process of querying data, the worker node 102 does not need to read all the data in the whole people table, but can read only part of the data in the people table through the generated filter condition, thereby reducing the amount of data read from the people table. The amount of data, that is, reducing the amount of data involved in the Join operation in the two tables.

When the worker node 102 in the data processing system includes multiple executors, the Join operation can be divided into multiple executors for execution. As shown in FIG. 1 , the worker node 102 includes an executor 1 and an executor 2, and the Join operation The operation may be allocated to Join node_1 and Join node_2 for execution, where the Join node refers to a logical node for performing a join operation on Table A and Table B. Each executor can send the filter condition generated according to its corresponding table B to the coordinator node 101. As shown in FIG. 1, the executor 1 can send the partial filter condition (partial filter) generated according to the table B_1 to the coordinator node 101 , executor, 2 can also send the partial filter conditions (partial filter) generated by it according to table B_2 to the coordinator node 101, and the coordinator node 101 combines the filter conditions respectively sent by multiple executors. Then, the coordinating node 101 can respectively deliver the combined filter conditions to the executor 1 and the executor 2 that perform the Join operation this time, so that each executor can read the corresponding table A according to the combined filter conditions. corresponding data.

However, in practical applications, the filter conditions generated by the work node 102 according to Table B may not achieve a better data filtering effect, that is, after filtering by the filter conditions (or the combined filter conditions), the data contained in the filter conditions Correspondingly, the difference between the amount of data read from table A according to the filter condition and the amount of data in the entire table A is small, which makes the work node 102 directly read all the data in the entire table B compared to the In terms of implementation manner, the data processing system 100 can save less data overhead because the amount of data read by the worker nodes 102 is reduced. At the same time, generating a filter condition by the worker node 102 and using the filter condition to filter the data in Table B requires more overhead (such as computing resource consumption, etc.), which may be more than the overhead saved by the data processing system 100. In particular, when the data processing system 100 uses multiple working nodes 102 to query data, each working node 102 will also send the filter conditions generated by themselves to the coordinating node 101, and the coordinating node 101 will combine the multiple filtering conditions and send them to the multiple working nodes 101. The working node 102 issues the combined filter conditions, which will further occupy more resources of the data processing system 100 (including resources such as network transmission, system computing, and storage). In this way, when the filtering effect of the filtering conditions is poor, compared with the way in which the worker node 102 directly reads all the data in the entire table B, the data processing system 100 uses the filtering conditions to filter the data, which not only does not reduce the overall performance of the data processing system 100 On the contrary, the filter conditions will consume additional resources such as network transmission, system computing, storage, etc., which will increase the overall overhead of the data processing system 100; at the same time, the generation, transmission, merging, and application of filter conditions will also reduce data query. efficiency.

Based on this, the embodiments of the present application provide a data query method, so that the data query efficiency of the data processing system 100 is kept at a high level, and the query cost is kept at a low level. Specifically, the worker node 102 can pre-estimate the data proportion of the data that meets the query conditions in the table A with a large amount of data according to the query conditions and the corresponding data information of the two tables participating in the Join operation. When the value is large, it indicates that a better data filtering effect may not be achieved by using the generated filter conditions, that is, the difference in the amount of data before and after filtering is not large. In Table B, the data that meets the query conditions required by the user (hereinafter referred to as the data to be queried) is queried and the query operation is performed; and when the proportion of the data is relatively small, it shows that the filter conditions can be used to effectively filter out a large number of large amounts of data in Table A. Irrelevant data (that is, data that is not to be queried, and this part of the data does not need to be read to the worker node 102), at this time, the worker node 102 can use the query condition and table B to generate the filter condition, and based on the filter condition from table A Query the data to be queried required by the user. Therefore, when the filtering effect of the filter condition is good, the worker node 102 can use the filter condition to reduce the amount of data read from Table A, improve the data query efficiency and reduce the query cost; and when the filtering effect of the filter condition is poor, The data processing system 100 can avoid the problems of low data query efficiency and increased query overhead caused by generating filter conditions by not generating filter conditions. In this way, while the data query efficiency of the data processing system 100 can be kept at a high level, the query overhead can also be kept at a low level.

For example, as shown in Figure 3, the Join operation is still scattered to executor 1 and executor 2 for execution. Wherein, both the executor 1 and the executor 2 include a determination module (ie, the determination module 1 and the determination module 2 in FIG. 3 ), and the determination module is used to determine whether to generate a filter condition. The judgment module in each executor can estimate whether the filter conditions to be generated have a high data filtering effect according to the query conditions and the data information corresponding to the two tables that require the Join operation, that is, determine whether the above data proportion is If the data filtering effect of the estimated filter condition is better, the judgment module can instruct the executor where it is located to generate the filter condition. However, if the data filtering effect of the estimated filtering condition is poor, the determining module may instruct the executor where it is located not to generate the filtering condition.

In this way, for executor 1 and executor 2, if the two executors both generate filter conditions, they can send the (part of) filter conditions generated by them to the coordinating node 101 respectively, so that the coordinating node 101 can combine the conditions and distribution. However, if only executor 1 generates the filter condition, the executor 1 can directly use the generated filter condition to read data from table A_1, while another executor 2 can directly read the data in the entire table A_2, and then Find data that satisfies the query condition from Table A_2. However, if the two executors do not generate the filter condition, they can read the corresponding table A respectively, and find the data satisfying the query condition from the read table A respectively.

It is worth noting that the system architecture shown in FIG. 1 and FIG. 3 is only an example, and is not intended to limit its specific implementation to this example. For example, in other possible system architectures, the data processing system may not include the client 104; or, the number of executors included in the worker nodes is not limited to two; In addition to nodes, data sources can also be integrated. In practical application, the data processing system may adaptively add or delete corresponding components in the architecture shown in FIG. 1 and FIG. 3 , which is not limited in this embodiment.

In order to make the above objects, features and advantages of the present application more clearly understood, various non-limiting implementations in the embodiments of the present application will be exemplarily described below with reference to the accompanying drawings. Obviously, the described embodiments are some, but not all, embodiments of the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

As shown in FIG. 4 , it is a schematic flowchart of a data query method in an embodiment of the present application. The method can be applied to the data processing system 100 shown in FIG. 1 or FIG. 3 , and can be executed by the data processing system 100 or the Corresponding nodes in the data processing system 100 execute the method. In this embodiment, the coordinating node 101 and the working node 102 in the data processing system 100 execute the method as an example for illustrative description. The method may specifically include:

S401: The coordination node 101 receives a data query statement, where the data query statement includes query conditions corresponding to the data to be queried and query operations for the first file and the second file, wherein the data volume of the first file is larger than the data of the second file quantity.

In practical application, the user may provide a data query statement to the data processing system 100, so that the data processing system 100 can locate the data to be queried based on the data query statement. For convenience of description, in this embodiment, the data required by the user is referred to as the data to be queried.

As an implementation example for receiving a data query statement, the data processing system 100 includes a client 104, and the client 104 can present a query statement input interface as shown in FIG. 5 to the user. The user may input a corresponding query statement in a specific area on the query statement input interface presented by the client 104, so that the data processing system 100 feeds back the query result expected by the user. Then, the client 104 can send the data query statement to the coordinating node 101, for example, by sending it to the coordinating node 101 through a data query request, so that the coordinating node 101 can query the data to be queried required by the user. The data query statement includes query conditions and query operations on two files (eg, two tables, etc.), such as a join (Join) operation, and the query conditions are used to locate the data to be queried. For example, the data query statement input by the user can be, for example, an SQL statement: select people.name from people join orders on people.id=orders.id where orders.sum>10, which is used to query the people table shown in Figure 2 The data of the row where the corresponding id value is located, where "people.id=orders.id" and "orders.sum>10" in the SQL statement are query conditions, which are used to locate the data to be queried, that is, the id in the aforementioned people table Row data with

values

120001, 120002, ..., 149999.

S402: The coordination node 101 performs syntax analysis and semantic analysis on the data query statement to determine whether the data query statement input by the user is legal.

In practical application, if the data query statement input by the user is invalid, the data processing system 100 may terminate the data query task, and may prompt the user to input a data query statement with correct syntax/semantics. If it is determined that the data query statement input by the user is valid, the coordinating node 101 may continue to perform subsequent data query steps.

S403: After determining that the data query statement is valid, the coordinating node 101 sends a task of querying the first file and the second file to the worker node, where the task includes query conditions for the first file and the second file.

As an example, the task sent by the coordinating node 101 to the worker node may be, for example, a logical plan tree, where the logical plan tree includes query conditions for the data to be queried and identifiers (such as file names, etc.) of two files participating in the query operation. . During specific implementation, the coordinating node may generate a logical plan tree according to the data query statement, and send the logical plan tree to the working node 102 so as to schedule the working node 102 to execute the tasks in the logical plan tree.

S404: The worker node 102 estimates the proportion of the data in the first file that meet the query conditions in the first file according to the query conditions in the received task and the data information corresponding to the first file and the second file respectively.

The data source 103 may include a first file and a second file, and the worker node 102 may obtain data satisfying the query condition by accessing the data in the first file and the second file. In practical applications, the first file and the second file may be formed by different data in the data source 103, and the first file and the second file may specifically record data in the form of a table, such as the people table shown in FIG. 2 . And the orders table and so on.

In some implementations, the data in the first file and the second file may be associated. For example, the first file and the second file may contain the same column, and the value included in the column in the first file may be the same as the value in the second file. The values included in this column in the file are at least partially the same. As shown in Figure 2, the people table and the orders table have some of the same id values for the id column, such as 120001, 120002, ..., 149999, and so on.

In this embodiment, the amount of data included in the first file and the second file may be different. For example, the amount of data in the first file may be larger than that in the second file. In practical application, the first file may specifically be a fact table, and the second file may specifically be a dimension table or the like. Among them, the data recorded in the fact table is usually rich and can include information of multiple dimensions. For example, for a fact table used for work records, it can include working date, working employee, working hours, overtime hours, work nature, Work content, job leader and other information, which can include time dimension (working date, working hours, overtime hours), personnel dimension (staff, job leader), and job attributes (work nature, work content). information in three dimensions. Correspondingly, since there are more data recorded in the fact table, the data volume of the fact table is usually large. The dimension table can be used to record some dimensional data. For example, the time dimension table can be used to record only the data of the time dimension, such as the above-mentioned data such as working date, working hours, and overtime hours. Data such as the person in charge of work may not be recorded in this time dimension table, but may be recorded in other dimension tables. The dimension table can be regarded as a window for analyzing data, which contains the data characteristics of some dimensions in the fact table. Typically, dimension tables contain less data than fact tables.

In this embodiment, the amount of data in the first file may be relatively large, and the data that meets the query conditions (that is, the data to be queried) may be a small part of the data in the first file. 103 to read the first file and query the data to be queried, the worker node 102 needs to read a large amount of useless data in the first file except the data to be queried. Therefore, in this embodiment, the worker node 102 can use The amount of data read from the first file is reduced according to the filter conditions generated by the second file. Since the generated filter conditions do not necessarily have good data filtering effects, the worker node 102 can parse out the query conditions and the two files involved in the Join operation from the logical execution plan tree, and obtain the first file and the second file. The corresponding data information, so that the working node 102 can estimate the proportion of the data to be queried in the first file according to the query condition and the data information corresponding to the two files respectively, so as to use the proportion to measure whether the filter condition can be It has better data filtering effect.

Specifically, when the proportion is relatively large, it indicates that most of the data in the first file is the data to be queried. At this time, even if the data in the first file is filtered by using the filter condition, the working node 102 reads the first file. The difference between the data to be queried and the amount of data read in the entire first file is not large, which shows that the filtering effect of the filter condition is poor; on the contrary, when the proportion is small, the small part of the data in the first file is represented as The data to be queried, at this time, most of the data in the first file can be effectively filtered by using the filter conditions, and the worker node 102 only needs to read a small part of the data in the first file to obtain the required data to be queried. It shows that the data filtering effect of the filtering condition is better.

As an implementation example of the estimated proportion, the data information respectively corresponding to the first file and the second file may specifically be data statistics information corresponding to the first file and the second file respectively. Exemplarily, the data statistics information corresponding to the first file (or the second file) may be, for example, any one or more of the data range, data distribution interval, and repeated data of the first file (or the second file). Wherein, the data range refers to the value range of the data in the first file (or the second file), and may specifically be the value range of each column in the first file, that is, the minimum value to the maximum value of the data in each column; data distribution The interval is used for the distribution of the data in the first file (or the second file), for example, the data distribution can be represented by a bar chart; the repeated data can be used to indicate that the same data exists in the first file (or the second file). For example, it can be expressed by characterizing the repetition rate or the number of repetitions of each data.

The work node 102 can determine the identifier of the data in the second file that satisfies the query condition according to the statistical information of the data corresponding to the second file, so as to calculate the data with the identifier in the first file according to the statistical information of the data corresponding to the first file. The proportion of data in the first file. For example, assuming that the data statistical information is specifically a data range, the worker node 102 can search for a data range that satisfies the query condition from the second file according to the data range of the second file, and further determine the data range within the found data range. The identifier corresponding to the data; then, the working node 102 finds out the data with the data identifier in the first file according to the determined data identifier, and calculates the proportion of the data with the data identifier in the data in the first file For example, the range of data in the first file is 1 to 100000, and the data with the above data identifier in the first file is 1 to 1000, the ratio may be 1/10 (ie 1000/10000). Of course, the data processing system 100 may also calculate the proportion according to the above data distribution interval, repeated data, or in combination with any of the above three kinds of information, which is not limited in this embodiment.

The data statistics information corresponding to the first file (or the second file) may be pre-collected by the working node 102 and the coordination node 101 and stored locally. For example, after the data processing system 100 is connected to the data source 103, the worker nodes 102 can access each file in the data source 103 one by one, and send the read files to the coordinating node 101, so that the coordinating node 101 can provide each file Generate corresponding data statistics and save. In this way, when querying data subsequently, the worker node 102 can obtain the data statistics information corresponding to the file locally (or from the coordinating node 101 ) according to the file to be accessed, so as to determine whether to generate a filter during the data query process based on the data statistics information condition. Of course, in practical applications, the worker node 102 may also obtain the data statistics information corresponding to the first file and the second file in other ways. For example, when the first file and the second file are stored in the data source 103, the data source The relevant device in 103 generates the corresponding data statistics information for the first file and the second file in advance, so that the worker node 102 can obtain the data statistics information corresponding to the first file and the second file from the data source 103 . In this embodiment, the specific implementation manner in which the working node obtains the data statistics information is not limited.

In another implementation example of estimating the proportion, the working node 102 may also determine the proportion by means of data sampling. In this case, the data information corresponding to the first file and the second file respectively may be the first file and the sampled data in the second file. During specific implementation, the worker node 102 may sample the data in the first file and the data in the second file respectively, such as random sampling, equal interval sampling or equal proportion sampling, etc., and read the data from the data source 103 respectively. The sampled data in the first file and the sampled data in the second file, then, the worker node 102 can determine the identifier of the target sampled data that satisfies the query condition in the sampled data of the second file, and count the sampled data in the first file. The sampled data with the identifier can be calculated, so that the proportion of the sampled data with the identifier in the sampled data of the first file in the first file can be calculated. In this way, the proportion of the data to be queried in the first file can be predicted based on the sampling data of the first file and the second file.

It should be noted that the above two implementation manners of estimating the proportions are only used as examples. In practical applications, the proportions may also be estimated by sampling other possible methods, which are not limited in this embodiment.

S405: When the proportion is greater than the preset threshold, the worker node 102 reads the first file and the second file to the worker node 102, and performs a query operation on the first file and the second file.

S406: When the proportion is less than the preset threshold, the worker node 102 generates a filter condition according to the query condition and the second file, and searches the first file for data that meets the query condition according to the filter condition, and executes a corresponding query operation.

In this embodiment, because the ratio calculated in step S404 can reflect the data filtering effect of the filter condition, when the ratio is greater than the preset threshold, it indicates that the filter condition to be generated is difficult to filter out the data filtering effect in the first file. If the working node 102 generates filter conditions, the query efficiency of the data processing system 100 may be reduced. In this case, the working node 102 may not generate filter conditions according to the second file, but may read the first file from the data source 103. file and the second file, and according to the query conditions, query the data satisfying the query conditions from the read first file and the second file and perform corresponding operations, such as performing a Join operation on the queried data. In this way, it can be avoided that the overall overhead of the worker node 102 due to generating the filter condition and applying the filter condition (even including the transmission, merging, etc. of the filter condition) exceeds the overall overhead of the worker node 102 directly reading the first file .

When the proportion is less than the preset threshold, it indicates that the filter condition to be generated can effectively filter out more data in the first file. Accordingly, the work node 102 generates the filter condition and applies the filter condition (even including the filter condition) The overall overhead of conditional transmission, merging, etc.) is smaller than the overall overhead of the worker node 102 directly reading the first file. At this time, the worker node 102 can read the second file, and query the first data that satisfies the query condition therefrom, and then the worker node 102 can generate a filter condition according to the first data, and send the filter condition to the data source 103, to instruct the data source 103 to query the second data matching the first data from the first file and feed it back to the worker node 102, wherein the first data and the second data may have the same identifier, For example, the first data in the first file and the second data in the second file have the same column value (such as the id value in the people table and the orders table in the foregoing example) and so on. In this way, the worker node 102 can perform the above query operation on the received second data and the first data queried from the second file, such as performing a Join operation on the first data and the second data, etc., so as to obtain the user's needs. query data. Since the amount of data read by the worker node 102 from the data source 103 (relative to the amount of data read directly from the first file) is small, the data query efficiency of the worker node 102 can be maintained at a high level. The specific implementation process for the working node 102 to generate the dynamic query condition according to the second file and the query condition can be referred to the above-mentioned descriptions in the relevant places, and will not be repeated here.

The preset threshold may be calculated and determined by the working node 102 according to the data volume of the first file, that is, when the working node 102 queries data in different files, different preset thresholds may be generated; or, the preset threshold It can also be a fixed value, such as an empirical value, etc., and is preset by the relevant technical personnel. In this embodiment, the specific implementation manner of how to set the preset threshold is not limited.

It is worth noting that, in this embodiment, the data source 103 includes the first file and the second file as an example for illustration. In practical applications, the data source 103 may also include more files, such as a third file. files, etc., and when multiple files in the data source 103 need to participate in the Join operation, the first file and the second file can be joined first, and then the first file and the third file. By analogy, this embodiment will not describe it in detail.

It should be noted that, in some possible implementations, after determining the maximum data amount of each file in the data source 103, the above-mentioned proportion may also be characterized by the data amount included in the filter condition. For example, in some scenarios, when the amount of data included in the filter condition is large, the representation may account for a larger proportion. On the contrary, when the amount of data included in the filter condition is smaller, the representation may account for a smaller proportion. Then, when determining whether to generate a filter condition according to the size of the proportion, it may also be determined whether to generate a filter condition according to the amount of data included in the filter condition. The specific implementation principle is similar to the above process, and this embodiment does not Repeat.

In a further possible implementation manner, in the process of querying data, the worker node 102 may determine whether to continue querying the remaining data by using the filter conditions according to the queried partial data. When the data sampling method predicts that the filtering effect of the filtering condition is good, there may be a problem that the filtering effect of the filtering condition is poor in actual use. In this way, the worker node 102 can stop using the filtering condition in time to continue processing the data in the first file. filter.

As an example, after using the filter condition to query some data in the data to be queried from the first file, the worker node 102 may calculate that the queried data in the data to be queried is in the traversed data of the first file The traversed data of the first file includes the queried data and the data currently filtered out of the first file by using the filter conditions. When the proportion is relatively large, specifically when the proportion is greater than the filtering threshold, it indicates that the filtering effect of the filtering conditions on the data in the first file is poor in the actual use process. Therefore, when the worker node 102 subsequently queries the remaining data in the data to be queried (that is, data other than the queried data), it can directly read the untraversed data in the first file to the work The node 102 continues to query the remaining data in the data to be queried from the untraversed data according to the query condition, without using the filter condition to filter the data in the first file. Of course, when the ratio is small, specifically when the ratio is less than the filtering threshold, it indicates that the filtering condition has a better filtering effect on the data in the first file in the actual use process. Therefore, the working node 102 can continue to use the filtering condition. Continue to query the remaining data in the data to be queried from the untraversed data of the first file.

In other embodiments, the proportion of the queried data in the traversed data of the first file may also be determined by means of data sampling. For example, when the data volume of the queried data in the data to be queried is relatively large, the traversed data and the queried data of the first file can be sampled respectively, and the sampled data in the queried data can be sampled in the traversed data. The proportion of data in the sampled data of the data is taken as the proportion and the like. The specific implementation of the proportion in this embodiment is not limited.

In practical applications, a separate functional module can be configured in the actuator, such as the judgment module in Figure 3 above. This functional module can determine the proportion through real-time monitoring or data sampling in the process of querying data, and according to the The ratio determines whether to continue to use the filter condition to filter the data in the first file in the subsequent query process. The functional module may be implemented by software or hardware, which is not limited in this embodiment.

The setting method of the filtering threshold may be similar to the setting method of the aforementioned preset threshold. For the specific implementation of how to set the filtering threshold, please refer to the description of the aforementioned setting of the preset threshold, which will not be repeated here. Moreover, the value of the filtering threshold may be the same as the preset threshold, or may be different from the preset threshold, which is not limited in this embodiment.

The data query method provided by the present application is described in detail above with reference to FIGS. 1 to 5 , and the apparatus and equipment provided according to the present application will be described below with reference to FIGS. 6 to 7 .

The same inventive concept as the above method, the embodiment of the present application also provides a data query device, the data query device can realize the function of the data processing system in the embodiment shown in FIG. 4, the data processing system includes a coordination node and a working node. Referring to Fig. 6, the data query apparatus 600 may include:

The first communication module 601 is configured to send a task of performing a query operation on the first file and the second file to the worker node, where the task includes query conditions on the first file and the second file, wherein, The data volume of the first file is greater than the data volume of the second file;

Estimation module 602, configured to estimate the query condition in the first file according to the received query condition of the task and the data information corresponding to the first file and the second file respectively The proportion of the data in the first file;

The reading module 603 is configured to read the first file and the second file to the work node when the ratio is greater than a preset threshold, and read the first file and the second file Execute the query operation.

In a possible implementation manner, the reading module 603 is further configured to read the second file when the ratio is less than or equal to the preset threshold;

The apparatus 600 also includes:

A query module 604, configured to query the first data in the second file that satisfies the query condition;

a generating module 605, configured to generate filter conditions according to the first data;

The second communication module 606 is configured to send the filter condition to the data source where the first file is located, where the filter condition is used to instruct the data source to query the first file to match the first data and receive the second data sent by the data source, and perform the query operation on the first data and the second data.

In a possible implementation manner, the data information corresponding to the first file and the second file respectively includes data statistics information corresponding to the first file and the second file respectively;

The estimating module 602 is specifically used for:

According to the data statistics information corresponding to the second file, determine the identifier of the data in the second file that meets the query condition;

According to the data statistics information corresponding to the first file, the proportion of the data with the identifier in the first file in the first file is calculated.

In a possible implementation manner, the data information corresponding to the first file and the second file respectively includes sampling data in the first file and sampling data in the second file;

The estimating module 602 is specifically used for:

The working node respectively samples the data in the first file and the data in the second file to obtain the sampling data in the first file and the sampling data in the second file;

The working node determines the identifier of the target sampled data that meets the query condition in the sampled data of the second file;

The working node calculates the proportion of the sampled data with the identifier in the sampled data of the first file in the sampled data of the first file.

In a possible implementation manner, the data statistics information corresponding to the first file includes any one or more of the data range, data distribution interval, and repeated data of the first file.

In a possible implementation manner, when the proportion is less than the preset threshold, the reading module 603 is further configured to further include:

In the process of querying the data to be queried from the first file, calculating the proportion of the queried data in the data to be queried in the traversed data of the first file;

When the proportion is greater than the filtering threshold, the remaining data in the data to be queried is queried from the untraversed data of the first file according to the query condition.

The data query device 600 in this embodiment corresponds to the data query method shown in FIG. 4 . Therefore, for the specific implementation of each functional module in the data query device 600 in this embodiment and the technical effects it has, please refer to FIG. 4 The description of the relevant parts in the illustrated embodiment will not be repeated here.

In addition, an embodiment of the present application further provides a device. As shown in FIG. 7 , the device 700 may include a communication interface 710 and a processor 720 . Optionally, the device 700 may further include a memory 730 . The memory 730 may be disposed inside the device 700 or outside the device 700 . Exemplarily, each action in the above-mentioned embodiment shown in FIG. 4 may be implemented by the processor 720 . The processor 720 may acquire the first file and the second file in the data source 103 through the communication interface 710, and use them to implement any method executed in FIG. 4 . In the implementation process, each step of the processing flow can be implemented by the hardware integrated logic circuit in the processor 720 or the instructions in the form of software to complete the method executed in FIG. 4 . For brevity, details are not repeated here. The program codes executed by the processor 720 for implementing the above method may be stored in the memory 730 . The memory 730 is connected to the processor 720, such as a coupling connection or the like.

Some features of the embodiments of the present application may be implemented/supported by the processor 720 executing program instructions or software codes in the memory 730 . The software components loaded on the memory 730 can be summarized in terms of functions or logic, for example, the estimation module 602, the reading module 603, the query module 604, and the generation module 605 shown in FIG. 6 . The functions of the first communication module 601 and the second communication module 606 may be implemented by the communication interface 710 .

Any communication interface involved in the embodiments of this application may be a circuit, a bus, a transceiver, or any other device that can be used for information interaction. For example, the communication interface 710 in the device 700, for example, the other device may be a device connected to the device 700, and the like.

The processors involved in the embodiments of the present application may be general-purpose processors, digital signal processors, application-specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and may implement or The methods, steps, and logic block diagrams disclosed in the embodiments of this application are executed. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly embodied as being executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.

The coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, modules or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, modules or modules.

The processor may cooperate with the memory. The memory can be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or a volatile memory (volatile memory), such as random access memory (random-state drive, SSD), etc. access memory, RAM). Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The specific connection medium among the above-mentioned communication interface, processor, and memory is not limited in the embodiments of the present application. For example, the memory, the processor and the communication interface can be connected by a bus. The bus can be divided into an address bus, a data bus, a control bus, and the like.

Based on the above embodiments, the embodiments of the present application further provide a computer storage medium, where a software program is stored in the storage medium, and when the software program is read and executed by one or more processors, it can implement any one or more of the above Embodiments provide methods performed by data processing system 100 . The computer storage medium may include: a U disk, a removable hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk and other mediums that can store program codes.

Based on the above embodiments, the embodiments of the present application further provide a chip including a processor for implementing the functions of the data processing system 100 involved in the above embodiments, for example, for implementing the method executed in FIG. 4 . Optionally, the chip further includes a memory, and the memory is used for necessary program instructions and data to be executed by the processor. The chip may consist of chips, or may include chips and other discrete devices.

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is only a distinguishing manner adopted when describing objects with the same attributes in the embodiments of the present application.

Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the scope of the embodiments of the present application. Thus, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

A data query method, characterized in that the method is applied to a data processing system, the data processing system includes a coordination node and a working node, and the method includes:

The coordinating node sends a task of querying the first file and the second file to the worker node, the task includes query conditions for the first file and the second file, wherein the first file The data volume of the file is greater than the data volume of the second file;

According to the received query condition of the task and the data information corresponding to the first file and the second file, the work node estimates that the data in the first file that meets the query condition is the proportion in the first file;

When the proportion is greater than a preset threshold, the worker node reads the first file and the second file to the worker node, and executes the query operation.
The method according to claim 1, wherein the method further comprises:

When the proportion is less than or equal to the preset threshold, the worker node reads the second file;

The working node queries the first data in the second file that satisfies the query condition;

generating, by the working node, a filter condition according to the first data;

The worker node sends the filter condition to the data source where the first file is located, where the filter condition is used to instruct the data source to query the first file for second data matching the first data ;

The working node receives the second data sent by the data source, and performs the query operation on the first data and the second data.
The method according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively includes sampling data in the first file and sampling data in the second file data;

According to the received query condition of the task and the data information corresponding to the first file and the second file, the work node estimates that the data in the first file that meets the query condition is The proportion in the first file, including:

The working node respectively samples the data in the first file and the data in the second file to obtain the sampling data in the first file and the sampling data in the second file;

The working node determines the identifier of the target sampled data that meets the query condition in the sampled data of the second file;

The working node calculates the proportion of the sampled data with the identifier in the sampled data of the first file in the sampled data of the first file.
The method according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively comprises data statistics information corresponding to the first file and the second file respectively;

According to the received query conditions of the task and the data information corresponding to the first file and the second file, the work node estimates that the data in the first file that meets the query conditions is The proportion in the first file, including:

The working node determines, according to the data statistics information corresponding to the second file, the identifier of the data in the second file that meets the query condition;

The working node calculates, according to the data statistics information corresponding to the first file, the proportion of the data with the identifier in the first file in the first file.
The method according to claim 4, wherein the data statistics information corresponding to the first file includes any one or more of a data range, a data distribution interval, and repeated data of the first file.
A data query device, characterized in that the device is applied to a data processing system, the data processing system includes a coordination node and a working node, and the device includes:

A first communication module, configured to send a task of querying the first file and the second file to the working node, the task including query conditions for the first file and the second file, wherein the The data volume of the first file is greater than the data volume of the second file;

Estimation module is used for, according to the said query condition of the said task received and the data information corresponding to the first file and the second file respectively, to estimate the query condition in the first file that meets the query condition. the proportion of data in the first file;

a reading module, configured to read the first file and the second file to the work node when the ratio is greater than a preset threshold, and execute the first file and the second file the query operation.
The device according to claim 6, wherein the reading module is further configured to read the second file when the ratio is less than or equal to the preset threshold;

The device also includes:

a query module, configured to query the first data in the second file that satisfies the query condition;

a generating module, configured to generate filter conditions according to the first data;

a second communication module, configured to send the filter condition to the data source where the first file is located, where the filter condition is used to instruct the data source to query the first file for a data matching the first data Second data is received, and the second data sent by the data source is received, and the query operation is performed on the first data and the second data.
The device according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively comprises data statistics information corresponding to the first file and the second file respectively;

The estimation module is specifically used for:

According to the data statistics information corresponding to the second file, determine the identifier of the data in the second file that meets the query condition;

According to the data statistics information corresponding to the first file, the proportion of the data with the identifier in the first file in the first file is calculated.
A device, characterized in that the device includes a processor and a memory;

The processor is adapted to execute instructions stored in the memory to cause the apparatus to perform the method of any one of claims 1 to 5.
A computer-readable storage medium, comprising instructions for implementing the method of any one of claims 1 to 5.