CN114817310A

CN114817310A - Data query method, device, equipment and storage medium

Info

Publication number: CN114817310A
Application number: CN202110111965.8A
Authority: CN
Inventors: 李铮; 刘玉; 罗旦
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2022-07-29
Also published as: WO2022161417A1

Abstract

The application discloses a data query method, a data query device, data query equipment and a storage medium, which can be applied to a data processing system comprising a coordination node and a working node. And the coordination node sends a task for performing query operation on the first file and the second file to the working node, wherein the task comprises query conditions for the first file and the second file, and the data volume of the first file is larger than that of the second file. And the working node estimates the proportion of the data meeting the query condition in the first file according to the query condition in the task and the data information corresponding to the first file and the second file respectively, reads the first file and the second file when the proportion is greater than a preset threshold value, and executes query operation on the first file and the second file. Therefore, when the proportion of the data meeting the query condition in the first file is estimated to be large, the problems that the data query efficiency is low and the query cost is not reduced and increased due to the generation of the filter condition are avoided.

Description

Data query method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a data query method, a data query device, data query equipment and a storage medium.

Background

In the big data era, along with the expansion of data collection methods, the data collection cost is lower and lower, and correspondingly, the data amount stored in a data source is larger and larger. When query analysis is performed on mass data, data query efficiency and query overhead become core problems in the field of data processing.

Currently, when a data processing system performs a data query based on a query statement input by a user, a filtering condition (DF) is generated to reduce the amount of data read from a data source, so as to improve the efficiency of the data query and reduce the data query overhead. However, in practical applications, the filtering condition generated by the data processing system may not achieve a good data filtering effect, that is, the data amount before and after filtering is not very different, and at this time, the data query efficiency and the data query overhead are not significantly optimized, but the data query efficiency of the data processing system may be reduced and the query overhead may be increased due to the generation, calculation, transmission and other processes in which the filtering condition participates.

Disclosure of Invention

Embodiments of the present application provide a data query method, apparatus, device, storage medium, and computer program product, so that data query efficiency of a data processing system is kept at a high level, and query overhead is kept at a low level.

In a first aspect, an embodiment of the present application provides a data query method, which may be applied to a data processing system, and the data processing system includes a coordinating node and a working node. The coordination node may send a task for performing query operation on the first file and the second file to the work node, where the sent task includes a query condition for the first file and the second file, and a data volume of the first file is greater than a data volume of the second file. For example, the task sent by the coordination node may be, for example, a logical plan tree generated by the coordination node. The working node estimates the proportion of data in the first file, which meets the query condition, in the first file according to the query condition in the received task and the data information corresponding to the first file and the second file respectively, reads the first file and the second file to the working node when the proportion is larger than a preset threshold value, and executes query operation on the read first file and the read second file.

Because the filtering effect of the filtering condition to be generated is poor when the data meeting the query condition occupies a larger area in the first file, at this time, the data processing system can directly read the first file and the second file and query the data meeting the query condition without generating the filtering condition to filter the data in the first file, so that the problems of low data query efficiency and no reduction and increase of query cost caused by generating the filtering condition can be avoided, and the data query cost of the data processing system can be kept at a lower level.

In a possible implementation manner, when the estimated proportion of data meeting the query condition in the first file is less than or equal to a preset threshold, the working node may read the second file and query the first data meeting the query condition in the second file; then, the working node may generate a filter condition according to the first data, and send the filter condition to the data source where the first file is located, where the filter condition may be used to instruct the data source to query, from the first file, second data that matches the first data, so that the working node may receive the second data sent by the data source, and perform a query operation on the first data and the second data.

In this embodiment, when the percentage of the data of the query condition in the first file is smaller, it is described that the filtering effect of the filter condition to be generated is better, at this time, the data processing system may reduce the data amount of the first file read by the working node from the data source by generating the filter condition, so that the data query efficiency of the data processing system can reach a higher level, and when the working node queries the data meeting the query condition, it is not necessary to read a large amount of data in the first file that does not meet the query condition from the data source, so that the data query overhead of the data processing system can be effectively reduced.

In a possible implementation manner, the data information corresponding to the first file and the second file respectively may specifically be sampling data in the first file and sampling data in the second file, and when the duty ratio of the data meeting the query condition in the first file is estimated, the working node may specifically sample the data in the first file and the data in the second file respectively to obtain the sampling data in the first file and the sampling data in the second file, and then the working node may determine an identifier of the target sampling data meeting the query condition in the sampling data in the second file, and calculate the duty ratio of the sampling data having the identifier in the sampling data in the first file. Therefore, whether the filter condition to be generated has a good data filter effect or not can be estimated by sampling a small amount of data in the first file and the second file, so that the data processing system can evaluate the filter effect of the filter condition with low cost, and the feasibility of implementation of the scheme is improved.

In a possible implementation manner, the data information corresponding to the first file and the second file respectively may specifically be data statistical information corresponding to the first file and the second file, and when the duty ratio of the data meeting the query condition in the first file is estimated, the working node may specifically determine the identifier of the data meeting the query condition in the second file according to the data statistical information corresponding to the second file, and further calculate the duty ratio of the data having the identifier in the first file according to the data statistical information corresponding to the first file. In this way, the data statistics information corresponding to each file can be used to obtain the ratio of the high or low filtering effect that can be used to evaluate the filtering condition, so as to further determine whether to generate the filtering condition based on the ratio.

For example, the working node may read the first file and the second file from the data source to the working node in advance, and submit the first file and the second file to the working node for data statistics, so as to obtain and store data statistics information corresponding to the first file and the second file, respectively, so that the working node may determine whether to generate a filtering condition for the query process each time according to the locally stored data statistics information when performing data query subsequently. Or, in other examples, during the process of storing the first file and the second file in the data source, the data source may generate corresponding data statistics for each file and maintain the data statistics locally to the data source. When the working node judges whether the filtering condition is generated, the data statistical information corresponding to the first file and the second file can be obtained from the data source, so that the data ratio for estimating the height of the filtering condition can be estimated.

In one possible implementation, the data statistics information corresponding to the first file may specifically be any one or more of a data range, a data distribution interval, and duplicate data of the first file, and of course, other possible implementations are also possible. In this way, the working node can determine whether to generate the filtering condition according to the data range, the data distribution interval, the repeated data and other information.

In a possible implementation manner, when the working node determines to generate a filtering condition and queries data from the first file by using the filtering condition, the working node may calculate a proportion of queried data in the data meeting the querying condition in traversed data of the first file, and when the proportion is greater than a filtering threshold, it indicates that the filtering condition does not have a superior data filtering effect in an actual data querying process, at this time, the working node may stop querying the remaining data by using the filtering condition, and may query the remaining data in the data meeting the querying condition from non-traversed data of the first file according to the querying condition. Therefore, when the filter condition does not have the expected data filtering effect, the filter condition can be stopped from querying the residual data in time, so that the additional overhead caused by the application of the filter condition is avoided as much as possible.

In a second aspect, based on the same inventive concept as the method embodiment of the first aspect, an embodiment of the present application provides a data query apparatus. The device has functions corresponding to the implementation of the embodiments of the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In a third aspect, an embodiment of the present application provides an apparatus, including: a processor and a memory; the memory is configured to store instructions, and when the computing apparatus is executed, the processor executes the instructions stored in the memory to cause the apparatus to perform the data query method in the first aspect or any implementation manner of the first aspect. It should be noted that the memory may be integrated into the processor or may be independent from the processor. The apparatus may also include a bus. Wherein, the processor is connected with the memory through the bus. The memory may include a readable memory and a random access memory, among others.

In a fourth aspect, an embodiment of the present application further provides a readable storage medium, where a program or instructions are stored, and when the readable storage medium is run on a computer, the data query method in the first aspect or any implementation manner of the first aspect is executed.

In a fifth aspect, an embodiment of the present application further provides a computer program product containing instructions, which when executed on a computer, cause the computer to perform any of the data query methods in the first aspect or any implementation manner of the first aspect.

In addition, for technical effects brought by any one implementation manner of the second aspect to the sixth aspect, reference may be made to technical effects brought by different implementation manners of the first aspect, or reference may be made to technical effects brought by different implementation manners of the second aspect, and details are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a block diagram of an exemplary data processing system provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a scope table and orders table included in the data source 103;

FIG. 3 is a block diagram of yet another exemplary data processing system provided by an embodiment of the present application;

fig. 4 is a schematic flowchart of a data query method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a query statement input interface provided in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a data query device according to an embodiment of the present application;

fig. 7 is a schematic hardware structure diagram of an apparatus according to an embodiment of the present application.

Detailed Description

Referring to FIG. 1, a block diagram of an exemplary data processing system is shown. As shown in fig. 1, data processing system 100 may include a coordinator node (coordinator node)101 and a worker node (worker node) 102. Wherein, the working node 102 can access data in the data source 103, and the data source 103 can include one or more data sources, such as hive, oracle data source shown in fig. 1.

The data processing system 100 may provide a client (client)104 externally for human-computer interaction with a user to execute a corresponding data query process based on a query statement input by the user. The coordinating node 101 may receive the SQL statement input by the user through the client 104, and perform syntax analysis and semantic analysis on the SQL statement. Wherein, the syntax analysis means that the coordination node 101 checks whether the SQL statement has syntax errors by using syntax rules of the SQL language; semantic analysis means that the coordinating node 101 analyzes whether the semantics of the SQL statement are legal. When the syntax and semantics of the SQL statement are legal, the coordinating node 101 may generate a logic plan tree according to the SQL statement, where the logic plan tree indicates a logic execution plan for performing operations such as computation, analysis, and access on data. Then, the coordination node 101 may optimize the plan tree through one or more optimizers, and send the optimized logical plan tree to the work nodes 102 for execution, specifically, may determine which executor in the work nodes 102 to send the logical plan tree to execute through a corresponding scheduler. The worker node 102, which may include one or more executors (worker, fig. 1 takes the executor 1 and the executor 2 as an example), may execute the corresponding plan according to the received logic plan tree, and return a query result obtained after executing the plan to the client 104 through the coordinating node 101, so as to present the query result to the user on the client 104.

In the big data era, the data amount in the data source 103 is usually huge, for example, the number of lines may reach the tens of millions or even hundreds of millions, and thus a table a with a large data amount, such as a fact table, may be formed. At this time, if the working node 102 directly traverses the table a, the efficiency of the data processing system 100 for querying data is reduced because of the excessive data to be traversed. For this purpose, it is common to perform a Join operation on the table a and another table B (e.g. dimension table) with a smaller data size, where the table a and the table B at least contain the same column, and the data in the two tables at least partially have the same column value in the same column. Then, the working node 102 may first query the data meeting the query condition according to the table B, and the data may be filtered as a filtering condition to obtain the data meeting the query condition in the table a (i.e., the data required by the user). The filtering condition may be, for example, a partial column value of the same column in table a and table B. .

For example, suppose that the data source 103 has a scope table with a larger data size and an orders table with a smaller data size as shown in fig. 2. The scope table and the orders table both have the same column of id (i.e., identity) and further have partially identical column values, as shown in fig. 2 where the id columns of both tables have partially identical column values 120001 through 149999. Assume that the user enters an SQL statement on client 104: name from peer join orders on order, id >10, i.e. data in the query data source 103 having orders, sum > 10. To reduce the amount of data read out from the scope table, the working node 102 may first collect the values of the id columns corresponding to the data satisfying the query condition in the scopes table, i.e., the id values 120001, 120002, …, 149999, etc. in the scopes table shown in fig. 2, according to the query condition "scope.id" and "scopes.sum > 10" in the query statement, and use these id values as filtering conditions, and then the working node 102 may read the data in the scope table according to the filtering conditions, i.e., read the row data with

id values

120001, 120002, …, 149999, without reading the row data with

id values

150000, 150001. In this way, in the process of querying data, the working node 102 does not need to read all the data in the entire scope table, but can read only part of the data in the scope table through the generated filter condition, so that the data amount read from the scope table can be reduced, that is, the data amount participating in Join operation in the two tables is reduced.

When the work node 102 in the data processing system includes a plurality of executors, the Join operation may be distributed to a plurality of executors for execution, such as the work node 102 shown in fig. 1 including the executor 1 and the executor 2, and may be distributed to the Join node _1 and the Join node _2 for execution, where the Join node refers to a logical node for performing a Join operation on the table a and the table B. Each executor may send the filter condition generated according to the corresponding table B to the coordinator node 101, as shown in fig. 1, the executor 1 may send the partial filter condition (partial filter) generated according to the table B _1 to the coordinator node 101, and the executor 2 may also send the partial filter condition (partial filter) generated according to the table B _2 to the coordinator node 101, and the coordinator node 101 merges the filter conditions sent by the plurality of executors, respectively. Then, the coordination node 101 may issue the combined filtering condition to the actuator 1 and the actuator 2 performing Join operation at this time, so that each actuator reads corresponding data in the table a corresponding to each actuator according to the combined filtering condition.

However, in practical applications, the filtering condition generated by the working node 102 according to the table B may not achieve a good data filtering effect, that is, after the filtering condition (or the combined filtering condition) is filtered, the data size contained in the filtering condition is large, and accordingly, the difference between the data size read from the table a according to the filtering condition and the data size in the whole table a is small, which makes the data processing system 100 save less data overhead because the data size read by the working node 102 is small compared with an implementation manner in which the working node 102 directly reads all the data in the whole table B. At the same time, the worker node 102 generates the filter term and filters the data in table B using the filter term, which requires more overhead (e.g., computing resource consumption, etc.) than the data processing system 100 can save. Specifically, when the data processing system 100 queries data using a plurality of working nodes 102, each working node 102 further sends the generated filtering condition to the coordinating node 101, and the coordinating node 101 merges the filtering conditions and sends the merged filtering condition to the working nodes 102, which further occupies more resources (including network transmission, system computation, storage, etc.) of the data processing system 100. Thus, when the filtering effect of the filtering condition is poor, compared with the way that the working node 102 directly reads all the data in the whole table B, the data processing system 100 filters the data by using the filtering condition, which does not reduce the overall overhead of the data processing system 100, but instead consumes additional resources such as network transmission, system calculation, storage and the like due to the filtering condition, so that the overall overhead of the data processing system 100 is increased; meanwhile, the processes of generating, transmitting, merging, applying, etc. of the filtering condition may also reduce the data query efficiency.

Based on this, the embodiment of the present application provides a data query method, so that the data query efficiency of the data processing system 100 is kept at a high level, and meanwhile, the query overhead is kept at a low level. Specifically, the working node 102 may pre-estimate a data proportion of data meeting the query condition in the table a with a large data volume according to the query condition and data information respectively corresponding to the two tables participating in the Join operation, and when the proportion is large, it indicates that a superior data filtering effect may not be achieved by using the generated filtering condition, that is, a difference between the data volumes before and after filtering is not large, at this time, the working node 102 may directly query data meeting the query condition (hereinafter referred to as data to be queried) required by the user from the table a and the table B according to the query condition and perform the query operation without generating the filtering condition; when the data occupies a small amount, it indicates that a large amount of irrelevant data in table a can be effectively filtered by using the filtering condition (i.e., non-to-be-queried data, which may not be read to the working node 102), and at this time, the working node 102 may generate the filtering condition by using the querying condition and table B, and query the to-be-queried data required by the user from table a based on the filtering condition. Therefore, when the filtering effect of the filtering condition is better, the working node 102 can reduce the data amount read from the table a by using the filtering condition, improve the data query efficiency, and reduce the query overhead; when the filtering effect of the filtering condition is poor, the data processing system 100 may avoid the problems that the data query efficiency is low and the query cost is not increased due to the generation of the filtering condition by not generating the filtering condition. In this manner, query overhead may be kept low while data query efficiency for data processing system 100 may be kept high.

For example, as shown in fig. 3, the Join operation is still performed in actuator 1 and actuator 2. The actuators 1 and 2 each include a determination module (i.e., the determination module 1 and the determination module 2 in fig. 3), and the determination module is configured to determine whether to generate a filtering condition. The determining module in each actuator may estimate whether the filter condition to be generated has a higher data filtering effect according to the query condition and the data information corresponding to the two tables requiring Join operation, that is, determine whether the data proportion is larger, and if the estimated data filtering effect of the filter condition is better, the determining module may instruct the actuator where the determining module is located to generate the filter condition. If the data filtering effect of the estimated filtering condition is poor, the judging module can indicate the actuator where the judging module is located not to generate the filtering condition.

As described above, for the executor 1 and the executor 2, if both of the executors generate the filter condition, they may respectively send the (partial) filter conditions generated by each to the coordinator node 101, so that the coordinator node 101 performs condition merging and issuing. If only the executor 1 generates the filter condition, the executor 1 may directly read the data from the table a _1 by using the generated filter condition, and the other executor 2 may directly read the data in the entire table a _2, and then find the data satisfying the query condition from the table a _ 2. If neither of the two executors generates the filtering condition, the two executors can respectively read the corresponding tables a, and respectively find out the data meeting the query condition from the read tables a.

It is noted that the system architecture shown in fig. 1 and fig. 3 is only an example, and is not intended to limit the specific implementation thereof. For example, in other possible system architectures, a data processing system may not include the client 104; alternatively, the number of executors included in the work node is not limited to 2; or, the data processing system processing comprises a coordination node, a working node and the like, and can be integrated with a data source. In practical applications, the data processing system may adaptively add or delete corresponding components in the architectures shown in fig. 1 and fig. 3, which is not limited in this embodiment.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, various non-limiting embodiments of the present application embodiments accompanying the drawings are described below. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 4 is a flowchart illustrating a data query method in an embodiment of the present application, where the method may be applied to the data processing system 100 shown in fig. 1 or fig. 3, and may be executed by the data processing system 100 or a corresponding node in the data processing system 100, and in this embodiment, taking the coordination node 101 and the work node 102 in the data processing system 100 as an example to execute the method, the method may specifically include:

s401: the coordination node 101 receives a data query statement, where the data query statement includes a query condition corresponding to data to be queried and a query operation for a first file and a second file, and a data volume of the first file is greater than a data volume of the second file.

In practice, a user may provide a data query statement to data processing system 100 so that data processing system 100 can locate the data of a desired query based on the data query statement. For convenience of description, the data required by the user is referred to as the data to be queried in this embodiment.

As an example of an implementation of receiving a data query statement, data processing system 100 includes a client 104, and the client 104 may present a query statement input interface to a user as shown in FIG. 5. A user may enter a corresponding query statement in a particular area on a query statement input interface presented by the client 104 in order for the data processing system 100 to feed back the query results desired by the user. Then, the client 104 may send the data query statement to the coordinating node 101, for example, it may send the data query statement to the coordinating node 101 by means of a data query request, so that the coordinating node 101 queries the data to be queried required by the user. The data query statement includes a query condition and a query operation, such as a Join operation, for two files (such as two tables) and the query condition is used to locate data to be queried. For example, the data query statement input by the user may be, for example, an SQL statement: name from peer join orders on order >10, for querying the row data of the peer table shown in fig. 2, where "peer" id and "order > 10" in the SQL statement are query conditions for locating the data to be queried, i.e., the row data of the aforementioned peer table with id values of 120001, 120002, …, 149999.

S402: the coordinating node 101 performs syntax analysis and semantic analysis on the data query statement to determine whether the data query statement input by the user is legal.

In practical applications, if the data query statement input by the user is illegal, the data processing system 100 may terminate the data query task and prompt the user to input a data query statement with correct syntax/semantics. If the data query statement input by the user is determined to be legal, the coordinating node 101 may continue to perform the subsequent data query steps.

S403: after determining that the data query statement is legal, the coordinating node 101 sends a task for performing a query operation on the first file and the second file to the working node, where the task includes a query condition for the first file and the second file.

As an example, the task sent by the coordinating node 101 to the work node may be, for example, a logic plan tree including query conditions of data to be queried and identifications (such as file names and the like) of two files participating in a query operation. In a specific implementation, the coordinating node may generate a logical plan tree according to the data query statement and send the logical plan tree to the working node 102, so as to schedule the working node 102 to execute the tasks in the logical plan tree.

S404: the working node 102 estimates the proportion of the data in the first file, which meets the query condition, in the first file according to the query condition in the received task and the data information respectively corresponding to the first file and the second file.

The data source 103 may include a first file and a second file, and the work node 102 may obtain data satisfying the query condition by accessing the data in the first file and the second file. In practical applications, the first file and the second file may be different data forms in the data source 103, and the first file and the second file may specifically record data in the form of a table, such as a peer table and an orders table shown in fig. 2.

In some embodiments, the data in the first file and the second file may be associated, for example, the first file and the second file may contain the same column, and the value included in the column in the first file is at least partially the same as the value included in the column in the second file, such as the same id values for the id column in the scope table and the orders table shown in fig. 2, such as 120001, 120002, …, 149999, etc.

In this embodiment, the data amount included in the first file and the second file may be different, for example, the data amount of the first file may be larger than the data amount of the second file. In practical applications, the first file may specifically be a fact table, and the second file may specifically be a dimension table or the like. The fact table for the work records may include information in various aspects, such as working date, working staff, working duration, overtime duration, working property, working content, working responsible person, and the like, and may include information in three dimensions, i.e., time dimension (working date, working duration, overtime duration), staff dimension (working staff, working responsible person), and job property (working property, working content). Accordingly, the amount of data of the fact table is generally large because of the large amount of data recorded in the fact table. The dimension table can be used for recording partial dimension data, for example, the time dimension table can be used for recording only the data of the time dimension, such as the data including the working date, the working time length, the overtime length and the like, and the data of the staff, the working property, the working content, the working responsible person and the like may not be recorded in the time dimension table, but can be recorded in other dimension tables. The dimension table may be viewed as a window of analytical data that contains data characteristics in a portion of the dimensions in the fact table. In general, the data amount of the dimension table is smaller than that of the fact table.

In this embodiment, the data amount of the first file may be relatively large, and the data (i.e., the data to be queried) meeting the query condition may be a small part of data in the first file, so that if the working node 102 directly reads the first file from the data source 103 and queries the data to be queried from the first file, the working node 102 needs to read a large amount of useless data in the first file except the data to be queried, and therefore, in this embodiment, the working node 102 may reduce the data amount read from the first file by using the filtering condition generated according to the second file. Because the generated filtering condition does not necessarily have a good data filtering effect, the working node 102 may analyze the query condition and the two files participating in Join operation from the logic execution plan tree, and obtain data information corresponding to the first file and the second file, respectively, so that the working node 102 may estimate the proportion of the data to be queried in the first file according to the query condition and the data information corresponding to the two files, respectively, so as to measure whether the filtering condition can have a good data filtering effect by using the proportion.

Specifically, when the ratio is large, most of the data in the first file are represented as the data to be queried, at this time, the data in the first file is filtered by using the filtering condition, and the difference between the data volume of the working node 102 reading the data to be queried in the first file and the data volume of the whole first file is not large, which indicates that the filtering condition has a poor data filtering effect; on the contrary, when the occupied data is smaller, a small part of data in the first file is represented as data to be queried, at this time, most of data in the first file can be effectively filtered by using the filtering condition, and the working node 102 can obtain the required data to be queried only by reading the small part of data in the first file, which indicates that the data filtering effect of the filtering condition is better.

As an implementation example of the pre-estimated ratio, the data information corresponding to the first file and the second file respectively may specifically be data statistical information corresponding to the first file and the second file respectively. For example, the data statistics information corresponding to the first file (or the second file) may be any one or more of a data range, a data distribution interval, and duplicate data of the first file (or the second file). The data range refers to a value range of data in the first file (or the second file), and specifically may be a value range of each column in the first file, that is, a minimum value to a maximum value of data of each column; the data distribution interval is used for the distribution of data in the first file (or the second file), and for example, the data distribution can be embodied by a histogram; the duplicate data may be used to indicate the presence of the same data in the first file (or the second file), for example, by characterizing the repetition rate or the number of the duplicate data.

The working node 102 may determine, according to the data statistical information corresponding to the second file, an identifier of the data that satisfies the query condition in the second file, so that the percentage of the data having the identifier in the first file may be calculated according to the data statistical information corresponding to the first file. For example, assuming that the data statistics information is specifically a data range, the working node 102 may find out a data range satisfying the query condition from the second file according to the data range of the second file, and further determine an identifier corresponding to the data in the found data range; then, the working node 102 further finds out the data having the data identifier in the first file according to the determined data identifier, and calculates the data percentage of the data having the data identifier in the first file, for example, the data range in the first file is 1 to 100000, and the data having the data identifier in the first file is 1 to 1000, the percentage may be 1/10 (i.e., 1000/10000). Of course, the data processing system 100 may calculate the ratio according to the data distribution interval, the repeated data, or any number of combinations of the three information, which is not limited in this embodiment.

The data statistics information corresponding to the first file (or the second file) may be collected in advance by the working node 102 and the coordinating node 101 and stored locally. For example, after the data processing system 100 is connected to the data source 103, the work node 102 may access each file in the data source 103 one by one and send the read file to the coordinating node 101, so that the coordinating node 101 may generate and store corresponding data statistics for each file. In this way, when querying data, the working node 102 may obtain data statistics corresponding to a file to be accessed from the local (or from the coordinating node 101) so as to determine whether to generate a filtering condition in the data querying process based on the data statistics. Of course, in practical application, the working node 102 may also obtain the data statistics information corresponding to the first file and the second file by using other manners, for example, when the first file and the second file are stored in the data source 103, the relevant device in the data source 103 generates corresponding data statistics information for the first file and the second file in advance, so that the working node 102 may obtain the data statistics information corresponding to the first file and the second file from the data source 103. In this embodiment, a specific implementation manner of the working node obtaining the data statistical information is not limited.

In another implementation example of the estimated occupation ratio, the working node 102 may also determine the occupation ratio by a data sampling method, and at this time, the data information corresponding to the first file and the second file respectively may specifically be the sampled data in the first file and the second file. In a specific implementation, the working node 102 may sample data in the first file and data in the second file, for example, random sampling, equal interval sampling, or equal proportion sampling, and read the sample data in the first file and the sample data in the second file from the data source 103, respectively, and then the working node 102 may determine an identifier of target sample data that satisfies the query condition in the sample data of the second file, and count sample data having the identifier in the sample data of the first file, so as to calculate a ratio of the sample data having the identifier in the first file to the sample data of the first file. Therefore, the proportion of the data to be queried in the first file can be predicted through the sampling data of the first file and the second file.

It should be noted that the implementation manner of the two estimated occupation ratios is only an example, and in practical application, the two estimated occupation ratios may be estimated by sampling other possible manners, which is not limited in this embodiment.

S405: when the ratio is greater than the preset threshold, the working node 102 reads the first file and the second file from the working node 102, and performs a query operation on the first file and the second file.

S406: and when the ratio is smaller than the preset threshold, the working node 102 generates a filtering condition according to the query condition and the second file, queries data meeting the query condition from the first file according to the filtering condition, and executes corresponding query operation.

In this embodiment, since the occupation ratio calculated in step S404 may reflect the data filtering effect of the filtering condition, when the occupation ratio is greater than the preset threshold, it indicates that the filtering condition to be generated is difficult to filter more data in the first file, and if the working node 102 generates the filtering condition, the query efficiency of the data processing system 100 may be reduced, at this time, the working node 102 may read the first file and the second file from the data source 103 without generating the filtering condition according to the second file, and query the data satisfying the query condition from the read first file and the read second file according to the query condition and perform corresponding operation, for example, perform Join operation on the queried data. In this way, the total overhead of the working node 102 for generating the filter condition and applying the filter condition (even including transmission, combination, etc. of the filter condition) can be avoided from exceeding the total overhead of the working node 102 for directly reading the first file.

When the ratio is smaller than the preset threshold, the filter condition to be generated is characterized to be capable of effectively filtering more data in the first file, and accordingly, the total cost of the working node 102 for generating the filter condition and applying the filter condition (even including transmission, combination and the like of the filter condition) is smaller than the total cost of the working node 102 for directly reading the first file. At this time, the working node 102 may read the second file and query the second file for first data satisfying the query condition, and then the working node 102 may generate a filter condition according to the first data and send the filter condition to the data source 103 to instruct the data source 103 to query the first file for second data matching the first data and feed the second data back to the working node 102, where the first data and the second data may have the same identification therebetween, such as the first data in the first file and the second data in the second file have the same column value (e.g., id values in the aforementioned example in the scope table and the orders table). In this way, the working node 102 may perform the query operation on the received second data and the first data queried from the second file, for example, perform Join operation on the first data and the second data, so as to obtain data that is queried by the user. Since the data amount read by the working node 102 from the data source 103 (relative to the data amount directly reading the first file) is small, the data query efficiency of the working node 102 can be kept at a high level. For a specific implementation process of the working node 102 generating the dynamic query condition according to the second file and the query condition, reference may be made to the foregoing description, which is not repeated herein.

The preset threshold may be calculated and determined by the working node 102 according to the data amount of the first file, that is, when the working node 102 queries data in different files, different preset thresholds may be generated; alternatively, the preset threshold may be a fixed value, for example, an empirical value, and may be preset by a person skilled in the art. In this embodiment, a specific implementation manner of how to set the preset threshold is not limited.

It is to be noted that, in this embodiment, the data source 103 includes a first file and a second file as an example for illustration, in practical applications, the data source 103 may further include more files, such as a third file, and when a plurality of files in the data source 103 all need to participate in Join operation, Join operation may be performed on the first file and the second file first, and then Join operation may be performed on the first file and the third file, and so on, which is not described in detail in this embodiment.

It is noted that in some possible embodiments, after determining the maximum amount of data that each file in the data source 103 has, the ratio may also be characterized by the amount of data included in the filtering condition. For example, in some scenarios, when the filtering condition includes a large amount of data, the characterization ratio may be large, whereas when the filtering condition includes a small amount of data, the characterization ratio may be small. Then, when determining whether to generate the filtering condition according to the size of the proportion, it may also be determined whether to generate the filtering condition according to the size of the data amount included in the filtering condition, and a specific implementation principle of the method is similar to the above process, which is not described in detail in this embodiment.

In a further possible embodiment, during the process of querying data, the working node 102 may determine whether to continue to query the remaining data using the filter condition according to the queried partial data, for example, when the filter effect of the filter condition is predicted to be better by the above-mentioned manner of sampling data of the first file and the second file, there may be a problem that the filter effect of the filter condition is poor when the filter condition is actually used, so that the working node 102 may stop using the filter condition to continue to filter data in the first file in time.

As an example, after querying a part of data in the data to be queried from the first file by using the filtering condition, the working node 102 may calculate a proportion of the queried data in the data to be queried in traversed data of the first file, where the traversed data of the first file includes the queried data and data currently filtered out in the first file by using the filtering condition. When the ratio is relatively large, specifically, when the ratio is greater than a filtering threshold, the filtering effect of the representation filtering condition on the data in the first file is relatively poor in the actual using process, at this time, because extra calculation overhead is required for filtering the data in the first file by using the filtering condition, when the working node 102 subsequently queries remaining data (i.e., data other than the queried data) in the data to be queried, the data that is not traversed in the first file may be directly read to the working node 102, and the remaining data in the data to be queried may be continuously queried from the data that is not traversed according to the querying condition, without filtering the data in the first file by using the filtering condition. Of course, when the ratio is smaller, specifically, when the ratio is smaller than the filtering threshold, the representation filtering condition has a better filtering effect on the data in the first file in the actual use process, and therefore, the worker node 102 may continue to query the remaining data in the data to be queried from the non-traversed data of the first file by using the filtering condition.

In other embodiments, the proportion of the queried data in the traversed data of the first file may also be determined by means of data sampling. For example, when the data volume of the queried data in the data to be queried is large, the traversed data and the queried data of the first file may be respectively sampled, and the data proportion of the sampled data in the queried data in the sampled data of the traversed data may be used as the proportion. The specific implementation of the ratio in this embodiment is not limited.

In practical applications, a separate function module, such as the determination module in fig. 3, may be configured in the executor, and the function module may determine the ratio through real-time monitoring or data sampling during the process of querying data, and determine whether to continue to filter the data in the first file using the filter condition in the subsequent querying process according to the ratio. The functional module may be implemented by software or hardware, which is not limited in this embodiment.

The setting manner of the filtering threshold may be similar to the setting manner of the preset threshold, and for the specific implementation of how to set the filtering threshold, reference may be made to the description of the relevant places of the preset threshold, which is not described herein again. In addition, the value of the filtering threshold may be the same as the preset threshold, or may be different from the preset threshold, which is not limited in this embodiment.

The data query method provided by the present application is described in detail above with reference to fig. 1 to 5, and the apparatus and device provided by the present application are described below with reference to fig. 6 to 7.

The same inventive concept as the method described above, the embodiment of the present application also provides a data query apparatus, which can implement the functions of the data processing system in the embodiment shown in fig. 4 described above, where the data processing system includes a coordination node and a work node. Referring to fig. 6, the data query apparatus 600 may include:

a first communication module 601, configured to send a task for performing query operation on a first file and a second file to the work node, where the task includes a query condition for the first file and the second file, and a data volume of the first file is greater than a data volume of the second file;

an estimating module 602, configured to estimate, according to the received query condition of the task and data information corresponding to the first file and the second file, a proportion of data in the first file, which meets the query condition, in the first file;

a reading module 603, configured to read the first file and the second file to the working node and execute the query operation on the first file and the second file when the ratio is greater than a preset threshold.

In a possible implementation manner, the reading module 603 is further configured to read the second file when the proportion is less than or equal to the preset threshold;

the apparatus 600 further comprises:

a query module 604, configured to query the first data in the second file, where the first data meets the query condition;

a generating module 605, configured to generate a filtering condition according to the first data;

the second communication module 606 is configured to send the filtering condition to a data source where the first file is located, where the filtering condition is used to instruct the data source to query, from the first file, second data that is matched with the first data, receive the second data sent by the data source, and perform the query operation on the first data and the second data.

In a possible implementation manner, the data information corresponding to the first file and the second file respectively includes data statistics information corresponding to the first file and the second file respectively;

the estimation module 602 is specifically configured to:

determining the identifier of the data meeting the query condition in the second file according to the data statistical information corresponding to the second file;

and calculating the proportion of the data with the identification in the first file according to the data statistical information corresponding to the first file.

In a possible implementation manner, the data information corresponding to the first file and the second file respectively includes the sampled data in the first file and the sampled data in the second file;

the estimation module 602 is specifically configured to:

the working node respectively samples the data in the first file and the data in the second file to obtain the sampled data in the first file and the sampled data in the second file;

the working node determines the identifier of target sampling data which meets the query condition in the sampling data of the second file;

the working node calculates the proportion of the sampling data with the identification in the sampling data of the first file.

In a possible implementation manner, the data statistics information corresponding to the first file includes any one or more of a data range, a data distribution interval, and duplicate data of the first file.

In a possible implementation manner, when the ratio is smaller than the preset threshold, the reading module 603 is further configured to further include:

in the process of inquiring the data to be inquired from the first file, calculating the proportion of the inquired data in the data to be inquired in the traversed data of the first file;

and when the ratio is larger than a filtering threshold value, querying the remaining data in the data to be queried from the non-traversed data of the first file according to the query condition.

The data query apparatus 600 in this embodiment corresponds to the data query method shown in fig. 4, and therefore, for specific implementation of each functional module in the data query apparatus 600 in this embodiment and technical effects thereof, reference may be made to the description of relevant parts in the embodiment shown in fig. 4, which is not described herein again.

In addition, an apparatus is also provided in the embodiments of the present application, as shown in fig. 7, a communication interface 710 and a processor 720 may be included in the apparatus 700. Optionally, a memory 730 may also be included in device 700. The memory 730 may be disposed inside the device 700 or disposed outside the device 700. Illustratively, various acts described above in connection with the embodiment illustrated in FIG. 4 may be implemented by the processor 720. The processor 720 may retrieve the first file and the second file from the data source 103 via the communication interface 710 and may be configured to implement any of the methods performed in fig. 4. In implementation, the steps of the process flow may complete the method performed in fig. 4 through instructions in the form of hardware integrated logic circuits or software in the processor 720. For brevity, no further description is provided herein. Program code executed by processor 720 to implement the methods described above may be stored in memory 730. The memory 730 is coupled to the processor 720, such as coupled.

Some features of embodiments of the application may be performed/supported by processor 720 executing program instructions or software code in memory 730. The software components loaded on the memory 730 may be summarized functionally or logically, for example, the estimation module 602, the reading module 603, the query module 604, and the generation module 605 shown in fig. 6. And the functions of the first communication module 601 and the second communication module 606 can be implemented by the communication interface 710.

Any of the communication interfaces involved in the embodiments of the present application may be a circuit, a bus, a transceiver, or any other device that can be used for information interaction. Such as the communication interface 710 in the device 700. the other means may illustratively be a device coupled to the device 700, etc.

The processors referred to in the embodiments of the present application may be general purpose processors, digital signal processors, application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like that implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

The coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, modules or modules, and may be an electrical, mechanical or other form for information interaction between the devices, modules or modules.

The processor may cooperate with the memory. The memory may be a nonvolatile memory, such as a Hard Disk Drive (HDD) or a solid-state drive (SSD), and may also be a volatile memory, such as a random-access memory (RAM). The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such.

The specific connection medium among the communication interface, the processor and the memory is not limited in the embodiments of the present application. Such as memory, processor, and communication interfaces may be connected by a bus. The bus may be divided into an address bus, a data bus, a control bus, etc.

Based on the above embodiments, the present application also provides a computer storage medium, in which a software program is stored, and when the software program is read and executed by one or more processors, the software program can implement the method performed by the data processing system 100 according to any one or more of the above embodiments. The computer storage medium may include: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

Based on the above embodiments, the present application further provides a chip, which includes a processor, and is used to implement the functions of the data processing system 100 according to the above embodiments, for example, to implement the method executed in fig. 4. Optionally, the chip further comprises a memory for the processor to execute the necessary program instructions and data. The chip may be constituted by a chip, or may include a chip and other discrete devices.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished from one another.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the scope of the embodiments of the present application. Thus, to the extent that such modifications and variations of the embodiments of the present application fall within the scope of the claims and their equivalents, it is intended that the present application also encompass such modifications and variations.

Claims

1. A data query method, applied to a data processing system including a coordinating node and a working node, the method comprising:

the coordination node sends a task for performing query operation on a first file and a second file to the working node, wherein the task comprises query conditions for the first file and the second file, and the data volume of the first file is larger than that of the second file;

the working node estimates the proportion of data meeting the query condition in the first file according to the received query condition of the task and the data information corresponding to the first file and the second file respectively;

and when the ratio is larger than a preset threshold value, the working node reads the first file and the second file to the working node, and executes the query operation on the first file and the second file.

2. The method of claim 1, further comprising:

when the ratio is less than or equal to the preset threshold value, the working node reads the second file;

the working node inquires first data meeting the inquiry condition in the second file;

the working node generates a filtering condition according to the first data;

the working node sends the filtering condition to a data source where the first file is located, wherein the filtering condition is used for indicating the data source to inquire second data matched with the first data from the first file;

and the working node receives second data sent by the data source and executes the query operation on the first data and the second data.

3. The method according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively comprises the sample data in the first file and the sample data in the second file;

the method for the working node to estimate the proportion of the data which meets the query condition in the first file according to the received query condition of the task and the data information which respectively corresponds to the first file and the second file comprises the following steps:

4. The method according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively comprises data statistics information corresponding to the first file and the second file respectively;

the method for estimating the proportion of the data meeting the query condition in the first file by the working node according to the received query condition of the task and the data information corresponding to the first file and the second file respectively comprises the following steps:

the working node determines the identifier of the data meeting the query condition in the second file according to the data statistical information corresponding to the second file;

and the working node calculates the proportion of the data with the identification in the first file according to the data statistical information corresponding to the first file.

5. The method according to claim 4, wherein the data statistics corresponding to the first file include any one or more of a data range, a data distribution interval, and duplicate data of the first file.

6. A data query apparatus, applied to a data processing system including a coordinating node and a working node, the apparatus comprising:

the system comprises a first communication module, a second communication module and a processing module, wherein the first communication module is used for sending a task for performing query operation on a first file and a second file to the working node, the task comprises query conditions for the first file and the second file, and the data volume of the first file is larger than that of the second file;

the estimation module is used for estimating the proportion of data which accords with the query condition in the first file according to the received query condition of the task and the data information which respectively corresponds to the first file and the second file;

and the reading module is used for reading the first file and the second file to the working node and executing the query operation on the first file and the second file when the ratio is greater than a preset threshold value.

7. The apparatus according to claim 6, wherein the reading module is further configured to read the second file when the duty ratio is less than or equal to the preset threshold;

the device further comprises:

the query module is used for querying first data meeting the query condition in the second file;

the generating module is used for generating a filtering condition according to the first data;

and the second communication module is used for sending the filtering condition to a data source where the first file is located, wherein the filtering condition is used for indicating the data source to inquire second data matched with the first data from the first file, receiving the second data sent by the data source, and executing the inquiry operation on the first data and the second data.

8. The apparatus according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively comprises data statistics information corresponding to the first file and the second file respectively;

the estimation module is specifically configured to:

9. An apparatus, comprising a processor and a memory;

the processor is to execute instructions stored in the memory to cause the device to perform the method of any of claims 1 to 5.

10. A computer-readable storage medium comprising instructions for implementing the method of any of claims 1 to 5.