WO2022161417A1 - Data query method and apparatus, and device and storage medium - Google Patents

Data query method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2022161417A1
WO2022161417A1 PCT/CN2022/074138 CN2022074138W WO2022161417A1 WO 2022161417 A1 WO2022161417 A1 WO 2022161417A1 CN 2022074138 W CN2022074138 W CN 2022074138W WO 2022161417 A1 WO2022161417 A1 WO 2022161417A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
file
query
node
condition
Prior art date
Application number
PCT/CN2022/074138
Other languages
French (fr)
Chinese (zh)
Inventor
李铮
刘玉
罗旦
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022161417A1 publication Critical patent/WO2022161417A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution

Definitions

  • the embodiments of the present application relate to the technical field of data processing, and in particular, to a data query method, apparatus, device, and storage medium.
  • the data processing system when it performs data query based on the query statement input by the user, it usually generates a filter condition (dynamic filter, DF) to reduce the amount of data read from the data source, so as to improve the efficiency of data query and reduce the amount of data. query overhead.
  • DF dynamic filter
  • the filtering conditions generated by the data processing system in practical application may not achieve a good data filtering effect, that is, the amount of data before and after filtering is not much different.
  • the data query efficiency and data query overhead are not significantly optimized.
  • the generation, calculation, and transmission processes involved in filter conditions may reduce the data query efficiency of the data processing system and increase query overhead.
  • Embodiments of the present application provide a data query method, apparatus, device, storage medium, and computer program product, so that the data query efficiency of the data processing system is kept at a high level, and the query overhead is kept at a low level.
  • an embodiment of the present application provides a data query method, which can be applied to a data processing system, and the data processing system includes a coordination node and a working node.
  • the coordinating node may send a task of querying the first file and the second file to the working node, the sent task includes query conditions for the first file and the second file, and the data volume of the first file is larger than the data amount of the second file.
  • the task sent by the coordinating node may be, for example, a logical plan tree generated by the coordinating node.
  • the work node estimates the proportion of the data in the first file that meets the query conditions in the first file, and calculates the proportion of the data in the first file.
  • the ratio is greater than the preset threshold, read the first file and the second file to the working node, and perform a query operation on the read first file and the second file.
  • the data processing system directly reads the first file and the second file and queries the It is possible to filter the data in the first file without generating filter conditions. In this way, the problem of low data query efficiency and increased query overhead caused by generating filter conditions can be avoided.
  • the data query overhead of the data processing system can also be kept low.
  • the worker node when the estimated proportion of the data that meets the query conditions in the first file in the first file is less than or equal to a preset threshold, the worker node can read the second file and query The first data in the second file that satisfies the query condition; then, the worker node can generate a filter condition according to the first data, and send the filter condition to the data source where the first file is located, where the filter condition can be used to indicate that the data source is from The first file is queried for second data matching the first data, so that the worker node can receive the second data sent by the data source, and perform a query operation on the first data and the second data.
  • the data processing system can generate the filter condition to reduce the number of working nodes
  • the data volume of the first file read from the data source, so that the data query efficiency of the data processing system can reach a high level, and when querying data that meets the query conditions, the working node does not need to check the data that does not meet the query conditions in the first file.
  • a large amount of data of the query condition is read from the data source, so that the data query overhead of the data processing system can be effectively reduced.
  • the data information respectively corresponding to the first file and the second file may be the sampled data in the first file and the sampled data in the second file, and the working node is estimating the first file.
  • the working node can determine the identifier of the target sampled data that meets the query condition in the sampled data of the second file, and calculate the sampled data of the first file with the identifier in the sampled data of the first file. percentage of the sampled data.
  • the filter condition to be generated has a better data filtering effect, which enables the data processing system to reduce the cost of the filter condition.
  • the filtering effect is evaluated to improve the feasibility of the program implementation.
  • the data information respectively corresponding to the first file and the second file may specifically be the data statistics information corresponding to the first file and the second file respectively, then the working node in the estimated first file conforms to the
  • the proportion of the data of the query condition in the first file specifically, according to the data statistical information corresponding to the second file, determine the identifier of the data that meets the query condition in the second file, and further according to the corresponding data of the first file.
  • the data statistics information is calculated, and the proportion of the data with the identifier in the first file in the first file is calculated. In this way, the proportion of the filtering effect that can be used to evaluate the filtering conditions can be obtained by using the data statistics information corresponding to each file, so as to further determine whether to generate the filtering conditions based on the proportion.
  • the data statistics information of each file may be generated in advance by the working node and the coordinating node.
  • the working node may read the first file and the second file from the data source to the working node in advance, and hand it over to the coordinating node for pairing. Perform data statistics on the read first file and the second file, and obtain and save the data statistics corresponding to the first file and the second file respectively, so that when subsequent data query is performed, the worker node can each time based on the locally saved data statistics Information to determine whether to generate filter conditions for this query process.
  • the data source may generate corresponding data statistical information for each file and keep it locally to the data source.
  • the working node is determining whether to generate a filter condition, it can obtain the data statistics information corresponding to the first file and the second file from the data source, so as to estimate the proportion of data used for estimating the level of the filter condition.
  • the data statistics information corresponding to the first file may specifically be any one or more of the data range, data distribution interval, and repeated data of the first file, and of course, may also be other possible implementation.
  • a working node can determine whether to generate a filter condition according to the above-mentioned information such as the data range, data distribution interval, and repeated data.
  • the worker node may calculate that the queried data in the data that meets the query condition is in the first file.
  • the proportion of the traversed data of a file and when the proportion is greater than the filtering threshold, it indicates that the filtering condition does not have an optimal data filtering effect in the actual data query process.
  • the worker node can stop The remaining data is queried by using the filter condition, and the remaining data in the data that meets the query condition can be queried from the untraversed data of the first file according to the query condition. In this way, when the filter condition does not have the expected data filtering effect, the filter condition can be stopped to query the remaining data in time, so as to avoid extra overhead caused by the application of the filter condition as much as possible.
  • the embodiments of the present application provide a data query apparatus.
  • the device has functions corresponding to the implementations of the above-mentioned first aspect. This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • an embodiment of the present application provides a device, including: a processor and a memory; the memory is used to store an instruction, and when the computing device runs, the processor executes the instruction stored in the memory, so that the device executes the instruction
  • the data query method in the first aspect or any implementation manner of the first aspect may be integrated in the processor, or may be independent of the processor.
  • the apparatus may also include a bus. Among them, the processor is connected to the memory through the bus.
  • the memory may include readable memory and random access memory.
  • an embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored in the readable storage medium, and when the readable storage medium runs on a computer, causes the first aspect or any one of the first aspect to be executed.
  • the data query method in the implementation is executed.
  • the embodiments of the present application further provide a computer program product including instructions, which, when running on a computer, enables the computer to execute any data query method in the first aspect or any implementation manner of the first aspect.
  • FIG. 1 is a schematic structural diagram of an exemplary data processing system provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the people table and the orders table included in the data source 103;
  • FIG. 3 is a schematic structural diagram of another exemplary data processing system provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a data query method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a query statement input interface provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a data query apparatus provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.
  • the data processing system 100 may include a coordinator node (coordinator node) 101 and a worker node (worker node) 102 .
  • the worker node 102 may access data in the data source 103, and the data source 103 may include one or more data sources, such as hive and oracle data sources as shown in FIG. 1 .
  • the data processing system 100 may externally provide a client (client) 104 for performing human-computer interaction with the user, so as to execute the corresponding data query process based on the query statement input by the user.
  • the coordination node 101 may receive the SQL statement input by the user through the client 104, and perform syntax analysis and semantic analysis on the SQL statement.
  • the syntax analysis means that the coordination node 101 uses the syntax rules of the SQL language to check whether there is a syntax error in the SQL statement; the semantic analysis means that the coordination node 101 analyzes whether the semantics of the SQL statement is legal.
  • the coordination node 101 can generate a logical plan tree according to the SQL statement, and the logical plan tree indicates a logical execution plan for computing, analyzing, and accessing data.
  • the coordinating node 101 can optimize the plan tree through one or more optimizers, and send the optimized logical plan tree to the worker node 102 for execution.
  • the corresponding scheduler can determine to send the logical plan tree to which executor in worker node 102 to execute.
  • the worker node 102 may include one or more executors (worker, including executor 1 and executor 2 as an example in FIG. 1 ), which can execute the corresponding plan according to the received logical plan tree, and through the coordination node 101
  • the query result obtained after executing the plan is returned to the client 104 so that the client 104 presents the query result to the user.
  • the amount of data in the data source 103 is usually relatively large, for example, the number of rows may reach tens of millions or even hundreds of millions, etc., which may form a table A with a large amount of data, such as a fact table.
  • the table A is usually joined with another table B (such as a dimension table, etc.) with a small amount of data.
  • the table A and table B contain at least the same columns, and the two tables The data has at least some of the same column values in that same column.
  • the worker node 102 can first query the data that meets the query condition according to Table B, and the data can be used as a filter condition to filter to obtain the data in Table A that meets the query condition (that is, the data required by the user).
  • the filter condition may be, for example, some column values of the same column in table A and table B. .
  • the data source 103 has a people table with a large amount of data and an orders table with a small amount of data as shown in FIG. 2 .
  • Both the people table and the orders table have the same id (namely identity) column, and further have some of the same column values.
  • the id columns of the two tables have some of the same column values of 120001 to 149999.
  • the value of the id column corresponding to the data that is, the id values such as 120001, 120002, .
  • Take the data in the people table that is, read the row data with id values of 120001, 120002, ..., 149999, without reading the row data with id values of 150000 and 150001.
  • the worker node 102 does not need to read all the data in the whole people table, but can read only part of the data in the people table through the generated filter condition, thereby reducing the amount of data read from the people table.
  • the amount of data that is, reducing the amount of data involved in the Join operation in the two tables.
  • the Join operation can be divided into multiple executors for execution.
  • the worker node 102 includes an executor 1 and an executor 2, and the Join operation The operation may be allocated to Join node_1 and Join node_2 for execution, where the Join node refers to a logical node for performing a join operation on Table A and Table B.
  • Each executor can send the filter condition generated according to its corresponding table B to the coordinator node 101. As shown in FIG.
  • the executor 1 can send the partial filter condition (partial filter) generated according to the table B_1 to the coordinator node 101
  • executor, 2 can also send the partial filter conditions (partial filter) generated by it according to table B_2 to the coordinator node 101
  • the coordinator node 101 combines the filter conditions respectively sent by multiple executors.
  • the coordinating node 101 can respectively deliver the combined filter conditions to the executor 1 and the executor 2 that perform the Join operation this time, so that each executor can read the corresponding table A according to the combined filter conditions. corresponding data.
  • the filter conditions generated by the work node 102 according to Table B may not achieve a better data filtering effect, that is, after filtering by the filter conditions (or the combined filter conditions), the data contained in the filter conditions
  • the difference between the amount of data read from table A according to the filter condition and the amount of data in the entire table A is small, which makes the work node 102 directly read all the data in the entire table B compared to the
  • the data processing system 100 can save less data overhead because the amount of data read by the worker nodes 102 is reduced.
  • each working node 102 will also send the filter conditions generated by themselves to the coordinating node 101, and the coordinating node 101 will combine the multiple filtering conditions and send them to the multiple working nodes 101.
  • the working node 102 issues the combined filter conditions, which will further occupy more resources of the data processing system 100 (including resources such as network transmission, system computing, and storage).
  • the data processing system 100 uses the filtering conditions to filter the data, which not only does not reduce the overall performance of the data processing system 100
  • the filter conditions will consume additional resources such as network transmission, system computing, storage, etc., which will increase the overall overhead of the data processing system 100; at the same time, the generation, transmission, merging, and application of filter conditions will also reduce data query. efficiency.
  • the embodiments of the present application provide a data query method, so that the data query efficiency of the data processing system 100 is kept at a high level, and the query cost is kept at a low level.
  • the worker node 102 can pre-estimate the data proportion of the data that meets the query conditions in the table A with a large amount of data according to the query conditions and the corresponding data information of the two tables participating in the Join operation. When the value is large, it indicates that a better data filtering effect may not be achieved by using the generated filter conditions, that is, the difference in the amount of data before and after filtering is not large.
  • Table B the data that meets the query conditions required by the user (hereinafter referred to as the data to be queried) is queried and the query operation is performed; and when the proportion of the data is relatively small, it shows that the filter conditions can be used to effectively filter out a large number of large amounts of data in Table A.
  • Irrelevant data that is, data that is not to be queried, and this part of the data does not need to be read to the worker node 102
  • the worker node 102 can use the query condition and table B to generate the filter condition, and based on the filter condition from table A Query the data to be queried required by the user.
  • the worker node 102 can use the filter condition to reduce the amount of data read from Table A, improve the data query efficiency and reduce the query cost; and when the filtering effect of the filter condition is poor,
  • the data processing system 100 can avoid the problems of low data query efficiency and increased query overhead caused by generating filter conditions by not generating filter conditions. In this way, while the data query efficiency of the data processing system 100 can be kept at a high level, the query overhead can also be kept at a low level.
  • the Join operation is still scattered to executor 1 and executor 2 for execution.
  • both the executor 1 and the executor 2 include a determination module (ie, the determination module 1 and the determination module 2 in FIG. 3 ), and the determination module is used to determine whether to generate a filter condition.
  • the judgment module in each executor can estimate whether the filter conditions to be generated have a high data filtering effect according to the query conditions and the data information corresponding to the two tables that require the Join operation, that is, determine whether the above data proportion is If the data filtering effect of the estimated filter condition is better, the judgment module can instruct the executor where it is located to generate the filter condition. However, if the data filtering effect of the estimated filtering condition is poor, the determining module may instruct the executor where it is located not to generate the filtering condition.
  • executor 1 and executor 2 if the two executors both generate filter conditions, they can send the (part of) filter conditions generated by them to the coordinating node 101 respectively, so that the coordinating node 101 can combine the conditions and distribution.
  • the executor 1 can directly use the generated filter condition to read data from table A_1, while another executor 2 can directly read the data in the entire table A_2, and then Find data that satisfies the query condition from Table A_2.
  • the two executors do not generate the filter condition, they can read the corresponding table A respectively, and find the data satisfying the query condition from the read table A respectively.
  • the system architecture shown in FIG. 1 and FIG. 3 is only an example, and is not intended to limit its specific implementation to this example.
  • the data processing system may not include the client 104; or, the number of executors included in the worker nodes is not limited to two; In addition to nodes, data sources can also be integrated.
  • the data processing system may adaptively add or delete corresponding components in the architecture shown in FIG. 1 and FIG. 3 , which is not limited in this embodiment.
  • FIG. 4 it is a schematic flowchart of a data query method in an embodiment of the present application.
  • the method can be applied to the data processing system 100 shown in FIG. 1 or FIG. 3 , and can be executed by the data processing system 100 or the Corresponding nodes in the data processing system 100 execute the method.
  • the coordinating node 101 and the working node 102 in the data processing system 100 execute the method as an example for illustrative description.
  • the method may specifically include:
  • the coordination node 101 receives a data query statement, where the data query statement includes query conditions corresponding to the data to be queried and query operations for the first file and the second file, wherein the data volume of the first file is larger than the data of the second file quantity.
  • the user may provide a data query statement to the data processing system 100, so that the data processing system 100 can locate the data to be queried based on the data query statement.
  • the data required by the user is referred to as the data to be queried.
  • the data processing system 100 includes a client 104, and the client 104 can present a query statement input interface as shown in FIG. 5 to the user.
  • the user may input a corresponding query statement in a specific area on the query statement input interface presented by the client 104, so that the data processing system 100 feeds back the query result expected by the user.
  • the client 104 can send the data query statement to the coordinating node 101, for example, by sending it to the coordinating node 101 through a data query request, so that the coordinating node 101 can query the data to be queried required by the user.
  • the data query statement includes query conditions and query operations on two files (eg, two tables, etc.), such as a join (Join) operation, and the query conditions are used to locate the data to be queried.
  • the coordination node 101 performs syntax analysis and semantic analysis on the data query statement to determine whether the data query statement input by the user is legal.
  • the data processing system 100 may terminate the data query task, and may prompt the user to input a data query statement with correct syntax/semantics. If it is determined that the data query statement input by the user is valid, the coordinating node 101 may continue to perform subsequent data query steps.
  • the coordinating node 101 After determining that the data query statement is valid, the coordinating node 101 sends a task of querying the first file and the second file to the worker node, where the task includes query conditions for the first file and the second file.
  • the task sent by the coordinating node 101 to the worker node may be, for example, a logical plan tree, where the logical plan tree includes query conditions for the data to be queried and identifiers (such as file names, etc.) of two files participating in the query operation. .
  • the coordinating node may generate a logical plan tree according to the data query statement, and send the logical plan tree to the working node 102 so as to schedule the working node 102 to execute the tasks in the logical plan tree.
  • the worker node 102 estimates the proportion of the data in the first file that meet the query conditions in the first file according to the query conditions in the received task and the data information corresponding to the first file and the second file respectively.
  • the data source 103 may include a first file and a second file, and the worker node 102 may obtain data satisfying the query condition by accessing the data in the first file and the second file.
  • the first file and the second file may be formed by different data in the data source 103, and the first file and the second file may specifically record data in the form of a table, such as the people table shown in FIG. 2 . And the orders table and so on.
  • the data in the first file and the second file may be associated.
  • the first file and the second file may contain the same column, and the value included in the column in the first file may be the same as the value in the second file.
  • the values included in this column in the file are at least partially the same.
  • the people table and the orders table have some of the same id values for the id column, such as 120001, 120002, ..., 149999, and so on.
  • the amount of data included in the first file and the second file may be different.
  • the amount of data in the first file may be larger than that in the second file.
  • the first file may specifically be a fact table
  • the second file may specifically be a dimension table or the like.
  • the data recorded in the fact table is usually rich and can include information of multiple dimensions.
  • a fact table used for work records it can include working date, working employee, working hours, overtime hours, work nature, Work content, job leader and other information, which can include time dimension (working date, working hours, overtime hours), personnel dimension (staff, job leader), and job attributes (work nature, work content). information in three dimensions.
  • the dimension table can be used to record some dimensional data.
  • the time dimension table can be used to record only the data of the time dimension, such as the above-mentioned data such as working date, working hours, and overtime hours. Data such as the person in charge of work may not be recorded in this time dimension table, but may be recorded in other dimension tables.
  • the dimension table can be regarded as a window for analyzing data, which contains the data characteristics of some dimensions in the fact table. Typically, dimension tables contain less data than fact tables.
  • the amount of data in the first file may be relatively large, and the data that meets the query conditions (that is, the data to be queried) may be a small part of the data in the first file.
  • the worker node 102 needs to read a large amount of useless data in the first file except the data to be queried. Therefore, in this embodiment, the worker node 102 can use
  • the amount of data read from the first file is reduced according to the filter conditions generated by the second file. Since the generated filter conditions do not necessarily have good data filtering effects, the worker node 102 can parse out the query conditions and the two files involved in the Join operation from the logical execution plan tree, and obtain the first file and the second file.
  • the corresponding data information so that the working node 102 can estimate the proportion of the data to be queried in the first file according to the query condition and the data information corresponding to the two files respectively, so as to use the proportion to measure whether the filter condition can be It has better data filtering effect.
  • the proportion when the proportion is relatively large, it indicates that most of the data in the first file is the data to be queried. At this time, even if the data in the first file is filtered by using the filter condition, the working node 102 reads the first file. The difference between the data to be queried and the amount of data read in the entire first file is not large, which shows that the filtering effect of the filter condition is poor; on the contrary, when the proportion is small, the small part of the data in the first file is represented as The data to be queried, at this time, most of the data in the first file can be effectively filtered by using the filter conditions, and the worker node 102 only needs to read a small part of the data in the first file to obtain the required data to be queried. It shows that the data filtering effect of the filtering condition is better.
  • the data information respectively corresponding to the first file and the second file may specifically be data statistics information corresponding to the first file and the second file respectively.
  • the data statistics information corresponding to the first file (or the second file) may be, for example, any one or more of the data range, data distribution interval, and repeated data of the first file (or the second file).
  • the data range refers to the value range of the data in the first file (or the second file), and may specifically be the value range of each column in the first file, that is, the minimum value to the maximum value of the data in each column; data distribution
  • the interval is used for the distribution of the data in the first file (or the second file), for example, the data distribution can be represented by a bar chart; the repeated data can be used to indicate that the same data exists in the first file (or the second file). For example, it can be expressed by characterizing the repetition rate or the number of repetitions of each data.
  • the work node 102 can determine the identifier of the data in the second file that satisfies the query condition according to the statistical information of the data corresponding to the second file, so as to calculate the data with the identifier in the first file according to the statistical information of the data corresponding to the first file.
  • the proportion of data in the first file For example, assuming that the data statistical information is specifically a data range, the worker node 102 can search for a data range that satisfies the query condition from the second file according to the data range of the second file, and further determine the data range within the found data range.
  • the working node 102 finds out the data with the data identifier in the first file according to the determined data identifier, and calculates the proportion of the data with the data identifier in the data in the first file
  • the range of data in the first file is 1 to 100000
  • the data with the above data identifier in the first file is 1 to 1000
  • the ratio may be 1/10 (ie 1000/10000).
  • the data processing system 100 may also calculate the proportion according to the above data distribution interval, repeated data, or in combination with any of the above three kinds of information, which is not limited in this embodiment.
  • the data statistics information corresponding to the first file (or the second file) may be pre-collected by the working node 102 and the coordination node 101 and stored locally.
  • the worker nodes 102 can access each file in the data source 103 one by one, and send the read files to the coordinating node 101, so that the coordinating node 101 can provide each file Generate corresponding data statistics and save.
  • the worker node 102 can obtain the data statistics information corresponding to the file locally (or from the coordinating node 101 ) according to the file to be accessed, so as to determine whether to generate a filter during the data query process based on the data statistics information condition.
  • the worker node 102 may also obtain the data statistics information corresponding to the first file and the second file in other ways.
  • the data source 103 when the first file and the second file are stored in the data source 103, the data source The relevant device in 103 generates the corresponding data statistics information for the first file and the second file in advance, so that the worker node 102 can obtain the data statistics information corresponding to the first file and the second file from the data source 103 .
  • the specific implementation manner in which the working node obtains the data statistics information is not limited.
  • the working node 102 may also determine the proportion by means of data sampling.
  • the data information corresponding to the first file and the second file respectively may be the first file and the sampled data in the second file.
  • the worker node 102 may sample the data in the first file and the data in the second file respectively, such as random sampling, equal interval sampling or equal proportion sampling, etc., and read the data from the data source 103 respectively. The sampled data in the first file and the sampled data in the second file, then, the worker node 102 can determine the identifier of the target sampled data that satisfies the query condition in the sampled data of the second file, and count the sampled data in the first file.
  • the sampled data with the identifier can be calculated, so that the proportion of the sampled data with the identifier in the sampled data of the first file in the first file can be calculated. In this way, the proportion of the data to be queried in the first file can be predicted based on the sampling data of the first file and the second file.
  • the worker node 102 When the proportion is less than the preset threshold, the worker node 102 generates a filter condition according to the query condition and the second file, and searches the first file for data that meets the query condition according to the filter condition, and executes a corresponding query operation.
  • the ratio calculated in step S404 can reflect the data filtering effect of the filter condition, when the ratio is greater than the preset threshold, it indicates that the filter condition to be generated is difficult to filter out the data filtering effect in the first file.
  • the working node 102 may not generate filter conditions according to the second file, but may read the first file from the data source 103. file and the second file, and according to the query conditions, query the data satisfying the query conditions from the read first file and the second file and perform corresponding operations, such as performing a Join operation on the queried data.
  • the work node 102 When the proportion is less than the preset threshold, it indicates that the filter condition to be generated can effectively filter out more data in the first file. Accordingly, the work node 102 generates the filter condition and applies the filter condition (even including the filter condition)
  • the overall overhead of conditional transmission, merging, etc.) is smaller than the overall overhead of the worker node 102 directly reading the first file.
  • the worker node 102 can read the second file, and query the first data that satisfies the query condition therefrom, and then the worker node 102 can generate a filter condition according to the first data, and send the filter condition to the data source 103, to instruct the data source 103 to query the second data matching the first data from the first file and feed it back to the worker node 102, wherein the first data and the second data may have the same identifier,
  • the first data in the first file and the second data in the second file have the same column value (such as the id value in the people table and the orders table in the foregoing example) and so on.
  • the worker node 102 can perform the above query operation on the received second data and the first data queried from the second file, such as performing a Join operation on the first data and the second data, etc., so as to obtain the user's needs. query data. Since the amount of data read by the worker node 102 from the data source 103 (relative to the amount of data read directly from the first file) is small, the data query efficiency of the worker node 102 can be maintained at a high level.
  • the specific implementation process for the working node 102 to generate the dynamic query condition according to the second file and the query condition can be referred to the above-mentioned descriptions in the relevant places, and will not be repeated here.
  • the preset threshold may be calculated and determined by the working node 102 according to the data volume of the first file, that is, when the working node 102 queries data in different files, different preset thresholds may be generated; or, the preset threshold It can also be a fixed value, such as an empirical value, etc., and is preset by the relevant technical personnel. In this embodiment, the specific implementation manner of how to set the preset threshold is not limited.
  • the data source 103 includes the first file and the second file as an example for illustration.
  • the data source 103 may also include more files, such as a third file. files, etc., and when multiple files in the data source 103 need to participate in the Join operation, the first file and the second file can be joined first, and then the first file and the third file.
  • this embodiment will not describe it in detail.
  • the above-mentioned proportion may also be characterized by the data amount included in the filter condition.
  • the representation may account for a larger proportion.
  • the representation may account for a smaller proportion.
  • the worker node 102 may determine whether to continue querying the remaining data by using the filter conditions according to the queried partial data.
  • the data sampling method predicts that the filtering effect of the filtering condition is good, there may be a problem that the filtering effect of the filtering condition is poor in actual use. In this way, the worker node 102 can stop using the filtering condition in time to continue processing the data in the first file. filter.
  • the worker node 102 may calculate that the queried data in the data to be queried is in the traversed data of the first file
  • the traversed data of the first file includes the queried data and the data currently filtered out of the first file by using the filter conditions.
  • the proportion is relatively large, specifically when the proportion is greater than the filtering threshold, it indicates that the filtering effect of the filtering conditions on the data in the first file is poor in the actual use process.
  • the worker node 102 when the worker node 102 subsequently queries the remaining data in the data to be queried (that is, data other than the queried data), it can directly read the untraversed data in the first file to the work
  • the node 102 continues to query the remaining data in the data to be queried from the untraversed data according to the query condition, without using the filter condition to filter the data in the first file.
  • the ratio is small, specifically when the ratio is less than the filtering threshold, it indicates that the filtering condition has a better filtering effect on the data in the first file in the actual use process. Therefore, the working node 102 can continue to use the filtering condition. Continue to query the remaining data in the data to be queried from the untraversed data of the first file.
  • the proportion of the queried data in the traversed data of the first file may also be determined by means of data sampling. For example, when the data volume of the queried data in the data to be queried is relatively large, the traversed data and the queried data of the first file can be sampled respectively, and the sampled data in the queried data can be sampled in the traversed data.
  • the proportion of data in the sampled data of the data is taken as the proportion and the like. The specific implementation of the proportion in this embodiment is not limited.
  • a separate functional module can be configured in the actuator, such as the judgment module in Figure 3 above.
  • This functional module can determine the proportion through real-time monitoring or data sampling in the process of querying data, and according to the The ratio determines whether to continue to use the filter condition to filter the data in the first file in the subsequent query process.
  • the functional module may be implemented by software or hardware, which is not limited in this embodiment.
  • the setting method of the filtering threshold may be similar to the setting method of the aforementioned preset threshold.
  • the specific implementation of how to set the filtering threshold please refer to the description of the aforementioned setting of the preset threshold, which will not be repeated here.
  • the value of the filtering threshold may be the same as the preset threshold, or may be different from the preset threshold, which is not limited in this embodiment.
  • the embodiment of the present application also provides a data query device, the data query device can realize the function of the data processing system in the embodiment shown in FIG. 4, the data processing system includes a coordination node and a working node.
  • the data query apparatus 600 may include:
  • the first communication module 601 is configured to send a task of performing a query operation on the first file and the second file to the worker node, where the task includes query conditions on the first file and the second file, wherein, The data volume of the first file is greater than the data volume of the second file;
  • Estimation module 602 configured to estimate the query condition in the first file according to the received query condition of the task and the data information corresponding to the first file and the second file respectively The proportion of the data in the first file;
  • the reading module 603 is configured to read the first file and the second file to the work node when the ratio is greater than a preset threshold, and read the first file and the second file Execute the query operation.
  • the reading module 603 is further configured to read the second file when the ratio is less than or equal to the preset threshold;
  • the apparatus 600 also includes:
  • a query module 604 configured to query the first data in the second file that satisfies the query condition
  • a generating module 605, configured to generate filter conditions according to the first data
  • the second communication module 606 is configured to send the filter condition to the data source where the first file is located, where the filter condition is used to instruct the data source to query the first file to match the first data and receive the second data sent by the data source, and perform the query operation on the first data and the second data.
  • the data information corresponding to the first file and the second file respectively includes data statistics information corresponding to the first file and the second file respectively;
  • the estimating module 602 is specifically used for:
  • the data statistics information corresponding to the second file determine the identifier of the data in the second file that meets the query condition
  • the proportion of the data with the identifier in the first file in the first file is calculated.
  • the data information corresponding to the first file and the second file respectively includes sampling data in the first file and sampling data in the second file;
  • the estimating module 602 is specifically used for:
  • the working node respectively samples the data in the first file and the data in the second file to obtain the sampling data in the first file and the sampling data in the second file;
  • the working node determines the identifier of the target sampled data that meets the query condition in the sampled data of the second file
  • the working node calculates the proportion of the sampled data with the identifier in the sampled data of the first file in the sampled data of the first file.
  • the data statistics information corresponding to the first file includes any one or more of the data range, data distribution interval, and repeated data of the first file.
  • the reading module 603 is further configured to further include:
  • the remaining data in the data to be queried is queried from the untraversed data of the first file according to the query condition.
  • the data query device 600 in this embodiment corresponds to the data query method shown in FIG. 4 . Therefore, for the specific implementation of each functional module in the data query device 600 in this embodiment and the technical effects it has, please refer to FIG. 4 The description of the relevant parts in the illustrated embodiment will not be repeated here.
  • the device 700 may include a communication interface 710 and a processor 720 .
  • the device 700 may further include a memory 730 .
  • the memory 730 may be disposed inside the device 700 or outside the device 700 .
  • each action in the above-mentioned embodiment shown in FIG. 4 may be implemented by the processor 720 .
  • the processor 720 may acquire the first file and the second file in the data source 103 through the communication interface 710, and use them to implement any method executed in FIG. 4 .
  • each step of the processing flow can be implemented by the hardware integrated logic circuit in the processor 720 or the instructions in the form of software to complete the method executed in FIG. 4 .
  • the program codes executed by the processor 720 for implementing the above method may be stored in the memory 730 .
  • the memory 730 is connected to the processor 720, such as a coupling connection or the like.
  • Some features of the embodiments of the present application may be implemented/supported by the processor 720 executing program instructions or software codes in the memory 730 .
  • the software components loaded on the memory 730 can be summarized in terms of functions or logic, for example, the estimation module 602, the reading module 603, the query module 604, and the generation module 605 shown in FIG. 6 .
  • the functions of the first communication module 601 and the second communication module 606 may be implemented by the communication interface 710 .
  • Any communication interface involved in the embodiments of this application may be a circuit, a bus, a transceiver, or any other device that can be used for information interaction.
  • the communication interface 710 in the device 700 for example, the other device may be a device connected to the device 700, and the like.
  • the processors involved in the embodiments of the present application may be general-purpose processors, digital signal processors, application-specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and may implement or The methods, steps, and logic block diagrams disclosed in the embodiments of this application are executed.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly embodied as being executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
  • the coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, modules or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, modules or modules.
  • the processor may cooperate with the memory.
  • the memory can be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or a volatile memory (volatile memory), such as random access memory (random-state drive, SSD), etc. access memory, RAM).
  • Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • connection medium among the above-mentioned communication interface, processor, and memory is not limited in the embodiments of the present application.
  • the memory, the processor and the communication interface can be connected by a bus.
  • the bus can be divided into an address bus, a data bus, a control bus, and the like.
  • the embodiments of the present application further provide a computer storage medium, where a software program is stored in the storage medium, and when the software program is read and executed by one or more processors, it can implement any one or more of the above Embodiments provide methods performed by data processing system 100 .
  • the computer storage medium may include: a U disk, a removable hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk and other mediums that can store program codes.
  • the embodiments of the present application further provide a chip including a processor for implementing the functions of the data processing system 100 involved in the above embodiments, for example, for implementing the method executed in FIG. 4 .
  • the chip further includes a memory, and the memory is used for necessary program instructions and data to be executed by the processor.
  • the chip may consist of chips, or may include chips and other discrete devices.
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Abstract

Disclosed in the present application are a data query method and apparatus, and a device and a storage medium, which can be applied to a data processing system comprising a coordination node and a working node. The coordination node sends, to the working node, a task of performing a query operation for a first file and a second file, wherein the task comprises a query condition for the first file and the second file, and the amount of data of the first file is greater than the amount of data of the second file. The working node estimates, in the first file and according to the query condition in the task and data information respectively corresponding to the first file and the second file, the proportion of data, which meets the query condition, in the first file; and the working node reads the first file and the second file when the proportion is greater than a preset threshold value, and executes the query operation on the first file and the second file. In this way, when it is estimated that the proportion of data, which meets a query condition, in a first file is relatively great, the problems of the data query efficiency being reduced and query overheads increasing instead of reducing due to the generation of a dynamic filter are avoided.

Description

一种数据查询方法、装置、设备及存储介质A data query method, device, equipment and storage medium 技术领域technical field
本申请实施例涉及数据处理技术领域,尤其涉及一种数据查询方法、装置、设备及存储介质。The embodiments of the present application relate to the technical field of data processing, and in particular, to a data query method, apparatus, device, and storage medium.
背景技术Background technique
在大数据时代,随着数据收集方法的拓展,数据收集的成本越来越低,相应的,数据源中存储的数据量也越来越大。在针对海量数据进行查询分析时,数据查询效率以及查询开销成为数据处理领域中的核心问题。In the era of big data, with the expansion of data collection methods, the cost of data collection is getting lower and lower, and correspondingly, the amount of data stored in the data source is also increasing. When querying and analyzing massive data, data query efficiency and query overhead become the core issues in the field of data processing.
目前,数据处理系统在基于用户输入的查询语句进行数据查询时,通常是通过生成过滤条件(dynamic filter,DF)来减少从数据源中读取的数据量,以此提高数据查询效率并降低数据查询开销。但是,实际应用时数据处理系统生成的过滤条件,可能无法达到较好的数据过滤效果,即过滤前后的数据量相差不大,此时,数据查询效率以及数据查询开销并没有得到明显优化,反而可能会因为过滤条件所参与的生成、计算以及传输等过程会降低数据处理系统的数据查询效率、增加查询开销。At present, when the data processing system performs data query based on the query statement input by the user, it usually generates a filter condition (dynamic filter, DF) to reduce the amount of data read from the data source, so as to improve the efficiency of data query and reduce the amount of data. query overhead. However, the filtering conditions generated by the data processing system in practical application may not achieve a good data filtering effect, that is, the amount of data before and after filtering is not much different. At this time, the data query efficiency and data query overhead are not significantly optimized. The generation, calculation, and transmission processes involved in filter conditions may reduce the data query efficiency of the data processing system and increase query overhead.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种数据查询方法、装置、设备、存储介质以及计算机程序产品,以使得数据处理系统的数据查询效率保持在较高水平,查询开销保持在较低水平。Embodiments of the present application provide a data query method, apparatus, device, storage medium, and computer program product, so that the data query efficiency of the data processing system is kept at a high level, and the query overhead is kept at a low level.
第一方面,本申请实施例提供一种数据查询方法,该方法可以应用于数据处理系统,并且该数据处理系统中包括协调节点以及工作节点。其中,协调节点可以向工作节点发送针对第一文件以及第二文件进行查询操作的任务,所发送的任务中包括针对第一文件以及第二文件的查询条件,并且,该第一文件的数据量大于第二文件的数据量。示例性的,协调节点发送的任务,例如可以是协调节点生成的逻辑计划树等。工作节点根据接收到的任务中的查询条件以及第一文件和第二文件分别对应的数据信息,预估第一文件中符合该查询条件的数据在第一文件中的占比,并当该占比大于预设阈值时,读取第一文件以及第二文件至工作节点,并对读取的第一文件以及第二文件执行查询操作。In a first aspect, an embodiment of the present application provides a data query method, which can be applied to a data processing system, and the data processing system includes a coordination node and a working node. The coordinating node may send a task of querying the first file and the second file to the working node, the sent task includes query conditions for the first file and the second file, and the data volume of the first file is larger than the data amount of the second file. Exemplarily, the task sent by the coordinating node may be, for example, a logical plan tree generated by the coordinating node. According to the query conditions in the received task and the data information corresponding to the first file and the second file, the work node estimates the proportion of the data in the first file that meets the query conditions in the first file, and calculates the proportion of the data in the first file. When the ratio is greater than the preset threshold, read the first file and the second file to the working node, and perform a query operation on the read first file and the second file.
由于符合查询条件的数据在第一文件中的占比较大时,说明待生成的过滤条件的过滤效果较差,此时,数据处理系统通过直接读取第一文件以及第二文件并查询出符合查询条件的数据,而可以不生成过滤条件对第一文件中的数据进行过滤,如此,可以避免生成过滤条件所带来的数据查询效率变低、查询开销不减反增的问题,从而可以实现数据处理系统的数据查询开销也能保持在较低水平。Since the proportion of data that meets the query conditions in the first file is relatively large, it means that the filtering effect of the filter conditions to be generated is poor. At this time, the data processing system directly reads the first file and the second file and queries the It is possible to filter the data in the first file without generating filter conditions. In this way, the problem of low data query efficiency and increased query overhead caused by generating filter conditions can be avoided. The data query overhead of the data processing system can also be kept low.
在一种可能的实施方式中,当预估得到的第一文件中符合查询条件的数据在第一文件中的占比,小于等于预设阈值时,工作节点可以读取第二文件,并查询该第二文件中满足查询条件的第一数据;然后,工作节点可以根据第一数据生成过滤条件,并将该过滤条件发送至第一文件所在数据源,该过滤条件可以用于指示数据源从第一文件中查询与该第一数据匹配的第二数据,从而工作节点可以接收数据源发送的第二数据,并对该第一数据以及第二数据执行查询操作。In a possible implementation, when the estimated proportion of the data that meets the query conditions in the first file in the first file is less than or equal to a preset threshold, the worker node can read the second file and query The first data in the second file that satisfies the query condition; then, the worker node can generate a filter condition according to the first data, and send the filter condition to the data source where the first file is located, where the filter condition can be used to indicate that the data source is from The first file is queried for second data matching the first data, so that the worker node can receive the second data sent by the data source, and perform a query operation on the first data and the second data.
在该实施方式中,当查询条件的数据在第一文件中的占比较小时,说明待生成的过滤条件的过滤效果较优,此时,数据处理系统可以通过生成过滤条件的方式来减少工作节点从数据源中读取的第一文件的数据量,从而可以使得数据处理系统的数据查询效率能够达到较高水平,并且工作节点在查询符合查询条件的数据时,无需将第一文件中不符合该查询条件的大量数据从数据源读取出来,从而可以有效减少数据处理系统的数据查询开销。In this embodiment, when the proportion of the data of the query condition in the first file is relatively small, it means that the filtering effect of the filter condition to be generated is better. At this time, the data processing system can generate the filter condition to reduce the number of working nodes The data volume of the first file read from the data source, so that the data query efficiency of the data processing system can reach a high level, and when querying data that meets the query conditions, the working node does not need to check the data that does not meet the query conditions in the first file. A large amount of data of the query condition is read from the data source, so that the data query overhead of the data processing system can be effectively reduced.
在一种可能的实施方式中,第一文件与第二文件分别对应的数据信息,具体可以是第一文件中的采样数据以及第二文件中的采样数据,则工作节点在预估第一文件中符合查询条件的数据在第一文件中的占比时,具体可以是分别对第一文件中的数据以及第二文件中的数据进行采样,得到该第一文件中的采样数据以及第二文件中的采样数据,然后,工作节点可以确定该第二文件的采样数据中符合查询条件的目标采样数据的标识,并计算该第一文件的采样数据中具有该标识的采样数据在第一文件的采样数据中的占比。如此,通过对第一文件以及第二文件中少量数据的采样,可以预估出待生成的过滤条件是否具有较好的数据过滤效果,这使得数据处理系统能够以较小的开销对过滤条件的过滤效果进行评估,提高方案实施的可行性。In a possible implementation manner, the data information respectively corresponding to the first file and the second file may be the sampled data in the first file and the sampled data in the second file, and the working node is estimating the first file. When the proportion of the data that meets the query conditions in the first file in the Then, the working node can determine the identifier of the target sampled data that meets the query condition in the sampled data of the second file, and calculate the sampled data of the first file with the identifier in the sampled data of the first file. percentage of the sampled data. In this way, by sampling a small amount of data in the first file and the second file, it can be estimated whether the filter condition to be generated has a better data filtering effect, which enables the data processing system to reduce the cost of the filter condition. The filtering effect is evaluated to improve the feasibility of the program implementation.
在一种可能的实施方式中,第一文件与第二文件分别对应的数据信息,具体可以是第一文件与第二文件分别对应的数据统计信息,则工作节点在预估第一文件中符合查询条件的数据在第一文件中的占比时,具体可以是根据第二文件对应的数据统计信息,确定第二文件中符合该查询条件的数据的标识,并进一步根据该第一文件对应的数据统计信息,计算出第一文件中具有该标识的数据在第一文件中的占比。如此,可以利用每个文件对应的数据统计信息得到能够用于评估过滤条件的过滤效果高低的占比,以便基于该占比进一步判定是否生成过滤条件。In a possible implementation manner, the data information respectively corresponding to the first file and the second file may specifically be the data statistics information corresponding to the first file and the second file respectively, then the working node in the estimated first file conforms to the When the proportion of the data of the query condition in the first file, specifically, according to the data statistical information corresponding to the second file, determine the identifier of the data that meets the query condition in the second file, and further according to the corresponding data of the first file. The data statistics information is calculated, and the proportion of the data with the identifier in the first file in the first file is calculated. In this way, the proportion of the filtering effect that can be used to evaluate the filtering conditions can be obtained by using the data statistics information corresponding to each file, so as to further determine whether to generate the filtering conditions based on the proportion.
其中,各个文件的数据统计信息,可以是由工作节点与协调节点预先协同生成,比如,工作节点可以预先将第一文件以及第二文件从数据源读取至工作节点,并交由协调节点对读取的第一文件以及第二文件进行数据统计,得到并保存第一文件与第二文件分别对应的数据统计信息,从而后续在进行数据查询时,工作节点每次可以根据本地保存的数据统计信息确定是否为此次查询过程生成过滤条件。或者,在其他示例中,第一文件以及第二文件在数据源中存储的过程中,数据源可以为各个文件生成相应的数据统计信息,并将其保持至数据源本地。当工作节点在判定是否生成过滤条件时,可以从数据源中获取该第一文件以及第二文件对应的数据统计信息,以便于预估出用于预估过滤条件高低的数据占比。The data statistics information of each file may be generated in advance by the working node and the coordinating node. For example, the working node may read the first file and the second file from the data source to the working node in advance, and hand it over to the coordinating node for pairing. Perform data statistics on the read first file and the second file, and obtain and save the data statistics corresponding to the first file and the second file respectively, so that when subsequent data query is performed, the worker node can each time based on the locally saved data statistics Information to determine whether to generate filter conditions for this query process. Or, in other examples, during the process of storing the first file and the second file in the data source, the data source may generate corresponding data statistical information for each file and keep it locally to the data source. When the working node is determining whether to generate a filter condition, it can obtain the data statistics information corresponding to the first file and the second file from the data source, so as to estimate the proportion of data used for estimating the level of the filter condition.
在一种可能的实施方式中,第一文件对应的数据统计信息,具体可以是第一文件的数据范围、数据分布区间、重复数据中的任意一种或多种,当然,也可以是其它可能的实施方式。如此,可以工作节点可以根据上述数据范围、数据分布区间以及重复数据等信息确定是否生成过滤条件。In a possible implementation manner, the data statistics information corresponding to the first file may specifically be any one or more of the data range, data distribution interval, and repeated data of the first file, and of course, may also be other possible implementation. In this way, a working node can determine whether to generate a filter condition according to the above-mentioned information such as the data range, data distribution interval, and repeated data.
在一种可能的实施方式中,当工作节点确定生成过滤条件并利用该过滤条件从第一文件中查询数据的过程中,工作节点可以计算符合查询条件的数据中的已查询到数据在该第一文件的已遍历数据中的占比,并且,当该占比大于过滤阈值时,表明该过滤条件在实际的数据查询过程中并不具有较优的数据过滤效果,此时,工作节点可以停止使用该过滤条件查询剩余数据,而可以根据查询条件,从第一文件的未遍历数据中查询符合查询条件的数据中的剩余数据。如此,可以在过滤条件不具有预期的数据过滤效果时,及时停止该过滤条件查询剩余数据,从而尽可能避免该过滤条件的应用所带来的额外的开销。In a possible implementation manner, when the worker node determines to generate a filter condition and use the filter condition to query data from the first file, the worker node may calculate that the queried data in the data that meets the query condition is in the first file. The proportion of the traversed data of a file, and when the proportion is greater than the filtering threshold, it indicates that the filtering condition does not have an optimal data filtering effect in the actual data query process. At this time, the worker node can stop The remaining data is queried by using the filter condition, and the remaining data in the data that meets the query condition can be queried from the untraversed data of the first file according to the query condition. In this way, when the filter condition does not have the expected data filtering effect, the filter condition can be stopped to query the remaining data in time, so as to avoid extra overhead caused by the application of the filter condition as much as possible.
第二方面,基于与第一方面的方法实施例同样的发明构思,本申请实施例提供了一种数 据查询装置。该装置具有实现上述第一方面的各实施方式对应的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。In the second aspect, based on the same inventive concept as the method embodiments of the first aspect, the embodiments of the present application provide a data query apparatus. The device has functions corresponding to the implementations of the above-mentioned first aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions.
第三方面,本申请实施例提供一种设备,包括:处理器和存储器;该存储器用于存储指令,当该计算装置运行时,该处理器执行该存储器存储的该指令,以使该设备执行上述第一方面或第一方面的任一实现方式中的数据查询方法。需要说明的是,该存储器可以集成于处理器中,也可以是独立于处理器之外。装置还可以包括总线。其中,处理器通过总线连接存储器。其中,存储器可以包括可读存储器以及随机存取存储器。In a third aspect, an embodiment of the present application provides a device, including: a processor and a memory; the memory is used to store an instruction, and when the computing device runs, the processor executes the instruction stored in the memory, so that the device executes the instruction The data query method in the first aspect or any implementation manner of the first aspect. It should be noted that the memory may be integrated in the processor, or may be independent of the processor. The apparatus may also include a bus. Among them, the processor is connected to the memory through the bus. The memory may include readable memory and random access memory.
第四方面,本申请实施例还提供一种可读存储介质,所述可读存储介质中存储有程序或指令,当其在计算机上运行时,使得上述第一方面或第一方面的任一实现方式中的数据查询方法被执行。In a fourth aspect, an embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored in the readable storage medium, and when the readable storage medium runs on a computer, causes the first aspect or any one of the first aspect to be executed. The data query method in the implementation is executed.
第五方面,本申请实施例还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面的任一实现方式中的任意数据查询方法。In a fifth aspect, the embodiments of the present application further provide a computer program product including instructions, which, when running on a computer, enables the computer to execute any data query method in the first aspect or any implementation manner of the first aspect.
另外,第二方面至六方面中任一种实现方式所带来的技术效果可参见第一方面中不同实现方式所带来的技术效果,或者可参见第二方面中不同实现方式所带来的技术效果,此处不再赘述。In addition, for the technical effect brought by any one of the implementations in the second aspect to the sixth aspect, reference may be made to the technical effects brought by different implementations in the first aspect, or reference may be made to the technical effects brought by different implementations in the second aspect The technical effect will not be repeated here.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some implementations described in the present application. For example, for those skilled in the art, other drawings can also be obtained from these drawings.
图1为本申请实施例提供的一示例性数据处理系统的结构示意图;FIG. 1 is a schematic structural diagram of an exemplary data processing system provided by an embodiment of the present application;
图2为数据源103中包括的people表以及orders表的示意图;2 is a schematic diagram of the people table and the orders table included in the data source 103;
图3为本申请实施例提供的又一示例性数据处理系统的结构示意图;3 is a schematic structural diagram of another exemplary data processing system provided by an embodiment of the present application;
图4为本申请实施例提供的一种数据查询方法的流程示意图;4 is a schematic flowchart of a data query method provided by an embodiment of the present application;
图5为本申请实施例提供的查询语句输入界面示意图;5 is a schematic diagram of a query statement input interface provided by an embodiment of the present application;
图6为本申请实施例提供的一种数据查询装置的结构示意图;6 is a schematic structural diagram of a data query apparatus provided by an embodiment of the present application;
图7为本申请实施例提供的一种设备的硬件结构示意图。FIG. 7 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.
具体实施方式Detailed ways
参见图1,为一示例性数据处理系统的结构示意图。如图1所示,数据处理系统100可以包括协调节点(coordinator node)101、工作节点(worker node)102。其中,工作节点102可以访问数据源103中的数据,数据源103可以包括一种或者多种数据源,如图1所示的hive、oracle数据源等。Referring to FIG. 1, it is a schematic structural diagram of an exemplary data processing system. As shown in FIG. 1 , the data processing system 100 may include a coordinator node (coordinator node) 101 and a worker node (worker node) 102 . The worker node 102 may access data in the data source 103, and the data source 103 may include one or more data sources, such as hive and oracle data sources as shown in FIG. 1 .
数据处理系统100可以对外提供客户端(client)104,用于与用户进行人机交互,以便基于用户输入的查询语句执行相应的数据查询过程。协调节点101可以通过客户端104接收到用户输入的SQL语句,并对该SQL语句进行语法分析以及语义分析。其中,语法分析,是指协调节点101利用SQL语言的语法规则校验该SQL语句是否存在语法错误;语义分析,是指协调节点101分析该SQL语句的语义是否合法。当SQL语句的语法以及语义均合法后,协调节点101可以根据该SQL语句生成逻辑计划树,该逻辑计划树指示了对数据进行计算、分析以及访问等操作的逻辑执行计划。然后,协调节点101可以通过一个或者多个优化器对 计划树进行优化,并将优化后所得到的逻辑计划树发送给工作节点102执行,具体可以是通过相应的调度器确定将逻辑计划树发送至工作节点102中的哪个执行器进行执行。工作节点102,可以包括一个或者多个执行器(worker,图1中以包括执行器1、执行器2作为示例),其可以根据接收到的逻辑计划树执行相应计划,并通过协调节点101将执行计划后所得到的查询结果返回给客户端104,以便在客户端104上将查询结果呈现给用户。The data processing system 100 may externally provide a client (client) 104 for performing human-computer interaction with the user, so as to execute the corresponding data query process based on the query statement input by the user. The coordination node 101 may receive the SQL statement input by the user through the client 104, and perform syntax analysis and semantic analysis on the SQL statement. The syntax analysis means that the coordination node 101 uses the syntax rules of the SQL language to check whether there is a syntax error in the SQL statement; the semantic analysis means that the coordination node 101 analyzes whether the semantics of the SQL statement is legal. When the syntax and semantics of the SQL statement are legal, the coordination node 101 can generate a logical plan tree according to the SQL statement, and the logical plan tree indicates a logical execution plan for computing, analyzing, and accessing data. Then, the coordinating node 101 can optimize the plan tree through one or more optimizers, and send the optimized logical plan tree to the worker node 102 for execution. Specifically, the corresponding scheduler can determine to send the logical plan tree to which executor in worker node 102 to execute. The worker node 102 may include one or more executors (worker, including executor 1 and executor 2 as an example in FIG. 1 ), which can execute the corresponding plan according to the received logical plan tree, and through the coordination node 101 The query result obtained after executing the plan is returned to the client 104 so that the client 104 presents the query result to the user.
在大数据时代,数据源103中的数据量通常较为庞大,如其行数可能达到千万级甚至亿级等,从而可能会形成数据量较大的表A,如事实表等。此时,若工作节点102直接遍历该表A,则会因为所需遍历的数据过多而使得数据处理系统100查询数据的效率降低。为此,目前通常是将该表A与另一数据量较小的表B(如维度表等)进行连接(Join)运算,该表A与表B至少包含相同列,并且两个表中的数据在该相同列至少具有部分相同的列值。然后,工作节点102可以先根据表B查询出符合查询条件的数据,该数据可以作为过滤条件来过滤得到表A中符合该查询条件的数据(也即用户所需的数据)。其中,该过滤条件例如可以是表A与表B中相同列的部分列值。。In the era of big data, the amount of data in the data source 103 is usually relatively large, for example, the number of rows may reach tens of millions or even hundreds of millions, etc., which may form a table A with a large amount of data, such as a fact table. At this time, if the work node 102 directly traverses the table A, the data processing system 100 may reduce the efficiency of querying data due to too much data to be traversed. To this end, at present, the table A is usually joined with another table B (such as a dimension table, etc.) with a small amount of data. The table A and table B contain at least the same columns, and the two tables The data has at least some of the same column values in that same column. Then, the worker node 102 can first query the data that meets the query condition according to Table B, and the data can be used as a filter condition to filter to obtain the data in Table A that meets the query condition (that is, the data required by the user). Wherein, the filter condition may be, for example, some column values of the same column in table A and table B. .
举例来说,假设数据源103中具有数据量较大的people表以及数据量较小的orders表分别如图2所示。people表与orders表均具有相同的id(即identity)列,并进一步具有部分相同列值,如图2中所示两个表的id列具有部分相同的列值120001至149999。假设用户在客户端104上输入SQL语句:select people.name from people join orders on people.id=orders.id where orders.sum>10,即查询数据源103中orders.sum大于10的数据。为减少对于people表的读出数据量,工作节点102可以先根据该查询语句中的查询条件“people.id=orders.id”以及“orders.sum>10”,收集orders表中满足该查询条件的数据所对应的id列的值,也即图2所述orders表中120001、120002、…、149999等id值,并将这些id值作为过滤条件,然后,工作节点102可以根据该过滤条件读取people表中的数据,也即读取id值为120001、120002、…、149999的行数据,而无需读取id值为150000、150001的行数据。如此,在查询数据的过程中,工作节点102无需读取整个people表中的所有数据,而可以通过生成的过滤条件仅读取该people表中的部分数据,从而可以减少从people表读取的数据量,也即减少两张表中参与Join操作的数据量。For example, it is assumed that the data source 103 has a people table with a large amount of data and an orders table with a small amount of data as shown in FIG. 2 . Both the people table and the orders table have the same id (namely identity) column, and further have some of the same column values. As shown in Figure 2, the id columns of the two tables have some of the same column values of 120001 to 149999. Suppose the user inputs the SQL statement on the client 104: select people.name from people join orders on people.id=orders.id where orders.sum>10, that is, the data whose orders.sum is greater than 10 in the data source 103 is queried. In order to reduce the amount of data read out for the people table, the worker node 102 may first collect data from the orders table that satisfy the query condition according to the query conditions "people.id=orders.id" and "orders.sum>10" in the query statement The value of the id column corresponding to the data, that is, the id values such as 120001, 120002, . Take the data in the people table, that is, read the row data with id values of 120001, 120002, ..., 149999, without reading the row data with id values of 150000 and 150001. In this way, in the process of querying data, the worker node 102 does not need to read all the data in the whole people table, but can read only part of the data in the people table through the generated filter condition, thereby reducing the amount of data read from the people table. The amount of data, that is, reducing the amount of data involved in the Join operation in the two tables.
当数据处理系统中的工作节点102包括多个执行器时,Join操作可以被打散到多个执行器中执行,如图1所示的工作节点102包括执行器1以及执行器2,并且Join操作可以被分配至Join节点_1以及Join节点_2执行,其中,Join节点是指用于对表A以及表B进行连接操作的逻辑节点。每个执行器可以将根据各自对应的表B生成的过滤条件发送给协调节点101,如图1中,执行器1可以将其根据表B_1生成的部分过滤条件(partial filter)发送给协调节点101,执行器,2可以将其根据表B_2生成的部分过滤条件(partial filter)也发送给协调节点101,并由协调节点101对多个执行器分别发送的过滤条件进行合并。然后,协调节点101可以将合并后的过滤条件分别下发至此次进行Join操作的执行器1以及执行器2,以便由各个执行器分别根据合并后的过滤条件读取各自对应的表A中的相应数据。When the worker node 102 in the data processing system includes multiple executors, the Join operation can be divided into multiple executors for execution. As shown in FIG. 1 , the worker node 102 includes an executor 1 and an executor 2, and the Join operation The operation may be allocated to Join node_1 and Join node_2 for execution, where the Join node refers to a logical node for performing a join operation on Table A and Table B. Each executor can send the filter condition generated according to its corresponding table B to the coordinator node 101. As shown in FIG. 1, the executor 1 can send the partial filter condition (partial filter) generated according to the table B_1 to the coordinator node 101 , executor, 2 can also send the partial filter conditions (partial filter) generated by it according to table B_2 to the coordinator node 101, and the coordinator node 101 combines the filter conditions respectively sent by multiple executors. Then, the coordinating node 101 can respectively deliver the combined filter conditions to the executor 1 and the executor 2 that perform the Join operation this time, so that each executor can read the corresponding table A according to the combined filter conditions. corresponding data.
但是,实际应用时,工作节点102根据表B所生成的过滤条件可能并不能达到较好的数据过滤效果,即经过过滤条件(或者并合并后的过滤条件)过滤后,过滤条件所包含的数据量较大,相应的,依据该过滤条件从表A中读取的数据量与整个表A中的数据量相差较小,这使得相较于工作节点102直接读取整个表B中的所有数据的实现方式而言,数据处理系统100因为工作节点102读取的数据量变少所能节省的数据开销较少。与此同时,工作节点102生成过滤条件以及利用该过滤条件过滤表B中的数据需要较多的额外开销(如计算资源消耗 等),其可能会多于数据处理系统100节省的开销。特别的,当数据处理系统100利用多个工作节点102查询数据时,各个工作节点102还会将各自生成的过滤条件发送给协调节点101,并由协调节点101合并多个过滤条件以及向多个工作节点102下发合并后的过滤条件,这就会进一步占用数据处理系统100更多的资源(包括网络传输、系统计算、存储等资源)。如此,当过滤条件的过滤效果较差时,相较于工作节点102直接读取整个表B中的所有数据的方式,数据处理系统100利用过滤条件过滤数据,不仅没有降低数据处理系统100的整体开销,反而会因为过滤条件会消耗额外的网络传输、系统计算、存储等资源,使得数据处理系统100的整体开销增加;同时,过滤条件的生成、传输、合并、应用等过程也会降低数据查询效率。However, in practical applications, the filter conditions generated by the work node 102 according to Table B may not achieve a better data filtering effect, that is, after filtering by the filter conditions (or the combined filter conditions), the data contained in the filter conditions Correspondingly, the difference between the amount of data read from table A according to the filter condition and the amount of data in the entire table A is small, which makes the work node 102 directly read all the data in the entire table B compared to the In terms of implementation manner, the data processing system 100 can save less data overhead because the amount of data read by the worker nodes 102 is reduced. At the same time, generating a filter condition by the worker node 102 and using the filter condition to filter the data in Table B requires more overhead (such as computing resource consumption, etc.), which may be more than the overhead saved by the data processing system 100. In particular, when the data processing system 100 uses multiple working nodes 102 to query data, each working node 102 will also send the filter conditions generated by themselves to the coordinating node 101, and the coordinating node 101 will combine the multiple filtering conditions and send them to the multiple working nodes 101. The working node 102 issues the combined filter conditions, which will further occupy more resources of the data processing system 100 (including resources such as network transmission, system computing, and storage). In this way, when the filtering effect of the filtering conditions is poor, compared with the way in which the worker node 102 directly reads all the data in the entire table B, the data processing system 100 uses the filtering conditions to filter the data, which not only does not reduce the overall performance of the data processing system 100 On the contrary, the filter conditions will consume additional resources such as network transmission, system computing, storage, etc., which will increase the overall overhead of the data processing system 100; at the same time, the generation, transmission, merging, and application of filter conditions will also reduce data query. efficiency.
基于此,本申请实施例提供了一种数据查询方法,以使得数据处理系统100的数据查询效率保持在较高水平的同时,查询开销保持在较低水平。具体的,工作节点102可以预先根据查询条件以及参与Join操作的两个表分别对应的数据信息,预估符合查询条件的数据在数据量较大的表A中的数据占比,当该占比较大时,表明利用生成的过滤条件可能无法达到较优的数据过滤效果,即过滤前后的数据量差异不大,此时,工作节点102可以不生成过滤条件,而直接根据查询条件从表A以及表B中查询到用户所需的符合查询条件的数据(以下称之为待查询数据)并执行查询操作;而当该数据占比较小时,表明利用该过滤条件可以有效过滤出表A中大量的无关数据(也即非待查询数据,该部分数据可以不用被读取至工作节点102),此时,工作节点102可以利用查询条件以及表B生成该过滤条件,并基于该过滤条件从表A中查询到用户所需的待查询数据。因此,当过滤条件的过滤效果较好,工作节点102可以利用该过滤条件减少从表A中读取的数据量,提高数据查询效率、降低查询开销;而当过滤条件的过滤效果较差时,数据处理系统100可以通过不生成过滤条件来避免生成过滤条件所带来的数据查询效率变低、查询开销不减反增的问题。如此,可以实现数据处理系统100的数据查询效率保持在较高水平的同时,查询开销也能保持在较低水平。Based on this, the embodiments of the present application provide a data query method, so that the data query efficiency of the data processing system 100 is kept at a high level, and the query cost is kept at a low level. Specifically, the worker node 102 can pre-estimate the data proportion of the data that meets the query conditions in the table A with a large amount of data according to the query conditions and the corresponding data information of the two tables participating in the Join operation. When the value is large, it indicates that a better data filtering effect may not be achieved by using the generated filter conditions, that is, the difference in the amount of data before and after filtering is not large. In Table B, the data that meets the query conditions required by the user (hereinafter referred to as the data to be queried) is queried and the query operation is performed; and when the proportion of the data is relatively small, it shows that the filter conditions can be used to effectively filter out a large number of large amounts of data in Table A. Irrelevant data (that is, data that is not to be queried, and this part of the data does not need to be read to the worker node 102), at this time, the worker node 102 can use the query condition and table B to generate the filter condition, and based on the filter condition from table A Query the data to be queried required by the user. Therefore, when the filtering effect of the filter condition is good, the worker node 102 can use the filter condition to reduce the amount of data read from Table A, improve the data query efficiency and reduce the query cost; and when the filtering effect of the filter condition is poor, The data processing system 100 can avoid the problems of low data query efficiency and increased query overhead caused by generating filter conditions by not generating filter conditions. In this way, while the data query efficiency of the data processing system 100 can be kept at a high level, the query overhead can also be kept at a low level.
举例来说,如图3所示,Join操作依然被打散到执行器1以及执行器2中执行。其中,执行器1与执行器2中均包含有判定模块(即图3中的判定模块1以及判定模块2),该判定模块用于判定是否生成过滤条件。每个执行器中的判定模块,可以根据查询条件以及需要Join操作的两个表分别对应的数据信息,预估待生成的过滤条件是否具有较高的数据过滤效果,即确定上述数据占比是否较大,若预估过滤条件的数据过滤效果较好,则该判定模块可以指示其所在的执行器生成过滤条件。而若预估过滤条件的数据过滤效果较差,则该判定模块可以指示其所在的执行器不生成过滤条件。For example, as shown in Figure 3, the Join operation is still scattered to executor 1 and executor 2 for execution. Wherein, both the executor 1 and the executor 2 include a determination module (ie, the determination module 1 and the determination module 2 in FIG. 3 ), and the determination module is used to determine whether to generate a filter condition. The judgment module in each executor can estimate whether the filter conditions to be generated have a high data filtering effect according to the query conditions and the data information corresponding to the two tables that require the Join operation, that is, determine whether the above data proportion is If the data filtering effect of the estimated filter condition is better, the judgment module can instruct the executor where it is located to generate the filter condition. However, if the data filtering effect of the estimated filtering condition is poor, the determining module may instruct the executor where it is located not to generate the filtering condition.
如此,针对于执行器1以及执行器2,若这两个执行器均生成了过滤条件,则其可以分别向协调节点101发送各自生成的(部分)过滤条件,以便由协调节点101进行条件合并以及下发。而若只有执行器1生成了过滤条件,则该执行器1可以直接利用生成的过滤条件从表A_1中读取数据,而另一执行器2可以直接读取整个表A_2中的数据,并再从表A_2中查找出满足该查询条件的数据。而若两个执行器均未生成过滤条件,则其可以分别读取各自对应的表A,并分别从读取的表A中查找出满足查询条件的数据。In this way, for executor 1 and executor 2, if the two executors both generate filter conditions, they can send the (part of) filter conditions generated by them to the coordinating node 101 respectively, so that the coordinating node 101 can combine the conditions and distribution. However, if only executor 1 generates the filter condition, the executor 1 can directly use the generated filter condition to read data from table A_1, while another executor 2 can directly read the data in the entire table A_2, and then Find data that satisfies the query condition from Table A_2. However, if the two executors do not generate the filter condition, they can read the corresponding table A respectively, and find the data satisfying the query condition from the read table A respectively.
值得注意的是,图1以及图3所示的系统架构仅作为一种示例,并不用于限定其具体实现局限于该示例。例如,在其它可能的系统架构中,数据处理系统可以不包括该客户端104;或者,工作节点所包括的执行器的数量不局限于2个;又或者,数据处理系统处理包括协调节点、工作节点之外,还可以集成有数据源等。实际应用时,数据处理系统可以在图1以及图3所示的架构中适应性增加或者删减相应的组成部分,本实施例对此并不进行限定。It is worth noting that the system architecture shown in FIG. 1 and FIG. 3 is only an example, and is not intended to limit its specific implementation to this example. For example, in other possible system architectures, the data processing system may not include the client 104; or, the number of executors included in the worker nodes is not limited to two; In addition to nodes, data sources can also be integrated. In practical application, the data processing system may adaptively add or delete corresponding components in the architecture shown in FIG. 1 and FIG. 3 , which is not limited in this embodiment.
为使本申请的上述目的、特征和优点能够更加明显易懂,下面将结合附图对本申请实施例中的各种非限定性实施方式进行示例性说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。In order to make the above objects, features and advantages of the present application more clearly understood, various non-limiting implementations in the embodiments of the present application will be exemplarily described below with reference to the accompanying drawings. Obviously, the described embodiments are some, but not all, embodiments of the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
如图4所示,为本申请实施例中一种数据查询方法的流程示意图,该方法可以应用于如图1或图3所示的数据处理系统100中,并可以由数据处理系统100或者该数据处理系统100中相应的节点执行,本实施例中,以数据处理系统100中的协调节点101以及工作节点102执行该方法为例进行示例性说明,该方法具体可以包括:As shown in FIG. 4 , it is a schematic flowchart of a data query method in an embodiment of the present application. The method can be applied to the data processing system 100 shown in FIG. 1 or FIG. 3 , and can be executed by the data processing system 100 or the Corresponding nodes in the data processing system 100 execute the method. In this embodiment, the coordinating node 101 and the working node 102 in the data processing system 100 execute the method as an example for illustrative description. The method may specifically include:
S401:协调节点101接收数据查询语句,该数据查询语句包括针对待查询数据对应的查询条件以及针对第一文件和第二文件的查询操作,其中,第一文件的数据量大于第二文件的数据量。S401: The coordination node 101 receives a data query statement, where the data query statement includes query conditions corresponding to the data to be queried and query operations for the first file and the second file, wherein the data volume of the first file is larger than the data of the second file quantity.
实际应用时,用户可以向数据处理系统100提供数据查询语句,以便于数据处理系统100能够基于该数据查询语句定位出所需查询的数据。为便于描述,本实施例中将用户所需的数据称之为待查询数据。In practical application, the user may provide a data query statement to the data processing system 100, so that the data processing system 100 can locate the data to be queried based on the data query statement. For convenience of description, in this embodiment, the data required by the user is referred to as the data to be queried.
作为一种接收数据查询语句的实施示例,数据处理系统100包括客户端104,并且,该客户端104可以向用户呈现如图5所示的查询语句输入界面。用户可以在该客户端104呈现的查询语句输入界面上的特定区域内输入相应的查询语句,以便数据处理系统100反馈用户所期望的查询结果。然后,客户端104可以将该数据查询语句发送给协调节点101,例如可以是通过数据查询请求的方式将其发送给协调节点101,以便协调节点101查询到用户所需的待查询数据。其中,该数据查询语句中包括查询条件以及针对两个文件(如两张表等)的查询操作,如连接(Join)操作等,并且该查询条件用于定位待查询数据。举例来说,用户输入的数据查询语句例如可以是SQL语句:select people.name from people join orders on people.id=orders.id where orders.sum>10,用于查询图2所示的people表中相应id值所在行数据,其中,该SQL语句中的“people.id=orders.id”以及“orders.sum>10”即为查询条件,用于定位待查询数据,也即前述people表中id值为120001、120002、…、149999的行数据。As an implementation example for receiving a data query statement, the data processing system 100 includes a client 104, and the client 104 can present a query statement input interface as shown in FIG. 5 to the user. The user may input a corresponding query statement in a specific area on the query statement input interface presented by the client 104, so that the data processing system 100 feeds back the query result expected by the user. Then, the client 104 can send the data query statement to the coordinating node 101, for example, by sending it to the coordinating node 101 through a data query request, so that the coordinating node 101 can query the data to be queried required by the user. The data query statement includes query conditions and query operations on two files (eg, two tables, etc.), such as a join (Join) operation, and the query conditions are used to locate the data to be queried. For example, the data query statement input by the user can be, for example, an SQL statement: select people.name from people join orders on people.id=orders.id where orders.sum>10, which is used to query the people table shown in Figure 2 The data of the row where the corresponding id value is located, where "people.id=orders.id" and "orders.sum>10" in the SQL statement are query conditions, which are used to locate the data to be queried, that is, the id in the aforementioned people table Row data with values 120001, 120002, ..., 149999.
S402:协调节点101对该数据查询语句进行语法分析以及语义分析,确定用户输入的数据查询语句是否合法。S402: The coordination node 101 performs syntax analysis and semantic analysis on the data query statement to determine whether the data query statement input by the user is legal.
实际应用时,若用户输入的数据查询语句不合法,则数据处理系统100可以终止此次数据查询任务,并可以提示用户输入正确语法/语义的数据查询语句。而若确定用户输入的数据查询语句合法时,协调节点101可以继续执行后续的数据查询步骤。In practical application, if the data query statement input by the user is invalid, the data processing system 100 may terminate the data query task, and may prompt the user to input a data query statement with correct syntax/semantics. If it is determined that the data query statement input by the user is valid, the coordinating node 101 may continue to perform subsequent data query steps.
S403:协调节点101在确定数据查询语句合法后,发送针对第一文件以及第二文件进行查询操作的任务至工作节点,该任务包括针对第一文件以及第二文件的查询条件。S403: After determining that the data query statement is valid, the coordinating node 101 sends a task of querying the first file and the second file to the worker node, where the task includes query conditions for the first file and the second file.
作为一种示例,协调节点101向工作节点发送的任务,例如可以是逻辑计划树,该逻辑计划树中包括待查询数据的查询条件以及参与查询操作的两个文件的标识(如文件名称等)。具体实现时,协调节点可以根据该数据查询语句生成逻辑计划树,并将该逻辑计划树发送给工作节点102,以便调度该工作节点102执行该逻辑计划树中的任务。As an example, the task sent by the coordinating node 101 to the worker node may be, for example, a logical plan tree, where the logical plan tree includes query conditions for the data to be queried and identifiers (such as file names, etc.) of two files participating in the query operation. . During specific implementation, the coordinating node may generate a logical plan tree according to the data query statement, and send the logical plan tree to the working node 102 so as to schedule the working node 102 to execute the tasks in the logical plan tree.
S404:工作节点102根据接收到的任务中的查询条件以及第一文件、第二文件分别对应的数据信息,预估第一文件中符合查询条件的数据在第一文件中的占比。S404: The worker node 102 estimates the proportion of the data in the first file that meet the query conditions in the first file according to the query conditions in the received task and the data information corresponding to the first file and the second file respectively.
其中,数据源103中可以包括第一文件以及第二文件,并且工作节点102可以通过访问该第一文件以及第二文件中的数据获取满足查询条件的数据。实际应用中,该第一文件以及第二文件可以是数据源103中的不同数据形成,并且,第一文件与第二文件具体可以是以表 的形式记录数据,如图2所示的people表以及orders表等。The data source 103 may include a first file and a second file, and the worker node 102 may obtain data satisfying the query condition by accessing the data in the first file and the second file. In practical applications, the first file and the second file may be formed by different data in the data source 103, and the first file and the second file may specifically record data in the form of a table, such as the people table shown in FIG. 2 . And the orders table and so on.
在一些实施方式中,第一文件与第二文件中的数据可以存在关联,例如,第一文件与第二文件可以包含相同的列,并且,第一文件中该列所包括的值与第二文件中该列所包括的值至少存在部分相同,如图2所示的people表以及orders表中对于id列存在部分相同的id值,如120001、120002、…、149999等。In some implementations, the data in the first file and the second file may be associated. For example, the first file and the second file may contain the same column, and the value included in the column in the first file may be the same as the value in the second file. The values included in this column in the file are at least partially the same. As shown in Figure 2, the people table and the orders table have some of the same id values for the id column, such as 120001, 120002, ..., 149999, and so on.
本实施例中,第一文件与第二文件所包括的数据量可以存在差异,例如,第一文件的数据量可以大于第二文件的数据量等。实际应用时,该第一文件具体可以是事实表,而第二文件具体可以是维度表等。其中,事实表中所记录的数据通常较为丰富,可以包括多个维度的信息,例如,对于用于工作记录的事实表,其可以包含工作日期、工作员工、工作时长、加班时长、工作性质,工作内容,工作负责人等多方面的信息,其可以包含时间维度(工作日期、工作时长、加班时长)、人员维度(工作人员、工作负责人)、以及作业属性(工作性质、工作内容)这三个维度的信息。相应的,由于事实表中记录的数据较多,因此,事实表的数据量通常较大。维度表,可以用于记录部分维度数据,如时间维度表可以仅用于记录时间维度的数据,如包括上述工作日期、工作时长、加班时长等数据,而对于工作人员、工作性质、工作内容、工作负责人等数据可能未记录在该时间维度表中,但是可以被记录在其它维度表中。维度表可以视为分析数据的窗口,其包含事实表中在部分维度上的数据特性。通常情况下,维度表的数据量小于事实表的数据量。In this embodiment, the amount of data included in the first file and the second file may be different. For example, the amount of data in the first file may be larger than that in the second file. In practical application, the first file may specifically be a fact table, and the second file may specifically be a dimension table or the like. Among them, the data recorded in the fact table is usually rich and can include information of multiple dimensions. For example, for a fact table used for work records, it can include working date, working employee, working hours, overtime hours, work nature, Work content, job leader and other information, which can include time dimension (working date, working hours, overtime hours), personnel dimension (staff, job leader), and job attributes (work nature, work content). information in three dimensions. Correspondingly, since there are more data recorded in the fact table, the data volume of the fact table is usually large. The dimension table can be used to record some dimensional data. For example, the time dimension table can be used to record only the data of the time dimension, such as the above-mentioned data such as working date, working hours, and overtime hours. Data such as the person in charge of work may not be recorded in this time dimension table, but may be recorded in other dimension tables. The dimension table can be regarded as a window for analyzing data, which contains the data characteristics of some dimensions in the fact table. Typically, dimension tables contain less data than fact tables.
本实施例中,第一文件的数据量可能较大,而符合查询条件的数据(也即待查询数据)可能为该第一文件中的小部分数据,因此,若工作节点102直接从数据源103中读取该第一文件并从中查询到该待查询数据,则工作节点102需要读取第一文件中除待查询数据之外的大量无用数据,因此,本实施例中工作节点102可以利用根据第二文件生成的过滤条件来减少从第一文件中读取的数据量。由于生成的过滤条件并不一定具有较好的数据过滤效果,因此,工作节点102可以从逻辑执行计划树中解析出查询条件以及参与Join操作的两个文件,并获取第一文件、第二文件分别对应的数据信息,从而工作节点102可以根据该查询条件以及两个文件分别对应的数据信息,预估待查询数据在第一文件中的占比,以便利用该占比来衡量过滤条件是否能够具有较优的数据过滤效果。In this embodiment, the amount of data in the first file may be relatively large, and the data that meets the query conditions (that is, the data to be queried) may be a small part of the data in the first file. 103 to read the first file and query the data to be queried, the worker node 102 needs to read a large amount of useless data in the first file except the data to be queried. Therefore, in this embodiment, the worker node 102 can use The amount of data read from the first file is reduced according to the filter conditions generated by the second file. Since the generated filter conditions do not necessarily have good data filtering effects, the worker node 102 can parse out the query conditions and the two files involved in the Join operation from the logical execution plan tree, and obtain the first file and the second file. The corresponding data information, so that the working node 102 can estimate the proportion of the data to be queried in the first file according to the query condition and the data information corresponding to the two files respectively, so as to use the proportion to measure whether the filter condition can be It has better data filtering effect.
具体的,当该占比较大时,表征第一文件中的大部分数据均为待查询数据,此时,即使利用过滤条件对第一文件中的数据进行过滤,工作节点102读取第一文件中的待查询数据与读取整个第一文件的数据量差距不大,这说明,过滤条件所具有的数据过滤效果较差;反之,当占比较小时,表征第一文件中的小部分数据为待查询数据,此时,利用过滤条件能够有效将第一文件中的大部分数据进行过滤,工作节点102仅需读取第一文件中的小部分数据即可获得所需的待查询数据,这说明,过滤条件所具有的数据过滤效果较好。Specifically, when the proportion is relatively large, it indicates that most of the data in the first file is the data to be queried. At this time, even if the data in the first file is filtered by using the filter condition, the working node 102 reads the first file. The difference between the data to be queried and the amount of data read in the entire first file is not large, which shows that the filtering effect of the filter condition is poor; on the contrary, when the proportion is small, the small part of the data in the first file is represented as The data to be queried, at this time, most of the data in the first file can be effectively filtered by using the filter conditions, and the worker node 102 only needs to read a small part of the data in the first file to obtain the required data to be queried. It shows that the data filtering effect of the filtering condition is better.
作为一种预估占比的实现示例,第一文件与第二文件分别对应的数据信息,具体可以是第一文件以及第二文件分别对应的数据统计信息。示例性的,第一文件(或第二文件)对应的数据统计信息,例如可以是第一文件(或第二文件)的数据范围、数据分布区间、重复数据中的任意一种或多种。其中,数据范围,是指第一文件(或第二文件)中数据的取值范围,具体可以是第一文件中各个列的取值范围,即各个列数据的最小值至最大值;数据分布区间,用于第一文件(或第二文件)中数据的分布情况,例如可以是通过柱状图体现数据分布等;重复数据,可以用于指示第一文件(或第二文件)中存在相同数据的情况,例如可以是通过表征各个数据的重复率或者重复条数予以表示等。As an implementation example of the estimated proportion, the data information respectively corresponding to the first file and the second file may specifically be data statistics information corresponding to the first file and the second file respectively. Exemplarily, the data statistics information corresponding to the first file (or the second file) may be, for example, any one or more of the data range, data distribution interval, and repeated data of the first file (or the second file). Wherein, the data range refers to the value range of the data in the first file (or the second file), and may specifically be the value range of each column in the first file, that is, the minimum value to the maximum value of the data in each column; data distribution The interval is used for the distribution of the data in the first file (or the second file), for example, the data distribution can be represented by a bar chart; the repeated data can be used to indicate that the same data exists in the first file (or the second file). For example, it can be expressed by characterizing the repetition rate or the number of repetitions of each data.
工作节点102可以根据第二文件对应的数据统计信息,确定第二文件中满足该查询条件 的数据的标识,从而可以根据第一文件对应的数据统计信息,计算出第一文件中具有该标识的数据在第一文件中的占比。举例来说,假设数据统计信息具体为数据范围,则工作节点102可以根据第二文件的数据范围,从第二文件中查找出满足查询条件的数据范围,并进一步确定出查找到的数据范围内数据所对应的标识;然后,工作节点102再根据确定出的数据标识,查找出第一文件中具有该数据标识的数据,并计算出具有该数据标识的数据在第一文件中数据的占比,例如第一文件中的数据范围为1至100000,而第一文件中具有上述数据标识的数据为1至1000,则占比可以为1/10(即1000/10000)。当然,数据处理系统100也可以是根据上述数据分布区间、重复数据或者结合上述三种信息中的任意多种计算出占比,本实施例对此并不进行限定。The work node 102 can determine the identifier of the data in the second file that satisfies the query condition according to the statistical information of the data corresponding to the second file, so as to calculate the data with the identifier in the first file according to the statistical information of the data corresponding to the first file. The proportion of data in the first file. For example, assuming that the data statistical information is specifically a data range, the worker node 102 can search for a data range that satisfies the query condition from the second file according to the data range of the second file, and further determine the data range within the found data range. The identifier corresponding to the data; then, the working node 102 finds out the data with the data identifier in the first file according to the determined data identifier, and calculates the proportion of the data with the data identifier in the data in the first file For example, the range of data in the first file is 1 to 100000, and the data with the above data identifier in the first file is 1 to 1000, the ratio may be 1/10 (ie 1000/10000). Of course, the data processing system 100 may also calculate the proportion according to the above data distribution interval, repeated data, or in combination with any of the above three kinds of information, which is not limited in this embodiment.
其中,第一文件(或第二文件)对应的数据统计信息,可以由工作节点102与协调节点101预先采集并保存在本地。例如,数据处理系统100在与数据源103连接后,工作节点102可以逐个访问数据源103中的每个文件,并将读取的文件发送给协调节点101,从而协调节点101可以为每个文件生成相应的数据统计信息并保存。如此,后续在查询数据时,工作节点102可以根据所要访问的文件从本地(或者从协调节点101)获取该文件对应的数据统计信息,以便基于该数据统计信息确定是否在数据查询过程中生成过滤条件。当然,实际应用时,工作节点102也可以是采用其它方式获取第一文件以及第二文件对应的数据统计信息,比如,该第一文件以及第二文件在存储于数据源103中时,数据源103中的相关设备预先为该第一文件以及第二文件生成相应的数据统计信息,从而工作节点102可以从数据源103中获取到该第一文件以及第二文件对应的数据统计信息。本实施例中,对于工作节点获取数据统计信息的具体实现方式并不进行限定。The data statistics information corresponding to the first file (or the second file) may be pre-collected by the working node 102 and the coordination node 101 and stored locally. For example, after the data processing system 100 is connected to the data source 103, the worker nodes 102 can access each file in the data source 103 one by one, and send the read files to the coordinating node 101, so that the coordinating node 101 can provide each file Generate corresponding data statistics and save. In this way, when querying data subsequently, the worker node 102 can obtain the data statistics information corresponding to the file locally (or from the coordinating node 101 ) according to the file to be accessed, so as to determine whether to generate a filter during the data query process based on the data statistics information condition. Of course, in practical applications, the worker node 102 may also obtain the data statistics information corresponding to the first file and the second file in other ways. For example, when the first file and the second file are stored in the data source 103, the data source The relevant device in 103 generates the corresponding data statistics information for the first file and the second file in advance, so that the worker node 102 can obtain the data statistics information corresponding to the first file and the second file from the data source 103 . In this embodiment, the specific implementation manner in which the working node obtains the data statistics information is not limited.
而在另一种预估占比的实现示例中,工作节点102也可以是通过数据采样的方式确定占比,此时,第一文件与第二文件分别对应的数据信息,具体可以是第一文件以及第二文件中的采样数据。具体实现时,工作节点102可以分别对第一文件中的数据以及第二文件中的数据进行采样,例如可以是随机采样、等间隔采样或者等比例采样等,并从数据源103中分别读取第一文件中的采样数据以及第二文件中的采样数据,然后,工作节点102可以确定第二文件的采样数据中满足该查询条件的目标采样数据的标识,并统计第一文件的采样数据中具有该标识的采样数据,从而可以计算出第一文件中具有该标识的采样数据在第一文件的采样数据中的占比。如此,可以通过第一文件以及第二文件的采样数据,预测出待查询数据在第一文件中的占比。In another implementation example of estimating the proportion, the working node 102 may also determine the proportion by means of data sampling. In this case, the data information corresponding to the first file and the second file respectively may be the first file and the sampled data in the second file. During specific implementation, the worker node 102 may sample the data in the first file and the data in the second file respectively, such as random sampling, equal interval sampling or equal proportion sampling, etc., and read the data from the data source 103 respectively. The sampled data in the first file and the sampled data in the second file, then, the worker node 102 can determine the identifier of the target sampled data that satisfies the query condition in the sampled data of the second file, and count the sampled data in the first file. The sampled data with the identifier can be calculated, so that the proportion of the sampled data with the identifier in the sampled data of the first file in the first file can be calculated. In this way, the proportion of the data to be queried in the first file can be predicted based on the sampling data of the first file and the second file.
需要说明的是,上述两种预估占比的实现方式仅作为一种示例,实际应用时,也可以是采样其它可能的方式预估占比,本实施例对此并不进行限定。It should be noted that the above two implementation manners of estimating the proportions are only used as examples. In practical applications, the proportions may also be estimated by sampling other possible methods, which are not limited in this embodiment.
S405:当占比大于预设阈值时,工作节点102读取第一文件以及第二文件至工作节点102,并对该第一文件以及第二文件执行查询操作。S405: When the proportion is greater than the preset threshold, the worker node 102 reads the first file and the second file to the worker node 102, and performs a query operation on the first file and the second file.
S406:当占比小于预设阈值时,工作节点102根据查询条件以及第二文件生成过滤条件,并根据该过滤条件从第一文件中查询符合查询条件的数据,并执行相应的查询操作。S406: When the proportion is less than the preset threshold, the worker node 102 generates a filter condition according to the query condition and the second file, and searches the first file for data that meets the query condition according to the filter condition, and executes a corresponding query operation.
本实施例中,由于步骤S404中计算出的占比可以反映过滤条件所具有的数据过滤效果,因此,当占比大于预设阈值时,表征所要生成的过滤条件难以过滤出第一文件中较多数据,若工作节点102生成过滤条件,则可能会降低数据处理系统100的查询效率,此时,工作节点102可以不根据第二文件生成过滤条件,而可以从数据源103中读取第一文件以及第二文件,并根据查询条件从读取的第一文件以及第二文件中查询出满足该查询条件的数据并进行相应运算,如将查询到的数据进行Join运算等。如此,可以避免工作节点102因为生成过滤 条件,以及应用该过滤条件(甚至还包括过滤条件的传输、合并等)所具有的总体开销,超过工作节点102直接读取第一文件所具有的总体开销。In this embodiment, because the ratio calculated in step S404 can reflect the data filtering effect of the filter condition, when the ratio is greater than the preset threshold, it indicates that the filter condition to be generated is difficult to filter out the data filtering effect in the first file. If the working node 102 generates filter conditions, the query efficiency of the data processing system 100 may be reduced. In this case, the working node 102 may not generate filter conditions according to the second file, but may read the first file from the data source 103. file and the second file, and according to the query conditions, query the data satisfying the query conditions from the read first file and the second file and perform corresponding operations, such as performing a Join operation on the queried data. In this way, it can be avoided that the overall overhead of the worker node 102 due to generating the filter condition and applying the filter condition (even including the transmission, merging, etc. of the filter condition) exceeds the overall overhead of the worker node 102 directly reading the first file .
而当占比小于预设阈值时,表征所要生成的过滤条件能够有效过滤出第一文件中的较多数据,相应的,工作节点102因为生成过滤条件,以及应用该过滤条件(甚至还包括过滤条件的传输、合并等)所具有的总体开销,小于工作节点102直接读取第一文件所具有的总体开销。此时,工作节点102可以读取第二文件,并从中查询出满足该查询条件的第一数据,然后,工作节点102可以根据该第一数据生成过滤条件,并将该过滤条件发送给数据源103,以指示数据源103从第一文件中查询与该第一数据相匹配的第二数据并将其反馈给工作节点102,其中,第一数据与第二数据之间可以具有相同的标识,如第一文件中的第一数据与第二文件中的第二数据具有相同的列值(如前述举例中people表与orders表中的id值)等。这样,工作节点102可以对接收到的第二数据以及从第二文件中查询到的第一数据执行上述查询操作,如执行对第一数据与第二数据的Join操作等,以得到用户所需查询的数据。由于工作节点102从数据源103中读取的数据量(相对于直接读取第一文件的数据量)较小,从而可以使得工作节点102的数据查询效率保持在较高水平。其中,工作节点102根据第二文件以及查询条件生成动态查询条件的具体实现过程,可以参见前述相关之处描述,在此不做赘述。When the proportion is less than the preset threshold, it indicates that the filter condition to be generated can effectively filter out more data in the first file. Accordingly, the work node 102 generates the filter condition and applies the filter condition (even including the filter condition) The overall overhead of conditional transmission, merging, etc.) is smaller than the overall overhead of the worker node 102 directly reading the first file. At this time, the worker node 102 can read the second file, and query the first data that satisfies the query condition therefrom, and then the worker node 102 can generate a filter condition according to the first data, and send the filter condition to the data source 103, to instruct the data source 103 to query the second data matching the first data from the first file and feed it back to the worker node 102, wherein the first data and the second data may have the same identifier, For example, the first data in the first file and the second data in the second file have the same column value (such as the id value in the people table and the orders table in the foregoing example) and so on. In this way, the worker node 102 can perform the above query operation on the received second data and the first data queried from the second file, such as performing a Join operation on the first data and the second data, etc., so as to obtain the user's needs. query data. Since the amount of data read by the worker node 102 from the data source 103 (relative to the amount of data read directly from the first file) is small, the data query efficiency of the worker node 102 can be maintained at a high level. The specific implementation process for the working node 102 to generate the dynamic query condition according to the second file and the query condition can be referred to the above-mentioned descriptions in the relevant places, and will not be repeated here.
其中,预设阈值可以是工作节点102根据第一文件的数据量进行计算确定,即工作节点102针对于不同文件中的数据进行查询时,可以生成不同的预设阈值;或者,该预设阈值也可以是固定值,例如可以是经验值等,并且由相关技术人员进行预先设定。本实施例中,对于如何设定预设阈值的具体实现方式并不进行限定。The preset threshold may be calculated and determined by the working node 102 according to the data volume of the first file, that is, when the working node 102 queries data in different files, different preset thresholds may be generated; or, the preset threshold It can also be a fixed value, such as an empirical value, etc., and is preset by the relevant technical personnel. In this embodiment, the specific implementation manner of how to set the preset threshold is not limited.
值得注意的是,本实施例中,是以数据源103包括第一文件以及第二文件为例进行示例性说明,实际应用中,数据源103还可以包括更多的文件,如还包括第三文件等,并且当数据源103中的多个文件均需要参与Join操作时,可以先对第一文件与第二文件进行Join操作后,再对第一文件与第三文件进行Join操作,以此类推,本实施例对此并不进行赘述。It is worth noting that, in this embodiment, the data source 103 includes the first file and the second file as an example for illustration. In practical applications, the data source 103 may also include more files, such as a third file. files, etc., and when multiple files in the data source 103 need to participate in the Join operation, the first file and the second file can be joined first, and then the first file and the third file. By analogy, this embodiment will not describe it in detail.
值得注意,在一些可能的实施方式中,在确定数据源103中的各个文件所具有的最大数据量后,上述占比也可以是利用过滤条件所包括的数据量进行表征。比如,在部分场景中,当过滤条件所包括的数据量较大时,可以表征占比较大,反之,当过滤条件包括的数据量较小时,表征占比较小。则,在根据占比的大小确定是否生成过滤条件时,也可以是根据过滤条件中所包括的数据量大小确定是否生成过滤条件,其具体实现原理与上述过程类似,本实施例对此不再进行赘述。It should be noted that, in some possible implementations, after determining the maximum data amount of each file in the data source 103, the above-mentioned proportion may also be characterized by the data amount included in the filter condition. For example, in some scenarios, when the amount of data included in the filter condition is large, the representation may account for a larger proportion. On the contrary, when the amount of data included in the filter condition is smaller, the representation may account for a smaller proportion. Then, when determining whether to generate a filter condition according to the size of the proportion, it may also be determined whether to generate a filter condition according to the amount of data included in the filter condition. The specific implementation principle is similar to the above process, and this embodiment does not Repeat.
在进一步可能的实施方式中,工作节点102在查询数据的过程中,可以根据已查询到的部分数据确定是否继续使用过滤条件查询剩余数据,比如,当通过上述对第一文件以及第二文件进行数据采样的方式预测过滤条件的过滤效果较好时,可能存在该过滤条件在实际使用时过滤效果较差的问题,如此,工作节点102可以及时停止利用过滤条件继续对第一文件中的数据进行过滤。In a further possible implementation manner, in the process of querying data, the worker node 102 may determine whether to continue querying the remaining data by using the filter conditions according to the queried partial data. When the data sampling method predicts that the filtering effect of the filtering condition is good, there may be a problem that the filtering effect of the filtering condition is poor in actual use. In this way, the worker node 102 can stop using the filtering condition in time to continue processing the data in the first file. filter.
作为一种示例,工作节点102在利用过滤条件从第一文件中查询到待查询数据中的部分数据后,可以计算出该待查询数据中的已查询到数据在第一文件的已遍历数据中的占比,其中,该第一文件的已遍历数据,包括已查询数据以及当前利用过滤条件在第一文件中所过滤出的数据。当该占比较大时,具体可以是该占比大于过滤阈值时,表征过滤条件在实际使用过程中对于第一文件中数据的过滤效果较差,此时,由于使用过滤条件过滤第一文件中的数据需要额外的计算开销,因此,工作节点102在后续查询待查询数据中的剩余数据(即除已查询到数据以外的数据)时,可以直接读取第一文件中未遍历的数据至工作节点102,并根 据查询条件从该未遍历数据中继续查询待查询数据中的剩余数据,而不用再利用过滤条件过滤第一文件中的数据。当然,当该占比较小时,具体可以是该占比小于过滤阈值时,表征过滤条件在实际使用过程中对于第一文件中数据的过滤效果较好,因此,工作节点102可以继续利用该过滤条件从第一文件的未遍历数据中继续查询待查询数据中的剩余数据。As an example, after using the filter condition to query some data in the data to be queried from the first file, the worker node 102 may calculate that the queried data in the data to be queried is in the traversed data of the first file The traversed data of the first file includes the queried data and the data currently filtered out of the first file by using the filter conditions. When the proportion is relatively large, specifically when the proportion is greater than the filtering threshold, it indicates that the filtering effect of the filtering conditions on the data in the first file is poor in the actual use process. Therefore, when the worker node 102 subsequently queries the remaining data in the data to be queried (that is, data other than the queried data), it can directly read the untraversed data in the first file to the work The node 102 continues to query the remaining data in the data to be queried from the untraversed data according to the query condition, without using the filter condition to filter the data in the first file. Of course, when the ratio is small, specifically when the ratio is less than the filtering threshold, it indicates that the filtering condition has a better filtering effect on the data in the first file in the actual use process. Therefore, the working node 102 can continue to use the filtering condition. Continue to query the remaining data in the data to be queried from the untraversed data of the first file.
在其它实施方式中,也可以是通过数据采样的方式确定已查询到数据在第一文件的已遍历数据中的占比等。比如,当待查询数据中的已查询到数据的数据量较大时,可以分别对第一文件的已遍历数据以及已查询到数据进行采样,并将已查询到数据中的采样数据在已遍历数据的采样数据中的数据占比作为该占比等。本实施例中对于占比的具体实现并不进行限定。In other embodiments, the proportion of the queried data in the traversed data of the first file may also be determined by means of data sampling. For example, when the data volume of the queried data in the data to be queried is relatively large, the traversed data and the queried data of the first file can be sampled respectively, and the sampled data in the queried data can be sampled in the traversed data. The proportion of data in the sampled data of the data is taken as the proportion and the like. The specific implementation of the proportion in this embodiment is not limited.
实际应用中,可以在执行器中配置单独的功能模块,如上述图3中的判定模块等,该功能模块可以在查询数据的过程中通过实时监控或者数据采样等方式确定占比,并根据该占比确定在后续的查询过程中是否继续使用过滤条件过滤第一文件中的数据。该功能模块,可以是通过软件或者硬件实现,本实施例对此并不进行限定。In practical applications, a separate functional module can be configured in the actuator, such as the judgment module in Figure 3 above. This functional module can determine the proportion through real-time monitoring or data sampling in the process of querying data, and according to the The ratio determines whether to continue to use the filter condition to filter the data in the first file in the subsequent query process. The functional module may be implemented by software or hardware, which is not limited in this embodiment.
其中,过滤阈值的设定方式可以与前述预设阈值的设定方式类似,对于如何设定过滤阈值的具体实现可以参见前述设定预设阈值的相关之处描述,在此不做赘述。并且,过滤阈值的取值可以与预设阈值相同,也可以是与预设阈值不同,本实施例对此并不进行限定。The setting method of the filtering threshold may be similar to the setting method of the aforementioned preset threshold. For the specific implementation of how to set the filtering threshold, please refer to the description of the aforementioned setting of the preset threshold, which will not be repeated here. Moreover, the value of the filtering threshold may be the same as the preset threshold, or may be different from the preset threshold, which is not limited in this embodiment.
上文中结合图1至图5,详细描述了本申请所提供的数据查询方法,下面将结合图6至图7,描述根据本申请所提供的装置以及设备。The data query method provided by the present application is described in detail above with reference to FIGS. 1 to 5 , and the apparatus and equipment provided according to the present application will be described below with reference to FIGS. 6 to 7 .
与上述方法同样的发明构思,本申请实施例还提供一种数据查询装置,该数据查询装置可以实现上述图4所示的实施例中数据处理系统的功能,该数据处理系统包括协调节点以及工作节点。参见图6所示,该数据查询装置600可以包括:The same inventive concept as the above method, the embodiment of the present application also provides a data query device, the data query device can realize the function of the data processing system in the embodiment shown in FIG. 4, the data processing system includes a coordination node and a working node. Referring to Fig. 6, the data query apparatus 600 may include:
第一通信模块601,用于发送针对第一文件及第二文件进行查询操作的任务至所述工作节点,所述任务包括针对所述第一文件及所述第二文件的查询条件,其中,所述第一文件的数据量大于所述第二文件的数据量;The first communication module 601 is configured to send a task of performing a query operation on the first file and the second file to the worker node, where the task includes query conditions on the first file and the second file, wherein, The data volume of the first file is greater than the data volume of the second file;
预估模块602,用于根据接收到的所述任务的所述查询条件以及所述第一文件、所述第二文件分别对应的数据信息,预估所述第一文件中符合所述查询条件的数据在所述第一文件中的占比; Estimation module 602, configured to estimate the query condition in the first file according to the received query condition of the task and the data information corresponding to the first file and the second file respectively The proportion of the data in the first file;
读取模块603,用于当所述占比大于预设阈值时,读取所述第一文件及所述第二文件至所述工作节点,并对所述第一文件及所述第二文件执行所述查询操作。The reading module 603 is configured to read the first file and the second file to the work node when the ratio is greater than a preset threshold, and read the first file and the second file Execute the query operation.
在一种可能的实施方式中,所述读取模块603,还用于当所述占比小于等于所述预设阈值时,读取所述第二文件;In a possible implementation manner, the reading module 603 is further configured to read the second file when the ratio is less than or equal to the preset threshold;
所述装置600还包括:The apparatus 600 also includes:
查询模块604,用于查询所述第二文件中满足所述查询条件的第一数据;A query module 604, configured to query the first data in the second file that satisfies the query condition;
生成模块605,用于根据所述第一数据生成过滤条件;a generating module 605, configured to generate filter conditions according to the first data;
第二通信模块606,用于将所述过滤条件发送至所述第一文件所在数据源,所述过滤条件用于指示所述数据源从所述第一文件中查询与所述第一数据匹配的第二数据,并接收所述数据源发送的第二数据,对所述第一数据及第二数据执行所述查询操作。The second communication module 606 is configured to send the filter condition to the data source where the first file is located, where the filter condition is used to instruct the data source to query the first file to match the first data and receive the second data sent by the data source, and perform the query operation on the first data and the second data.
在一种可能的实施方式中,所述第一文件、所述第二文件分别对应的数据信息包括所述第一文件、所述第二文件分别对应的数据统计信息;In a possible implementation manner, the data information corresponding to the first file and the second file respectively includes data statistics information corresponding to the first file and the second file respectively;
所述预估模块602,具体用于:The estimating module 602 is specifically used for:
根据所述第二文件对应的数据统计信息,确定所述第二文件中符合所述查询条件的数据的标识;According to the data statistics information corresponding to the second file, determine the identifier of the data in the second file that meets the query condition;
根据所述第一文件对应的数据统计信息,计算所述第一文件中具有所述标识的数据在所述第一文件中的占比。According to the data statistics information corresponding to the first file, the proportion of the data with the identifier in the first file in the first file is calculated.
在一种可能的实施方式中,所述第一文件、所述第二文件分别对应的数据信息包括所述第一文件中的采样数据以及所述第二文件中的采样数据;In a possible implementation manner, the data information corresponding to the first file and the second file respectively includes sampling data in the first file and sampling data in the second file;
所述预估模块602,具体用于:The estimating module 602 is specifically used for:
所述工作节点分别对所述第一文件中的数据以及所述第二文件中的数据进行采样,得到所述第一文件中的采样数据以及所述第二文件中的采样数据;The working node respectively samples the data in the first file and the data in the second file to obtain the sampling data in the first file and the sampling data in the second file;
所述工作节点确定所述第二文件的采样数据中符合所述查询条件的目标采样数据的标识;The working node determines the identifier of the target sampled data that meets the query condition in the sampled data of the second file;
所述工作节点计算所述第一文件的采样数据中具有所述标识的采样数据在所述第一文件的采样数据中的占比。The working node calculates the proportion of the sampled data with the identifier in the sampled data of the first file in the sampled data of the first file.
在一种可能的实施方式中,所述第一文件对应的数据统计信息,包括所述第一文件的数据范围、数据分布区间、重复数据中的任意一种或多种。In a possible implementation manner, the data statistics information corresponding to the first file includes any one or more of the data range, data distribution interval, and repeated data of the first file.
在一种可能的实施方式中,当所述占比小于所述预设阈值时,所述读取模块603,还用于还包括:In a possible implementation manner, when the proportion is less than the preset threshold, the reading module 603 is further configured to further include:
在从所述第一文件中查询所述待查询数据的过程中,计算所述待查询数据中的已查询到数据在所述第一文件的已遍历数据中的占比;In the process of querying the data to be queried from the first file, calculating the proportion of the queried data in the data to be queried in the traversed data of the first file;
当所述占比大于过滤阈值时,根据所述查询条件,从所述第一文件的未遍历数据中查询所述待查询数据中的剩余数据。When the proportion is greater than the filtering threshold, the remaining data in the data to be queried is queried from the untraversed data of the first file according to the query condition.
本实施例中的数据查询装置600,对应于图4所示的数据查询方法,因此,对于本实施例数据查询装置600中各个功能模块的具体实现及其所具有的技术效果,可以参见图4所示实施例中的相关之处描述,在此不做赘述。The data query device 600 in this embodiment corresponds to the data query method shown in FIG. 4 . Therefore, for the specific implementation of each functional module in the data query device 600 in this embodiment and the technical effects it has, please refer to FIG. 4 The description of the relevant parts in the illustrated embodiment will not be repeated here.
此外,本申请实施例还提供一种设备,如图7所示,设备700中可以包括通信接口710、处理器720。可选的,设备700中还可以包括存储器730。其中,存储器730可以设置于设备700内部,还可以设置于设备700外部。示例性地,上述图4所示实施例中各个动作均可以由处理器720实现。处理器720可以通过通信接口710获取数据源103中的第一文件和第二文件,并用于实现图4中所执行的任一方法。在实现过程中,处理流程的各步骤可以通过处理器720中的硬件的集成逻辑电路或者软件形式的指令完成图4中执行的方法。为了简洁,在此不再赘述。处理器720用于实现上述方法所执行的程序代码可以存储在存储器730中。存储器730和处理器720连接,如耦合连接等。In addition, an embodiment of the present application further provides a device. As shown in FIG. 7 , the device 700 may include a communication interface 710 and a processor 720 . Optionally, the device 700 may further include a memory 730 . The memory 730 may be disposed inside the device 700 or outside the device 700 . Exemplarily, each action in the above-mentioned embodiment shown in FIG. 4 may be implemented by the processor 720 . The processor 720 may acquire the first file and the second file in the data source 103 through the communication interface 710, and use them to implement any method executed in FIG. 4 . In the implementation process, each step of the processing flow can be implemented by the hardware integrated logic circuit in the processor 720 or the instructions in the form of software to complete the method executed in FIG. 4 . For brevity, details are not repeated here. The program codes executed by the processor 720 for implementing the above method may be stored in the memory 730 . The memory 730 is connected to the processor 720, such as a coupling connection or the like.
本申请实施例的一些特征可以由处理器720执行存储器730中的程序指令或者软件代码来完成/支持。存储器730上在加载的软件组件可以从功能或者逻辑上进行概括,例如,图6所示的预估模块602、读取模块603、查询模块604、生成模块605。而第一通信模块601与第二通信模块606的功能可以由通信接口710实现。Some features of the embodiments of the present application may be implemented/supported by the processor 720 executing program instructions or software codes in the memory 730 . The software components loaded on the memory 730 can be summarized in terms of functions or logic, for example, the estimation module 602, the reading module 603, the query module 604, and the generation module 605 shown in FIG. 6 . The functions of the first communication module 601 and the second communication module 606 may be implemented by the communication interface 710 .
本申请实施例中涉及到的任一通信接口可以是电路、总线、收发器或者其它任意可以用于进行信息交互的装置。比如设备700中的通信接口710,示例性地,该其它装置可以是与该设备700相连的设备等。Any communication interface involved in the embodiments of this application may be a circuit, a bus, a transceiver, or any other device that can be used for information interaction. For example, the communication interface 710 in the device 700, for example, the other device may be a device connected to the device 700, and the like.
本申请实施例中涉及的处理器可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件 处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。The processors involved in the embodiments of the present application may be general-purpose processors, digital signal processors, application-specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and may implement or The methods, steps, and logic block diagrams disclosed in the embodiments of this application are executed. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly embodied as being executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
本申请实施例中的耦合是装置、模块或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、模块或模块之间的信息交互。The coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, modules or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, modules or modules.
处理器可能和存储器协同操作。存储器可以是非易失性存储器,比如硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。The processor may cooperate with the memory. The memory can be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or a volatile memory (volatile memory), such as random access memory (random-state drive, SSD), etc. access memory, RAM). Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
本申请实施例中不限定上述通信接口、处理器以及存储器之间的具体连接介质。比如存储器、处理器以及通信接口之间可以通过总线连接。所述总线可以分为地址总线、数据总线、控制总线等。The specific connection medium among the above-mentioned communication interface, processor, and memory is not limited in the embodiments of the present application. For example, the memory, the processor and the communication interface can be connected by a bus. The bus can be divided into an address bus, a data bus, a control bus, and the like.
基于以上实施例,本申请实施例还提供了一种计算机存储介质,该存储介质中存储软件程序,该软件程序在被一个或多个处理器读取并执行时可实现上述任意一个或多个实施例提供数据处理系统100执行的方法。所述计算机存储介质可以包括:U盘、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。Based on the above embodiments, the embodiments of the present application further provide a computer storage medium, where a software program is stored in the storage medium, and when the software program is read and executed by one or more processors, it can implement any one or more of the above Embodiments provide methods performed by data processing system 100 . The computer storage medium may include: a U disk, a removable hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk and other mediums that can store program codes.
基于以上实施例,本申请实施例还提供了一种芯片,该芯片包括处理器,用于实现上述实施例所涉及的数据处理系统100的功能,例如用于实现图4中所执行的方法。可选地,所述芯片还包括存储器,所述存储器,用于处理器所执行必要的程序指令和数据。该芯片,可以由芯片构成,也可以包含芯片和其他分立器件。Based on the above embodiments, the embodiments of the present application further provide a chip including a processor for implementing the functions of the data processing system 100 involved in the above embodiments, for example, for implementing the method executed in FIG. 4 . Optionally, the chip further includes a memory, and the memory is used for necessary program instructions and data to be executed by the processor. The chip may consist of chips, or may include chips and other discrete devices.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is only a distinguishing manner adopted when describing objects with the same attributes in the embodiments of the present application.
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the scope of the embodiments of the present application. Thus, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims (10)

  1. 一种数据查询方法,其特征在于,所述方法应用于数据处理系统,所述数据处理系统包括协调节点以及工作节点,所述方法包括:A data query method, characterized in that the method is applied to a data processing system, the data processing system includes a coordination node and a working node, and the method includes:
    所述协调节点发送针对第一文件及第二文件进行查询操作的任务至所述工作节点,所述任务包括针对所述第一文件及所述第二文件的查询条件,其中,所述第一文件的数据量大于所述第二文件的数据量;The coordinating node sends a task of querying the first file and the second file to the worker node, the task includes query conditions for the first file and the second file, wherein the first file The data volume of the file is greater than the data volume of the second file;
    所述工作节点根据接收到的所述任务的所述查询条件以及所述第一文件、所述第二文件分别对应的数据信息,预估所述第一文件中符合所述查询条件的数据在所述第一文件中的占比;According to the received query condition of the task and the data information corresponding to the first file and the second file, the work node estimates that the data in the first file that meets the query condition is the proportion in the first file;
    当所述占比大于预设阈值时,所述工作节点读取所述第一文件及所述第二文件至所述工作节点,并对所述第一文件及所述第二文件执行所述查询操作。When the proportion is greater than a preset threshold, the worker node reads the first file and the second file to the worker node, and executes the query operation.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    当所述占比小于等于所述预设阈值时,所述工作节点读取所述第二文件;When the proportion is less than or equal to the preset threshold, the worker node reads the second file;
    所述工作节点查询所述第二文件中满足所述查询条件的第一数据;The working node queries the first data in the second file that satisfies the query condition;
    所述工作节点根据所述第一数据生成过滤条件;generating, by the working node, a filter condition according to the first data;
    所述工作节点将所述过滤条件发送至所述第一文件所在数据源,所述过滤条件用于指示所述数据源从所述第一文件中查询与所述第一数据匹配的第二数据;The worker node sends the filter condition to the data source where the first file is located, where the filter condition is used to instruct the data source to query the first file for second data matching the first data ;
    所述工作节点接收所述数据源发送的第二数据,对所述第一数据及第二数据执行所述查询操作。The working node receives the second data sent by the data source, and performs the query operation on the first data and the second data.
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一文件、所述第二文件分别对应的数据信息包括所述第一文件中的采样数据以及所述第二文件中的采样数据;The method according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively includes sampling data in the first file and sampling data in the second file data;
    所述工作节点根据接收到的所述任务的所述查询条件以及所述第一文件、所述第二文件分别对应的数据信息,预估所述第一文件中符合所述查询条件的数据在所述第一文件中的占比,包括:According to the received query condition of the task and the data information corresponding to the first file and the second file, the work node estimates that the data in the first file that meets the query condition is The proportion in the first file, including:
    所述工作节点分别对所述第一文件中的数据以及所述第二文件中的数据进行采样,得到所述第一文件中的采样数据以及所述第二文件中的采样数据;The working node respectively samples the data in the first file and the data in the second file to obtain the sampling data in the first file and the sampling data in the second file;
    所述工作节点确定所述第二文件的采样数据中符合所述查询条件的目标采样数据的标识;The working node determines the identifier of the target sampled data that meets the query condition in the sampled data of the second file;
    所述工作节点计算所述第一文件的采样数据中具有所述标识的采样数据在所述第一文件的采样数据中的占比。The working node calculates the proportion of the sampled data with the identifier in the sampled data of the first file in the sampled data of the first file.
  4. 根据权利要求1或2所述的方法,其特征在于,所述第一文件、所述第二文件分别对应的数据信息包括所述第一文件、所述第二文件分别对应的数据统计信息;The method according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively comprises data statistics information corresponding to the first file and the second file respectively;
    所述工作节点根据接收到的所述任务的所述查询条件以及所述第一文件、所述第二文件分别对应的数据信息,预估所述第一文件中符合所述查询条件的数据在所述第一文件中的占比,包括:According to the received query conditions of the task and the data information corresponding to the first file and the second file, the work node estimates that the data in the first file that meets the query conditions is The proportion in the first file, including:
    所述工作节点根据所述第二文件对应的数据统计信息,确定所述第二文件中符合所述查询条件的数据的标识;The working node determines, according to the data statistics information corresponding to the second file, the identifier of the data in the second file that meets the query condition;
    所述工作节点根据所述第一文件对应的数据统计信息,计算所述第一文件中具有所述标 识的数据在所述第一文件中的占比。The working node calculates, according to the data statistics information corresponding to the first file, the proportion of the data with the identifier in the first file in the first file.
  5. 根据权利要求4所述的方法,其特征在于,所述第一文件对应的数据统计信息,包括所述第一文件的数据范围、数据分布区间、重复数据中的任意一种或多种。The method according to claim 4, wherein the data statistics information corresponding to the first file includes any one or more of a data range, a data distribution interval, and repeated data of the first file.
  6. 一种数据查询装置,其特征在于,所述装置应用于数据处理系统,所述数据处理系统包括协调节点以及工作节点,所述装置包括:A data query device, characterized in that the device is applied to a data processing system, the data processing system includes a coordination node and a working node, and the device includes:
    第一通信模块,用于发送针对第一文件及第二文件进行查询操作的任务至所述工作节点,所述任务包括针对所述第一文件及所述第二文件的查询条件,其中,所述第一文件的数据量大于所述第二文件的数据量;A first communication module, configured to send a task of querying the first file and the second file to the working node, the task including query conditions for the first file and the second file, wherein the The data volume of the first file is greater than the data volume of the second file;
    预估模块,用于根据接收到的所述任务的所述查询条件以及所述第一文件、所述第二文件分别对应的数据信息,预估所述第一文件中符合所述查询条件的数据在所述第一文件中的占比;Estimation module is used for, according to the said query condition of the said task received and the data information corresponding to the first file and the second file respectively, to estimate the query condition in the first file that meets the query condition. the proportion of data in the first file;
    读取模块,用于当所述占比大于预设阈值时,读取所述第一文件及所述第二文件至所述工作节点,并对所述第一文件及所述第二文件执行所述查询操作。a reading module, configured to read the first file and the second file to the work node when the ratio is greater than a preset threshold, and execute the first file and the second file the query operation.
  7. 根据权利要求6所述的装置,其特征在于,所述读取模块,还用于当所述占比小于等于所述预设阈值时,读取所述第二文件;The device according to claim 6, wherein the reading module is further configured to read the second file when the ratio is less than or equal to the preset threshold;
    所述装置还包括:The device also includes:
    查询模块,用于查询所述第二文件中满足所述查询条件的第一数据;a query module, configured to query the first data in the second file that satisfies the query condition;
    生成模块,用于根据所述第一数据生成过滤条件;a generating module, configured to generate filter conditions according to the first data;
    第二通信模块,用于将所述过滤条件发送至所述第一文件所在数据源,所述过滤条件用于指示所述数据源从所述第一文件中查询与所述第一数据匹配的第二数据,并接收所述数据源发送的第二数据,对所述第一数据及第二数据执行所述查询操作。a second communication module, configured to send the filter condition to the data source where the first file is located, where the filter condition is used to instruct the data source to query the first file for a data matching the first data Second data is received, and the second data sent by the data source is received, and the query operation is performed on the first data and the second data.
  8. 根据权利要求1或2所述的装置,其特征在于,所述第一文件、所述第二文件分别对应的数据信息包括所述第一文件、所述第二文件分别对应的数据统计信息;The device according to claim 1 or 2, wherein the data information corresponding to the first file and the second file respectively comprises data statistics information corresponding to the first file and the second file respectively;
    所述预估模块,具体用于:The estimation module is specifically used for:
    根据所述第二文件对应的数据统计信息,确定所述第二文件中符合所述查询条件的数据的标识;According to the data statistics information corresponding to the second file, determine the identifier of the data in the second file that meets the query condition;
    根据所述第一文件对应的数据统计信息,计算所述第一文件中具有所述标识的数据在所述第一文件中的占比。According to the data statistics information corresponding to the first file, the proportion of the data with the identifier in the first file in the first file is calculated.
  9. 一种设备,其特征在于,所述设备包括处理器和存储器;A device, characterized in that the device includes a processor and a memory;
    所述处理器用于执行所述存储器中存储的指令,以使得所述设备执行权利要求1至5中任一项所述的方法。The processor is adapted to execute instructions stored in the memory to cause the apparatus to perform the method of any one of claims 1 to 5.
  10. 一种计算机可读存储介质,其特征在于,包括指令,所述指令用于实现权利要求1至5中任一项所述的方法。A computer-readable storage medium, comprising instructions for implementing the method of any one of claims 1 to 5.
PCT/CN2022/074138 2021-01-27 2022-01-27 Data query method and apparatus, and device and storage medium WO2022161417A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110111965.8A CN114817310A (en) 2021-01-27 2021-01-27 Data query method, device, equipment and storage medium
CN202110111965.8 2021-01-27

Publications (1)

Publication Number Publication Date
WO2022161417A1 true WO2022161417A1 (en) 2022-08-04

Family

ID=82524076

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074138 WO2022161417A1 (en) 2021-01-27 2022-01-27 Data query method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN114817310A (en)
WO (1) WO2022161417A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706819A (en) * 2009-12-14 2010-05-12 金蝶软件(中国)有限公司 Query method and system of database, client side, server and database
US20160378829A1 (en) * 2015-06-29 2016-12-29 Oracle International Corporation One-pass join size estimation with correlated sampling
CN108090224A (en) * 2018-01-05 2018-05-29 星环信息科技(上海)有限公司 A kind of cascade Connection method and apparatus
CN109241093A (en) * 2017-06-30 2019-01-18 华为技术有限公司 A kind of method of data query, relevant apparatus and Database Systems
US20200364226A1 (en) * 2019-05-16 2020-11-19 Alibaba Group Holding Limited Methods and devices for dynamic filter pushdown for massive parallel processing databases on cloud

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706819A (en) * 2009-12-14 2010-05-12 金蝶软件(中国)有限公司 Query method and system of database, client side, server and database
US20160378829A1 (en) * 2015-06-29 2016-12-29 Oracle International Corporation One-pass join size estimation with correlated sampling
CN109241093A (en) * 2017-06-30 2019-01-18 华为技术有限公司 A kind of method of data query, relevant apparatus and Database Systems
CN108090224A (en) * 2018-01-05 2018-05-29 星环信息科技(上海)有限公司 A kind of cascade Connection method and apparatus
US20200364226A1 (en) * 2019-05-16 2020-11-19 Alibaba Group Holding Limited Methods and devices for dynamic filter pushdown for massive parallel processing databases on cloud

Also Published As

Publication number Publication date
CN114817310A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
US10936364B2 (en) Task allocation method and system
US9934261B2 (en) Progress analyzer for database queries
US9460154B2 (en) Dynamic parallel aggregation with hybrid batch flushing
US8290937B2 (en) Estimating and monitoring query processing time
US8682875B2 (en) Database statistics for optimization of database queries containing user-defined functions
US9298771B2 (en) Resource estimation for a query optimization process
US8577871B2 (en) Method and mechanism for out-of-the-box real-time SQL monitoring
US10599652B2 (en) Database query time estimator
CN108595254B (en) Query scheduling method
CN108573029B (en) Method, device and storage medium for acquiring network access relation data
US11803521B2 (en) Implementation of data access metrics for automated physical database design
US10795889B2 (en) Query path with aggregate projection
CN110147470B (en) Cross-machine-room data comparison system and method
CN110807145A (en) Query engine acquisition method, device and computer-readable storage medium
Suriarachchi et al. Big provenance stream processing for data intensive computations
US20220035812A1 (en) Execution of query plans
WO2021017701A1 (en) Spark performance optimization control method and apparatus, and device and storage medium
US8548980B2 (en) Accelerating queries based on exact knowledge of specific rows satisfying local conditions
CN111901405B (en) Multi-node monitoring method and device, electronic equipment and storage medium
WO2022161417A1 (en) Data query method and apparatus, and device and storage medium
CN107515864B (en) Method and equipment for monitoring workflow
US8533219B2 (en) Adjusting one or more trace filters in a database system
US20130346436A1 (en) Simulation Techniques for Predicting In-Memory Database Systems Performance
CN116501761A (en) Query optimization method, device and storage medium
CN106528849B (en) Complete history record-oriented graph query overhead method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22745290

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22745290

Country of ref document: EP

Kind code of ref document: A1