WO2016177027A1 - 批量数据查询方法和装置 - Google Patents

批量数据查询方法和装置 Download PDF

Info

Publication number
WO2016177027A1
WO2016177027A1 PCT/CN2016/074141 CN2016074141W WO2016177027A1 WO 2016177027 A1 WO2016177027 A1 WO 2016177027A1 CN 2016074141 W CN2016074141 W CN 2016074141W WO 2016177027 A1 WO2016177027 A1 WO 2016177027A1
Authority
WO
WIPO (PCT)
Prior art keywords
operand
operator
query
identifier
logical
Prior art date
Application number
PCT/CN2016/074141
Other languages
English (en)
French (fr)
Inventor
李丰
张赟
王蕾
冯晓兵
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP16789021.9A priority Critical patent/EP3282370A4/en
Publication of WO2016177027A1 publication Critical patent/WO2016177027A1/zh
Priority to US15/804,346 priority patent/US10678789B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the embodiments of the present invention relate to the field of computer technologies, and in particular, to a method and an apparatus for querying bulk data.
  • representative big data query systems use a single query statement as the basic unit for parsing and optimization.
  • a key feature of big data query systems is query efficiency.
  • the traditional processing mode of parsing and optimizing the basic unit with a single query statement has insufficient optimization opportunities.
  • the rich inter-query optimization opportunities presented in the data warehouse batch query application scenario the optimization opportunities between queries, that is, the optimization opportunities existing between multiple query statements.
  • the specific query is dynamically obtained.
  • monitoring data records or performing partial query functions only dynamic data dependencies related to a certain set of inputs can be collected. Based on these dynamic data dependencies, the optimizations performed can only be adapted to a specific set of inputs, once the input changes. , you need to re-execute analysis and optimization.
  • the embodiment of the invention provides a batch data query method and device, which improves the efficiency of optimization between queries and reduces the system overhead of optimization between queries.
  • a first aspect of the present invention provides a method for querying bulk data, including:
  • N is a positive integer not less than 2;
  • the symbol identifier includes a version number of the operand, and the operands that refer to the same data have the same version number, and refer to different data.
  • the operands have different version numbers, and the operators include at least: create operators, destroy operators, scan operators, and file setting operators;
  • the determining, by the operator and the operand in the N query statements, the symbol identifier of the operand of the N query statements includes:
  • the operand of the second type operator adds a symbol identifier, wherein the first logical query plan tree is any logical query plan tree in the N logical query plan trees, and the second type operator is a divide Other operators than the first type of operator.
  • the root node of the first logical query plan tree includes a file setting operator
  • the leaf node of the query plan tree includes a file scan operator
  • the internal node of the first logical query plan tree includes a second type operator, a create operator, and a destroy operator, wherein the internal node is a leaf node and a root node.
  • the adding rule includes: performing, for each of the second class operators in the first logical query plan tree, the following operations:
  • the operand of the first operator is the same as the operand of the left child node of the first operator, adding an operand of the left child node of the first operator to the operand of the first operator
  • the symbol identifies the same symbol identifier, and the first operator is any one of the second type of operators
  • the optimization rule includes the following rules At least one rule:
  • the symbol identifier of the operand in the N query statements further includes: a hot data identifier
  • the determining, by the operator and the operand in the N query statements, the symbol identifier of the operand of the N query statements further includes:
  • the hot data identifier is used to represent the operand with the hot data identifier
  • the data pointed to is hot data
  • the method further includes:
  • the symbol identifier of the operand in the N query statements further includes: starting active Positioning and terminating the active location, the determining, by the operator and the operand in the N query statements, the symbol identifier of the operand of the N query statements further includes:
  • the first operand is any one of the operands in the N query statements;
  • the method further includes:
  • the storage space of the data indicated by the first operand is released according to the termination active location of the first operand.
  • the determining, by the operator and the operand in the N query statements, the N query statements also includes:
  • the method further includes:
  • the storage space of the data indicated by the second operand is released according to the termination active position of the second operand.
  • a second aspect of the present invention provides a data query server, including:
  • a receiving module configured to receive N query statements to be executed, where the N is a positive integer not less than 2;
  • An identifier determining module configured to determine a symbol identifier of an operand of the N query statements according to an operator and an operand in the N query statements, where the operator is used to indicate an operation to be performed, The operand is used to indicate a storage location of data to be operated by the operator in the N query statements, the symbol identifier includes a version number of the operand, and the operands referring to the same data have the same version number, Operands that represent different data have different version numbers, and the operators include at least: create operators, destroy operators, scan operators, and file setting operators;
  • a relationship determining module configured to determine, according to a version number of an operand of the N query statements determined by the identifier determining module, a dependency relationship between the N query statements;
  • An optimization module configured to perform inter-query optimization on the N query statements according to the dependency relationship between the N query statements and a preset optimization rule
  • the query module is configured to execute the optimized query statement to obtain the query result of the N query statements.
  • the identifier determining module is specifically configured to:
  • the operand of the second type operator adds a symbol identifier, wherein the first logical query plan tree is any logical query plan tree in the N logical query plan trees, and the second type operator is a divide Other operators than the first type of operator.
  • the root node of the first logical query plan tree includes a file setting operator, where A leaf node of a logical query plan tree includes a file scan operator, and an internal node of the first logical query plan tree includes a second type operator, a create operator, and a destroy operator, wherein the internal node is a leaf node And other nodes than the root node;
  • the adding rule includes: performing, for each of the second class operators in the first logical query plan tree, the following operations:
  • the operand of the first operator is the same as the operand of the left child node of the first operator, adding an operand of the left child node of the first operator to the operand of the first operator
  • the symbol identifies the same symbol identifier, and the first operator is any one of the second type of operators
  • the optimization rule includes the following rules At least one rule:
  • the symbol identifier of the operand in the N query statements further includes: a hot data identifier, where the identifier
  • the determination module is also used to:
  • the hot data identifier being used to indicate that there is a hot number
  • the data pointed to by the identified operand is hot data
  • the query module is further configured to:
  • the optimized query statement including the hot data identifier and having no flow dependency and output dependency is concurrently executed.
  • the symbol identifier of the operand in the N query statements further includes: starting active Positioning and terminating the active location, the identity determination module is further configured to:
  • the first operand is any one of the operands in the N query statements;
  • the query module is further configured to:
  • the storage space of the data indicated by the first operand is released according to the termination active location of the first operand.
  • the identifier determining module is further configured to:
  • the query module is further configured to:
  • the storage space of the data indicated by the second operand is released according to the termination active position of the second operand.
  • the data query server determines the symbol identifier of the operand of the N query statements according to the operators and operands in the N query statements, and the symbol identifier includes the version number of the operand. Then, according to the version number of the operands of the N query statements, the dependency relationship between the N query statements is determined, and then the inter-query optimization is performed on the N query statements according to the dependency relationship between the N query statements and the preset optimization rule. . Since the symbol identifier of the operand of the N query statements is fixed and does not change with the input query statement, this embodiment proposes a static analysis and maintenance of the data flow relationship between queries. Technology, which is independent of input data, does not need to execute any part of any query statement, and does not need to monitor data access and update during query statement execution, improving the efficiency of optimization between queries, and reducing inter-query optimization. s expenses.
  • Embodiment 1 is a flowchart of a method for querying bulk data according to Embodiment 1 of the present invention
  • FIG. 2 is a flowchart of determining a symbol identifier of an operand of an N query statement according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic structural diagram of a data query server according to Embodiment 3 of the present invention.
  • FIG. 4 is a schematic structural diagram of a data query server according to Embodiment 4 of the present invention.
  • the methods of the embodiments of the present invention are mainly applied in the scenario of batch data query.
  • Bulk data queries usually use a client/server model.
  • the database usually includes multiple data query servers, storage systems, and a large number of clients.
  • the storage system can include one or more storage devices.
  • multiple clients may send query statements to the data query server.
  • the data query server determines that the number of received query statements reaches a preset number, a plurality of query statements are batch-queried, or , the data query server checks all the received time within the preset time period The query statement is used for batch query.
  • a common scenario for bulk data queries is the data warehouse.
  • a data warehouse is a structured data environment for decision support systems and online analytical application data sources.
  • the data warehouse mainly studies and solves the problem of obtaining information from the database.
  • Data warehouses are characterized by topic-oriented, integration, stability, and time-varying. There are a large number of bulk data query opportunities in data warehouses. There are a large amount of data in data warehouses, which are usually stored in distributed storage systems.
  • FIG. 1 is a flowchart of a batch data query method according to Embodiment 1 of the present invention.
  • the method provided in this embodiment may be performed by a data query server. As shown in FIG. 1 , the method provided in this embodiment may include the following steps:
  • Step 101 Receive N query statements to be executed, where N is a positive integer greater than or equal to 2.
  • Step 102 Determine, according to operators and operands in the N query statements, a symbol identifier of an operand of the N query statements.
  • Each query includes multiple operators, each operator refers to one or more operands, where operators are used to indicate the operation to be performed.
  • Common operators include: Scan Operator, File Functionsink Operator, Create Operator, Destruction Operator, Sort Operator, Select Operator, Aggregate Operator, Product Operator, and Connection Operator (Join Operator), etc.
  • the operand does not refer to a specific data record, but is used to indicate the storage location of the data to be operated by the operator, that is, the operand corresponds to the storage location.
  • the operand can be a variable or an expression.
  • the storage location indicated by the operand can be a data table, a partition in the data table, or a field in the data table.
  • the symbol identifier of the operand includes the version number of the operand, and the operands that refer to the same data have the same version number, and the operands that refer to different data have different version numbers.
  • two operators have the same name operand (such as: a), does not mean that the two must operate the same data; similarly, the two operators have operands with different names (such as: a and b) It does not mean that the data operated by the two is different. Therefore, when determining the symbol identifier of the operand of N query statements, it is not possible to determine whether the two operands are the same based on the names of the two operands, but should determine whether the two operations are identical according to whether the data represented by the two operands are the same. Whether the numbers are the same, where the operand refers to the data stored in the storage location indicated by the operand.
  • the data query server determines the symbol identifier of the operand of the N query statements.
  • the order between the N query statements and the order in which the internal operators of each query statement are executed determine the symbol identifiers of all operands of the N query statements.
  • the data query server first acquires N logical query plan trees corresponding to N query statements, wherein one query statement corresponds to a logical query plan tree, and each node of each logical query plan tree is an operation. symbol.
  • a symbol identifier is added to the operands of the first type operator in the N logical query plan tree, wherein the first type of operators include: create operator, destroy operator, scan operator, and file set operator.
  • a symbol identifier is also generated for the operand of the first type operator of the N logical query plan trees, and the generated symbol identifier includes The version number of the operand.
  • a symbol identifier is added to the operand of the second type operator according to the symbol identifier of the operand of the first type operator. Specifically, the following operations are performed for each query plan tree in the N logical query plan trees:
  • the second type operator in the first logical query plan tree according to the topological order of the first logical query plan tree, the symbol identifier of the operand of the first type operator in the first logical query plan tree, and the preset addition rule The operand adds a symbol identifier.
  • the first logical query plan tree is any logical query plan tree in the N logical query plan trees, and the second type operator is other operators than the first type operator.
  • the node of the first logical query plan tree includes: a root node, a leaf node, and an internal node, and the node at the top of the logical query plan tree (without the parent) is the root node and is located in the logical query plan tree.
  • the bottom node (without children) is a leaf node, and the internal node is a node with both parents and children.
  • the topological order of the first logical query plan tree refers to the order from the leaf nodes to the root nodes.
  • the root node of the first logical query plan tree includes a file setting operator
  • the leaf node of the first logical query plan tree includes a file scan operator
  • the internal node of the first logical query plan tree includes a second type of operation Symbols, create operators, and destroy operators.
  • the internal node is a node other than the leaf node and the root node. If the first logical query plan tree is a binary tree, the internal node of the first logical query plan tree has a left child node and a right child node.
  • the above adding rule includes: for each second type operator in the first logical query plan tree, performing the following operation: if the operand of the first operator is the same as the operand of the left child node of the first operator, The operand of the first operator adds the symbol of the operand of the left child node of the first operator Know the same symbol identifier; if the operand of the first operator is the same as the operand of the right child node of the first operator, add the operation of the right child node of the first operator to the operand of the first operator The number of symbols identifies the same symbol identifier.
  • the first operator is any one of the operators of the second type.
  • the left child node and the right child node of the first operator may be the first type operator or the second type operator, specifically, the left child node and the right child node of the first operator.
  • the left child node and the right child node of the first operator are the first type of operators.
  • the left child node and the right child node of the first operator are internal nodes, the left child node and the right child node of the first operator are the second type of operators.
  • the first logical query plan tree has 4 layers, the first layer includes nodes as root nodes, the second layer and the third layer include nodes as internal nodes, and the fourth layer includes nodes as leaf nodes.
  • the symbol identifier is added to all the internal nodes of the third layer
  • the symbol identifier is added to all the internal nodes of the second layer according to the operands of all the internal nodes of the third layer, and the second layer internal node is the second class.
  • the operator at this time, can add a symbolic identifier to the operand of the second type of operator of the second class according to the symbolic identification of the operand of the second type operator of the third layer.
  • Step 103 Determine a dependency relationship between the N query statements according to the version number of the operand of the determined N query statements.
  • the dependencies between N statements can include: flow dependencies, output dependencies, operator overlap relationships, and operand overlap relationships.
  • the flow dependency means that the version number of the operand of the file setting operator of a previously executed query statement is the same as the version number of the operand of the scan operator of another query statement executed later.
  • the output dependency refers to the value of the file value of the operand of the file operator of the previous execution of a query statement, which is determined by the file value of another query statement executed later, that is, the file setting operator executed later.
  • the operands of the previously executed file setting operator have been rewritten.
  • the operator overlap relationship means that the number of operators of the two query statements is the same, and the operand overlap relationship means that the version numbers of all or part of the operands of the two query statements are the same.
  • Step 104 Perform inter-query optimization on the N query statements according to the dependency relationship between the N query statements and the preset optimization rule.
  • Inter-query optimization is optimized for multiple logical query plan trees as a whole, that is, optimization opportunities exist between logical query plan trees rather than for a single logical query plan tree.
  • the optimization rule includes at least one of the following rules: (1) deleting a query statement having the same operand version number and the same operator as the first query statement, wherein the first query statement is any one of N query statements Query statement; (2) maintain the query order between query statements with flow dependencies, and optimize multiple query statements with flow dependencies into a new query statement; (3) merge the operands with the same operator There are overlapping query statements.
  • the optimization rule (1) is aimed at the inter-query optimization of the query statement with the operator overlap relationship, and the query statement having the same operand version number and the same operator as the first query statement is referred to as the second query statement, due to the first
  • the query statement and the second query statement have the same operand version number and the same operator. Therefore, the query result of the first query statement and the second query statement are the same, and the second query statement can be deleted.
  • the first logical query tree corresponding to the first query statement and the second logical query plan tree corresponding to the second query statement have a common query subtree , that is, the first query subtree of the first logical query plan tree has the same tree structure as the second query subtree of the second logical query plan tree, and the data identifier of the operand of each operator of the first query subtree is The version number of the operand of each operator of the second query subtree is the same.
  • the optimization rule (2) is aimed at the inter-query optimization of the query with the flow dependency, and can directly connect the predecessor of the pre-executed file setting operator with the successor of the post-executing scan operator, and delete the subsequent
  • the scan operator is executed, so that after the execution of the pre-execution file setting operator, the output result of the file setting operator is directly processed as the input of the scan operator, without first outputting the file setting operator
  • the result is written to the distributed storage system and then read out, which reduces the read and write overhead of the distributed storage system and improves query efficiency.
  • the optimization rule (3) is aimed at inter-query optimization of a query statement having an operand overlapping relationship, if the first operator of the first query statement is identical to the second operator of the second query statement, and the part of the first operator
  • the version number of the operand is the same as the version number of the partial operand of the second operator, and then in the subsequent query process, when the physical query tree is generated for the first query statement and the second query statement Matching the first operator and the second operation to the same task, respectively querying the overlapping operands and the non-overlapping operation trees. Since the merges are for the same task, only the overlapping operands are queried once. Thereby reducing the scanning operation overhead for overlapping data.
  • Step 105 Execute the optimized query statement to obtain a query result of the N query statements.
  • the data query server determines the symbol identifier of the operand of the N query statements according to the operators and operands in the N query statements, the symbol identifier includes the version number of the operand, and then according to the N query statements.
  • the version number of the operand determines the dependency relationship between the N query statements, and then optimizes the query between the N query statements according to the dependency relationship between the N query statements and the preset optimization rule. Since the symbol identifier of the operand of the N query statements is fixed and does not change with the input query statement, this embodiment proposes a static analysis and maintenance of the data flow relationship between queries.
  • Technology which is independent of input data, does not need to execute any part of any query statement, and does not need to monitor data access and update during query statement execution, improving the efficiency of optimization between queries, and reducing inter-query optimization. s expenses.
  • the symbol identifier of the operand in the N query statements may further include: a hot data identifier, and then determining N query statements according to operators and operands in the N query statements.
  • the symbolic identifier of the operand further includes: counting the number of times the operand of each scan operator in the N logical query plan tree is referenced; determining that the operand of each scan operator in the N logical query plan tree is referenced Whether the number of times is greater than the hot data threshold; the hot data identifier is added for the operand whose number of times the operand of the scan operator in the N logical query plan tree is referenced is greater than the hot data threshold, and the hot data identifier is used to represent the operand with the hot data identifier The data pointed to is hot data.
  • the optimized query statement including the hot data identifier and no flow dependency and output dependency may be executed concurrently.
  • the operators corresponding to the operands containing the hot data identifier are reordered without changing the inter-query flow dependency and the output dependency, so that they are continuously executed to improve the access efficiency of the hot data.
  • the symbol identifier of the operand in the N query statements may further include: a start active position and a stop active position, and determining an operand of the N query statements according to operators and operands in the N query statements.
  • the symbol identifier further includes, for the first operand, determining a starting active location of the first operand according to the identifier of the scan operator that first references the first operand and the sequence number of the logical query plan tree in which the scan operator is located.
  • the first operand is any one of the operands in the N query statements. Identification and destruction operations based on the destruction operator used to destroy the first operand
  • the sequence number of the logical query plan tree in which the token is located determines the termination active location of the first operand.
  • the storage space of the data indicated by the first operand may be released according to the termination active position of the first operand.
  • the data query server may determine that the first operand is inactive after terminating the active location according to the terminating active location of the first operand, and for the already inactive operand, release the storage space occupied by the data query server as soon as possible. It is used for other operands, which can improve the utilization of storage space.
  • the data query server may also determine an active interval of the first operand according to the initial active position and the ending active position of the first operand, and the operand for the active interval is shorter (for example, only active in a certain query), if
  • the storage of the operands in the distributed storage system can be optimized to change the storage location of the operand from the distributed storage system to the local disk or memory of the data query server to reduce the curing and access overhead.
  • any one of the operands in the N query statements may be referred to as a second operand in the embodiment of the present invention.
  • the second operand determining the start of the second operand according to the identifier of the first file setting operator that first references the second operand and the sequence number of the logical query plan tree where the first file setting operator is located position.
  • the first file setting operator is used to write data to the storage location indicated by the second operand.
  • a termination active position of the second operand according to an identifier of the second file setting operator that references the second operand and a sequence number of the logical query plan tree in which the second file setting operator is located, wherein the second file setting operation
  • the character is used to override the data pointed to by the second operand of the first file operator operation.
  • the storage space of the data indicated by the second operand is released according to the termination active position of the second operand.
  • the identifiers of the operators mentioned in this embodiment are used to identify the order of the operators, and may be an identity identifier (ID) of the operator.
  • FIG. 2 is a method for determining N query statements according to the second embodiment of the present invention.
  • a flowchart of the symbol identification of the operand, as shown in FIG. 2, the method provided in this embodiment may include the following steps:
  • Step 201 Create a symbol identification table, where each data record in the symbol identification table includes the following fields: version number, reference, fixed value, active, inactive, and path.
  • a symbol identification table is first created, and the symbol identification table is used to maintain the symbol identifier of the operand of the N query statements.
  • the tabular structure is not the only form for storing symbolic representations,
  • the symbol identifier can be stored in a storage form such as a linked list or a hash table.
  • Each data record in the symbol identification table includes the following fields: version number, reference, fixed value, active, inactive, and path, where the version number field is used to store the version number of the operand corresponding to the data record, and the reference field is used for
  • the identifier of the logical query plan tree in which the file value operator is located, the path field is used to store the storage location of the operand corresponding to the data record, and the active field is used to store the scan operation that references the operand corresponding to the data record.
  • the identifier of the logical query plan tree in which the identifier and scan operator are located, or the identifier of the file-valued operator used to store the operand corresponding to the data record, and the logical query plan tree in which the file-valued operator is located Sequence number, the inactive field is used to store the identifier of the destroy operator that destroys the operand corresponding to the data record and Destroyed logical query plan operator tree where the serial number, or cancellation ID file for storing the value operator and the document setting logical query plan operator tree where the number of data records corresponding to the operation.
  • Step 202 Initialize the value of the version number field.
  • the value of the version number field is initialized to -1, after which the value of the version number field is incremented by one for each additional data record in the symbol identification table.
  • Step 203 According to the sequence of the N logical query plan trees, sequentially traverse each logical query plan tree, and perform operations of creating operators, scanning operators, file setting operators, and destroying operators in each logical query plan tree. The number generates a corresponding data record, wherein the order of the N logical query plan trees is consistent with the input order of the N query statements.
  • the corresponding data records can be generated for the operands of the creation operator, the scan operator, the file setting operator, and the destruction operator in each logical query plan tree by the following two methods.
  • the first method is as follows:
  • the i-th logical query plan tree When traversing the i-th logical query plan tree, if the i-th logical query plan tree includes a create operator, a data record is created for the operand of the creation operator of the i-th logical query plan tree, and the i-th tree is created.
  • the correspondence between the version number of the operand of the creation operator of the logical query plan tree and the storage location of the operand of the creation operator of the i-th logical query plan tree is saved in the mapping relationship table.
  • the mapping relationship table is used to store the correspondence between the version number of the operand and the storage location of the operand.
  • i is the number of the N logical query plan trees.
  • the initial value of i is 1, and the value of i is: An integer greater than or equal to 1, and less than or equal to N.
  • the version number of the operand, the version number of the operand of the creation operator of the i-th logical query plan tree is added to the version number field of the data record corresponding to the operand of the creation operator of the i-th logical query plan tree
  • the other fields of the data record corresponding to the operand of the creation operator of the i-th logical query plan tree are empty.
  • the storage location of the operand of the destruction operator of the plan tree is searched according to the i-th logical query, and the destruction of the i-th logical query plan tree is searched from the mapping relationship table.
  • the version number corresponding to the storage location of the operand of the destruction operator of the i-th logical query plan tree is from the The data record corresponding to the operand of the destruction operator of the i-th logical query plan tree in the symbol identification table, the identifier of the i-th logical query plan tree serial number i and the i-th logical query plan tree destruction operator, Add to the inactive field of the data record corresponding to the operand of the destroy operator of the i-th logical query plan tree.
  • the scan operation of the i-th logical query plan tree is searched from the mapping relationship table according to the storage location of the operand of the scan operator of the i-th logical query plan tree.
  • the version number corresponding to the storage location of the operand of the character.
  • the version number corresponding to the storage location of the operand of the scan operator of the i-th logical query plan tree is from the The data record corresponding to the operand of the scan operator of the i-th logical query plan tree in the symbol identification table. After finding the data record corresponding to the operand of the scan operator of the i-th logical query plan tree, it is determined whether the active field of the data record corresponding to the operand of the file scan operator of the i-th logical query plan tree is empty.
  • the scan of the i-th logical query plan tree sequence number i and the i-th logical query plan tree is performed.
  • the identifier of the operator is added to the reference field and the active field of the data record corresponding to the operand of the scan operator of the i-th logical query plan tree.
  • the scan operation of the sequence number i of the i-th logical query plan tree and the i-th logical query plan tree The identifier of the character, added to the scan operator of the i-th logical query plan tree
  • the operand corresponds to the data record in the reference field.
  • a data record is created for the operand of the scan operator of the i-th logical query plan tree, and the i-th
  • the correspondence between the version number of the operand of the scan operator of the logical query plan tree and the storage location of the operand of the scan operator of the i-th logical query plan tree is saved in the mapping relationship table, and the i-th logical
  • the identifier of the scan plan operator's serial number i and the i-th logical query plan tree is added to the reference field and the active field of the newly created data record.
  • the scan operation in the first i-1 logical query plan tree is described.
  • the operand of the character is the same as the operand of the scan operator of the i-th logical query plan tree, so the version number has been generated for the same operand.
  • the operand of the scan operator of the i-th logical query plan tree is first. The second occurrence has not yet generated a version number for the operand of the scan operator of the i-th logical query plan tree.
  • the i-th logical query plan is searched from the mapping relationship table according to the storage location of the operand of the file-valued operator of the i-th logical query plan tree.
  • the version number of the storage location of the operand of the tree's file-valued operator If the version number corresponding to the storage location of the operand of the file setting operator of the i-th logical query plan tree is found, the storage location of the operand of the file-valued operator of the i-th logical query plan tree is corresponding.
  • the version number searches for the data record corresponding to the operand of the file setting operator of the i-th logical query plan tree from the symbol identification table, and corresponds to the operand of the file setting operator of the i-th logical query plan tree. After the data record, it is judged whether the value of the fixed value field of the data record corresponding to the operand of the file operator of the i-th logical query plan tree is empty.
  • the fixed value field of the data record corresponding to the operand of the file setting operator of the i-th logical query plan tree is empty, first the number i of the i-th logical query plan tree and the i-th logical query plan tree are first The identifier of the file setting operator is added to the inactive field of the data record corresponding to the operand of the file setting operator of the i-th logical query plan tree, and then, the file of the i-th logical query plan tree
  • the operand of the fixed operator creates a data record, and the version number of the operand of the file-valued operator of the i-th logical query plan tree and the operand of the file-valued operator of the i-th logical query plan tree
  • the version numbers are generated for the operands of each logical query plan tree in the following order: create operator, destroy operator, reference operator, and file value. Operator.
  • the order of the above operators may not be followed. It should be noted that if the version number is generated for the operand of each logical query plan tree in other orders, each time a data record is created, the mapping table must be queried first, if the current relationship is not found in the mapping table. The version number corresponding to the operand of the operator, then a new data record is generated for the operand of the current operator.
  • the generation of the version number of the operand of the N logical query plan trees can be completed by only one traversal.
  • the symbol identifier table needs to be established first, and the fields of the symbol identifier table are also the same as those in the first embodiment. The difference is that, in this embodiment, the mapping relationship table is used to save the storage location from the operand to the operand.
  • the symbol identifies the current value of the version number field in the table plus the value of 1 as the version number of the operand of the current file setting operator, and the operation of the current file setting operator
  • the version number of the number is added to the version number field of the current data record, and the identifier of the current file setting operator and the sequence number of the logical query plan tree in which the current file setting operator is located are added to the fixed value field of the current data record and In the active field.
  • the N logical query plan trees are traversed for the second time according to the sequence of the N logical query plan trees, and the destruction operators of the N logical query plan trees are processed as follows: (1) according to the operand of the current destruction operator Store the location lookup mapping table to get the operand of the current destruction operator The storage location corresponds to all the set of two-parts consisting of the version number of the operand and the sequence number of the logical query plan tree.
  • the N logical query plan trees are traversed for the third time in the order of the N logical query plan trees, and each scan operator of the N logical query plan trees is sequentially processed as follows: (1) according to the operation of the current scan operator The storage location of the number is searched for the mapping relationship table, and all the binary group sets consisting of the version number of the operand and the sequence number of the logical query plan tree corresponding to the storage location of the operand of the current scan operator are obtained.
  • the sequence number of the logical query plan tree in which the operator is located is added to the reference field of the data record corresponding to the version number in the binary group in which the largest sequence number is located. (4) If the active field of the data record corresponding to the version number in the dual group in which the minimum sequence number is located is empty, the identifier of the current scan operator and the sequence number of the logical query plan tree in which the current scan operator is located are added. Go to the active field of the data record corresponding to the version number in the binary group where the smallest sequence number is located.
  • a new data record is newly created in the symbol identification table, and the maximum value of the version number field is incremented by 1 to obtain the version number of the operand of the current scan operator.
  • the version number of the operand of the current scan operator is added to the version number field of the newly created data record, and the storage location of the operand of the current scan operator is added to the path field of the newly created data record, and the current The identifier of the scan operator and the logical query plan tree where the current scan operator is located The sequence number is added to the reference and active fields of the newly created data record.
  • the N logical query plan trees are traversed for the fourth time, and each creation operator of the N logical query plan trees is sequentially processed as follows: (1) according to the current creation operator The storage location of the operands is searched for the mapping relationship table, and all the binary group sets consisting of the version number of the operand and the sequence number of the logical query plan tree corresponding to the storage location of the operand of the currently created operator are obtained.
  • the path field of the data record corresponding to the version number is empty. If it is empty, the storage location of the operand of the current creation operator is added to the path of the data record corresponding to the version number of the binary group in which the smallest sequence number is located. In the field, if it is not empty, jump to the next creation operator to perform the method as described above.
  • the version numbers are generated for the operands of each logical query plan tree in the following order: destruction operator, reference operator, file setting operator, and creation. Operator.
  • the N logical query plan trees are traversed for the fourth time, the N logical query plan trees are traversed in the reverse order of the N logical query plan trees.
  • the version number is generated for the reference operator, if the version number that satisfies the condition is not found in the mapping relationship table, the current reference operator is The operand generates a new data record.
  • the version number is generated by the operands of all the operators in the logical query plan tree, and the generated version number is added to the logical query plan tree, so as to be followed in the query optimization process, according to the operation.
  • the version number of the number optimizes the N logical query plan tree.
  • the active field of each data record in the symbol identification table may be The value of the inactive field, the statistical symbol identifies the starting active position and the ending active position of all operands in the table. Specifically, it is first determined whether the value of the active field and the inactive field of the current data record is null, if the value of the active field and the inactive field of the current data record If it is empty, it indicates that the operand corresponding to the current data record is not active during the entire batch query process, and the current data record can be deleted from the symbol identification table.
  • the current data record is determined according to the identifier of the operator in the active field of the current data record and the sequence number of the logical query plan tree to which the operator belongs.
  • the starting active position of the operand determines the termination active position of the operand corresponding to the current data record according to the identifier of the operator in the inactive field of the current data record and the sequence number of the logical query plan tree to which the operator belongs, the current data record
  • There is only one operator in the active field which may be a scan operator or a file-valued operator.
  • FIG. 3 is a schematic structural diagram of a data query server according to Embodiment 3 of the present invention.
  • the data query server provided in this embodiment includes: a receiving module 11, an identifier determining module 12, a relationship determining module 13, and an optimization module 14. And query module 15.
  • the receiving module 11 is configured to receive N query statements to be executed, where the N is a positive integer not less than 2;
  • the identifier determining module 12 is configured to determine a symbol identifier of an operand of the N query statements according to an operator and an operand in the N query statements, where the operator is used to indicate an operation to be performed, where The operand is used to indicate a storage location of data to be operated by an operator in the N query statements, the symbol identifier includes a version number of the operand, and the operands that refer to the same data have the same version number, Operands that refer to different data have different version numbers, and the operators include at least: create operators, destroy operators, scan operators, and file setting operators;
  • the relationship determining module 13 is configured to determine a dependency relationship between the N query statements according to a version number of an operand of the N query statements determined by the identifier determining module 12;
  • the optimization module 14 is configured to perform inter-query optimization on the N query statements according to the dependency relationship between the N query statements and a preset optimization rule.
  • the querying module 15 is configured to execute the optimized query statement to obtain the query result of the N query statements.
  • the identifier determining module 12 is specifically configured to: acquire N logical query plan trees corresponding to the N query statements, where one query statement corresponds to a logical query plan tree; and the N logical queries are The operand of the first type operator in the plan tree adds a symbol identifier, wherein the first class operator includes: create operator, destroy operator, scan operator, and file value An operation of performing, according to the first logical query plan tree, the first type operation in the first logical query plan tree, for each query plan tree in the N logical query plan trees
  • the symbol identifier of the operand of the symbol and the preset addition rule add a symbol identifier to the operand of the second type operator in the first logical query plan tree, wherein the first logical query plan tree is the N Any one of the logical query plan trees in the logical query plan tree, the second type of operators being other operators than the first type of operators.
  • the root node of the first logical query plan tree includes a file setting operator
  • the leaf node of the first logical query plan tree includes a file scanning operator
  • the first logical query plan tree is internally
  • the node includes a second type operator, a create operator, and a destroy operator, wherein the internal node is a node other than the leaf node and the root node;
  • the adding rule includes: a plan tree for the first logical query For each of the second type of operators, do the following:
  • the operand of the first operator is the same as the operand of the left child node of the first operator, adding an operand of the left child node of the first operator to the operand of the first operator.
  • the symbol identifies the same symbol identifier, the first operator is any one of the second type of operators; if the operand of the first operator and the right child node of the first operator If the operands are the same, the same symbol identifier is added to the operand of the first operator as the symbol identifier of the operand of the right child node of the first operator.
  • the optimization rule includes at least one of the following rules: deleting a query statement having the same operand version number and the same operator as the first query statement, where the first query statement is the N Any one of the query statements; maintaining the query order between the query statements having the flow dependency, and optimizing the plurality of query statements having the flow dependency into a new query statement, wherein the flow dependency It means that the version number of the operand of the file-valued operator of a previously executed query statement is the same as the version number of the operand of another query executed later; the merge has the same operator and the operands overlap. Check for phrases.
  • the symbol identifier of the operand in the N query statements further includes: a hot data identifier, where the identifier determining module 12 is further configured to: collect each scan operator in the N logical query plan trees. The number of times the operand is referenced; whether the number of times the operand of each scan operator in the N logical query plan tree is referenced is greater than a hot data threshold; the scan operator in the N logical query plan tree The operand is referenced by an operand whose number of times greater than the hot data threshold is added to the hot data identifier, the hot data identifier being used to indicate the number pointed to by the operand with the hot data identifier According to the heat data.
  • the querying module 15 is further configured to concurrently execute the optimized query statement that includes the hot data identifier and does not have a flow dependency and an output dependency during the execution of the optimized query statement.
  • the symbol identifier of the operand in the N query statements further includes: a start active location and a termination active location
  • the identifier determining module 12 is further configured to: according to the first operand, according to the first reference The identifier of the scan operator of the first operand and the sequence number of the logical query plan tree in which the scan operator is located determine a starting active location of the first operand, wherein the first operand is the Any one of the operands in the N query statements; determining the first according to an identifier of a destruction operator for destroying the first operand and a sequence number of a logical query plan tree in which the destruction operator is located The active end of the operand.
  • the querying module 15 is further configured to: during the execution of the optimized query statement, release a storage space of data indicated by the first operand according to a termination active position of the first operand.
  • the identifier determining module 12 is further configured to, according to the second operand, the identifier of the first file setting operator and the first file setting operation according to the first reference to the second operand
  • the sequence number of the logical query plan tree in which the symbol is located determines the initial active position of the second operand, wherein the first file setting operator is used to write data to the storage location indicated by the second operand.
  • the second operand is any one of the operands in the N query statements; the identifier of the second file setting operator according to the second operand and the second file setting The sequence number of the logical query plan tree in which the operator is located determines the termination active position of the second operand, wherein the second file setting operator is used to overwrite the first operation of the first file fixed value operator The data pointed to by the second operand.
  • the query module 15 is further configured to: during the execution of the optimized query statement, release a storage space of data indicated by the second operand according to a termination active position of the second operand.
  • the data query server provided in this embodiment may be used to perform the methods in the first embodiment and the second embodiment.
  • the specific implementation manners and technical effects are similar, and details are not described herein again.
  • the data query server 200 provided in this embodiment includes: a processor 21, a memory 22, a communication interface 23, and a system bus 24.
  • the memory 22 and the communication interface 23 are connected and communicated with the processor 21 via the system bus 24;
  • the memory 22 is for storing computer execution instructions;
  • the communication interface 23 is for communicating with other devices,
  • the processor 21 for running The computer executes the instructions and performs the following methods:
  • N is a positive integer not less than 2;
  • the operator is used to indicate an operation to be performed, and the operand is used to indicate
  • the storage location of the data to be operated by the operator in the N query statements, the symbol identifier includes the version number of the operand, and the operands that refer to the same data have the same version number, and refer to the operands of different data.
  • the operators include at least: create operator, destroy operator, scan operator, and file setting operator;
  • the processor 21 determines, according to an operator and an operand in the N query statements, a symbol identifier of an operand of the N query statements, specifically:
  • the operand of the second type operator adds a symbol identifier, wherein the first logical query plan tree is any logical query plan tree in the N logical query plan trees, and the second type operator is a divide Other operators than the first type of operator.
  • the root node of the first logical query plan tree includes a file setting operator
  • the leaf node of the first logical query plan tree includes a file scanning operator
  • the first logical query plan tree is internally
  • the node includes a second type of operator, a create operator, and a destroy operator, where
  • the internal node is a node other than the leaf node and the root node;
  • the adding rule includes: performing, for each of the second class operators in the first logical query plan tree, the following operations:
  • the operand of the first operator is the same as the operand of the left child node of the first operator, adding an operand of the left child node of the first operator to the operand of the first operator.
  • the symbol identifies the same symbol identifier, the first operator is any one of the second type of operators; if the operand of the first operator and the right child node of the first operator If the operands are the same, the same symbol identifier is added to the operand of the first operator as the symbol identifier of the operand of the right child node of the first operator.
  • the optimization rule includes at least one of the following rules: deleting a query statement having the same operand version number and the same operator as the first query statement, where the first query statement is the N Any one of the query statements; maintaining the query order between the query statements having the flow dependency, and optimizing the plurality of query statements having the flow dependency into a new query statement, wherein the flow dependency It means that the version number of the operand of the file-valued operator of a previously executed query statement is the same as the version number of the operand of another query executed later; the merge has the same operator and the operands overlap. Check for phrases.
  • the symbol identifier of the operand in the N query statements further includes: a hot data identifier; the processor 21 is further configured to: collect statistics of each scan operator in the N logical query plan trees The number of times the operand is referenced; determining whether the number of times the operand of each scan operator in the N logical query plan tree is referenced is greater than a hot data threshold; for the scan operator of the N logical query plan trees The operand is referenced by an operand whose number of times is greater than the hot data threshold, and the hot data identifier is used to indicate that the data pointed to by the operand with the hot data identifier is hot data. Subsequently, in the process of executing the optimized query statement, the optimized query statement including the hot data identifier and having no flow dependency and output dependency is concurrently executed.
  • the symbol identifier of the operand in the N query statements further includes: a start active location and a termination active location
  • the processor 31 is further configured to: according to the first reference, according to the first reference Determining, by the identifier of the scan operator of the first operand, and the sequence number of the logical query plan tree in which the scan operator is located, determining a starting active position of the first operand, wherein the first operand is the N Any one of the operands in the query statement; determining the first operation according to an identifier of a destruction operator for destroying the first operand and a sequence number of a logical query plan tree in which the destruction operator is located The number of terminated active locations.
  • Subsequent execution of the optimized query statement During the process, the storage space of the data indicated by the first operand is released according to the termination active position of the first operand.
  • the processor 31 is further configured to, according to the second operand, the identifier of the first file-valued operator that references the second operand for the first time and the logical query where the first file-valued operator is located
  • the sequence number of the plan tree determines a starting active location of the second operand, wherein the first file rating operator is for writing data to a storage location referred to by the second operand, the second operation
  • the number is any one of the operands in the N query statements; the identifier of the second file setting operator according to the second operand and the logic of the second file setting operator
  • the sequence number of the query plan tree determines a termination active location of the second operand, wherein the second file setting operator is configured to overwrite the second operand of the first file rating operator operation data. Subsequently, during the execution of the optimized query statement, the storage space of the data indicated by the second operand is released according to the termination active position of the second operand.
  • the data query server provided in this embodiment may be used to perform the methods in the first embodiment and the second embodiment.
  • the specific implementation manners and technical effects are similar, and details are not described herein again.
  • the embodiment of the invention further provides a computer program product for data processing, comprising a computer readable storage medium storing program code, the program code comprising instructions for executing the method flow described in any one of the foregoing method embodiments.
  • a person skilled in the art can understand that the foregoing storage medium includes: a USB flash drive, a mobile hard disk, a magnetic disk, an optical disk, a random access memory (RAM), a solid state disk (SSD), or a nonvolatile.
  • a non-transitory machine readable medium that can store program code, such as a non-volatile memory.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Operations Research (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种批量数据查询方法和装置,数据查询服务器接根据N条查询语句中的操作符和操作数确定N条查询语句的操作数的符号标识(102),符号标识包括操作数的版本号,然后根据N条查询语句的操作数的版本号确定N条查询语句之间的依赖关系(103),进而根据N条查询语句之间的依赖关系以及预设的优化规则对N条查询语句进行查询间优化(104)。由于N条查询语句的操作数的符号标识是固定的,不会随着输入的查询语句的不同而变化,因此,该方法确定的符号数的符号标识可以适用所有的输入集,并且不需要执行任何查询语句的任何部分、且无需监控查询语句执行过程中的数据访问和更新情况,提高了查询间优化的效率,并且降低了查询间优化的开销。

Description

批量数据查询方法和装置 技术领域
本发明实施例涉及计算机技术领域,尤其涉及一种批量数据查询方法和装置。
背景技术
目前代表性的大数据查询系统(如:Hive,Shark,Impala)均以单条查询语句作为解析和优化的基本单位。大数据查询系统的一个关键性能是查询效率。然而,在数据仓库(Data Warehouse)的批量查询场景下,传统的以单条查询语句为解析和优化基本单位的处理模式存在优化机会不足的问题。与查询内优化机会不足构成强烈对比的是数据仓库批量查询应用场景中所呈现出的丰富的查询间优化机会,查询间优化机会即多条查询语句之间存在的优化机会。
在批量查询应用场景下,现有技术中,通过在查询语句执行过程中对数据记录的更新情况进行实时监控与反馈,以及通过预先执行查询语句中的一部分功能等方式,动态地获取各个查询具体需要访问的数据记录,以此确定多个查询所操作的数据记录之间是否存在冲突或交集,并基于上述分析执行一些动态优化。但通过监控数据记录或执行部分查询功能只能收集到与某一组输入相关的动态数据依赖关系,基于这些动态数据依赖所执行的优化只能适应于某一组特定的输入,一旦输入发生变化,就需要重新执行分析和优化。
发明内容
本发明实施例提供一种批量数据查询方法和装置,提高了查询间优化的效率,并且降低了查询间优化的系统开销。
本发明第一方面提供一种批量数据查询方法,包括:
接收待执行的N条查询语句,其中,所述N为不小于2的正整数;
根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识,其中,所述操作符用于指示要执行的操作,所述操作数用 于指示所述N条查询语句中的操作符待操作的数据的存储位置,所述符号标识包括操作数的版本号,并且,指代相同数据的操作数具有相同的版本号,指代不同数据的操作数具有不同的版本号,所述操作符至少包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
根据确定的所述N条查询语句的操作数的版本号确定所述N条查询语句之间的依赖关系;
根据所述N条查询语句之间的依赖关系以及预设的优化规则对所述N条查询语句进行查询间优化;
执行优化后的查询语句以得到所述N条查询语句的查询结果。
结合第一方面,在第一方面的第一种可能的实现方式中,所述根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识包括:
获取所述N条查询语句对应的N棵逻辑查询计划树,其中,一条查询语句对应一棵逻辑查询计划树;
为所述N棵逻辑查询计划树中的第一类操作符的操作数添加符号标识,其中,所述第一类操作符包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
对于所述N棵逻辑查询计划树中的每一棵查询计划树分别执行如下操作:
根据第一逻辑查询计划树的拓扑顺序、所述第一逻辑查询计划树中的第一类操作符的操作数的符号标识以及预设的添加规则为所述第一逻辑查询计划树中的第二类操作符的操作数添加符号标识,其中,所述第一逻辑查询计划树为所述N棵逻辑查询计划树中的任意一棵逻辑查询计划树,所述第二类操作符为除所述第一类操作符之外的其他操作符。
结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述第一逻辑查询计划树的根节点包括文件定值操作符,所述第一逻辑查询计划树的叶子节点包括文件扫描操作符,所述第一逻辑查询计划树的内部节点包括第二类操作符、创建操作符和销毁操作符,其中,所述内部节点为除叶子节点和根节点之外的其他节点;所述添加规则包括:对于所述第一逻辑查询计划树中的每一个第二类操作符,执行如下操作:
若第一操作符的操作数与所述第一操作符的左孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的左孩子节点的操作数的符号标识相同的符号标识,所述第一操作符为所述第二类操作符中的任意一个操作符;
若所述第一操作符的操作数与所述第一操作符的右孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的右孩子节点的操作数的符号标识相同的符号标识。
结合第一方面、第一方面的第一种至第二种可能的实现方式中的任一种,在第一方面的第三种可能的实现方式中,所述优化规则包括下述规则中的至少一个规则:
删除与第一查询语句具有相同操作数版本号和相同操作符的查询语句,其中,所述第一查询语句为所述N条查询语句中的任意一条查询语句;
保持具有流依赖关系的查询语句之间的查询顺序,并将具有流依赖关系的多个查询语句优化为一个新的查询语句,其中,所述流依赖关系是指在前执行的一条查询语句的文件定值操作符的操作数的版本号与在后执行的另一条查询语句的的操作数的版本号相同;和
合并具有相同操作符且操作数有重叠的查询语句。
结合第一方面的第一种可能的实现方式,在第一方面的第四种可能的实现方式中,所述N条查询语句中的操作数的符号标识还包括:热数据标识;
所述根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识还包括:
统计所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数;
判断所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数是否大于热数据阈值;
为所述N棵逻辑查询计划树中扫描操作符的操作数被引用的次数大于所述热数据阈值的操作数添加热数据标识,所述热数据标识用于表示具有热数据标识的操作数所指向的数据为热数据;
所述方法还包括:
在执行所述优化后的查询语句过程中,并发执行包含所述热数据标识且 不存在流依赖关系和输出依赖关系的优化后的查询语句。
结合第一方面的第一种或第四种可能的实现方式,在第一方面的第五种可能的实现方式中,所述N条查询语句中的操作数的符号标识还包括:起始活跃位置和终止活跃位置,所述根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识还包括:
针对第一操作数,根据第一次引用所述第一操作数的扫描操作符的标识和所述扫描操作符所在的逻辑查询计划树的序号确定所述第一操作数的起始活跃位置,其中,所述第一操作数为所述N条查询语句中的操作数中的任意一个操作数;
根据用于销毁所述第一操作数的销毁操作符的标识和所述销毁操作符所在的逻辑查询计划树的序号确定所述第一操作数的终止活跃位置;
所述方法还包括:
在执行所述优化后的查询语句过程中,根据所述第一操作数的终止活跃位置释放所述第一操作数指示的数据的存储空间。
结合第一方面的第五种可能的实现方式,在第一方面的第六种可能的实现方式中,所述根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识还包括:
针对第二操作数,根据第一次引用所述第二操作数的第一文件定值操作符的标识和所述第一文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的起始活跃位置,其中,所述第一文件定值操作符用于向所述第二操作数指代的存储位置写数据,所述第二操作数为所述N条查询语句中的操作数中的任意一个操作数;
根据引用所述第二操作数的第二文件定值操作符的标识和所述第二文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的终止活跃位置,其中,所述第二文件定值操作符用于改写所述第一文件定值操作符操作的所述第二操作数指向的数据;
所述方法还包括:
在执行所述优化后的查询语句过程中,根据所述第二操作数的终止活跃位置释放所述第二操作数指示的数据的存储空间。
本发明第二方面提供一种数据查询服务器,包括:
接收模块,用于接收待执行的N条查询语句,其中,所述N为不小于2的正整数;
标识确定模块,用于根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识,其中,所述操作符用于指示要执行的操作,所述操作数用于指示所述N条查询语句中的操作符待操作的数据的存储位置,所述符号标识包括操作数的版本号,并且,指代相同数据的操作数具有相同的版本号,指代不同数据的操作数具有不同的版本号,所述操作符至少包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
关系确定模块,用于根据所述标识确定模块确定的所述N条查询语句的操作数的版本号确定所述N条查询语句之间的依赖关系;
优化模块,用于根据所述N条查询语句之间的依赖关系以及预设的优化规则对所述N条查询语句进行查询间优化;
查询模块,用于执行优化后的查询语句以得到所述N条查询语句的查询结果。
结合第二方面,在第二方面的第一种可能的实现方式中,所述标识确定模块具体用于:
获取所述N条查询语句对应的N棵逻辑查询计划树,其中,一条查询语句对应一棵逻辑查询计划树;
为所述N棵逻辑查询计划树中的第一类操作符的操作数添加符号标识,其中,所述第一类操作符包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
对于所述N棵逻辑查询计划树中的每一棵查询计划树分别执行如下操作:
根据第一逻辑查询计划树的拓扑顺序、所述第一逻辑查询计划树中的第一类操作符的操作数的符号标识以及预设的添加规则为所述第一逻辑查询计划树中的第二类操作符的操作数添加符号标识,其中,所述第一逻辑查询计划树为所述N棵逻辑查询计划树中的任意一棵逻辑查询计划树,所述第二类操作符为除所述第一类操作符之外的其他操作符。
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述第一逻辑查询计划树的根节点包括文件定值操作符,所述第 一逻辑查询计划树的叶子节点包括文件扫描操作符,所述第一逻辑查询计划树的内部节点包括第二类操作符、创建操作符和销毁操作符,其中,所述内部节点为除叶子节点和根节点之外的其他节点;所述添加规则包括:对于所述第一逻辑查询计划树中的每一个第二类操作符,执行如下操作:
若第一操作符的操作数与所述第一操作符的左孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的左孩子节点的操作数的符号标识相同的符号标识,所述第一操作符为所述第二类操作符中的任意一个操作符;
若所述第一操作符的操作数与所述第一操作符的右孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的右孩子节点的操作数的符号标识相同的符号标识。
结合第二方面、第二方面的第一种至第二种可能的实现方式中的任一种,在第二方面的第三种可能的实现方式中,所述优化规则包括下述规则中的至少一个规则:
删除与第一查询语句具有相同操作数版本号和相同操作符的查询语句,其中,所述第一查询语句为所述N条查询语句中的任意一条查询语句;
保持具有流依赖关系的查询语句之间的查询顺序,并将具有流依赖关系的多个查询语句优化为一个新的查询语句,其中,所述流依赖关系是指在前执行的一条查询语句的文件定值操作符的操作数的版本号与在后执行的另一条查询语句的的操作数的版本号相同;和
合并具有相同操作符且操作数有重叠的查询语句。
结合第二方面的第一种可能的实现方式,在第二方面的第四种可能的实现方式中,所述N条查询语句中的操作数的符号标识还包括:热数据标识,所述标识确定模块还用于:
统计所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数;
判断所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数是否大于热数据阈值;
为所述N棵逻辑查询计划树中扫描操作符的操作数被引用的次数大于所述热数据阈值的操作数添加热数据标识,所述热数据标识用于表示具有热数 据标识的操作数所指向的数据为热数据;
所述查询模块还用于:
在执行所述优化后的查询语句过程中,并发执行包含所述热数据标识且不存在流依赖关系和输出依赖关系的优化后的查询语句。
结合第二方面的第一种或第四种可能的实现方式,在第二方面的第五种可能的实现方式中,所述N条查询语句中的操作数的符号标识还包括:起始活跃位置和终止活跃位置,所述标识确定模块还用于:
针对第一操作数,根据第一次引用所述第一操作数的扫描操作符的标识和所述扫描操作符所在的逻辑查询计划树的序号确定所述第一操作数的起始活跃位置,其中,所述第一操作数为所述N条查询语句中的操作数中的任意一个操作数;
根据用于销毁所述第一操作数的销毁操作符的标识和所述销毁操作符所在的逻辑查询计划树的序号确定所述第一操作数的终止活跃位置;
所述查询模块还用于:
在执行所述优化后的查询语句过程中,根据所述第一操作数的终止活跃位置释放所述第一操作数指示的数据的存储空间。
结合第二方面的第五种可能的实现方式,在第二方面的第六种可能的实现方式中,所述标识确定模块还用于:
针对第二操作数,根据第一次引用所述第二操作数的第一文件定值操作符的标识和所述第一文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的起始活跃位置,其中,所述第一文件定值操作符用于向所述第二操作数指代的存储位置写数据,所述第二操作数为所述N条查询语句中的操作数中的任意一个操作数;
根据引用所述第二操作数的第二文件定值操作符的标识和所述第二文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的终止活跃位置,其中,所述第二文件定值操作符用于改写所述第一文件定值操作符操作的所述第二操作数指向的数据;
所述查询模块还用于:
在执行所述优化后的查询语句过程中,根据所述第二操作数的终止活跃位置释放所述第二操作数指示的数据的存储空间。
本发明实施例提供的批量数据查询方法和装置,数据查询服务器接通过根据N条查询语句中的操作符和操作数确定N条查询语句的操作数的符号标识,符号标识包括操作数的版本号,然后根据N条查询语句的操作数的版本号确定N条查询语句之间的依赖关系,进而根据N条查询语句之间的依赖关系以及预设的优化规则对N条查询语句进行查询间优化。由于N条查询语句的操作数的符号标识是固定的,不会随着输入的查询语句的不同而变化,因此,本实施例所提出的是一种对查询间数据流关系的静态分析和维护技术,即一种与输入数据无关的,不需要执行任何查询语句的任何部分、且无需监控查询语句执行过程中的数据访问和更新情况,提高了查询间优化的效率,并且降低了查询间优化的开销。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍。
图1为本发明实施例一提供的批量数据查询方法的流程图;
图2为本发明实施例二提供的确定N条查询语句的操作数的符号标识的流程图;
图3为本发明实施例三提供的数据查询服务器的结构示意图;
图4为本发明实施例四提供的数据查询服务器的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。
本发明各实施例的方法主要应用在批量数据查询的场景下。批量数据查询通常采用客户端/服务器模式,数据库中通常包括多个数据查询服务器、存储系统和大量的客户端,存储系统可以包括一台或多台存储设备。在一次批量查询过程中,多个客户端都可能向数据查询服务器发送查询语句,当数据查询服务器确定接收到的查询语句的数量达到预设的数量,则对多条查询语句进行批量查询,或者,数据查询服务器对预设时间段内接收到的所有查 询语句进行批量查询。批量数据查询的一个常用场景是数据仓库。数据仓库是决策支持系统和联机分析应用数据源的结构化数据环境。数据仓库主要研究和解决从数据库中获取信息的问题。数据仓库的特征在于面向主题、集成性、稳定性和时变性,在数据仓库中存在大量的批量数据查询机会,数据仓库中存在大量的数据,这些数据通常存储在分布式存储系统中。
图1为本发明实施例一提供的批量数据查询方法的流程图,本实施例提供的方法可以由数据查询服务器执行,如图1所示,本实施例提供的方法可以包括以下步骤:
步骤101、接收待执行的N条查询语句,N为大于等于2的正整数。
步骤102、根据N条查询语句中的操作符和操作数确定N条查询语句的操作数的符号标识。
每条查询语句中包括多个操作符,每个操作符引用一个或多个操作数,其中,操作符用于指示要执行的操作,常用的操作符包括:扫描操作符(Scan Operator)、文件定值操作符(Filesink Operator)、创建操作符、销毁操作符、排序操作符(Sort Operator)、选择操作符(Select Operator)、聚集操作符(Aggregate Operator)、乘积操作符(Product Operator)和连接操作符(Join Operator)等。
本实施例中,操作数并不是指具体的一条数据记录,而是用于指示操作符待操作的数据的存储位置,即操作数与存储位置对应。操作数可以是一个变量或表达式,操作数所指示的存储位置可以为一张数据表、数据表中的一个分区或数据表中的一个字段等。本实施例中,操作数的符号标识包括操作数的版本号,并且,指代相同数据的操作数具有相同的版本号,指代不同数据的操作数具有不同的版本号。在数据库中,两个操作符拥有同名的操作数(如:a),并不代表二者一定操作相同的数据;同理,两个操作符拥有不同名的操作数(如:a和b),也并不代表二者所操作的数据是不同的。因此,在确定N条查询语句的操作数的符号标识时,不能只根据两个操作数的名称确定两个操作数是否相同,而应该根据两个操作数指代的数据是否相同确定两个操作数是否相同,其中,操作数指代的数据即操作数所指示的存储位置中存储的数据。
数据查询服务器在确定N条查询语句的操作数的符号标识时,可以根据 N条查询语句之间的顺序,以及每条查询语句内部操作符执行的先后顺序确定N条查询语句的所有操作数的符号标识。一种实现方式中,数据查询服务器先获取N条查询语句对应的N棵逻辑查询计划树,其中,一条查询语句对应一棵逻辑查询计划树,每棵逻辑查询计划树的每个节点为一个操作符。然后,为N棵逻辑查询计划树中的第一类操作符的操作数添加符号标识,其中,第一类操作符包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符。在为N棵逻辑查询计划树中的第一类操作符的操作数添加符号标识之前,还要为N棵逻辑查询计划树的第一类操作符的操作数生成符号标识,生成的符号标识包括操作数的版本号。
在为N棵逻辑查询计划树的第一类操作符的操作数添加符号标识之后,还要根据第一类操作符的操作数的符号标识为第二类操作符的操作数添加符号标识。具体的,对于N棵逻辑查询计划树中的每一棵查询计划树分别执行如下操作:
根据第一逻辑查询计划树的拓扑顺序、第一逻辑查询计划树中的第一类操作符的操作数的符号标识以及预设的添加规则为第一逻辑查询计划树中的第二类操作符的操作数添加符号标识。其中,第一逻辑查询计划树为N棵逻辑查询计划树中的任意一棵逻辑查询计划树,第二类操作符为除所述第一类操作符之外的其他操作符。本实施例中,第一逻辑查询计划树的节点包括:根节点、叶子节点和内部节点,处在逻辑查询计划树的最顶端(没有双亲)的节点为根节点,处在逻辑查询计划树的最底端(没有孩子)的节点为叶子节点,内部节点为既有双亲也有孩子的节点。第一逻辑查询计划树的拓扑顺序是指从叶子节点到根节点的顺序。
本实施例中,第一逻辑查询计划树的根节点包括文件定值操作符,第一逻辑查询计划树的叶子节点包括文件扫描操作符,第一逻辑查询计划树的内部节点包括第二类操作符、创建操作符和销毁操作符。其中,内部节点为除叶子节点和根节点之外的其他节点。若第一逻辑查询计划树为二叉树,则第一逻辑查询计划树的内部节点具有一个左孩子节点和一个右孩子节点。上述添加规则包括:对于第一逻辑查询计划树中的每一个第二类操作符,执行如下操作:若第一操作符的操作数与第一操作符的左孩子节点的操作数相同,则为第一操作符的操作数添加与第一操作符的左孩子节点的操作数的符号标 识相同的符号标识,;若第一操作符的操作数与第一操作符的右孩子节点的操作数相同,则为第一操作符的操作数添加与第一操作符的右孩子节点的操作数的符号标识相同的符号标识。其中,第一操作符为第二类操作符中的任意一个操作符。
本实施例中,第一操作符的左孩子节点和右孩子节点可能是第一类操作符,也可能是第二类操作符,具体的,当第一操作符的左孩子节点和右孩子节点为叶子节点时,第一操作符的左孩子节点和右孩子节点是第一类操作符。当第一操作符的左孩子节点和右孩子节点为内部节点时,第一操作符的左孩子节点和右孩子节点是第二类操作符。例如,假设第一逻辑查询计划树共有4层,第一层包括的节点为根节点,第二层和第三层包括的节点为内部节点,第四层包括的节点为叶子节点。在为该第一逻辑查询计划树的第二类操作符的操作数添加符号标识时,先根据第四层的叶子节点上的操作数为第三层的内部节点上的操作数添加符号标识,第四层的叶子节点为第一类操作符,第三层的内部节点为第二类操作符,即根据第一类操作符的操作数为第二类操作符的操作数添加符号标识。在为第三层的所有内部节点的操作数都添加完符号标识之后,根据第三的所有内部节点的操作数为第二层的所有内部节点添加符号标识,第二层内部节点为第二类操作符,这时可以根据第三层的第二类操作符的操作数的符号标识为第二层的第二类操作符的操作数添加符号标识。
步骤103、根据确定的N条查询语句的操作数的版本号确定N条查询语句之间的依赖关系。
N条语句之间的依赖关系可以包括:流依赖关系、输出依赖关系、操作符重叠关系、操作数重叠关系。其中,流依赖关系是指在前执行的一条查询语句的文件定值操作符的操作数的版本号与在后执行的另一条查询语句的扫描操作符的操作数的版本号相同。输出依赖关系是指在前执行的一条查询语句的文件定值操作符的操作数的版本号被在后执行的另一条查询语句的文件定值所定值,即在后执行的文件定值操作符对在前执行的文件定值操作符的操作数进行了改写。操作符重叠关系是指两条查询语句的各类操作符的数量均相同,操作数重叠关系是指两条查询语句的全部或者部分操作数的版本号相同。
步骤104、根据N条查询语句之间的依赖关系以及预设的优化规则对N条查询语句进行查询间优化。
查询间优化是针对多棵逻辑查询计划树整体进行优化,也就是优化机会存在逻辑查询计划树之间,而不是针对单棵逻辑查询计划树进行优化。优化规则包括下述规则中的至少一个规则:(1)删除与第一查询语句具有相同操作数版本号和相同操作符的查询语句,其中,第一查询语句为N条查询语句中的任意一条查询语句;(2)保持具有流依赖关系的查询语句之间的查询顺序,并将具有流依赖关系的多个查询语句优化为一个新的查询语句;(3)合并具有相同操作符且操作数有重叠的查询语句。
优化规则(1)针对的是具有操作符重叠关系的查询语句的查询间优化,将与第一查询语句具有相同操作数版本号和相同操作符的查询语句称为第二查询语句,由于第一查询语句和第二查询语句具有相同操作数版本号和相同操作符,因此,第一查询语句和第二查询语句的查询结果相同,可以删除第二查询语句。若将第一查询语句和第二查询语句都解析成逻辑查询计划子树,则第一查询语句对应的第一逻辑查询树和第二查询语句对应的第二逻辑查询计划树具有公共查询子树,即第一逻辑查询计划树的第一查询子树与第二逻辑查询计划树的第二查询子树的树结构相同,并且第一查询子树的每个操作符的操作数的数据标识与第二查询子树的每个操作符的操作数的版本号均相同。通过对公共查询子树的优化机会进行优化,避免重复的对同一查询结果反复的进行查询,降低了数据库查询的开销,提高了数据库查询的效率。
优化规则(2)针对的是具有流依赖关系的查询语句的查询间优化,可以将在前执行的文件定值操作符的前驱与在后执行的扫描操作符的后继直接连接,并删除在后执行的扫描操作符,这样在前执行文件定值操作符在执行完后,直接将文件定值操作符的输出结果作为扫描操作符的输入进行处理,不需要先将文件定值操作符的输出结果写入分布式存储系统然后再读出来,从而降低了对分布式存储系统的读写开销,提高了查询效率。
优化规则(3)针对的是具有操作数重叠关系的查询语句的查询间优化,若第一查询语句的第一操作符与第二查询语句的第二操作符相同,并且第一操作符的部分操作数的版本号与第二操作符的部分操作数的版本号相同,则在后续在查询过程中,在为第一查询语句和第二查询语句生成物理查询树时 将该第一操作符和该第二操作符合并为同一个任务,对重叠的操作数和不重叠的操作树分别进行查询,由于合并为了同一个任务,只对重叠的操作数进行一次查询,从而减少了对重叠数据的扫描操作开销。
步骤105、执行优化后的查询语句以得到N条查询语句的查询结果。
本实施例中,数据查询服务器接通过根据N条查询语句中的操作符和操作数确定N条查询语句的操作数的符号标识,符号标识包括操作数的版本号,然后根据N条查询语句的操作数的版本号确定N条查询语句之间的依赖关系,进而根据N条查询语句之间的依赖关系以及预设的优化规则对N条查询语句进行查询间优化。由于N条查询语句的操作数的符号标识是固定的,不会随着输入的查询语句的不同而变化,因此,本实施例所提出的是一种对查询间数据流关系的静态分析和维护技术,即一种与输入数据无关的,不需要执行任何查询语句的任何部分、且无需监控查询语句执行过程中的数据访问和更新情况,提高了查询间优化的效率,并且降低了查询间优化的开销。
在实施例一的基础上,可选的,N条查询语句中的操作数的符号标识还可以包括:热数据标识,则根据N条查询语句中的操作符和操作数确定N条查询语句的操作数的符号标识还包括:统计N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数;判断N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数是否大于热数据阈值;为N棵逻辑查询计划树中扫描操作符的操作数被引用的次数大于热数据阈值的操作数添加热数据标识,热数据标识用于表示具有热数据标识的操作数所指向的数据为热数据。相应的,后续在执行优化后的查询语句过程中,可以并发执行包含热数据标识且不存在流依赖关系和输出依赖关系的优化后的查询语句。或者,在不改变查询间流依赖和输出依赖的前提下,对包含热数据标识的操作数对应的操作符进行重排序,使它们连续执行,以提高对热数据的访问效率。
可选的,N条查询语句中的操作数的符号标识还可以包括:起始活跃位置和终止活跃位置,则根据N条查询语句中的操作符和操作数确定N条查询语句的操作数的符号标识还包括:针对第一操作数,根据第一次引用第一操作数的扫描操作符的标识和扫描操作符所在的逻辑查询计划树的序号确定第一操作数的起始活跃位置。其中,第一操作数为N条查询语句中的操作数中的任意一个操作数。根据用于销毁第一操作数的销毁操作符的标识和销毁操 作符所在的逻辑查询计划树的序号确定第一操作数的终止活跃位置。相应的,后续在执行优化后的查询语句过程中,可以根据第一操作数的终止活跃位置释放第一操作数指示的数据的存储空间。具体的,数据查询服务器根据第一操作数的终止活跃位置可以确定第一操作数在终止活跃位置之后已经不活跃了,对于已经不活跃操作数,应尽快将其所占用的存储空间予以释放,以供其他操作数占用,从而能够提高存储空间的利用率。数据查询服务器也可以根据第一操作数的起始活跃位置和终止活跃位置确定第一操作数的活跃区间,对于活跃区间较短(例如,仅在某一个查询内活跃)的操作数,若该操作数的存储在分布式存储系统中,则通过优化,可以将该操作数的存储位置由分布式存储系统改为数据查询服务器的本地磁盘或内存,以降低固化和访问开销。
另外,对于N条查询语句中的操作数中的任意一个操作数,在本发明实施例中可以称为第二操作数。针对第二操作数,根据第一次引用第二操作数的第一文件定值操作符的标识和第一文件定值操作符所在的逻辑查询计划树的序号确定第二操作数的起始活跃位置。其中,第一文件定值操作符用于向第二操作数指代的存储位置写数据。根据引用第二操作数的第二文件定值操作符的标识和第二文件定值操作符所在的逻辑查询计划树的序号确定第二操作数的终止活跃位置,其中,第二文件定值操作符用于改写第一文件定值操作符操作的第二操作数指向的数据。相应的,后续在执行优化后的查询语句过程中,根据第二操作数的终止活跃位置释放第二操作数指示的数据的存储空间。需说明的是,本实施例中提到的各操作符的标识用于标识操作符的顺序,具体可以为操作符的身份标识号(Identity,简称ID)。
在上述实施例一的基础上,本发明实施例二将通过一个具体的例如说明如何确定N条查询语句的操作数的符号标识,图2为本发明实施例二提供的确定N条查询语句的操作数的符号标识的流程图,如图2所示,本实施例提供的方法可以包括以下步骤:
步骤201、创建符号标识表,该符号标识表中的每条数据记录包括如下字段:版本号、引用、定值、活跃、不活跃和路径。
本实施例中,首先创建一个符号标识表,该符号标识表用于维护N条查询语句的操作数的符号标识。表状结构并非为存储符号标识的唯一形式,也 可以采用链表、散列表等存储形式存储符号标识。
该符号标识表中每条数据记录包括如下字段:版本号、引用、定值、活跃、不活跃和路径,其中,版本号字段用于存储数据记录对应的操作数的版本号,引用字段用于存储引用该数据记录对应的操作数的扫描操作符的标识和该扫描操作符所在的逻辑查询计划树的序号,定值字段用于存储定值该数据记录对应的操作数的文件定值操作符的标识和该文件定值操作符所在的逻辑查询计划树的序号,路径字段用于存储该数据记录对应的操作数的存储位置,活跃字段用于存储引用该数据记录对应的操作数的扫描操作符的标识和扫描操作符所在的逻辑查询计划树的序号,或用于存储定值该数据记录对应的操作数的文件定值操作符的标识和该文件定值操作符所在的逻辑查询计划树的序号,不活跃字段用于存储销毁该数据记录对应的操作数的销毁操作符的标识和该销毁操作符所在的逻辑查询计划树的序号,或用于存储注销该数据记录对应的操作数的文件定值操作符的标识和该文件定值操作符所在的逻辑查询计划树的序号。
步骤202、初始化版本号字段的值。
例如,将版本号字段的值初始化为-1,之后该符号标识表中每增加一条数据记录,版本号字段的值加1。
步骤203、按照N棵逻辑查询计划树的顺序,依次遍历每棵逻辑查询计划树,为每棵逻辑查询计划树中的创建操作符、扫描操作符、文件定值操作符和销毁操作符的操作数生成对应的数据记录,其中,N棵逻辑查询计划树的顺序与N条查询语句的输入顺序一致。
具体可通过以下如下两种方式为每棵逻辑查询计划树中的创建操作符、扫描操作符、文件定值操作符和销毁操作符的操作数生成对应的数据记录,第一种方式如下:
在遍历第i棵逻辑查询计划树时,若第i棵逻辑查询计划树中包括创建操作符,则为第i棵逻辑查询计划树的创建操作符的操作数创建一条数据记录,将第i棵逻辑查询计划树的创建操作符的操作数的版本号和第i棵逻辑查询计划树的创建操作符的操作数的存储位置的对应关系保存到映射关系表中。本实施例中,映射关系表用于保存操作数的版本号与操作数的存储位置的对应关系,i为N棵逻辑查询计划树的编号,i的初始值为1,i的取值为: 大于等于1,且小于等于N的整数。为第i棵逻辑查询计划树的创建操作符的操作数创建一条数据记录具体为:将该符号标识表中的版本号字段的当前取值加1得到第i棵逻辑查询计划树的创建操作符的操作数的版本号,将第i棵逻辑查询计划树的创建操作符的操作数的版本号添加到第i棵逻辑查询计划树的创建操作符的操作数对应的数据记录的版本号字段中,第i棵逻辑查询计划树的创建操作符的操作数对应的数据记录的其他字段为空。
若第i棵逻辑查询计划树中包括销毁操作符,则根据第i棵逻辑查询计划树的销毁操作符的操作数的存储位置,从该映射关系表中查找第i棵逻辑查询计划树的销毁操作符的操作数的存储位置对应的版本号。若查找到第i棵逻辑查询计划树的销毁操作符的操作数的存储位置对应的版本号,则根据第i棵逻辑查询计划树的销毁操作符的操作数的存储位置对应的版本号从该符号标识表中查找第i棵逻辑查询计划树的销毁操作符的操作数对应的数据记录,将第i棵逻辑查询计划树的序号i和第i棵逻辑查询计划树的销毁操作符的标识,添加到第i棵逻辑查询计划树的销毁操作符的操作数对应的数据记录的非活跃字段中。
若第i棵逻辑查询计划树中包括扫描操作符,则根据第i棵逻辑查询计划树的扫描操作符的操作数的存储位置,从映射关系表中查找第i棵逻辑查询计划树的扫描操作符的操作数的存储位置对应的版本号。
若查找到第i棵逻辑查询计划树的扫描操作符的操作数的存储位置对应的版本号,则根据第i棵逻辑查询计划树的扫描操作符的操作数的存储位置对应的版本号从该符号标识表中查找第i棵逻辑查询计划树的扫描操作符的操作数对应的数据记录。在查找到第i棵逻辑查询计划树的扫描操作符的操作数对应的数据记录后,判断第i棵逻辑查询计划树的文件扫描操作符的操作数对应的数据记录的活跃字段是否为空。若第i棵逻辑查询计划树的文件扫描操作符的操作数对应的数据记录的活跃字段的值为空,则将第i棵逻辑查询计划树的序号i和第i棵逻辑查询计划树的扫描操作符的标识,添加到第i棵逻辑查询计划树的扫描操作符的操作数对应的数据记录的引用字段和活跃字段中。若第i棵逻辑查询计划树的文件扫描操作符的操作数对应的数据记录的活跃字段不为空,则将第i棵逻辑查询计划树的序号i和第i棵逻辑查询计划树的扫描操作符的标识,添加到第i棵逻辑查询计划树的扫描操作符 的操作数对应的数据记录的引用字段中。若没有查找到第i棵逻辑查询计划树的扫描操作符的操作数的存储位置对应的版本号,则为第i棵逻辑查询计划树的扫描操作符的操作数创建一条数据记录,将第i棵逻辑查询计划树的扫描操作符的操作数的版本号和第i棵逻辑查询计划树的扫描操作符的操作数的存储位置的对应关系保存到该映射关系表中,并将第i棵逻辑查询计划树的序号i和第i棵逻辑查询计划树的扫描操作符的标识,添加到新创建的数据记录的引用字段和活跃字段中。
本实施例中,若在该映射关系表中查找到第i棵逻辑查询计划树的扫描操作符的操作数的存储位置对应的版本号,则说明前i-1棵逻辑查询计划树中扫描操作符的操作数与第i棵逻辑查询计划树的扫描操作符的操作数相同,因此,已经为该相同的操作数生成了版本号。若在该映射关系表中没有查找到第i棵逻辑查询计划树的扫描操作符的操作数的存储位置对应的版本号,则说明第i棵逻辑查询计划树的扫描操作符的操作数第一次出现,还没有为第i棵逻辑查询计划树的扫描操作符的操作数生成版本号。
若i棵逻辑查询计划树中包括文件定值操作符,则根据第i棵逻辑查询计划树的文件定值操作符的操作数的存储位置,从该映射关系表中查找第i棵逻辑查询计划树的文件定值操作符的操作数的存储位置对应的版本号。若查找到第i棵逻辑查询计划树的文件定值操作符的操作数的存储位置对应的版本号,则根据第i棵逻辑查询计划树的文件定值操作符的操作数的存储位置对应的版本号从该符号标识表中查找第i棵逻辑查询计划树的文件定值操作符的操作数对应的数据记录,在查找到第i棵逻辑查询计划树的文件定值操作符的操作数对应的数据记录后,判断第i棵逻辑查询计划树的文件操作符的操作数对应的数据记录的定值字段的值是否为空。
若第i棵逻辑查询计划树的文件定值操作符的操作数对应的数据记录的定值字段为空,则首先将第i棵逻辑查询计划树的序号i和第i棵逻辑查询计划树的文件定值操作符的标识,添加到所述第i棵逻辑查询计划树的文件定值操作符的操作数对应的数据记录的非活跃字段中,然后,为第i棵逻辑查询计划树的文件定值操作符的操作数创建一条数据记录,并将第i棵逻辑查询计划树的文件定值操作符的操作数的版本号和第i棵逻辑查询计划树的文件定值操作符的操作数的存储位置的对应关系保存到该映射关系表中,将第 i棵逻辑查询计划树的序号i和第i棵逻辑查询计划树的文件定值操作符的标识,添加到新创建的数据记录的定值字段和活跃字段中。
在第一种方式中,在遍历每棵逻辑查询计划树时,依次按照如下顺序为每棵逻辑查询计划树的操作数生成版本号:创建操作符、销毁操作符、引用操作符和文件定值操作符。在本发明其他可能的实现方式中,在为每棵逻辑查询计划树的操作数生成版本号时,可以不按照上述操作符的顺序。需说明的是,如果按照其他顺序为每棵逻辑查询计划树的操作数生成版本号时,每次创建一条数据记录时,都要先查询映射关系表中,如果映射关系表中没有找到当前的操作符的操作数对应的版本号,那么为当前的操作符的操作数生成新的数据记录。
在第一种方式中,仅需通过一次遍历即可完成对N棵逻辑查询计划树的操作数的版本号的生成。在第二中方式中,需要通过多次遍历才能完成对N棵逻辑查询计划树的操作数的版本号的生成,每次遍历只为一种特定操作符的操作数生成版本号,在第二中方式中,也需要先建立该符号标识表,符号标识表的各字段也和实施例一相同,不同的是,本实施例中,映射关系表用于保存由操作数的存储位置到操作数的版本号和操作数所在的逻辑查询计划树的序号所构成的二元组的对应关系。第二种方式如下:
首先,按照N棵逻辑查询计划树的顺序遍历一次N棵逻辑查询计划树,依次对N棵逻辑查询计划树的每个文件定值操作符的操作数创建一条数据记录,在对每个文件定值操作符的操作数创建数据记录时,该符号标识表中版本号字段的当前值加1后的值作为当前文件定值操作符的操作数的版本号,将当前文件定值操作符的操作数的版本号添加到当前数据记录的版本号字段,将当前文件定值操作符的标识和该当前文件定值操作符所在的逻辑查询计划树的序号添加到该当前数据记录的定值字段和活跃字段中。然后,将该当前文件定值操作符的操作数的存储位置到该当前文件定值操作符的操作数的版本号和该当前文件定值操作符所在的逻辑查询计划树的序号构成的二元组的对应关系保存到映射关系表中。
其次,按照N棵逻辑查询计划树的顺序第二次遍历N棵逻辑查询计划树,依次对N棵逻辑查询计划树的销毁操作符进行如下处理:(1)根据当前销毁操作符的操作数的存储位置查找映射关系表,得到当前销毁操作符的操作数 的存储位置对应的所有由操作数的版本号和逻辑查询计划树的序号组成的二元组集合。(2)从该二元组集合中选出逻辑查询计划树的序号小于当前销毁操作符所在的逻辑查询计划树的序号的所有备选逻辑查询计划树,从备选逻辑查询计划树中确定出序号最大的逻辑查询计划树,从该二元组集合中找到该最小序号所在的二元组,获得该最小序号所在的二元组中的版本号。(3)根据该最小序号所在的二元组中的版本号查找该符号标识表,找到该最大序号所在的二元组中的版本号对应的数据记录,将当前销毁操作符的标识和该当前销毁操作符所在的逻辑查询计划树的序号,添加到该最小序号所在的二元组中的版本号对应的数据记录的不活跃字段中。如果没有找到该最大序号所在的二元组中的版本号,则跳转到下一个销毁操作符执行上述处理。
之后,按照N棵逻辑查询计划树的顺序第三次遍历N棵逻辑查询计划树,依次对N棵逻辑查询计划树的每个扫描操作符执行如下处理:(1)根据当前扫描操作符的操作数的存储位置,查找映射关系表,得到当前扫描操作符的操作数的存储位置对应的所有由操作数的版本号和逻辑查询计划树的序号组成的二元组集合。(2)从该二元组集合中选出逻辑查询计划树序号大于或等于当前扫描操作符所在的逻辑查询计划树的所有备选逻辑查询计划树,从备选逻辑查询计划树中确定出序号最小的逻辑查询计划树,从该二元组集合中找到该最小序号所在的二元组,获得该最小序号所在的二元组中的版本号。(3)该最小序号所在的二元组中的版本号查找该符号标识表,找到该最小序号所在的二元组中的版本号对应的数据记录,将当前扫描操作符的标识和该当前扫描操作符所在的逻辑查询计划树的序号,添加到该最大序号所在的二元组中的版本号对应的数据记录的引用字段中。(4)如果该最小序号所在的二元组中的版本号对应的数据记录的活跃字段为空,则将当前扫描操作符的标识和该当前扫描操作符所在的逻辑查询计划树的序号,添加到该最小序号所在的二元组中的版本号对应的数据记录的活跃字段中。如果没有找到该最小序号所在的二元组中的版本号,则在该符号标识表中新创建一条数据记录,将版本号字段的最大取值加1得到当前扫描操作符的操作数的版本号,将当前扫描操作符的操作数的版本号添加到新创建的数据记录的版本号字段中,将当前扫描操作符的操作数的存储位置添加到新创建的数据记录的路径字段中,将当前扫描操作符的标识和当前扫描操作符的所在的逻辑查询计划树的 序号添加到新创建的数据记录的引用字段和活跃字段中。
最后,按照N棵逻辑查询计划树的顺序的逆序第四次遍历N棵逻辑查询计划树,依次对N棵逻辑查询计划树的每个创建操作符执行如下处理:(1)根据当前创建操作符的操作数的存储位置,查找映射关系表,得到当前创建操作符的操作数的存储位置对应的所有由操作数的版本号和逻辑查询计划树的序号组成的二元组集合。(2)从该二元组集合中选出逻辑查询计划树的序号大于等于当前创建操作符所在的逻辑查询计划树的序号的所有备选逻辑查询计划树,从备选逻辑查询计划树中确定出序号最小的逻辑查询计划树,从该二元组集合中找到该最小序号所在的二元组,获得该最小序号所在的二元组中的版本号。(3)根据该最小序号所在的二元组中的版本号查找该符号标识表,找到该最小序号所在的二元组中的版本号对应的数据记录,判断该最小序号所在的二元组中的版本号对应的数据记录的路径字段是否为空,若为空,则将当前创建操作符的操作数的存储位置添加到该最小序号所在的二元组中的版本号对应的数据记录的路径字段中,若不为空,则跳转到下一个创建操作符执行如上所述的方法。
在第二种方式中,在遍历每棵逻辑查询计划树时,依次按照如下顺序为每棵逻辑查询计划树的操作数生成版本号:销毁操作符、引用操作符、文件定值操作符和创建操作符。第二种方式中,在第四次遍历N棵逻辑查询计划树时,按照N棵逻辑查询计划树的顺序的逆序遍历N棵逻辑查询计划树。若要采用N棵逻辑查询计划树的顺序遍历N棵逻辑查询计划树,那么在为引用操作符生成版本号时,如果在映射关系表中没有找到满足条件的版本号,则为当前引用操作符的操作数生成新的数据记录。本实施例中,通过为每棵逻辑查询计划树中的所有操作符的操作数生成版本号,并将生成的版本号添加到逻辑查询计划树中,以便于后续在查询优化过程中,根据操作数的版本号对N棵逻辑查询计划树进行优化。
在实施例二的基础上,数据查询装置在统计N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数时,可以根据符号标识表中的每条数据记录的活跃字段和非活跃字段的取值,统计符号标识表中所有操作数的起始活跃位置和终止活跃位置。具体的,首先判断当前数据记录的活跃字段和非活跃字段的取值是否为空,若当前数据记录的活跃字段和非活跃字段的取值 为空,那么说明当前数据记录对应的操作数在整个批量查询过程中,都不活跃,可以将当前数据记录从符号标识表中删除。若当前数据记录的活跃字段和非活跃字段的取值不为空,那么根据当前数据记录的活跃字段中的操作符的标识和该操作符所属的逻辑查询计划树的序号确定当前数据记录对应的操作数的起始活跃位置,根据当前数据记录的不活跃字段中的操作符的标识和该操作符所属的逻辑查询计划树的序号确定当前数据记录对应的操作数的终止活跃位置,当前数据记录的活跃字段中只有一个操作符,该操作符可能为扫描操作符或文件定值操作符,当前数据记录的不活跃字段中也只有一个操作符,该操作符可能为销毁操作符或文件定值操作符。
图3为本发明实施例三提供的数据查询服务器的结构示意图,如图3所示,本实施例提供的数据查询服务器包括:接收模块11、标识确定模块12、关系确定模块13、优化模块14和查询模块15。
其中,接收模块11,用于接收待执行的N条查询语句,其中,所述N为不小于2的正整数;
标识确定模块12,用于根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识,其中,所述操作符用于指示要执行的操作,所述操作数用于指示所述N条查询语句中的操作符待操作的数据的存储位置,所述符号标识包括操作数的版本号,并且,指代相同数据的操作数具有相同的版本号,指代不同数据的操作数具有不同的版本号,所述操作符至少包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
关系确定模块13,用于根据所述标识确定模块12确定的所述N条查询语句的操作数的版本号确定所述N条查询语句之间的依赖关系;
优化模块14,用于根据所述N条查询语句之间的依赖关系以及预设的优化规则对所述N条查询语句进行查询间优化;
查询模块15,用于执行优化后的查询语句以得到所述N条查询语句的查询结果。
可选的,所述标识确定模块12具体用于:获取所述N条查询语句对应的N棵逻辑查询计划树,其中,一条查询语句对应一棵逻辑查询计划树;为所述N棵逻辑查询计划树中的第一类操作符的操作数添加符号标识,其中,所述第一类操作符包括:创建操作符、销毁操作符、扫描操作符和文件定值 操作符;对于所述N棵逻辑查询计划树中的每一棵查询计划树分别执行如下操作:根据第一逻辑查询计划树的拓扑顺序、所述第一逻辑查询计划树中的第一类操作符的操作数的符号标识以及预设的添加规则为所述第一逻辑查询计划树中的第二类操作符的操作数添加符号标识,其中,所述第一逻辑查询计划树为所述N棵逻辑查询计划树中的任意一棵逻辑查询计划树,所述第二类操作符为除所述第一类操作符之外的其他操作符。
本实施例中,所述第一逻辑查询计划树的根节点包括文件定值操作符,所述第一逻辑查询计划树的叶子节点包括文件扫描操作符,所述第一逻辑查询计划树的内部节点包括第二类操作符、创建操作符和销毁操作符,其中,所述内部节点为除叶子节点和根节点之外的其他节点;所述添加规则包括:对于所述第一逻辑查询计划树中的每一个第二类操作符,执行如下操作:
若第一操作符的操作数与所述第一操作符的左孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的左孩子节点的操作数的符号标识相同的符号标识,所述第一操作符为所述第二类操作符中的任意一个操作符;若所述第一操作符的操作数与所述第一操作符的右孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的右孩子节点的操作数的符号标识相同的符号标识。
可选的,所述优化规则包括下述规则中的至少一个规则:删除与第一查询语句具有相同操作数版本号和相同操作符的查询语句,其中,所述第一查询语句为所述N条查询语句中的任意一条查询语句;保持具有流依赖关系的查询语句之间的查询顺序,并将具有流依赖关系的多个查询语句优化为一个新的查询语句,其中,所述流依赖关系是指在前执行的一条查询语句的文件定值操作符的操作数的版本号与在后执行的另一条查询语句的的操作数的版本号相同;合并具有相同操作符且操作数有重叠的查询语句。
可选的,所述N条查询语句中的操作数的符号标识还包括:热数据标识,所述标识确定模块12还用于:统计所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数;判断所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数是否大于热数据阈值;为所述N棵逻辑查询计划树中扫描操作符的操作数被引用的次数大于所述热数据阈值的操作数添加热数据标识,所述热数据标识用于表示具有热数据标识的操作数所指向的数 据为热数据。相应的,所述查询模块15还用于:在执行所述优化后的查询语句过程中,并发执行包含所述热数据标识且不存在流依赖关系和输出依赖关系的优化后的查询语句。
可选的,所述N条查询语句中的操作数的符号标识还包括:起始活跃位置和终止活跃位置,所述标识确定模块12还用于:针对第一操作数,根据第一次引用所述第一操作数的扫描操作符的标识和所述扫描操作符所在的逻辑查询计划树的序号确定所述第一操作数的起始活跃位置,其中,所述第一操作数为所述N条查询语句中的操作数中的任意一个操作数;根据用于销毁所述第一操作数的销毁操作符的标识和所述销毁操作符所在的逻辑查询计划树的序号确定所述第一操作数的终止活跃位置。相应的,所述查询模块15还用于:在执行所述优化后的查询语句过程中,根据所述第一操作数的终止活跃位置释放所述第一操作数指示的数据的存储空间。
可选的,所述标识确定模块12还用于:针对第二操作数,根据第一次引用所述第二操作数的第一文件定值操作符的标识和所述第一文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的起始活跃位置,其中,所述第一文件定值操作符用于向所述第二操作数指代的存储位置写数据,所述第二操作数为所述N条查询语句中的操作数中的任意一个操作数;根据引用所述第二操作数的第二文件定值操作符的标识和所述第二文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的终止活跃位置,其中,所述第二文件定值操作符用于改写所述第一文件定值操作符操作的所述第二操作数指向的数据。所述查询模块15还用于:在执行所述优化后的查询语句过程中,根据所述第二操作数的终止活跃位置释放所述第二操作数指示的数据的存储空间。
本实施例提供的数据查询服务器,可用于执行实施例一和实施例二的方法,具体实现方式和技术效果类似,这里不再赘述。
图4为本发明实施例四提供的数据查询服务器的结构示意图,如图4所示,本实施例提供的数据查询服务器200包括:处理器21、存储器22、通信接口23和系统总线24,所述存储器22和所述通信接口23通过所述系统总线24与所述处理器21连接并通信;所述存储器22,用于存储计算机执行指令;所述通信接口23用于和其他设备进行通信,所述处理器21,用于运行 所述计算机执行指令,执行下述方法:
接收待执行的N条查询语句,其中,所述N为不小于2的正整数;
根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识,其中,所述操作符用于指示要执行的操作,所述操作数用于指示所述N条查询语句中的操作符待操作的数据的存储位置,所述符号标识包括操作数的版本号,并且,指代相同数据的操作数具有相同的版本号,指代不同数据的操作数具有不同的版本号,所述操作符至少包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
根据确定的所述N条查询语句的操作数的版本号确定所述N条查询语句之间的依赖关系;
根据所述N条查询语句之间的依赖关系以及预设的优化规则对所述N条查询语句进行查询间优化;
执行优化后的查询语句以得到所述N条查询语句的查询结果。
可选的,所述处理器21根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识,具体为:
获取所述N条查询语句对应的N棵逻辑查询计划树,其中,一条查询语句对应一棵逻辑查询计划树;
为所述N棵逻辑查询计划树中的第一类操作符的操作数添加符号标识,其中,所述第一类操作符包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
对于所述N棵逻辑查询计划树中的每一棵查询计划树分别执行如下操作:
根据第一逻辑查询计划树的拓扑顺序、所述第一逻辑查询计划树中的第一类操作符的操作数的符号标识以及预设的添加规则为所述第一逻辑查询计划树中的第二类操作符的操作数添加符号标识,其中,所述第一逻辑查询计划树为所述N棵逻辑查询计划树中的任意一棵逻辑查询计划树,所述第二类操作符为除所述第一类操作符之外的其他操作符。
本实施例中,所述第一逻辑查询计划树的根节点包括文件定值操作符,所述第一逻辑查询计划树的叶子节点包括文件扫描操作符,所述第一逻辑查询计划树的内部节点包括第二类操作符、创建操作符和销毁操作符,其中, 所述内部节点为除叶子节点和根节点之外的其他节点;所述添加规则包括:对于所述第一逻辑查询计划树中的每一个第二类操作符,执行如下操作:
若第一操作符的操作数与所述第一操作符的左孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的左孩子节点的操作数的符号标识相同的符号标识,所述第一操作符为所述第二类操作符中的任意一个操作符;若所述第一操作符的操作数与所述第一操作符的右孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的右孩子节点的操作数的符号标识相同的符号标识。
可选的,所述优化规则包括下述规则中的至少一个规则:删除与第一查询语句具有相同操作数版本号和相同操作符的查询语句,其中,所述第一查询语句为所述N条查询语句中的任意一条查询语句;保持具有流依赖关系的查询语句之间的查询顺序,并将具有流依赖关系的多个查询语句优化为一个新的查询语句,其中,所述流依赖关系是指在前执行的一条查询语句的文件定值操作符的操作数的版本号与在后执行的另一条查询语句的的操作数的版本号相同;合并具有相同操作符且操作数有重叠的查询语句。
可选的,所述N条查询语句中的操作数的符号标识还包括:热数据标识;所述处理器21还用于:统计所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数;判断所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数是否大于热数据阈值;为所述N棵逻辑查询计划树中扫描操作符的操作数被引用的次数大于所述热数据阈值的操作数添加热数据标识,所述热数据标识用于表示具有热数据标识的操作数所指向的数据为热数据。后续在执行所述优化后的查询语句过程中,并发执行包含所述热数据标识且不存在流依赖关系和输出依赖关系的优化后的查询语句。
可选的,所述N条查询语句中的操作数的符号标识还包括:起始活跃位置和终止活跃位置,所述处理器31还用于:针对第一操作数,根据第一次引用所述第一操作数的扫描操作符的标识和所述扫描操作符所在的逻辑查询计划树的序号确定所述第一操作数的起始活跃位置,其中,所述第一操作数为所述N条查询语句中的操作数中的任意一个操作数;根据用于销毁所述第一操作数的销毁操作符的标识和所述销毁操作符所在的逻辑查询计划树的序号确定所述第一操作数的终止活跃位置。后续在执行所述优化后的查询语句 过程中,根据所述第一操作数的终止活跃位置释放所述第一操作数指示的数据的存储空间。
所述处理器31还用于:针对第二操作数,根据第一次引用所述第二操作数的第一文件定值操作符的标识和所述第一文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的起始活跃位置,其中,所述第一文件定值操作符用于向所述第二操作数指代的存储位置写数据,所述第二操作数为所述N条查询语句中的操作数中的任意一个操作数;根据引用所述第二操作数的第二文件定值操作符的标识和所述第二文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的终止活跃位置,其中,所述第二文件定值操作符用于改写所述第一文件定值操作符操作的所述第二操作数指向的数据。后续在执行所述优化后的查询语句过程中,根据所述第二操作数的终止活跃位置释放所述第二操作数指示的数据的存储空间。
本实施例提供的数据查询服务器,可用于执行实施例一和实施例二的方法,具体实现方式和技术效果类似,这里不再赘述。
本发明实施例还提供一种数据处理的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令用于执行前述任意一个方法实施例所述的方法流程。本领域普通技术人员可以理解,前述的存储介质包括:U盘、移动硬盘、磁碟、光盘、随机存储器(Random-Access Memory,RAM)、固态硬盘(Solid State Disk,SSD)或者非易失性存储器(non-volatile memory)等各种可以存储程序代码的非短暂性的(non-transitory)机器可读介质。
需要说明的是,本申请所提供的实施例仅仅是示意性的。所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。在本发明实施例、权利要求以及附图中揭示的特征可以独立存在也可以组合存在。在本发明实施例中以硬件形式描述的特征可以通过软件来执行,反之亦然。在此不做限定。

Claims (14)

  1. 一种批量数据查询方法,其特征在于,包括:
    接收待执行的N条查询语句,其中,所述N为不小于2的正整数;
    根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识,其中,所述操作符用于指示要执行的操作,所述操作数用于指示所述N条查询语句中的操作符待操作的数据的存储位置,所述符号标识包括操作数的版本号,并且,指代相同数据的操作数具有相同的版本号,指代不同数据的操作数具有不同的版本号,所述操作符至少包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
    根据确定的所述N条查询语句的操作数的版本号确定所述N条查询语句之间的依赖关系;
    根据所述N条查询语句之间的依赖关系以及预设的优化规则对所述N条查询语句进行查询间优化;
    执行优化后的查询语句以得到所述N条查询语句的查询结果。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识包括:
    获取所述N条查询语句对应的N棵逻辑查询计划树,其中,一条查询语句对应一棵逻辑查询计划树;
    为所述N棵逻辑查询计划树中的第一类操作符的操作数添加符号标识,其中,所述第一类操作符包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
    对于所述N棵逻辑查询计划树中的每一棵查询计划树分别执行如下操作:
    根据第一逻辑查询计划树的拓扑顺序、所述第一逻辑查询计划树中的第一类操作符的操作数的符号标识以及预设的添加规则为所述第一逻辑查询计划树中的第二类操作符的操作数添加符号标识,其中,所述第一逻辑查询计划树为所述N棵逻辑查询计划树中的任意一棵逻辑查询计划树,所述第二类操作符为除所述第一类操作符之外的其他操作符。
  3. 根据权利要求2所述的方法,其特征在于:所述第一逻辑查询计划树的根节点包括文件定值操作符,所述第一逻辑查询计划树的叶子节点包括文 件扫描操作符,所述第一逻辑查询计划树的内部节点包括第二类操作符、创建操作符和销毁操作符,其中,所述内部节点为除叶子节点和根节点之外的其他节点;
    所述添加规则包括:对于所述第一逻辑查询计划树中的每一个第二类操作符,执行如下操作:
    若第一操作符的操作数与所述第一操作符的左孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的左孩子节点的操作数的符号标识相同的符号标识,所述第一操作符为所述第二类操作符中的任意一个操作符;
    若所述第一操作符的操作数与所述第一操作符的右孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的右孩子节点的操作数的符号标识相同的符号标识。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述优化规则包括下述规则中的至少一个规则:
    删除与第一查询语句具有相同操作数版本号和相同操作符的查询语句,其中,所述第一查询语句为所述N条查询语句中的任意一条查询语句;
    保持具有流依赖关系的查询语句之间的查询顺序,并将具有流依赖关系的多个查询语句优化为一个新的查询语句,其中,所述流依赖关系是指在前执行的一条查询语句的文件定值操作符的操作数的版本号与在后执行的另一条查询语句的的操作数的版本号相同;和
    合并具有相同操作符且操作数有重叠的查询语句。
  5. 根据权利要求2所述的方法,其特征在于,所述N条查询语句中的操作数的符号标识还包括:热数据标识;
    所述根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识还包括:
    统计所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数;
    判断所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数是否大于热数据阈值;
    为所述N棵逻辑查询计划树中扫描操作符的操作数被引用的次数大于所 述热数据阈值的操作数添加热数据标识,所述热数据标识用于表示具有热数据标识的操作数所指向的数据为热数据;
    所述方法还包括:
    在执行所述优化后的查询语句过程中,并发执行包含所述热数据标识且不存在流依赖关系和输出依赖关系的优化后的查询语句。
  6. 根据权利要求2或5所述的方法,其特征在于:所述N条查询语句中的操作数的符号标识还包括:起始活跃位置和终止活跃位置;
    所述根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识还包括:
    针对第一操作数,根据第一次引用所述第一操作数的扫描操作符的标识和所述扫描操作符所在的逻辑查询计划树的序号确定所述第一操作数的起始活跃位置,其中,所述第一操作数为所述N条查询语句中的操作数中的任意一个操作数;
    根据用于销毁所述第一操作数的销毁操作符的标识和所述销毁操作符所在的逻辑查询计划树的序号确定所述第一操作数的终止活跃位置;
    所述方法还包括:
    在执行所述优化后的查询语句过程中,根据所述第一操作数的终止活跃位置释放所述第一操作数指示的数据的存储空间。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识还包括:
    针对第二操作数,根据第一次引用所述第二操作数的第一文件定值操作符的标识和所述第一文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的起始活跃位置,其中,所述第一文件定值操作符用于向所述第二操作数指代的存储位置写数据,所述第二操作数为所述N条查询语句中的操作数中的任意一个操作数;
    根据引用所述第二操作数的第二文件定值操作符的标识和所述第二文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的终止活跃位置,其中,所述第二文件定值操作符用于改写所述第一文件定值操作符操作的所述第二操作数指向的数据;
    所述方法还包括:
    在执行所述优化后的查询语句过程中,根据所述第二操作数的终止活跃位置释放所述第二操作数指示的数据的存储空间。
  8. 一种数据查询服务器,其特征在于,包括:
    接收模块,用于接收待执行的N条查询语句,其中,所述N为不小于2的正整数;
    标识确定模块,用于根据所述N条查询语句中的操作符和操作数确定所述N条查询语句的操作数的符号标识,其中,所述操作符用于指示要执行的操作,所述操作数用于指示所述N条查询语句中的操作符待操作的数据的存储位置,所述符号标识包括操作数的版本号,并且,指代相同数据的操作数具有相同的版本号,指代不同数据的操作数具有不同的版本号,所述操作符至少包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
    关系确定模块,用于根据所述标识确定模块确定的所述N条查询语句的操作数的版本号确定所述N条查询语句之间的依赖关系;
    优化模块,用于根据所述N条查询语句之间的依赖关系以及预设的优化规则对所述N条查询语句进行查询间优化;
    查询模块,用于执行优化后的查询语句以得到所述N条查询语句的查询结果。
  9. 根据权利要求8所述的服务器,其特征在于,所述标识确定模块具体用于:
    获取所述N条查询语句对应的N棵逻辑查询计划树,其中,一条查询语句对应一棵逻辑查询计划树;
    为所述N棵逻辑查询计划树中的第一类操作符的操作数添加符号标识,其中,所述第一类操作符包括:创建操作符、销毁操作符、扫描操作符和文件定值操作符;
    对于所述N棵逻辑查询计划树中的每一棵查询计划树分别执行如下操作:
    根据第一逻辑查询计划树的拓扑顺序、所述第一逻辑查询计划树中的第一类操作符的操作数的符号标识以及预设的添加规则为所述第一逻辑查询计划树中的第二类操作符的操作数添加符号标识,其中,所述第一逻辑查询计 划树为所述N棵逻辑查询计划树中的任意一棵逻辑查询计划树,所述第二类操作符为除所述第一类操作符之外的其他操作符。
  10. 根据权利要求9所述的服务器,其特征在于:所述第一逻辑查询计划树的根节点包括文件定值操作符,所述第一逻辑查询计划树的叶子节点包括文件扫描操作符,所述第一逻辑查询计划树的内部节点包括第二类操作符、创建操作符和销毁操作符,其中,所述内部节点为除叶子节点和根节点之外的其他节点;
    所述添加规则包括:对于所述第一逻辑查询计划树中的每一个第二类操作符,执行如下操作:
    若第一操作符的操作数与所述第一操作符的左孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的左孩子节点的操作数的符号标识相同的符号标识,所述第一操作符为所述第二类操作符中的任意一个操作符;
    若所述第一操作符的操作数与所述第一操作符的右孩子节点的操作数相同,则为所述第一操作符的操作数添加与所述第一操作符的右孩子节点的操作数的符号标识相同的符号标识。
  11. 根据权利要求8-10中任一项所述的服务器,其特征在于,所述优化规则包括下述规则中的至少一个规则:
    删除与第一查询语句具有相同操作数版本号和相同操作符的查询语句,其中,所述第一查询语句为所述N条查询语句中的任意一条查询语句;
    保持具有流依赖关系的查询语句之间的查询顺序,并将具有流依赖关系的多个查询语句优化为一个新的查询语句,其中,所述流依赖关系是指在前执行的一条查询语句的文件定值操作符的操作数的版本号与在后执行的另一条查询语句的的操作数的版本号相同;和
    合并具有相同操作符且操作数有重叠的查询语句。
  12. 根据权利要求9所述的服务器,其特征在于,所述N条查询语句中的操作数的符号标识还包括:热数据标识,所述标识确定模块还用于:
    统计所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次数;
    判断所述N棵逻辑查询计划树中的每个扫描操作符的操作数被引用的次 数是否大于热数据阈值;
    为所述N棵逻辑查询计划树中扫描操作符的操作数被引用的次数大于所述热数据阈值的操作数添加热数据标识,所述热数据标识用于表示具有热数据标识的操作数所指向的数据为热数据;
    所述查询模块还用于:
    在执行所述优化后的查询语句过程中,并发执行包含所述热数据标识且不存在流依赖关系和输出依赖关系的优化后的查询语句。
  13. 根据权利要求9或12所述的服务器,其特征在于,所述N条查询语句中的操作数的符号标识还包括:起始活跃位置和终止活跃位置,所述标识确定模块还用于:
    针对第一操作数,根据第一次引用所述第一操作数的扫描操作符的标识和所述扫描操作符所在的逻辑查询计划树的序号确定所述第一操作数的起始活跃位置,其中,所述第一操作数为所述N条查询语句中的操作数中的任意一个操作数;
    根据用于销毁所述第一操作数的销毁操作符的标识和所述销毁操作符所在的逻辑查询计划树的序号确定所述第一操作数的终止活跃位置;
    所述查询模块还用于:
    在执行所述优化后的查询语句过程中,根据所述第一操作数的终止活跃位置释放所述第一操作数指示的数据的存储空间。
  14. 根据权利要求13所述的服务器,其特征在于,所述标识确定模块还用于:
    针对第二操作数,根据第一次引用所述第二操作数的第一文件定值操作符的标识和所述第一文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的起始活跃位置,其中,所述第一文件定值操作符用于向所述第二操作数指代的存储位置写数据,所述第二操作数为所述N条查询语句中的操作数中的任意一个操作数;
    根据引用所述第二操作数的第二文件定值操作符的标识和所述第二文件定值操作符所在的逻辑查询计划树的序号确定所述第二操作数的终止活跃位置,其中,所述第二文件定值操作符用于改写所述第一文件定值操作符操作的所述第二操作数指向的数据;
    所述查询模块还用于:
    在执行所述优化后的查询语句过程中,根据所述第二操作数的终止活跃位置释放所述第二操作数指示的数据的存储空间。
PCT/CN2016/074141 2015-05-06 2016-02-19 批量数据查询方法和装置 WO2016177027A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16789021.9A EP3282370A4 (en) 2015-05-06 2016-02-19 Batch data query method and device
US15/804,346 US10678789B2 (en) 2015-05-06 2017-11-06 Batch data query method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510226374.XA CN106202102B (zh) 2015-05-06 2015-05-06 批量数据查询方法和装置
CN201510226374.X 2015-05-06

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/804,346 Continuation US10678789B2 (en) 2015-05-06 2017-11-06 Batch data query method and apparatus

Publications (1)

Publication Number Publication Date
WO2016177027A1 true WO2016177027A1 (zh) 2016-11-10

Family

ID=57217338

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/074141 WO2016177027A1 (zh) 2015-05-06 2016-02-19 批量数据查询方法和装置

Country Status (4)

Country Link
US (1) US10678789B2 (zh)
EP (1) EP3282370A4 (zh)
CN (1) CN106202102B (zh)
WO (1) WO2016177027A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143406A (zh) * 2018-11-06 2020-05-12 北京京东尚科信息技术有限公司 数据库数据比对方法和比对系统
CN111159227B (zh) * 2019-12-20 2023-04-14 建信金融科技有限责任公司 数据查询方法、装置、设备及存储介质
CN111753028B (zh) * 2020-07-02 2023-08-25 上海达梦数据库有限公司 数据传输方法、装置、设备及存储介质
CN116089476B (zh) * 2023-04-07 2023-07-04 北京宝兰德软件股份有限公司 数据查询方法、装置及电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110030A (zh) * 2007-08-23 2008-01-23 南京联创科技股份有限公司 基于java的数据库持久层的开发方法
CN101221578A (zh) * 2008-02-01 2008-07-16 中国建设银行股份有限公司 数据筛选的方法、装置以及证券化贷款的筛选方法、装置
CN102859521A (zh) * 2010-04-30 2013-01-02 国际商业机器公司 数据库应用的集中控制

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8239406B2 (en) * 2008-12-31 2012-08-07 International Business Machines Corporation Expression tree data structure for representing a database query
CN102609451B (zh) * 2012-01-11 2014-12-17 华中科技大学 面向流式数据处理的sql查询计划生成方法
CN103559300B (zh) * 2013-11-13 2017-06-13 曙光信息产业(北京)有限公司 数据的查询方法和查询装置
CN104036007B (zh) * 2014-06-23 2017-12-12 北京京东尚科信息技术有限公司 一种分布式数据库查询方法及装置
CN104063486B (zh) * 2014-07-03 2017-07-11 四川中亚联邦科技有限公司 一种大数据分布式存储方法和系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110030A (zh) * 2007-08-23 2008-01-23 南京联创科技股份有限公司 基于java的数据库持久层的开发方法
CN101221578A (zh) * 2008-02-01 2008-07-16 中国建设银行股份有限公司 数据筛选的方法、装置以及证券化贷款的筛选方法、装置
CN102859521A (zh) * 2010-04-30 2013-01-02 国际商业机器公司 数据库应用的集中控制

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3282370A4 *

Also Published As

Publication number Publication date
CN106202102A (zh) 2016-12-07
US20180060392A1 (en) 2018-03-01
EP3282370A1 (en) 2018-02-14
EP3282370A4 (en) 2018-04-25
CN106202102B (zh) 2019-04-05
US10678789B2 (en) 2020-06-09

Similar Documents

Publication Publication Date Title
WO2020233367A1 (zh) 区块链数据存储和查询方法、装置、设备及存储介质
US10007699B2 (en) Optimized exclusion filters for multistage filter processing in queries
US8396852B2 (en) Evaluating execution plan changes after a wakeup threshold time
US10733055B1 (en) Methods and apparatus related to graph transformation and synchronization
JP6160277B2 (ja) リコンシリエーション処理を実行する方法、制御部、プログラム及びデータ記憶システム
US8601474B2 (en) Resuming execution of an execution plan in a virtual machine
US8849876B2 (en) Methods and apparatuses to optimize updates in a file system based on birth time
US20170139991A1 (en) Dynamic query plan based on skew
CN108431766B (zh) 用于访问数据库的方法和系统
WO2016078592A1 (zh) 批量数据查询方法和装置
CN105989015B (zh) 一种数据库扩容方法和装置以及访问数据库的方法和装置
US8442971B2 (en) Execution plans with different driver sources in multiple threads
US9229961B2 (en) Database management delete efficiency
US10678789B2 (en) Batch data query method and apparatus
WO2020007288A1 (zh) 管理内存数据及在内存中维护数据的方法和系统
US10984050B2 (en) Method, apparatus, and computer program product for managing storage system
Reza et al. Prunejuice: pruning trillion-edge graphs to a precise pattern-matching solution
US10599614B1 (en) Intersection-based dynamic blocking
Amin et al. Plume: Differential privacy at scale
US10838930B2 (en) Database migration sequencing using dynamic object-relationship diagram
CN112970011A (zh) 记录查询优化中的谱系
CN116628066A (zh) 数据传输方法、装置、计算机设备和存储介质
US11847121B2 (en) Compound predicate query statement transformation
US20170031909A1 (en) Locality-sensitive hashing for algebraic expressions
DeBrie The dynamodb book

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16789021

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE