CN109947796B

CN109947796B - Caching method for query intermediate result set of distributed database system

Info

Publication number: CN109947796B
Application number: CN201910166410.6A
Authority: CN
Inventors: 杜金莲; 陈子昂; 金雪云; 苏航; 李童
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2021-04-30
Anticipated expiration: 2039-04-12
Also published as: CN109947796A

Abstract

The invention discloses a cache method for a distributed database system to query an intermediate result set, which comprises the following steps: recording the return of the sub-query task to the intermediate result set, and establishing an intermediate result cache data set storage model; establishing a query statement query range identification mechanism; implementing overtime failure processing on the intermediate result set; by means of the intermediate result set caching and identifying mechanism, the distributed database realizes network-free interactive query on the sub-queries meeting the conditions, and the efficiency of distributed query is improved.

Description

Caching method for query intermediate result set of distributed database system

Technical Field

The invention belongs to the field of computer databases, and particularly relates to a method for caching and using an intermediate result set generated by query sentences of a distributed database system.

Background

Distributed databases have emerged as an important research topic in the field of databases in the last decade of the rapid development of computer technology. The popularization of internet and mobile application leads various data services to face larger and larger data scale and access request pressure, and with the wide application of distributed databases, the information processing capacity of the data services is improved through the distributed databases, and the method becomes a common solution for various data service providers.

The distributed database is developed on the mature technology of a centralized database, the core idea is that a database cluster is used as a whole to provide data service externally, internal data is stored in a scattered mode, data reliability is achieved by means of technologies such as data redundancy, data fragmentation and copy synchronization, and data operation execution efficiency is improved by means of technologies such as read-write separation. In management, a distributed database generally uses a single node or a few nodes as management nodes to realize functions of query analysis, statement rewriting, result merging and the like, and generally leads a plurality of sub-database nodes, thereby realizing external service provision.

At present, most distributed databases do not completely cache query results submitted by child node databases due to the requirement of strict data consistency, so that a large amount of time is consumed in network interaction when the query results are subjected to single distributed query. However, in some practical applications, such as the medical field, the requirement on data consistency is not high, and meanwhile, the characteristic that a certain incidence relation or near homogeneity exists in query data within a period of time is also provided, so how to improve the query efficiency of the distributed database system and provide high-performance service is a problem that needs to be researched and solved.

Disclosure of Invention

Aiming at the defects of the traditional distributed database, the invention provides a cache and a use method of a distributed database system for querying an intermediate result set. The method realizes that the sub-query result sets generated by a plurality of related sub-nodes are cached when the distributed database executes the query task, so that the system can directly multiplex the intermediate result sets when the query range is smaller than the cached result range or the range is repeated when the query range is smaller than the cached result range or the query range is repeated in a future period of time, and the network interaction resources consumed by repeated execution of the query are reduced.

The main thought of the method of the invention is as follows: after the query statement is decomposed to each child node, acquiring intermediate result set information generated by the query clauses of each child node, and establishing an intermediate result cache data set; and establishing an identification mechanism for dividing a query range according to database execution statements, and judging whether an intermediate result cache data set is available or not when a new query task appears, thereby reducing the number of query tasks sent by the database to the background child nodes and network interaction resources consumed when the query tasks are executed.

The implementation of the invention comprises the following steps:

(1) establishing a storage model of an intermediate result cache data set

The purpose of this step is to build a storage model for the sub-query results for caching intermediate results generated by the sub-query. The storage model is divided into a head cache and a result cache. The head cache needs to record local database ip, database name, relation mode, column field name and the like; the result cache needs to record all data items present in a single row of data. The data information returned by each node is cached by generating the two parts of cache data.

(2) Identification mechanism for establishing query scope

The purpose of query scope identification is to determine how well intermediate results match future queries through an identification mechanism. To achieve this goal, the multi-way tree expression is designed to realize the judgment of the query range of the query statement (the structure is shown in fig. 1 in the description figure). The multi-branch tree can generate a complete where clause from the head node to any non-root node path, and the single-node storage data is a query executed on the sub-database nodes by the where clauses generated by connecting all the nodes on the path from the head node to the node. The construction of the multi-branch tree is that corresponding data nodes are added in the multi-branch tree while caching the sub-query intermediate result set, and the information contained in the nodes comprises: and the node mark searches all information of the query record corresponding to the condition Key and the query statement generated by taking the node as the tail node. The query record includes the query target database ip, a table name, a query condition, a query target column, and the query record expiration time, and the data structure thereof is shown in fig. 2. When a subsequent query statement is executed, firstly, a routing module processes to obtain a sub-query task to be executed corresponding to a sub-node, the sub-query task is divided into a query item, a query table and a plurality of query conditions, then, the sub-query task is retrieved in a multi-branch tree according to the query conditions, if a query path generated by the query conditions exists in the multi-branch tree and the query condition range on the path is larger than or equal to the constraint range of the current sub-task, and meanwhile, a cache record exists in a path tail node, the statement is considered to be identified, and an intermediate result set cache which can be used by the statement exists in a cache set.

(3) Handling of intermediate result cache data set failures

In the step (1), the intermediate result set cache has certain timeliness, the data volume is necessarily huge if the time is too long, the database running speed is seriously slowed down if the time is not long, and meanwhile, the validity of the data cannot be ensured. In order to solve the problem, the recorded failure time limit is set in the middle result set of the cache, a timing traversing device is set, the failure time limit attribute in the cache result set is traversed on time, when the time attribute in the cache result is larger than the current time, the data set is determined to be overtime and failed, and the deleting operation is executed on the data set. The expiration time is generally related to the domain and the operation of the database. The expiration time is the expiration time attribute in the configuration file realized by the configuration file, and the time after reading is equivalent to a fixed value.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

the invention provides an intermediate result set caching method based on distributed database query statement generation, which has better database response speed for query tasks such as medical data and the like with low network interaction requirements, low requirements on database consistency and large data volume.

Drawings

FIG. 1 is a recognition tree illustration;

FIG. 2 is an illustrative diagram of identifying tree node classes;

FIG. 3 is an illustration of an intermediate result cache;

FIG. 4 identifying tree diagrams

FIG. 5 is a flow chart of a method in accordance with the present invention;

detailed description of the preferred embodiments

The invention is further described with reference to the following figures and detailed description.

Step 1, establishing an intermediate result cache data set model.

The intermediate result set caching method takes the returned content of the relational database MySQL as a standard, so the structure of the intermediate result set is taken for caching. The relational database returns a result set containing two parts of contents: result header information, result column information.

The intermediate result cache header dataset structure and the intermediate result cache content dataset structure are shown in fig. 3. The intermediate result cache head data set is Fieldcache in the graph, and the intermediate result cache head data set is respectively a head record id, a database ip address, a database meta name, a database data table name, a column original name array, a complete original cache head and a cache content id array from top to bottom. And the intermediate result cache content data set is Rowcache in the graph, and the attribute of the intermediate result cache content data set records id and line data content from top to bottom.

And establishing a cache data storage mechanism according to the two data entities, and storing the cache data storage mechanism into a database or a document.

The non-relational database has different return results according to different database types. Therefore, the operation is mainly to identify the format of the returned content, convert the content into the intermediate result set data of the model according to the format and cache the data.

And 2, establishing an identification mechanism of the query range.

The identification mechanism adopts a multi-branch tree method to judge whether available intermediate result set cache exists during the execution of the query task. The contents of the single node of the multi-branch tree are shown in FIG. 2.

The contents are respectively read-write lock, current node storage where clause (column, operator, value), operator, value, cache database ip, table name, cache attribute array, cache expiration time and cache head id from top to bottom. The SQL statement of the query task executed by the child node is analyzed and then converted into a path of the following binary tree.

The resulting multi-way tree structure is shown in fig. 4. The path is marked as a query statement "select name, sex from user where id >1 and age >13 and type ═ 4; "statement generation path.

When a new SQL sentence is executed, the sentence is divided into a selection condition, a table name and a query attribute, the selection condition is sequenced according to a text sequence, after the selection condition is searched in a binary tree, when the conditions are met and all the conditions are the same as the data stored in a certain path in the binary tree or the range is smaller than the marked range in the binary tree, if the tail node is the same as the IP (Internet protocol) and the table name of the query execution database, the range of the query attribute is the same as or larger than the current query target attribute and the cache is not expired, the current cache is indicated to be available. And if the query range is inconsistent, performing secondary query on the cache result, splicing the cache result into a new intermediate result set packet, and executing merging operation after waiting for the cache return of the subsequent intermediate result set and returning the merged operation to the client. Otherwise, directly generating an intermediate result set packet to wait for subsequent processing.

And 3, caching the failure mode of the data set by the intermediate result.

And after the intermediate result set returned by the child node is obtained, analyzing the content of the result set and caching the content. And calculating the sum of the current time of the cache record and the effective time of the record, counting the failure time, and recording the failure time into a cache head data set. The system cycles through the cached results by additionally initiating a timed task. And when the cache head data with the invalidation time less than the current time exists in the data set, deleting the current head data and the cache content data associated with the head data. Similarly, the identification mechanism traverses the data and deletes the expired data by additionally starting a timing task mode, but the identification mechanism can also execute the operation of deleting the cache record when finding the invalid data so as to release the space. The recognition mechanism deletion operation follows the following principle: when deleting a certain cache record, if the cache record is the only cache record of the current node and the current node has no any child node, directly deleting the current node and judging whether the parent node has the cache record. If the cache record does not exist in the father node, the father node is continuously deleted and the operation of detecting whether the father node is empty or not is repeated.

Step 4, realizing the sub-inquiry without network interaction

The flow of the network-less interaction sub-query task is shown in fig. 5.

As shown in FIG. 5, the sub-query is first executed to obtain its target path in the identification cache via a parsing stage.

And judging whether the cache exists according to the path, and simultaneously recording the number of the conditions that the sub-query range is smaller than the cache path range. If the path exists and the path has a cache intermediate result set of the database node and the data table executed by the sub-query, the cache is hit, if the number of inconsistent detection query ranges is larger than 0, the column name and the row data are sequentially extracted, and a secondary query filtering result set is executed. And if the number of the sub-queries is inconsistent, directly generating an intermediate result set packet, and merging the results after all the sub-queries are executed.

And if the path does not exist, the sub-query is sent to the corresponding sub-node to wait for the sub-node to feed back a result packet. And after receiving the result packet fed back, analyzing information such as a result packet header, result row data and the like in sequence, and inputting the information into an intermediate result set cache along with failure time. And then adding the intermediate result set cache path in the identification cache according to the sub-query statement and recording the cache failure time. And then, executing result merging operation after all the sub-queries are executed.

Claims

1. A cache method for querying an intermediate result set in a distributed database system is characterized by comprising the following steps:

step 1, establishing an intermediate result cache data set model;

the intermediate result set caching method takes the returned content of the relational database MySQL as a standard, so the structure of the intermediate result set is taken for caching; the relational database returns a result set containing two parts of contents: result header information, result column information;

the intermediate result cache head data set structure comprises a head record id, a database ip address, a database element name, a database data table name, a column original name array, a complete original cache head and a cache content id array; the intermediate result cache content data set comprises a line record id and a line data content;

establishing a cache data storage mechanism according to the two data entities, and storing the cache data storage mechanism into a database or a document;

the non-relational database returns different results according to different database types; so the content format returned by the operation is identified, and the content is converted into the intermediate result set data of the model according to the format and is cached;

step 2, establishing an identification mechanism of a query range;

the identification mechanism adopts a multi-branch tree method to judge whether available intermediate result set cache exists during the execution of the query task; the multi-branch tree single node comprises a high concurrency down read-write lock, a current node storage where clause, an operator, a value, a cache database ip, a table name, a cache attribute array, cache expiration time and a cache head id; the child node executes the query task SQL statement and is analyzed into a path of the following binary tree;

when a new SQL sentence is executed, the sentence is divided into a selection condition, a table name and a query attribute, the selection condition is ordered according to a text sequence, after the selection condition is searched in a binary tree, when the conditions are met and all the conditions are the same as the data stored in a certain path in the binary tree or the range is smaller than the marked range in the binary tree, if the tail node is the same as the IP (Internet protocol) and the table name of the query execution database, the range of the query attribute is the same as or larger than the current query target attribute and the cache is not expired, the current cache is indicated to be available; if the condition that the query ranges are inconsistent exists in the past, after secondary query is carried out on the cache result, the cache result is spliced into a new intermediate result set packet, and after the cache of the subsequent intermediate result set is returned, merging operation is carried out and returned to the client; otherwise, directly generating an intermediate result set packet to wait for subsequent processing;

step 3, setting a failure mode of the intermediate result cache data set;

after the intermediate result set returned by the child node is obtained, analyzing the content of the result set and caching; calculating the sum of the current time of the cache record and the effective time of the record, counting the invalidation time, and recording into a cache head data set; the system circularly traverses the cache result by additionally initiating a timing task; when cache head data with the invalidation time smaller than the current time exists in the data set, deleting the current head data and cache content data associated with the head data; similarly, the identification mechanism traverses the data by additionally starting a timing task mode and deletes the expired data, but the identification mechanism can also execute the operation of deleting the cache record when finding the invalid data so as to release the space; the recognition mechanism deletion operation follows the following principle: when deleting a certain cache record, if the cache record is the only cache record of the current node and the current node has no any child node, directly deleting the current node and judging whether the parent node has the cache record; if the father node does not have the cache record, the father node is continuously deleted and the detection of whether the father node is idle or not is repeated;

step 4, realizing sub-query network-free interaction

Before the sub-query is executed, firstly obtaining a target path of the sub-query in an identification cache through an analysis stage;

judging whether the cache exists according to the path, and simultaneously recording the number of the conditions that the sub-query range is smaller than the cache path range; if the path exists and the path has a cache intermediate result set of the database node and the data table executed by the sub-query, the cache is hit, if the number of inconsistent detection query ranges is greater than 0, the column name and the row data are sequentially extracted, and a secondary query filtering result set is executed; if the inconsistent data is 0, directly generating an intermediate result set packet, and merging results after all the sub-queries are executed;

if the path does not exist, the cache is not hit, and the sub-query is sent to the corresponding sub-node to wait for the sub-node to feed back a result packet; after the result packet fed back is received, the result packet head and the result row data information are analyzed in sequence and then are recorded into an intermediate result set cache along with the failure time; then adding the intermediate result set cache path in the identification cache according to the sub-query statement and recording the cache failure time; and then, executing result merging operation after all the sub-queries are executed.