CN109947796B - Caching method for query intermediate result set of distributed database system - Google Patents

Caching method for query intermediate result set of distributed database system Download PDF

Info

Publication number
CN109947796B
CN109947796B CN201910166410.6A CN201910166410A CN109947796B CN 109947796 B CN109947796 B CN 109947796B CN 201910166410 A CN201910166410 A CN 201910166410A CN 109947796 B CN109947796 B CN 109947796B
Authority
CN
China
Prior art keywords
cache
query
data
intermediate result
result set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910166410.6A
Other languages
Chinese (zh)
Other versions
CN109947796A (en
Inventor
杜金莲
陈子昂
金雪云
苏航
李童
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201910166410.6A priority Critical patent/CN109947796B/en
Publication of CN109947796A publication Critical patent/CN109947796A/en
Application granted granted Critical
Publication of CN109947796B publication Critical patent/CN109947796B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cache method for a distributed database system to query an intermediate result set, which comprises the following steps: recording the return of the sub-query task to the intermediate result set, and establishing an intermediate result cache data set storage model; establishing a query statement query range identification mechanism; implementing overtime failure processing on the intermediate result set; by means of the intermediate result set caching and identifying mechanism, the distributed database realizes network-free interactive query on the sub-queries meeting the conditions, and the efficiency of distributed query is improved.

Description

Caching method for query intermediate result set of distributed database system
Technical Field
The invention belongs to the field of computer databases, and particularly relates to a method for caching and using an intermediate result set generated by query sentences of a distributed database system.
Background
Distributed databases have emerged as an important research topic in the field of databases in the last decade of the rapid development of computer technology. The popularization of internet and mobile application leads various data services to face larger and larger data scale and access request pressure, and with the wide application of distributed databases, the information processing capacity of the data services is improved through the distributed databases, and the method becomes a common solution for various data service providers.
The distributed database is developed on the mature technology of a centralized database, the core idea is that a database cluster is used as a whole to provide data service externally, internal data is stored in a scattered mode, data reliability is achieved by means of technologies such as data redundancy, data fragmentation and copy synchronization, and data operation execution efficiency is improved by means of technologies such as read-write separation. In management, a distributed database generally uses a single node or a few nodes as management nodes to realize functions of query analysis, statement rewriting, result merging and the like, and generally leads a plurality of sub-database nodes, thereby realizing external service provision.
At present, most distributed databases do not completely cache query results submitted by child node databases due to the requirement of strict data consistency, so that a large amount of time is consumed in network interaction when the query results are subjected to single distributed query. However, in some practical applications, such as the medical field, the requirement on data consistency is not high, and meanwhile, the characteristic that a certain incidence relation or near homogeneity exists in query data within a period of time is also provided, so how to improve the query efficiency of the distributed database system and provide high-performance service is a problem that needs to be researched and solved.
Disclosure of Invention
Aiming at the defects of the traditional distributed database, the invention provides a cache and a use method of a distributed database system for querying an intermediate result set. The method realizes that the sub-query result sets generated by a plurality of related sub-nodes are cached when the distributed database executes the query task, so that the system can directly multiplex the intermediate result sets when the query range is smaller than the cached result range or the range is repeated when the query range is smaller than the cached result range or the query range is repeated in a future period of time, and the network interaction resources consumed by repeated execution of the query are reduced.
The main thought of the method of the invention is as follows: after the query statement is decomposed to each child node, acquiring intermediate result set information generated by the query clauses of each child node, and establishing an intermediate result cache data set; and establishing an identification mechanism for dividing a query range according to database execution statements, and judging whether an intermediate result cache data set is available or not when a new query task appears, thereby reducing the number of query tasks sent by the database to the background child nodes and network interaction resources consumed when the query tasks are executed.
The implementation of the invention comprises the following steps:
(1) establishing a storage model of an intermediate result cache data set
The purpose of this step is to build a storage model for the sub-query results for caching intermediate results generated by the sub-query. The storage model is divided into a head cache and a result cache. The head cache needs to record local database ip, database name, relation mode, column field name and the like; the result cache needs to record all data items present in a single row of data. The data information returned by each node is cached by generating the two parts of cache data.
(2) Identification mechanism for establishing query scope
The purpose of query scope identification is to determine how well intermediate results match future queries through an identification mechanism. To achieve this goal, the multi-way tree expression is designed to realize the judgment of the query range of the query statement (the structure is shown in fig. 1 in the description figure). The multi-branch tree can generate a complete where clause from the head node to any non-root node path, and the single-node storage data is a query executed on the sub-database nodes by the where clauses generated by connecting all the nodes on the path from the head node to the node. The construction of the multi-branch tree is that corresponding data nodes are added in the multi-branch tree while caching the sub-query intermediate result set, and the information contained in the nodes comprises: and the node mark searches all information of the query record corresponding to the condition Key and the query statement generated by taking the node as the tail node. The query record includes the query target database ip, a table name, a query condition, a query target column, and the query record expiration time, and the data structure thereof is shown in fig. 2. When a subsequent query statement is executed, firstly, a routing module processes to obtain a sub-query task to be executed corresponding to a sub-node, the sub-query task is divided into a query item, a query table and a plurality of query conditions, then, the sub-query task is retrieved in a multi-branch tree according to the query conditions, if a query path generated by the query conditions exists in the multi-branch tree and the query condition range on the path is larger than or equal to the constraint range of the current sub-task, and meanwhile, a cache record exists in a path tail node, the statement is considered to be identified, and an intermediate result set cache which can be used by the statement exists in a cache set.
(3) Handling of intermediate result cache data set failures
In the step (1), the intermediate result set cache has certain timeliness, the data volume is necessarily huge if the time is too long, the database running speed is seriously slowed down if the time is not long, and meanwhile, the validity of the data cannot be ensured. In order to solve the problem, the recorded failure time limit is set in the middle result set of the cache, a timing traversing device is set, the failure time limit attribute in the cache result set is traversed on time, when the time attribute in the cache result is larger than the current time, the data set is determined to be overtime and failed, and the deleting operation is executed on the data set. The expiration time is generally related to the domain and the operation of the database. The expiration time is the expiration time attribute in the configuration file realized by the configuration file, and the time after reading is equivalent to a fixed value.
Compared with the prior art, the invention has the following obvious advantages and beneficial effects:
the invention provides an intermediate result set caching method based on distributed database query statement generation, which has better database response speed for query tasks such as medical data and the like with low network interaction requirements, low requirements on database consistency and large data volume.
Drawings
FIG. 1 is a recognition tree illustration;
FIG. 2 is an illustrative diagram of identifying tree node classes;
FIG. 3 is an illustration of an intermediate result cache;
FIG. 4 identifying tree diagrams
FIG. 5 is a flow chart of a method in accordance with the present invention;
detailed description of the preferred embodiments
The invention is further described with reference to the following figures and detailed description.
Step 1, establishing an intermediate result cache data set model.
The intermediate result set caching method takes the returned content of the relational database MySQL as a standard, so the structure of the intermediate result set is taken for caching. The relational database returns a result set containing two parts of contents: result header information, result column information.
The intermediate result cache header dataset structure and the intermediate result cache content dataset structure are shown in fig. 3. The intermediate result cache head data set is Fieldcache in the graph, and the intermediate result cache head data set is respectively a head record id, a database ip address, a database meta name, a database data table name, a column original name array, a complete original cache head and a cache content id array from top to bottom. And the intermediate result cache content data set is Rowcache in the graph, and the attribute of the intermediate result cache content data set records id and line data content from top to bottom.
And establishing a cache data storage mechanism according to the two data entities, and storing the cache data storage mechanism into a database or a document.
The non-relational database has different return results according to different database types. Therefore, the operation is mainly to identify the format of the returned content, convert the content into the intermediate result set data of the model according to the format and cache the data.
And 2, establishing an identification mechanism of the query range.
The identification mechanism adopts a multi-branch tree method to judge whether available intermediate result set cache exists during the execution of the query task. The contents of the single node of the multi-branch tree are shown in FIG. 2.
The contents are respectively read-write lock, current node storage where clause (column, operator, value), operator, value, cache database ip, table name, cache attribute array, cache expiration time and cache head id from top to bottom. The SQL statement of the query task executed by the child node is analyzed and then converted into a path of the following binary tree.
The resulting multi-way tree structure is shown in fig. 4. The path is marked as a query statement "select name, sex from user where id >1 and age >13 and type ═ 4; "statement generation path.
When a new SQL sentence is executed, the sentence is divided into a selection condition, a table name and a query attribute, the selection condition is sequenced according to a text sequence, after the selection condition is searched in a binary tree, when the conditions are met and all the conditions are the same as the data stored in a certain path in the binary tree or the range is smaller than the marked range in the binary tree, if the tail node is the same as the IP (Internet protocol) and the table name of the query execution database, the range of the query attribute is the same as or larger than the current query target attribute and the cache is not expired, the current cache is indicated to be available. And if the query range is inconsistent, performing secondary query on the cache result, splicing the cache result into a new intermediate result set packet, and executing merging operation after waiting for the cache return of the subsequent intermediate result set and returning the merged operation to the client. Otherwise, directly generating an intermediate result set packet to wait for subsequent processing.
And 3, caching the failure mode of the data set by the intermediate result.
And after the intermediate result set returned by the child node is obtained, analyzing the content of the result set and caching the content. And calculating the sum of the current time of the cache record and the effective time of the record, counting the failure time, and recording the failure time into a cache head data set. The system cycles through the cached results by additionally initiating a timed task. And when the cache head data with the invalidation time less than the current time exists in the data set, deleting the current head data and the cache content data associated with the head data. Similarly, the identification mechanism traverses the data and deletes the expired data by additionally starting a timing task mode, but the identification mechanism can also execute the operation of deleting the cache record when finding the invalid data so as to release the space. The recognition mechanism deletion operation follows the following principle: when deleting a certain cache record, if the cache record is the only cache record of the current node and the current node has no any child node, directly deleting the current node and judging whether the parent node has the cache record. If the cache record does not exist in the father node, the father node is continuously deleted and the operation of detecting whether the father node is empty or not is repeated.
Step 4, realizing the sub-inquiry without network interaction
The flow of the network-less interaction sub-query task is shown in fig. 5.
As shown in FIG. 5, the sub-query is first executed to obtain its target path in the identification cache via a parsing stage.
And judging whether the cache exists according to the path, and simultaneously recording the number of the conditions that the sub-query range is smaller than the cache path range. If the path exists and the path has a cache intermediate result set of the database node and the data table executed by the sub-query, the cache is hit, if the number of inconsistent detection query ranges is larger than 0, the column name and the row data are sequentially extracted, and a secondary query filtering result set is executed. And if the number of the sub-queries is inconsistent, directly generating an intermediate result set packet, and merging the results after all the sub-queries are executed.
And if the path does not exist, the sub-query is sent to the corresponding sub-node to wait for the sub-node to feed back a result packet. And after receiving the result packet fed back, analyzing information such as a result packet header, result row data and the like in sequence, and inputting the information into an intermediate result set cache along with failure time. And then adding the intermediate result set cache path in the identification cache according to the sub-query statement and recording the cache failure time. And then, executing result merging operation after all the sub-queries are executed.

Claims (1)

1. A cache method for querying an intermediate result set in a distributed database system is characterized by comprising the following steps:
step 1, establishing an intermediate result cache data set model;
the intermediate result set caching method takes the returned content of the relational database MySQL as a standard, so the structure of the intermediate result set is taken for caching; the relational database returns a result set containing two parts of contents: result header information, result column information;
the intermediate result cache head data set structure comprises a head record id, a database ip address, a database element name, a database data table name, a column original name array, a complete original cache head and a cache content id array; the intermediate result cache content data set comprises a line record id and a line data content;
establishing a cache data storage mechanism according to the two data entities, and storing the cache data storage mechanism into a database or a document;
the non-relational database returns different results according to different database types; so the content format returned by the operation is identified, and the content is converted into the intermediate result set data of the model according to the format and is cached;
step 2, establishing an identification mechanism of a query range;
the identification mechanism adopts a multi-branch tree method to judge whether available intermediate result set cache exists during the execution of the query task; the multi-branch tree single node comprises a high concurrency down read-write lock, a current node storage where clause, an operator, a value, a cache database ip, a table name, a cache attribute array, cache expiration time and a cache head id; the child node executes the query task SQL statement and is analyzed into a path of the following binary tree;
when a new SQL sentence is executed, the sentence is divided into a selection condition, a table name and a query attribute, the selection condition is ordered according to a text sequence, after the selection condition is searched in a binary tree, when the conditions are met and all the conditions are the same as the data stored in a certain path in the binary tree or the range is smaller than the marked range in the binary tree, if the tail node is the same as the IP (Internet protocol) and the table name of the query execution database, the range of the query attribute is the same as or larger than the current query target attribute and the cache is not expired, the current cache is indicated to be available; if the condition that the query ranges are inconsistent exists in the past, after secondary query is carried out on the cache result, the cache result is spliced into a new intermediate result set packet, and after the cache of the subsequent intermediate result set is returned, merging operation is carried out and returned to the client; otherwise, directly generating an intermediate result set packet to wait for subsequent processing;
step 3, setting a failure mode of the intermediate result cache data set;
after the intermediate result set returned by the child node is obtained, analyzing the content of the result set and caching; calculating the sum of the current time of the cache record and the effective time of the record, counting the invalidation time, and recording into a cache head data set; the system circularly traverses the cache result by additionally initiating a timing task; when cache head data with the invalidation time smaller than the current time exists in the data set, deleting the current head data and cache content data associated with the head data; similarly, the identification mechanism traverses the data by additionally starting a timing task mode and deletes the expired data, but the identification mechanism can also execute the operation of deleting the cache record when finding the invalid data so as to release the space; the recognition mechanism deletion operation follows the following principle: when deleting a certain cache record, if the cache record is the only cache record of the current node and the current node has no any child node, directly deleting the current node and judging whether the parent node has the cache record; if the father node does not have the cache record, the father node is continuously deleted and the detection of whether the father node is idle or not is repeated;
step 4, realizing sub-query network-free interaction
Before the sub-query is executed, firstly obtaining a target path of the sub-query in an identification cache through an analysis stage;
judging whether the cache exists according to the path, and simultaneously recording the number of the conditions that the sub-query range is smaller than the cache path range; if the path exists and the path has a cache intermediate result set of the database node and the data table executed by the sub-query, the cache is hit, if the number of inconsistent detection query ranges is greater than 0, the column name and the row data are sequentially extracted, and a secondary query filtering result set is executed; if the inconsistent data is 0, directly generating an intermediate result set packet, and merging results after all the sub-queries are executed;
if the path does not exist, the cache is not hit, and the sub-query is sent to the corresponding sub-node to wait for the sub-node to feed back a result packet; after the result packet fed back is received, the result packet head and the result row data information are analyzed in sequence and then are recorded into an intermediate result set cache along with the failure time; then adding the intermediate result set cache path in the identification cache according to the sub-query statement and recording the cache failure time; and then, executing result merging operation after all the sub-queries are executed.
CN201910166410.6A 2019-04-12 2019-04-12 Caching method for query intermediate result set of distributed database system Active CN109947796B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910166410.6A CN109947796B (en) 2019-04-12 2019-04-12 Caching method for query intermediate result set of distributed database system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910166410.6A CN109947796B (en) 2019-04-12 2019-04-12 Caching method for query intermediate result set of distributed database system

Publications (2)

Publication Number Publication Date
CN109947796A CN109947796A (en) 2019-06-28
CN109947796B true CN109947796B (en) 2021-04-30

Family

ID=67008343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910166410.6A Active CN109947796B (en) 2019-04-12 2019-04-12 Caching method for query intermediate result set of distributed database system

Country Status (1)

Country Link
CN (1) CN109947796B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543494B (en) * 2019-08-19 2023-03-24 湖南麟淇网络科技股份有限公司 Method for constructing reachable graph based on cache table
CN112380256B (en) * 2020-11-24 2023-10-13 广东机场白云信息科技有限公司 Method for accessing data of energy system, database and computer readable storage medium
CN112905592A (en) * 2021-02-08 2021-06-04 中国工商银行股份有限公司 Data query method, system and server
CN113420033B (en) * 2021-08-17 2021-12-07 北京奥星贝斯科技有限公司 Table data query method, table data query device and system for distributed database
CN113515549B (en) * 2021-09-14 2021-12-10 江西科技学院 Financial data query method and device and readable storage medium
CN114840562B (en) * 2022-07-04 2022-11-01 深圳市茗格科技有限公司 Distributed caching method and device for business data, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163195A (en) * 2010-02-22 2011-08-24 北京东方通科技股份有限公司 Query optimization method based on unified view of distributed heterogeneous database
CN105912666A (en) * 2016-04-12 2016-08-31 中国科学院软件研究所 Method for high-performance storage and inquiry of hybrid structure data aiming at cloud platform
CN106682147A (en) * 2016-12-22 2017-05-17 北京锐安科技有限公司 Mass data based query method and device
CN108108456A (en) * 2017-12-28 2018-06-01 重庆邮电大学 A kind of information resources distributed enquiring method based on metadata

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521406B (en) * 2011-12-26 2014-06-25 中国科学院计算技术研究所 Distributed query method and system for complex task of querying massive structured data
US20160267132A1 (en) * 2013-12-17 2016-09-15 Hewlett-Packard Enterprise Development LP Abstraction layer between a database query engine and a distributed file system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163195A (en) * 2010-02-22 2011-08-24 北京东方通科技股份有限公司 Query optimization method based on unified view of distributed heterogeneous database
CN105912666A (en) * 2016-04-12 2016-08-31 中国科学院软件研究所 Method for high-performance storage and inquiry of hybrid structure data aiming at cloud platform
CN106682147A (en) * 2016-12-22 2017-05-17 北京锐安科技有限公司 Mass data based query method and device
CN108108456A (en) * 2017-12-28 2018-06-01 重庆邮电大学 A kind of information resources distributed enquiring method based on metadata

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An Elastic Multi-Core Allocation Mechanism for Database Systems;Simone;《IEEE》;20180430;第473-484页 *
支持高并发数据流处理的MapReduce中间结果缓存;亓开元;《计算机研究与发展》;20130131;第50卷(第1期);第111-121页 *

Also Published As

Publication number Publication date
CN109947796A (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN109947796B (en) Caching method for query intermediate result set of distributed database system
US10176225B2 (en) Data processing service
US9798772B2 (en) Using persistent data samples and query-time statistics for query optimization
US9171062B2 (en) Real-time search of vertically partitioned, inverted indexes
US11941034B2 (en) Conversational database analysis
CN104424258B (en) Multidimensional data query method, query server, column storage server and system
US11507555B2 (en) Multi-layered key-value storage
CN109710767B (en) Multilingual big data service platform
CN112231321B (en) Oracle secondary index and index real-time synchronization method
US11809468B2 (en) Phrase indexing
CN113190687A (en) Knowledge graph determining method and device, computer equipment and storage medium
CN113934750A (en) Data blood relationship analysis method based on compiling mode
CN112231351A (en) Real-time query method and device for PB-level mass data
US8756246B2 (en) Method and system for caching lexical mappings for RDF data
Han et al. Design and implementation of elasticsearch for media data
Manghi et al. De-duplication of aggregation authority files
Yan et al. RDF knowledge graph keyword type search using frequent patterns
Xiao-Shu et al. Cloud computing oriented retrieval technology based on big data
CN113886702A (en) Hybrid cloud search engine and search method based on dynamic cache mechanism
CN113889199A (en) Search engine and search method based on compound
CN117951112A (en) Basic data processing method of business system architecture
CN116975098A (en) Query plan construction method, device, electronic equipment and storage medium
CN115098755A (en) Scientific and technological information service platform construction method and scientific and technological information service platform
CN115795180A (en) Lightweight method for analyzing social relationship of user based on social network
Castrejon-Castillo HAL Id: hal-01002695

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant