CN117009302A

CN117009302A - Database cache optimization method based on S3 cloud storage

Info

Publication number: CN117009302A
Application number: CN202310769257.2A
Authority: CN
Inventors: 李勋
Original assignee: Beijing Aowei Technology Co ltd
Current assignee: Beijing Aowei Technology Co ltd
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-11-07

Abstract

The invention discloses a database cache optimization method based on S3 cloud storage, which comprises the following steps: s1, a data source receives push-down filtering and column trimming requests from a DBMS through an API, and a Cache Connector is integrated into the unmodified DBMS through the data source API; s2, storing the data in the column format in a local ssd by Caches, and receiving a query from a Cache Connector through an API by a Cache MS, wherein the query comprises predicate pushing; s3, the Cache Server receives the push-down predicate string sent by the Cache Connector AND then converts the push-down predicate string back into an internal AST, the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), cache granularity is Region, AND all query requests are represented by disjunctive (OR) of a conjunctive (AND); s4, during the processing of the Region request, whether a superset is matched with the Region request in the Cache or not is judged by the Cache Server in the local search request, whether the Region Cache is matched or not is firstly scanned, if the query is not matched, the file download manager can acquire the file from the file Cache, and if the query is not matched, the file is pulled from the remote storage.

Description

Database cache optimization method based on S3 cloud storage

Technical Field

The invention relates to the technical field of IT application, in particular to a database cache optimization method based on S3 cloud storage.

Background

Amazon simple storage service (Amazon Simple Storage Service, amazon S3 for short) is an object storage service with industry leading scalability, data availability, security and performance. The platform was developed by Amazon Web Services (AWS) and was first introduced on 14 days 3 and 2006. Amazon S3 smart rating provides 99.99999999999% durability and 99.9% availability. The management features allow users to optimize, configure, and organize access to their data to meet specific compliance, business, or organization requirements.

The existing caching scheme of Amazon S3 includes: the method is not easy to use, the manual creation of the view needs a DBA (database manager) with a certain experience, and is familiar with hot spot inquiry of actual service, and cold and hot data, otherwise, the created view cannot improve the inquiry efficiency, but increases inquiry delay and resource consumption due to cache miss; the maintainability is poor, the scheme of semantic cache or intermediate results needs to integrate and develop the original DBMS (database management system), the content of cache data is realized according to SQL queries of different DBMSs, the development difficulty is high, and the requirement on maintainers is high; the cache utilization rate is low, and for the data block cache and the data page cache with lower granularity, a large amount of cache space is required to ensure higher cache hit rate, so that higher query efficiency is ensured.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the database cache optimization method based on S3 cloud storage, which realizes the decoupling of a cache and a database management system, improves the universality of the cache, generates self-adaptive cache, improves the usability of the system, remarkably improves the cache utilization rate, saves the bandwidth of remote storage and reduces the inquiry delay.

In order to solve the technical problems, the invention provides the following technical scheme: a database cache optimization method based on S3 cloud storage comprises the following steps:

s1, a big data system (such as Spark, prest) provides a data source API to support various data sources and formats, the data source receives push-down filtering and column pruning requests from the DBMS through the API, so the data source can reduce the data amount returned to the DBMS by processing the additional information, and a lightweight designated DBMS data source Connector (Cache Connector) is integrated into the unmodified DBMS through the data source API;

s2, storing column format data (such as part data) in a local ssd by Caches, and receiving a query from a Cache Connector through an API by a Cache management system (Cache MS), wherein the query comprises predicate pushdown, and the predicate pushdown Cache Server is used for caching different subsets of the data, which are called as data areas (regions);

s3, the Cache Server receives push-down predicate strings sent by the Cache Connector AND then converts the push-down predicate strings back into internal AST (abstract syntax tree), the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), all connections (AND) are pushed down into an expression tree in the disjunctive normal form, the connections (AND) AND the separations (OR) are not staggered any more, each conjunctive normal form (AND) can be regarded as a single set hyper-rectangle, the data area Region can be regarded as a disjunctive normal form (OR) of the hyper-rectangle, the granularity of Cache is Region, AND all query requests are represented by disjunctive (OR) of the conjunctive (AND);

s4, during Region request processing, whether a superset is matched with a local search request Region in the Cache or not is judged by the Cache Server, whether a Region Cache is matched or not is firstly scanned, if the query is not matched, a file download manager can possibly acquire the file from the file Cache, and if the query is not matched, the file is pulled from a remote storage;

further, in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:

1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;

2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;

the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine;

still further, the Cache Server is used as a storage layer of the DBMS, runs outside the DBMS, and transmits information through socket connection and shared spaces (ssd, ramdisk) in the file system; during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;

still further, the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is performed by combining (and/or) multiple predicates or combining atomic predicates (such as gt, lt and eq), wherein the atomic predicates use the same binary grammar, the left is a representational symbol, and the right is a comparison value;

still further, there are four relationships between regions by conjunctive and disjunctive expressions, one full inclusion, equivalent, intersecting, partial full inclusion;

still further, in step S4, the matching policy is cached:

A. preferably, a Region can meet the requirements;

B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;

C. when a plurality of Region combinations meet the requirement, matching is carried out by using a greedy algorithm, and for a candidate list consisting of a plurality of regions, one Region which can cover the most hyper-rectangles is selected from the candidate list each time, and then de-duplication is carried out.

Compared with the prior art, the invention has the following beneficial effects:

the invention has good usability, only the connector is needed to realize predicate serialization, and the invention is very universal for each DBMS; the cache utilization rate and the cache hit rate are high, the remote storage design Region cache supporting predicate push-down based on Amazon S3 and the like only aims at the super-rectangular extraction of predicate division to carry out cache, the cache granularity is high, the resource consumption is low, a greedy algorithm meeting multiple regions as much as possible is adopted during cache hit, and overlapping regions are de-duplicated, so that the cache hit rate is improved, and the influence of frequent cache update on the system performance is prevented.

Drawings

FIG. 1 is a diagram of the overall framework of the present invention;

FIG. 2 is a diagram showing the MS composition of the Cache of the present invention;

FIG. 3 is a diagram of the process of converting DNF and extracting a single hyper-rectangle according to the invention;

FIG. 4 is a diagram of an example of matching a query consisting of two hyper-rectangles to two memory regions in accordance with the present invention;

FIG. 5 is a schematic diagram of a matching process and algorithm of the present invention;

Detailed Description

In order that the manner in which the above recited features, objects and advantages of the present invention are obtained will become readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Based on the examples in the embodiments, those skilled in the art can obtain other examples without making any inventive effort, which fall within the scope of the invention. The experimental methods in the following examples are conventional methods unless otherwise specified, and materials, reagents, etc. used in the following examples are commercially available unless otherwise specified.

Example 1

Referring to fig. 1 and 2, the invention provides a database cache optimization method based on S3 cloud storage, which comprises the following steps:

s2, storing column format data (such as part data) in a local ssd by Caches, wherein a Cache management system (Cache MS) receives a query from a Cache Connector through an API, the query comprises predicate pushdown, the predicate pushdown Cache Server is used for caching different subsets of the data, which are called data areas (regions) herein, and through conjunctions and disjunctive expressions, the regions have four relations, namely one-time full inclusion, equivalent, intersection and partial full inclusion;

in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:

the Cache Server is used as a storage layer of the DBMS and runs outside the DBMS, and information is transmitted through socket connection and shared spaces (ssd, ramdisk) in a file system; during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;

the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is performed by combining (and/or) multiple predicates or combining atomic predicates (such as gt, lt and eq), wherein the atomic predicates use the same binary grammar, the left is a representational symbol, and the right is a comparison value;

in step S4, the matching policy is cached:

A. preferably, a Region can meet the requirements;

C. when a plurality of Region combinations meet the requirement, matching by using a greedy algorithm, selecting one Region which can cover the most hyper-rectangles from a candidate list consisting of the plurality of regions each time, and then performing de-duplication;

in the embodiment of the invention, amazon S3 (Amazon Simple Storage Service) is a remote d object storage service provided by Amazon company, and is dominant in the field of remote object storage; the database management system (Database Management System) is a large software for manipulating and managing databases for creating, using and maintaining databases; view refers to a view in a computer database, which is a virtual table whose contents are defined by queries; as with the real table, the view contains a series of columns and rows with names, but the view does not exist in the database in the form of a stored set of data values; the row and column data is from a table referenced by a query defining the view and is dynamically generated when the view is referenced; materialized views are similar to views, reflect the results of a query, but unlike views which only store SQL definitions, materialized views themselves store data and are therefore materialized views;

the Cache Server is divided into two parts: 1) One lightweight specified DBMS data source Connector is hereinafter referred to as Cache Connector 2) Cache management system is hereinafter referred to as Cache MS, now large data systems (e.g.: spark, prest) provides a data source API to support various data sources and formats, through which the data source receives push-down filtering and column pruning requests from the DBMS, so that the data source can reduce the amount of data returned to the DBMS by processing this additional information, the Cache Connector is integrated into the unmodified DBMS through this data source API, the Cache Connector can be considered a data source for the DBMS, and can be considered a client for the Cache MS, the Cache stores data in column format (e.g., part data) in local ssd C, the Cache MS receives queries from the Cache Connector through the API, the queries contain predicate push-down, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from local storage, if there is no hit, two options are provided:

thus, the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine;

as described above, the system architecture of the Cache Server makes it suitable for any cloud analysis system:

1) The user can realize cache replacement according to the self definition of the workload;

2) Remote memory may be replaced, not necessarily requiring the use of Amazon S3, but Amazon S3 or similar cloud storage supports predicate pushdown to provide better performance;

3) A custom Cache Connector can be implemented for each DBMS using the system;

the use of predicate push-down Cache servers to Cache different subsets of data, called regions of data, hereinafter simply regions, which may be considered as a view table, or another form of semantic Cache, has two advantages over traditional file caching, firstly, it typically returns a tighter view to the DBMS, thereby reducing further processing of the data, thus saving I/O and CPU costs, and secondly, regions may be much smaller than the original file, thus achieving better space utilization and higher Cache hit rates, the DBMS client request to Region request comprising the steps of:

(1)API

the Cache Server is used as a storage layer of the DBMS and operates outside the DBMS, information is transmitted through socket connection and shared space (ssd, ramdisk) in a file system, and during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;

in the invention, the API uses a tree character string to represent a push predicate, and because the predicate is usually stored as an AST (abstract syntax tree) in a DBMS, we traverse the AST to construct a character string expression, each individual item is expressed by a syntax similar to a tree structure (left, right), binary operators, unitary operators and words can be supported, the words are leaf nodes of the tree, binary operation is a combination (sum or) of a plurality of predicates, or a combination of atomic predicates (such as gt, lt and eq), the atomic predicates use the same binary syntax, the left is an indicator, and the right is a comparison value;

(2) Cache MS receiving conversion

The Cache Server receives the push-down predicate string sent by the Cache Connector AND then converts the push-down predicate string back into an internal AST, because the processing of any nested logic expression is complex, the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF) in which all connections (AND) are pushed down into the expression tree AND the connections (AND) AND separations (OR) are no longer staggered;

each conjunctive normal form (AND) can be considered as a single collective hyper-rectangle, AND the data Region described above can then be considered as a disjunctive normal form (OR) of a hyper-rectangle, AND fig. 3 shows the process of converting DNF AND extracting a single hyper-rectangle;

the granularity of Cache Server Cache is Region, AND all query requests are represented by disjunctive (OR) of conjunctions (AND); however, individual connections of different regions may be combined to satisfy incoming region requests; some semantic cache works in the past only consider using non-overlapping hyper-rectangles; although non-overlapping hyper-rectangles can help to reduce the complexity of the decision making process, due to the fact that granularity is too small and too many hyper-rectangles are not friendly to the extra cost of Cache strategies and the like, region is still selected as the minimum granularity of the Cache;

(3) Region matching

Through the expression of conjunctions and disjunctions, four relations exist among regions, one-time full inclusion, equivalence, intersection and partial full inclusion; the underlying formulas contained are each conjunctive, i.e., hyper-rectangular, and the algorithm that determines the relationship of the two regions (rx, ry) is shown below:

representing all r _y Can be r _x Finding a superset;

rx and ry +.0 means that there is at least one r _y Can be r _x Finding out the intersection of the hyper-rectangles;

(partial superset) means that there is at least one r _y Can be r _x Finding a superset;

as shown in FIG. 4, an example is shown in which a query consisting of two hyper-rectangles matches two storage areas

(4) Request matching flow and algorithm

As shown in fig. 5, first, scan whether the Region caches match, if the queries do not match, the file download manager may obtain the file from the file caches, if not, pull the file from the remote store;

cache matching policy:

A. preferably, a Region can meet the requirements;

the invention has good usability, for the scheme of semantic cache or intermediate result, the original DBMS (database management system) needs to be integrated and developed, the content of the cache data needs to be specifically realized according to SQL queries of different DBMSs, the development difficulty is high, and only a connector is needed to realize predicate serialization in the invention, so that the invention is very general for each DBMS; the cache utilization rate and the cache hit rate are high, the semantic cache or the final result only is cached, and the resource waste is caused by low cache utilization rate; or the cache area is cut into non-repeated blocks, the calculation cost is high, a large amount of calculation resources are consumed for each reading and updating of the cache, in the invention, the cache of the Region of the remote storage design supporting predicate push-down based on Amazon S3 and the like is only used for caching the disjunctive of the super rectangle divided by predicates, the cache granularity is high, the resource consumption is low, and a greedy algorithm meeting multiple areas as much as possible is adopted during cache hit, so that the overlapping area is de-duplicated, namely the cache hit rate is improved, and the influence of frequent updating of the cache on the system performance is prevented.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A database cache optimization method based on S3 cloud storage is characterized by comprising the following steps: the method comprises the following steps:

s4, during the processing of the Region request, whether a superset is matched with the Region request in the Cache or not is judged by the Cache Server in the local search request, whether the Region Cache is matched or not is firstly scanned, if the query is not matched, the file download manager can acquire the file from the file Cache, and if the query is not matched, the file is pulled from the remote storage.

2. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:

the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine.

3. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: the Cache Server is used as a storage layer of the DBMS and runs outside the DBMS, and information is transmitted through socket connection and shared spaces (ssd, ramdisk) in a file system; during file request, the DBMS exchanges information about the file and the required region with the Cache Server, which can preferentially attempt to meet the requirements with the cached file.

4. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is either a combination (sum, or) of predicates or a combination of atomic predicates (e.g., gt, lt, eq), which use the same binary syntax, left is the indicator, right is the comparison value.

5. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: through conjunctive and disjunctive expressions, there are four relations between regions, one full inclusion, equivalent, intersecting, and partial full inclusion.

6. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: in step S4, the matching policy is cached:

A. preferably, a Region can meet the requirements;