CN117009302A - Database cache optimization method based on S3 cloud storage - Google Patents
Database cache optimization method based on S3 cloud storage Download PDFInfo
- Publication number
- CN117009302A CN117009302A CN202310769257.2A CN202310769257A CN117009302A CN 117009302 A CN117009302 A CN 117009302A CN 202310769257 A CN202310769257 A CN 202310769257A CN 117009302 A CN117009302 A CN 117009302A
- Authority
- CN
- China
- Prior art keywords
- cache
- region
- data
- dbms
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000005457 optimization Methods 0.000 title claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 10
- 238000001914 filtration Methods 0.000 claims abstract description 5
- 230000014509 gene expression Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 8
- 238000009635 antibiotic susceptibility testing Methods 0.000 claims description 6
- 238000013138 pruning Methods 0.000 claims description 4
- 238000000926 separation method Methods 0.000 claims description 4
- 238000009966 trimming Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/12—Replacement control
- G06F12/121—Replacement control using replacement algorithms
- G06F12/123—Replacement control using replacement algorithms with age lists, e.g. queue, most recently used [MRU] list or least recently used [LRU] list
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/176—Support for shared access to files; File sharing support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/06—Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/16—Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
- H04L69/161—Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields
- H04L69/162—Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields involving adaptations of sockets based mechanisms
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a database cache optimization method based on S3 cloud storage, which comprises the following steps: s1, a data source receives push-down filtering and column trimming requests from a DBMS through an API, and a Cache Connector is integrated into the unmodified DBMS through the data source API; s2, storing the data in the column format in a local ssd by Caches, and receiving a query from a Cache Connector through an API by a Cache MS, wherein the query comprises predicate pushing; s3, the Cache Server receives the push-down predicate string sent by the Cache Connector AND then converts the push-down predicate string back into an internal AST, the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), cache granularity is Region, AND all query requests are represented by disjunctive (OR) of a conjunctive (AND); s4, during the processing of the Region request, whether a superset is matched with the Region request in the Cache or not is judged by the Cache Server in the local search request, whether the Region Cache is matched or not is firstly scanned, if the query is not matched, the file download manager can acquire the file from the file Cache, and if the query is not matched, the file is pulled from the remote storage.
Description
Technical Field
The invention relates to the technical field of IT application, in particular to a database cache optimization method based on S3 cloud storage.
Background
Amazon simple storage service (Amazon Simple Storage Service, amazon S3 for short) is an object storage service with industry leading scalability, data availability, security and performance. The platform was developed by Amazon Web Services (AWS) and was first introduced on 14 days 3 and 2006. Amazon S3 smart rating provides 99.99999999999% durability and 99.9% availability. The management features allow users to optimize, configure, and organize access to their data to meet specific compliance, business, or organization requirements.
The existing caching scheme of Amazon S3 includes: the method is not easy to use, the manual creation of the view needs a DBA (database manager) with a certain experience, and is familiar with hot spot inquiry of actual service, and cold and hot data, otherwise, the created view cannot improve the inquiry efficiency, but increases inquiry delay and resource consumption due to cache miss; the maintainability is poor, the scheme of semantic cache or intermediate results needs to integrate and develop the original DBMS (database management system), the content of cache data is realized according to SQL queries of different DBMSs, the development difficulty is high, and the requirement on maintainers is high; the cache utilization rate is low, and for the data block cache and the data page cache with lower granularity, a large amount of cache space is required to ensure higher cache hit rate, so that higher query efficiency is ensured.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the database cache optimization method based on S3 cloud storage, which realizes the decoupling of a cache and a database management system, improves the universality of the cache, generates self-adaptive cache, improves the usability of the system, remarkably improves the cache utilization rate, saves the bandwidth of remote storage and reduces the inquiry delay.
In order to solve the technical problems, the invention provides the following technical scheme: a database cache optimization method based on S3 cloud storage comprises the following steps:
s1, a big data system (such as Spark, prest) provides a data source API to support various data sources and formats, the data source receives push-down filtering and column pruning requests from the DBMS through the API, so the data source can reduce the data amount returned to the DBMS by processing the additional information, and a lightweight designated DBMS data source Connector (Cache Connector) is integrated into the unmodified DBMS through the data source API;
s2, storing column format data (such as part data) in a local ssd by Caches, and receiving a query from a Cache Connector through an API by a Cache management system (Cache MS), wherein the query comprises predicate pushdown, and the predicate pushdown Cache Server is used for caching different subsets of the data, which are called as data areas (regions);
s3, the Cache Server receives push-down predicate strings sent by the Cache Connector AND then converts the push-down predicate strings back into internal AST (abstract syntax tree), the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), all connections (AND) are pushed down into an expression tree in the disjunctive normal form, the connections (AND) AND the separations (OR) are not staggered any more, each conjunctive normal form (AND) can be regarded as a single set hyper-rectangle, the data area Region can be regarded as a disjunctive normal form (OR) of the hyper-rectangle, the granularity of Cache is Region, AND all query requests are represented by disjunctive (OR) of the conjunctive (AND);
s4, during Region request processing, whether a superset is matched with a local search request Region in the Cache or not is judged by the Cache Server, whether a Region Cache is matched or not is firstly scanned, if the query is not matched, a file download manager can possibly acquire the file from the file Cache, and if the query is not matched, the file is pulled from a remote storage;
further, in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:
1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;
2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;
the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine;
still further, the Cache Server is used as a storage layer of the DBMS, runs outside the DBMS, and transmits information through socket connection and shared spaces (ssd, ramdisk) in the file system; during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;
still further, the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is performed by combining (and/or) multiple predicates or combining atomic predicates (such as gt, lt and eq), wherein the atomic predicates use the same binary grammar, the left is a representational symbol, and the right is a comparison value;
still further, there are four relationships between regions by conjunctive and disjunctive expressions, one full inclusion, equivalent, intersecting, partial full inclusion;
still further, in step S4, the matching policy is cached:
A. preferably, a Region can meet the requirements;
B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;
C. when a plurality of Region combinations meet the requirement, matching is carried out by using a greedy algorithm, and for a candidate list consisting of a plurality of regions, one Region which can cover the most hyper-rectangles is selected from the candidate list each time, and then de-duplication is carried out.
Compared with the prior art, the invention has the following beneficial effects:
the invention has good usability, only the connector is needed to realize predicate serialization, and the invention is very universal for each DBMS; the cache utilization rate and the cache hit rate are high, the remote storage design Region cache supporting predicate push-down based on Amazon S3 and the like only aims at the super-rectangular extraction of predicate division to carry out cache, the cache granularity is high, the resource consumption is low, a greedy algorithm meeting multiple regions as much as possible is adopted during cache hit, and overlapping regions are de-duplicated, so that the cache hit rate is improved, and the influence of frequent cache update on the system performance is prevented.
Drawings
FIG. 1 is a diagram of the overall framework of the present invention;
FIG. 2 is a diagram showing the MS composition of the Cache of the present invention;
FIG. 3 is a diagram of the process of converting DNF and extracting a single hyper-rectangle according to the invention;
FIG. 4 is a diagram of an example of matching a query consisting of two hyper-rectangles to two memory regions in accordance with the present invention;
FIG. 5 is a schematic diagram of a matching process and algorithm of the present invention;
Detailed Description
In order that the manner in which the above recited features, objects and advantages of the present invention are obtained will become readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Based on the examples in the embodiments, those skilled in the art can obtain other examples without making any inventive effort, which fall within the scope of the invention. The experimental methods in the following examples are conventional methods unless otherwise specified, and materials, reagents, etc. used in the following examples are commercially available unless otherwise specified.
Example 1
Referring to fig. 1 and 2, the invention provides a database cache optimization method based on S3 cloud storage, which comprises the following steps:
s1, a big data system (such as Spark, prest) provides a data source API to support various data sources and formats, the data source receives push-down filtering and column pruning requests from the DBMS through the API, so the data source can reduce the data amount returned to the DBMS by processing the additional information, and a lightweight designated DBMS data source Connector (Cache Connector) is integrated into the unmodified DBMS through the data source API;
s2, storing column format data (such as part data) in a local ssd by Caches, wherein a Cache management system (Cache MS) receives a query from a Cache Connector through an API, the query comprises predicate pushdown, the predicate pushdown Cache Server is used for caching different subsets of the data, which are called data areas (regions) herein, and through conjunctions and disjunctive expressions, the regions have four relations, namely one-time full inclusion, equivalent, intersection and partial full inclusion;
s3, the Cache Server receives push-down predicate strings sent by the Cache Connector AND then converts the push-down predicate strings back into internal AST (abstract syntax tree), the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), all connections (AND) are pushed down into an expression tree in the disjunctive normal form, the connections (AND) AND the separations (OR) are not staggered any more, each conjunctive normal form (AND) can be regarded as a single set hyper-rectangle, the data area Region can be regarded as a disjunctive normal form (OR) of the hyper-rectangle, the granularity of Cache is Region, AND all query requests are represented by disjunctive (OR) of the conjunctive (AND);
s4, during Region request processing, whether a superset is matched with a local search request Region in the Cache or not is judged by the Cache Server, whether a Region Cache is matched or not is firstly scanned, if the query is not matched, a file download manager can possibly acquire the file from the file Cache, and if the query is not matched, the file is pulled from a remote storage;
in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:
1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;
2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;
the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine;
the Cache Server is used as a storage layer of the DBMS and runs outside the DBMS, and information is transmitted through socket connection and shared spaces (ssd, ramdisk) in a file system; during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;
the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is performed by combining (and/or) multiple predicates or combining atomic predicates (such as gt, lt and eq), wherein the atomic predicates use the same binary grammar, the left is a representational symbol, and the right is a comparison value;
in step S4, the matching policy is cached:
A. preferably, a Region can meet the requirements;
B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;
C. when a plurality of Region combinations meet the requirement, matching by using a greedy algorithm, selecting one Region which can cover the most hyper-rectangles from a candidate list consisting of the plurality of regions each time, and then performing de-duplication;
in the embodiment of the invention, amazon S3 (Amazon Simple Storage Service) is a remote d object storage service provided by Amazon company, and is dominant in the field of remote object storage; the database management system (Database Management System) is a large software for manipulating and managing databases for creating, using and maintaining databases; view refers to a view in a computer database, which is a virtual table whose contents are defined by queries; as with the real table, the view contains a series of columns and rows with names, but the view does not exist in the database in the form of a stored set of data values; the row and column data is from a table referenced by a query defining the view and is dynamically generated when the view is referenced; materialized views are similar to views, reflect the results of a query, but unlike views which only store SQL definitions, materialized views themselves store data and are therefore materialized views;
the Cache Server is divided into two parts: 1) One lightweight specified DBMS data source Connector is hereinafter referred to as Cache Connector 2) Cache management system is hereinafter referred to as Cache MS, now large data systems (e.g.: spark, prest) provides a data source API to support various data sources and formats, through which the data source receives push-down filtering and column pruning requests from the DBMS, so that the data source can reduce the amount of data returned to the DBMS by processing this additional information, the Cache Connector is integrated into the unmodified DBMS through this data source API, the Cache Connector can be considered a data source for the DBMS, and can be considered a client for the Cache MS, the Cache stores data in column format (e.g., part data) in local ssd C, the Cache MS receives queries from the Cache Connector through the API, the queries contain predicate push-down, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from local storage, if there is no hit, two options are provided:
1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;
2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;
thus, the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine;
as described above, the system architecture of the Cache Server makes it suitable for any cloud analysis system:
1) The user can realize cache replacement according to the self definition of the workload;
2) Remote memory may be replaced, not necessarily requiring the use of Amazon S3, but Amazon S3 or similar cloud storage supports predicate pushdown to provide better performance;
3) A custom Cache Connector can be implemented for each DBMS using the system;
the use of predicate push-down Cache servers to Cache different subsets of data, called regions of data, hereinafter simply regions, which may be considered as a view table, or another form of semantic Cache, has two advantages over traditional file caching, firstly, it typically returns a tighter view to the DBMS, thereby reducing further processing of the data, thus saving I/O and CPU costs, and secondly, regions may be much smaller than the original file, thus achieving better space utilization and higher Cache hit rates, the DBMS client request to Region request comprising the steps of:
(1)API
the Cache Server is used as a storage layer of the DBMS and operates outside the DBMS, information is transmitted through socket connection and shared space (ssd, ramdisk) in a file system, and during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;
in the invention, the API uses a tree character string to represent a push predicate, and because the predicate is usually stored as an AST (abstract syntax tree) in a DBMS, we traverse the AST to construct a character string expression, each individual item is expressed by a syntax similar to a tree structure (left, right), binary operators, unitary operators and words can be supported, the words are leaf nodes of the tree, binary operation is a combination (sum or) of a plurality of predicates, or a combination of atomic predicates (such as gt, lt and eq), the atomic predicates use the same binary syntax, the left is an indicator, and the right is a comparison value;
(2) Cache MS receiving conversion
The Cache Server receives the push-down predicate string sent by the Cache Connector AND then converts the push-down predicate string back into an internal AST, because the processing of any nested logic expression is complex, the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF) in which all connections (AND) are pushed down into the expression tree AND the connections (AND) AND separations (OR) are no longer staggered;
each conjunctive normal form (AND) can be considered as a single collective hyper-rectangle, AND the data Region described above can then be considered as a disjunctive normal form (OR) of a hyper-rectangle, AND fig. 3 shows the process of converting DNF AND extracting a single hyper-rectangle;
the granularity of Cache Server Cache is Region, AND all query requests are represented by disjunctive (OR) of conjunctions (AND); however, individual connections of different regions may be combined to satisfy incoming region requests; some semantic cache works in the past only consider using non-overlapping hyper-rectangles; although non-overlapping hyper-rectangles can help to reduce the complexity of the decision making process, due to the fact that granularity is too small and too many hyper-rectangles are not friendly to the extra cost of Cache strategies and the like, region is still selected as the minimum granularity of the Cache;
(3) Region matching
Through the expression of conjunctions and disjunctions, four relations exist among regions, one-time full inclusion, equivalence, intersection and partial full inclusion; the underlying formulas contained are each conjunctive, i.e., hyper-rectangular, and the algorithm that determines the relationship of the two regions (rx, ry) is shown below:
representing all r y Can be r x Finding a superset;
rx and ry +.0 means that there is at least one r y Can be r x Finding out the intersection of the hyper-rectangles;
(partial superset) means that there is at least one r y Can be r x Finding a superset;
as shown in FIG. 4, an example is shown in which a query consisting of two hyper-rectangles matches two storage areas
(4) Request matching flow and algorithm
As shown in fig. 5, first, scan whether the Region caches match, if the queries do not match, the file download manager may obtain the file from the file caches, if not, pull the file from the remote store;
cache matching policy:
A. preferably, a Region can meet the requirements;
B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;
C. when a plurality of Region combinations meet the requirement, matching by using a greedy algorithm, selecting one Region which can cover the most hyper-rectangles from a candidate list consisting of the plurality of regions each time, and then performing de-duplication;
the invention has good usability, for the scheme of semantic cache or intermediate result, the original DBMS (database management system) needs to be integrated and developed, the content of the cache data needs to be specifically realized according to SQL queries of different DBMSs, the development difficulty is high, and only a connector is needed to realize predicate serialization in the invention, so that the invention is very general for each DBMS; the cache utilization rate and the cache hit rate are high, the semantic cache or the final result only is cached, and the resource waste is caused by low cache utilization rate; or the cache area is cut into non-repeated blocks, the calculation cost is high, a large amount of calculation resources are consumed for each reading and updating of the cache, in the invention, the cache of the Region of the remote storage design supporting predicate push-down based on Amazon S3 and the like is only used for caching the disjunctive of the super rectangle divided by predicates, the cache granularity is high, the resource consumption is low, and a greedy algorithm meeting multiple areas as much as possible is adopted during cache hit, so that the overlapping area is de-duplicated, namely the cache hit rate is improved, and the influence of frequent updating of the cache on the system performance is prevented.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (6)
1. A database cache optimization method based on S3 cloud storage is characterized by comprising the following steps: the method comprises the following steps:
s1, a big data system (such as Spark, prest) provides a data source API to support various data sources and formats, the data source receives push-down filtering and column pruning requests from the DBMS through the API, so the data source can reduce the data amount returned to the DBMS by processing the additional information, and a lightweight designated DBMS data source Connector (Cache Connector) is integrated into the unmodified DBMS through the data source API;
s2, storing column format data (such as part data) in a local ssd by Caches, and receiving a query from a Cache Connector through an API by a Cache management system (Cache MS), wherein the query comprises predicate pushdown, and the predicate pushdown Cache Server is used for caching different subsets of the data, which are called as data areas (regions);
s3, the Cache Server receives push-down predicate strings sent by the Cache Connector AND then converts the push-down predicate strings back into internal AST (abstract syntax tree), the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), all connections (AND) are pushed down into an expression tree in the disjunctive normal form, the connections (AND) AND the separations (OR) are not staggered any more, each conjunctive normal form (AND) can be regarded as a single set hyper-rectangle, the data area Region can be regarded as a disjunctive normal form (OR) of the hyper-rectangle, the granularity of Cache is Region, AND all query requests are represented by disjunctive (OR) of the conjunctive (AND);
s4, during the processing of the Region request, whether a superset is matched with the Region request in the Cache or not is judged by the Cache Server in the local search request, whether the Region Cache is matched or not is firstly scanned, if the query is not matched, the file download manager can acquire the file from the file Cache, and if the query is not matched, the file is pulled from the remote storage.
2. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:
1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;
2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;
the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine.
3. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: the Cache Server is used as a storage layer of the DBMS and runs outside the DBMS, and information is transmitted through socket connection and shared spaces (ssd, ramdisk) in a file system; during file request, the DBMS exchanges information about the file and the required region with the Cache Server, which can preferentially attempt to meet the requirements with the cached file.
4. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is either a combination (sum, or) of predicates or a combination of atomic predicates (e.g., gt, lt, eq), which use the same binary syntax, left is the indicator, right is the comparison value.
5. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: through conjunctive and disjunctive expressions, there are four relations between regions, one full inclusion, equivalent, intersecting, and partial full inclusion.
6. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: in step S4, the matching policy is cached:
A. preferably, a Region can meet the requirements;
B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;
C. when a plurality of Region combinations meet the requirement, matching is carried out by using a greedy algorithm, and for a candidate list consisting of a plurality of regions, one Region which can cover the most hyper-rectangles is selected from the candidate list each time, and then de-duplication is carried out.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310769257.2A CN117009302A (en) | 2023-06-28 | 2023-06-28 | Database cache optimization method based on S3 cloud storage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310769257.2A CN117009302A (en) | 2023-06-28 | 2023-06-28 | Database cache optimization method based on S3 cloud storage |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117009302A true CN117009302A (en) | 2023-11-07 |
Family
ID=88571784
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310769257.2A Pending CN117009302A (en) | 2023-06-28 | 2023-06-28 | Database cache optimization method based on S3 cloud storage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117009302A (en) |
-
2023
- 2023-06-28 CN CN202310769257.2A patent/CN117009302A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11397768B2 (en) | Handling semi-structured and unstructured data in a sharded database environment | |
JP6617117B2 (en) | Scalable analysis platform for semi-structured data | |
CN107402995B (en) | Distributed newSQL database system and method | |
US8620903B2 (en) | Database distribution system and methods for scale-out applications | |
CA2777425C (en) | Method for performing transactions on data and a transactional database | |
JP4552242B2 (en) | Virtual table interface and query processing system and method using the interface | |
US11226955B2 (en) | Techniques for enabling and integrating in-memory semi-structured data and text document searches with in-memory columnar query processing | |
US10733172B2 (en) | Method and computing device for minimizing accesses to data storage in conjunction with maintaining a B-tree | |
Tatarowicz et al. | Lookup tables: Fine-grained partitioning for distributed databases | |
US20160034486A1 (en) | Multi-Range and Runtime Pruning | |
CN104679898A (en) | Big data access method | |
CN104778270A (en) | Storage method for multiple files | |
CN106294695A (en) | A kind of implementation method towards the biggest data search engine | |
Hubail et al. | Couchbase analytics: NoETL for scalable NoSQL data analysis | |
Terlecki et al. | On improving user response times in tableau | |
JP2001350656A (en) | Integrated access method for different data sources | |
Durner et al. | Crystal: a unified cache storage system for analytical databases | |
US8756246B2 (en) | Method and system for caching lexical mappings for RDF data | |
Theocharidis et al. | SRX: efficient management of spatial RDF data | |
CN116541427B (en) | Data query method, device, equipment and storage medium | |
Barkhordari et al. | Atrak: a MapReduce-based data warehouse for big data | |
Suganya et al. | Efficient fragmentation and allocation in distributed databases | |
CN117009302A (en) | Database cache optimization method based on S3 cloud storage | |
CN112818010B (en) | Database query method and device | |
CN115248829A (en) | Data storage method, data query method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |