CN117009302A - Database cache optimization method based on S3 cloud storage - Google Patents

Database cache optimization method based on S3 cloud storage Download PDF

Info

Publication number
CN117009302A
CN117009302A CN202310769257.2A CN202310769257A CN117009302A CN 117009302 A CN117009302 A CN 117009302A CN 202310769257 A CN202310769257 A CN 202310769257A CN 117009302 A CN117009302 A CN 117009302A
Authority
CN
China
Prior art keywords
cache
region
data
dbms
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310769257.2A
Other languages
Chinese (zh)
Inventor
李勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aowei Technology Co ltd
Original Assignee
Beijing Aowei Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aowei Technology Co ltd filed Critical Beijing Aowei Technology Co ltd
Priority to CN202310769257.2A priority Critical patent/CN117009302A/en
Publication of CN117009302A publication Critical patent/CN117009302A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/123Replacement control using replacement algorithms with age lists, e.g. queue, most recently used [MRU] list or least recently used [LRU] list
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
    • H04L69/161Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields
    • H04L69/162Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields involving adaptations of sockets based mechanisms
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a database cache optimization method based on S3 cloud storage, which comprises the following steps: s1, a data source receives push-down filtering and column trimming requests from a DBMS through an API, and a Cache Connector is integrated into the unmodified DBMS through the data source API; s2, storing the data in the column format in a local ssd by Caches, and receiving a query from a Cache Connector through an API by a Cache MS, wherein the query comprises predicate pushing; s3, the Cache Server receives the push-down predicate string sent by the Cache Connector AND then converts the push-down predicate string back into an internal AST, the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), cache granularity is Region, AND all query requests are represented by disjunctive (OR) of a conjunctive (AND); s4, during the processing of the Region request, whether a superset is matched with the Region request in the Cache or not is judged by the Cache Server in the local search request, whether the Region Cache is matched or not is firstly scanned, if the query is not matched, the file download manager can acquire the file from the file Cache, and if the query is not matched, the file is pulled from the remote storage.

Description

Database cache optimization method based on S3 cloud storage
Technical Field
The invention relates to the technical field of IT application, in particular to a database cache optimization method based on S3 cloud storage.
Background
Amazon simple storage service (Amazon Simple Storage Service, amazon S3 for short) is an object storage service with industry leading scalability, data availability, security and performance. The platform was developed by Amazon Web Services (AWS) and was first introduced on 14 days 3 and 2006. Amazon S3 smart rating provides 99.99999999999% durability and 99.9% availability. The management features allow users to optimize, configure, and organize access to their data to meet specific compliance, business, or organization requirements.
The existing caching scheme of Amazon S3 includes: the method is not easy to use, the manual creation of the view needs a DBA (database manager) with a certain experience, and is familiar with hot spot inquiry of actual service, and cold and hot data, otherwise, the created view cannot improve the inquiry efficiency, but increases inquiry delay and resource consumption due to cache miss; the maintainability is poor, the scheme of semantic cache or intermediate results needs to integrate and develop the original DBMS (database management system), the content of cache data is realized according to SQL queries of different DBMSs, the development difficulty is high, and the requirement on maintainers is high; the cache utilization rate is low, and for the data block cache and the data page cache with lower granularity, a large amount of cache space is required to ensure higher cache hit rate, so that higher query efficiency is ensured.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the database cache optimization method based on S3 cloud storage, which realizes the decoupling of a cache and a database management system, improves the universality of the cache, generates self-adaptive cache, improves the usability of the system, remarkably improves the cache utilization rate, saves the bandwidth of remote storage and reduces the inquiry delay.
In order to solve the technical problems, the invention provides the following technical scheme: a database cache optimization method based on S3 cloud storage comprises the following steps:
s1, a big data system (such as Spark, prest) provides a data source API to support various data sources and formats, the data source receives push-down filtering and column pruning requests from the DBMS through the API, so the data source can reduce the data amount returned to the DBMS by processing the additional information, and a lightweight designated DBMS data source Connector (Cache Connector) is integrated into the unmodified DBMS through the data source API;
s2, storing column format data (such as part data) in a local ssd by Caches, and receiving a query from a Cache Connector through an API by a Cache management system (Cache MS), wherein the query comprises predicate pushdown, and the predicate pushdown Cache Server is used for caching different subsets of the data, which are called as data areas (regions);
s3, the Cache Server receives push-down predicate strings sent by the Cache Connector AND then converts the push-down predicate strings back into internal AST (abstract syntax tree), the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), all connections (AND) are pushed down into an expression tree in the disjunctive normal form, the connections (AND) AND the separations (OR) are not staggered any more, each conjunctive normal form (AND) can be regarded as a single set hyper-rectangle, the data area Region can be regarded as a disjunctive normal form (OR) of the hyper-rectangle, the granularity of Cache is Region, AND all query requests are represented by disjunctive (OR) of the conjunctive (AND);
s4, during Region request processing, whether a superset is matched with a local search request Region in the Cache or not is judged by the Cache Server, whether a Region Cache is matched or not is firstly scanned, if the query is not matched, a file download manager can possibly acquire the file from the file Cache, and if the query is not matched, the file is pulled from a remote storage;
further, in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:
1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;
2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;
the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine;
still further, the Cache Server is used as a storage layer of the DBMS, runs outside the DBMS, and transmits information through socket connection and shared spaces (ssd, ramdisk) in the file system; during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;
still further, the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is performed by combining (and/or) multiple predicates or combining atomic predicates (such as gt, lt and eq), wherein the atomic predicates use the same binary grammar, the left is a representational symbol, and the right is a comparison value;
still further, there are four relationships between regions by conjunctive and disjunctive expressions, one full inclusion, equivalent, intersecting, partial full inclusion;
still further, in step S4, the matching policy is cached:
A. preferably, a Region can meet the requirements;
B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;
C. when a plurality of Region combinations meet the requirement, matching is carried out by using a greedy algorithm, and for a candidate list consisting of a plurality of regions, one Region which can cover the most hyper-rectangles is selected from the candidate list each time, and then de-duplication is carried out.
Compared with the prior art, the invention has the following beneficial effects:
the invention has good usability, only the connector is needed to realize predicate serialization, and the invention is very universal for each DBMS; the cache utilization rate and the cache hit rate are high, the remote storage design Region cache supporting predicate push-down based on Amazon S3 and the like only aims at the super-rectangular extraction of predicate division to carry out cache, the cache granularity is high, the resource consumption is low, a greedy algorithm meeting multiple regions as much as possible is adopted during cache hit, and overlapping regions are de-duplicated, so that the cache hit rate is improved, and the influence of frequent cache update on the system performance is prevented.
Drawings
FIG. 1 is a diagram of the overall framework of the present invention;
FIG. 2 is a diagram showing the MS composition of the Cache of the present invention;
FIG. 3 is a diagram of the process of converting DNF and extracting a single hyper-rectangle according to the invention;
FIG. 4 is a diagram of an example of matching a query consisting of two hyper-rectangles to two memory regions in accordance with the present invention;
FIG. 5 is a schematic diagram of a matching process and algorithm of the present invention;
Detailed Description
In order that the manner in which the above recited features, objects and advantages of the present invention are obtained will become readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Based on the examples in the embodiments, those skilled in the art can obtain other examples without making any inventive effort, which fall within the scope of the invention. The experimental methods in the following examples are conventional methods unless otherwise specified, and materials, reagents, etc. used in the following examples are commercially available unless otherwise specified.
Example 1
Referring to fig. 1 and 2, the invention provides a database cache optimization method based on S3 cloud storage, which comprises the following steps:
s1, a big data system (such as Spark, prest) provides a data source API to support various data sources and formats, the data source receives push-down filtering and column pruning requests from the DBMS through the API, so the data source can reduce the data amount returned to the DBMS by processing the additional information, and a lightweight designated DBMS data source Connector (Cache Connector) is integrated into the unmodified DBMS through the data source API;
s2, storing column format data (such as part data) in a local ssd by Caches, wherein a Cache management system (Cache MS) receives a query from a Cache Connector through an API, the query comprises predicate pushdown, the predicate pushdown Cache Server is used for caching different subsets of the data, which are called data areas (regions) herein, and through conjunctions and disjunctive expressions, the regions have four relations, namely one-time full inclusion, equivalent, intersection and partial full inclusion;
s3, the Cache Server receives push-down predicate strings sent by the Cache Connector AND then converts the push-down predicate strings back into internal AST (abstract syntax tree), the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), all connections (AND) are pushed down into an expression tree in the disjunctive normal form, the connections (AND) AND the separations (OR) are not staggered any more, each conjunctive normal form (AND) can be regarded as a single set hyper-rectangle, the data area Region can be regarded as a disjunctive normal form (OR) of the hyper-rectangle, the granularity of Cache is Region, AND all query requests are represented by disjunctive (OR) of the conjunctive (AND);
s4, during Region request processing, whether a superset is matched with a local search request Region in the Cache or not is judged by the Cache Server, whether a Region Cache is matched or not is firstly scanned, if the query is not matched, a file download manager can possibly acquire the file from the file Cache, and if the query is not matched, the file is pulled from a remote storage;
in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:
1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;
2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;
the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine;
the Cache Server is used as a storage layer of the DBMS and runs outside the DBMS, and information is transmitted through socket connection and shared spaces (ssd, ramdisk) in a file system; during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;
the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is performed by combining (and/or) multiple predicates or combining atomic predicates (such as gt, lt and eq), wherein the atomic predicates use the same binary grammar, the left is a representational symbol, and the right is a comparison value;
in step S4, the matching policy is cached:
A. preferably, a Region can meet the requirements;
B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;
C. when a plurality of Region combinations meet the requirement, matching by using a greedy algorithm, selecting one Region which can cover the most hyper-rectangles from a candidate list consisting of the plurality of regions each time, and then performing de-duplication;
in the embodiment of the invention, amazon S3 (Amazon Simple Storage Service) is a remote d object storage service provided by Amazon company, and is dominant in the field of remote object storage; the database management system (Database Management System) is a large software for manipulating and managing databases for creating, using and maintaining databases; view refers to a view in a computer database, which is a virtual table whose contents are defined by queries; as with the real table, the view contains a series of columns and rows with names, but the view does not exist in the database in the form of a stored set of data values; the row and column data is from a table referenced by a query defining the view and is dynamically generated when the view is referenced; materialized views are similar to views, reflect the results of a query, but unlike views which only store SQL definitions, materialized views themselves store data and are therefore materialized views;
the Cache Server is divided into two parts: 1) One lightweight specified DBMS data source Connector is hereinafter referred to as Cache Connector 2) Cache management system is hereinafter referred to as Cache MS, now large data systems (e.g.: spark, prest) provides a data source API to support various data sources and formats, through which the data source receives push-down filtering and column pruning requests from the DBMS, so that the data source can reduce the amount of data returned to the DBMS by processing this additional information, the Cache Connector is integrated into the unmodified DBMS through this data source API, the Cache Connector can be considered a data source for the DBMS, and can be considered a client for the Cache MS, the Cache stores data in column format (e.g., part data) in local ssd C, the Cache MS receives queries from the Cache Connector through the API, the queries contain predicate push-down, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from local storage, if there is no hit, two options are provided:
1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;
2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;
thus, the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine;
as described above, the system architecture of the Cache Server makes it suitable for any cloud analysis system:
1) The user can realize cache replacement according to the self definition of the workload;
2) Remote memory may be replaced, not necessarily requiring the use of Amazon S3, but Amazon S3 or similar cloud storage supports predicate pushdown to provide better performance;
3) A custom Cache Connector can be implemented for each DBMS using the system;
the use of predicate push-down Cache servers to Cache different subsets of data, called regions of data, hereinafter simply regions, which may be considered as a view table, or another form of semantic Cache, has two advantages over traditional file caching, firstly, it typically returns a tighter view to the DBMS, thereby reducing further processing of the data, thus saving I/O and CPU costs, and secondly, regions may be much smaller than the original file, thus achieving better space utilization and higher Cache hit rates, the DBMS client request to Region request comprising the steps of:
(1)API
the Cache Server is used as a storage layer of the DBMS and operates outside the DBMS, information is transmitted through socket connection and shared space (ssd, ramdisk) in a file system, and during file request, the DBMS exchanges information about files and required regions with the Cache Server, and the Cache Server can preferentially try to meet the requirements by using the cached files;
in the invention, the API uses a tree character string to represent a push predicate, and because the predicate is usually stored as an AST (abstract syntax tree) in a DBMS, we traverse the AST to construct a character string expression, each individual item is expressed by a syntax similar to a tree structure (left, right), binary operators, unitary operators and words can be supported, the words are leaf nodes of the tree, binary operation is a combination (sum or) of a plurality of predicates, or a combination of atomic predicates (such as gt, lt and eq), the atomic predicates use the same binary syntax, the left is an indicator, and the right is a comparison value;
(2) Cache MS receiving conversion
The Cache Server receives the push-down predicate string sent by the Cache Connector AND then converts the push-down predicate string back into an internal AST, because the processing of any nested logic expression is complex, the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF) in which all connections (AND) are pushed down into the expression tree AND the connections (AND) AND separations (OR) are no longer staggered;
each conjunctive normal form (AND) can be considered as a single collective hyper-rectangle, AND the data Region described above can then be considered as a disjunctive normal form (OR) of a hyper-rectangle, AND fig. 3 shows the process of converting DNF AND extracting a single hyper-rectangle;
the granularity of Cache Server Cache is Region, AND all query requests are represented by disjunctive (OR) of conjunctions (AND); however, individual connections of different regions may be combined to satisfy incoming region requests; some semantic cache works in the past only consider using non-overlapping hyper-rectangles; although non-overlapping hyper-rectangles can help to reduce the complexity of the decision making process, due to the fact that granularity is too small and too many hyper-rectangles are not friendly to the extra cost of Cache strategies and the like, region is still selected as the minimum granularity of the Cache;
(3) Region matching
Through the expression of conjunctions and disjunctions, four relations exist among regions, one-time full inclusion, equivalence, intersection and partial full inclusion; the underlying formulas contained are each conjunctive, i.e., hyper-rectangular, and the algorithm that determines the relationship of the two regions (rx, ry) is shown below:
representing all r y Can be r x Finding a superset;
rx and ry +.0 means that there is at least one r y Can be r x Finding out the intersection of the hyper-rectangles;
(partial superset) means that there is at least one r y Can be r x Finding a superset;
as shown in FIG. 4, an example is shown in which a query consisting of two hyper-rectangles matches two storage areas
(4) Request matching flow and algorithm
As shown in fig. 5, first, scan whether the Region caches match, if the queries do not match, the file download manager may obtain the file from the file caches, if not, pull the file from the remote store;
cache matching policy:
A. preferably, a Region can meet the requirements;
B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;
C. when a plurality of Region combinations meet the requirement, matching by using a greedy algorithm, selecting one Region which can cover the most hyper-rectangles from a candidate list consisting of the plurality of regions each time, and then performing de-duplication;
the invention has good usability, for the scheme of semantic cache or intermediate result, the original DBMS (database management system) needs to be integrated and developed, the content of the cache data needs to be specifically realized according to SQL queries of different DBMSs, the development difficulty is high, and only a connector is needed to realize predicate serialization in the invention, so that the invention is very general for each DBMS; the cache utilization rate and the cache hit rate are high, the semantic cache or the final result only is cached, and the resource waste is caused by low cache utilization rate; or the cache area is cut into non-repeated blocks, the calculation cost is high, a large amount of calculation resources are consumed for each reading and updating of the cache, in the invention, the cache of the Region of the remote storage design supporting predicate push-down based on Amazon S3 and the like is only used for caching the disjunctive of the super rectangle divided by predicates, the cache granularity is high, the resource consumption is low, and a greedy algorithm meeting multiple areas as much as possible is adopted during cache hit, so that the overlapping area is de-duplicated, namely the cache hit rate is improved, and the influence of frequent updating of the cache on the system performance is prevented.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. A database cache optimization method based on S3 cloud storage is characterized by comprising the following steps: the method comprises the following steps:
s1, a big data system (such as Spark, prest) provides a data source API to support various data sources and formats, the data source receives push-down filtering and column pruning requests from the DBMS through the API, so the data source can reduce the data amount returned to the DBMS by processing the additional information, and a lightweight designated DBMS data source Connector (Cache Connector) is integrated into the unmodified DBMS through the data source API;
s2, storing column format data (such as part data) in a local ssd by Caches, and receiving a query from a Cache Connector through an API by a Cache management system (Cache MS), wherein the query comprises predicate pushdown, and the predicate pushdown Cache Server is used for caching different subsets of the data, which are called as data areas (regions);
s3, the Cache Server receives push-down predicate strings sent by the Cache Connector AND then converts the push-down predicate strings back into internal AST (abstract syntax tree), the Cache Server uniformly converts the AST into a Disjunctive Normal Form (DNF), all connections (AND) are pushed down into an expression tree in the disjunctive normal form, the connections (AND) AND the separations (OR) are not staggered any more, each conjunctive normal form (AND) can be regarded as a single set hyper-rectangle, the data area Region can be regarded as a disjunctive normal form (OR) of the hyper-rectangle, the granularity of Cache is Region, AND all query requests are represented by disjunctive (OR) of the conjunctive (AND);
s4, during the processing of the Region request, whether a superset is matched with the Region request in the Cache or not is judged by the Cache Server in the local search request, whether the Region Cache is matched or not is firstly scanned, if the query is not matched, the file download manager can acquire the file from the file Cache, and if the query is not matched, the file is pulled from the remote storage.
2. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: in step S2, the Cache MS first checks the matcher, if there is a Cache hit, it will return a set of file paths from the local store, if there is no hit, providing two options:
1) The DBMS directly uses the Cache Connector to process the data stored remotely, and the Cache MS downloads the data to the Caches through the Cache Connector;
2) The Cache MS applies predicate downhill to download data from a remote location, stores the result in Caches, and returns a path to the connector;
the contents of the Cache may be filled by the DBMS or the Cache MS, but not every requested region is cached, requiring an LRU-2 based algorithm to determine.
3. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: the Cache Server is used as a storage layer of the DBMS and runs outside the DBMS, and information is transmitted through socket connection and shared spaces (ssd, ramdisk) in a file system; during file request, the DBMS exchanges information about the file and the required region with the Cache Server, which can preferentially attempt to meet the requirements with the cached file.
4. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: the API uses the tree string representation to push down predicates that are typically stored as ASTs in the DBMS, so that the string representation is constructed using ASTs; each individual item is structured in a syntax similar to a tree (left, right) that can support binary operators, unitary operators, and text, which is a leaf node of the tree; binary operation is either a combination (sum, or) of predicates or a combination of atomic predicates (e.g., gt, lt, eq), which use the same binary syntax, left is the indicator, right is the comparison value.
5. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: through conjunctive and disjunctive expressions, there are four relations between regions, one full inclusion, equivalent, intersecting, and partial full inclusion.
6. The database cache optimization method based on the S3 cloud storage as claimed in claim 1, wherein the method comprises the following steps: in step S4, the matching policy is cached:
A. preferably, a Region can meet the requirements;
B. if a Region fails to meet the requirements, the Cache MS will try to meet the single hyper-rectangle, but this may require additional deduplication operations, such as A and B may meet a Query, but the A and B regions overlap but are not exactly the same, requiring one-pass deduplication operation;
C. when a plurality of Region combinations meet the requirement, matching is carried out by using a greedy algorithm, and for a candidate list consisting of a plurality of regions, one Region which can cover the most hyper-rectangles is selected from the candidate list each time, and then de-duplication is carried out.
CN202310769257.2A 2023-06-28 2023-06-28 Database cache optimization method based on S3 cloud storage Pending CN117009302A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310769257.2A CN117009302A (en) 2023-06-28 2023-06-28 Database cache optimization method based on S3 cloud storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310769257.2A CN117009302A (en) 2023-06-28 2023-06-28 Database cache optimization method based on S3 cloud storage

Publications (1)

Publication Number Publication Date
CN117009302A true CN117009302A (en) 2023-11-07

Family

ID=88571784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310769257.2A Pending CN117009302A (en) 2023-06-28 2023-06-28 Database cache optimization method based on S3 cloud storage

Country Status (1)

Country Link
CN (1) CN117009302A (en)

Similar Documents

Publication Publication Date Title
US11397768B2 (en) Handling semi-structured and unstructured data in a sharded database environment
JP6617117B2 (en) Scalable analysis platform for semi-structured data
CN107402995B (en) Distributed newSQL database system and method
US8620903B2 (en) Database distribution system and methods for scale-out applications
CA2777425C (en) Method for performing transactions on data and a transactional database
JP4552242B2 (en) Virtual table interface and query processing system and method using the interface
US11226955B2 (en) Techniques for enabling and integrating in-memory semi-structured data and text document searches with in-memory columnar query processing
US10733172B2 (en) Method and computing device for minimizing accesses to data storage in conjunction with maintaining a B-tree
Tatarowicz et al. Lookup tables: Fine-grained partitioning for distributed databases
US20160034486A1 (en) Multi-Range and Runtime Pruning
CN104679898A (en) Big data access method
CN104778270A (en) Storage method for multiple files
CN106294695A (en) A kind of implementation method towards the biggest data search engine
Hubail et al. Couchbase analytics: NoETL for scalable NoSQL data analysis
Terlecki et al. On improving user response times in tableau
JP2001350656A (en) Integrated access method for different data sources
Durner et al. Crystal: a unified cache storage system for analytical databases
US8756246B2 (en) Method and system for caching lexical mappings for RDF data
Theocharidis et al. SRX: efficient management of spatial RDF data
CN116541427B (en) Data query method, device, equipment and storage medium
Barkhordari et al. Atrak: a MapReduce-based data warehouse for big data
Suganya et al. Efficient fragmentation and allocation in distributed databases
CN117009302A (en) Database cache optimization method based on S3 cloud storage
CN112818010B (en) Database query method and device
CN115248829A (en) Data storage method, data query method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination