WO2019082177A1 - A system and method for data retrieval - Google Patents

A system and method for data retrieval

Info

Publication number
WO2019082177A1
WO2019082177A1 PCT/IL2018/051123 IL2018051123W WO2019082177A1 WO 2019082177 A1 WO2019082177 A1 WO 2019082177A1 IL 2018051123 W IL2018051123 W IL 2018051123W WO 2019082177 A1 WO2019082177 A1 WO 2019082177A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
partition
queries
query
cached
Prior art date
Application number
PCT/IL2018/051123
Other languages
French (fr)
Inventor
Avi ZLOOF
Original Assignee
The Evaluex Io Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Evaluex Io Ltd filed Critical The Evaluex Io Ltd
Publication of WO2019082177A1 publication Critical patent/WO2019082177A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing

Definitions

  • the invention relates to a system and method for data retrieval.
  • Database systems store vast, and ever-increasing amounts of data, whether in a tabular format (e.g. in Relational Database Management Systems (RDBMS), such as: Oracle Database, Microsoft SQL Server, etc.), or in non-tabular formats (e.g. utilizing linked lists, key-value pairs, documents or other formats, using tolls such as: MongoDB, Apache HBase, etc.).
  • RDBMS Relational Database Management Systems
  • non-tabular formats e.g. utilizing linked lists, key-value pairs, documents or other formats, using tolls such as: MongoDB, Apache HBase, etc.
  • Querying such amounts of data is, in many cases, a heavy task that requires substantial processing and network resources.
  • current data retrieval systems do not have any caching mechanisms for caching retrieved data, or the caching mechanisms are lacking. Therefore, processing and networking resources are in many cases over utilized and become expensive resources. There is thus a need in the art for a new method and system for data retrieval.
  • IaaS Infrastructure as a Service
  • Some examples include: Google Cloud SQL, Amazon Relational Database, Microsoft SQL Azure Database and more.
  • Additional IaaS services include data analysis tools (such as Google BigQuery, Amazon Athena, Azure Data Lake and more), that enable reporting and data analysis of big volumes of data.
  • cloud-based services even further emphasizes the above- mentioned need, as in such IaaS environment, many entities (e.g. companies) store their data on the same physical infrastructure, and the over utilization of the processing and networking resources is enhanced.
  • the cloud-based services charge their clients as a function of the amount of data processed or retrieved by each query. Having a caching mechanism that reduces the amount of data processed or retrieved by the queries will thus reduce the resource utilization, and as a side-effect - also the queries costs.
  • having a caching mechanism is also beneficial for enabling faster and more efficient querying of data, which is advantageous when dealing with large amounts of stored data.
  • a data retrieval system comprising a processing unit configured to: provide a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtain: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data; translate, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions; identify one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that
  • the source relational query further comprises rules for reducing the required data and wherein the processing unit is further configured to reduce the merged results in accordance with the rules, for reducing the required data.
  • the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp.
  • the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp plus a tolerance time.
  • the similarity criterion is identity between the partition query and one of the cached relational queries.
  • the processing unit is further configured to manipulate query texts of the partition query and of the cached relational queries, before the identify, so that the query texts have a common comparable structure.
  • the manipulate includes removing unnecessary spaces and brackets.
  • the similarity criterion is a given percentage of similarity between one or more attributes of the partition query and of one of the cached relational queries.
  • the attributes include column names, table names, query parameters, function names or function parameters.
  • the processing unit is further configured to provide a similarity exception list, defining at least one pair of distinct function names to be regarded as similar, and wherein upon a first attribute of the attributes corresponds to a first distinct function name of the distinct function names and a second attribute of the attributes corresponds to a second distinct function name of the distinct function names, the first attribute and the second attribute are considered similar.
  • the processing unit is further configured to calculate a relevance score for each of the cached relational queries that meet the similarity criterion with the given partition query, and wherein the retrieve includes retrieving only the cached partition results of the cached relational query having the relevance score that is highest.
  • at least a first partition query of the partition queries is executed on a first data repository and at least a second partition query of the partition queries is executed on a second data repository.
  • the first data repository is an Structured Query Language (SQL) based data repository and the second data repository is a non-SQL-based data repository.
  • SQL Structured Query Language
  • the partitions include at least one real partition, being a real partition of the corresponding table, defined in the database.
  • the partitions include at least one virtual partition, being a virtual partition of the corresponding table, that does not exist in the database.
  • the cache further comprises one or more past source relational queries and wherein the processing unit is further configured to: identify at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulate, before the translate, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
  • the processing unit is further configured to calculate a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
  • a query manipulation system comprising a processing unit configured to: provide a cache configured to store one or more past source relational queries; obtain a source relational query for retrieving required data from at least one table, the source relational query defining the required data; identify at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulate the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
  • the processing unit is further configured to calculate a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
  • a method of data retrieval comprising: providing a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtaining, by a processing resource: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data; translating, by the processing resource, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions; identifying, by the processing resource, one or more of the partition queries (a) that do not meet a similarity
  • the method further comprises inserting the non-fresh-cache partition results and their respective partition queries into the cache and updating their respective cache timestamp to indicative of the retrieval time thereof.
  • the source relational query further comprises rules for reducing the required data and wherein the method further comprising reducing the merged results in accordance with the rules, for reducing the required data.
  • the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp.
  • the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp plus a tolerance time.
  • the similarity criterion is identity between the partition query and one of the cached relational queries.
  • the method further comprises manipulating query texts of the partition query and of the cached relational queries, before the identifying, so that the query texts have a common comparable structure.
  • the manipulating includes removing unnecessary spaces and brackets.
  • the similarity criterion is a given percentage of similarity between one or more attributes of the partition query and of one of the cached relational queries.
  • the attributes include column names, table names, query parameters, function names or function parameters.
  • the method further comprises providing a similarity exception list, defining at least one pair of distinct function names to be regarded as similar, and wherein upon a first attribute of the attributes corresponds to a first distinct function name of the distinct function names and a second attribute of the attributes corresponds to a second distinct function name of the distinct function names, the first attribute and the second attribute are considered similar.
  • the method upon identification of a given partition query that meet the similarity criterion with more than one of the cached relational queries, the method further comprises calculating a relevance score for each of the cached relational queries that meet the similarity criterion with the given partition query, and wherein the retrieving includes retrieving only the cached partition results of the cached relational query having the relevance score that is highest.
  • At least a first partition query of the partition queries is executed on a first data repository and at least a second partition query of the partition queries is executed on a second data repository.
  • the first data repository is an Structured Query Language (SQL) based data repository and the second data repository is a non-SQL-based data repository.
  • SQL Structured Query Language
  • the partitions include at least one real partition, being a real partition of the corresponding table, defined in the database.
  • the partitions include at least one virtual partition, being a virtual partition of the corresponding table, that does not exist in the database.
  • the cache further comprises one or more past source relational queries and the method further comprises: identifying at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulating, before the translating, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
  • the method further comprises calculating a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
  • a method of query manipulation comprising: providing, by a processing resource, a cache configured to store one or more past source relational queries; obtaining, by the processing resource, a source relational query for retrieving required data from at least one table, the source relational query defining the required data; identifying, by the processing resource, at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulating, by the processing resource, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
  • the method further comprises calculating a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor of a computer to perform a method of data retrieval comprising: providing a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtaining, by a processing resource: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data; translating, by the processing resource, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor of a computer to perform a method of data retrieval comprising: providing, by a processing resource, a cache configured to store one or more past source relational queries; obtaining, by the processing resource, a source relational query for retrieving required data from at least one table, the source relational query defining the required data; identifying, by the processing resource, at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulating, by the processing resource, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion
  • FIG. 1 is a schematic illustration of an operating environment of a data retrieval system in accordance with the presently disclosed subject matter
  • Fig. 2 is a block diagram schematically illustrating one example of a querying device of the data retrieval system, in accordance with the presently disclosed subject matter;
  • Fig. 3 is a flowchart illustrating one example of a sequence of operations carried out for data retrieval, in accordance with the presently disclosed subject matter
  • Fig. 4 is a flowchart illustrating one example of a sequence of operations carried out for query text manipulation, in accordance with the presently disclosed subject matter
  • Fig. 5 is a flowchart illustrating one example of a sequence of operations carried out for query content manipulation, in accordance with the presently disclosed subject matter
  • Fig. 6 is a flowchart illustrating another example of a sequence of operations carried out for query content manipulation, in accordance with the presently disclosed subject matter.
  • should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g. digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co- residing on a single physical machine, any other electronic computing device, and/or any combination thereof.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • non-transitory is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or non- volatile computer memory technology suitable to the application.
  • the phrase “for example,” “such as”, “for instance” and variants thereof describe non-limiting embodiments of the presently disclosed subject matter.
  • Reference in the specification to “one case”, “some cases”, “other cases” or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter.
  • the appearance of the phrase “one case”, “some cases”, “other cases” or variants thereof does not necessarily refer to the same embodiment(s).
  • Figs. 1-2 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter.
  • Each module in Figs. 1-2 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein.
  • the modules in Figs. 1-2 may be centralized in one location or dispersed over more than one location.
  • the system may comprise fewer, more, and/or different modules than those shown in Figs. 1-2.
  • Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
  • Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
  • Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
  • FIG. 1 a schematic illustration of an operating environment of a data retrieval system in accordance with the presently disclosed subject matter.
  • an infrastructure 130 including processing resources 120, such as one or more processing units (e.g. central processing units), microprocessors, microcontrollers or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling resources of infrastructure 130 and for enabling operations related to infrastructure 130.
  • processing resources 120 such as one or more processing units (e.g. central processing units), microprocessors, microcontrollers or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling resources of infrastructure 130 and for enabling operations related to infrastructure 130.
  • Data repositories 110 capable of storing data.
  • Data repositories 110 can include one of the following, or any combination thereof: data stores, data warehouses, data bases (whether relational or not), files, tables, lists, or any other physical and/or logical means of storing data on computerized memories (whether on proprietary hardware or on third-party hardware).
  • Such data repositories 110 may include: RDBMS data repositories 110, such as: Oracle Database, Microsoft SQL Server, etc., non-tabular data repositories 110, such as: MongoDB, Apache HBase, etc., cloud based data repositories 110, such as: Google Cloud SQL, Amazon Relational Database, Microsoft SQL Azure Database, etc., cloud based Data Warehouse (DW) data repositories 110, such as: Google BigQuery, Amazon Athena, Azure Data Lake, etc. and other data repositories 110, such as: Excel Sheets, Google Sheets, proprietary file formats, etc.
  • the stored data can optionally be retrieval, update or deletion of data stored thereon.
  • Data repositories 110, and/or each data repository 110 can optionally be distributed over a plurality of devices and locations.
  • infrastructure 130 can optionally further comprise cache 160, that can optionally be stored on one or more of the data repositories 110.
  • Cache 160 can be configured to store various information that can be used by the data retrieval system in order to improve its operation (e.g. in terms of processing resources 120 utilization and/or bandwidth consumption).
  • cache 160 can store cached relational queries, results thereof, and for each of the cached relational queries results a corresponding cache timestamp indicative of a corresponding execution time thereof (i.e. a timestamp indicating when the query was executed in order to retrieve its respective results).
  • Cached relational queries are queries used to retrieve data from a data repository using a given syntax.
  • the syntax is a Structured Query Language (SQL) syntax, known in the art.
  • the cache 160 can be further configured to store the results of the cached relational queries, being the data obtained by executing the cached relational queries on the respective data repository 110.
  • the cache 160 can be external to the infrastructure 130, and it can be stored as a whole elsewhere (i.e. outside the infrastructure 130), or it can be distributed between the infrastructure's 130 data repositories 110 and other data repositories external to the infrastructure. In some cases, the cache 160 may be stored, in part in in whole, within one or more querying devices 140 (described herein).
  • the cached relational queries are sub-queries of so called “past source relational queries”.
  • Past source relational queries are also queries used to retrieve data from a data repository using a given syntax, such as SQL.
  • the past source relational queries are translated by the data retrieval system into partition queries, each being a relational query designed to query a single partition within a respective data repository 110.
  • One or more querying devices 140 can query data stored on the data repositories 110 and/or the cache 160.
  • the querying devices 140 can be any computerized device capable of providing queries in the required syntax for retrieving data.
  • the queries can be manually provided, e.g. by a human operator operating a querying device 140, or they can be automatically generated by a querying device 140.
  • the communication between the querying devices 140 and the infrastructure 130, and more specifically the data repositories 110 and the cache 160 can be transferred over one or more networks 150, that can be any type of network enabling such transfer of data, including wired, or wireless networks.
  • groups of one or more querying devices 140 each can each represent a different entity (e.g. company/other business type) to which a certain portion of the data repositories 110 is allocated, and only such portion is available thereto for storing or querying data stored thereon.
  • a first group of one or more querying devices 140 can be endpoints of a first company
  • a second group of one or more other querying devices 140 can be other endpoints of a second company.
  • a first portion of the data repositories 110 can be allocated to the first company and a second portion, other than the first portion, can be allocated to the second company.
  • a querying device 140 of the first company will have access to the first portion of the data repositories 110 allocated to the first company, and will not have access to the second portion of the data repositories allocated to the second company, and vice versa.
  • FIG. 2 there is shown a block diagram schematically illustrating one example of a querying device of the data retrieval system, in accordance with the presently disclosed subject matter.
  • querying device 140 can comprise a network interface 120 enabling transfer of data over network 150, including sending queries and/or receiving queries results to/from the infrastructure 130.
  • Querying device 140 can further comprise, or be otherwise associated with, a data repository 230 (e.g. a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.) configured to store data, including inter alia metadata describing structures of one or more tables within a data repository 110 (e.g. a database), a division of each of the tables to one or more partitions (real partitions, or virtual partitions, as detailed herein, inter alia with reference to Fig. 3), and, for each partition, a last update timestamp indicative of a last update time thereof.
  • data repository 230 can be further configured to enable retrieval and/or update and/or deletion of the stored data.
  • Processing resource 210 can be one or more processing units (e.g. central processing units), microprocessors, microcontrollers or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant querying device 140 resources and for enabling operations related to querying device 140 resources.
  • processing units e.g. central processing units
  • microprocessors e.g., microcontrollers
  • any other computing devices or modules including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant querying device 140 resources and for enabling operations related to querying device 140 resources.
  • the processing resource 210 can comprise one or more of the following modules: data retrieval module 240, query manipulation module 250 and query content manipulation module 260.
  • data retrieval module 240 can be configured to perform a data retrieval process, as further detailed herein, inter alia with respect to Fig. 3.
  • query manipulation module 250 can be configured to perform a query manipulation process, as further detailed herein, inter alia with respect to Figs. 4 to 5.
  • query content manipulation module 260 can be configured to perform a query content manipulation process, as further detailed herein, inter alia with respect to Fig. 6.
  • FIG. 3 a flowchart illustrating one example of a sequence of operations carried out for data retrieval, in accordance with the presently disclosed subject matter.
  • the querying device 140 can be configured to perform a data retrieval process 300, e.g. utilizing data retrieval module 240.
  • querying device 140 can be configured to provide a cache 160 configured to store one or more cached relational queries (being queries executed on the data repositories 110 in the past), corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof (block 310).
  • a cache 160 configured to store one or more cached relational queries (being queries executed on the data repositories 110 in the past), corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof (block 310).
  • querying device 140 is further configured to obtain: (a) metadata describing structures of one or more tables within the data repositories 110 (e.g. within a given database), a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data (block 320)
  • the division of the database to partitions can be a real division to real database partitions, defined/configured within the database (e.g. based on instructions provided by a Data Base Administrator (DBA)), or it can be a virtual division, that is not defined/configured within the database (e.g. automatically created in cache 160 by querying device 140, based on user configuration or based on past source relational queries).
  • DBA Data Base Administrator
  • part of the partitions can be real database partitions and some can be virtual.
  • the source relational query is for querying data from the data repositories 110, and it is not provided in a partition-aware manner, i.e. the source relational query refers to the tables themselves, and not to specific partitions to which the tables are divided to. In many cases, a user that provides the source relational query is unaware of the division of the table/s, from which he seeks to retrieve data, into partitions.
  • the querying device 140 translates, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions (whether real or virtual) (block 330). For example, if table A is divided into two partitions, and the source relational query is aimed at retrieving data from table A, the source relational query is translated into two partition queries, one for querying the data from the first partition of table A, and the other for querying the data from the second partition of table A.
  • querying device 140 can be configured to identify one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that meet the similarity criterion with at least one of the cached relational queries and for which the cached results of the cached relational queries do not comprise fresh results of the corresponding partition query (block 340).
  • the identified partition queries form a non- fresh-cache partition queries list, of partition queries to be executed in order to retrieve the required relevant data from the data repositories 110. It is to be noted that fresh results are results associated with a given cache timestamp of the cache timestamps that meets a freshness criterion.
  • the freshness criterion can be that the cache timestamp is equal to, or later than, the last update timestamp, so that the results are up to date and the corresponding data in the data repositories 110 has not been changed since it was retrieved and stored in the cache 160.
  • a certain tolerance can be provided, so that the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp plus a tolerance time.
  • the tolerance can be defined by a user of the data retrieval system.
  • the cached results of the cached can still be regarded as fresh, as they meet the freshness criterion.
  • the similarity criterion can require complete identity between the partition query and one of the cached relational queries, so that only cached relational queries that are identical to corresponding partition queries are identified. However, in other cases, a certain tolerance can be provided. In such cases, the similarity criterion can require a given percentage of similarity between one or more attributes of the partition query and of one of the cached relational queries. In more specific cases, the attributes are required to include less than a maximal number of different characters at corresponding positions of the corresponding attributes of the partition query and of the one of the cached relational queries.
  • the attributes can include column names, table names, query parameters, function names, function parameters, etc.
  • the partition query and one of the cached relational queries are identical - clearly if the cache 160 comprises fresh results of such cached relational queries they can be used instead of retrieving them from the data repositories 110 again.
  • tolerance can be provided. The tolerance can be defined by a user of the data retrieval system.
  • the user can determine that if there is an overlap of X% between the columns (which are a certain type of attributes) defined by a given partition query and the columns of one or more cached relational queries, and assuming that the cached relational queries are fresh, the partition query can be replaced by the cached relational query and the results of the cached relational query can be used instead of executing the partition query for retrieving data from the data repositories 110.
  • a similarity exception list defining at least one pair of distinct function names to be regarded as similar, and wherein upon a first attribute of the attributes corresponds to a first distinct function name of the distinct function names and a second attribute of the attributes corresponds to a second distinct function name of the distinct function names, the first attribute and the second attribute are considered similar.
  • the functions 'trim' and 'rtrim' can be regarded as identical when determining whether the similarity criterion is met, if they are included in the similarity exception list.
  • Fig. 4 is a flowchart illustrating one example of a sequence of operations carried out for query text manipulation, in accordance with the presently disclosed subject matter.
  • the partition queries are a generated as a translation of the source relational query, and are then compared with the cached relational queries to identify those cached relational queries whose cached results can be used instead of executing the corresponding partition queries.
  • the querying device 140 can be configured to manipulate query texts of the partition query and/or of the cached relational queries, so that the query texts have a common comparable structure. Then, when moving to block 340, the comparisons performed by the querying device 140 will compare apples with apples and not apples with oranges (block 410).
  • the manipulation includes removing unnecessary spaces and brackets within the text of the partition query and/or of the cached relational queries.
  • block 410 can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
  • Querying device 140 executes each of the partition queries comprised within the non-fresh-cache partition queries list, for retrieving non-cached or non-fresh cached partition results, from the data repositories 110 (block 350).
  • querying device 140 retrieves cached partition results from the cache, the cached partition results being the fresh results of the cached relational queries of the corresponding partition queries not comprised within the non-fresh-cache partition queries list, if any, and if more than one - the one having a highest calculated relevance score (block 360). It is to be noted that in some cases, for a given partition query, there can be more than one cached relational query that meet the similarity criterion and/or the freshness criterion.
  • a relevance score can be calculated for each of the cached relational queries that meet the similarity criterion and/or the freshness criterion with the given partition query, and the cached relational query having the relevance score that is highest is the one that will be retrieved at block 360.
  • the relevance score can be calculated so that the later the cache timestamp of the cached relational queries that meet the similarity criterion is - the higher the score is.
  • the relevance score can be calculated so that the more similar the cached relational query is to the partition query - the higher the score is.
  • the non-fresh-cache partition results and their respective partition queries can be inserted into the cache and their respective cache timestamp can be updated to indicate of the retrieval time thereof (block 370), so that the retrieved data that is now cached can be used at a later time, when dealing with future source relational queries.
  • Query device 140 can be further configured to merge the cached partition results with the non-fresh-cache partition results, if any, giving rise to the required data, as defined by the source relational query (block 380).
  • the merged results can then be reduced in accordance with rules for reducing the required data (block 390).
  • Data reduction rules can include, for example, group by or having clauses (used for example by SQL queries), or any other rule which cause a certain number of retrieved rows to be reduced to a lower number of rows.
  • the cached relational queries can also comprise rules for reducing the data retrieved thereby, such as the group by or having clauses (used for example by SQL queries), or any other rule which cause a certain number of retrieved rows to be reduced to a lower number of rows, so that retrieving the results of the cached queries from the cache 160, substantially reduces the infrastructure's 130 processing and bandwidth resource utilization.
  • a user of querying device 140 may request a source query querying a 'sales' table, containing sales records and stored in data repository 110.
  • the source query may translate into multiple partition queries, each for a partition of the 'sales' table.
  • Each partition may hold, for example, the sales records of a specific calendar month (or any other time period for that matter, mutatis mutandis).
  • Each of the partition queries is then compared with the cached relational queries to identify those cached relational queries whose cached results can be used instead of executing the corresponding partition queries.
  • it may be that the source query is for all sales records of the last calendar year.
  • the query will translate into 12 partition queries (one for each calendar month).
  • Each of the 12 partition queries will be then compared with the cached relational queries. If some of the partition queries are in the cache 160, meeting the freshness and similarity criterions, their respective results will be retrieved from the cache 160, thus saving processing resources 120 and network resources 150 while returning the requested data to the user.
  • some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein (for example, block 320 can be performed before block 310, block 380 can be performed before block 370, block 390 can be performed before block 370 and/or block 380, etc.). It is to be further noted that some of the blocks are optional. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
  • FIG. 5 there is shown a flowchart illustrating one example of a sequence of operations carried out for query content manipulation, in accordance with the presently disclosed subject matter.
  • the querying device 140 can be configured to perform a query content manipulation process 300, e.g. utilizing data query manipulation module 250.
  • querying device 140 can be configured to identify, within the cache 160, at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data, as defined by the source relational query obtained at block 320 (block 510).
  • the querying device 140 can be configured to manipulate, before block 330, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries (block 520).
  • the similarity criterion can require a first percentage of overlap and a second percentage of non-overlap between the columns defined by the source relational query and the structure defined by one of the past source relational queries. For example, at least one column is required to overlap and at least one column is required not to overlap, so that if a given past source relational query defines retrieval of columns A, B and C from a given table, and the source relational query defines retrieval of columns A, B only, the source relational query can be updated to retrieve also column C.
  • the source relational query can be updated to retrieve only columns A and B which have been previously retrieved by the past source relational query.
  • FIG. 6 a flowchart illustrating another example of a sequence of operations carried out for query content manipulation, in accordance with the presently disclosed subject matter, is shown.
  • the querying device 140 can be configured to perform a query content manipulation process 300, e.g. utilizing query content manipulation module 260.
  • querying device 140 can be configured to provide a cache 160 configured to store one or more past source relational queries (being source relational queries obtained in the past as detailed with respect to Fig. 3) (block 610). Querying device 140 is further configured to obtain a source relational query for retrieving required data from at least one table, the source relational query defining the required data (block 620).
  • a cache 160 configured to store one or more past source relational queries (being source relational queries obtained in the past as detailed with respect to Fig. 3) (block 610).
  • Querying device 140 is further configured to obtain a source relational query for retrieving required data from at least one table, the source relational query defining the required data (block 620).
  • querying device 140 can be configured to identify at least one of the past source relational queries (stored on the cache 160) defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data, and optionally if more than one - the one having the highest calculated relevance score (block 630).
  • the similarity criterion can require that at least a first part of the columns defined by the source relational query and the columns defined by one of the past source relational queries will overlap, and at least a second part will not overlap, so that the columns defined by the source relational query and the columns defined by one of the past source relational queries will include at least one shared column and at least one column that exists in the source relational query and does not exist in the past source relational queries or vice versa.
  • At least one column is required to overlap and at least one column is required not to overlap, so that if a given past source relational query defines retrieval of columns A, B and C from a given table, and the source relational query defines retrieval of columns A, B only, the source relational query can be updated to retrieve also column C. On the other hand, if a given past source relational query defines retrieval of columns A and B from a given table, and the source relational query defines retrieval of columns A, B and C, the source relational query can be updated to retrieve only columns A and B which have been previously retrieved by the past source relational query.
  • a relevance score can be calculated for each of the past source relational queries that meet the similarity criterion with the source relational query, and the past source relational query having the relevance score that is highest is the one that will be identified at block 630.
  • the querying device 140 is further configured to manipulate the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries (block 640).
  • system can be implemented, at least partly, as a suitably programmed computer.
  • the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method.
  • the presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Abstract

A method of query manipulation, the method comprising, by a processing resource: providing a cache configured to store one or more past source relational queries; obtaining a source relational query for retrieving required data from at least one table, the source relational query defining the required data; identifying at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulating the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.

Description

A SYSTEM AND METHOD FOR DATA RETRffiVAL
TECHNICAL FIELD
The invention relates to a system and method for data retrieval.
BACKGROUND
Database systems store vast, and ever-increasing amounts of data, whether in a tabular format (e.g. in Relational Database Management Systems (RDBMS), such as: Oracle Database, Microsoft SQL Server, etc.), or in non-tabular formats (e.g. utilizing linked lists, key-value pairs, documents or other formats, using tolls such as: MongoDB, Apache HBase, etc.). Querying such amounts of data is, in many cases, a heavy task that requires substantial processing and network resources. In some cases, current data retrieval systems do not have any caching mechanisms for caching retrieved data, or the caching mechanisms are lacking. Therefore, processing and networking resources are in many cases over utilized and become expensive resources. There is thus a need in the art for a new method and system for data retrieval.
In addition, maintaining an infrastructure (hardware and software) for storing data became at the very least cumbersome, if not impractical. Therefore, new cloud- based services have emerged including Infrastructure as a Service (IaaS) services, for storing the data on a cloud infrastructure, in which the hardware, and at least some parts of the software required for storing and retrieving the data are not owned and/or maintained by the owner of the data itself, but by a certain third-party. Some examples include: Google Cloud SQL, Amazon Relational Database, Microsoft SQL Azure Database and more. Additional IaaS services include data analysis tools (such as Google BigQuery, Amazon Athena, Azure Data Lake and more), that enable reporting and data analysis of big volumes of data.
The shift to such cloud-based services even further emphasizes the above- mentioned need, as in such IaaS environment, many entities (e.g. companies) store their data on the same physical infrastructure, and the over utilization of the processing and networking resources is enhanced. As a result, the cloud-based services charge their clients as a function of the amount of data processed or retrieved by each query. Having a caching mechanism that reduces the amount of data processed or retrieved by the queries will thus reduce the resource utilization, and as a side-effect - also the queries costs. In addition, having a caching mechanism is also beneficial for enabling faster and more efficient querying of data, which is advantageous when dealing with large amounts of stored data.
GENERAL DESCRIPTION
In accordance with a first aspect of the presently disclosed subject matter, there is provided a data retrieval system comprising a processing unit configured to: provide a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtain: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data; translate, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions; identify one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that meet the similarity criterion with at least one of the cached relational queries and for which the cached results of the cached relational queries do not comprise fresh results of the corresponding partition query, wherein fresh results are results associated with a given cache timestamp of the cache timestamps that meets a freshness criterion, giving rise to non-fresh-cache partition queries list; execute each of the partition queries comprised within the non-fresh-cache partition queries list, for retrieving non-cached or non-fresh cached partition results; retrieve cached partition results from the cache, the cached partition results being the fresh results of the cached relational queries of the corresponding partition queries not comprised within the non- fresh-cache partition queries list, if any; and merge the cached partition results with the non-fresh-cache partition results, if any, giving rise to the required data. In some cases, the processing unit is further configured to insert the non-fresh- cache partition results and their respective partition queries into the cache and update their respective cache timestamp to indicative of the retrieval time thereof.
In some cases, the source relational query further comprises rules for reducing the required data and wherein the processing unit is further configured to reduce the merged results in accordance with the rules, for reducing the required data.
In some cases, the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp.
In some cases, the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp plus a tolerance time.
In some cases, the similarity criterion is identity between the partition query and one of the cached relational queries.
In some cases, the processing unit is further configured to manipulate query texts of the partition query and of the cached relational queries, before the identify, so that the query texts have a common comparable structure.
In some cases, the manipulate includes removing unnecessary spaces and brackets.
In some cases, the similarity criterion is a given percentage of similarity between one or more attributes of the partition query and of one of the cached relational queries.
In some cases, the attributes include column names, table names, query parameters, function names or function parameters.
In some cases, the processing unit is further configured to provide a similarity exception list, defining at least one pair of distinct function names to be regarded as similar, and wherein upon a first attribute of the attributes corresponds to a first distinct function name of the distinct function names and a second attribute of the attributes corresponds to a second distinct function name of the distinct function names, the first attribute and the second attribute are considered similar.
In some cases, upon identification of a given partition query that meet the similarity criterion with more than one of the cached relational queries, the processing unit is further configured to calculate a relevance score for each of the cached relational queries that meet the similarity criterion with the given partition query, and wherein the retrieve includes retrieving only the cached partition results of the cached relational query having the relevance score that is highest. Γη some cases, at least a first partition query of the partition queries is executed on a first data repository and at least a second partition query of the partition queries is executed on a second data repository.
In some cases, the first data repository is an Structured Query Language (SQL) based data repository and the second data repository is a non-SQL-based data repository.
In some cases, the partitions include at least one real partition, being a real partition of the corresponding table, defined in the database.
In some cases, the partitions include at least one virtual partition, being a virtual partition of the corresponding table, that does not exist in the database.
In some cases, the cache further comprises one or more past source relational queries and wherein the processing unit is further configured to: identify at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulate, before the translate, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
In some cases, upon identification of a plurality of past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion with the second structure of the required data, the processing unit is further configured to calculate a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
In accordance with a second aspect of the presently disclosed subject matter, there is provided a query manipulation system, comprising a processing unit configured to: provide a cache configured to store one or more past source relational queries; obtain a source relational query for retrieving required data from at least one table, the source relational query defining the required data; identify at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulate the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
In some cases, wherein upon identification of a plurality of past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion with the second structure of the required data, the processing unit is further configured to calculate a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
In accordance with a third aspect of the presently disclosed subject matter, there is provided a method of data retrieval, the method comprising: providing a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtaining, by a processing resource: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data; translating, by the processing resource, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions; identifying, by the processing resource, one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that meet the similarity criterion with at least one of the cached relational queries and for which the cached results of the cached relational queries do not comprise fresh results of the corresponding partition query, wherein fresh results are results associated with a given cache timestamp of the cache timestamps that meets a freshness criterion, giving rise to non-fresh-cache partition queries list; executing, by the processing resource, each of the partition queries comprised within the non-fresh-cache partition queries list, for retrieving non-cached or non-fresh cached partition results; retrieving, by the processing resource, cached partition results from the cache, the cached partition results being the fresh results of the cached relational queries of the corresponding partition queries not comprised within the non-fresh-cache partition queries list, if any; and merging, by the processing resource, the cached partition results with the non-fresh-cache partition results, if any, giving rise to the required data.
In some cases, the method further comprises inserting the non-fresh-cache partition results and their respective partition queries into the cache and updating their respective cache timestamp to indicative of the retrieval time thereof.
In some cases, the source relational query further comprises rules for reducing the required data and wherein the method further comprising reducing the merged results in accordance with the rules, for reducing the required data.
In some cases, the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp.
In some cases, the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp plus a tolerance time.
In some cases, the similarity criterion is identity between the partition query and one of the cached relational queries.
In some cases, the method further comprises manipulating query texts of the partition query and of the cached relational queries, before the identifying, so that the query texts have a common comparable structure.
In some cases, the manipulating includes removing unnecessary spaces and brackets.
In some cases, the similarity criterion is a given percentage of similarity between one or more attributes of the partition query and of one of the cached relational queries. In some cases, the attributes include column names, table names, query parameters, function names or function parameters.
In some cases, the method further comprises providing a similarity exception list, defining at least one pair of distinct function names to be regarded as similar, and wherein upon a first attribute of the attributes corresponds to a first distinct function name of the distinct function names and a second attribute of the attributes corresponds to a second distinct function name of the distinct function names, the first attribute and the second attribute are considered similar.
In some cases, upon identification of a given partition query that meet the similarity criterion with more than one of the cached relational queries, the method further comprises calculating a relevance score for each of the cached relational queries that meet the similarity criterion with the given partition query, and wherein the retrieving includes retrieving only the cached partition results of the cached relational query having the relevance score that is highest.
In some cases, at least a first partition query of the partition queries is executed on a first data repository and at least a second partition query of the partition queries is executed on a second data repository.
In some cases, the first data repository is an Structured Query Language (SQL) based data repository and the second data repository is a non-SQL-based data repository.
In some cases, the partitions include at least one real partition, being a real partition of the corresponding table, defined in the database.
In some cases, the partitions include at least one virtual partition, being a virtual partition of the corresponding table, that does not exist in the database.
In some cases, the cache further comprises one or more past source relational queries and the method further comprises: identifying at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulating, before the translating, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
In some cases, upon identification of a plurality of past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion with the second structure of the required data, the method further comprises calculating a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
In accordance with a fourth aspect of the presently disclosed subject matter, there is provided a method of query manipulation, the method comprising: providing, by a processing resource, a cache configured to store one or more past source relational queries; obtaining, by the processing resource, a source relational query for retrieving required data from at least one table, the source relational query defining the required data; identifying, by the processing resource, at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulating, by the processing resource, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
In some cases, upon identification of a plurality of past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion with the second structure of the required data, the method further comprises calculating a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest. Γη accordance with a fifth aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor of a computer to perform a method of data retrieval comprising: providing a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtaining, by a processing resource: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data; translating, by the processing resource, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions; identifying, by the processing resource, one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that meet the similarity criterion with at least one of the cached relational queries and for which the cached results of the cached relational queries do not comprise fresh results of the corresponding partition query, wherein fresh results are results associated with a given cache timestamp of the cache timestamps that meets a freshness criterion, giving rise to non-fresh-cache partition queries list; executing, by the processing resource, each of the partition queries comprised within the non-fresh-cache partition queries list, for retrieving non-cached or non-fresh cached partition results; retrieving, by the processing resource, cached partition results from the cache, the cached partition results being the fresh results of the cached relational queries of the corresponding partition queries not comprised within the non-fresh-cache partition queries list, if any; and merging, by the processing resource, the cached partition results with the non-fresh-cache partition results, if any, giving rise to the required data. In accordance with a sixth aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor of a computer to perform a method of data retrieval comprising: providing, by a processing resource, a cache configured to store one or more past source relational queries; obtaining, by the processing resource, a source relational query for retrieving required data from at least one table, the source relational query defining the required data; identifying, by the processing resource, at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and manipulating, by the processing resource, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries
BRIEF DESCRIPTION OF THE DRAWINGS
In order to understand the presently disclosed subject matter and to see how it may be carried out in practice, the subject matter will now be described, by way of non- limiting examples only, with reference to the accompanying drawings, in which:
Fig. 1 is a schematic illustration of an operating environment of a data retrieval system in accordance with the presently disclosed subject matter;
Fig. 2 is a block diagram schematically illustrating one example of a querying device of the data retrieval system, in accordance with the presently disclosed subject matter;
Fig. 3 is a flowchart illustrating one example of a sequence of operations carried out for data retrieval, in accordance with the presently disclosed subject matter;
Fig. 4 is a flowchart illustrating one example of a sequence of operations carried out for query text manipulation, in accordance with the presently disclosed subject matter; Fig. 5 is a flowchart illustrating one example of a sequence of operations carried out for query content manipulation, in accordance with the presently disclosed subject matter; and
Fig. 6 is a flowchart illustrating another example of a sequence of operations carried out for query content manipulation, in accordance with the presently disclosed subject matter.
DETAILED DESCRIPTION
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the presently disclosed subject matter. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well- known methods, procedures, and components have not been described in detail so as not to obscure the presently disclosed subject matter.
In the drawings and descriptions set forth, identical reference numerals indicate those components that are common to different embodiments or configurations.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as "providing", "obtaining", "translating", "identifying", "executing", "retrieving", "merging", "inserting", "updating", "reducing", "manipulating", "removing", "calculating" or the like, include action and/or processes of a computer that manipulate and/or transform data into other data, said data represented as physical quantities, e.g. such as electronic quantities, and/or said data representing the physical objects. The terms "computer", "processor", and "controller" should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g. digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co- residing on a single physical machine, any other electronic computing device, and/or any combination thereof. The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes or by a general-purpose computer specially configured for the desired purpose by a computer program stored in a non-transitory computer readable storage medium. The term "non-transitory" is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or non- volatile computer memory technology suitable to the application.
As used herein, the phrase "for example," "such as", "for instance" and variants thereof describe non-limiting embodiments of the presently disclosed subject matter. Reference in the specification to "one case", "some cases", "other cases" or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of the phrase "one case", "some cases", "other cases" or variants thereof does not necessarily refer to the same embodiment(s).
It is appreciated that, unless specifically stated otherwise, certain features of the presently disclosed subject matter, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.
In embodiments of the presently disclosed subject matter, fewer, more and/or different stages than those shown in Fig. 3-6 may be executed. In embodiments of the presently disclosed subject matter one or more stages illustrated in Fig. 3-6 may be executed in a different order and/or one or more groups of stages may be executed simultaneously. Figs. 1-2 illustrate a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter. Each module in Figs. 1-2 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein. The modules in Figs. 1-2 may be centralized in one location or dispersed over more than one location. In other embodiments of the presently disclosed subject matter, the system may comprise fewer, more, and/or different modules than those shown in Figs. 1-2.
Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
Bearing this in mind, attention is drawn to Fig. 1, a schematic illustration of an operating environment of a data retrieval system in accordance with the presently disclosed subject matter.
According to certain examples of the presently disclosed subject matter, an infrastructure 130 is provided, including processing resources 120, such as one or more processing units (e.g. central processing units), microprocessors, microcontrollers or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling resources of infrastructure 130 and for enabling operations related to infrastructure 130.
Infrastructure 130 further comprises one or more data repositories 110 capable of storing data. Data repositories 110 can include one of the following, or any combination thereof: data stores, data warehouses, data bases (whether relational or not), files, tables, lists, or any other physical and/or logical means of storing data on computerized memories (whether on proprietary hardware or on third-party hardware). Such data repositories 110 may include: RDBMS data repositories 110, such as: Oracle Database, Microsoft SQL Server, etc., non-tabular data repositories 110, such as: MongoDB, Apache HBase, etc., cloud based data repositories 110, such as: Google Cloud SQL, Amazon Relational Database, Microsoft SQL Azure Database, etc., cloud based Data Warehouse (DW) data repositories 110, such as: Google BigQuery, Amazon Athena, Azure Data Lake, etc. and other data repositories 110, such as: Excel Sheets, Google Sheets, proprietary file formats, etc. The stored data can optionally be retrieval, update or deletion of data stored thereon. Data repositories 110, and/or each data repository 110, can optionally be distributed over a plurality of devices and locations.
In accordance with the presently disclosed subject matter, infrastructure 130 can optionally further comprise cache 160, that can optionally be stored on one or more of the data repositories 110. Cache 160 can be configured to store various information that can be used by the data retrieval system in order to improve its operation (e.g. in terms of processing resources 120 utilization and/or bandwidth consumption). For example, cache 160 can store cached relational queries, results thereof, and for each of the cached relational queries results a corresponding cache timestamp indicative of a corresponding execution time thereof (i.e. a timestamp indicating when the query was executed in order to retrieve its respective results). Cached relational queries are queries used to retrieve data from a data repository using a given syntax. In one specific example, the syntax is a Structured Query Language (SQL) syntax, known in the art. As indicated herein, the cache 160 can be further configured to store the results of the cached relational queries, being the data obtained by executing the cached relational queries on the respective data repository 110.
It is to be noted that in some other cases the cache 160 can be external to the infrastructure 130, and it can be stored as a whole elsewhere (i.e. outside the infrastructure 130), or it can be distributed between the infrastructure's 130 data repositories 110 and other data repositories external to the infrastructure. In some cases, the cache 160 may be stored, in part in in whole, within one or more querying devices 140 (described herein).
In some cases, the cached relational queries are sub-queries of so called "past source relational queries". Past source relational queries are also queries used to retrieve data from a data repository using a given syntax, such as SQL. However, as further detailed herein, inter alia with reference to Fig. 3, the past source relational queries are translated by the data retrieval system into partition queries, each being a relational query designed to query a single partition within a respective data repository 110.
One or more querying devices 140 can query data stored on the data repositories 110 and/or the cache 160. The querying devices 140 can be any computerized device capable of providing queries in the required syntax for retrieving data. The queries can be manually provided, e.g. by a human operator operating a querying device 140, or they can be automatically generated by a querying device 140. The communication between the querying devices 140 and the infrastructure 130, and more specifically the data repositories 110 and the cache 160, can be transferred over one or more networks 150, that can be any type of network enabling such transfer of data, including wired, or wireless networks.
It is to be noted that in some cases, groups of one or more querying devices 140 each, can each represent a different entity (e.g. company/other business type) to which a certain portion of the data repositories 110 is allocated, and only such portion is available thereto for storing or querying data stored thereon. For example, a first group of one or more querying devices 140 can be endpoints of a first company, and a second group of one or more other querying devices 140 can be other endpoints of a second company. A first portion of the data repositories 110 can be allocated to the first company and a second portion, other than the first portion, can be allocated to the second company. A querying device 140 of the first company will have access to the first portion of the data repositories 110 allocated to the first company, and will not have access to the second portion of the data repositories allocated to the second company, and vice versa.
Turning to Fig. 2, there is shown a block diagram schematically illustrating one example of a querying device of the data retrieval system, in accordance with the presently disclosed subject matter.
According to certain examples of the presently disclosed subject matter, querying device 140 can comprise a network interface 120 enabling transfer of data over network 150, including sending queries and/or receiving queries results to/from the infrastructure 130.
Querying device 140 can further comprise, or be otherwise associated with, a data repository 230 (e.g. a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.) configured to store data, including inter alia metadata describing structures of one or more tables within a data repository 110 (e.g. a database), a division of each of the tables to one or more partitions (real partitions, or virtual partitions, as detailed herein, inter alia with reference to Fig. 3), and, for each partition, a last update timestamp indicative of a last update time thereof. Such data can be obtained from the infrastructure 130, e.g. by requesting it from the processing resources 120, having access to the data repositories 110. Data repository 230 can be further configured to enable retrieval and/or update and/or deletion of the stored data.
Querying device 140 further comprises one or more processing resources 210. Processing resource 210 can be one or more processing units (e.g. central processing units), microprocessors, microcontrollers or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant querying device 140 resources and for enabling operations related to querying device 140 resources.
The processing resource 210 can comprise one or more of the following modules: data retrieval module 240, query manipulation module 250 and query content manipulation module 260.
According to some examples of the presently disclosed subject matter, data retrieval module 240 can be configured to perform a data retrieval process, as further detailed herein, inter alia with respect to Fig. 3.
According to some examples of the presently disclosed subject matter, query manipulation module 250 can be configured to perform a query manipulation process, as further detailed herein, inter alia with respect to Figs. 4 to 5.
According to some examples of the presently disclosed subject matter, query content manipulation module 260 can be configured to perform a query content manipulation process, as further detailed herein, inter alia with respect to Fig. 6.
Attention is drawn to Fig. 3, a flowchart illustrating one example of a sequence of operations carried out for data retrieval, in accordance with the presently disclosed subject matter.
According to certain examples of the presently disclosed subject matter, the querying device 140 can be configured to perform a data retrieval process 300, e.g. utilizing data retrieval module 240.
For that purpose, querying device 140 can be configured to provide a cache 160 configured to store one or more cached relational queries (being queries executed on the data repositories 110 in the past), corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof (block 310).
In addition to the cache 160, querying device 140 is further configured to obtain: (a) metadata describing structures of one or more tables within the data repositories 110 (e.g. within a given database), a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data (block 320)
The division of the database to partitions can be a real division to real database partitions, defined/configured within the database (e.g. based on instructions provided by a Data Base Administrator (DBA)), or it can be a virtual division, that is not defined/configured within the database (e.g. automatically created in cache 160 by querying device 140, based on user configuration or based on past source relational queries). In some cases, part of the partitions can be real database partitions and some can be virtual.
The source relational query is for querying data from the data repositories 110, and it is not provided in a partition-aware manner, i.e. the source relational query refers to the tables themselves, and not to specific partitions to which the tables are divided to. In many cases, a user that provides the source relational query is unaware of the division of the table/s, from which he seeks to retrieve data, into partitions.
Having the metadata and the source relational query, the querying device 140 translates, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions (whether real or virtual) (block 330). For example, if table A is divided into two partitions, and the source relational query is aimed at retrieving data from table A, the source relational query is translated into two partition queries, one for querying the data from the first partition of table A, and the other for querying the data from the second partition of table A.
Once the source relational query is translated into partition queries, querying device 140 can be configured to identify one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that meet the similarity criterion with at least one of the cached relational queries and for which the cached results of the cached relational queries do not comprise fresh results of the corresponding partition query (block 340). The identified partition queries form a non- fresh-cache partition queries list, of partition queries to be executed in order to retrieve the required relevant data from the data repositories 110. It is to be noted that fresh results are results associated with a given cache timestamp of the cache timestamps that meets a freshness criterion. The freshness criterion can be that the cache timestamp is equal to, or later than, the last update timestamp, so that the results are up to date and the corresponding data in the data repositories 110 has not been changed since it was retrieved and stored in the cache 160. In other cases, a certain tolerance can be provided, so that the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp plus a tolerance time. The tolerance can be defined by a user of the data retrieval system. In such cases, even if the results are not up to date and the corresponding data in the data repositories 110 did change since it was retrieved and stored in the cache 160, if such change did not occur before the tolerance, the cached results of the cached can still be regarded as fresh, as they meet the freshness criterion.
The similarity criterion can require complete identity between the partition query and one of the cached relational queries, so that only cached relational queries that are identical to corresponding partition queries are identified. However, in other cases, a certain tolerance can be provided. In such cases, the similarity criterion can require a given percentage of similarity between one or more attributes of the partition query and of one of the cached relational queries. In more specific cases, the attributes are required to include less than a maximal number of different characters at corresponding positions of the corresponding attributes of the partition query and of the one of the cached relational queries. The attributes can include column names, table names, query parameters, function names, function parameters, etc.
Looking at an example, assuming that the partition query and one of the cached relational queries are identical - clearly if the cache 160 comprises fresh results of such cached relational queries they can be used instead of retrieving them from the data repositories 110 again. In many cases, tolerance can be provided. The tolerance can be defined by a user of the data retrieval system. For example, the user can determine that if there is an overlap of X% between the columns (which are a certain type of attributes) defined by a given partition query and the columns of one or more cached relational queries, and assuming that the cached relational queries are fresh, the partition query can be replaced by the cached relational query and the results of the cached relational query can be used instead of executing the partition query for retrieving data from the data repositories 110. In some cases, there can be provided a similarity exception list, defining at least one pair of distinct function names to be regarded as similar, and wherein upon a first attribute of the attributes corresponds to a first distinct function name of the distinct function names and a second attribute of the attributes corresponds to a second distinct function name of the distinct function names, the first attribute and the second attribute are considered similar. For example, the functions 'trim' and 'rtrim' can be regarded as identical when determining whether the similarity criterion is met, if they are included in the similarity exception list.
Before continuing the description of the data retrieval process 300, attention is drawn to Fig. 4. Fig. 4 is a flowchart illustrating one example of a sequence of operations carried out for query text manipulation, in accordance with the presently disclosed subject matter.
As indicated herein, the partition queries are a generated as a translation of the source relational query, and are then compared with the cached relational queries to identify those cached relational queries whose cached results can be used instead of executing the corresponding partition queries. However, before comparing the partition queries and the cached relational queries, in some cases the querying device 140 can be configured to manipulate query texts of the partition query and/or of the cached relational queries, so that the query texts have a common comparable structure. Then, when moving to block 340, the comparisons performed by the querying device 140 will compare apples with apples and not apples with oranges (block 410). In some cases, the manipulation includes removing unnecessary spaces and brackets within the text of the partition query and/or of the cached relational queries.
It is to be noted that, with reference to Fig. 4, block 410 can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
Attention is now drawn back to Fig. 3. Querying device 140 then executes each of the partition queries comprised within the non-fresh-cache partition queries list, for retrieving non-cached or non-fresh cached partition results, from the data repositories 110 (block 350). In addition, querying device 140 retrieves cached partition results from the cache, the cached partition results being the fresh results of the cached relational queries of the corresponding partition queries not comprised within the non-fresh-cache partition queries list, if any, and if more than one - the one having a highest calculated relevance score (block 360). It is to be noted that in some cases, for a given partition query, there can be more than one cached relational query that meet the similarity criterion and/or the freshness criterion. In such cases, a relevance score can be calculated for each of the cached relational queries that meet the similarity criterion and/or the freshness criterion with the given partition query, and the cached relational query having the relevance score that is highest is the one that will be retrieved at block 360. The relevance score can be calculated so that the later the cache timestamp of the cached relational queries that meet the similarity criterion is - the higher the score is. hi addition, the relevance score can be calculated so that the more similar the cached relational query is to the partition query - the higher the score is.
The non-fresh-cache partition results and their respective partition queries can be inserted into the cache and their respective cache timestamp can be updated to indicate of the retrieval time thereof (block 370), so that the retrieved data that is now cached can be used at a later time, when dealing with future source relational queries.
After obtaining the partition queries comprised within the non-fresh-cache partition queries list, and the cached partition results from the cache 160 (as detailed with respect to blocks 350 and 360), Query device 140 can be further configured to merge the cached partition results with the non-fresh-cache partition results, if any, giving rise to the required data, as defined by the source relational query (block 380). The merged results can then be reduced in accordance with rules for reducing the required data (block 390). Data reduction rules can include, for example, group by or having clauses (used for example by SQL queries), or any other rule which cause a certain number of retrieved rows to be reduced to a lower number of rows.
It is to be noted that the cached relational queries can also comprise rules for reducing the data retrieved thereby, such as the group by or having clauses (used for example by SQL queries), or any other rule which cause a certain number of retrieved rows to be reduced to a lower number of rows, so that retrieving the results of the cached queries from the cache 160, substantially reduces the infrastructure's 130 processing and bandwidth resource utilization.
For example, a user of querying device 140 may request a source query querying a 'sales' table, containing sales records and stored in data repository 110. The source query may translate into multiple partition queries, each for a partition of the 'sales' table. Each partition may hold, for example, the sales records of a specific calendar month (or any other time period for that matter, mutatis mutandis). Each of the partition queries is then compared with the cached relational queries to identify those cached relational queries whose cached results can be used instead of executing the corresponding partition queries. In our example, it may be that the source query is for all sales records of the last calendar year. The query will translate into 12 partition queries (one for each calendar month). Each of the 12 partition queries will be then compared with the cached relational queries. If some of the partition queries are in the cache 160, meeting the freshness and similarity criterions, their respective results will be retrieved from the cache 160, thus saving processing resources 120 and network resources 150 while returning the requested data to the user.
It is to be noted that, with reference to Fig. 3, some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. Furthermore, in some cases, the blocks can be performed in a different order than described herein (for example, block 320 can be performed before block 310, block 380 can be performed before block 370, block 390 can be performed before block 370 and/or block 380, etc.). It is to be further noted that some of the blocks are optional. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
Turning to Fig. 5, there is shown a flowchart illustrating one example of a sequence of operations carried out for query content manipulation, in accordance with the presently disclosed subject matter.
According to certain examples of the presently disclosed subject matter, the querying device 140 can be configured to perform a query content manipulation process 300, e.g. utilizing data query manipulation module 250.
For that purpose, querying device 140 can be configured to identify, within the cache 160, at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data, as defined by the source relational query obtained at block 320 (block 510). Upon identification of such past source relational queries, the querying device 140 can be configured to manipulate, before block 330, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries (block 520).
The similarity criterion can require a first percentage of overlap and a second percentage of non-overlap between the columns defined by the source relational query and the structure defined by one of the past source relational queries. For example, at least one column is required to overlap and at least one column is required not to overlap, so that if a given past source relational query defines retrieval of columns A, B and C from a given table, and the source relational query defines retrieval of columns A, B only, the source relational query can be updated to retrieve also column C. On the other hand, if a given past source relational query defines retrieval of columns A and B from a given table, and the source relational query defines retrieval of columns A, B and C, the source relational query can be updated to retrieve only columns A and B which have been previously retrieved by the past source relational query.
It is to be noted that, with reference to Fig. 5, some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
Turning to Fig. 6, a flowchart illustrating another example of a sequence of operations carried out for query content manipulation, in accordance with the presently disclosed subject matter, is shown.
According to certain examples of the presently disclosed subject matter, the querying device 140 can be configured to perform a query content manipulation process 300, e.g. utilizing query content manipulation module 260.
For that purpose, querying device 140 can be configured to provide a cache 160 configured to store one or more past source relational queries (being source relational queries obtained in the past as detailed with respect to Fig. 3) (block 610). Querying device 140 is further configured to obtain a source relational query for retrieving required data from at least one table, the source relational query defining the required data (block 620).
Having the source relational query and access to the cache 160 comprising information on one or more past source relational queries, querying device 140 can be configured to identify at least one of the past source relational queries (stored on the cache 160) defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data, and optionally if more than one - the one having the highest calculated relevance score (block 630).
The similarity criterion can require that at least a first part of the columns defined by the source relational query and the columns defined by one of the past source relational queries will overlap, and at least a second part will not overlap, so that the columns defined by the source relational query and the columns defined by one of the past source relational queries will include at least one shared column and at least one column that exists in the source relational query and does not exist in the past source relational queries or vice versa. For example, at least one column is required to overlap and at least one column is required not to overlap, so that if a given past source relational query defines retrieval of columns A, B and C from a given table, and the source relational query defines retrieval of columns A, B only, the source relational query can be updated to retrieve also column C. On the other hand, if a given past source relational query defines retrieval of columns A and B from a given table, and the source relational query defines retrieval of columns A, B and C, the source relational query can be updated to retrieve only columns A and B which have been previously retrieved by the past source relational query.
It is to be noted that in some cases, for a given source relational query, there can be more than one past source relational query that meet the similarity criterion. In such cases, a relevance score can be calculated for each of the past source relational queries that meet the similarity criterion with the source relational query, and the past source relational query having the relevance score that is highest is the one that will be identified at block 630.
Having identified the past source relational query at block 630, the querying device 140 is further configured to manipulate the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries (block 640).
It is to be noted that, with reference to Fig. 6, some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
It is to be understood that the presently disclosed subject matter is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The presently disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the present presently disclosed subject matter.
It will also be understood that the system according to the presently disclosed subject matter can be implemented, at least partly, as a suitably programmed computer. Likewise, the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method. The presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Claims

CLAIMS:
1. A data retrieval system comprising a processing unit configured to:
provide a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtain: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data;
translate, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions;
identify one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that meet the similarity criterion with at least one of the cached relational queries and for which the cached results of the cached relational queries do not comprise fresh results of the corresponding partition query, wherein fresh results are results associated with a given cache timestamp of the cache timestamps that meets a freshness criterion, giving rise to non-fresh-cache partition queries list;
execute each of the partition queries comprised within the non-fresh-cache partition queries list, for retrieving non-cached or non-fresh cached partition results; retrieve cached partition results from the cache, the cached partition results being the fresh results of the cached relational queries of the corresponding partition queries not comprised within the non-fresh-cache partition queries list, if any; and
merge the cached partition results with the non-fresh-cache partition results, if any, giving rise to the required data.
2. The data retrieval system of claim 1, wherein the processing unit is further configured to insert the non-fresh-cache partition results and their respective partition queries into the cache and update their respective cache timestamp to indicative of the retrieval time thereof.
3. The data retrieval system of claim 1, wherein the source relational query further comprises rules for reducing the required data and wherein the processing unit is further configured to reduce the merged results in accordance with the rules, for reducing the required data.
4. The data retrieval system of claim 1, wherein the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp.
5. The data retrieval system of claim 1, wherein the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp plus a tolerance time.
6. The data retrieval system of claim 1, wherein the similarity criterion is identity between the partition query and one of the cached relational queries.
7. The data retrieval system of claim 6, wherein the processing unit is further configured to manipulate query texts of the partition query and of the cached relational queries, before the identify, so that the query texts have a common comparable structure.
8. The data retrieval system of claim 7, wherein the manipulate includes removing unnecessary spaces and brackets.
9. The data retrieval system of claim 1, wherein the similarity criterion is a given percentage of similarity between one or more attributes of the partition query and of one of the cached relational queries.
10. The data retrieval system of claim 9, wherein the attributes include column names, table names, query parameters, function names or function parameters.
11. The data retrieval system of claim 10, wherein the processing unit is further configured to provide a similarity exception list, defining at least one pair of distinct function names to be regarded as similar, and wherein upon a first attribute of the attributes corresponds to a first distinct function name of the distinct function names and a second attribute of the attributes corresponds to a second distinct function name of the distinct function names, the first attribute and the second attribute are considered similar.
12. The data retrieval system of claim 1, wherein upon identification of a given partition query that meet the similarity criterion with more than one of the cached relational queries, the processing unit is further configured to calculate a relevance score for each of the cached relational queries that meet the similarity criterion with the given partition query, and wherein the retrieve includes retrieving only the cached partition results of the cached relational query having the relevance score that is highest.
13. The data retrieval system of claim 1, wherein at least a first partition query of the partition queries is executed on a first data repository and at least a second partition query of the partition queries is executed on a second data repository.
14. The data retrieval system of claim 13, wherein the first data repository is an Structured Query Language (SQL) based data repository and the second data repository is a non-SQL-based data repository.
15. The data retrieval system of claim 1, wherein the partitions include at least one real partition, being a real partition of the corresponding table, defined in the database.
16. The data retrieval system of claim 1, wherein the partitions include at least one virtual partition, being a virtual partition of the corresponding table, that does not exist in the database.
17. The data retrieval system of claim 1, wherein the cache further comprises one or more past source relational queries and wherein the processing unit is further configured to: identify at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and
manipulate, before the translate, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
18. The data retrieval system of claim 17, wherein upon identification of a plurality of past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion with the second structure of the required data, the processing unit is further configured to calculate a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
19. A query manipulation system, comprising a processing unit configured to:
provide a cache configured to store one or more past source relational queries; obtain a source relational query for retrieving required data from at least one table, the source relational query defining the required data;
identify at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and
manipulate the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
20. The query manipulation system of claim 20, wherein upon identification of a plurality of past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion with the second structure of the required data, the processing unit is further configured to calculate a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
21. A method of data retrieval, the method comprising:
providing a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtaining, by a processing resource: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data;
translating, by the processing resource, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions;
identifying, by the processing resource, one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that meet the similarity criterion with at least one of the cached relational queries and for which the cached results of the cached relational queries do not comprise fresh results of the corresponding partition query, wherein fresh results are results associated with a given cache timestamp of the cache timestamps that meets a freshness criterion, giving rise to non-fresh-cache partition queries list; executing, by the processing resource, each of the partition queries comprised within the non-fresh-cache partition queries list, for retrieving non-cached or non-fresh cached partition results;
retrieving, by the processing resource, cached partition results from the cache, the cached partition results being the fresh results of the cached relational queries of the corresponding partition queries not comprised within the non-fresh-cache partition queries list, if any; and
merging, by the processing resource, the cached partition results with the non- fresh-cache partition results, if any, giving rise to the required data.
22. The method of data retrieval of claim 21, further comprising inserting the non-fresh-cache partition results and their respective partition queries into the cache and updating their respective cache timestamp to indicative of the retrieval time thereof.
23. The method of data retrieval of claim 21, wherein the source relational query further comprises rules for reducing the required data and wherein the method further comprising reducing the merged results in accordance with the rules, for reducing the required data.
24. The method of data retrieval of claim 21, wherein the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp.
25. The method of data retrieval of claim 21, wherein the freshness criterion is that the cache timestamp is equal to, or later than, the last update timestamp plus a tolerance time.
26. The method of data retrieval of claim 21, wherein the similarity criterion is identity between the partition query and one of the cached relational queries.
27. The method of data retrieval of claim 26, further comprising manipulating query texts of the partition query and of the cached relational queries, before the identifying, so that the query texts have a common comparable structure.
28. The method of data retrieval of claim 27, wherein the manipulating includes removing unnecessary spaces and brackets.
29. The method of data retrieval of claim 21, wherein the similarity criterion is a given percentage of similarity between one or more attributes of the partition query and of one of the cached relational queries.
30. The method of data retrieval of claim 29, wherein the attributes include column names, table names, query parameters, function names or function parameters.
31. The method of data retrieval of claim 30, further comprising providing a similarity exception list, defining at least one pair of distinct function names to be regarded as similar, and wherein upon a first attribute of the attributes corresponds to a first distinct function name of the distinct function names and a second attribute of the attributes corresponds to a second distinct function name of the distinct function names, the first attribute and the second attribute are considered similar.
32. The method of data retrieval of claim 21, wherein upon identification of a given partition query that meet the similarity criterion with more than one of the cached relational queries, the method further comprises calculating a relevance score for each of the cached relational queries that meet the similarity criterion with the given partition query, and wherein the retrieving includes retrieving only the cached partition results of the cached relational query having the relevance score that is highest.
33. The method of data retrieval of claim 21, wherein at least a first partition query of the partition queries is executed on a first data repository and at least a second partition query of the partition queries is executed on a second data repository.
34. The method of data retrieval of claim 33, wherein the first data repository is an Structured Query Language (SQL) based data repository and the second data repository is a non-SQL-based data repository.
35. The method of data retrieval of claim 21, wherein the partitions include at least one real partition, being a real partition of the corresponding table, defined in the database.
36. The method of data retrieval of claim 21, wherein the partitions include at least one virtual partition, being a virtual partition of the corresponding table, that does not exist in the database.
37. The method of data retrieval of claim 21, wherein the cache further comprises one or more past source relational queries and wherein the method further comprises:
identifying at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and
manipulating, before the translating, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
38. The method of data retrieval of claim 37, wherein upon identification of a plurality of past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion with the second structure of the required data, the method further comprises calculating a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
39. A method of query manipulation, the method comprising: providing, by a processing resource, a cache configured to store one or more past source relational queries;
obtaining, by the processing resource, a source relational query for retrieving required data from at least one table, the source relational query defining the required data;
identifying, by the processing resource, at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and
manipulating, by the processing resource, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries.
40. The method of query manipulation of claim 39, wherein upon identification of a plurality of past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion with the second structure of the required data, the method further comprises calculating a relevance score for each of the past source relational queries defining the first structures of the previously retrieved data that meet the similarity criterion, wherein the additional data is associated with the past source relational query having the relevance score that is highest, and wherein the portion is not defined by the first structure of the past source relational query having the relevance score that is highest.
41. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor of a computer to perform a method of data retrieval comprising:
providing a cache configured to store one or more cached relational queries, corresponding results thereof, and for each of the cached relational queries a corresponding cache timestamp indicative of a corresponding execution time thereof; obtaining, by a processing resource: (a) metadata describing structures of one or more tables within a database, a division of each of the tables to one or more partitions, and, for each partition, a last update timestamp indicative of a last update time thereof; and (b) a source relational query for retrieving required data from at least one of the tables, the source relational query defining the required data;
translating, by the processing resource, using the metadata, the source relational query into one or more partition queries, each of the partition queries designed to retrieve at least part of the required data from a single corresponding partition of the partitions;
identifying, by the processing resource, one or more of the partition queries (a) that do not meet a similarity criterion with any of the cached relational queries or (b) that meet the similarity criterion with at least one of the cached relational queries and for which the cached results of the cached relational queries do not comprise fresh results of the corresponding partition query, wherein fresh results are results associated with a given cache timestamp of the cache timestamps that meets a freshness criterion, giving rise to non-fresh-cache partition queries list;
executing, by the processing resource, each of the partition queries comprised within the non-fresh-cache partition queries list, for retrieving non-cached or non-fresh cached partition results;
retrieving, by the processing resource, cached partition results from the cache, the cached partition results being the fresh results of the cached relational queries of the corresponding partition queries not comprised within the non-fresh-cache partition queries list, if any; and
merging, by the processing resource, the cached partition results with the non- fresh-cache partition results, if any, giving rise to the required data.
42. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor of a computer to perform a method of data retrieval comprising:
providing, by a processing resource, a cache configured to store one or more past source relational queries; obtaining, by the processing resource, a source relational query for retrieving required data from at least one table, the source relational query defining the required data;
identifying, by the processing resource, at least one of the past source relational queries defining first structures of previously retrieved data that meet a similarity criterion with a second structure of the required data; and
manipulating, by the processing resource, the source relational query to retrieve (a) additional data, other than the required data, wherein the additional data is associated with the identified past source relational queries and not associated with the source relational query, or (b) lesser data, not including at least a portion of the required data, wherein the portion is not defined by the first structure of the previously retrieved data of the identified past relational queries
PCT/IL2018/051123 2017-10-25 2018-10-21 A system and method for data retrieval WO2019082177A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762576982P 2017-10-25 2017-10-25
US62/576,982 2017-10-25

Publications (1)

Publication Number Publication Date
WO2019082177A1 true WO2019082177A1 (en) 2019-05-02

Family

ID=66247195

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2018/051123 WO2019082177A1 (en) 2017-10-25 2018-10-21 A system and method for data retrieval

Country Status (1)

Country Link
WO (1) WO2019082177A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021145960A1 (en) * 2020-01-15 2021-07-22 Sigma Computing, Inc. Dashboard loading using a filtering query from a cloud-based data warehouse cache
WO2021146045A1 (en) * 2020-01-15 2021-07-22 Sigma Computing, Inc. Dashboard loading from a cloud-based data warehouse cache

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060200438A1 (en) * 2005-03-02 2006-09-07 Red Hat, Inc. System and method for retrieving data from a relational database management system
US20070027904A1 (en) * 2005-06-24 2007-02-01 George Chow System and method for translating between relational database queries and multidimensional database queries
US7302425B1 (en) * 2003-06-09 2007-11-27 Microsoft Corporation Distributed pre-cached query results and refresh method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7302425B1 (en) * 2003-06-09 2007-11-27 Microsoft Corporation Distributed pre-cached query results and refresh method
US20060200438A1 (en) * 2005-03-02 2006-09-07 Red Hat, Inc. System and method for retrieving data from a relational database management system
US20070027904A1 (en) * 2005-06-24 2007-02-01 George Chow System and method for translating between relational database queries and multidimensional database queries

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021145960A1 (en) * 2020-01-15 2021-07-22 Sigma Computing, Inc. Dashboard loading using a filtering query from a cloud-based data warehouse cache
WO2021146045A1 (en) * 2020-01-15 2021-07-22 Sigma Computing, Inc. Dashboard loading from a cloud-based data warehouse cache
US11593375B2 (en) 2020-01-15 2023-02-28 Sigma Computing, Inc. Dashboard loading from a cloud-based data warehouse cache
US11860873B2 (en) 2020-01-15 2024-01-02 Sigma Computing, Inc. Dashboard loading using a filtering query from a cloud-based data warehouse cache
US11868351B1 (en) 2020-01-15 2024-01-09 Sigma Computing, Inc. Dashboard loading from a cloud-based data warehouse cache

Similar Documents

Publication Publication Date Title
US10915528B2 (en) Pluggable storage system for parallel query engines
US11068439B2 (en) Unsupervised method for enriching RDF data sources from denormalized data
US10346432B2 (en) Compaction policy
CN106030573B (en) Implementation of semi-structured data as first-level database element
RU2663358C2 (en) Clustering storage method and device
AU2017243870B2 (en) "Methods and systems for database optimisation"
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
US10706022B2 (en) Space-efficient secondary indexing on distributed data stores
US11726959B2 (en) Data pruning based on metadata
US11907251B2 (en) Method and system for implementing distributed lobs
US10311093B2 (en) Entity resolution from documents
US10831709B2 (en) Pluggable storage system for parallel query engines across non-native file systems
WO2019082177A1 (en) A system and method for data retrieval
CN103365987A (en) Clustered database system and data processing method based on shared-disk framework
US9201888B2 (en) File management apparatus, file management method, and file management system
CN107430633B (en) System and method for data storage and computer readable medium
US11720557B2 (en) System and method for on-demand search of a large dataset
US11386111B1 (en) Systems, devices, and methods for data analytics
US20170031909A1 (en) Locality-sensitive hashing for algebraic expressions
EP3436988B1 (en) "methods and systems for database optimisation"
US11232095B2 (en) Composite metadata objects for database systems
WO2017019889A1 (en) Maintaining performance in the presence of insertions, deletions, and streaming queries
US11567972B1 (en) Tree-based format for data storage
US11550793B1 (en) Systems and methods for spilling data for hash joins
WO2020124491A1 (en) Method and device for segmenting data, computer device, and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18871674

Country of ref document: EP

Kind code of ref document: A1