WO2022226435A1

WO2022226435A1 - Automatic configuration of semantic cache

Info

Publication number: WO2022226435A1
Application number: PCT/US2022/070044
Authority: WO
Inventors: Theodoros GKOUNTOUVAS; Jit GUPTA; Hui Lei; Hongliang Tang; Ningxiao TANG; Zhihao Tang; Yong Wang; Ning Wu; Heng Xu
Original assignee: Futurewei Technologies, Inc.
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-10-27

Abstract

An efficient architecture and methodology are provided for automatic cache configuration. A protocol and a design for operating distributed compute engines can be implemented using a candidate selector, a predictor, and a planner for decision making to automatically deduce new contents for a cache. The candidate selector and the predictor can be arranged to provide input to the planner. The candidate selector generates candidates for data insertion into the cache using a current plan and the predictor generates one or more future plans using the current plan. The planner generates new contents for the cache based on the selected candidates from the candidate selector, the one or more future plans from the predictor, and a current configuration of the cache. The planner can also use previous plans that have been archived to provide the decision on the configuration for the cache, including the contents of the cache.

Description

AUTOMATIC CONFIGURATION OF SEMANTIC CACHE

TECHNICAL FIELD

[0001] The present disclosure is related to data analysis, and in particular to methods and apparatus for automatic configuration of a semantic cache for distributed compute engines.

BACKGROUND

[0002] Data analysis jobs typically utilize large data sets with millions or billions of records. However, the results may depend on a very small fraction of the original records. For example, a dataset formed of records taken from sensors deployed in houses over a large area at a frequent rate can be extremely large. However, a particular job’s result might be impacted only by a small portion of these records. A typical example (1) of such a job can be given by the following: load("data/*.csv"). filter(x => x. temp =' hot'). map(... ). .. (1) where csv (comma separated values) refers to a delimited text file that uses a comma to separate values. This job loads data from a group of CSV files. Then, it filters out, with respect to the associated operation within the parentheses, all records that do not have a hot temperature. Hot temperature can be defined, in a non-limiting example, as being greater than or equal to 100 °F. Then, only records in areas with a warm climate during warm periods would affect the final count.

[0003] Distributed compute engines for big data workloads adopt a naive approach for tackling workloads like the above load. The user provides a job in the form of a directed acyclic graph (DAG). A simple example of a DAG is provided above by load → filter → •••. Then, the distributed compute engines create an optimized execution plan from the user-provided DAG based on the available resources and workload characteristics. Using the above example, an execution plan can be constructed. Assuming that the compute engine divides the dataset into multiple different partitions (e.g., Pl, P2, P3, P4), a portion of the data inside each partition has records with hot temperatures, and the rest of the data belongs to distinct different temperature ranges such as a warm temperature range, a cool temperature range, and a cold temperature range. [0004] For each partition, a process loads the corresponding data, filters in only the portion with hot temperatures, and continues the ensuing operations. Although only a small portion of the original data affects the result, this strategy leads to storage input/output (I/O) overhead analogous to the size of the initial dataset and network I/O overhead between a compute side and a storage side of the distributed compute engines. The compute side of the distributed compute engine system is where the data is processed, while the storage side of the distributed compute engine system is where the data is stored. The storage side can be analogous to the size of the original dataset. The data is typically stored in apparatus such as a hard disk drive (HDD) or a solid-state drive (SSD). Usage of computation and memory resources, such as dynamic random-access memory (DRAM), for the compute engine cluster analogous to the size of the initial dataset. The above strategy also leads to large execution times.

[0005] Multiple techniques have been proposed in the literature trying to solve this issue. The common theme among these techniques is that they sacrifice some additional memory (DRAM, primary) or disk (HDD/SSD, secondary) space for better overall performance. A cache can be used to store this extra data, since it does not need to be replicated or persisted for fault-tolerance. If the contents are not found in the cache, compute engines can revert to the baseline strategy, which is inefficient but correct.

SUMMARY

[0006] It is an object of various embodiments to provide an efficient architecture and methodology to automatically configure a cache. A protocol and a design for operating distributed compute engines can be implemented using a candidate selector, a predictor, and a planner for decision making to automatically deduce new contents for a cache. The candidate selector and the predictor can be arranged to provide input to the planner. The candidate selector generates candidates for data insertion into the cache using a current plan and the predictor generates one or more future plans using the current plan. The planner generates new contents for the cache based on the selected candidates from the candidate selector, the one or more future plans from the predictor, and a current configuration of the cache. The planner can be implemented with a score function and can also use previous plans that have been archived to provide the decision on the configuration for the cache, including the contents of the cache. This technique provides for big data analytics and can work with multiple types of contents, be shared between multiple user contents, and can operate with a multi-layered cache. This technique can be utilized with size estimation and run time estimation.

[0007] According to a first aspect of the present disclosure, there is provided a computer-implemented method of automatic configuration of a cache. The computer-implemented method comprises receiving a current plan of data operations and selecting candidates for data insertion into the cache using the current plan. The computer-implemented method further comprises predicting one or more future plans using the current plan and planning insertion of contents into the cache or eviction of contents from the cache using the selected candidates, the one or more future plans, and current configuration of the cache, where the current configuration include current contents of the cache. New contents for the cache are provided from the planning.

[0008] In a first implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such, the computer-implemented method includes selecting candidates to include: determining that a cost of the current plan with use of the candidate included into the current plan is less than or equal to a cost of the current plan; and determining that a cost of storing the candidate in the cache plus the cost of the current plan with use of the candidate included into the current plan is less than a threshold for limiting the cost of the current plan.

[0009] In a second implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer-implemented method includes predicting one or more future plans to include predicting a sequence of plans.

[0010] In a third implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer-implemented method includes predicting one or more future plans to include: obtaining previous executed plans from an archive of executed plans; and with the previous executed plans having an order, reversing the order of the previous executed plans up to a specific window size and outputting the reversed ordered previous executed plans as predicted one or more future plans.

[0011 J In a fourth implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer-implemented method includes planning insertion of contents into the cache or eviction of contents from the cache to include: calculating a score for each selected candidate; eliminating selected candidates having a score less than or equal to zero; and, in response to eliminating all selected candidates based on the scores of the selected candidates, maintaining the current contents as the new contents for provision to the cache.

[0012] In a fifth implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer-implemented method includes planning insertion of contents into the cache or eviction of contents from the cache to include: calculating a score for each selected candidate; maintaining, for further evaluation, selected candidates having a score greater than zero; picking, as a top candidate, a selected candidate having a maximum score from the scores calculated for the selected candidates; and assigning the current contents plus the top candidate as the new contents.

[0013] In a sixth implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer-implemented method includes calculating the score for each selected candidate to include: calculating a score for the one or more future plans with the current contents plus the selected candidate; calculating a score for the one or more future plans with the current contents; and setting the score for the selected candidate as a difference between the score for the one or more future plans with the current contents plus the selected candidate and the score for the one or more future plans with the current contents.

[0014] In a seventh implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer- implemented method includes planning insertion of contents into the cache or eviction of contents from the cache to include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; and picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; determining first new contents by removing the evicted content from the new contents; and re-assigning the first new contents as the new contents for provision to the cache.

[0015] In an eighth implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer- implemented method includes planning insertion of contents into the cache or eviction of contents from the cache to include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; and, in response to the evicted content being the top candidate, maintaining the current contents as the new contents for provision to the cache.

[0016] In a ninth implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer-implemented method includes calculating the size score for each data content to include: calculating a size score for the one or more fiiture plans with the new contents; calculating a size score for the one or more fiiture plans with the new contents minus the current contents; and setting the size score for the data content as a difference between the size score for the one or more future plans with the new contents and the size score for the one or more fiiture plans with the new contents minus the current contents.

[0017] In a tenth implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer-implemented method includes planning insertion of contents into the cache or eviction of contents from the cache to include generating one or more scores for candidates or current content using a scoring function using one or more of a reference count of content, an inverse reference distance, a size estimation of storage input/output bytes loaded from storage, or a time estimation of execution time. [0018] In an eleventh implementation form of the computer-implemented method of automatic configuration of a cache according to the first aspect as such or any preceding implementation form of the first aspect, the computer- implemented method includes, with the cache being a multi-layer cache, planning insertion of contents into the cache or eviction of contents from the cache is performed iteratively from atop layer to a bottom layer using an unpicked top candidate from the selected candidates or evicted contents for planning in a next layer.

[0019] According to a second aspect of the present disclosure, there is provided a system having an automatic configuration of a cache, the system comprising a memory storing instructions and one or more processors in communication with the memory, where the one or more processors execute the instractions to perform operations. The operations comprise: receiving a current plan of data operations and selecting candidates for data insertion into the cache using the current plan. The operations include predicting one or more future plans using the current plan; planning insertion of contents into the cache or eviction of contents from the cache using the selected candidates, the one or more future plans, and current configuration of the cache; and providing new contents for the cache from the planning. The current configuration includes current contents of the cache.

[0020] In a first implementation form of the system having an automatic configuration of a cache according to the second aspect as such, the operations selecting candidates include: determining that a cost of the current plan with use of the candidate included into the current plan is less than or equal to a cost of the current plan; and determining that a cost of storing the candidate in the cache plus the cost of the current plan with use of the candidate included into the current plan is less than a threshold for limiting the cost of the current plan. [0021] In a second implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations predicting one or more future plans include predicting a sequence of plans.

[0022] In a third implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations predicting one or more future plans includes: obtaining previous executed plans from an archive of executed plans; and, with the previous executed plans having an order, reversing the order of the previous executed plans up to a specific window size and outputting the reversed ordered previous executed plans as predicted one or more future plans.

[0023] In a fourth implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include: calculating a score for each selected candidate; eliminating selected candidates having a score less than or equal to zero; and in response to eliminating all selected candidates based on the scores of the selected candidates, maintaining the current contents as the new contents for provision to the cache.

[0024] In a fifth implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include: calculating a score for each selected candidate; maintaining, for further evaluation, selected candidates having a score greater than zero; picking, as a top candidate, a selected candidate having a maximum score from the scores calculated for the selected candidates; and assigning the current contents plus the top candidate as the new contents.

[0025] In a sixth implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations calculating the score for each selected candidate include: calculating a score for the one or more future plans with the current contents plus the selected candidate; calculating a score for the one or more future plans with the current contents; and setting the score for the selected candidate as a difference between the score for the one or more future plans with the current contents plus the selected candidate and the score for the one or more future plans with the current contents.

[0026] In a seventh implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; determining first new contents by removing the evicted content from the new contents; and re-assigning the first new contents as the new contents for provision to the cache.

[0027] In an eighth implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; and, in response to the evicted content being the top candidate, maintaining the current contents as the new contents for provision to the cache.

[0028] In a ninth implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations calculating the size score for each data content include: calculating a size score for the one or more future plans with the new contents; calculating a size score for the one or more future plans with the new contents minus the current contents; and setting the size score for the data content as a difference between the size score for the one or more future plans with the new contents and the size score for the one or more future plans with the new contents minus the current contents.

[0029] In a tenth implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include generating one or more scores for candidates or current content using a scoring function using one or more of a reference count of content, an inverse reference distance, a size estimation of storage input/output bytes loaded from storage, or a time estimation of execution time.

[0030] In an eleventh implementation form of the system having an automatic configuration of a cache according to the second aspect as such or any preceding implementation form of the second aspect, the operations, with the cache being a multi-layer cache, planning insertion of contents into the cache or eviction of contents from the cache is performed iteratively from a top layer to a bottom layer using an unpicked top candidate from the selected candidates or evicted contents for planning in a next layer.

[0031] According to a third aspect of the present disclosure, there is provided a non-transitory computer-readable medium storing instructions for automatic configuration of a cache, which, when executed by one or more processors, cause the one or more processors to perform operations. The operations comprise receiving a current plan of data operations; selecting candidates for data insertion into the cache using the current plan; predicting one or more foture plans using the current plan; planning insertion of contents into the cache or eviction of contents from the cache using the selected candidates, the one or more foture plans, and current configuration of the cache; and providing new contents for the cache from the planning. The current configuration includes current contents of the cache.

[0032] In a first implementation form of the non-transitory computer-readable medium according to the third aspect as such, the operations selecting candidates include: determining that a cost of the current plan with use of the candidate included into the current plan is less than or equal to a cost of the current plan; and determining that a cost of storing the candidate in the cache plus the cost of the current plan with use of the candidate included into the current plan is less than a threshold for limiting the cost of the current plan.

[0033] In a second implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations predicting one or more foture plans include predicting a sequence of plans.

[0034] In a third implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations predicting one or more future plans include: obtaining previous executed plans from an archive of executed plans; and, with the previous executed plans having an order, reversing the order of the previous executed plans up to a specific window size and outputting the reversed ordered previous executed plans as predicted one or more future plans.

[0035] In a fourth implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include: calculating a score for each selected candidate; eliminating selected candidates having a score less than or equal to zero; and in response to eliminating all selected candidates based on the scores of the selected candidates, maintaining the current contents as the new contents for provision to the cache.

[0036] In a fifth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include: calculating a score for each selected candidate; maintaining, for further evaluation, selected candidates having a score greater than zero; picking, as atop candidate, a selected candidate having a maximum score from the scores calculated for the selected candidates; and assigning the current contents plus the top candidate as the new contents.

[0037] In a sixth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations calculating the score for each selected candidate include: calculating a score for the one or more future plans with the current contents plus the selected candidate; calculating a score for the one or more future plans with the current contents; and setting the score for the selected candidate as a difference between the score for the one or more future plans with the current contents plus the selected candidate and the score for the one or more future plans with the current contents.

[0038] In a seventh implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache includes, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; determining first new contents by removing the evicted content from the new contents; and re-assigning the first new contents as the new contents for provision to the cache.

[0039] In an eighth implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; and, in response to the evicted content being the top candidate, maintaining the current contents as the new contents for provision to the cache.

[0040] In a ninth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations calculating the size score for each data content include: calculating a size score for the one or more future plans with the new contents; calculating a size score for the one or more future plans with the new contents minus the current contents; and setting the size score for the data content as a difference between the size score for the one or more future plans with the new contents and the size score for the one or more future plans with the new contents minus the current contents.

[0041] In a tenth implementation form of the non-transitory computer-readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations planning insertion of contents into the cache or eviction of contents from the cache include generating one or more scores for candidates or current content using a scoring function using one or more of a reference count of content, an inverse reference distance, a size estimation of storage input/output bytes loaded from storage, or a time estimation of execution time.

[0042] In an eleventh implementation form of the non-transitory computer- readable medium according to the third aspect as such or any preceding implementation form of the third aspect, the operations include, with the cache being a multi-layer cache, planning insertion of contents into the cache or eviction of contents from the cache being performed iteratively from a top layer to a bottom layer using an unpicked top candidate from the selected candidates or evicted contents for planning in a next layer.

[0043] Any one of the foregoing examples may be combined with any one or more of the other foregoing examples to create a new embodiment in accordance with the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

[0044] The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed in the present document.

[0045] Figure 1 illustrates operation of a distributed compute engine having a driver and worker compute engines, associated with various embodiments.

[0046] Figure 2 illustrates a technique of adaptive partitioning, associated with various embodiments.

[0047] Figure 3 illustrates a technique of data-skipping, associated with various embodiments.

[0048] Figure 4 illustrates a technique of intermediate data caching, associated with various embodiments.

[0049] Figure 5 illustrates an example of least recently used technique, associated with various embodiments.

[0050] Figure 6 illustrates an example of most reference distance, associated with various embodiments.

[0051] Figure 7 is a representation of an operational architecture having three different modules arranged to automatically deduce new contents for a cache, according to various embodiments.

[0052] Figure 8 is a flow diagram of features of an example computer- implemented method of automatic configuration of a cache, according to various embodiments.

[0053] Figure 9 is a block diagram illustrating components of a system that implements algorithms and performs methods structured to perform automatic configuration of a cache, according to various embodiments. DETAILED DESCRIPTION

[0054] In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the embodiments, and it is to be understood that other embodiments may be utilized, and that structural, logical, mechanical, and electrical changes may be made. The following description of example embodiments is, therefore, not to be taken in a limited sense.

[0055] The functions or algorithms described herein may be implemented in software in an embodiment. The software may comprise computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware- based storage devices, either local or networked. Further, such fimctions correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple fimctions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, application-specific integrated circuit (ASIC), a microprocessor, or other type of processor operating on a computer system, such as a personal computer, server, or other computer system, turning such computer system into a specifically programmed machine. [0056] Computer-readable non-transitory media includes all types of computer readable media, including magnetic storage media, optical storage media, and solid-state storage media and specifically excludes signals. The software can be installed in and sold with the devices that handle automatic configuration of a semantic cache for distributed compute engines as taught herein. Alternatively, the software can be obtained and loaded into such devices, including obtaining the software via a disc medium or from any manner of network or distribution system, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. The software can be stored on a server for distribution over the Internet, for example.

[0057] Figure 1 illustrates operation of a distributed compute engine 100 having a driver 105 and workers 110-0, 110-1 . . . 110-N. A driver is a driver module that, among other things, accepts tasks, controls distribution of the execution of the tasks, and reports status or results to the source of the accepted tasks, and a worker is a worker compute engine that can be implemented as a worker module that computes one or more tasks assigned to it. Distributed compute engine 100 accepts a job from a user 101 in the form of a DAG 102. Driver 105 is responsible for dividing the job into multiple tasks, which are executed according to a plan. A plan is a tree of operators that determines a partial order of execution for a query/task. Examples of operators include, but are not limited to, a join operator, a filter operator, a where operator, a group by clause operator, or other operator that can be applied to data in a file or other location such as a table. The order is partial because execution of one or more operators, such as binary operators, is such that in a stage of the plan being analyzed it is not significant if a left branch or a right branch of the tree is going to be executed first or simultaneously. A partial order can include an instance of the complete order. A plan can be a logical plan, a physical plan, or combination thereof. A logical plan represents expected output after applying a given series of transformations, while a physical plan has control over decisions regarding the type of operations and sequence of execution of these operations.

[0058] Driver 105 sends the tasks to workers 110-0, 110-1 . . . 110-N for execution. At operation (1), user 101 sends a job request to driver 105 in the form of DAG 102. For example, DAG 102 provides load 150, filter 160, map 180, store 190, and count 170 for the previously mentioned temperature example (1). Filter 160 filters all loaded records that do not have a hot temperature greater than 100 °F. Only records in areas with a warm climate during warm periods would affect the final count 170. Map 180 is a map transformation that can be applied to facilitate storage of the data at store 190. At operation (2), driver 105 creates an optimized execution DAG fiom input DAG 102 and divides it into multiple tasks. Each task can be executed in a single worker of workers 110-0, 110-1 . . . 110-N. The tasks can be sent to workers 110-0, 110-1 . . . 110-N for execution in a specific order. At operation (3), workers 110-0, 110-1 . . . 110-N load initial data fiom respective storages 115-0, 115-1 . . . 115- N, process it, and store the output back to storages 115-0, 115-1 . . . 115-N, if necessary. At operation (4), workers 110-0, 110-1 . . . 110-N send acknowledgement or results related to the execution of the corresponding task back to driver 105. At operation (5), driver 105 sends acknowledgement or results related to the whole execution of the job back to user 101 , when all tasks are finished.

[0059J Figure 2 illustrates a technique of adaptive partitioning of data. Adaptive partitioning is the process in which data is dynamically restructured based on workload characteristics. For example, if there are many jobs which have a filter operation involving the temperature attribute value, like the one in the above example, it may be beneficial to sort and split the data between partitions based on the temperature attribute value. In an example corresponding to example (1), the initial dataset can be arranged in partitions Pl, P2, P3, and P4, where the partitions can include records corresponding to four temperature ranges: records with hot temperatures 221, records with warm temperatures 222, records with cool temperatures 223, and records with cold temperatures 224. The initial dataset is resorted, according to the filter, and new partitions (Pl’, P2’, P3 ’, P4’) correspond to unique, distinct temperature ranges. Of the four types of temperature related records, partition Pl’ has only records with hot temperatures 221 , partition P2’ has only records with warm temperatures 222, partition P3’ has only records with cool temperatures 223, and partition P4’ has only records with cold temperatures 224. These new^r partitions can be stored using extra memory or storage space depending on the size and hardware of the extra memory or storage space. When a job, like the above example (1), is executed, the compute engine loads only the Pl’ partition, which is the only one that contains records with hot temperature values 221. This is partition-pruning. The filter operation, from the example of Figure 1 , is still executed on the compute side since in the loaded partitions there can still be records that are eventually filtered out. As it can be seen in Figure 2, a smaller portion of the data is loaded from storage and transferred through the network (path 256), and the compute engine utilizes fewer tasks with fewer total compute and memory resources. Operations on paths 256, 257, and 258 are not performed. As a result, the execution in most cases is going to be considerably fester. However, the extra overhead from repartitioning data is considerable in terms of extra space and computation.

[0060] Adaptive partitioning provides a change to the partitioning scheme of the underlying data. It dynamically adopts new partition schemes according to the executed workload needs. New partitions are stored in a cache to re-use them for future queries. An application programming interface (API) can be used for adaptive partitioning. An API is a shared boundary for which two or more separate components of a system exchange information, where the information defines interactions between the two or more separate components. An API can include multiple software applications or mixed hardware-software intermediaries. An example API adaptive partitioning can be defined as repartition(source path, attribute, output path, tier).

[0061] Figure 3 illustrates a technique of data-skipping of a data partition. In data-skipping, metadata is maintained for secondary attributes per partition. Secondary attributes are attributes other than main partition attributes. The metadata is used to skip partitions, during the execution of specific query' or task, that can safely eliminate all the elements of skipped partitions and still obtain the same result. The metadata used can be cached to reuse for the execution of future queries. An API can be used with the API defined as dataSkippingMetadata (source path, attribute, output path, tier).

[0062] In the data-skipping technique, the storage side or the compute side maintains information about secondary attribute values per partition. A common data structure used includes a minimum (min) or maximum (max) value for each partition of a numeric attribute. Another common data structure used includes lists with all the values inside a single partition of a category- attribute. Bloom filters can be used for a data structure for values inside a single partition of an attribute. A partition can be pruned if data-skipping information ensures that eliminating the partition does not lead to a different result. In the above temperature example (1), consider partitions Pl, P2, P3, and P4 of the example of Figure 2, in which, in this non-limiting example, partition Pl has records with hot temperatures 221 and records with warm temperatures 222, partition P2 has records with hot temperatures 221, records with cool temperatures 223, and records with cold temperatures 224, partition P3 has records with warm temperatures 222 and records with cold temperatures 224, and partition P4 has records with cool temperatures 223 and records with cold temperatures 224. In this example, all records with hot temperatures 221 are in partitions Pl and P2. Using a list for data-skipping metadata, a compute engine can safely detect that there are no records with hot temperatures 221 in partitions P3 and P4 and thus, the compute engine can prune partitions P3 and P4. As it can be seen in Figure 3, data is loaded from storage and transferred through the network (paths 256 and 257), while subsequent operations on paths 257 and 258 are not performed. [0063] Figure 4 illustrates a technique of intermediate data caching of partition data. In an example corresponding to example (1), the initial dataset can be arranged in partitions Pl, P2, P3, and P4 as in the example of Figure 2, where the partitions can include records corresponding to four temperature ranges: records with hot temperatures 221, records with warm temperatures 222, records with cool temperatures 223, and records with cold temperatures 224. Filter 160 is applied to the initial dataset in partitions P1, P2, P3, and P4, resulting in records with hot temperatures 221 remaining from each partition. Records with hot temperatures 221 are cached in memory or storage using map 180 according to paths 256, 257, 258, and 259 associated with partitions Pl, P2, P3, and P4, respectively. If the job includes further processing, the rest of the job is then executed. When a subsequent job with the same filter operation, like the above example that filters in hot temperature records, is executed, the associated compute engine can directly load the cached data without executing the filter operation, that is, it can skip the filter operation. The compute engine can store the result of an operation and then if the same or similar job is executed, these results can be reused.

[0064] Intermediate data caching provides for caching intermediate results after processing the initial data, such that any time an ensuing query or task runs, if there is no loss of information by loading the intermediate results, only the cached data is loaded. In this technique, some operations may be eliminated. An API can be used with the API defined as intermediateData(input DAG, output path, tier).

[0065] There are policies for eviction from a cache developed from previous approaches that can be used in various embodiments as taught herein. These eviction policies typically consider only intermediate data caching and not the other types of content. Every result of an operation would be considered as a candidate for a cache insertion. One technique is least recently used (LRC) and another is most reference distance (MRD).

[0066] Figure 5 illustrates an example of LRC. LRC examines a DAG and counts the references of each node. The nodes with the largest reference count (RC) remain in the cache while the ones with the lowest are evicted. For example, in Figure 5, there are two different plans to be executed, one including a count 570 and one including a join 575. Count 570 counts the records after a filter operation 560. Join 575 joins data fiom two different tables, load table zero 562 and load table one 566, after some processing. Though project 564 and load table one 566 are not used in the plan having count 570, the results of the filter operation 560 are to be referenced in both plans. Thus, these results are an ideal candidate for caching according to the LRC method. The method does not cache anything with reference count equal to one since it is not going to be used again in the known horizon of executed plans. In general, LRC considers the results of every operation as potential candidates for cache insertion and evicts the cache contents with the least reference count when necessary, while not caching results with RC=1 that are not used more than once in the expected new future.

[0067] Figure 6 illustrates an example of MRD. MRD examines a DAG and counts the distance of each node. The method considers as candidates any results that are referenced by upcoming operations, after the current plan is executed. MRD processing decides to keep in the cache the ones that are going to be used sooner, which is the least distance, as opposed to those that are referenced later in the plan, which is the most distance. For example, consider the example of Figure 5 above assuming that the count is executed first as the current plan. The future plan(s), in this case the plan with join 575 as a root node, references filter 560, which is a node, and load table zero 562, which is another node. Filter node 560 has a distance of two from the root node 575 of the next operation, while load table zero 566 has a distance of three. As a result, the filter node 560 is picked for caching. In general, MRD considers the results of operations that are referenced from future plans as potential candidates for cache insertion and evicts the cache contents with the most referenced distance according to the DAG when necessary.

[0068] In various embodiments, cache configuration can be automatically deduced to effectively optimize execution of different workloads for distributed compute engines. Cache configuration can include contents of the cache, available size, and other characteristics of the cache that affect its operation. The cache can be structured to be operable with a number of characteristics. The cache can be shared between users such that content is accessible by many users. The cache can be distributed such that cache space can be utilized from multiple servers. The cache can be multi-tiered to use multiple tiers to store data. The tiers can use a variety of storage apparatus, such as but not limited to, solid state memory and disk. Operation of the cache can be semantic-aware using knowledge about the contents and automatically using the knowledge to improve performance. The cache can be multi-modal by providing multiple interfaces such as, but not limited to, to the three types of interactions: re-partitioning, data- skipping metadata, and intermediate data caching. The cache can be platform- agnostic such that the cache and its operation do not depend or work on a specific platform.

[0069] Figure 7 is a representation of an operational architecture 700 having three different modules arranged to automatically deduce new contents for a cache. The three modules include a candidate selector 706, a predictor 707, and a planner 708. Figure 7 shows the manner in which the three modules interact with each other to determine a new cache configuration with respect to cache contents. Candidate Selector 706 picks a set of candidates for insertion in the cache. Candidates for insertion into a cache can include different types of data or metadata. For example, candidates can include three different types of data or metadata. One type is repartition of a leaf node in the plan that re-arranges the data to a number of partitions (typically the same number as the partitions of the source data) according to a primary partition attribute. This is source data rearranged. A second type is file skipping indices that maintain information (metadata) about the attribute values in each partition. For example, such indices can be minimum or maximum of a numeric attribute for each partition. For example, a first partition of an employee table contains ages between 23 and 47, while a second partition contains ages between 32 and 67. This information regarding age intervals is metadata that is created for the existing employee source data. A third type is intermediate data caching that keeps results of a subquery/subtask of an initial query'. For example, intermediate data caching can be keeping the results of a filter operation. This is data that is computed using source data and part of the plan, where the plan is executed to produce intermediate data.

[0070] Candidate selector 706 can be utilized to eliminate costly and ineffective candidates from the beginning. A simple parse of the plan can be included to eliminate obvious costly and ineffective candidates, which can allow for avoidance of some overhead. Predictor 707 makes a prediction about future incoming plans to be executed. Predictor 707 can use a current plan and potentially can use a plan archive 709 of previous plans that provides historic information. Planner 708 determines potential insertions or evictions to the cache. Planner 708 can use outcomes from candidate selector 706 and predictor 707 to make such determinations. Planner 708 can also use the existing cache configuration, including contents 704, to assist in making the determinations. The design of operational architecture 700 ensures that an automatic decision is made to update the configuration (contents) of a cache for distributed compute engines.

[0071] Candidate selector 706 can be structured to meet at least two criteria. First, candidate selector 706 should propose candidates that reduce cost. These candidates, when cached, potentially optimize the current plan and do not create extra overheads when utilized, which can be given by

Cost(OptimizedPlanWithCandidate) < Cost(OriginalPlan). Second, candidate selector 706 should propose candidates that do not increase cost significantly for the current plan. These candidates when cached do not significantly increase cost for current plan, which can be in the negative as NOT: Cost(StoreCandidate) 4- Cost(OptimizedPlanWithCandidate) » Cost(OriginalPlan)

[0072] An embodiment of an example API for candidate selector 706 can be given by: select(plan: Tree[Operator]): Set [Candidate], with

Candidate → repartition( source path, attribute, output path, tier), dataSkippingMetadata(source path, attribute, output path, tier), intermediateData (input DAG, output path, tier))

Operator → load(... ), filter(... ), select(... ), aggregate(... ), repartition(... ), union(... ), intersection(... ), ...

Consider use of such an API in the following example:

[0073] The candidate (FileSkippinglndices) is picked since there are potential partitions that might be eliminated since the attribute temp is used in the filter operation, which meets the first criterion mentioned above.

Furthermore, creating the indices is not a great cost since it is a relatively cheap operation, which meets the second criterion. On the other hand, the candidate repartitioning on the attribute temp is rather costly since it requires re- arrangement of the whole data, which is a violation of the abovementioned second criterion. Repartitioning can be picked as a candidate if there was an operation in the plan that would re-arrange the data in a similar way, like repartitioning on temperature. However, this is not true for this example plan of only filtering on temperature.

[0074] The design and API of such a candidate selector can allow the cache to deduce candidates for cache insertion effectively and automatically from the current plan, which is the plan ready to be executed. The candidate, if selected for insertion, is guaranteed to offer relatively good improvements with a non- significant insertion cost, that is, it is cost-effective.

[0075] Predictor 707 can be structured to predict upcoming plans as accurately as possible. An example API for predictor 707 can be represented as predict(plan: Tree[Operator]): Set[Tree[Operator]]

A standard example, that is utilized in traditional eviction techniques as well, is to reverse the order of the previous executed plans up to a specific window size and output them as the prediction. A predictor can ensure that the cache is optimized not just for the current plan, but for a sequence of plans. By doing so, the predictor ensures that contents that are not immediately beneficial are considered. For example, repartitioning, which is a high-cost content to create, would never be considered in the traditional approach of having one or two plans taken into consideration. The predictor can ensure that the cache configuration modifications are smooth and not abrupt, since such optimization is performed on a relatively large sequence of plans that does not significantly change in a single step such as execution of a plan.

[0076] Planner 708 can be structured as a module that is responsible for determining new contents of the cache. Unlike the candidate selector and predictor modules, which are complimentary pieces, the planner module can hold the main logic for automating the cache configuration modification decisions.

[0077] An embodiment of an example protocol for a planner, such as planner 708, can be provided by

Input: (candidates, futurePlans, contents, size)

1. For each candidate e candidates calculate the following:

Score(candidate) = Score(futurePlans, contents + [candidate]) -

Score(futurePlans, contents)

2. Eliminate candidate 6 candidates if Score(candidate) < 0.

3. If no candidate remains, then return contents.

4. Pick topCandidate with max(Score(candidate))

5. Assign newContents = contents + [topCandidate]

6. While Size(newContents) > size:

For each content 6 newContents, calculate the following:

Score(content) = Score(futurePlans, newContents) - Score(futurePlans, newContents - [content]) Pick evictedContent with min( Score (content))

If evictedContent = = topCandidate then return contents Assign newContents = contents - [evictedContent]

7. Return newContents

The planner, such as planner 708, calculates a specified score of all the possible candidates (see line 1 above). Then, it eliminates all candidates that have a non- positive score (see line 2 above), since these are not candidates that improve the execution of the workload according to the score metric. If there are no remaining candidates, then the planner does not change the contents (see line 3 above). Furthermore, the planner selects the candidate with the top score and inserts it temporarily in the new cache contents (see line 4 above). If necessary, the content(s) with the least score(s), with the same metric as before, are removed from the new contents (see line 6 above). If at any point, the top candidate that was inserted before (see line 4 above) is evicted, then the planner returns the same contents as before without any modifications. Otherwise, the planner returns the new contents with the insertion and the possible eviction(s) taken into consideration.

[0078] The planner, such as planner 708, can use any score function that makes sense for optimization purposes. Examples of four specific score functions include reference count, reference distance, storage I/O bytes loaded from storage, and execution time. The reference distance can be used in inverse order, that is, lowest instead of highest. Storage I/O bytes loaded from storage can be used in opposite order. Use of storage I/O bytes loaded can be accompanied with a size estimator. Execution time can be used in opposite order. Use of execution time can be accompanied with a run time estimator. The four specific score functions may yield different benefits when utilized. [0079] For a multi-layer cache, the above protocol can be run for the top layer. Then, the unpicked top candidate or evicted contents can be treated as the new candidates for the next layer. The planner can repeat iteratively until it reaches the bottom layer. The planner can ensure that the automatic decisions that are made are sound according to the provided score function. This effectively means that in general the planner tries to optimize the cache contents to achieve the highest possible score.

[0080] The score function can be customized allowing cache administration to indirectly interfere with the cache configuration process. This customization can allow cache administration to tune the score function to match the needs of the administration more directly during operation. However, this would make the architecture 700 for automation of cache configuration of Figure 7 no longer a fully automatic solution for cache modification since the cache administration is directly involved in the decision-making procedure for determining new cache configurations. This customization may be semi-automatic.

[0081] Figure 8 is a flow diagram of features of an embodiment of an example computer-implemented method 800 of automatic configuration of a cache. Computer-implemented method 800 can be performed using one or more processors executing stored instructions. The one or more processors can be arranged to operate in the architecture of Figure 7 or similar architecture, which can be implemented for a distributed compute engine such as that of Figure 1 or other distributed compute engine. At operation 810, a current plan of data operations is received. At operation 820, candidates for data insertion into the cache are selected using the current plan. Selecting candidates can include, but is not limited to, determining that a cost of the current plan with use of the candidate included into the current plan is less than or equal to a cost of the current plan and determining that a cost of storing the candidate in the cache plus the cost of the current plan with use of the candidate included into the current plan is less than a threshold for limiting the cost of the current plan.

[0082] At operation 830, one or more future plans are predicted using the current plan. Predicting one or more fiiture plans can include, but is not limited to, predicting a sequence of plans. Predicting one or more fiiture plans can include, but is not limited to, obtaining previous executed plans from an archive of executed plans and, with the previous executed plans having an order, reversing the order of the previous executed plans up to a specific window size and outputting the reversed ordered previous executed plans as predicted one or more fiiture plans.

[0083] At operation 840, insertion of contents into the cache or eviction of contents fiom the cache is planned using the selected candidates, the one or more fiiture plans, and current configuration of the cache. The current configuration includes current contents of the cache. At 850, new contents are provided for the cache from the planning.

[0084] Variations of method 800 or methods similar to the method 800 can include a number of different embodiments that may be combined depending on the application of such methods and/or the architecture of devices or systems in which such methods are implemented. Variations of such methods can include planning insertion of contents into the cache or eviction of contents from the cache to include, but is not limited to, calculating a score for each selected candidate; eliminating selected candidates having a score less than or equal to zero; and, in response to eliminating all selected candidates based on the scores of the selected candidates, maintaining the current contents as the new contents for provision to the cache. Variations of such methods can include planning insertion of contents into the cache or eviction of contents from the cache to include, but is not limited to, calculating a score for each selected candidate; maintaining, for further evaluation, selected candidates having a score greater than zero; picking, as a top candidate, a selected candidate having a maximum score from the scores calculated for the selected candidates; and assigning the current contents plus the top candidate as the new contents.

[0085] Variations of method 800 or methods similar to method 800 can include calculating the score for each selected candidate to include, but is not limited to, calculating a score for the one or more future plans with the current contents plus the selected candidate; calculating a score for the one or more future plans with the current contents; and setting the score for the selected candidate as a difference between the score for the one or more future plans with the current contents plus the selected candidate and the score for the one or more future plans with the current contents.

[0086] Variations of method 800 or methods similar to method 800 can include planning insertion of contents into the cache or eviction of contents from the cache to include, but is not limited to, with a size of the new contents being greater than a threshold size for the cache, calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; determining first new contents by removing the evicted content from the new contents; and re-assigning the first new contents as the new contents for provision to the cache.

[0087] Variations of method 800 or methods similar to method 800 can include planning insertion of contents into the cache or eviction of contents from the cache to include, but is not limited to, with a size of the new contents being greater than a threshold size for the cache, calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; and in response to the evicted content being the top candidate, maintaining the current contents as the new contents for provision to the cache.

[0088] Variations of method 800 or methods similar to method 800 can include calculating the size score for each data content to include, but is not limited to, calculating a size score for the one or more future plans with the new contents; calculating a size score for the one or more future plans with the new contents minus the current contents; and setting the size score for the data content as a difference between the size score for the one or more future plans with the new contents and the size score for the one or more future plans with the new contents minus the current contents.

[0089] Variations of method 800 or methods similar to method 800 can include planning insertion of contents into the cache or eviction of contents from the cache to include, but is not limited to, generating one or more scores for candidates or current content using a scoring function using one or more of a reference count of content, an inverse reference distance, a size estimation of storage input/output bytes loaded from storage, or a time estimation of execution time.

[0090] Variations of method 800 or methods similar to method 800 can include, with the cache being a multi-layer cache, planning insertion of contents into the cache or eviction of contents from the cache being performed iteratively from a top layer to a bottom layer using an unpicked top candidate from the selected candidates or evicted contents for planning in a next layer.

[0091] In various embodiments, a non-transitory machine-readable storage device, such as computer-readable non-transitory medium, can comprise instractions stored thereon, which, when performed by a machine, cause the machine to perform operations, where the operations comprise one or more features similar to or identical to features of methods and techniques described with respect to method 800, variations thereof, and/or features of other methods taught herein such as associated with Figures 1-9. The physical structures of such instructions may be operated on by one or more processors. For example, executing these physical structures can cause the machine to perform operations comprising receiving a current plan of data operations; selecting candidates for data insertion into the cache using the current plan; predicting one or more future plans using the current plan; planning insertion of contents into the cache or eviction of contents from the cache using the selected candidates, the one or more future plans, and current configuration of the cache, the current configuration including current contents of the cache; and providing new contents for the cache from the planning.

[0092] From instructions executed by the one or more processors, operations selecting candidates by actions can include determining that a cost of the current plan with use of the candidate included into the current plan is less than or equal to a cost of the current plan; and determining that a cost of storing the candidate in the cache plus the cost of the current plan with use of the candidate included into the current plan is less than a threshold for limiting the cost of the current plan.

[0093] From instructions executed by the one or more processors, operations predicting one or more future plans can include predicting a sequence of plans. From instructions executed by the one or more processors, operations predicting one or more future plans can include obtaining previous executed plans from an archive of executed plans; and, with the previous executed plans having an order, reversing the order of the previous executed plans up to a specific window size and outputting the reversed ordered previous executed plans as predicted one or more future plans.

[0094] From instructions executed by the one or more processors, operations planning insertion of contents into the cache or eviction of contents from the cache can include: calculating a score for each selected candidate; eliminating selected candidates having a score less than or equal to zero; and, in response to eliminating all selected candidates based on the scores of the selected candidates, maintaining the current contents as the new contents for provision to the cache. [0095] From instructions executed by the one or more processors, operations planning insertion of contents into the cache or eviction of contents from the cache can include: calculating a score for each selected candidate; maintaining, for further evaluation, selected candidates having a score greater than zero; picking, as a top candidate, a selected candidate having a maximum score from the scores calculated for the selected candidates; and assigning the current contents plus the top candidate as the new contents.

[0096] From instructions executed by the one or more processors, operations calculating the score for each selected candidate can include: calculating a score for the one or more future plans with the current contents plus the selected candidate; calculating a score for the one or more future plans with the current contents; and setting the score for the selected candidate as a difference between the score for the one or more future plans with the current contents plus the selected candidate and the score for the one or more future plans with the current contents.

[0097] From instructions executed by the one or more processors, operations planning insertion of contents into the cache or eviction of contents from the cache can include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; determining first new contents by removing the evicted content from the new contents; and re- assigning the first new contents as the new contents for provision to the cache. [0098] From instructions executed by the one or more processors, operations planning insertion of contents into the cache or eviction of contents from the cache can include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; and in response to the evicted content being the top candidate, maintaining the current contents as the new contents for provision to the cache.

[0099] From instructions executed by the one or more processors, operations calculating the size score for each data content can include: calculating a size score for the one or more future plans with the new contents; calculating a size score for the one or more future plans with the new contents minus the current contents; and setting the size score for the data content as a difference between the size score for the one or more future plans with the new contents and the size score for the one or more future plans with the new contents minus the current contents.

[0100] From instractions executed by the one or more processors, operations planning insertion of contents into the cache or eviction of contents from the cache can include generating one or more scores for candidates or current content using a scoring function using one or more of a reference count of content, an inverse reference distance, a size estimation of storage input/output bytes loaded from storage, or a time estimation of execution time.

[0101] From instructions executed by the one or more processors, operations can include, with the cache being a multi-layer cache, planning insertion of contents into the cache or eviction of contents from the cache being performed iteratively from a top layer to a bottom layer using an unpicked top candidate from the selected candidates or evicted contents for planning in a next layer. [0102] In various embodiments, a system, having an automatic configuration of a cache, can comprise a memory storing instructions and one or more processors in communication with the memory, where the one or more processors execute the instructions to perform operations. The operations can comprise receiving a current plan of data operations; selecting candidates for data insertion into the cache using the current plan; predicting one or more future plans using the current plan; planning insertion of contents into the cache or eviction of contents from the cache using the selected candidates, the one or more future plans, and current configuration of the cache; and providing new contents for the cache fiom the planning. The current configuration can include current contents of the cache.

[0103] In such a system, selecting candidates can include determining that a cost of the current plan with use of the candidate included into the current plan is less than or equal to a cost of the current plan; and determining that a cost of storing the candidate in the cache plus the cost of the current plan with use of the candidate included into the current plan is less than a threshold for limiting the cost of the current plan. In the system, predicting one or more future plans can include predicting a sequence of plans. Predicting one or more future plans can include: obtaining previous executed plans from an archive of executed plans; and, with the previous executed plans having an order, reversing the order of the previous executed plans up to a specific window size and outputting the reversed ordered previous executed plans as predicted one or more future plans.

[0104] Variations of such a system or similar systems can include a number of different embodiments that may or may not be combined depending on the application of such systems and/or the architecture of systems in which methods, as taught herein, are implemented. In such a system, planning insertion of contents into the cache or eviction of contents from the cache can include calculating a score for each selected candidate; eliminating selected candidates having a score less than or equal to zero; and, in response to eliminating all selected candidates based on the scores of the selected candidates, maintaining the current contents as the new contents for provision to the cache. In the system, planning insertion of contents into the cache or eviction of contents from the cache can include calculating a score for each selected candidate; maintaining, for further evaluation, selected candidates having a score greater than zero; picking, as atop candidate, a selected candidate having a maximum score from the scores calculated for the selected candidates; and assigning the current contents plus the top candidate as the new contents.

[0105] In such a system or similar systems, calculating the score for each selected candidate can include: calculating a score for the one or more future plans with the current contents plus the selected candidate; calculating a score for the one or more foture plans with the current contents; and setting the score for the selected candidate as a difference between the score for the one or more future plans with the current contents plus the selected candidate and the score for the one or more foture plans with the current contents.

[0106] In such a system or similar systems, planning insertion of contents into the cache or eviction of contents from the cache can include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; determining first new contents by removing the evicted content from the new contents; and re-assigning the first new contents as the new contents for provision to the cache.

[0107] In such a system or similar systems, planning insertion of contents into the cache or eviction of contents from the cache can include, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; and, in response to the evicted content being the top candidate, maintaining the current contents as the new contents for provision to the cache. [0108] In such a system or similar systems, calculating the size score for each data content can include: calculating a size score for the one or more foture plans with the new contents; calculating a size score for the one or more foture plans with the new contents minus the current contents; and setting the size score for the data content as a difference between the size score for the one or more foture plans with the new contents and the size score for the one or more future plans with the new contents minus the current contents.

[0109] In such a system or similar systems, planning insertion of contents into the cache or eviction of contents from the cache can include generating one or more scores for candidates or current content using a scoring function using one or more of a reference count of content, an inverse reference distance, a size estimation of storage input/output bytes loaded from storage, or a time estimation of execution time.

[0110] In such a system or similar systems, with the cache being a multi-layer cache, planning insertion of contents into the cache or eviction of contents from the cache is performed iteratively from a top layer to a bottom layer using an unpicked top candidate from the selected candidates or evicted contents for planning in a next layer.

[0111] Figure 9 is a block diagram illustrating components of a system 900 that implements algorithms and performs methods structured to perform automatic configuration of a cache. System 900 can include one or more processors 970 that can execute stored instructions to automate configuration of a semantic cache for distributed compute engines for data from one or more partitions 985. System 900 can perform operations to automatically deduce new contents for a cache using a candidate selector, a predictor, and a planner as taught herein.

[0112] System 900, having one or more such memory devices, can operate as a standalone system or can be connected, for example networked, to other systems. In a networked deployment, system 900 can operate in the capacity of a server machine, a client machine, or both in server-client network environments. In an example, system 900 can act as a peer machine in peer-to- peer (P2P) (or other distributed) network environment. While only a single system is illustrated, the term “system” shall also be taken to include any collection of systems that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein, such as cloud computing, software as a service (SaaS), or other computer cluster configurations. Example system 900 can be arranged to operate with one or more memory' devices in a structure to perform automatic configuration of a semantic cache, as taught herein. [0113] Examples, as described herein, can include, or can operate by, logic, components, devices, packages, or mechanisms. Circuitry is a collection (e.g., set) of circuits implemented in tangible entities that include hardware (e.g., simple circuits, gates, logic, etc.). Circuitry membership can be flexible over time and underlying hardware variability. Circuitries include members that can, alone or in combination, perform specific tasks when operating. In an example, hardware of the circuitry can be immutably designed to carry out a specific operation (e.g., hardwired). In an example, the hardware of the circuitry' can include variably connected physical components (e.g., execution units, transistors, simple circuits, etc.) including a computer-readable medium physically modified (e.g., magnetically, electrically, moveable placement of invariant massed particles, etc.) to encode instructions of the specific operation. In connecting the physical components, the underlying electrical properties of a hardware constituent are changed. The instractions enable participating hardware (e.g., the execution units or a loading mechanism) to create members of the circuitry in hardware via the variable connections to cany out portions of the specific tasks when in operation. Accordingly, the computer-readable medium is communicatively coupled to the other components of the circuitry when the device is operating. In an example, any of the physical components can be used in more than one member of more than one circuitry. For example, under operation, execution units can be used in a first circuit of a first circuitry at one point in time and reused by a second circuit in the first circuitry, or by a third circuit in a second circuitry', at a different time.

[0114] System 900 (e.g., computer system or distributed computing system) can include one or more hardware processors 970 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a hardware processor core, or any combination thereof), a main memory 973, and a static memory 975, some or all of which may communicate with each other via a interlink 979. The interlink 979 (e.g., bus) can be implemented as a bus, a local link, a network, other communication path, or combinations thereof. System 900 can further include a display device 981, an alphanumeric input device 982 (e.g., a keyboard), and a user interface (UI) navigation device 983 (e.g., a mouse). In an example, the display device 981, alphanumeric input device 982, and UI navigation device 983 can be a touch screen display. System 900 can include an output controller 984, such as a serial (e.g., Universal Serial Bus (USB), parallel, or other wired or wireless (e.g., infrared (IR), near field communication (NFC), etc.) connection to communicate or control one or more peripheral devices (e.g., a printer, card reader, etc.).

[0115] System 900 can include a machine-readable medium 977 on which is stored one or more sets of data structures or instructions 978 (e.g., software or data) embodying or utilized by system 900 to perform any one or more of the techniques or functions for which system 900 is designed, including cache automatic configuration. The instructions 978 or other data stored on the machine-readable medium 977 can be accessed by the main memory 973 for use by the one or more processors 970. The instructions 978 can also reside, completely or at least partially, within instructions 974 of the main memory 973, within instructions 976 of the static memory 975, or within instructions 972 of the one or more hardware processors 970.

[0116] While the machine-readable medium 977 is illustrated as a single medium, the term "machine-readable medium" can include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) configured to store the instructions 978 or data. The term “machine-readable medium” can include any medium that is capable of storing or encoding instructions for execution by system 900 and that cause system 900 to perform any one or more of the techniques to which system 900 is designed, or that is capable of storing or encoding data structures used by or associated with such instructions. Non-limiting machine-readable medium examples can include solid-state memories, optical memory media, and magnetic memory media. In an example, a massed machine-readable medium comprises a machine-readable medium with a plurality of particles having invariant (e.g., rest) mass. Accordingly, massed machine-readable media are not transitory propagating signals. Specific examples of massed machine-readable media can include non-volatile memory, such as semiconductor memory devices (e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory- (EEPROM)) and flash memory' devices; magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and compact disc-ROM (CD-ROM) and digital versatile disc-read only memory (DVD-ROM) disks. [0117] The data from or stored in machine-readable medium 977 or main memory 973 can be transmitted or received over a communications network using a transmission medium via a network interface device 990 utilizing any one of a number of transfer protocols (e.g., frame relay, Internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), hypertext transfer protocol (HTTP), etc.). Example communication networks can include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., Institute of Electrical and Electronics Engineers (IEEE) 802.11 family of standards known as Wi-Fi®, IEEE 802.16 family of standards known as WiMax®), IEEE 802.15.4 family of standards, peer-to-peer (P2P) networks, among others. In an example, the network interface device 990 can include one or more physical jacks (e.g., Ethernet, coaxial, or phone jacks) or one or more antennas to connect to the communications network. In an example, the network interface device 990 can include a plurality of antennas to wirelessly communicate using at least one of single-input multiple-output (SIMO), multiple-input multiple-output (MIMO), or multiple-input single-output (MISO) techniques. The term “transmission medium” shall be taken to include any tangible medium that is capable of carrying instructions to and for execution by system 900 and includes instrumentalities to propagate digital or analog communications signals to facilitate communication of such instructions, which instractions can be implemented by software.

[0118J Cache 980 provides a data depository that can be used with data intense operations such as big data analysis, in accordance with various embodiments as discussed herein. Cache 980 can be located in allocated memory of a server. Contents of cache 980 can be accessed by remote servers such as by use of instrumentalities such as the interlink 979 and the network interface device 990.

Cache 980 can be distributed as memory allocations in the machine-readable medium 977, the main memory 973, or other data storage of system 900. Likewise, the components of system 900 can be distributed.

[0119J The components of the illustrative devices, systems, and methods employed in accordance with the illustrated embodiments can be implemented, at least in part, in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. These components can be implemented, for example, as a computer program product such as a computer program, program code or computer instractions tangibly embodied in a machine-readable storage device, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers.

[0120] The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein can be implemented or performed with a general-purpose processor, a digital signal processor (DSP), an ASIC, a FPGA (field-programmable gate array) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor can be a microprocessor, but in the alternative, the processor can be a conventional processor, controller, microcontroller, or state machine. A processor can also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

[0121] Processors suitable for the execution of a computer program include, by way of example, both general and special pinpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory' or a random-access memory or both. The elements of a computer include a processor for executing instructions and one or more memory' devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example, semiconductor memory devices, e.g., electrically programmable read-only memory or ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory devices, and data storage disks (e.g., magnetic disks, internal hard disks, or removable disks, magneto-optical disks, and CD-ROM and DVD-ROM disks). The processor and the memory can be supplemented by or incorporated in special purpose logic circuitry.

[0122] As used herein, “machine-readable medium” (or “computer-readable medium”) means a device able to store instructions and data temporarily or permanently and can include, but is not limited to, RAM, ROM, buffer memory, flash memory, optical media, magnetic media, cache memory, other types of storage (e.g., EEPROM), and/or any suitable combination thereof. The term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store processor instructions. The term “machine-readable medium” shall also be taken to include any medium (or a combination of multiple media) that is capable of storing instructions for execution by one or more processors, such that the instructions, when executed by the one or more processors, cause performance of any one or more of the methodologies described herein. Accordingly, a “machine-readable medium” refers to a single storage apparatus or device, as well as “cloud-based” storage systems or storage networks that include multiple storage apparatus or devices. The term “machine- readable medium” as used herein excludes signals per se.

[0123] Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement that is calculated to achieve the same purpose may be substituted for the specific embodiments shown. Various embodiments use permutations and/or combinations of embodiments described herein. It is to be understood that the above description is intended to be illustrative, and not restrictive, and that the phraseology or terminology employed herein is for the purpose of description. Combinations of the above embodiments and other embodiments will be apparent to those of skill in the art upon studying the above description.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method of automatic configuration of a cache, the computer-implemented method comprising: receiving a current plan of data operations; selecting candidates for data insertion into the cache using the current plan; predicting one or more future plans using the current plan; planning insertion of contents into the cache or eviction of contents from the cache using the selected candidates, the one or more future plans, and current configuration of the cache, the current configuration including current contents of the cache; and providing new contents for the cache from the planning.

2. The computer-implemented method of claim 1, wherein selecting candidates includes: determining that a cost of the current plan with use of the candidate included into the current plan is less than or equal to a cost of the current plan; and determining that a cost of storing the candidate in the cache plus the cost of the current plan with use of the candidate included into the current plan is less than a threshold for limiting the cost of the current plan.

3. The computer-implemented method of any one of the preceding claims, wherein predicting one or more future plans includes predicting a sequence of plans.

4. The computer-implemented method of any one of the preceding claims, wherein predicting one or more future plans includes: obtaining previous executed plans from an archive of executed plans; and with the previous executed plans having an order, reversing the order of the previous executed plans up to a specific window size and outputting the reversed ordered previous executed plans as predicted one or more future plans.

5. The computer-implemented method of any one of the preceding claims, wherein planning insertion of contents into the cache or eviction of contents from the cache includes: calculating a score for each selected candidate; eliminating selected candidates having a score less than or equal to zero; and in response to eliminating all selected candidates based on the scores of the selected candidates, maintaining the current contents as the new contents for provision to the cache.

6. The computer-implemented method of any one of the preceding claims, wherein planning insertion of contents into the cache or eviction of contents from the cache includes: calculating a score for each selected candidate; maintaining, for further evaluation, selected candidates having a score greater than zero; picking, as a top candidate, a selected candidate having a maximum score from the scores calculated for the selected candidates; and assigning the current contents plus the top candidate as the new contents.

7. The computer-implemented method of claim 5 or claim 6, wherein calculating the score for each selected candidate includes: calculating a score for the one or more future plans with the current contents plus the selected candidate; calculating a score for the one or more future plans with the current contents; and setting the score for the selected candidate as a difference between the score for the one or more future plans with the current contents plus the selected candidate and the score for the one or more future plans with the current contents.

8. The computer-implemented method of claim 6, wherein planning insertion of contents into the cache or eviction of contents from the cache includes, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; and picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; determining first new contents by removing the evicted content from the new contents; and re-assigning the first new contents as the new contents for provision to the cache.

9. The computer-implemented method of claim 6, wherein planning insertion of contents into the cache or eviction of contents from the cache includes, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; and in response to the evicted content being the top candidate, maintaining the current contents as the new contents for provision to the cache.

10. The computer-implemented method of claim 8 or claim 9, wherein calculating the size score for each data content includes: calculating a size score for the one or more foture plans with the new contents; calculating a size score for the one or more foture plans with the new contents minus the current contents; and setting tiie size score for the data content as a difference between the size score for the one or more foture plans with the new contents and the size score for the one or more foture plans with the new contents minus the current contents.

11. The computer-implemented method of any one of the preceding claims, wherein planning insertion of contents into the cache or eviction of contents from the cache includes generating one or more scores for candidates or current content using a scoring function using one or more of a reference count of content, an inverse reference distance, a size estimation of storage input/output bytes loaded from storage, or a time estimation of execution time.

12. The computer-implemented method of any one of the preceding claims, wherein with the cache being a multi-layer cache, planning insertion of contents into the cache or eviction of contents from the cache is performed iteratively from a top layer to a bottom layer using an unpicked top candidate from the selected candidates or evicted contents for planning in a next layer.

13. A system having an automatic configuration of a cache, the system comprising: a memory storing instructions; and one or more processors in communication with the memory', wherein the one or more processors execute the instructions to perform operations comprising: receiving a current plan of data operations; selecting candidates for data insertion into the cache using the current plan; predicting one or more future plans using the current plan; planning insertion of contents into the cache or eviction of contents from the cache using the selected candidates, the one or more future plans, and current configuration of the cache, the current configuration including current contents of the cache; and providing new contents for the cache from the planning.

14. The system of claim 13, wherein selecting candidates includes: determining that a cost of the current plan with use of the candidate included into the current plan is less than or equal to a cost of the current plan; and determining that a cost of storing the candidate in the cache plus the cost of the current plan with use of the candidate included into the current plan is less than a threshold for limiting the cost of the current plan.

15. The system of claim 13 or claim 14, wherein predicting one or more future plans includes predicting a sequence of plans.

16. The system of any one of the preceding claims 13-15, wherein predicting one or more future plans includes: obtaining previous executed plans from an archive of executed plans; and with the previous executed plans having an order, reversing the order of the previous executed plans up to a specific window size and outputting the reversed ordered previous executed plans as predicted one or more future plans.

17. The system of any one of the preceding claims 13-16, wherein planning insertion of contents into the cache or eviction of contents from the cache includes: calculating a score for each selected candidate; eliminating selected candidates having a score less than or equal to zero; and in response to eliminating all selected candidates based on the scores of the selected candidates, maintaining the current contents as the new contents for provision to the cache.

18. The system of any one of the preceding claims 13-17, wherein planning insertion of contents into the cache or eviction of contents from the cache includes: calculating a score for each selected candidate; maintaining, for further evaluation, selected candidates having a score greater than zero; picking, as atop candidate, a selected candidate having a maximum score from the scores calculated for the selected candidates; and assigning the current contents plus the top candidate as the new contents.

19. The system of claim 17 or claim 18, wherein calculating the score for each selected candidate includes: calculating a score for the one or more future plans with the current contents plus the selected candidate; calculating a score for the one or more future plans with the current contents; and setting the score for the selected candidate as a difference between the score for the one or more future plans with the current contents plus the selected candidate and the score for the one or more foture plans with the current contents.

20. The system of claim 18, wherein planning insertion of contents into the cache or eviction of contents from the cache includes, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; determining first new contents by removing the evicted content from the new contents; and re-assigning the first new contents as the new contents for provision to the cache.

21. The system of claim 18, wherein planning insertion of contents into the cache or eviction of contents from the cache includes, with a size of the new contents being greater than a threshold size for the cache: calculating a size score for each data content in the new contents; picking, as an evicted content, a data content having a minimum size score from the scores calculated for the data contents; and in response to the evicted content being the top candidate, maintaining the current contents as the new contents for provision to the cache.

22. The system of claim 20 or claim 21, wherein calculating the size score for each data content includes: calculating a size score for the one or more foture plans with the new contents; calculating a size score for the one or more foture plans with the new contents minus the current contents; and setting the size score for the data content as a difference between the size score for the one or more foture plans with the new contents and the size score for the one or more future plans with the new contents minus the current contents.

23. The system of any one of the preceding claims 13-22, wherein planning insertion of contents into the cache or eviction of contents from the cache includes generating one or more scores for candidates or current content using a scoring function using one or more of a reference count of content, an inverse reference distance, a size estimation of storage input/output bytes loaded from storage, or a time estimation of execution time.

24. The system of any one of the preceding claims 13-23, wherein with the cache being a multi-layer cache, planning insertion of contents into the cache or eviction of contents from the cache is performed iteratively from a top layer to a bottom layer using an unpicked top candidate from the selected candidates or evicted contents for planning in a next layer.

25. A non-transitory computer-readable storage medium storing instructions, wherein the instructions, when executed by one or more processors, cause the one or more processors to perform operations comprising any one of the methods of claims 1-12.