WO2023061567A1

WO2023061567A1 - Compressed cache as a cache tier

Info

Publication number: WO2023061567A1
Application number: PCT/EP2021/078249
Authority: WO
Inventors: Assaf Natanzon; Zvi Schneider
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2023-04-20
Also published as: CN117813592A

Abstract

There is provided a computing device for management of a memory divided into a higher cache tier that stores cache entries in an uncompressed state and lower cache tier(s) that store cache entries in a compressed state. Compression performance parameter(s) are computed for a cache entry. The compression performance parameter(s) may indicate latency incurred for accessing the cache entry in the compressed state due to the extra time for decompressing the cache or in the uncompressed state due to the extra time for accessing more data which may be stored non- sequentially. The compression performance parameter may be computed based on: a predicted hit rate of the cache entry, and/or a reduction in storage requirement for storing the cache entry in the compressed state. The cache entry is moved to the higher cache tier or one of the lower cache tiers and/or evicted according to the compression performance parameter.

Description

Compressed cache as a cache tier

BACKGROUND

The present disclosure, in some embodiments thereof, relates to caches and, more specifically, but not exclusively, to systems and methods for managing caches for improving access times.

A cache is a hardware or software component that stores data so that future requests for that data can be served faster. A typical storage system may have DRAM memory for cache, which serves IOs very fast, SCM (storage class memory) which allows persistent fast random IOs but slower than DRAM, and a solid-state driver (SSD) tier which allows relatively fast random access for reads and write. The SSD tier is sometime also used for caching. The storage system typically includes a hard disk drive (HDD) tier, which allows for relatively fast sequential reading and writing, but has very poor performance for random IOs as the seek times in an HDD are very high and can be up to 10 milliseconds (ms).

SUMMARY

It is an object of the present disclosure to provide a computing device, a system, a computer program product, and a method for hierarchical storage management.

The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, a computing device for hierarchical storage management, is configured for: detecting at least one of a read miss for a cache entry of on a cache of a memory, and eviction of the cache entry, wherein the memory is divided into a higher cache tier of uncompressed cache entries and at least one lower cache tier of compressed cache entries, computing at least one compression performance parameter for the cache entry, and moving the cache entry to, and/or evicting the cache from, the higher cache tier and/or at least one lower cache tier according to the compression parameter.

According to a second aspect, a method of hierarchical storage management, comprises: detecting at least one of a read miss for a cache entry of on a cache of a memory, and eviction of the cache entry, wherein the memory is divided into a higher cache tier of uncompressed cache entries and at least one lower cache tier of compressed cache entries, computing at least one compression performance parameter for the cache entry, and moving the cache entry to, and/or evicting the cache from, the higher cache tier and/or at least one lower cache tier according to the compression parameter.

According to a third aspect, a non-transitory medium storing program instructions for hierarchical storage management, which, when executed by a processor, cause the processor to: detect at least one of a read miss for a cache entry of on a cache of a memory, and eviction of the cache entry, wherein the memory is divided into a higher cache tier of uncompressed cache entries and at least one lower cache tier of compressed cache entries, compute at least one compression performance parameter for the cache entry, and move the cache entry to, and/or evict the cache from, the higher cache tier and/or at least one lower cache tier according to the compression parameter.

Using multiple lower tiers of the compressed cache helps optimize performance, for example, to reduce overall average latency when serving data and/or service IO requests. The more compressible the data is the less CPU it requires to decompress the data, and the decompression time is lower.

In a further implementation form of the first, second, and third aspects, the at least one compression performance parameter comprises a decompression performance indicating latency incurred for accessing the cache entry in a compressed state when stored on the at least one lower cache tier or in the uncompressed state when stored on the at least one lower cache tier.

When the additional incurred latency is not significant, access performance is improved.

In a further implementation form of the first, second, and third aspects, the at least one compression performance parameter is computed based on a predicted hit rate of the cache entry indicating a probability that the cache entry will be read again during a future time interval.

Cache entries with higher hit rates are better candidates for remaining in the cache, to avoid cache misses. The hit rate may serve to “normalize” the added latency times, and enable comparison of the “normalized” latency times.

In a further implementation form of the first, second, and third aspects, the at least one compression performance parameter further comprises a reduction in storage requirement for storing the cache entry in the compressed state at a compression ratio defined by a specific compression process.

Cache entries that are highly compressible are better candidates for remaining in the cache, since they take up less space than other cache entries that are less compressible.

In a further implementation form of the first, second, and third aspects, the at least one compression performance parameter further comprises an impact on average latency of at least one of (i) accessing cache entries of the higher cache tier when storing the cache entry in an uncompressed state on the higher cache tier and (ii) accessing cache entries of the at least one lower cache tier when storing the cache entry in the compressed state on the at least one lower cache tier, wherein the cache entry is stored on and/or evicted from, the higher cache tier and/or the at least one lower cache tier, to improve the average latency.

Considering a combination of the compression ratio and the decompression performance for moving the cache entry into one of the multiple defined lower tiers of compressed cache or into the higher tier of uncompressed cache, improves overall performance for handling cache read misses.

In a further implementation form of the first, second, and third aspects, the at least one lower cache tier comprises two cache tiers, wherein the cache entry is moved to a first lower cache tier when the at least one compression parameter indicates a compression ratio being above a first compression threshold and a decompression performance indicating latency incurred for accessing the cache entry in a compressed state being below a first latency threshold, and the cache entry is moved to a second lower cache tier when the at least one compression parameter indicates the compression ratio being above a second compression threshold and lower than the first compression threshold and the decompression performance is below a second latency threshold greater than the first latency threshold.

The number of cache tiers and/or threshold and/or values for moving cache entries into each cache timer may be selected for optimal performance, for example, according to available processing resources.

In a further implementation form of the first, second, and third aspects, a compression ratio of the cache entry is computed for a respective target compression process that runs on a respective lower tier cache for compressing the cache entry for storage on the respective lower tier cache.

The compression ratio is computed using the target compression process that runs on respective lower tier caches where the cache entry may be moved to, in order to help evaluate where the cache entry will be moved to.

In a further implementation form of the first, second, and third aspects, in response to the cache being full, selecting the cache entry for eviction having lower respective hit score.

Data with lowest hit score represents data that is unlikely to be accessed again, which represents the best candidate for eviction.

In a further implementation form of the first, second, and third aspects, the respective hit score is a function of a respective compression factor of the respective lower cache tier, wherein higher hit scores are obtained for higher compression factors and lower hit scores are obtained for lower compression factors.

The compression factor helps maintain data that is more highly compressed, and/or helps evict data that is less compressible. Data with higher compression factors is to be kept in the compressed lower cache tiers with higher priority.

In a further implementation form of the first, second, and third aspects, the respective hit score is a function of the latency of decompressing the compressed cache entry, wherein higher hit scores are obtained for lower latency and lower hit scores are obtained for higher latency. Hit scores may be “normalized” by latency, enabling comparing total predicted latency times, based on predicted hit rates.

In a further implementation form of the first, second, and third aspects, the respective hit score is decayed over a time interval and based on a last access time.

When cache entries are not recently accessed frequently, they are predicted to also not be accessed frequency in the future. The decay enables removal of the non-frequently read cache entry even when the cache entry used to be frequently read.

In a further implementation form of the first, second, and third aspects, eviction of data from the at least one lower cache tier is according to a least recently used (LRU) policy.

In a further implementation form of the first, second, and third aspects, further comprising comparing a hit score of the cache entry being evicted to hit scores of cache entries in other cache tiers, and moving the cache entry when the hit score of the cache entry is higher than a lowest hit score of a current cache entry located in another cache tiers.

If the hit score of the evicted entry being moved is lower than all cache entries in the selected lower cache tier, adding the evicted entry will simply cause it to be evicted again from the selected lower cache tier.

In a further implementation form of the first, second, and third aspects, the read miss is detected for the cache entry stored on the at least one lower tier cache, and wherein moving comprises moving the cache entry to the higher cache tier storing cache entries in an uncompressed state.

The cache entry may be moved to the higher cache tier when doing so provides optimal performance in terms of reduced latency and/or high hit rate relative to current cache entries stored on the higher cache tier.

In a further implementation form of the first, second, and third aspects, the cache entry is moved to the higher cache tier when a hit score of the cache entry is higher than a threshold computed as a function of hit scores of cache entries stored on the higher cache tier.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the disclosure, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the disclosure are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the disclosure. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the disclosure may be practiced.

In the drawings:

FIG. l is a flowchart of a method of moving a cache entry to and/or evicting the cache entry from, a lower cache tier of compressed cache entries or a higher cache tier of compressed cache entries according to a compression performance parameter of the cache entry, in accordance with some embodiments; and

FIG. 2 is a block diagram of components of a system for moving a cache entry to and/or evicting the cache entry from, a lower cache tier of compressed cache entries or a higher cache tier of compressed cache entries according to a compression performance parameter of the cache entry, in accordance with some embodiments.

DETAILED DESCRIPTION

The present disclosure, in some embodiments thereof, relates to caches and, more specifically, but not exclusively, to systems and methods for managing caches for improving access times. An aspect of some embodiments relates to systems, methods, a computing device and/or apparatus, and/or computer program product (storing code instructions executable by one or more processors) for management of a memory (e.g., DRAM, SCM) divided into a higher cache tier that stores cache entries in an uncompressed state and one or more lower cache tiers that store cache entries in a compressed state. A cache event, such as a read miss for a cache entry, and/or eviction of the cache entry, is detected. One or more compression performance parameters are computed for the cache entry. The compression performance parameter may indicate latency incurred for accessing the cache entry in the compressed state due to the extra time for decompressing the cache or in the uncompressed state due to the extra time for accessing more data which may be stored non-sequentially. The compression performance parameter may be computed based on a predicted hit rate of the cache entry. The compression performance parameter may be computed based on a reduction in storage requirement for storing the cache entry in the compressed state. The cache entry is moved and/or evicted according to the compression performance parameter. The cache entry may be moved to the higher cache tier or one of the lower cache tiers.

At least some implementations described herein utilize multiple defined lower tiers of a cache that store compressed cache entries, in addition to a higher tier of the cache that stores uncompressed cache entries. Using multiple lower tiers of the compressed cache helps optimize performance, for example, to reduce overall average latency when serving data and/or service IO requests. The more compressible the data is the less CPU it requires to decompress the data, and the decompression time is lower.

Cache of a storage system is usually not kept compressed due to performance issues, such as increased processing to compress and decompress the data and/or processor utilization to compress and decompress the data. Compressed caching is a method used to improve the mean access time to memory pages. It inserts a new level into the virtual memory hierarchy where a portion of main memory is allocated for the compressed cache and is used to store pages compressed by data compression algorithms. Storing a number of pages in compressed format increases effective memory size and, for most workloads, this enlargement reduces the number of accesses to backing store devices, typically slow hard disks. This method takes advantage of the gap between the CPU processing power and disk latency time. In standard compressed cache systems, the cache memory is divided between uncompressed cache and compressed cache. Pages evicted from the un-compressed cache are compressed and moved to the compressed cache tier. Pages read from the compress cache tier are uncompressed and moved to the uncompressed cache tier. In some prior approaches, the size of the compressed and uncompressed cache is dynamically decided. Standard cache eviction mechanisms are used to move data from the uncompressed cache to the compressed cache. Such prior approaches do not take into account the compression ratio in the eviction scheme. At best, just the total compression ratio of the cache may be considered.

Different compression processes yield different compression ratios and/or different performance. As an example, modem compression processes may decompress at a speed of more than 4.5 gigabyte per second (GB/sec) on a single CPU. This means that decompression of 8 kilobytes (KB) for example take 2 microseconds, significantly faster than NAND flash, but slower than SCM memory which serve the data in 1 millisecond (MS) range.

Before explaining at least one embodiment of the disclosure in detail, it is to be understood that the disclosure is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The disclosure is capable of other embodiments or of being practiced or carried out in various ways.

The present disclosure may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Reference is now made to FIG. 1, which is a flowchart of a method of moving a cache entry to and/or evicting the cache entry from, a lower cache tier of compressed cache entries or a higher cache tier of compressed cache entries according to a compression performance parameter of the cache entry, in accordance with some embodiments. Reference is also made to FIG. 2, which is a block diagram of components of a system 200 for moving a cache entry to and/or evicting the cache entry from, a lower cache tier of compressed cache entries or a higher cache tier of compressed cache entries according to a compression performance parameter of the cache entry, in accordance with some embodiments. System 200 may implement the acts of the method described with reference to FIG. 1, by processor(s) 202 of a computing device 204 executing code instructions (e.g., code 206C) stored in a memory 206.

Computing device 204 manages a cache portion 206A of memory 206. Cache portion 206A includes one or more lower cache tiers 206A-1, and a higher cache tier 206A-2. Lower cache tier(s) 206 A- 1 stores cache entries in a compressed state. Different cache tiers may store cache entries of different compression ratios, and/or incurring different latencies as a result of decompressing the compressed cache entries, as described herein. Higher cache tier 206A-2 stores cache entries in an uncompressed state.

Memory 206 storing cache portion 206A may be implemented, for example, as dynamic random-access memory (DRAM) and/or storage class memory (SCM). Memory 206 storing cache portion 206A may be selected to have low access times. Cost of memory 206 storing cache portion 206A may be high, limiting the amount of storage available for the cache tiers.

Computing device 204 may further manage a hierarchical storage 208 that includes at least a lower tier data storage device 210 and a higher tier data storage device 212. Data chunks may be moved between cache portion 206 A and hierarchical storage 208. Lower tier data storage device 210 may store a lower data storage tier. Higher tier data storage device 212 may store a higher data storage tier.

Computing device 204 may use a prefetching process 206B (e.g., stored on memory 206, executed by processor(s) 202) for prefetching data from lower tier data storage device 210, as described herein. The prefetching process 206B may predict the location of the next data component before the data component is being requested, and fetch the next data component before the request. The prefetching may be performed from lower tier data storage device 210, saving room on higher tier data storage device 212. Some prefetching processes 206B are designed to predict locations of data components that are non-sequentially located, for example, located in a striding pattern (e.g., increase by a fixed address location relative to the previous address location) and/or in a constant address pattern that may at first appear to be random.

Lower tier data storage device 210 has relatively slower random-access input/output (IO) (e.g., read) times in comparison to higher tier data storage device 212. Higher tier data storage device 212 has relatively faster random I/O (e.g., read and/or write) times in comparison to lower tier data storage device 210.

Lower tier data storage device 210 may cost less (e.g., per megabyte) in comparison to higher tier data storage device 212.

Lower tier data storage device 210 may be implemented, for example, as a hard disk drive (HDD). Lower tier data storage device 210 may provide fast sequential reading and/or writing, but has poor performance for random I/O as the seek times may be very high (e.g., up to 10 milliseconds).

Higher tier data storage device 212 may be implemented, for example, as a solid-state drive (SSD), and/or phase-change memory (PCM).

Higher tier data storage device 212 may serve as a cache and/or a tier (e.g., cache when data is volatile and has a copy in the lower tier, and/or tier when the data is nonvolatile and/or may be kept (e.g., only) in the higher tier) for lower tier data storage device 210.

Cache portion 206A may serve as the cache for hierarchical storage 208, such as for cache entries with highest hit rates.

Hierarchical storage 208 is in communication with a computing system 214, which stores data on hierarchical storage 208 and/or reads data stored on hierarchical storage 208. Hierarchical storage 208 may be integrated within computing system 214, and/or may be implemented as an external storage device. Computing system 214 may be indirectly connected to hierarchical storage 208 via computing device 204, i.e., computing system 214 may communicate with computing device 204, where computing device 204 communicates with hierarchical storage 208, rather than computing system 214 directly communicating with hierarchical storage 208.

Computing system 214 and/or computing device 204 may be implemented as, for example, one of more of a computing cloud, a cloud network, a computer network, a virtual machine(s) (e.g., hypervisor, virtual server), a network node (e.g., switch, a virtual network, a router, a virtual router), a single computing device (e.g., client terminal), a group of computing devices arranged in parallel, a network server, a web server, a storage server, a local server, a remote server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet computer, a wearable computing device, a glasses computing device, a watch computing device, and a desktop computer.

Optionally, hierarchical storage 208 is used exclusively by a single user such as a computing device 214. Alternatively, hierarchical storage 208 is used by multiple users such as multiple client terminals 216 accessing hierarchical storage 208 over a network 218, for example, computing system 214 provides cloud storage services and/or virtual storage services to client terminals 216.

Computing device 204 may be implemented as, for example, integrated within hierarchical storage 208 (e.g., as hardware and/or software installed within hierarchical storage 208), integrated within computing system 214 (e.g., as hardware and/or software installed within computing system 214, such as an accelerator chip and/or code stored on a memory of computing system 214 and executed by processor of computing system 214), and/or as an external component (e.g., implemented as hardware and/or software) in communication with hierarchical storage 208, such as a plug-in component. Optionally, hierarchical storage 208 and computing device 204 are implemented as one storage system that exposes storage (e.g., functions, features, capabilities) to computing system(s) 214.

Computing device 204 includes one or more processor(s) 202, implemented as for example, central processing unit(s) (CPU), graphics processing unit(s) (GPU), field programmable gate array(s) (FPGA), digital signal processor(s) (DSP), application specific integrated circuit(s) (ASIC), customized circuit(s), processors for interfacing with other units, and/or specialized hardware accelerators. Processor(s) 202 may be implemented as a single processor, a multi-core processor, and/or a cluster of processors arranged for parallel processing (which may include homogenous and/or heterogeneous processor architectures). It is noted that processor(s) 202 may be designed to implement in hardware one or more features stored as code instructions 206C and/or 206B.

Memory 206 stores code instructions implementable by processor(s) 202, for example, a random-access memory (RAM), read-only memory (ROM), and/or a storage device, for example, non-volatile memory, magnetic media, semiconductor memory devices, hard drive, removable storage, and optical media (e.g., DVD, CD-ROM). Memory 206 may store code 206C that when executed by processor(s) 208, implement one or more acts of the method described with reference to FIG. 1, and/or store prefetching process 206B code as described herein.

Computing device 204 may include a data storage device 220 for storing data. Data storage device 220 may be implemented as, for example, a memory, a local hard-drive, a removable storage unit, an optical disk, a storage device, and/or as a remote server and/or computing cloud (e.g., accessed using a network connection). It is noted that code instructions executable by processor(s) 202 may be stored in data storage device 220, for example, with executing portions loaded into memory 206 for execution by processor(s) 202.

Computing device 204 (and/or computing system 214) may be in communication with a user interface 222 that presents data to a user and/or includes a mechanism for entry of data, for example, one or more of a touch-screen, a display, a keyboard, a mouse, voice activated software, and a microphone.

Network 218 may be implemented as, for example, the internet, a local area network, a virtual private network, a virtual public network, a wireless network, a cellular network, a local bus, a point to point link (e.g., wired), and/or combinations of the aforementioned.

At 102, the cache is monitored to detect a cache event for a cache entry. The cache may be monitored to detect cache access patterns. Examples of cache events and/or cache access patterns that are monitored for include a read miss for the cache entry and/or an eviction of the cache entry.

A memory storing the cache is divided into a higher cache tier of uncompressed cache entries and one or more lower cache tiers of compressed cache entries.

Other cache events and/or cache access patterns, and/or collected data parameters used to detect the cache event and/or cache access patterns, include for example, reads, sequential reads, size of reads, writes, sequential writes, and size of writes, and statistical data parameters for data chunks (e.g., for each data chunk).

Optionally, the access patterns and/or cache events are dynamically decayed. The decay may be performed by multiplying a current parameter of the access pattern and/or cache event by a decay value less than 1, every time interval to obtain an adapted parameter of the access pattern and/or cache event. Other decay approaches may be used, for example, linear, logarithmic, dynamic changing values, and the like. The predicted normalized access parameter and/or cache event may be computed using the adapted parameter of the access pattern and/or cache event. The decay value prevents increasing the value of the parameter of the access parameter and/or cache event indefinitely, and/or maintains the value of the parameter of the access pattern and/or cache event at a reasonable state that enables processing in a reasonable time. For example, every 5 minutes the number of reads (an example of the parameter of the access pattern) is multiplied by 0.99, such that if there are currently 100 reads, after 5 minutes the number of reads is reduced to 99.

The access pattern and/or cache event may be computed per individual data chunk (e.g., each data chunk), where an individual data chunk includes multiple sequentially stored data blocks. Blocks may be the smallest granularity that are operated by the storage system. A user may read and/or write a single block and/or multiple blocks. Blocks may be of a size between about 0.5-32 kilobytes (KB), or other ranges. A cache entry may be a data chunk of multiple sequentially stored data blocks, rather than per block. A chunk may be a continuous address space of, for example, 4 megabytes (MB) or other values. It is noted that cache entries are much smaller, for example, about 8 KB, where the 4MB chunks may be chunks of data storage tiering). Analyzing access patterns and/or cache events for cache entries per data chunk rather than per block reduces storage requirements and/or improves computational performance (e.g., processor utilization, processor time) in comparison to using very large data structures for storing access patterns and/or cache events per block. Movement and/or eviction (e.g., as described with reference to 106 and/or 108) may be performed per data chunk.

Optionally, the access pattern and/or cache event is computed by a tier up and tier down process that dynamically moves data chunks between the cache tier(s) and hierarchical storage (e.g., higher level data storage device and the lower level data storage device) for dynamic optimization. Existing tier up and/or tier down processes that evaluate the hotness of an area of storage device(s) and/or probability the area will be read may be used to determine the access pattern and/or cache event and/or the analysis therefore to determine movement and/or eviction of the cache entry.

Optionally, analysis of the access patterns and/or cache event for the cache entry is performed by computing a prediction of future access patterns for the cache entry. The predicted future access patterns enable better allocation of the cache entry. The prediction of future access patterns may be obtained as an outcome of a machine learning (ML) model, for example, a regressor, a neural network, a classifier, and the like. The ML model may be trained on a training dataset of a records, where each record including a respective cache entry is labelled with a ground truth label of historical access patterns and/or cache events. The ML model may increase accuracy of the predictions, based on learning historical access patterns for cache entries. Other approaches may be used to obtain the predicted future access patterns and/or cache events, for example, a set of rules, and/or mathematical prediction models.

Optionally, the access patterns and/or cache events includes prefetching patterns by a prefetching process. Prefetching approaches may be analyzed to predict the future access patterns and/or cache events more accurately. The prefetching pattern may be, for example, one or combination of: sequentially, stride (i.e., increase by fixed step each time), and/or randomly. The prefetching process places the prefetched data components (that are located non- sequentially on the lower tier data storage device) on a higher tier data storage device and/or higher cache tier when the data component is not already stored on the higher tier data storage device and/or higher cache tier. The prefetching process computes a probability of each of multiple candidate subsequent data chunks being accessed given a current data chunk being accessed, and prefetches the subsequent data chunk having highest probability when the current data chunk is being accessed. The prefetching process that computes the probability enables selecting the data chunks for which highest accuracy is obtained for storage on the higher tier data storage device and/or higher cache tier, which improves performance of the higher tier data storage device and/or higher cache tier since stored data chunks are most likely to actually be accessed in the future over other components with lower probability which are kept on the lower tier data storage device and/or lower cache tier.

Optionally, accuracy of the prefetching patterns (e.g., each prefetching pattern) is computed. The prefetching pattern, e.g., data component to be prefetched, may be predicted as described herein with reference to a believe cache process discussed below. As used herein, the term believe cache relates to a prefetch cache that predicts next location(s) which are not necessarily sequential. The accuracy may be computed as a percentage of when the prefetching pattern has correctly prefetched the correct component, relative to all prefetching attempts including attempts where the prefetching pattern was unable to prefetch the correct component. Two or more fetching patterns having accuracy above a threshold may be selected. The threshold may be, for example, 20%, 25%, 30%, 40%, 45%, 50%, or other values. Two or more prefetching patterns with highest accuracy are selected, since such prefetching patterns are most likely to be re-selected in the future.

Optionally, the prefetching process is based on computing conditional probabilities of a next access (e.g., read) location based on a current access (e.g., read) location, sometimes referred to as believe cache prefetching. The prefetching process (e.g., believe cache prefetching) computes probability of each of multiple candidate subsequent data components being accessed given a current data component being accessed, and prefetches the subsequent data component having highest probability when the current data component is being accessed. The prefetching process computes the probability of the prefetching pattern fetching each of multiple candidate components.

The data may be prefetched from the next access location when the conditional probability is above a threshold. The believe cache prefetching may be used, for example, when access to data storage is non- sequential but in a repeatable pattern, for example, in striding access (i.e., each time increase the address by a fixed amount relative to the current access), and/or in another repeatable pattern which may at first appear to be random. The next location to be accessed is computed based on the current and/or previous locations that were accessed, based on absolute address locations and/or relative address locations. An exemplary computation is now described:

After a first location (denoted A) is accessed, the following memory locations are accessed multiple times: a second location (denoted X) is accessed 10 times, a third location (denoted Y) is accessed 3 times, and a fourth location (denoted Z) is accessed 5 times.

After a fifth location (denoted B) is accessed, the following memory locations are accessed multiple times: the second location (denoted X) is accessed 6 times, the third location (denoted Y) is accessed 2 times, the fourth location (denoted Z) is accessed 4 times, and a sixth location (denoted K) is accessed 7 times.

Conditional probabilities are calculated as follows:

• p(X|A) = 10/18 p(Y|A)=3/18 p(Z|A)=5/18

• p(X|B) = 6/19 p(Y|B)=2/19 p(Z|B)=4/19 p(K|B)=7/19 If there are two accesses (e.g. IO), A and B in sequence, the recommendation for which data location to prefetch from may be computed by calculating the candidate probability of each of the following locations; X, Y, Z, K:

Cx = p(X|A) + p(X|B) = 10/18+6/19=0.87

C_Y= p(Y|A) + p(Y|B) = 0.27

Cz = p(Z|A) + p(Z|B) = 0.71

C_K = p(K|A) + p(K|B) = 0.36

The probabilities are sorted to rank the most likely next locations from where prefetching of data is obtained. One or more prefetching patterns may be accessed, for example, a single prefetch, two prefetches, or more, and/or according to a threshold. The first prefetch is from location X. The second prefetch is from location Z. The third prefetch is from location K. If a threshold of 50% is used, data is prefetched from locations X and Z.

Prefetch locations (i.e., X, Y, Z, K) may be referred to as candidates. Current access locations (i.e., A, B) may be referred to as voter.

The relationship between the current and next locations may be presented in a matrix, which may be referred to as a relation matrix, for example, as below (e.g., curHis: A B)

At 104, one or more compression performance parameters for the cache entry are computed.

Optionally, the compression performance parameter includes a decompression performance parameter. The decompression performance parameter indicates additional latency incurred for accessing the cache entry in the compressed state when stored on one of the lower cache tiers, or accessing the cache entry stored in the uncompressed state on the at least one lower cache tier. The additional incurred latency due to decompressing the cache entry stored in the lower tier cache before serving the uncompressed data may be insignificant in some cases, or significant in other cases, such as according to the capability and/or availability of the CPU. When the additional incurred latency is not significant, access performance may be improved. Added latency for reading the decompressed data affects the cache tier of where the cache entry is moved to, since the goal is to reduce overall average latency when serving data and/or when performing IO operations for example, according to whether the incurred latency is significant or not.

Alternatively, or additionally, the compression performance parameter is computed based on a predicted hit rate of the cache entry indicating a probability that the cache entry will be read again during a future time interval. Cache entries with higher hit rates are better candidates for remaining in the cache, to avoid cache misses. The hit rate may serve to “normalize” the added latency times, and enable comparison of the “normalized” latency times.

Alternatively, or additionally, the compression performance parameter includes a reduction in storage requirement for storing the cache entry in the compressed state at a compression ratio defined by a specific compression process. Cache entries that are highly compressible are better candidates for remaining in the compressed cache, since they take up less space than other cache entries that are less compressible.

The compression ratio of the cache entry may be computed for a respective target compression process that runs on a respective lower tier cache for compressing the cache entry for storage on the respective lower tier cache. Different compression processes have different compression ratios. The compression ratio is computed using the target compression process that runs on respective lower tier caches where the cache entry may be moved to, in order to help evaluate where the cache entry will be moved to.

Alternatively, or additionally, the compression performance parameter includes an impact on average latency of one or more of (i) accessing cache entries of the higher cache tier when storing the cache entry in an uncompressed state on the higher cache tier, and (ii) accessing cache entries of the lower cache tier(s) when storing the cache entry in the compressed state on the lower cache tier(s). The cache entry is stored on and/or evicted from, the higher cache tier and/or the at least one lower cache tier, to improve the average latency. Considering a combination of the compression ratio and the decompression performance for moving the cache entry into one of the multiple defined lower tiers of compressed cache or into the higher tier of uncompressed cache, improves overall performance for handling cache read misses. The improvement may be in obtaining an optimal combination of reducing increased latency from decompressing the compressed cache entry and/or in reducing processor utilization from decompressing the compressed cache entry (or when accessing the uncompressed cache entry rather than decompressing the compressed cache entry). Compression ratio and decompression performance may be considered as a combination, since the decompression performance may be impacted by the compression ratio (e.g., function of). The more the data may be compressible (e.g., the higher the compression ratio) the faster the decompression process may work (e.g., the lower the incurred delay latency). For example, for a cache entry having 4X compression (i.e., compression ratio of 4) the decompression process may work twice as fast than in a case where the compression ratio is 2X.

At 106, the cache entry may be moved to the higher cache tier and/or moved to one of the lower cache tiers according to the compression parameter.

For clarity, it is noted that the cache tiers described herein may be implemented as different cache layers. The cache tiers (e.g., cache layers) are stored in the memory (e.g., DRAM, SCM). The cache tiers (e.g., cache layers) are not stored on another data storage device such as hard disk and/or SSD. The hard disk and/or SSD, which may be a part of a hierarchical storage system, serve as tiered data storage devices that store data storage tiers, and do not store the cache tiers (e.g., cache layers). As such, it is clarified that the data storage tiers (e.g., stored by the hierarchical storage system, such as hard disk and/or SSD) are different than the cache tiers stored in the memory (e.g., DRAM, SCM). It is noted that the data storage tiers are persistent. Data is usually stored in only one data storage tier (e.g., the higher tier, or the lower tier). In contrast, the cache is volatile, and data stored in the cache is also stored on a data storage tier.

The cache entry is moved to one of the tiers of the cache on a read miss, when the compression performance parameter(s) are already known. The compression ratio of the data and/or the decompression performance (e.g., amount of processor utilization and/or delay incurred due to decompression of the data to be served) may be obtained once the decompressed cache entry is served. The known compression performance parameter(s) are used to select which of the lower tiers of the cache the cache entry is inserted to in a compressed state, or whether to keep the cache entry in the uncompressed state and insert into the higher tier of the cache. When the compressed cache lower tiers are divided by different values of the compression performance param eter(s) (e.g., the latency of decompression and/or compression ratio and/or combination thereof), each such IO is relevant for one compressed lower cache tier and/or not relevant for other compressed lower cache tiers, or not relevant to any lower cache tiers in which case the cache entry is to be kept uncompressed in the higher-level cache tier.

There may be one or more lower cache tiers, for example, 2, 3 or more. The number of cache tiers and/or threshold and/or values for moving cache entries into each cache timer may be selected for optimal performance, for example, according to available processing resources. The number of cache tiers may be decided in advance.

In an exemplary implementation, there are two lower cache tiers. The cache entry is moved to a first lower cache tier when the compression parameter indicates a compression ratio being above a first compression threshold and a decompression performance indicating latency incurred for accessing the cache entry in a compressed state being below a first latency threshold. For example, the cache entry is moved to the first lower cache tier, when the compression ratio is above the first compression ratio threshold of 3, and latency is below the first latency threshold of 10 microseconds. The cache entry is moved to a second lower cache tier when the compression parameter indicates the compression ratio is above a second compression threshold and lower than the first compression threshold, and the decompression performance is below a second latency threshold greater than the first latency threshold. For example, the cache entry is moved to the second lower cache tier, when the compression ratio is above the second compression threshold of 2, but below the first compression threshold of 3, and latency is below the second latency threshold 30 microseconds. In an exemplary implementation, the cache entry is moved when according to a minimum value of a combination of compression ratio and maximum latency. For example, the cache entry is moved when the compression ratio is at least 2 and latency is at least 5. Otherwise, the cache entry will not fall to any queue.

The read miss may be detected for the cache entry stored on one of the lower tier caches. The cache entry may be moved to the higher cache tier storing cache entries in an uncompressed state. The cache entry may be moved to the higher cache tier when a hit score of the cache entry is higher than a threshold computed as a function of hit scores of cache entries stored on the higher cache tier. The cache entry may be moved to the higher cache tier when doing so provides optimal performance in terms of reduced latency and/or high hit rate relative to current cache entries stored on the higher cache tier.

Alternatively, or additionally to 106, at 108, the cache is evicted from the higher cache tier and/or from one of the lower cache tier according to the compression parameter.

Eviction of data from the lower cache tier may be according to a least recently used (LRU) policy.

Optionally, in response to the cache being full, the cache entry having a lowest hit score is selected for eviction. Data with lowest hit score represents data that is unlikely to be accessed again, which represents the best candidate for eviction. The score of a page of data (e.g., a page) in the respective lower cache tier may be based on the number of hits, optionally decaying over time. The cache entry having lowest hit score from the multiple lower tier caches is evicted. For each of the lower tier caches, the cache entry having lowest value may be evicted.

The hit score may be a function of a compression factor of the respective lower cache tier. Higher hit scores are obtained for higher compression factors and lower hit scores are obtained for lower compression factors. The compression factor helps maintain data that is more highly compressed, and/or helps evict data that is less compressible. Data with higher compression factors is to be kept in the compressed lower cache tiers with higher priority. The hit score may be multiplied by the compression factor, or other functions may be used.

Alternatively, or additionally, the hit score may be a function of the latency of decompressing the compressed cache entry. Hit scores may be “normalized” by latency, enabling comparing total predicted latency times, based on predicted hit rates. Higher hit scores are obtained for lower latency and lower hit scores are obtained for higher latency. The hit score may be a combination of the compression factor and latency.

Alternatively, or additionally, the respective hit score is decayed over a time interval and based on a last access time. When cache entries are not recently accessed frequently, they are predicted to also not be accessed frequency in the future. The decay enables removal of the non-frequently read cache entry even when the cache entry used to be frequently read.

An exemplary approach for computing the respective hit score for each cache entry is now described, and may include one or more of the following features: 1. When the cache entry is read, a constant (e.g., denoted c l) is be added to the hit score.

2. Every interval of time the hit score is multiplied by a decay factor (e.g., denoted d_l), for example, every 1 minute the hit score is multiplied by 0.99.

3. When the cache entry has a certain compression factor value (e.g., denoted p l) the hit score of the cache entry is multiplied by an adjustment function (e.g., denoted f(p_l)). For example f(p_l)=p_l .

4. When the cache entry has a decompression time of d_l the hit score is multiplied by a function (e.g., denoted g(d_l)) that adjusts the hit score based on the added latency. For example, g(0)=l and g(lower tier)= 1/2, e.g., the added latency is equal to the latency added by reading the hit score from the higher tier data storage device (e.g., SSD). An example of the function g(x) may be g(x)=l-x/(x+SSD).

The hit score of the cache entry being evicted may be compared to hit scores of cache entries in other cache tiers. The cache entry may be moved to a specific cache tier when the hit score of the cache entry is higher than a lowest hit score of a current cache entry located in the specific cache tier, which is lower than hit scores of cache entries in the other cache tiers. If the hit score of the evicted entry being moved is lower than all cache entries in the selected lower cache tier, adding the evicted entry will simply cause it to be evicted again from the selected lower cache tier.

At 110, one or more features described with reference to 102-108 may be iterated, for example, for different cache entries.

The allocation of the memory into the multiple lower cache tiers and/or to the higher cache tier may be dynamic, for example, adapted per iteration or multiple iterations. The dynamic allocation may be performed according to a set of rules, for example, the size of a certain cache tier is expanded when the set of rules indicating that more storage space is needed, or reduced when the set of rules indicating that the existing storage space is not being used.

Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.

The descriptions of the various embodiments of the present disclosure have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant cache devices and cache processes will be developed and the scope of the term cache is intended to include all such new technologies a priori.

As used herein the term “about” refers to ± 10 %.

The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of' and "consisting essentially of'.

The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the disclosure may include a plurality of “optional” features unless such features conflict.

Throughout this application, various embodiments of this disclosure may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosure. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the disclosure. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

It is the intent of the applicant(s) that all publications, patents and patent applications referred to in this specification are to be incorporated in their entirety by reference into the specification, as if each individual publication, patent or patent application was specifically and individually noted when referenced that it is to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present disclosure. To the extent that section headings are used, they should not be construed as necessarily limiting. In addition, any priority document(s) of this application is/are hereby incorporated herein by reference in its/their entirety.

Claims

WHAT IS CLAIMED IS:

1. A computing device (204) for hierarchical storage management (208), configured for: detecting at least one of a read miss for a cache entry of on a cache (206 A) of a memory

(206), and eviction of the cache entry; wherein the memory is divided into a higher cache tier of uncompressed cache entries (206 A-2) and at least one lower cache tier of compressed cache entries (206A-1); computing at least one compression performance parameter for the cache entry; and moving the cache entry to, and/or evicting the cache from, the higher cache tier and/or at least one lower cache tier according to the compression parameter.

2. The computing device according to claim 1, wherein the at least one compression performance parameter comprises a decompression performance indicating latency incurred for accessing the cache entry in a compressed state when stored on the at least one lower cache tier or in the uncompressed state when stored on the at least one lower cache tier.

3. The computed device according to any of the preceding claims, wherein the at least one compression performance parameter is computed based on a predicted hit rate of the cache entry indicating a probability that the cache entry will be read again during a future time interval.

4. The computing device according to any of the preceding claims, wherein the at least one compression performance parameter further comprises a reduction in storage requirement for storing the cache entry in the compressed state at a compression ratio defined by a specific compression process.

5. The computing device according to any of the preceding claims, wherein the at least one compression performance parameter further comprises an impact on average latency of at least one of: (i) accessing cache entries of the higher cache tier when storing the cache entry in an uncompressed state on the higher cache tier and (ii) accessing cache entries of the at least one lower cache tier when storing the cache entry in the compressed state on the at least one lower cache tier, wherein the cache entry is stored on and/or evicted from, the higher cache tier and/or the at least one lower cache tier, to improve the average latency.

26

6. The computing device according to any of the preceding claims, wherein the at least one lower cache tier comprises two cache tiers, wherein the cache entry is moved to a first lower cache tier when the at least one compression parameter indicates a compression ratio being above a first compression threshold and a decompression performance indicating latency incurred for accessing the cache entry in a compressed state being below a first latency threshold, and the cache entry is moved to a second lower cache tier when the at least one compression parameter indicates the compression ratio being above a second compression threshold and lower than the first compression threshold and the decompression performance is below a second latency threshold greater than the first latency threshold.

7. The computing device according to any of the preceding claims, wherein a compression ratio of the cache entry is computed for a respective target compression process that runs on a respective lower tier cache for compressing the cache entry for storage on the respective lower tier cache.

8. The computing device according to any of the preceding claims, in response to the cache being full, selecting the cache entry for eviction having lower respective hit score.

9. The computing device according to claim 8, wherein the respective hit score is a function of a respective compression factor of the respective lower cache tier, wherein higher hit scores are obtained for higher compression factors and lower hit scores are obtained for lower compression factors.

10. The computing device according to any one of the preceding claims 8-9, wherein the respective hit score is a function of the latency of decompressing the compressed cache entry, wherein higher hit scores are obtained for lower latency and lower hit scores are obtained for higher latency.

11. The computing device according to any one of the preceding claims 8-10, wherein the respective hit score is decayed over a time interval and based on a last access time.

12. The computing device according to any one of the preceding claims 8-11, wherein eviction of data from the at least one lower cache tier is according to a least recently used (LRU) policy.

13. The computing device according to any ofthe preceding claims 8-12, further comprising comparing a hit score of the cache entry being evicted to hit scores of cache entries in other cache tiers, and moving the cache entry when the hit score of the cache entry is higher than a lowest hit score of a current cache entry located in another cache tiers.

14. The computing device according to any one of the preceding claims, wherein the read miss is detected for the cache entry stored on the at least one lower tier cache, and wherein moving comprises moving the cache entry to the higher cache tier storing cache entries in an uncompressed state.

15. The computing device according to claim 14, wherein the cache entry is moved to the higher cache tier when a hit score of the cache entry is higher than a threshold computed as a function of hit scores of cache entries stored on the higher cache tier.

16. A method of hierarchical storage management, comprising: detecting at least one of a read miss for a cache entry of on a cache of a memory, and eviction of the cache entry (102); wherein the memory is divided into a higher cache tier of uncompressed cache entries and at least one lower cache tier of compressed cache entries; computing at least one compression performance parameter for the cache entry (104); and moving the cache entry to, and/or evicting the cache from, the higher cache tier and/or at least one lower cache tier according to the compression parameter (106).

17. A non-transitory medium (206) storing program instructions for hierarchical storage management (206C), which, when executed by a processor (202), cause the processor to: detect at least one of a read miss for a cache entry of on a cache of a memory, and eviction of the cache entry; wherein the memory is divided into a higher cache tier of uncompressed cache entries and at least one lower cache tier of compressed cache entries; compute at least one compression performance parameter for the cache entry; and move the cache entry to, and/or evict the cache from, the higher cache tier and/or at least one lower cache tier according to the compression parameter.

29