WO2023083454A1 - Compression de données et hiérarchisation sensible à la déduplication dans un système de stockage - Google Patents
Compression de données et hiérarchisation sensible à la déduplication dans un système de stockage Download PDFInfo
- Publication number
- WO2023083454A1 WO2023083454A1 PCT/EP2021/081384 EP2021081384W WO2023083454A1 WO 2023083454 A1 WO2023083454 A1 WO 2023083454A1 EP 2021081384 W EP2021081384 W EP 2021081384W WO 2023083454 A1 WO2023083454 A1 WO 2023083454A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- parameter
- predicted
- access
- storage device
- Prior art date
Links
- 238000013144 data compression Methods 0.000 title description 2
- 238000007906 compression Methods 0.000 claims abstract description 103
- 230000006835 compression Effects 0.000 claims abstract description 97
- 238000004458 analytical method Methods 0.000 claims abstract description 11
- 238000013500 data storage Methods 0.000 claims description 130
- 238000000034 method Methods 0.000 claims description 38
- 230000008569 process Effects 0.000 claims description 19
- 238000007726 management method Methods 0.000 claims description 10
- 238000010801 machine learning Methods 0.000 claims description 9
- 238000012544 monitoring process Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 238000013459 approach Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 8
- 238000004590 computer program Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 150000001875 compounds Chemical class 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000004615 ingredient Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/185—Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof
Definitions
- the present disclosure in some embodiments thereof, relates to hierarchical storage management and, more specifically, but not exclusively, to improving read access for data chunks stored on a hierarchical storage system.
- Hierarchical storage systems are managed by automatically moving data between high cost and fast storage media, and low cost and slow storage media. While it would be ideal to store all data on fast storage media that provide fast read access, in practice, such solution is expensive. Most of the data is stored on the lower cost but slower storage media. Some data is moved to the higher cost but faster storage media, with the goal of optimizing the tradeoff between access time and costs of the storage media. The higher cost but faster storage media may serve as a cache for the lower cost but slower storage media.
- a computing device for hierarchical storage management configured for: monitoring access patterns to a plurality of data chunks located on a lower tier data storage device and a higher tier data storage device, computing a predicted normalized access parameter for a data chunk of the plurality of data chunks according to a compression parameter and predicted non-normalized access parameter computed from an analysis of the access patterns, and moving the data chunk between the lower tier data storage device and the higher tier data storage device according to the predicted normalized access parameter.
- a computer implemented method of hierarchical storage management comprises: monitoring access patterns to a plurality of data chunks located on a lower tier data storage device and a higher tier data storage device, computing a predicted normalized access parameter for a data chunk of the plurality of data chunks according to a compression parameter and predicted non-normalized access parameter computed from an analysis of the access patterns, and moving the data chunk between the lower tier data storage device and the higher tier data storage device according to the predicted normalized access parameter.
- a non-transitory medium storing program instructions for hierarchical storage management, which, when executed by a processor, cause the processor to: monitor access patterns to a plurality of data chunks located on a lower tier data storage device and a higher tier data storage device, compute a predicted normalized access parameter for a data chunk of the plurality of data chunks according to a compression parameter and predicted non-normalized access parameter computed from an analysis of the access patterns, and move the data chunk between the lower tier data storage device and the higher tier data storage device according to the predicted normalized access parameter.
- Moving the most access active (e.g. reads and/or writes) data chunks to the higher tier improves access performance for the access active data chunks, and/or improves overall performance of the hierarchical data storage system.
- the predicted normalized access parameter is computed by multiplying the compression parameter by the predicted non-normalized access parameter.
- the compression parameter comprises a compression ratio between a size of the data chunk when noncompressed and a size of the data chunk when compressed by a compression process when the data chunk is placed on the higher tier data storage device. Since compression approaches may vary between different tiers, using the compression parameter computed according to the compression approach used in the higher tier provides a more accurate predicted normalized access parameter.
- the compression parameter is computed based on a deduplication parameter indicating an amount of space occupied by de-duplicated data of the data chunk.
- Deduplication has a large effect on the amount of space that data takes up, such as in the higher tier data storage device.
- the data chunk comprises a plurality of data blocks
- the deduplication parameter is computed by: for each respective data block that is not de-duplicated, the deduplication parameter indicates the size of the respective data block, for each respective data block that is de-duplicated, the deduplication parameter is computed as a size of a compression of the respective data block divided by a number of copies of the respective block.
- the compression ratio based on the deduplication parameter is computed as a size of the data chunk when no deduplication is applied divided by a size of the data chunk for which de-duplication data blocks of the data chunk is applied by a de-duplication process when the data chunk is placed on the higher tier data storage device.
- the compression ratio considers the effects of de-duplication on the size of the data chunk when placed at the target tier. Since de-duplication approaches may vary between different tiers, using the compression parameter computed according to the de-duplication approach used in the higher tier (when moving to the higher tier) provides a more accurate predicted normalized access parameter.
- the compression parameter is computed based on the deduplication parameter prior to moving the data chunk to the higher tier or lower tier.
- the compression parameter is computed based on a delta compression parameter indicating an amount of space occupied by delta compression data of the data chunk.
- Delta compression may have a significant effect on the amount of space that data takes up, such as in the higher tier data storage device.
- the compression ratio based on the delta compression parameter is computed as a size of an original of the data chunk and delta blocks when no delta compression is applied divided a total size of data blocks of the data chunk used in the delta compression of the data blocks of the data chunk.
- the compression parameter is computed according to a delta compression process when the data chunk is placed on the higher tier data storage device.
- data chunks with predicted normalized access parameter above a threshold and/or data chunks highest ranked according to predicted normalized access parameter are moved to the higher tier data storage device, and data chunks with predicted normalized access parameter below the threshold and/or data chunks lowest ranked according to predicted normalized access parameter, are moved to the lower tier data storage device.
- a preset maximal number of data chunks are moved per time interval.
- the predicted normalized access parameter for the data chunk comprises a predicted normalized number of random reads for the data chunk, wherein data chunks with predicted normalized number of random reads above a threshold and/or data chunks highest ranked according to predicted normalized number of random reads, are moved to the higher tier data storage device.
- the predicted normalized access parameter for the data chunk comprises a predicted normalized number of writes and/or normalized sequential reads for the data chunk, wherein data chunks with predicted normalized number of writes and/or normalized sequential reads above a threshold and/or data chunks highest ranked according to predicted normalized number of writes and/or normalized sequential reads, are moved to the lower tier data storage device.
- Writes and/or sequential reads have less impact on the performance when served from the lower tier. As such, moving data chunks with highest and/or highest ranked predicted normalized number of writes and/or normalized sequential reads to the lower tier data storage device, improves overall performance of the data storage system that includes the lower tier and higher tier data storage devices (e.g., by freeing up the higher tier storage for other data chunks, such as having high predicted random reads).
- the predicted non-normalized access parameter is computed from the analysis of the access patterns selected from a group consisting of: reads, sequential reads, size of reads, writes, sequential writes, and size of writes.
- the predicted non-normalized access parameter for the data chunk comprises a predicted non-normalized number of random reads for the data chunk generated as an outcome of a machine learning model trained on a training dataset of a plurality of records, each record including a respective chunk labelled with a ground truth label of number of random reads for the respective chunk.
- the trained ML model generates an outcome indicating the predicted number of random reads expected for a target chunk fed as input.
- first, second, and third aspects further comprising dynamically decaying the access patterns by multiplying a current parameter of the access pattern by a decay value less than 1, every time interval to obtain an adapted parameter of the access pattern, wherein the predicted normalized access parameter is computed using the adapted parameter of the access pattern.
- the decay value prevents increasing the value of the parameter of the access parameter indefinitely, and/or maintains the value of the parameter of the access pattern at a reasonable state that enables processing in a reasonable time.
- a file comprises a plurality of data chunks, wherein a first subset of the plurality of data chunks is located on the higher tier and a second subset of the plurality of data chunks is located on the lower tier, wherein individual data chunks of the file are moved between the lower tier data storage device and the higher tier data storage device according to the predicted normalized access parameter computed per individual data chunk of the file.
- the access patterns comprise prefetching patterns by a prefetching process.
- the prefetching process computes probability of each of a plurality of candidate subsequent data chunks being accessed given a current data chunk being accessed, and prefetches the subsequent data chunk having highest probability when the current data chunk is being accessed.
- the access pattern is computed for each data chunk comprising a plurality of sequentially stored data blocks
- the predicted normalized access parameter is computed for each data chunk
- the movement is performed per data chunk.
- Computing the predicted non-normalized access parameter per data chunk rather than per block reduces storage requirements and/or improves computational performance (e.g., processor utilization, processor time) in comparison to using very large data structures for storing non-normalized access parameter(s) per block.
- FIG. 1 is a flowchart of a method of moving data chunks between a lower tier data storage device and a higher tier data storage device according to a predicted normalized access parameter, in accordance with some embodiments.
- FIG. 2 is a block diagram of components of a system for moving data chunks between a lower tier data storage device and a higher tier data storage device according to a predicted normalized access parameter, in accordance with some embodiments.
- the present disclosure in some embodiments thereof, relates to hierarchical storage management and, more specifically, but not exclusively, to improving read access for data chunks stored on a hierarchical storage system.
- An aspect of some embodiments relates to systems, methods, a computing device and/or apparatus, and/or computer program product (storing code instructions executable by one or more processors) for management of a hierarchical storage system that includes a lower tier storage device which has slow random access times (but fast access for sequentially stored data) and a higher tier storage device which has fast random access times. Access patterns to data chunks located on the lower tier data storage device and the higher tier data storage device are monitored. A predicted normalized access parameter is computed for one or more of the data chunks. The predicted normalized access parameter enables comparing different data chunks by considering a combination of the amount of compression of the data chunk and a prediction of accesses to the data chunk.
- the predicted normalized access parameter is computed according to a compression parameter indicating the compression of the data chunk stored, and according to a predicted non-normalized access parameter indicating predicted access to the data chunk which is computed from an analysis of the access patterns.
- the data chunk is moved between the lower tier data storage device and the higher tier data storage device according to the predicted normalized access parameter. For example, data chunks which are highly compressible and/or are expected to be accessed often are moved to the higher tier, which provides faster access times, but has less available storage space. Data chunks which are less compressible are moved to the lower tier, which has more available storage space, but has slower access times.
- the lower tier data storage device may be implemented, for example, as a hard disk drive (HDD), and the higher tier data storage device may be implemented as, for example, a solid state drive (SSD).
- HDD hard disk drive
- SSD solid state drive
- At least some implementations described herein improve access times to data stored on a hierarchical storage system in comparison to other standard approaches.
- the storage system includes a lower tier storage device which has slow random access times (but fast access for sequentially stored data) and a higher tier storage device which has fast random access times.
- the higher tier storage device is more expensive in comparison to the lower tier storage device, which limits the practical size of the higher tier data storage device.
- Different approaches for moving data between the higher tier and lower tier have been proposed for improving the access times and/or for improving overall efficiency of the data storage system (i.e., combination of higher tier and lower tier data storage devices).
- standard approaches to making a tier decision are purely based on activity statistics of the data, such as number of reads, and how sequential the reads are.
- Other known approaches perform tiering of deduplicated data using at a full file level. When the whole file is considered as nonactive, the whole file is moved to the lower tier.
- moving the most access active (e.g. reads and/or writes) data chunks which also are the most compressible (e.g., considered as a normalized combination) to the higher tier improves access performance for the access active data chunks, and/or improves overall performance of the hierarchical data storage system.
- the performance of the hierarchical data storage system may be measured by the number of input/outputs (10s) it can serve per second and/or the average latency of each IO.
- the higher tier data storage tier may be implemented as a SSD with latency of about 0.1 millisecond (ms) while the lower tier data storage may be implemented as a HDD with random access latency of about 5-10 ms.
- Chunks of data are moved between tiers. A data chunk may be moved tier down, by reading the data chunk from the higher tier and writing to the lower tier.
- a data chunk may be moved tier up, by reading data from the lower tier and writing to the higher tier. Data chunks that were read on the previous tier may be removed and/or overwritten to make room for other data chunks.
- the present disclosure may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- a network for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
- FPGA field-programmable gate arrays
- PLA programmable logic arrays
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical fimction(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- FIG. 1 is a flowchart of a method of moving data chunks between a lower tier data storage device and a higher tier data storage device according to a predicted normalized access parameter, in accordance with some embodiments.
- FIG. 2 is a block diagram of components of a system 200 for moving data chunks between a lower tier data storage device and a higher tier data storage device according to a predicted normalized access parameter, in accordance with some embodiments.
- System 200 may implement the acts of the method described with reference to FIG. 1, by processor(s) 202 of a computing device 204 executing code instructions (e.g., code 206 A) stored in a memory 206.
- code instructions e.g., code 206 A
- Computing device 204 manages a hierarchical storage 208 that includes at least a lower tier data storage device 210 and a higher tier data storage device 212.
- Computing device 204 may use a prefetching process 206B (e.g., stored on memory 206, executed by processor(s) 202) for prefetching data from lower tier data storage device 210, as described herein.
- the prefetching process 206B may predict the location of the next data component before the data component is being requested, and fetch the next data component before the request.
- the prefetching may be performed from lower tier data storage device 210, saving room on higher tier data storage device 212.
- Some prefetching processes 206B are designed to predict locations of data components that are non-sequentially located, for example, located in a striding pattern (e.g., increase by a fixed address location relative to the previous address location) and/or in a constant address pattern that may at first appear to be random.
- Lower tier data storage device 210 has relatively slower random access input/output (IO) (e.g., read) times in comparison to higher tier data storage device 212.
- Higher tier data storage device 212 has relatively faster random I/O (e.g., read and/or write) times in comparison to lower tier data storage device 210.
- IO input/output
- Lower tier data storage device 210 may cost most (e.g., per megabyte) in comparison to higher tier data storage device 212.
- Lower tier data storage device 210 may be implemented, for example, as a hard disk drive (HDD). Lower tier data storage device 210 may provide fast sequential reading and/or writing, but has poor performance for random I/O as the seek times may be very high (e.g., up to 10 milliseconds).
- HDD hard disk drive
- Higher tier data storage device 212 may be implemented, for example, as a solid state drive (SSD), and/or phase-change memory (PCM).
- SSD solid state drive
- PCM phase-change memory
- Higher tier data storage device 212 may serve as a cache and/or a tier (e.g., cache when data is volatile and has a copy in the lower tier, and/or tier when the data is nonvolatile and/or may be kept (e.g., only) in the higher tier) for lower tier data storage device 210.
- Hierarchical storage 208 is in communication with a computing system 214, which stores data on hierarchical storage 208 and/or reads data stored on hierarchical storage 208.
- Hierarchical storage 208 may be integrated within computing system 214, and/or may be implemented as an external storage device.
- Computing system 214 may be indirectly connected to hierarchical storage 208 via computing device 204, i.e., computing system 214 may communicate with computing device 204, where computing device 204 communicates with hierarchical storage 208, rather than computing system 214 directly communicating with hierarchical storage 208.
- Computing system 214 and/or computing device 204 may be implemented as, for example, one of more of a computing cloud, a cloud network, a computer network, a virtual machine(s) (e.g., hypervisor, virtual server), a network node (e.g., switch, a virtual network, a router, a virtual router), a single computing device (e.g., client terminal), a group of computing devices arranged in parallel, a network server, a web server, a storage server, a local server, a remote server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet computer, a wearable computing device, a glasses computing device, a watch computing device, and a desktop computer.
- a computing cloud e.g., hypervisor, virtual server
- a network node e.g., switch, a virtual network, a router, a virtual router
- a single computing device e.g., client terminal
- hierarchical storage 208 is used exclusively by a single use such as a computing device 214.
- hierarchical storage 208 is used by multiple users such as multiple client terminals 216 accessing hierarchical storage 208 over a network 218, for example, computing system 214 provides cloud storage services and/or virtual storage services to client terminals 216.
- Computing device 204 may be implemented as, for example, integrated within hierarchical storage 208 (e.g., as hardware and/or software installed within hierarchical storage 208), integrated within computing system 214 (e.g., as hardware and/or software installed within computing system 214, such as an accelerator chip and/or code stored on a memory of computing system 214 and executed by processor of computing system 214), and/or as an external component (e.g., implemented as hardware and/or software) in communication with hierarchical storage 208, such as a plug-in component.
- hierarchical storage 208 and computing device 204 are implemented as one storage system that exposes storage (e.g., functions, features, capabilities) to computing system(s) 214.
- Computing device 204 includes one or more processor(s) 202, implemented as for example, central processing unit(s) (CPU), graphics processing unit(s) (GPU), field programmable gate array(s) (FPGA), digital signal processor(s) (DSP), application specific integrated circuit(s) (ASIC), customized circuit(s), processors for interfacing with other units, and/or specialized hardware accelerators.
- processor(s) 202028 may be implemented as a single processor, a multi-core processor, and/or a cluster of processors arranged for parallel processing (which may include homogenous and/or heterogeneous processor architectures). It is noted that processor(s) 202 may be designed to implement in hardware one or more features stored as code instructions 206 A and/or 206B.
- Memory 206 stores code instructions implementable by processor(s) 202, for example, a random access memory (RAM), read-only memory (ROM), and/or a storage device, for example, non-volatile memory, magnetic media, semiconductor memory devices, hard drive, removable storage, and optical media (e.g., DVD, CD-ROM).
- Memory 206 may store code 206A that when executed by processor(s) 208, implement one or more acts of the method described with reference to FIG. 1, and/or store prefetching process 206B code as described herein.
- Computing device 204 may include a data storage device 220 for storing data, for example, monitored access patterns as described herein.
- Data storage device 220 may be implemented as, for example, a memory, a local hard-drive, a removable storage unit, an optical disk, a storage device, and/or as a remote server and/or computing cloud (e.g., accessed using a network connection).
- code instructions executable by processor(s) 202 may be stored in data storage device 220, for example, with executing portions loaded into memory 206 for execution by processor(s) 202.
- Computing device 204 may be in communication with a user interface 222 that presents data to a user and/or includes a mechanism for entry of data, for example, one or more of a touch-screen, a display, a keyboard, a mouse, voice activated software, and a microphone.
- a user interface 222 that presents data to a user and/or includes a mechanism for entry of data, for example, one or more of a touch-screen, a display, a keyboard, a mouse, voice activated software, and a microphone.
- Network 218 may be implemented as, for example, the internet, a local area network, a virtual private network, a virtual public network, a wireless network, a cellular network, a local bus, a point to point link (e.g., wired), and/or combinations of the aforementioned.
- access patterns to data chunks located on a lower tier data storage device and/or on a higher tier data storage device are monitored. Access patterns may be non-normalized, referred to herein as non-normalized access parameters. Non-normalized access parameters may be computed for each data chunk independently of other data chunks, and/or independently of where the data chunk is stored.
- Access patterns may be computed based on collected data parameters, for example, computing statistical data parameters for data chunks (e.g., for each data chunk). Examples of data parameters used to compute access patterns include: reads, sequential reads, size of reads, writes, sequential writes, and size of writes.
- An exemplary access parameter is a predicted non-normalized number of random reads.
- the predicted non-normalized number of random reads may be computed from other access patterns and/or from the collected data.
- the predicted non-normalized number of random reads for the data chunk may be generated as an outcome of a machine learning (ML) model, for example, a regressor, a neural network, a classifier, and the like.
- the trained ML model generates an outcome indicating the predicted number of random reads expected for a target chunk fed as input.
- the ML model may be trained on a training dataset of records, where each record includes a respective chunk labelled with a ground truth label of number of random reads for the respective chunk.
- the training dataset may include other access patterns, such as sequential reads, size of reads, writes, sequential writes, and size of writes.
- the other access patterns determined for the target chunk may be fed as input into the ML model to obtain the outcome of predicted number of random reads.
- Using the other access patterns may increase accuracy of the predicted number of random reads generated by the ML model.
- Other approaches may be used to obtain the predicted non-normalized number of random reads, for example, a set of rules, and/or mathematical prediction models.
- the access patterns are dynamically decayed.
- the decay may be performed by multiplying a current parameter of the access pattern by a decay value less than 1, every time interval to obtain an adapted parameter of the access pattern.
- Other decay approaches may be used, for example, linear, logarithmic, dynamic changing values, and the like.
- the predicted normalized access parameter may be computed using the adapted parameter of the access pattern.
- the decay value prevents increasing the value of the parameter of the access parameter indefinitely, and/or maintains the value of the parameter of the access pattern at a reasonable state that enables processing in a reasonable time. For example, every 5 minutes the number of reads (an example of the parameter of the access pattern) is multiplied by 0.99, such that if there are currently 100 reads, after 5 minutes the number of reads is reduced to 99.
- the access pattern may be computed per individual data chunk (e.g., each data chunk), where an individual data chunk includes multiple sequentially stored data blocks.
- Blocks may be the smallest granularity that are operated by the storage system. A user may read and/or write a single block and/or multiple blocks. Blocks may be of a size between about 0.5-32 kilobytes (kb), or other ranges.
- the predicted non-normalized access parameter(s) may be computed per data chunk of multiple sequentially stored data blocks, rather than per block.
- a chunk may be a continuous address space of, for example, 4 megabytes (MB) or other values.
- Computing the predicted non-normalized access parameter per data chunk rather than per block reduces storage requirements and/or improves computational performance (e.g., processor utilization, processor time) in comparison to using very large data structures for storing non-normalized access parameter(s) per block.
- computational performance e.g., processor utilization, processor time
- at the lower tier there is a desire to store data sequentially if it is sequential in the address space (it is noted that portions of the data can be read from the higher tier while the rest are read sequentially from the lower tier). Movement (e.g., as described with reference to 106) may be performed per data chunk.
- the access pattern includes prefetching patterns by a prefetching process.
- the prefetching pattern may be, for example, one or combination of sequentially, stride (i.e., increase by fixed step each time), and/or randomly.
- the prefetching process places the prefetched data components (that are located non-sequentially on the lower tier data storage device) on a higher tier data storage device when the data component is not already stored on the higher tier data storage device.
- the prefetching process computes a probability of each of multiple candidate subsequent data chunks being accessed given a current data chunk being accessed, and prefetches the subsequent data chunk having highest probability when the current data chunk is being accessed.
- the prefetching process that computes the probability enables selecting the data chunks for which highest accuracy is obtained for storage on the higher tier data storage device, which improves performance of the higher tier data storage device since stored data chunks are most likely to actually be accessed in the future over other components with lower probability which are kept on the lower tier data storage device.
- accuracy of the prefetching patterns is computed.
- the prefetching pattern e.g., data component to be prefetched, may be predicted as described herein with reference to a believe cache process discussed below. As used herein, the term believe cache relates to a prefetch cache that predicts next location(s) which are not necessarily sequential.
- the accuracy may be computed as a percentage of when the prefetching pattern has correctly prefetched the correct component, relative to all prefetching attempts including attempts where the prefetching pattern was unable to prefetch the correct component.
- Two or more fetching patterns having accuracy above a threshold may be selected. The threshold may be, for example, 20%, 25%, 30%, 40%, 45%, 50%, or other values. Two or more prefetching patterns with highest accuracy are selected, since such prefetching patterns are most likely to be re-selected in the future.
- the prefetching process is based on computing conditional probabilities of a next access (e.g., read) location based on a current access (e.g., read) location, sometimes referred to as believe cache prefetching.
- the prefetching process (e.g., believe cache prefetching) computes probability of each of multiple candidate subsequent data components being accessed given a current data component being accessed, and prefetches the subsequent data component having highest probability when the current data component is being accessed.
- the prefetching process computes the probability of the prefetching pattern fetching each of multiple candidate components.
- the data may be prefetched from the next access location when the conditional probability is above a threshold.
- cache prefetching may be used, for example, when access to data storage is non- sequential but in a repeatable pattern, for example, in striding access (i.e., each time increase the address by a fixed amount relative to the current access), and/or in another repeatable pattern which may at first appear to be random.
- the next location to be accessed is computed based on the current and/or previous locations that were accessed, based on absolute address locations and/or relative address locations. An exemplary computation is now described:
- a first location (denoted A) is accessed, the following memory locations are accessed multiple times: a second location (denoted X) is accessed 10 times, a third location (denoted Y) is accessed 3 times, and a fourth location (denoted Z) is accessed 5 times.
- a fifth location (denoted B) is accessed, the following memory locations are accessed multiple times: the second location (denoted X) is accessed 6 times, the third location (denoted Y) is accessed 2 times, the fourth location (denoted Z) is accessed 4 times, and a sixth location (denoted K) is accessed 7 times.
- Conditional probabilities are calculated as follows:
- the recommendation for which data location to prefetch from may be computed by calculating the candidate probability of each of the following locations; X, Y, Z, K:
- the probabilities are sorted to rank the most likely next locations from where prefetching of data is obtained.
- One or more prefetching patterns may be accessed, for example, a single prefetch, two prefetches, or more, and/or according to a threshold.
- the first prefetch is from location X.
- the second prefetch is from location Z.
- the third prefetch is from location K. If a threshold of 50% is used, data is prefetched from locations X and Z.
- Prefetch locations i.e., X, Y, Z, K
- Candidates Current access locations
- voter i.e., A, B
- a relation matrix for example, as below (e.g., curHis: A B)
- a predicted normalized access parameter is computed for one or more data chunks.
- the predicted normalized access parameter may be computed for individual data chunks, for example, for each data chunk.
- the predicted normalized access parameter is computed according to a compression parameter and according to one or more predicted non-normalized access parameters computed from an analysis of the access patterns.
- the predicted normalized access parameter is computed by multiplying the compression parameter (e.g., compression ratio) by the predicted non-normalized access parameter.
- the compression parameter e.g., compression ratio
- the compression parameter which indicates the actual space that the data chunk takes up in the higher tier data storage device, improves the performance of the higher tier data storage device.
- chunk A When the compression parameter is not taken into consideration, and there is space for 4MB in the highest tier and 8MB in the lower tier, using standard approaches, chunk A will be placed in the higher tier, and chunks B,C will be placed in the lower tier. This means that out of 14 IOs, 6 IOs will be served from the higher tier, while 8 IOs will be served from the lower tier.
- the amount of data in the higher tier will be 4MB (chunk A is not compressible according to the compression parameter having a value of 1) and the amount of data on the lower tier is also 4MB (two chunks of 4MB each compressible into 2MB according to the compression parameter having a value of 2). As such, it will be better to keep chunks B and C in the higher tier and thus get 8 of the reads coming from the higher tier, rather than 6 reads from the higher tier if chunk A is placed.
- the normalized #read for chunk A is 6 (i.e., computed as 1*6)
- the normalized number for chunks B and C is 8 (i.e., 2*4) and thus chunks B and C, which have higher normalized predicted read values than chunk A, are moved into the higher tier.
- the compression parameter may be computed as a compression ratio between a size of the data chunk when non-compressed and a size of the data chunk when compressed by a compression process when the data chunk is placed on the higher tier data storage device and/or lower tier data storage device. Since compression approaches may vary between different tiers, using the compression parameter computed according to the compression approach used in the higher tier and/or the lower tier provides a more accurate predicted normalized access parameter. For example, at the lower tier (e.g., HDD), a stronger compression algorithm may be applied and/or the compression block granularity may be higher yielding a better compression ratio.
- the lower tier e.g., HDD
- the compression parameter may be computed based on a deduplication parameter indicating an amount of space occupied by de-duplicated data of the data chunk.
- Deduplication has a large effect on the amount of space that data takes up, such as in the higher tier data storage device. For example, when a data chunk is completely de-duplicated on the higher tier data storage device, there is no point of moving it to the lower tier data storage device even if the chunk is never accessed, as the de-duplicated chunk does not actually take up space on the higher tier data storage device.
- the deduplication parameter may be computed as follows. For each respective data block that is not de-duplicated, the deduplication parameter indicates the size of the respective data block.
- the compression ratio based on the deduplication parameter may be computed as a size of the data chunk when no de-duplication is applied divided by a size of the data chunk for which de-duplication data blocks of the data chunk is applied by a de-duplication process when the data chunk is placed on the higher tier data storage device.
- the compression ratio considers the effects of de-duplication on the size of the data chunk when placed at the target tier. Since de-duplication approaches may vary between different tiers, using the compression parameter computed according to the de-duplication approach used in the higher tier (when moving to the higher tier) provides a more accurate predicted normalized access parameter.
- the amount of de-duplication on the higher tier may be evaluated and/or approximated (e.g., predicted) based on de-duplication in the lower tier, for example, when the applied deduplication cannot be accurately determined in advance and/or cannot be computationally efficiently determined in advance, until the data chunk is actually moved to the higher tier.
- the compression parameter may be computed based on the deduplication parameter prior to moving the data chunk to the higher tier or lower tier.
- Computing the deduplication parameter e.g., compression ratio
- data may appear to be suitable for the lower tier, however, random access times to access the deduplicated data may be extremely long.
- the compression parameter may be computed based on a delta compression parameter (e.g., compression ratio) indicating an amount of space occupied by delta compression data of the data chunk.
- Delta compression is another form of data reduction, which finds chunk that are similar and compress them together. Delta compression may have a significant effect on the amount of space that data takes up, such as in the higher tier data storage device.
- the compression ratio based on the delta compression parameter may be computed as a size of an original of the data chunk and delta blocks when no delta compression is applied divided a total size of data blocks of the data chunk used in the delta compression of the data blocks of the data chunk.
- the compression ratio may be selected to be the size of all the chunks which are compressed together before compression divided by the size of the chunks compressed - i.e., for compression of 3 chunks which are 8 kilobytes (KB) (i.e. a total of 24KB) where the chunks’ size after compression is 12KB, the compression ratio is 2 and thus for each 8KB the compression ratio is defined as 2, where it is assumed after compression that the 8KB chunk occupies 4KB (in practice when each 8KB is compressed alone a factor of 1.1 may be obtained as an example)).
- the compression parameter may be computed according to a delta compression process when the data chunk is placed on the higher tier data storage device.
- the compression ratio considers the effects of delta compression on the size of the data chunk when placed at the target tier. Since delta compression approaches may vary between different tiers, using the delta compression parameter computed according to the delta compression approach used in the higher tier (when moving to the higher tier) provides a more accurate predicted normalized access parameter.
- the data chunk is moved between the lower tier data storage device and the higher tier data storage device according to the predicted normalized access parameter.
- the data chunk may be moved from the higher tier data storage device to the lower tier data storage device.
- the data chunk may be moved from the lower tier data storage device to the higher tier data storage device.
- Data chunks with predicted normalized access parameter e.g., predicted normalized number of random reads, predicted normalized number of writes, normalized sequential reads
- predicted normalized access parameter e.g., predicted normalized number of random reads, predicted normalized number of writes, normalized sequential reads
- Data chunks with predicted normalized access parameter e.g., predicted normalized number of random reads, predicted normalized number of writes, normalized sequential reads
- predicted normalized access parameter e.g., predicted normalized number of random reads, predicted normalized number of writes, normalized sequential reads
- data chunks lowest ranked according to predicted normalized access parameter e.g predicted normalized number of random reads, predicted normalized number of writes, normalized sequential reads
- moving data chunks with highest and/or highest ranked predicted normalized number of writes and/or normalized sequential reads to the lower tier data storage device improves overall performance of the data storage system that includes the lower tier and higher tier data storage devices (e.g., by freeing up the higher tier storage for other data chunks, such as having high predicted random reads).
- a file includes multiple data chunks.
- a first subset of the data chunks may be located on the higher tier and a second subset of the data chunks may be located on the lower tier.
- Individual data chunks of the file may be moved between the lower tier data storage device and the higher tier data storage device according to the predicted normalized access parameter computed per individual data chunk of the file.
- Different data chunks of the same file may be stored on different tiers and moved between tiers according to predicted normalized access parameters computed per chunk.
- Individually considering the optimal tier for each data chunk of the file improves access to the respective data chunk and/or overall access to the file. Movement of individual data chunks of the same file between tiers is in contrast, for example, to movement of an entire file.
- a preset maximal number of data chunks are moved per time interval.
- one or more features described with respect to 102-106 may be iterated, for example, over multiple time intervals for dynamic movement of data chunks between the higher and lower tiers.
- compositions comprising, “comprising”, “includes”, “including”, “having” and their conjugates mean “including but not limited to”. This term encompasses the terms “consisting of' and “consisting essentially of'.
- Consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
- a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
- range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosure. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
- a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range.
- the phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention concerne un dispositif informatique pour la gestion d'un dispositif de stockage à niveau inférieur qui a un accès lent et un dispositif de stockage à niveau supérieur qui a un accès rapide. Des motifs d'accès à des blocs de données situés sur le niveau inférieur et le niveau supérieur sont surveillés. Un paramètre d'accès normalisé prédit est calculé pour les segments de données. Le paramètre d'accès normalisé prédit permet de comparer différents segments de données en considérant une combinaison de la quantité de compression du bloc de données et d'une prédiction d'accès au segment de données. Le paramètre d'accès normalisé prédit est calculé en fonction d'un paramètre de compression indiquant la compression du bloc de données stocké, et selon un paramètre d'accès non normalisé prédit indiquant un accès prédit au segment de données qui est calculé à partir d'une analyse des motifs d'accès. Le bloc de données est déplacé entre le niveau inférieur et le niveau supérieur en fonction du paramètre d'accès normalisé prédit.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202180100614.5A CN117677941A (zh) | 2021-11-11 | 2021-11-11 | 存储系统中的数据压缩和重复数据删除感知分层 |
PCT/EP2021/081384 WO2023083454A1 (fr) | 2021-11-11 | 2021-11-11 | Compression de données et hiérarchisation sensible à la déduplication dans un système de stockage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2021/081384 WO2023083454A1 (fr) | 2021-11-11 | 2021-11-11 | Compression de données et hiérarchisation sensible à la déduplication dans un système de stockage |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023083454A1 true WO2023083454A1 (fr) | 2023-05-19 |
Family
ID=78695712
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2021/081384 WO2023083454A1 (fr) | 2021-11-11 | 2021-11-11 | Compression de données et hiérarchisation sensible à la déduplication dans un système de stockage |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117677941A (fr) |
WO (1) | WO2023083454A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116975008A (zh) * | 2023-09-22 | 2023-10-31 | 青岛海联智信息科技有限公司 | 一种船舶气象监测数据优化存储方法 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130006948A1 (en) * | 2011-06-30 | 2013-01-03 | International Business Machines Corporation | Compression-aware data storage tiering |
-
2021
- 2021-11-11 WO PCT/EP2021/081384 patent/WO2023083454A1/fr active Application Filing
- 2021-11-11 CN CN202180100614.5A patent/CN117677941A/zh active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130006948A1 (en) * | 2011-06-30 | 2013-01-03 | International Business Machines Corporation | Compression-aware data storage tiering |
Non-Patent Citations (2)
Title |
---|
DEVARAJAN HARIHARAN ET AL: "HCompress: Hierarchical Data Compression for Multi-Tiered Storage Environments", 2020 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), IEEE, 18 May 2020 (2020-05-18), pages 557 - 566, XP033791699, DOI: 10.1109/IPDPS47924.2020.00064 * |
RAMLJAK DUSAN: "DATA DRIVEN HIGH PERFORMANCE DATA ACCESS", DISSERTATION SUBMITTED TO THE TEMPLE UNIVERSITY GRADUATE BOARD, 31 December 2018 (2018-12-31), XP055945815, Retrieved from the Internet <URL:https://scholarshare.temple.edu/bitstream/handle/20.500.12613/2208/Ramljak_temple_0225E_13554.pdf?sequence=1&isAllowed=y> [retrieved on 20220725] * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116975008A (zh) * | 2023-09-22 | 2023-10-31 | 青岛海联智信息科技有限公司 | 一种船舶气象监测数据优化存储方法 |
CN116975008B (zh) * | 2023-09-22 | 2023-12-15 | 青岛海联智信息科技有限公司 | 一种船舶气象监测数据优化存储方法 |
Also Published As
Publication number | Publication date |
---|---|
CN117677941A (zh) | 2024-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11797185B2 (en) | Solid-state drive control device and learning-based solid-state drive data access method | |
Wang | Yang | |
US9164676B2 (en) | Storing multi-stream non-linear access patterns in a flash based file-system | |
US11055224B2 (en) | Data processing apparatus and prefetch method | |
CN112970006B (zh) | 一种基于递归神经网络的内存访问预测方法和电路 | |
US10387340B1 (en) | Managing a nonvolatile medium based on read latencies | |
HUE035390T2 (en) | Data migration process, data migration device and storage device | |
Laga et al. | Lynx: A learning linux prefetching mechanism for ssd performance model | |
CN113254362A (zh) | 存储设备和存储器控制器的操作方法 | |
WO2021050109A1 (fr) | Système de stockage non volatil avec filtrage d'échantillons de données pour des statistiques de fonctionnement surveillées | |
CN115756312A (zh) | 数据访问系统、数据访问方法和存储介质 | |
Chen et al. | A hybrid memory built by SSD and DRAM to support in-memory Big Data analytics | |
CN117235088B (zh) | 一种存储系统的缓存更新方法、装置、设备、介质及平台 | |
WO2023083454A1 (fr) | Compression de données et hiérarchisation sensible à la déduplication dans un système de stockage | |
Wu et al. | Exploiting workload dynamics to improve SSD read latency via differentiated error correction codes | |
WO2023061567A1 (fr) | Mémoire cache compressée en tant que niveau de mémoire cache | |
Asadi et al. | DiskAccel: Accelerating disk-based experiments by representative sampling | |
WO2023088535A1 (fr) | Éviction de mémoire cache sur la base d'un état de hiérarchisation en cours | |
WO2023061569A1 (fr) | Défragmentation intelligente d'un système de stockage de données | |
WO2022248051A1 (fr) | Mise en cache intelligente de données se prêtant à une prélecture | |
WO2022233391A1 (fr) | Placement intelligent de données sur un ensemble de stockage hiérarchique | |
CN111796757B (zh) | 一种固态硬盘缓存区管理方法和装置 | |
KR100974514B1 (ko) | 컴퓨터 시스템에서의 순차적 프리페칭 방법 | |
CN114746848A (zh) | 用于存储装置的高速缓存架构 | |
Bhimani et al. | Automatic stream identification to improve flash endurance in data centers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21810601 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180100614.5 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |