CN110955613A

CN110955613A - Intelligent data streaming and flow tracking for storage devices

Info

Publication number: CN110955613A
Application number: CN201811123110.1A
Authority: CN
Inventors: 路向峰; 孙丛; 王金一
Original assignee: Beijing Memblaze Technology Co Ltd
Current assignee: Beijing Memblaze Technology Co Ltd
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2020-04-03

Abstract

The application discloses intelligent data offloading and flow tracking for storage devices. The disclosed intelligent data offloading comprises the steps of: obtaining an IO command; obtaining an updating parameter according to the IO command; and allocating the streams for the IO commands according to the update parameters.

Description

Intelligent data streaming and flow tracking for storage devices

Technical Field

The present application relates to storage technology, and more particularly, to identifying data accessing a storage device as a plurality of streams, and adaptively tracking the identified plurality of data streams.

Background

FIG. 1 illustrates a block diagram of a storage device. The storage device 102 is coupled to a host for providing storage capabilities to the host. The host and the storage device 102 may be coupled by various methods, including but not limited to, connecting the host and the solid state storage device 102 by, for example, SATA (Serial Advanced Technology Attachment), SCSI (Small computer system Interface), SAS (Serial Attached SCSI), IDE (Integrated Drive Electronics), USB (Universal Serial Bus), PCIE (Peripheral Component Interconnect Express, PCIE, high speed Peripheral Component Interconnect), NVMe (NVM Express, high speed nonvolatile storage), ethernet, fibre channel, wireless communication network, etc. The host may be an information processing device, such as a personal computer, tablet, server, portable computer, network switch, router, cellular telephone, personal digital assistant, etc., capable of communicating with the storage device in the manner described above. The Memory device 102 includes an interface 103, a control section 104, one or more NVM chips 105, and a DRAM (Dynamic Random Access Memory) 110.

NAND flash Memory, phase change Memory, FeRAM (Ferroelectric RAM), MRAM (magnetoresistive Memory), RRAM (Resistive Random Access Memory), etc. are common NVM.

The interface 103 may be adapted to exchange data with a host by means such as SATA, IDE, USB, PCIE, NVMe, SAS, ethernet, fibre channel, etc.

The control unit 104 is used to control data transfer between the interface 103, the NVM chip 105, and the DRAM 110, and also used for memory management, host logical address to flash physical address mapping, erase leveling, bad block management, and the like. The control component 104 can be implemented in various manners of software, hardware, firmware, or a combination thereof, for example, the control component 104 can be in the form of an FPGA (Field-programmable gate array), an ASIC (Application-specific integrated Circuit), or a combination thereof. The control component 104 may also include a processor or controller in which software is executed to manipulate the hardware of the control component 104 to process IO (Input/Output) commands. The control component 104 may also be coupled to the DRAM 110 and may access data of the DRAM 110. FTL tables and/or cached IO command data may be stored in the DRAM.

Control section 104 includes a flash interface controller (or referred to as a media interface controller, a flash channel controller) that is coupled to NVM chip 105 and issues commands to NVM chip 105 in a manner that conforms to an interface protocol of NVM chip 105 to operate NVM chip 105 and receive command execution results output from NVM chip 105. Known NVM chip interface protocols include "Toggle", "ONFI", etc.

In the solid-state storage device, mapping information from logical addresses to physical addresses is maintained using FTL (Flash Translation Layer). The logical addresses constitute the storage space of the solid-state storage device as perceived by upper-level software, such as an operating system. The physical address is an address for accessing a physical memory location of the solid-state memory device. Address mapping may also be implemented using an intermediate address modality in the related art. E.g. mapping the logical address to an intermediate address, which in turn is further mapped to a physical address.

A table structure storing mapping information from logical addresses to physical addresses is called an FTL table. FTL tables are important metadata in solid state storage devices. Usually, the data entry of the FTL table records the address mapping relationship in the unit of data page in the solid-state storage device.

Fig. 2 shows a schematic diagram of a large block. A large block includes physical blocks from each of a plurality of logical units (referred to as a group of logical units). Optionally, each logical unit provides a physical block for a large block. By way of example, large blocks are constructed on every 16 Logical Units (LUNs). Each large block includes 16 physical blocks, from each of 16 Logical Units (LUNs). In the example of FIG. 2, large block 0 includes physical block 0 from each of the 16 Logical Units (LUNs), and large block 1 includes physical block 1 from each Logical Unit (LUN). There are many other ways to construct the bulk mass.

As an alternative, page stripes are constructed in large blocks, with physical pages of the same physical address within each Logical Unit (LUN) constituting a "page stripe". In FIG. 2, physical pages 0-0, physical pages 0-1 … …, and physical pages 0-x form a page stripe 0, where physical pages 0-0, physical pages 0-1 … … physical pages 0-14 are used to store user data, and physical pages 0-15 are used to store parity data computed from all user data within the stripe. Similarly, in FIG. 2, physical pages 2-0, 2-1 … …, and 2-x constitute page strip 2. Alternatively, the physical page used to store parity data may be located anywhere in the page stripe.

When a logical page is repeatedly written with data, the correspondence between the logical page address and the latest physical page address is recorded in the FTL table entry, and data recorded in a physical page address where data was written once but is no longer referenced (e.g., no record in the FTL table) becomes "garbage" (data). Data that has been written to and referenced (e.g., having a record in the FTL table) is referred to as valid data, and "garbage" is referred to as dirty data. A physical block containing dirty data is referred to as a "dirty physical block", and a physical block to which data is not written is referred to as a "free physical block".

Disclosure of Invention

Data recorded in the storage device is updated from time to time. The time interval between two times that data is updated is referred to as the data lifecycle. The multiple data have different life cycles. For example, video-like data has a long or near read-only life, and data cached as an internet web page has a relatively short life, perhaps hours, days, or months. So that data recorded in the storage device gradually becomes invalid due to updating with use. The space occupied by invalid data is released through operations such as garbage collection.

Since valid data and invalid data are recorded in a mixed manner in the storage medium, the garbage collection process needs to move the valid data. The transfer of valid data causes write amplification effects that affect the performance and lifetime of the storage device. In the garbage collection operation of the prior art, a storage medium with a higher invalid data ratio is selected for garbage collection in an effort to reduce write amplification.

According to embodiments of the application, data written to a storage device is differentiated into a plurality of streams, the data of each stream having the same or similar life cycle. And also according to the change of the life cycle of the data of each stream, so that the data belonging to the stream is divided into the same stream even if the life cycle fluctuates. The flows also have the opportunity to be split or merged so that the flows better indicate the lifecycle characteristics of the belonging data.

The same stream is stored as much as possible in the same or adjacent physical blocks or chunks of the storage medium, and different streams are stored as much as possible in different physical blocks or chunks of the storage medium. The physical blocks or chunks are recycled as a whole during the garbage recycling process. Because data belonging to the same stream has a similar life cycle, physical blocks or large blocks with invalid data occupying more are easier to obtain during garbage collection, and therefore write amplification caused by the garbage collection process is reduced.

In addition to being applied to the garbage collection process, the data stream also provides opportunities for the storage device to optimize the handling of IOs.

According to a first aspect of the present application, there is provided a method of allocating data streams according to the first aspect of the present application, comprising the steps of: obtaining an IO command; obtaining an updating parameter according to the IO command; and allocating the streams for the IO commands according to the update parameters.

According to the first method for allocating data streams in the first aspect of the present application, there is provided the second method for allocating data streams in the first aspect of the present application, wherein the IO command carries an update parameter.

According to the first method for allocating data streams of the first aspect of the present application, there is provided the third method for allocating data streams of the first aspect of the present application, wherein a data unit accessed by an IO command is obtained, and an update time interval of the data unit is obtained as an update parameter.

According to the first method for allocating data streams of the first aspect of the present application, there is provided the method for allocating data streams of the first aspect of the present application, wherein the data units accessed by the IO commands are obtained, the update time intervals of the data units are obtained, and the average value, the median value, and the mode value of one or more update time intervals of the data units are used as the update parameters.

According to a first method of distributing data streams of the first aspect of the present application, there is provided the method of distributing data streams of the first aspect of the present application, wherein data units accessed by IO commands are obtained, an update time interval of the data units is obtained, and an update parameter is calculated according to g '═ 1-k × g + k (dt), where g' is the calculated update parameter, g is an old update parameter, dt is the update time interval, and 0< k < 1.

The method of allocating data streams according to any one of the first to fifth aspects of the present application provides the sixth method of allocating data streams according to the first aspect of the present application, wherein the data units represent a logical address space or a physical address space of a specified size.

According to a sixth method for allocating data streams of the first aspect of the present application, there is provided the seventh method for allocating data streams of the first aspect of the present application, wherein a data unit accessed by an IO command is obtained, and a stream is allocated to the data unit according to an update parameter of the obtained data unit.

According to a seventh method of allocating data streams of the first aspect of the present application, there is provided the eighth method of allocating data streams of the first aspect of the present application, wherein data units having the same update parameter or having a distance of the update parameter smaller than a threshold value are allocated to the same stream.

According to a seventh method for allocating data streams of the first aspect of the present application, there is provided the ninth method for allocating data streams of the first aspect of the present application, further comprising allocating a first stream for an IO command accessing the first data unit if a distance between an update parameter of the first data unit and an update parameter of the first stream is smaller than a threshold.

According to a seventh method for allocating data streams of the first aspect of the present application, there is provided the method for allocating data streams of the first aspect of the present application, wherein a first stream is allocated to data units with the same update parameter; and allocating a second stream to the data unit of which the distance of the update parameter of the data unit is within the threshold value, and allocating a third stream to other data units.

According to the method for allocating data streams in the third aspect of the present application, there is provided the method for allocating data streams in the eleventh aspect of the present application, wherein the obtained update parameters of the data units are compared with the update parameters of one or more streams, and a stream whose distance between the update parameters and the obtained update parameters is within a threshold value is allocated to the IO command.

The method for allocating data streams according to any one of the eighth to eleventh aspects of the present application provides the twelfth aspect of the present application, wherein, in response to allocating a stream for a data unit accessed by an IO command, an update parameter of the data stream is calculated according to the update parameter of the data unit.

According to a twelfth method of allocating a data stream of the first aspect of the present application, there is provided the thirteenth method of allocating a data stream of the first aspect of the present application, wherein in response to allocating a first stream to a data unit accessed by an IO command, an update parameter of a data unit of a mean, a median, and a mode of the update parameters allocated to a plurality of or all of the data units of the first stream is used as the update parameter of the first stream.

According to a twelfth method of allocating data streams of the first aspect of the present application, there is provided the fourteenth method of allocating data streams of the first aspect of the present application, wherein in response to allocating a first stream to a data unit accessed by an IO command, a result of weighted averaging or low-pass filtering of update parameters allocated to a plurality of or all data units of the first stream is taken as an update parameter of the first stream.

According to a twelfth method of allocating data streams of the first aspect of the present application, there is provided the fifteenth method of allocating data streams of the first aspect of the present application, wherein in response to allocating a first stream for a data unit accessed by an IO command, a new update parameter of the first stream is calculated using an update parameter of the data unit and an old update parameter of the first stream.

According to a fifteenth method of allocating data streams of the first aspect of the present application, there is provided the sixteenth method of allocating data streams of the first aspect of the present application, wherein the new update parameter is a weighted sum of an old update parameter of the stream and an update parameter of the data units.

According to the method for allocating data streams of any one of the tenth to sixteenth aspects of the present application, there is provided the method for allocating data streams of the seventeenth aspect of the present application, wherein if the distances between the obtained data unit update parameters and the update parameters of one or more streams are both greater than a threshold, a new stream is created, and a new stream or a specified stream is allocated to the IO command.

The eighteenth method of allocating data streams of the first aspect of the present application is provided according to any one of the tenth to sixteenth methods of the first aspect of the present application, wherein the new one or more stream update parameters are created from the stream update parameters, the update parameters of each stream representing one stream.

According to the seventeenth or eighteenth method for allocating data streams of the first aspect of the present application, there is provided the nineteenth method for allocating data streams of the first aspect of the present application, wherein the second update parameter and the third update parameter are generated according to the update parameter of the first stream, the second update parameter represents the second stream, and the third update parameter represents the third stream.

According to an eighteenth method of allocating data streams of the first aspect of the present application, there is provided the twentieth method of allocating data streams of the first aspect of the present application, wherein an average value or a statistical value of the second update parameter and the third update parameter is the first update parameter, and the first update parameter is larger than the second update parameter.

According to the method of allocating data streams of any one of the eleventh to twenty first aspects of the present application, there is provided the twenty-first method of allocating data streams of the first aspect of the present application, wherein in response to the allocation of the first stream to the IO command, new update parameters of the first stream are calculated.

A twenty-second method of allocating data streams of the first aspect of the present application is provided according to the twenty-first method of allocating data streams of the first aspect of the present application, wherein IO commands are allocated to initial streams.

The twenty-third method for allocating data streams according to the first aspect of the present application is provided according to any one of the first to twenty-second methods for allocating data streams of the first aspect of the present application, wherein the update parameters are compared with the update parameters of one or more existing streams, and a stream whose update parameter is within a threshold of the obtained update parameter is allocated to the IO command.

The method of allocating data streams according to any one of the first to twenty-third aspects of the present application provides the method of allocating data streams according to the twenty-fourth aspect of the present application, wherein new update parameters of the allocated streams are calculated in response to allocating a stream for an IO command.

The method of allocating a data stream according to any one of the first to twenty-fourth aspects of the present application provides the twenty-fifth method of allocating a data stream according to the first aspect of the present application, wherein in response to allocation of a stream to an IO command, a dispersion of the stream to which the IO command is allocated is calculated.

According to a twenty-sixth method of allocating a data stream of the first aspect of the present application, there is provided the twenty-fifth method of allocating a data stream of the first aspect of the present application, wherein the dispersion of the streams is a dispersion of update parameters of a plurality of data units associated with the streams.

According to a twenty-fifth or twenty-sixth method of allocating a data stream of the first aspect of the present application, there is provided the twenty-seventh method of allocating a data stream of the first aspect of the present application, wherein the dispersion of the streams is a dispersion of update parameters of the streams.

According to a twenty-fifth or twenty-sixth method of allocating a data stream of the first aspect of the present application, there is provided the twenty-eighth method of allocating a data stream of the first aspect of the present application, wherein the dispersion of the streams is a ratio of a dispersion of update parameters of the streams to an update time interval of the streams.

The method of distributing data streams according to any one of the twenty-fifth to twenty-eight aspects of the first aspect of the present application provides the twenty-ninth method of distributing data streams according to the first aspect of the present application, wherein the streams are split and/or merged according to a dispersion.

According to a twenty-ninth method for allocating data streams of the first aspect of the present application, there is provided the thirty-fifth method for allocating data streams of the first aspect of the present application, wherein two new update parameters are generated according to the update parameters of the streams, so as to split the streams into two streams.

According to a thirty-first method of allocating data streams of the first aspect of the present application, there is provided the thirty-first method of allocating data streams of the first aspect of the present application, wherein an average value or a statistical value of update parameters of two streams is the same as that of the stream before splitting, and one of the update parameters of the two streams is larger than the other.

The method of distributing data streams according to any one of twenty-fifth to thirty-first aspects of the present application provides the method of thirty-third aspect of the present application of distributing data streams, wherein the first stream is split into the first stream and the second stream in response to the first stream dispersion being greater than a threshold or an increment of the first stream dispersion being greater than a threshold.

According to the thirty-fifth method for distributing data streams of the first aspect of the present application, there is provided the thirty-third method for distributing data streams of the first aspect of the present application, wherein the specified part of the dispersion of the first stream is taken as the dispersion of the split first stream and the second stream.

According to a thirty-fourth method for distributing data streams of the first aspect of the present application, there is provided the thirty-fourth method for distributing data streams of the first aspect of the present application, wherein the dispersion of the split first stream is one-half of the dispersion of the first stream before splitting.

According to the method for allocating data streams in any one of the twenty-fifth to thirty-first aspects of the present application, there is provided the method for allocating data streams in the thirty-fifth aspect of the present application, wherein the plurality of streams are sorted by the update time interval or the update parameter, the dispersion after merging of two or more adjacent streams is calculated, and two or more streams with the dispersion smaller than a specified threshold value are merged.

According to the method of allocating a data stream of any one of the eleventh to thirty-fifth aspects of the present application, there is provided the method of allocating a data stream of the thirty-sixth aspect of the present application, wherein, in response to allocation of a stream to an IO command, an update time interval of the stream to which the IO command is allocated is calculated.

According to a thirty-fifth method of distributing a data stream of the first aspect of the present application, there is provided the thirty-seventh method of distributing a data stream of the first aspect of the present application, wherein the plurality of update time intervals of the stream are low-pass filtered.

According to the method of allocating a data stream of any one of thirty-sixth to thirty-seventh aspects of the present application, there is provided the method of allocating a data stream of thirty-eighth aspect of the present application, wherein the streams are split and/or merged by updating the time interval.

According to a thirty-eighth method of allocating data streams of the first aspect of the present application, there is provided the thirty-ninth method of allocating data streams of the first aspect of the present application, wherein two or more streams having an update time interval greater than a specified threshold are merged.

According to the method for allocating data streams of any one of thirty-sixth to thirty-ninth aspects of the present application, there is provided the method for allocating data streams of the forty-fifth aspect of the present application, wherein the update time intervals of the streams are taken as the update time intervals of the split streams.

According to a method of allocating data streams of any one of the twenty-ninth to thirty-ninth aspects of the first aspect of the present application, there is provided the method of allocating data streams of the forty-th aspect of the present application, wherein an average value or a statistical value of update parameters of two or more streams combined is used as the update parameter of the combined stream.

According to a second aspect of the present application, there is provided the first storage apparatus according to the second aspect of the present application, wherein a control unit and a nonvolatile storage medium are included, the control unit executing the method of allocating a data stream of any one of the first to forty-one of the first aspect.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a block diagram of a prior art storage device;

FIG. 2 is a schematic diagram of a large block in the prior art;

FIG. 3 is a diagram illustrating a state change of a large block of data in the prior art;

FIG. 4 is a schematic illustration of the shunting provided by embodiments of the present application;

FIG. 5 is a flow chart for allocating flows for data units according to an embodiment of the present application;

FIGS. 6A and 6B are schematic diagrams of splitting and merging of streams provided by embodiments of the present application;

fig. 7 is a flow chart of splitting and/or merging flows provided by an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

FIG. 3 is a schematic diagram of a change in state of data for a chunk.

Fig. 3 shows data on chunk 1 and chunk 2 where data is written, with invalid data increasing over time. In fig. 3, T1 indicates the earliest timing, T2 indicates the subsequent timing, and T3 indicates the latest timing. A blank square of a large block indicates a memory location to which data is written and which data is valid, such as a physical page or physical block of flash memory, and a shaded square of a large block indicates a memory location to which data is written and which becomes invalid because the data is updated.

The portions of valid data on chunk 1 and chunk 2 become progressively invalid as time passes from T1 to T2 and T3. And at each time, chunk 1 and chunk 2 each have an invalid data fraction. At time T1, the invalid data percentage for chunk 1 is greater than chunk 2, and at time T3, the invalid data percentage for chunk 1 is still greater than chunk 2.

At time T3, if garbage collection is performed, it is advantageous to choose to collect chunk 1 and not to collect chunk 2 for the moment, because there is less valid data stored on chunk 1. And it is desirable to recycle the garbage to the chunks 2 at a later time.

In order to select fewer large chunks of valid data at the time of garbage collection, the time to garbage collect the large chunks may be postponed. In some cases, however, the need for storage space on the storage device makes the policy of deferring garbage collection difficult to implement. Also, valid data on large blocks may have a very large life cycle, which may make the benefits of a policy that defers garbage collection insignificant.

Fig. 4 illustrates a schematic diagram of a flow splitting according to an embodiment of the application.

The storage device provides, for example, a continuous logical address space. The arrow to the right indicates the direction in which the logical address space is incremented. The logical address space is divided into a plurality of data units (410, 411, 412, 413 … … 418). Each data unit occupies a logical address space of a specified or fixed size. For example, the size of a data unit is 512 bytes, 4K, 16K, or other values.

The data units each have a life cycle. For example, the data unit 410 is updated substantially once a day with a 1 day life cycle; data cell 411 is updated substantially 1 time per month with a lifecycle of 1 month; the data unit 412 is updated substantially once an hour, with a 1 hour life cycle; the data units 415 are generally updated once a year with a 1 year life cycle.

The FTL table of the storage device maps addresses of the logical address space to physical addresses of the storage medium. In response to a data unit of the logical address space being updated, a new physical address is assigned to the data unit to carry the updated data, such that the data at the old physical address becomes dirty data.

The data units of the logical address space are divided into a plurality of streams according to the life cycle of the data units. According to embodiments of the present application, data units having the same or similar life cycles are divided into the same stream. While different data units with more different life cycles are divided into different streams. For example, referring to FIG. 4, the

data units

410, 414, and 418 all have a lifecycle of 1 day, dividing these data units into streams S1; data unit 411 and data unit 413 have a life cycle close (1 month and 1 week, respectively), and are divided into stream S3; and dividing data unit 412 and data unit 416 into streams S2; the data units 415 are divided into streams Sn. Alternatively, other data units not suitable for partitioning into stream S1, stream S2, or S3 are also partitioned into stream Sn.

In an alternative embodiment, the stream has a standard lifecycle. For example, the standard life cycle of the stream S1 is, for example, 1 day, and data units having a life cycle of 0.5 to 5 days are divided into the stream S1.

Still alternatively or additionally, the standard lifecycle and/or lifecycle range of the stream is adjusted to accommodate changes in the lifecycle of the data units. For example, the standard life cycle of the stream is updated with a mean, median or other statistical value of the life cycles of the data units divided into the stream.

Optionally, the time interval of two times of updating of the history of the data unit, or the average of the time intervals of multiple times of updating, is used as the updating parameter of the data unit to replace the life cycle. And the data units are divided into streams according to the update parameters of the data units. Still alternatively, the time intervals of the plurality of updates of the data unit history are weighted averaged, or the result of the low-pass filtering is used as the update parameter of the data unit.

Alternatively or additionally, the update parameters of the stream are calculated from the update parameters of a plurality or all of the data units partitioned into the stream, instead of the standard lifecycle of the stream. For example, the update parameters of the data unit of the mean, median, or mode among the update parameters of a plurality of or all of the data units divided into the stream are used as the update parameters of the stream. Still alternatively, the update parameters of the stream are weighted averaged or the result of low-pass filtering of the update parameters of a number or all of the data units divided into the stream.

According to embodiments of the present application, the flow is dynamic. In some cases, a data unit originally divided into one stream is changed to be divided into another stream; in still other cases, a single stream is split into multiple streams; while in other cases two or more streams are combined into a single stream. Thereby reflecting the change in the manner in which the data unit is accessed. For example, a document under editing, the life cycle of its storage unit is short; when a document is created, the life cycle of its storage unit becomes very long. As the life cycle of the memory cell changes greatly, the stream into which the memory cell is divided is also changed. As yet another example, the update parameters of a flow exhibit greater volatility, and the flow is split into two or more flows such that the fluctuation of the update parameters of each flow is reduced.

Still alternatively, the life cycles of some of the storage units are difficult to identify, and the storage units with the difficult life cycles are all divided into the specified data streams.

In the example of fig. 4, the logical address space provided by the storage device is divided into a plurality of data units. In an alternative embodiment, the storage device provides a physical address space, while an application at a host accessing the storage device accesses a logical address space, and the mapping from logical addresses to physical addresses is done at the host. Also in the host, dividing the logical address space into a plurality of data units, and counting update parameters of the data units by the host, dividing the data units into streams, and calculating the update parameters of the streams.

Still alternatively, the data unit is divided over other address spaces. The other address space is a physical address or a logical address to be mapped to a physical address. Data units of other address spaces can be updated. Update parameters are calculated for each data unit to divide the data unit into streams, and update parameters for the streams are calculated.

Fig. 5 is a flow chart of allocating flows for data units according to an embodiment of the application.

A stream is allocated for a data unit and also for IO commands that write data to the data unit.

Taking the storage device as an example, an IO command is received (510) from a host or other device, the IO command indicating, for example, a logical address to which data is to be written, the logical address corresponding to one or more data units on a logical address space. For simplicity, for example, an IO command accesses a single data unit (denoted as DU 1).

Update parameters are calculated for the data unit DU1 accessed by the IO command (520). The update parameter indicates the life cycle of the data unit, or the time expected to be updated. As an example, the update parameters of the accessed data unit are carried on an IO hit. As a further example, for each data unit, the time of its previous or last update is recorded, and the difference between the current time and the time of the previous or last update is taken as or the update parameter is calculated. As yet another example, the time interval according to which one or more was updated before it is recorded for each data unit, and the average, median, mode, or other statistical value according to the one or more time intervals is used as the update parameter. In yet another example, for each data unit, the time it was previously or last updated is recorded, and the update parameter is recorded, and a new update parameter is calculated based on the difference dt between the current time and the time it was previously or last updated, and the recorded update parameter g (e.g., (1-k) × g + k (dt), where g' is the calculated new update parameter and 0< k < 1).

The resulting updated parameters for the data units are compared to updated parameters for one or more streams to assign streams to the data units (530). For example, a data unit is assigned a flow having a flow update parameter closest to its update parameter.

As yet another example, if there is no flow, or if the update parameters of any flow are close enough to the update parameters of the data unit, a new flow is created and the data unit is assigned to the newly created flow. As yet another example, no stream has its update parameters sufficiently close to the update parameters of the data unit to be assigned to the specified stream. A flow is represented by, for example, a flow identifier, and a flow is assigned to a data unit by, for example, attaching the flow identifier to the data that updates the data unit.

Optionally, as data units are assigned to a flow, update parameters for the flow are also recalculated (540). As an example, an average value, a median value, a mode value, or other statistical value of the update parameters assigned to one, a plurality, or all of the data units of the stream is used as the update parameter of the stream. As yet another example, the update parameters are recorded for each stream, and in response to a stream being assigned a new data unit, the new update parameters for the stream are calculated using the update parameters for the new data unit and the current update parameters for the stream. Still by way of example, the new update parameter of the flow is a weighted sum of the current update parameter of the flow and the update parameter of the new data unit.

Fig. 6A and 6B are schematic diagrams of splitting and merging of streams according to an embodiment of the present application.

In fig. 6A and 6B, the coordinate axes represent update parameters. For the purposes of this description, the update parameters of the coordinate axes represent both the update parameters of the data units and the update parameters of the streams. The update parameters of the data unit and the update parameters of the stream can be directly compared. In some cases, the update parameters of the stream are calculated from the update parameters of the data units, but the update parameters of the stream can still be directly compared with the update parameters of the data.

Referring to fig. 6A, there are already two streams, stream S1 and stream S2. The rectangular boxes of the streams S1 and S2 represent the updated parameters of the streams. For example, the start point or the middle point of the coordinate axis covered by the rectangular frame is used as the update parameter of the stream. Data units whose update parameters fall within the range covered by the rectangular box will be assigned to the stream represented by the rectangular box. Circles on the coordinate axes represent data elements, and positions of the circles on the coordinate axes represent update parameters of the data elements. The data units represented by circles within the rectangular box are thus assigned to the stream represented by the rectangular box.

With continued reference to FIG. 6A, the stream S1 is in a rectangular frame, and a plurality of data units have been formed into three groups (G1, G2 and G3) according to their update parameters. Within the rectangular frame of the stream S2, a plurality of data units have been formed into 2 groups (G4 and G5) according to the updated parameters. Wherein the data cell group G3 and the data cell group G4 are adjacent to each other. And the group of data cells G4 is farther from the group of data cells G5.

According to an embodiment of the present application, stream S1 and stream S2 are split, resulting in new streams S1', S2 and S3. The new flow S1' accommodates data cell group G1 and data cell group G2; the new flow S2' holds the group of data cells G3 and the group of data cells G4, and the flow S3 holds the group of data cells G5, as shown in FIG. 6B.

Optionally, two or more streams are also merged, the merged stream accommodating all data unit groups of the streams participating in the merging.

By way of example, streams are split or merged by changing their update parameters. For example, referring to fig. 6A, the position of the midpoint of the rectangular box representing the stream on the coordinate axis indicates the update parameter of the stream. Streams are split or merged by changing or creating their update parameters.

Although, fig. 6A and 6B show that the rectangular box corresponding to the stream accommodates a plurality of data units. Alternatively, in response to a change in flow (e.g., split or merge), data units already assigned to a flow need not be re-attached with a new flow identifier, but instead, for a newly received IO request, the changed flow is assigned to the data units it accesses.

Referring also to fig. 6A, optionally, the parameters of the flow may be varied. For example, in response to more data units whose update parameters are adjacent to the data unit group G3 being allocated to the stream S1, the update parameters of the stream S1 are increased so that the rectangular frame corresponding to the stream S1 is shifted to the right on the coordinate axis.

In the examples of fig. 6A and 6B, splitting or merging streams according to the dispersion of update parameters is illustrated. Still optionally, the streams are also merged according to their update time interval. The update of the task data units assigned to the stream is treated as an update to the stream. For two or more streams with large (e.g., greater than a specified threshold) update intervals, it may be advantageous to merge them into the same stream.

FIG. 7 illustrates a flow diagram for splitting and/or merging streams in accordance with an embodiment of the present application.

According to an embodiment of the application, the identification of the flow is dynamic. After initialization, an initial stream is provided, and IO commands for accessing any data units are assigned to the initial stream (710). The initial stream has initial stream update parameters.

During operation of the storage device, an IO command is received (720).

Update parameters are computed for the data cells accessed by the IO command (730).

The resulting updated parameters for the data unit are compared to the updated parameters for the already existing flow or flows to assign a flow to the data unit (740). For example, a data unit is assigned a flow having a flow update parameter closest to its update parameter.

The update parameters of the flow to which the data unit is assigned are recalculated based on the update parameters of the data unit (750). Alternatively or additionally, further parameters are also calculated for the stream to which the data unit is allocated. E.g. the update time interval of the stream, the spread of the update parameters of the stream. The update time interval of a stream refers to the time interval between the current update and the last update of the stream. Optionally, the stream update interval is also low-pass filtered to reduce large fluctuations caused by occasional short interval updates to the stream update interval parameters. The update parameter dispersion of the stream represents a degree of deviation of the update parameter of the data unit assigned to the stream from the update parameter of the stream, such as a sum of differences of the update parameter of the data unit and the update parameter of the stream, a weighted sum of differences of the update parameter of the data unit and the update parameter of the stream, or a statistical value of differences of the update parameter of the data unit and the update parameter of the stream each time the data unit is assigned to the stream.

In response to assigning data units to the stream, periodically or as necessary, a dispersion of the updated stream is obtained (760). Optionally, the dispersion of the updated parameters of the stream is used as the dispersion of the stream. Still alternatively, the dispersion of the stream is taken as a ratio of the dispersion of the update parameters of the stream to the update time interval of the stream.

The streams are split and/or combined (770) according to their dispersion. Parameters including update parameters, update time intervals and/or dispersion of the stream represent the stream. The stream is created by creating a new one or more stream parameters from the parameters of the stream. For example, a stream is split into two streams by generating two update parameters from the update parameters of the stream, where the average or statistical value of the new two update parameters is the original update parameter, and one of the new update parameters is larger than the other new update parameter. Still by way of example, the assigned portion of the stream's dispersion (e.g., 1/2) is used as the dispersion of each stream after splitting. And taking the update time interval of the flow as the update time interval of each split flow.

Similarly, streams are merged by generating parameters for a new single stream from parameters for two or more streams. For example, an average value or a statistical value of the update parameters of two or more streams to be merged is used as the update parameter of the merged stream.

Alternatively, for a single stream, the stream is split in response to its dispersion being too large or increasing rapidly. For multiple streams, streams are merged in response to having a smaller dispersion if they were merged. Still alternatively, the plurality of streams are sorted according to their dispersion, the dispersion after two adjacent streams or a plurality of streams are merged is calculated, and two streams or a plurality of streams corresponding to the calculation result having the dispersion that is the smallest or smaller than a specified threshold are merged.

Thus, during operation of the storage device, changes made to the flow by IO requests are tracked by updating parameters of the flow so that the same flow identifier is appended to data units having the same or similar lifecycle.

Furthermore, the storage device also places the data units belonging to the same stream in the same or adjacent physical blocks or chunks, so that when garbage is recovered, the organic association process selects the physical blocks or chunks to be recovered, thereby reducing write amplification and improving the performance of the storage device.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of distributing a data stream, comprising the steps of:

obtaining an IO command;

obtaining an updating parameter according to the IO command; and allocating the streams for the IO commands according to the update parameters.

2. The method of allocating data streams according to claim 1, wherein the data units accessed by the IO commands are obtained, the update time interval of the data units is obtained, and the update parameters are calculated based on g '═ g + k (dt), where g' is the calculated update parameter, g is the old update parameter, dt is the update time interval, and 0< k < 1.

3. Method for allocating data streams according to any of claims 1-2, characterized in that the update parameters of the acquired data units are compared with the update parameters of one or more streams, and streams are allocated for the IO commands whose update parameters are within a threshold distance from the acquired update parameters.

4. A method of allocating a data stream according to claim 3, characterized in that in response to allocating a stream for a data unit accessed by an IO command, the update parameters of the data stream are calculated from the update parameters of the data unit.

5. The method of allocating data streams according to claim 4, wherein if the distances between the obtained data unit update parameters and the update parameters of one or more streams are greater than a threshold, a new stream is created, and the new stream or the specified stream is allocated to the IO command.

6. Method for allocating data streams according to any of claims 1 to 5, characterized in that the update parameters obtained from an IO command are compared with the update parameters of one or more existing streams, and a stream is allocated for said IO command whose update parameters are within a threshold distance from the obtained update parameters.

7. A method of allocating data streams according to any of claims 1 to 6, characterised in that in response to allocation of a stream to an IO command, the spread of the stream to which the IO command is allocated is calculated.

8. Method for distributing a data stream according to claim 7, characterized in that the streams are split and/or merged according to a dispersion.

9. Method for distributing a data stream according to claim 8, characterized in that two new update parameters are generated from the update parameters of the stream to split the stream into two streams.

10. Storage device, characterized in that it comprises a control unit and a non-volatile storage medium, said control unit performing the method according to one of claims 1 to 9.