CN107077300A

CN107077300A - For balancing segmentation removing and the rate-matched technology of I/O workloads

Info

Publication number: CN107077300A
Application number: CN201580049232.9A
Authority: CN
Inventors: D·帕特尔; M·斯瓦米纳坦; E·D·麦克拉纳汉; J·穆斯
Original assignee: NetApp Inc
Current assignee: NetApp Inc
Priority date: 2014-09-12
Filing date: 2015-09-08
Publication date: 2017-08-18
Also published as: US20170235673A1; WO2016040233A1; US10210082B2; EP3191932A1; US20160077745A1; US9671960B2

Abstract

A kind of rate-matched technology can be configured as adjusting the variable bit rate of incoming workload of the clearance rate of one or more selected segmentations of storage array to adapt to storage input/output (I/O) stack processing by being performed on one or more nodes of cluster.Segmentation can be removed according to the exemplary segmentation removing that can be embodied as being segmented reset procedure by storing the panel accumulation layer of I/O stacks.Rate-matched technology may be implemented as the feedback control mechanism for being configured as adjusting segmentation reset procedure based on incoming workload.The part of feedback control mechanism can include one or more weighting schedulers and various count data structures (for example, counter), and it is configured as the progress that tracking segmentation is removed and free space is used.Counter can be also used for balance segmentation removing and the speed of incoming I/O workloads, and it can depend on incoming I/O speed and change.When incoming I/O speed changes, the speed that segmentation is removed can be correspondingly adjusted to ensure that speed substantial equilibrium.

Description

For balancing segmentation removing and the rate-matched technology of I/O workloads

Technical field

This disclosure relates to storage system, and relate more specifically to the removing of the efficient segmentation in storage system.

Background technology

Storage system generally includes that the one or more of information can be obtained to its typing information and from it as needed Storage device, is such as presented as the solid-state drive (SSD) of the flash memory of storage array.Storage system can realize height Information in storage device of the hierarchy module (such as file system) logically will be stored in array is organized as storage container, Such as file or logic unit (LUN).Storage system cluster can be configured as the client/server mould delivered according to information Type is operated, and thus allows one or more clients (main frame) for example to access storage container via I/O requests.Each deposit Storage container may be implemented as one group of data structure, such as data block for storage container data storage and description storage container In data meta data block.For example, metadata can describe (for example, mark) storage location of data in equipment.

Certain form of SSD (especially those SSD with NAND Flash part) may or may not include Internal controller (that is, for SSD user's inaccessible), the internal controller is with page-granular (for example, 8Kbyte) Valid data are moved to new position between those parts from old position, the page previously wiped then only is moved to.This Afterwards, the stored old position of the page is released and available for storage additional data (for example, asking to receive via I/O).Will The write-in that valid data are moved to new position (that is, refuse collection) from old position and contributed in system is amplified.Although needing The available position on the SSD of storage array (that is, memory space) is provided to write additional data, but this refuse collection should It is performed in order to keep the smooth delay (that is, the bounded delay for keeping I/O requests) of perspective.Therefore, it is desirable to prolong in bounded Inherence storage system continues service data (that is, handle I/O request) late, while ensuring to have enough memory space and bandwidth to be used for Refuse collection and to storage array clear position write data.

The content of the invention

Each embodiment described herein is related to rate-matched technology, and the rate-matched technology is configured as adjusting storage array One or more selected parts or the clearance rate of segmentation, to adapt to the storage by being performed on one or more nodes of cluster The variable bit rate of the incoming workload of input/output (I/O) stack processing.Incoming workload can show as being directed to number of users According to (for example, from main frame coupled to cluster) and associated metadata (for example, volume layer and panel from storage I/O stacks are deposited Reservoir) I/O operation (for example, read operation and write operation).Panel accumulation layer is in the solid-state driving (SSD) of storage array The sequential storage of user data and metadata is provided.User data (and metadata) can be organized as by one of node serve Or the variable panel of any number of length of the visible logic unit of multiple main frames (LUN).Metadata can include the main frame from LUN It can be seen that range of logical block addresses (that is, deviation range) arrives the mapping of panel key assignments, and panel key assignments is stored to panel on SSD Position mapping.Panel accumulation layer can remove (that is, refuse collection) according to segmentation and remove segmentation, and the segmentation is removed exemplary Ground can be embodied as being segmented reset procedure.The segmentation reset procedure can read all from one or more segmentations to be cleaned Effective panel, and those effective panels (for example, the panel for not being deleted or making carbon copies) are write to can be written into one or many Other individual segmentations, thus discharge the memory space for the segmentation that (that is, " removing ") is just being eliminated.

In one embodiment, rate-matched technology may be implemented as feedback control mechanism (for example, feedback control is returned Road), the feedback control mechanism is configured as adjusting segmentation reset procedure based on incoming reading and write-in workload.Feedback control The part of making mechanism can include one or more weighting schedulers and various count data structures (for example, counter), the meter Number device be configured as tracking (it is determined that) segmentation removes and the progress (for example, speed) that uses of free space.The counter can be with It is used for the speed of balance segmentation removing and incoming I/O workloads, the incoming I/O workloads can depend on incoming I/O The pattern (reduce remove) made carbon copies in speed and incoming I/O workloads and change.When incoming I/O speed changes, segmentation is clear The speed removed can be accordingly adjusted to ensure that speed (that is, incoming I/O speed and segmentation clearance rate) is substantially the same (that is, balancing).Write in this manner, segmentation removes only to be performed and reduce when needing (that is, being incoming I/O Free up Memories) Enter amplification.

The I/O operation (for example, read operation and write operation) stored in I/O stacks can be according to user, metadata and again Positioning (segmentation remover) I/O operation is classified.Read operation provides service and from the reading of panel accumulation layer according to SSD Process is directly distributed to store the RAID layers of I/O stacks.Write operation is cumulatively added in the write-in process of panel accumulation layer, wherein It is distributed to before RAID layers, associated panel can be packaged to form full band.Weighting scheduler can be deposited in panel Reservoir is used to adjust read operation and full stripe write operation at RAID layers.Exemplarily, write-in weighting scheduler is being write All SSD for entering process for segmentation are provided, and are read weighting scheduler and provided in the process of reading for each SSD of segmentation, so as to Control the bandwidth distribution between the various classification of I/O operation.

In one embodiment, each weighting scheduler can include multiple queues, wherein each queue and I/O operation Classification (for example, user, metadata and reorientation) is associated.In other words, queue can include (i) for incoming user data work Make the data queue loaded, (ii) is directed to the metadata queue of incoming volume and panel accumulation layer metadata workload, and (iii) the reorientation queue of data and metadata reorientation during being removed for segmentation.According to rate-matched technology, weighting scheduling Device can be configured as applying fine granularity control (secondary control loop) to the speed that segmentation is removed with by assigning phase to queue Close weighting to match the speed of incoming I/O workloads, therefore ensure that segmentation removing will not be introduced and produced due to too fast operation Unnecessary write-in amplification, and do not interfere with and may increase related to I/O workloads due to too fast or too slow operation The performance for the delay that the main frame of connection is perceived.

Brief description of the drawings

By reference to the description made below in conjunction with the accompanying drawings, the above-mentioned advantage in embodiment hereof may be better understood With other advantages, in the drawings, identical reference represents identical or functionally similar element, in the accompanying drawings：

Fig. 1 is the block diagram for the multiple nodes for being connected to cluster；

Fig. 2 is the block diagram of node；

Fig. 3 is the block diagram of storage input/output (I/O) stack of node；

Fig. 4 shows the write paths of input/output (I/O) stack；

Fig. 5 shows the read path of input/output (I/O) stack；

Fig. 6 shows that the segmentation of the hierarchical file system by storing I/O stacks is removed；

Fig. 7 shows the RAID stripe performed by hierarchical file system；

Fig. 8 A and Fig. 8 B respectively illustrate the reading and write-in weighting scheduling of rate-matched technology；And

Fig. 9 shows rate-matched technology.

Embodiment

Storage cluster

Fig. 1 is the block diagram of multiple nodes 200, and these nodes are connected to cluster 100 and are configured to provide with depositing Store up the related storage service of the tissue of information in equipment.Node 200 can be connected with each other by cluster interconnection structure 110, and Including multiple functional parts, these functional parts cooperate with providing the distributed storage architecture of cluster 100, the distributed storage frame Structure can be deployed in storage area network (SAN).As described herein, each part of each node 200 includes hardware And software function, the hardware and software feature enables the node to be connected to one or more masters by computer network 130 Machine 120 and one or more storage arrays 150 that storage device is connected to by storing interconnection 140, so that according to distribution Storage architecture provides storage service.

Each main frame 120 can be embodied as all-purpose computer, and the all-purpose computer is configured to what is delivered according to information Client/server model is interacted with arbitrary node 200.That is, client (main frame) can be with the service of requesting node, and node The result for the service that main frame is asked can be returned by way of exchanging packet via network 130.When in main frame accessed node Form be storage container (such as file and catalogue) information when, the main frame can be sent including the access protocol based on file The packet of (such as, NFS (NFS) agreement in transmission control protocol/Internet protocol (TCP/IP)).However, In a kind of embodiment, when the access stencil of main frame 120 is the information of storage container (such as logic unit (LUN)), the main frame 120 Exemplarily send including block-based access protocol (small computer system interface (SCSI) association such as, encapsulated on TCP View (iSCSI) and the SCSI that is encapsulated on FC) packet.Note, arbitrary node 200 can be by for being stored in cluster 100 Storage container request provide service.

Fig. 2 is the block diagram of node 200, and the node 200 is exemplarily presented as storage system, and the storage system has warp One or more CPU (CPU) 210 of memory 220 are coupled to by storage bus 215.CPU 210 is also via being System interconnection 270 is coupled to network adapter 230, storage control 240, cluster interconnection interface 250 and based non-volatile random access Memory (NVRAM 280).Network adapter 230 can include being adapted to couple node 200 by computer network 130 To one or more ports of main frame 120, computer network 130 can include point-to-point link, wide area network, pass through public network The VPN that (internet) or LAN are realized.Therefore, network adapter 230 includes needing to connect nodes to network 130 mechanical, electric and signaling circuit, the network 130 exemplarily embodies Ethernet or optical-fibre channel (FC) network.

Memory 220 can include the addressable memory locations of CPU 210, and the memory location is used to store software journey Sequence and the data structure associated with embodiment described herein.CPU 210 can then include treatment element and/or logic Circuit, the treatment element and/or logic circuit are configured to perform software program (such as, storage input/output (I/O) stack And peration data structure 300).Exemplarily, storage input/output (I/O) stack 300 may be implemented as one group of user model Process, the user mode process can be broken down into multiple threads.Operating system nucleus 224 is functionally especially by calling The operation that the storage service realized by node (especially memory I/O stacks 300) is supported comes organization node, operating system nucleus 224 part usually remains in memory 220 (kernel) and performed by treatment element (that is, CPU 210).Suitable operation System kernel 224 can include the general-purpose operating system (such as, operating systemSeries or MicrosoftSeries), or the operating system (such as, micro-kernel and embedded mmdb) with configurable functionality.However, In a kind of embodiment herein, operating system nucleus is illustrativelyOperating system.Those skilled in the art It should be appreciated that other processing and storage arrangement including various computer-readable mediums can be used for storing and perform and this The relevant programmed instruction of literary embodiment.

Each storage control 240 cooperates with the storage I/O stacks 300 performed on node 200 is asked with obtaining main frame 120 Information.The information is preferably stored in storage device (such as, solid magnetic disc (SSD) 260), and the SSD is by exemplarily It is presented as the flash memory device of storage array 150.In one embodiment, although skilled artisan understands that other are non-easily The property lost, solid state electronic devices (for example, driver of the memory assembly based on storage class) can be advantageously combined and be retouched herein The embodiment stated is used, but flash memory device can also be the flash memory component based on NAND (for example, single layer cell (SLC) Flash memory, multilevel-cell (MLC) flash memory or three-layer unit (TLC) flash memory).Therefore, storage device can be or can not be towards (that is, being conducted interviews as block) of block.Storage control 240 includes one or more ports with I/O interface circuits, the I/ O Interface circuit is coupled to SSD 260 by storing interconnection 140, and storage interconnection 140 is exemplarily presented as Serial Attached SCSI (SAS) (SAS) topological structure.It is alternatively possible to which using other point-to-point I/O interconnection means, such as, conventional serial ATA (SATA) is opened up Flutter structure or PCI topological structures.(such as, node 200 can also be coupled to local service storage device 248 by system interconnection 270 SSD), local service storage device 248 is configured to the related configuration information of cluster being locally stored as such as cluster database (DB) 244, cluster database (DB) 244 can be copied to other nodes 200 in cluster 100.

Cluster interconnection interface 250 can include one or more ports, and these ports are adapted to node 200 being coupled to Other nodes of cluster 100.In one embodiment, Ethernet can be used as cluster protocol and interconnection structure medium, but It is it will be appreciated by those skilled in the art that other kinds of agreement can be used in embodiment described herein and interconnected (all Such as, wireless bandwidth).NVRAM 280 can include the standby electricity that data can be safeguarded according to the failure and cluster environment of node Pond or other built-in final state holding capacities (for example, nonvolatile semiconductor memory, such as, stores class memory). Exemplarily, a part of of NVRAM 280 can be configured as one or more non-volatile daily records (NVLog 285), these Non-volatile daily record is configured to provisionally to record (" charging to (log) ") I/O and is received from the request of main frame 120 (such as write-in please Ask).

Store I/O stacks

Fig. 3 is the block diagram for storing I/O stacks 300, and storage I/O stacks 300 can be advantageously combined one described herein Or more embodiment use.Storing I/O stacks 300 includes multiple software modules or layer, these software modules or layer and node 200 Other functional units cooperate with providing the distributed storage architecture of cluster 100.In one embodiment, distributed storage architecture Represent the abstract of single storage container, i.e. the storage array 150 of the node 200 of all about whole cluster 100 is organized as One big storage pool.In other words, the framework is integrated (consolidate) whole cluster and (can retrieved by cluster-wide key assignments ) storage (that is, the SSD 260 of array 150) to realize LUN storage.Then, can be by the way that node 200 be added into group Collect 100 to scale (scale) memory capacity and performance.

Exemplarily, storage I/O stacks 300 are deposited including management level 310, protocol layer 320, persistent layer 330, volume layer 340, panel Reservoir 350, RAID (RAID) layer 360, accumulation layer 365, and interconnected with messaging kernel 370 NVRAM (is used to store NVLog) layer.The information receiving and transmitting kernel 370 can provide the scheduling mould based on message (or based on event) Type (for example, asynchronous schedule), the scheduling model based on message (or based on event) is exchanged using message as between above-mentioned layer The basic working cell of (that is, transmitting).Information receiving and transmitting kernel comes in storage I/O there is provided some suitable message passing mechanisms Information is transmitted between the above-mentioned layer of stack 300, these message passing mechanisms can include：For example, for intra-node communication：I) it is online The messaging performed on Cheng Chi；Ii) message performed in single thread carried out is operated to pass by storing I/O stacks as a kind of Send；Iii) using the messaging of interprocess communication (IRC) mechanism, and for example, for inter-node communication：Transmitted according to function Implementation uses the messaging of remote procedure call (PRC) mechanism.Alternatively, I/O stacks can be used based on thread or base Realized in the execution model of stack.In one or more embodiments, 370 pairs of messaging kernel comes from operating system nucleus 224 process resource is allocated to perform messaging.It is one or more that each storage I/O stack layer may be implemented as execution One or more examples (that is, process) of thread (for example, in kernel or user's space), these examples between above-mentioned layer to passing The message passed carries out processing and enables message to provide synchronization for the blocking or non-blacked operation of above-mentioned layer.

In one embodiment, protocol layer 320 can be according to predefined agreement (such as iSCSI and FCP) by exchanging It is configured as the discrete frames of I/O requests or the mode of packet communicates via network 130 with main frame 120.I/O requests are (for example, read Or write request) LUN can be directed to, and I/O parameters (such as, especially LUN identifier (ID), LUN logic can be included Block address (LBA), length (that is, data volume), and write data (in the case of a write request).Protocol layer 320 receives I/O please Ask and I/O requests are transmitted to persistent layer 330, the persistent layer 330 recorded the request in lasting write-back cache 380 And acknowledgement notification is returned to main frame 120 via protocol layer 320, the lasting write-back cache 380 is exemplarily presented as day Will；For example, in some random access replacement policies, and more than in the case of serial mode, the content of daily record can carry out with Machine is replaced.In one embodiment, modification LUN I/O requests (for example, write request) only be have recorded.Note, can be by I/O Request recorded at the node for receiving I/O requests, or, in an alternative embodiment, implementation can be transmitted according to function will I/O request records are at other nodes.

Exemplarily, dedicated log can be safeguarded by storing each layer of I/O stacks 300.For example, dedicated log 335 can I/O stacks, parameter (that is, are stored (for example, volume with the I/O parameters of internal equally record I/O requests to be safeguarded by persistent layer 330 ID, offset and length)).In the case of a write request, persistent layer 330 can also cooperate with realizing write-back with NVRAM 280 Cache 380, the write-back cache 380, which is configured to store, associated with write request writes data.In one kind implementation In example, write-back cache may be constructed such that daily record.Note, writing data and can be physically stored on write request In cache 380 daily record 335 is included to the associated reference for writing data.It will be appreciated by those skilled in the art that can Data are write to store or safeguard using other modifications of data structure in NVRAM, including the use of the data knot without daily record Structure.In one embodiment, the copy of write-back cache can also be maintained in memory 220 to promote to storage control The direct memory access of device.In other embodiments, it can be stored according to data are safeguarded between cache and cluster Correlation agreement, at main frame 120 or at receiving node perform cache.

In one embodiment, LUN can be divided into multiple volumes by management level 310, and each volume can be partitioned into multiple areas Domain (for example, being allocated according to non-intersect block address scope), each region has one or more segmentations, these segmentation conducts Multiple bands are stored on array 150.Therefore, the multiple volumes being distributed in node 200 can provide service for single LUN, i.e. Each volume in LUN is LBA range (that is, offset ranges and length, hereinafter referred to as offset ranges) in LUN, different Or one group of scope provides service.Therefore, the protocol layer 320 can implement volume mapping technology to ask I/O targeted volume (that is, the offset ranges indicated by parameter asked for I/O provide the volume of service) is identified.Exemplarily, clustered data Storehouse 244 may be configured to the one or more associations (for example, key-value pair) of each volume maintenance being directed in multiple volumes, for example, Associating between associating between LUNID and volume and volume and the node ID of the node for managing the volume.Management level 310 may be used also To be cooperated with database 244 with establishment (or deletion) one or more volumes associated with LUN (for example, being created in database 244 Build volume ID/LUN key-value pairs).By using LUN ID and LBA (or LBA range), volume mapping technology can provide volume ID (for example, Use the appropriate association in cluster database 244) and by LBA (or LBA range) conversions coiled interior offset and length, should Volume ID is used for the node for recognizing volume and service being provided for the volume, and the volume is the destination of the request.Specifically, volume ID is used for true Determine volume layer example, the volume layer instance management volume metadata associated with LBA or LBA range.As it was previously stated, protocol layer 320 can To ask I/O (that is, volume ID, offset and length) to be transferred to persistent layer 330, persistent layer 330, which is based on volume ID, can use work( The suitable volume layer example that I/O is asked to perform on the node being forwarded in cluster by (for example, between node) implementation can be transmitted.

In one embodiment, volume layer 340 can manage volume metadata in the following manner, for example：Maintenance host is visible The state (such as, LUN scope) of container, and couple LUN cooperated with management level 310 perform data management function and (such as, created Build snapshot and clone).Volume metadata is exemplarily presented as LUN addresses (that is, offset) to the interior of persistence panel key assignments Nuclear mapping, persistence panel key assignments is SSD storages in the panel key assignments space of the storage container in cluster-wide, with panel ID in the associated unique cluster-wide in position.That is, panel key assignments, which can be used for retrieval, is located at SSD storage locations Data in the panel at place, associated with the panel key assignments.Alternatively, there can be multiple storage containers in cluster, wherein often Individual container has the panel key assignments space of its own, for example, panel is distributed in and deposited by management level 310 in the panel key assignments space In storage container.Panel is the data block of variable-length, and the data block provides the memory cell on SSD and need not be with appointing What specific border alignment, i.e. can be byte-aligned.Therefore, in order to maintain such alignment, panel can be write from multiple Enter the aggregation for writing data of request.Exemplarily, volume layer 340 can be by the request forwarded (for example, characterizing the letter of the request Breath or parameter) and volume metadata change recorded in the dedicated log 345 that volume layer 340 is safeguarded.Then, can be according to core Make an inventory of (for example, synchronous) operation and the content of volume layer daily record 345 is write into storage array 150, it is described to verify point operation by kernel member On data Cun Chudao arrays 150.That is, verifying point operation (verification point) ensures the consistent shape of the metadata handled by kernel State is submitted and (that is, stored) to storage array 150；And the resignation of journal entries before point operation is verified for example, by retiring from office tired The mode of long-pending journal entries, come the metadata for ensuring the entry accumulated in volume layer daily record 345 Yu being submitted to storage array 150 Verify point synchronous.In one or more embodiments, the resignation for verifying point and journal entries can be data-driven, cycle , or more the two.

In one embodiment, panel accumulation layer 350 is responsible for storing in panel onto SSD 260 (that is, in storage array On 150), and panel key assignments is supplied to volume layer 340 by (for example, write request in response to being forwarded).Panel accumulation layer 350 It is also responsible for (for example, read requests in response to being forwarded) and uses panel key assignments retrieval data (for example, existing panel).Panel Accumulation layer 350 can be responsible for before panel is stored carrying out the panel solution repetition and compress.Panel accumulation layer 350 can With the kernel mappings to panel key assignments to SSD storage locations (that is, the offset on the SSD 260 of array 150) (for example, embodying For Hash table) safeguarded.Panel accumulation layer 350 can also safeguard that the entry is accumulated to the dedicated log 355 of entry " being put into " and " deletion " operation (that is, the write request for panel sent from other layers to panel accumulation layer 350 asked And removal request), and these operations can change kernel mappings (that is, Hash-table entries).Then, can be according to " fuzzy " verification point 390 (that is, record has the verification point that increment changes in one or more journal files) store kernel mappings and panel The content write-in storage array 150 of layer daily record 355, wherein by the selected kernel mappings kernel mappings of total amount (be less than) according to Different time interval (for example, it is being driven by the change driving of a large amount of kernel mappings, by the size threshold value of daily record 355 or Regularly drive) it is submitted to array 150.Note, once all submitted kernel mappings include tiring out in daily record 355 During the change that long-pending entry is recorded, then these entries can be retired from office.

In one embodiment, the SSD260 in storage array 150 can be organized as one or more by RAID layers 360 RAID groups (for example, SSD gathers), one or more RAID groups pass through on the SSD 260 of the given quantity of each RAID groups Data " band " of the write-in with redundancy (that is, the appropriate parity information on slitting data), to improve array On panel storage reliability and integrality.RAID layers 360 can also be deposited for example according to the write operation of multiple successive ranges Multiple bands (for example, band with enough depth) are stored up, so as to reduce the knot as aforesaid operations that may occur in SSD The data repositioning (that is, internal flash block is managed) of fruit.In one embodiment, accumulation layer 365 realizes storage I/O drivers, Storage I/O drivers (such as, I/O (VFIO) drivers of Linux virtual functions) can by with operating system nucleus 224 Co-operation directly communicates with hardware (for example, storage control and cluster interface).

Write paths

Fig. 4 shows the I/O (examples for being used to handle I/O requests (for example, SCSI write requests 410) of storage I/O stacks 300 Such as, write) path 400.The write request 410 can be sent by main frame 120 and on the storage array 150 of cluster 100 The LUN stored.Exemplarily, protocol layer 320 receive the write request and by field to the request (for example, LUN ID, LBA and length (being shown at 413)) and write data 414 carry out that decoding 420 (for example, parsing and extract) comes please to the write-in Ask and handled.Protocol layer 320 can also use the result 422 for the decoding 420 (as described above) of volume mapping technology 430, with LUN ID in write request and LBA range (that is, equivalent to offset and length) are converted into volume layer appropriate in cluster 100 Example (that is, volume ID (volume 445)), the appropriate volume layer example is responsible for managing volume metadata for LBA range.A kind of alternative In embodiment, persistent layer 330 can realize above-mentioned volume mapping technology 430.Then, protocol layer by result 432 (for example, volume ID, partially Shifting amount, length (and writing data) pass to persistent layer 330, persistent layer 330 request recorded in persistent layer daily record 335 and It will confirm that information is back to main frame 120 via protocol layer 320.Persistent layer 330 can be by from one or more write requests Write data 414 to assemble and organize in new panel 610, and Hash meter is performed according to 450 pairs of new panels of panel salted hash Salted (that is, hash function) is calculated to generate cryptographic Hash 472.

Then, persistent layer 330 can by with assembled write data write request (including, such as volume ID, offset And length) appropriate volume layer example is transferred to as parameter 434.In one embodiment, parameter 434 (is received) by persistent layer Message transmission other nodes can be redirected to via function transfer mechanism (for example, PRC) for inter-node communication.Alternatively Ground, the message transmission of parameter 434 can be carried out for intra-node communication via IPC mechanism (for example, message threads).

In one or more of embodiments, bucket (bucket) mapping techniques 476 are provided to cryptographic Hash 472 being converted into The example (that is, panel storage example 470) of appropriate panel accumulation layer, the appropriate panel accumulation layer is responsible for storing new panel 610.Note, bucket mapping techniques can be performed in the random layer being located above the accumulation layer of panel of storage I/O stacks.For example, In a kind of embodiment, bucket mapping techniques can persistent layer 330, volume layer 340 or management cluster-wide in information layer it is (all Such as, cluster layer (not shown)) middle execution.Therefore, persistent layer 330, volume layer 340 or cluster layer can include what CPU 210 was performed Computer executable instructions, to carry out the operation for being used to perform bucket mapping techniques 476 described herein.Then, persistent layer 330 Cryptographic Hash 472 and new panel 610 can be transferred to appropriate volume layer example, and operation is put into by Hash by panel storage Value 472 and new panel 610 are transferred to appropriate panel storage example.Panel salted hash Salted 450 can embody approximate consistent Kazakhstan Uncommon function, example 470 is stored to ensure that any panel to be written can have approximately uniform chance to enter any panel, Hash bucket is distributed in the panel of cluster 100 storage example namely based on available resources.Therefore, bucket mapping techniques 476 are in cluster Node 200 on the write operation (and, for symmetry, read operation) of load balancing is provided, while balancing in cluster SSD 260 flash memory abrasion.

In response to being put into operation, panel storage example can be handled cryptographic Hash 472 to perform panel metadata choosing Select technology 460：(i) the appropriate Hash table of selection in the one group of Hash table (exemplarily, kernel) stored from panel in example 470 480 (for example, Hash table 480a), and (ii) extract hashed table index 462 to index selected Hash from cryptographic Hash 472 Table, and the table clause with panel key assignments 475 is searched for panel, the panel key assignments 475 is used to recognize depositing on SSD 260 Storage space puts 490.Therefore, panel accumulation layer 350 includes computer executable instructions, and CPU 210 performs these computers and can perform Instruct to carry out the operation for realizing panel metadata selected technology 460 described herein.If found with matching disc The table clause of area's key assignments, then the SSD positions 490 that are mapped using panel key assignments 475 retrieve existing panel from SSD (not shown).Then, whether existing panel and new panel 610 are compared to determine into their data identical.If number According to being identical, then new panel 610 is original is stored on SSD 260 and there is solution repetition chance (be expressed as solution and repeat 452) So as to It is not necessary to another copy of write-in data.Therefore, the reference count in table clause for existing panel increases, and existing The panel key assignments 475 for having panel be passed to appropriate volume layer example be used for be stored in it is intensive tree metadata structure 444 (for example, Intensive tree 444a) entry (be expressed as roll up metadata entry 446) in so that the offset ranges of panel key assignments 475 and volume 445 440 (for example, offset ranges 440a) are associated.

If however, the data in existing panel are different from the data in new panel 610, can clash and certainty is calculated Method is called to be continuously generated many new candidate panel key assignments (not shown) for mapping to same bucket as required to provide solution Repeat 452 or produce the panel key assignments for not yet storing and being stored in example in panel.Note, another Hash table is (for example, Hash table It can 480n) be selected according to panel metadata selected technology 460 by the new candidate panel key assignments.In the absence of solution weight Answer a pager's call and (that is, not yet store the panel) in the case of, be compressed according to 454 pairs of new panels 610 of compress technique and by newly Panel 610 is transferred to RAID layers 360, and the new panel 610 of 360 pairs of RAID layers, which carries out processing, to be used to be stored on SSD 260, RAID In one or more bands 710 of group 466.Panel storage example can cooperate with recognizing memory segment 650 with RAID layers 360 Position on (that is, a part for storage array 150) and SSD 260, in the segmentation 650 for storing new panel 610.Example Property, the memory segment recognized is the segmentation with larger continuous free space, and the free space has for example in SSD It is used for the position 490 for storing panel 610 on 260.

In one embodiment, then across RAID group 466 writes band 710, illustratively one or many to RAID layers 360 Individual full band write-in (full stripe write) 458.The RAID layers 360 can write a series of band of enough depth 710, to reduce the data relocation being likely to occur in the SSD 260 (that is, flash block is managed) based on flash memory.Then, panel The SSD positions 490 in new panel 610 are loaded into selected Hash table 480n (that is, according to new candidate's disk by storage example (i) Area's key assignments carries out the Hash table of selection)；(ii) new panel key assignments (being expressed as panel key assignments 475) is transferred to appropriate volume layer reality Example is used to (be also represented by rolling up metadata bar to the entry of the intensive tree 444 managed by volume layer example by the new panel key assignments storage Mesh 446) in；And the change of the panel metadata of selected Hash table recorded panel accumulation layer daily record 355 by (iii) In.Exemplarily, intensive tree 444a, offset ranges 440a of the volume layer example selection across the offset ranges 440a of volume 445 Offset ranges including write request.As it was previously stated, volume 445 (for example, skew quantity space of volume) is partitioned multiple regions (for example, the multiple regions being allocated according to disjoint offset ranges)；In one embodiment, each region is by intensive Tree 444 represents.Then, volume layer example will roll up metadata entry 446 be inserted into intensive tree 444a and will with roll up metadata bar The corresponding change of mesh recorded in volume layer daily record 345.Therefore, ask fully to be stored in the SSD of cluster in I/O (write-in) On 260.

Read path

Fig. 5 shows the I/O (examples of the storage I/O stacks 300 for handling I/O requests (for example, SCSI read requests 510) Such as, read) path 500.The read requests 510 can be sent by main frame 120 and node 200 in cluster 100 protocol layer Received at 320.Exemplarily, protocol layer 320 by the field to the request (for example, LUN ID, LBA and length are (at 513 Show)) decoding 420 (for example, parsing and extraction) is carried out to handle the read requests, and for volume mapping technology 430 use decoded result 522 (for example, LUN ID, offset and length).That is, protocol layer 320 can realize volume Mapping techniques 430 (described above) are with by the scope of LUN ID and LBA in read requests (equivalent to offset and length) The appropriate volume layer example (that is, volume ID (volume 445)) in cluster 100 is converted into, the appropriate volume layer example is responsible for being directed to LBA (i.e. offset) scope manages volume metadata.Protocol layer and then result 532 is delivered to persistent layer 330, persistent layer 330 can be with Write cache 380 is searched for be determined whether for some read requests or all to read according to the data of its cache Request offer service is provided.If can not be to whole request offer service according to the data of institute's cache, persistent layer 330 can be with According to function transfer mechanism (for example, RPC for inter-node communication) or IPC mechanism (for example, the message for intra-node communication Thread) remainder of request (including, volume ID, offset and length) is delivered to appropriate volume layer example as parameter 534.

Volume layer example can be handled the read requests to access and roll up 445 region (for example, offset ranges 440a) associated intensive tree metadata structure 444 (for example, intensive tree 444a), the region of volume 445 includes asked skew Measure scope (being specified by parameter 534).Volume layer example can be handled further read requests, close to search for (lookup) Collection tree 444a one or more volumes metadata entry 446 so as to obtain it is in asked offset ranges, with one or The associated one or more panel key assignments 475 of multiple panels 610 (or each several part in panel).In one embodiment, each Close tree 444 can be embodied as multi-pass decoding structure, the searching structure can at every one-level overlapping offset ranges entry.Should Multistage intensive tree can have volume metadata entry 446 for same offset, in this case, higher rank have compared with New entry and for providing service for read requests.Intensive the top of tree 444 is exemplarily retained in kernel and page Face cache 448 can be used to access the lower level of tree.If asked scope or one portion are not present in top Point, then it can access the metadata page associated with the directory entry at next relatively low tree rank (not shown).Then, under The metadata page (for example, in page cache 448) at one-level scans for finding any overlapping entry.So Afterwards, the process is iteratively performed, untill finding one or more volume metadata entries 446 of a level, to guarantee Find the panel key assignments 475 of the read range for entirely being asked.If the read range asked all or in part is not deposited In metadata entry, then non-lack part is by zero padding.

Once finding, then 340 pairs of each panel key assignments 475 of volume layer are handled to realize such as bucket mapping techniques 476, from And the panel key assignments is converted into the appropriate panel storage example 470 for being responsible for the asked panel 610 of storage.It should be noted It is that in one embodiment, each panel key assignments 475 can be substantially equivalent to the cryptographic Hash 472 associated with panel 610 (that is, the cryptographic Hash calculated during write request is carried out for panel) so that bucket mapping techniques 476 and panel metadata selected Technology 460 can be used for both write paths operation and read path operation.It is also to be noted that can be according to cryptographic Hash 472 derive panel key assignments 475.Then, the volume layer 340 can be by panel key assignments 475 (that is, from previous for panel The cryptographic Hash of write request) to transmit and (stored via panel and obtain operation) to appropriate panel storage example 470, the panel is stored Example 470 realizes panel key assignments to SSD mapping to determine position of the panel on SSD 226.

In response to acquisition operation, panel storage example can be handled panel key assignments 475 (that is, cryptographic Hash 472) To perform metadata selected technology 460：(i) appropriate Hash table is selected in the one group of Hash table stored from panel in example 470 480 (for example, Hash table 480a), and (ii) extract hashed table index 462 with rope from panel key assignments 475 (cryptographic Hash 472) Selected Hash table is guided to, and the table clause with matching panel key assignments 475, the matching panel are searched for panel 610 Key assignments 475 is used to identify the storage location 490 on SSD 260.That is, being mapped to the SSD positions 490 of panel key assignments 475 It can be used for the existing panel (being expressed as scope 610) of retrieval from SSD 260 (for example, SSD 260b).Then, panel is stored Example cooperates with accessing the panel on SSD 260b and retrieve data content according to read requests with RAID layers 360.Show Example property, RAID layers 360 can be read out to panel according to panel read operation 468 and panel 610 is transferred into panel Store example.Then, storage example in panel can be decompressed according to decompression technique 456 to panel 610, but answer this area Technical staff in any layer of storage I/O stacks 300 it is to be understood that perform decompression.Panel 610 can be stored in storage In the buffer (not shown) of device 220, and the reference of the buffering area can be transmitted back to by storing each layer of I/O stacks.So Afterwards, panel can be loaded into reading formula cache 580 (or other classification mechanism) by persistent layer, and can be asked for reading 510 LBA range is asked to extract appropriate reading data 512 from reading formula cache 580.Hereafter, protocol layer 320 can be created SCSI reads response 514 (including reading data 512) and reading response is back into main frame 120.

Hierarchical file system

The exemplary hierarchical file system for employing storage I/O stacks of embodiment described herein.The hierarchical file system includes The flash memory of file system is preferred, log-structured layer (that is, panel accumulation layer), and the flash memory is preferred, log-structured layer is configured as carrying For the Coutinuous store (that is, log-structured layout) of the data on the SSD 260 of cluster and metadata.Data can be organized as by Node provides the visible LUN of one or more main frames of service any number of variable-length panel.Metadata can include from The LUN visible range of logical block addresses of main frame (that is, deviation range) arrives the mapping of panel key assignments, and panel key assignments arrives panel The mapping of SSD storage locations.Exemplarily, the volume layer of hierarchical file system cooperates with providing indirect layer with panel accumulation layer, should Indirect layer promotes the efficient log-structured layout in panel on SSD by panel accumulation layer.

In one embodiment, (such as write-in distribution and flash memory device are (i.e., for the function of the log-structured layer of file system SSD) manage) performed and managed by panel accumulation layer 350.Writing distribution can include collecting variable-length panel, can with formation To be written into the full band to discharge segmentation across the SSD of one or more RAID groups.That is, the log-structured layer of file system is by disk Initially idle (that is, clean) segmentation is write as full band, rather than part band in area.Flash memory device management can include segmentation Remove to create this freed segment via RAID groups indirect mappers to SSD.Therefore, part RAID stripe write-in is avoided by, This produces reduced RAID correlation write-in amplifications.

Instead of relying on the refuse collection in SSD, storage I/O stacks can realize that segmentation is removed (i.e., in the accumulation layer of panel Refuse collection), to get around the performance impact of flash translation layer (FTL) in SSD (FTL) function (including refuse collection).In other words, store I/O stacks allow the log-structured layer of file system using clear operation is segmented for data layout engine, effectively to replace SSD's The substantial portion of FTL functions.Therefore panel accumulation layer can remove (that is, refuse collection) according to segmentation and be asked to handle random writing Ask to predict the flash memory behavior in its FTL function.Therefore, the log-structured equivalent source for the write-in amplification of storage I/O stacks can To merge and manage in panel accumulation layer.In addition, the log-structured layer of file system can be used to improve storage battle array by part The write performance of the flash memory device of row.

Note, SSD log-structured layout is realized by being sequentially written in panel with removing segmentation.Therefore, deposited by panel The log-structured layout (that is, sequential storage) that reservoir is used inherently supports variable-length panel, thus allows the storage on SSD Before panel not strictly compress and without from SSD clear and definite block level (that is, SSD blocks) metadata support, such as support 520 bytes of 512 byte datas and 8 byte metadata (for example, for another piece pointer of the end comprising compressed data) Sector.Generally, the sector for the power (for example, 512 bytes) that consumer-grade SSD supports are 2, and more expensive enterprise level SSD It can support to strengthen the sector (for example, 520 bytes) of size.Therefore, panel accumulation layer can utilize the consumer of lower cost Rank SSD is operated, while associated without constraint compression supporting variable-length panel using its.

Segmentation is removed

Fig. 6 shows and removed by the segmentation of hierarchical file system.In one embodiment, the panel of hierarchical file system Accumulation layer 350 can write panel empty or clear area or " segmentation ".Before the segmentation is re-writed again, panel accumulation layer 350 can be purged section according to segmentation removing, and the segmentation, which is removed, as shown in the figure can be embodied as being segmented reset procedure.Should Segmentation reset procedure can read all effective panels 610 from old segmentation 650a and (that is, not be deleted in those effective panels Or the panel of manifolding 612) one or more new segmentation 650b-c are write, thus discharge (that is, " removing ") old segmentation 650a.New building Area then can sequentially be write old (now clean) segmentation.Hierarchical file system can keep the retaining space of specified quantitative (i.e., Freed segment) with support segmentation remove efficient performance.For example, hierarchical file system can keep being equal to approximately as shown in the figure The retaining space of the freed segment of 7% memory capacity.Being sequentially written in for new panel can show as full band write-in 458 so that All SSD that single write operation to storage is crossed in RAID groups 466.Bar until minimum-depth can be accumulated by writing data Band write operation can be performed.

Exemplarily, segmentation is removed and can be performed to discharge indirect mappers to SSD one or more selected segmentations.Such as Used herein, SSD can include multiple fragmented blocks 620, wherein the size example of each block is approximate 2GB.Segmentation can be with Include the fragmented blocks 620a-c of multiple SSD each SSD in RAID groups 466.Therefore, for the RAID with 24 SSD Group, wherein 22 SSD equivalent memory space data storage (data SSD) and 2 SSD equivalent memory space storage odd even Verify (even-odd check SSD), each segmentation can include 44GB data and 4GB even-odd check.RAID layers can be further RAID groups are configured according to one or more RAID implementations (for example, RAID1, RAID 4, RAID 5 and/or RAID 6), Thus protection to SSD is provided in the case of for example one or more SSD failures.Note, each segmentation can from it is different RAID groups are associated, and therefore can have different RAID to configure, i.e. each RAID groups can be real according to different RAID Existing mode is configured.In order to discharge or remove selected segmentation, the panel comprising valid data is moved to difference in segmentation Clean segmentation and selected segmentation (now clean) is released for follow-up reuse.It is empty that the merging scattered free time is removed in segmentation Between to improve write-in validity, for example, improved by reducing the related magnifying powers of RAID write-in validity to band and The write-in validity to underlying Flash memory block is improved by reducing FTL performance impact.Once segmentation is eliminated and specifies release, Data just can sequentially be write the segmentation.The counting structure kept by panel accumulation layer for write-in distribution is (for example, indicate segmentation The freed segment mapping of the amount of free space) it can be used by segmentation reset procedure.Note, the clean segmentation of selection is with from just clear The subsection receiing data (that is, write) removed can be based on remaining free space in the clean segmentation amount and/or this clean point The final time that section is used.It shall also be noted that the different piece of the data from the segmentation being just eliminated can be moved into Different " targets " segmentation.That is, multiple related clean segmentation 650b, 650c can receive data from the segmentation 650a being just eliminated Different piece.

Exemplarily, segmentation removing is likely to result in the amplification of write-in to a certain degree in storage array (SSD).However, literary Part system can by SSD be sequentially written in panel be used as logging device reduce it is this write-in amplify.For example, it is assumed that SSD has There is approximate 2MB erasing block size, by being sequentially written at least 2MB data (panel) to freed segment, whole erasing block can Be replicated and the fragmentation of SSD stage can be eliminated (that is, reduce SSD in refuse collection).However, SSD generally across Multiple flash memory components and across multiple passages (that is, storage control 240) slitting data to realize performance.Therefore, to the free time Relatively large (for example, 2GB) write-in granularity of (that is, clean) segmentation may be to avoiding SSD stage write-in amplification from (that is, covering internal SSD Slitting) for be necessary.

Specifically, because the erasing block boundary in SSD is probably unknown, write-in granularity should be sufficiently large, so as to For the panel on big successive range a series of write-ins can make carbon copies panel previously written on SSD and effective over Refuse collection in SSD.In other words, this refuse collection can be preempted because new data with past data identical model Place and be written into so that the new data makes carbon copies previously written data completely.The method also avoid new write-in data consumption and protect The spatial content stayed.Therefore, the advantage of the log-structured feature (that is, the log-structured layer of file system) of storage I/O stacks is The ability of SSD write-in amplification is reduced merely with the retaining space of minimum in SSD.The log-structured feature is by retaining space Flash memory device management from SSD effectively " movement " to panel accumulation layer, the panel accumulation layer is managed using the retaining space Write-in amplification.Therefore, a source (that is, panel accumulation layer) for write-in amplification, rather than two with write-in amplification are only existed Source (that is, panel accumulation layer and SSD FTL, both are multiplied).

Write-in distribution

In one embodiment, each segmentation can have multiple RAID stripes.Segmentation is allocated (that is, in removing point every time After section), each SSD block can include a series of RAID stripes in the segmentation.Each piece can be identical or different in SSD Skew.Panel accumulation layer can read block for the purpose order of removing and all valid data are reoriented into another point Section.Hereafter, the block 620 of the packet through removing can be released and can propose how that composition uses next segmentation of the block Decision-making.For example, if SSD is removed from RAID groups, a part (that is, a chunk 620) for capacity can be from next segmentation (that is, the change in RAID stripe configuration) is omitted, to constitute the RAID groups of a narrower block from multiple pieces 620, i.e. make Obtain the few block width of RAID width.Therefore, removed by using segmentation, new segmentation every time is allocated, and constitutes the block 620 of segmentation RAID groups just can effectively be created, i.e. when new segmentation is allocated, RAID groups are just from available SSD dynamic creations.Generally not Need all SSD 260 in new segmentation includes storage array 150.Alternatively, the block 620 from the SSD newly introduced can To be added into the RAID groups created when new segmentation 650 is allocated.

Fig. 7 shows the RAID stripe formed by hierarchical file system.Note, write-in distribution can include collecting variable length Scale area is to form one or more bands across the SSD of one or more RAID groups.In one embodiment, RAID layers 360 The topology information and parity calculation of the replacement in panel 610 on the SSD 260a-n for RAID groups 466 can be managed.For This, RAID layers can cooperate with being organized as panel into the band 710 in RAID groups with panel accumulation layer.Exemplarily, panel is deposited Reservoir can collect panel 610 can be written to freed segment 650a one or more full bands 710 to be formed so that single All SSD that stripe write operation 458 can be crossed in the RAID groups.Panel accumulation layer can also be cooperated with RAID layers with will be every Individual band 710 is packaged as the full band in variable-length panel 610.Once band is done, RAID layers can be by the full bar in panel With 710 accumulation layers 365 that storage I/O stacks are delivered to as a chunk 620d-f, for being stored on SSD 260.By to sky The full band (that is, data and even-odd check) of spare time segmentation write-in, hierarchical file system avoid the cost of parity updating and The read operation load of across SSD any request of expansion.Note, the panels 610 for hanging up write operation of SSD 260 can be cumulatively added To block 620d, e, the block is written to SSD (for example, 2Gbyte) as one or more interim approximate write operations, thus reduces FTL performance impact in SSD.

In one embodiment, panel storage can be considered as the global extent pool stored on the storage array 150 of cluster, Wherein each panel can be maintained in the RAID groups 466 of panel storage example.Assuming that one or more variable-lengths are (i.e., It is small and/or big) panel is written into segmentation.Panel accumulation layer can collect variable-length panel to form the SSD across RAID groups One or more bands.Although each band can include multiple panels 610 and panel 610 can cross over more than one band 710a, b, each panel are intactly stored on a SSD.In one embodiment, band can have depth 16KB simultaneously And panel can have size 4KB, but hereafter panel can be pressed downward and be reduced to 1KB or 2KB or smaller, so that allow can It can exceed that the bigger panel of stripe depth (that is, block 620g depth) is packaged.Therefore, band can merely comprise one of panel Point, the depth (that is, the chunk 620d-f for constituting the band) of band 710 can be independently of the disk for being written into any one SSD Area.Because panel accumulation layer can write panel as one or more freed segments of the full band across SSD, with band The associated write-in amplification of processing information can be reduced.

Rate-matched technology

Each embodiment described herein is related to rate-matched technology, and the rate-matched technology is configured as adjusting storage array The clearance rate of one or more selected segmentations, to adapt to by the incoming workload for storing the processing of I/O stacks (and by every Individual selected segmentation is removed and the valid data that are relocated) variable bit rate.Incoming workload can show as being directed to user Data (for example, from main frame coupled to cluster) and associated metadata are (for example, volume layer and panel from storage I/O stacks Accumulation layer) I/O operation (for example, read operation and write operation).Note, panel accumulation layer is above carried in the SSD of storage array For user data and the sequential storage of metadata (being presented as panel).Panel accumulation layer can remove (that is, rubbish receipts according to segmentation Collection) segmentation is removed, the segmentation, which is removed, can exemplarily be embodied as being segmented reset procedure.The segmentation reset procedure can be from treating All effective panels are read in the one or more segmentations removed, and by those effective panels (for example, not deleted panel) Other the one or more segmentations that can be written into are write, the memory space for the segmentation that (that is, " removing ") is just being eliminated thus is discharged.

The challenge that segmentation is removed is related to the resource balanced by being consumed to other one or more fragmented copy valid data (for example, I/O bandwidth and CPU) and the resource needed for the variable incoming I/O workloads (that is, showing as I/O operation) of service.Panel Accumulation layer can postpone segmentation and remove until the number of freed segment in storage array is dropped to below specific threshold, thus minimize Data and metadata (that is, panel) reorientation and write-in amplification.That is, be loaded with data segmentation be not required to it is to be purged, until being available for The number of the freed segment used is dropped to below the threshold value.Therefore, it is desirable to remove segmentation fast enough to adapt to variable incoming write Enter workload, but can not be too fast so that quickly removing segmentation than needed for and producing unnecessary write-in amplification, i.e. copy The data that shellfish (removing) will may be made carbon copies by incoming workload.Panel can be replicated or delete before segmentation is removed, Panel need not be reoriented to another segmentation in the case of this.For example, removing the segmentation too if as segmentation reset procedure Then the panel is replicated (that is, before needing to remove the segmentation) soon, and effective panel of block is relocated to another segmentation, then Unnecessary write-in amplification can be produced.Additionally, it is desirable to be segmented removing it is as smooth as possible, i.e., in response to incoming I/O workloads guarantor Hold boundary's delay, to avoid inequality (imbalance) response time to main frame.Therefore, rate-matched technology described herein is related to Regulation segmentation removes to adapt to the delay that variable workload and smooth main frame are perceived.

In one embodiment, rate-matched technology may be implemented as feedback control mechanism (for example, feedback control is returned Road), the feedback control mechanism is configured as adjusting based on incoming reading and write-in workload and the speed of segmentation removing It is segmented reset procedure.The part of feedback control mechanism can include one or more weighting schedulers and various count data structures (for example, counter), the counter be configured as tracking (it is determined that) segmentation remove and segmentation in the progress (example that uses of free space Such as, speed).The counter may be utilized for the speed of balance segmentation removing and incoming I/O workloads, the incoming I/O works Making load can change depending on the pattern (reduce and remove) made carbon copies in incoming I/O speed and incoming I/O workloads.Work as biography When entering the change of I/O speed, the speed that segmentation is removed can be accordingly adjusted to ensure that speed (that is, incoming I/O speed and segmentation Clearance rate) substantially the same (that is, balancing).Only needing (that is, to be that incoming I/O releases are empty in this manner, segmentation is removed Between) when be performed and reduce write-in amplification.Similarly, incoming I/O speed can be suppressed (that is, constrained) to allow to be in The segmentation of backward state is removed and caught up with.

Fig. 8 A and Fig. 8 B respectively illustrate reading and the write-in weighting scheduler of rate-matched technology.Store in I/O stacks I/O operation (for example, reading and write operation) can enter according to user, metadata and reorientation (segmentation remover) I/O operation Row classification.Read operation provides according to SSD 260a-n and services and directly distributed from the reading process 810 of panel accumulation layer To RAID layers.Write operation is cumulatively added at the write-in process 850 of panel accumulation layer, wherein the phase before being distributed to RAID layers The panel 610 of association can be packaged to form full band.Weighting scheduler can be used with RAID in panel accumulation layer Layer adjustment read operation and full stripe write operation.Exemplarily, write-in weighting scheduler 860 is segmentation in write-in process 850 650 all SSD 260 are provided, and are read weighting scheduler 820a-n and read each disk (example of the process 810 for segmentation 650 Such as, SSD) bandwidth distribution between various classification to control I/O operation is provided.

Exemplarily, there are some segmentations in storage array, these segmentations are once full of, it is necessary to which segmentation, which is removed, (that is, to be done Only the number being segmented is dropped to below freespace threshold).Rate-matched technology tries hard to (that is, consume with the filling of incoming workload Free space) substantially the same speed progress segmentation removing of segmentation.Otherwise, segmentation removing may handle (1) too soon, thus increase Plus write-in amplification or (2) fall behind.For the latter, it may thus produce and not have in segmentation without enough clean segmentations " reservation " There are enough free spaces to be used to receive incoming workload；In addition, the segmentation being just eliminated may not by time be removed with Receive incoming workload.(that is, remove too fast) in the case of the former, completing segmentation removing, there may be the lower of main frame is prolonged (that is, bear " peak ") late.In both cases, " peak " (negative or positive) in the delay that main frame is perceived may be due to for incoming The unexpected removing (or fully erased) of the free space of workload and occur.In one embodiment, it is each to read weighting tune Multiple queues can be included by spending device 820a-n and write-in weighting scheduler 860, wherein the classification (example of each queue and I/O operation Such as, user, metadata and reorientation) one of it is associated.In other words, queue can include (i) for incoming user data (UD) The data queue of workload (reads 830a-n, 870), (ii) is directed to incoming volume and panel accumulation layer metadata (MD) work for write-in Make the metadata queue (reading 832a-n, write 872a-n) loaded, and (iii) is directed to data and first number during segmentation is removed (834a-n is read, write-in is 874) according to the reorientation queue of reorientation (REL).According to rate-matched technology, weighting scheduler can be with It is configured as applying fine granularity control to the speed that segmentation is removed with by assigning related weighing to match incoming I/O works to queue Make the speed loaded, the unnecessary write-in amplification produced due to too fast operation will not be introduced (simultaneously by therefore ensuring that segmentation is removed And the unnecessary main frame delay of increase), and do not interfere with and may increase associated with I/O workloads due to too slow operation The performance of delay that perceives of main frame.Note, the queue of each classification (UD, MD and REL) provides for each SSD, to support Fine granularity control to I/O.

For this reason, it may be desirable to which each band (for example, as part of full stripe write operation 458) for writing storage array exists Ready state has metadata, user data and the constant ratio for relocating data；Otherwise, the delay that main frame is perceived may go out Existing peak (that is, postponing beyond bounded).Rate-matched technology tries hard to the smooth write latency of offer and reads delay (for example, pin To user data read operation, metadata read operation and reorientation read operation), i.e. to keep the smooth delay to main frame (behavior for not having similar shake).Therefore, the technology balance user data, metadata and reorientation data in reading and Write operation (I/O rate-matcheds), and balance (fine granularity control) reading and write operation.

In one embodiment, rate-matched technology, which is provided, dispatches by selection and service queue and (that is, handles the team The operation (being presented as message) of row) single reading queue 830a-n, 832a-n, 834a-n and write-in queue 870,872, 874.Note, to each corresponding each set for reading the associated reading queues (UD, MD and REL) of weighting scheduler 820a-n Each SSD can be assigned to；Adjusted however, the single set of write-in queue (UD, MD and REL) can be assigned to write-in weighting Spend device 860.Exemplarily, each SSD is circulated the set for being assigned to write-in queue.Queue can be tired with the time based on the queue Long-pending credit (amount of bytes of service) being serviced (that is, assigning weighting).For example, it is assumed that three queues (user data, metadata and Relocate data) weighting according to ratio 1:2:3 assign.With number " N " individual iteration, weighting scheduler tries hard to first class services team Arrange (1/6*N), second queue (2/6*N) and the 3rd queue (3/6*N).Because different I/O operation classification can have difference Payload size (variable-size panel), therefore operation " credit size " (for example, 64KB) value can be from queue as true Fixed reading and write-in weighting scheduler is processed for hold queue weighting ratio.In other words, queue can be processed and make I/O byte numbers (rather than queue entries number) hold queue that must be from each queue weights ratio.

Exemplarily, there is a write-in weighting scheduler 860 and be arranged to write operation (that is, in all SSD altogether Enjoy) because RAID layers perform full stripe write operation 458 across all SSD.Weighting scheduler 820a- is read however, it is possible to exist N is arranged to the read operation for each SSD, because panel is read from a SSD and can be different across SSD every time Ground disperses.Therefore, on read path there is a reading weighting scheduler 820a-n (based on single in each SSD 260a-n SSD read operations), the read path is realized for reading queue in the backfeed loop on SSD, and the pin on write paths There is a write-in weighting scheduler 860 (being based on full stripe write operation) to all SSD, the write paths are for write-in queue Realized in the backfeed loop across all SSD.Note, show the source of (in the read path of storage I/O stacks) read operation： (i) user data (UD) read operation, (ii) metadata (MD) read operation and (iii) from volume and panel accumulation layer comes From reorientation (REL) read operation of segmentation reset procedure.Similarly, these sources are deposited also for the write paths of storage I/O stacks .By this framework, the I/O bandwidth of some classes can be controlled to accelerate or slow down segmentation by changing the weighting of correspondence queue Reorientation.

Fig. 9 shows rate-matched technology.In response to selecting the segmentation for removing, the segmentation reset procedure is from for weight The certain number of byte of positioning starts to remove " bytes_to_relocate_at_start ".That is, 910 are removed with segmentation to open Begin, the segmentation causes all bytes in effective panel (that is, be not deleted or make carbon copies) to be relocated.Counting structure 920,940 (for example, one or more counters) can keep tracking to count, and (be reset in being such as segmented as panel is added by removing Position) and/or delete effective panel of (being ignored by removing) and/or delete the number in panel.These statistics can be used for export Have been segmented into the number of the byte of reset procedure reorientation.Notice that the panel for being added to segmentation adds removing burden and (that is, increased The number for the byte to be relocated is added)；However, reducing removing burden from the panel that segmentation is deleted, (that is, reducing to reset The number of the byte of position).Being segmented the progress removed can be traced in segmentation removing progress counting structure 920, and the segmentation is clear Except progress counting structure 920 (for example, counter) causes parameter (such as " bytes_to_relocate_at_start " 922 He " bytes_relocated " 924 number) calculate the percentage of byte being relocated, " reloc_pct " 926, i.e. (reorientation) progress is removed in segmentation：

Reloc_pct=(bytes_relocated/bytes_to_relocate_at_start) * 100

Similarly, for select for remove one or more segmentations incoming I/O speed 930 (that is, incoming work bear Carry) free space availability can be used to determine.Without the byte number being relocated it is considered as free space in segmentation, That is, the write-in space quota (" write_quota_total ") available for segmentation.Since segmentation reset procedure has started for every The byte number (that is, running counter) of individual segmentation write-in/consumption can be tracked as " write_quota_used ".Therefore, Free space can use space free count structure 940 to be traced, and the space free count structure 940 causes parameter " write_quota_used " 944, " write_quota_total " 942 and free space most recently used percentage are calculated as：

Write_quota_pct=(write_quota_used/write_quota_total) * 100

Exemplarily, as progress is recorded for the removing (that is, the change of the percentage of counterweight located byte) of segmentation, Progress is recorded (that is, to the change in variable write-in space) also for the user data and metadata of write-in so that segmentation is removed Speed can be conditioned.That is, the panel amount from the segmentation reorientation being just eliminated is tracked, and the sky consumed in the segmentation The amount of free space.By the two parameters (that is, reloc_pct 926 and write_quota_pct 946), can propose on Reduce or increase the decision-making for resetting bit rate.In other words, parameter reloc_pct and write_quota_pct can be used for instead Present incoming I/O speed and segmentation clearance rate.For example, it is assumed that 50% panel needs to be relocated when beginning is removed in segmentation. It is further assumed that incoming I/O can make carbon copies logical block during removing is segmented so that panel (block) amount that will be relocated is reduced, Thus the increase of incoming I/O speed (for example, speed of the incoming write request from main frame) is allowed.Rate-matched technology is determined When and how various input rates are adjusted (for example, keeping smooth and imperceptible in terms of the delay to main frame Volt).

By removing (that is, reloc_pct parameters 926) and available free space (that is, write_quota_ on segmentation Pct parameters 946) progress information, can calculate reorientation queue expectation weighting.Controlled by calculating I/O rate-matcheds Error (for example, write_quota_pct-reloc_pct) between the medium-rate of device 950, and to reorientation queue (for example, Write-in reorientation queue 874) expectation weighting using the error as actuating to reduce the error (that is, backfeed loop), can be with Realize desired control (that is, incoming I/O speed substantially being matched into segmentation clearance rate).Note, available free space conduct Negative-feedback come limit segmentation remove speed, i.e., free space more (less incoming I/O), it is necessary to removing it is fewer.It is exemplary Ground, desired reorientation weighting and user and metadata queue it is weighting and relevant.Once calculating desired weighting, reset The weighting of position queue can just be adjusted in the range of small increase or decrease, rather than directly carry out the desired weighting, to avoid The extensive vibration in terms of performance shakes the weighting of (that is, the quick change for the delay that main frame is perceived) is likely to result in (for example, control Loop overshoots ring).Note, expect in error (that is, incoming I/O speed and reset the imbalance between bit rate) control period The delay (for example, 1 millisecond) perceived by main frame is constrained, so that smooth delay changes.It will be appreciated by those skilled in the art that being permitted Many control algolithms (such as proportional-integral-differential (PID) is controlled) can be used to realize the desired control.

The parameter (that is, bytes_relocated 924 and write_quota_used 944) of export segmentation clearance rate Can be traced and update in I/O paths, thus weighted calculation logic can reorientation progress (reloc_pct 926) or Free space is triggered when using the difference in (write_quota_pct 946) to be more than 1%.That is, when reorientation percentage (reloc_pct) or write-in quota percentage (write_quota_pct) change 1% when, I/O rate matching controllers 950 Weighting (that is, speed) is calculated and is triggered.Therefore, control loop calculating can be by the minimum change rather than fixed sample in error Speed drives.Separate processes (not shown) (that is, can be segmented removing progress structure 920 and free space and use knot with parameter The parameter of structure 940) change during the speed control of queue and be updated.Note, parameter can be removed (i.e., due to segmentation Reset the change of the number and amount of free space of bit byte) and/or incoming I/O operation (that is, the change of amount of free space) and change Become.

In one embodiment, one group of default weighting can be assigned to three queues, wherein 50% (fixation) weighting User data (UD) queue can be assigned to, 10% weighting can be initially assigned to metadata (MD) queue and 50% Weighting initially can be assigned to reorientation (REL) queue.Note, the fixed weightings of UD queues assign can be used for it is at least true Protect the minimum service of host request.Relocate queue weighting can such as 10%-1000% (be more than 20 times of UD queues) it Between change, this is due to that the weighting is typically to be driven by the degree of filling (amount in effective panel) that is segmented, i.e. segmentation it is fuller, it is necessary to The amount in the panel of reorientation is bigger, and therefore reorientation queue is longer.If for example, the segmentation that will be removed be 50% expire, The amount in effective panel of (removing), which will be relocated, should be similar to the number of users that will be write in the time frame of filling segmentation According to the amount of (and metadata).Therefore, degree of filling of the weighting dependent on storage array (segmentation), that is, reorientation needed for removing segmentation Amount is relevant directly with the degree of filling of storage array.It shall also be noted that the speed of incoming I/O operation may rely on it is desired simultaneously Hair amount.

As it is used herein, be concurrently to be based on incoming I/O workloads, and rate-matched is filling based on storage array Full scale and segmentation relocate the progress (that is, backward or advanced degree) of queue during removing.Specifically, reading concurrently can be with Determined based on RAID write requests.In one embodiment, rate-matched technology can use the weighting scheduler of ablation process Logic is used for the concurrent of both read operation and write operation, even if the independent weighting scheduler of reading process is directed to each disk (SSD) keep, i.e. each weighting for reading queue is identical.The processing of each queue is by user data (UD), metadata (MD) and reorientation (REL) queue ratio determine.If storage array is high speed that is full and there is incoming user data I/O Rate, then segmentation is removed and is weighted more heavily to be maintained at before I/O speed；Otherwise, enough memory spaces will not Adapt to incoming I/O workloads.If that is, be segmented remove during there is high (undesirable) read and postpone, without need not The reorientation wanted writes workload to keep up with incoming I/O.Increase reorientation queue weighting therefore be probably it is necessary, so as to Reorientation queue " can finally catch up with " speed of incoming I/O workloads.

Therefore, rate-matched technology tries hard to segmentation removing being maintained at before I/O speed；Otherwise storage array may be too early Ground is finished and (exhausted) memory space.Therefore, reorientation queue can be prioritized for user data I/O workloads and (pass through Regulation weighting).In addition, if user's I/O operation is then segmented clear without rapidly being received and being handled by storage I/O stacks as expected The speed removed can be reduced and (slow down), to avoid reorientation (that is, copying) if the removing of segmentation is delayed by (until needing) It may not be needed the panel of this copy.Assuming that relocating queue by random preferential to be maintained at before incoming user data Face.The panel filling that be used to writing on disk with band, compared with relocating the amount (for example, more) in panel, not into than Example, which obtains lower user data extent amount (for example, less), can be assembled or be packaged into band.This is there may be NVRAM lasting The loss of layer, i.e. the service to data in NVRAM possibly can not occur fast enough, thus cause to " back pressure " of main frame with And the peak in the delay that perceives of main frame.Rate-matched technology described herein is attempted to avoid this situation, so as to smooth delay, and Keep just coming to avoid these peaks in incoming I/O workloads speed by the way that segmentation is removed.

Note, read work load is applied equally to for writing the rate-matched of queue of workload.Depending on reading Workload is taken, each SSD there can be the different rates for the weighting for being attributed to its queue.Exemplarily, team is read in reorientation Row speed is controlled to matching user data and metadata reads (and write-in) queue rate.This causes to carry out bandwidth according to SSD Matching so that SSD can be allocated less read operation in busy and be allocated more read operations in not busy.Due to having Enough redirections, it is therefore desirable for the number for ensuring read operation is significant for SSD so that (reorientation) is removed in segmentation can To be performed with desired speed, while not influenceing main frame to read the smooth of delay.In other words, it is not necessary to send more read and grasp Reorientation is acted on, but needs to meet write operation for relocating.In this way, rate-matched technology can be configured as preventing The read operation of influence user and storage I/O stacks to data and metadata is removed in segmentation.Therefore, the speed of reorientation queue is read The speed of matching write-in reorientation queue can be controlled to so that the speed of user and metadata queue is allowed to carry out on demand Processing.

Weighted calculation based on feedback control described herein ensures that there is appropriate weighting to keep desired for reorientation queue It is segmented removing progress.The service (for example, reader and write-in process) of driving reorientation I/O operation is secondary control point, this time Control point is wanted to provide fine-grained speed control 960 to track reorientation progress, so as to expect rate processing I/O operation, i.e. The speed is set by (main) the I/O rate matching controllers 950 of setting expected rate.The service for relocating panel considers following Aspect handles reorientation I/O operation：(i) in current weighting (that is, the phase set by I/O rate-matcheds of write-in process queue Hope speed)；(ii) writing the Current Write Request of the class I/O operation of process three (user, metadata, reorientation) (with byte) Number；(iii) read process according to SSD across it is all weighting schedulers current read requests (with byte) number；And (iv) reorientation progress and/or free space availability.It additionally, there are according to the limitation of service to ensure that service does not terminate big vast biography Defeated I/O operation, wherein these limitations include the number of (v) according to the reorientation I/O operation for the segmentation distribution that will be removed；(vi) The maximum number of current reorientation I/O operation；And (vii) current maximum number for resetting bit byte.

Although having been shown and describing exemplary embodiment and be related to rate-matched technology, the rate-matched technology is configured The speed of one or more selected segmentations of storage array is removed for regulation, to adapt to by one or more nodes of cluster The variable bit rate of the incoming workload of the storage I/O stacks processing of execution, it should be understood that in the spirit and scope of embodiment hereof It is interior to carry out various other adaptations and modification.For example, each embodiment is being directed to the panel storage in storage I/O stacks herein The rate-matched technology that layer is used is shown and described.However, each embodiment is not limited to this in its wider meaning, In fact each embodiment can also allow for using in other storage applications and in other layers (such as volume layer) of storage I/O stacks Rate-matched technology, the wherein technology can be used for realizing that dereference counts step (that is, adjusting the deletion in storage system).

Description above is directed to specific embodiment, it will, however, be evident that can be carried out to described embodiment Other variants and modifications, these variants and modifications obtain the part or all of advantage of the embodiment.For example, it is expressly contemplated that Be, components described herein and/or element may be embodied as tangible (non-transitory) computer-readable medium (for example, Disk and/or CD) on the software that is encoded, the computer-readable medium of tangible (non-transitory), which has to perform, to be calculated Machine, hardware, the programmed instruction of the combination of firmware or above-mentioned item.Therefore, this specification is intended only to illustrate and is not intended to pair The scope of the embodiments herein is limited.Therefore, the purpose of appended claims is to cover the real of the embodiments herein All these variants and modifications within spirit and scope.

Claims

1. a kind of method, including：

Multiple write requests for one or more logic units (LUN) are received with incoming speed, each write request has Data and it is processed at the node of cluster, the node, which has memory and is linked to solid-state driving (SSD), to be deposited Store up array；

The data of each write request are stored in first across one group of SSD as one or more user data extent In segmentation, first segmentation has log-structured layout；And

The clearance rate is controlled by the way that the speed for removing first segmentation is substantially matched into the incoming speed.

2. according to the method described in claim 1, further comprise：

Track the percentage of the free space of first segmentation；

Tracking is reoriented to the percentage of the byte of the second segmentation, described second across one group of SSD from the described first segmentation Segmentation has the log-structured layout；And

By subtracting the percentage of the free space from the percentage for resetting bit byte come calculation error.

3. method according to claim 2, further comprises：

The first weighting to being assigned to reorientation queue is proportional to be located in the reason reorientation queue, and described first weights and use institute The error of calculating determines, wherein the byte for being reoriented to second segmentation from the described first segmentation be enqueued onto it is described heavy Position queue.

4. method according to claim 2, wherein the error responses calculated are in the percentage of the free space Change exceedes threshold value and calculated.

5. method according to claim 3, wherein the data of each write request are enqueued onto and the reorientation The user data queue of queue separation.

6. method according to claim 3, proportional to the described first weighting it is located in the reason reorientation team wherein described Arrange and occur by the amount of bytes from the reorientation queue processing.

7. method according to claim 5, further comprises：

The second weighting to being assigned to the user data queue is proportional to be located in the reason user data queue so that described the The ratio of one weighting and the described second weighting is kept.

8. method according to claim 3, wherein each SSD in one group of SSD, which has, individually reads reorientation Queue, and wherein described first weighting is assigned to the write-in reorientation write-in queue being shared in one group of SSD.

9. method according to claim 8, wherein first weighting is in response to read requests in the reading queue Number exceedes threshold value and changed.

10. a kind of method, including：

The data of each write request are stored in first across one group of SSD as one or more user data extent In segmentation, first segmentation has log-structured layout；

Track the amount of free space of first segmentation；

Tracking is reoriented to the amount of bytes of the second segmentation across one group of SSD from the described first segmentation；

By using the amount of free space and the reorientation amount of bytes calculation error, to set byte from described first point Section is reoriented to the expected rate of second segmentation；And

Control reorientation queue is to keep the expected rate removed, wherein resetting bit byte is enqueued onto the reorientation team Row.

11. a kind of system, including：

Storage system, the storage system has the memory that processor is connected to via bus；

Storage array, the storage array is coupled to storage system and with one or more solid-states driving (SSD)；

I/O stacks are stored, the storage I/O stacks are performed on the processor of the storage system, and the storage I/O stacks are being held It can be used to during row：

Multiple write requests for one or more logic units (LUN) are received with incoming speed, each write request has Data；

12. system according to claim 11, wherein I/O stacks are further operable is used for for the storage：

Track the percentage of the free space of first segmentation；

13. system according to claim 12, wherein I/O stacks are further operable is used for for the storage：

14. system according to claim 12, wherein the error responses calculated are in the percentage of the free space Change exceed threshold value and calculated.

15. system according to claim 13, wherein the data of each write request are enqueued onto and reset with described The user data queue of position queue separation.

16. method according to claim 3, proportional to the described first weighting it is located in the reason reorientation team wherein described Arrange and occur by the amount of bytes from the reorientation queue processing.

17. system according to claim 15, wherein I/O stacks are further operable is used for for the storage：

18. system according to claim 13, wherein there is each SSD in one group of SSD individually reading to reset Position queue, and wherein described first weighting is assigned to the write-in queue being shared in one group of SSD.

19. system according to claim 18, wherein first weighting is in response to read requests in the reading queue Number exceed threshold value and changed.

20. system according to claim 18, wherein be assigned to first weighting of each write-in queue in response to The number for resetting bit manipulation that is distributed and changed.