CN106155574B

CN106155574B - Method and device for constructing expandable storage device and expanded storage device

Info

Publication number: CN106155574B
Application number: CN201510184340.9A
Authority: CN
Inventors: 刘辉; 曹逾; 高雯雯; 郭小燕; 狄杰明
Original assignee: EMC IP Holding Co LLC
Current assignee: EMC Corp
Priority date: 2015-04-17
Filing date: 2015-04-17
Publication date: 2021-01-15
Anticipated expiration: 2035-04-17
Also published as: CN106155574A

Abstract

The embodiment of the disclosure relates to a method and a device for constructing a scalable storage device and the scalable storage device. The method includes building the scalable storage device by combining a plurality of modular building blocks; wherein each modular building block comprises one or more disk enclosures; and at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit; forming a cluster with the input-output processing units in the at least one modular building block; and processing input or output I/O requests from the hosts and metadata services with the cluster. Embodiments of the present disclosure also provide corresponding computer program products.

Description

Method and device for constructing expandable storage device and expanded storage device

Technical Field

Embodiments of the present disclosure relate to the field of storage, and more particularly, to a method, apparatus, and computer program product for building scalable storage devices, and storage devices built according to the method.

Background

Conventional storage devices, such as storage arrays, are typically constructed in a relatively limited manner with regard to expandability, and are therefore typically designed only for predefined optimal usage scenarios. Such construction methods lack flexibility. For example, different designs may be required for different usage scenarios, and thus a user may need to purchase different products for different usage scenarios. This is not conducive to the user reusing the existing storage resources, increasing the cost to the user. For manufacturers, constructing a specific storage product for a specific usage scenario also limits the usage range of the product, increasing the development cost of the product. Furthermore, most conventional storage devices are based on proprietary hardware designs, which further limits the flexibility of storage product construction.

Based on the above problems, embodiments of the present disclosure propose methods and apparatuses for building scalable storage devices.

Disclosure of Invention

To address at least some of the above-mentioned issues, embodiments of the present disclosure introduce a method and apparatus for building a highly scalable storage system based on modular building blocks, and propose a new IO process flow to implement an extended decentralized and high performance system.

According to a first aspect of the present disclosure, there is provided a method for building a scalable storage device, comprising: building the scalable storage device by combining a plurality of modular building blocks; wherein each modular building block of the plurality of modular building blocks comprises a disk enclosure; and at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit; forming a cluster with the input-output processing units in the at least one modular building block; and processing input or output I/O requests from hosts and metadata services with the cluster.

In one embodiment, only a first modular building block of the plurality of modular building blocks includes the storage processor; and wherein constructing the scalable storage device by combining a plurality of modular building blocks comprises: constructing the scalable storage device by connecting the first modular building block with each other modular building block of the plurality of modular building blocks.

In another embodiment, each modular building block of the plurality of modular building blocks comprises the storage processor; and wherein constructing the scalable storage device by combining a plurality of modular building blocks comprises: the scalable storage device is constructed by interconnecting each of the plurality of modular building blocks.

In yet another embodiment, the plurality of modular building blocks includes a first set of modular building blocks and a second set of modular building blocks, and only the first set of modular building blocks includes the storage processor; and wherein constructing the scalable storage device by combining a plurality of modular building blocks comprises: the scalable storage device is constructed by interconnecting each modular building block of the first set of modular building blocks and connecting each modular building block of the first set of modular building blocks with one or more modular building blocks of the second set of modular building blocks.

In one embodiment, forming a cluster with the input-output processing units in the at least one modular building block may further comprise: selecting one input/output processing unit in the cluster as a head of the cluster; wherein a header of the cluster services the metadata update request; and each input output processing unit in the cluster has the capability to provide the metadata service and the data service.

In another embodiment, forming a cluster with the input-output processing units in the at least one modular building block may further comprise: and when the head of the cluster fails, selecting another input/output processing unit in the cluster as a new head of the cluster.

In yet another embodiment, utilizing the cluster to process input or output I/O requests from hosts and metadata services may further comprise: when one input output processing unit starts up, the other input output processing units are informed of the local disk attached to the one input output processing unit through the metadata service.

In further embodiments, utilizing the cluster to process input or output I/O requests from hosts and metadata services may further comprise: the storage locations of the data are determined according to a consistent hashing algorithm so that the data can be evenly distributed across all of the storage processors.

In one embodiment, wherein determining the storage location of the data according to a consistent hashing algorithm comprises: computing a hash value based on a volume identifier and an offset value in an input or output I/O request; determining a list of hard disk drives corresponding to the hash value; querying a metadata service to determine input-output processing units directly attached to hard disk drives in the list and obtain the determined input-output load condition for each of the input-output processing units; and selecting an input-output processing unit for processing the I/O request from the determined input-output processing units directly attached to the hard disk drives in the list based on a result of the query.

In another embodiment, the number of hard disk drives included in the list is greater than one, and the number can be defined by the end user.

In yet another embodiment, utilizing the cluster to process input or output I/O requests from hosts and metadata services may further comprise: sending the I/O request to the selected input/output processing unit for processing the I/O request.

According to a second aspect of the present disclosure, there is provided an apparatus for building a scalable storage device, the apparatus comprising: a combining unit configured to construct the scalable storage device by combining the plurality of modular building blocks; wherein each modular building block of the plurality of modular building blocks comprises a disk enclosure; and at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit; a cluster forming unit configured to form a cluster with the input-output processing units in the at least one modular building block; and a cluster processing unit configured to process input or output I/O requests from the hosts and metadata services with the cluster.

In one embodiment, only a first modular building block of the plurality of modular building blocks includes the storage processor; and wherein the combining unit is configured to construct the scalable storage device by connecting the first modular building block with each other modular building block of the plurality of modular building blocks.

In another embodiment, each modular building block of the plurality of modular building blocks comprises the storage processor; and wherein the combining unit is configured to construct the scalable storage device by interconnecting each of the plurality of modular building blocks.

In yet another embodiment, the plurality of modular building blocks includes a first set of modular building blocks and a second set of modular building blocks, and only the first set of modular building blocks includes the storage processor; and wherein the combining unit is configured to: the scalable storage device is constructed by interconnecting each modular building block of the first set of modular building blocks and connecting each modular building block of the first set of modular building blocks with one or more modular building blocks of the second set of modular building blocks.

In one embodiment, the cluster forming unit may be further configured to: selecting one input/output processing unit in the cluster as a head of the cluster; wherein a header of the cluster services the metadata update request; and wherein each input output processing unit in the cluster is capable of providing the metadata service and a data service.

In another embodiment, the cluster forming unit may be further configured to select another input-output processing unit in the cluster as a new head of the cluster when the head of the cluster fails.

In one embodiment, the cluster processing unit may be further configured to notify, by the metadata service, other input output processing units of the attached local disk on one input output processing unit when the one input output processing unit is started.

In another embodiment, the cluster processing unit may be further configured to determine the storage location of the data according to a consistent hashing algorithm, such that the data can be evenly distributed across all of said storage processors.

In yet another embodiment, the cluster processing unit may be further configured to: computing a hash value based on a volume identifier and an offset value in an input or output I/O request; determining a list of hard disk drives corresponding to the hash value; querying a metadata service to determine input-output processing units directly attached to hard disk drives in the list and obtain the determined input-output load condition for each of the input-output processing units; and selecting an input-output processing unit for processing the I/O request from the determined input-output processing units directly attached to the hard disk drives in the list based on a result of the query.

In a further embodiment, the number of hard disk drives included in the list is greater than one, and the number can be defined by the end user.

In one embodiment, the cluster processing unit may be further configured to send the I/O request to the selected input output processing unit for processing the I/O request.

According to a third aspect of the present disclosure, there is provided an apparatus for building a scalable storage device, comprising at least one processor; and at least one memory including computer program code, wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform any of the methods according to the first aspect of the disclosure.

According to a fourth aspect of the present disclosure, there is provided a computer program product embodied in a computer readable medium and comprising computer readable program instructions which, when loaded into an apparatus, perform any of the methods according to the first aspect of the present disclosure.

According to a fifth aspect of the present disclosure, there is provided an expanded storage device, the device comprising:

any apparatus according to the second aspect of the present disclosure, and a plurality of modular building blocks; wherein each modular building block of the plurality of modular building blocks comprises one or more disk enclosures; and wherein at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit.

Drawings

Some embodiments of methods and/or apparatus according to embodiments of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 illustrates a flow diagram of a method for building a scalable storage device according to an embodiment of the present disclosure;

2A-2B illustrate schematic diagrams of modular building blocks according to embodiments of the present disclosure;

3A-3C schematically illustrate a schematic of a storage device constructed by combining a plurality of modular building blocks, according to an embodiment of the present disclosure; and

FIG. 4 is a block diagram schematically illustrating an apparatus for building a scalable storage device, according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

It should be understood that these exemplary embodiments are given solely to enable those skilled in the art to better understand and thereby implement the present disclosure, and are not intended to limit the scope of the present disclosure in any way.

References herein to "one embodiment," "another embodiment," or "one preferred embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, these terms are not necessarily referring to the same embodiment.

It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" may include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used herein, specify the presence of stated features, elements, and/or components, but do not preclude the presence or addition of one or more other features, elements, components, and/or groups thereof. The term "plurality" as used herein means "two or more". The term "and/or" as used herein may include any and all combinations of one or more of the associated listed items. Definitions of other terms will be specifically given in the following description.

In addition, in the following description, some functions or configurations known to those skilled in the art will be omitted to avoid obscuring the embodiments of the present disclosure in unnecessary detail.

Embodiments of the present disclosure relate to methods and apparatus for building scalable storage devices to improve flexibility of storage device construction and improve performance of storage devices.

For ease of illustration, some of the embodiments below will use specific modular building blocks to construct a scalable storage system, but those skilled in the art will appreciate that the methods and apparatus of embodiments of the present disclosure are not limited to specific modular building blocks, i.e., the methods and apparatus of embodiments herein do not limit the type of modular building block hardware, and may be applied to any hardware.

Fig. 1 schematically illustrates a flow diagram of an example method 100 according to one embodiment of the present disclosure. As shown in fig. 1, the method 100 includes: in step S101, building the scalable storage device by combining the plurality of modular building blocks; wherein each modular building block of the plurality of modular building blocks comprises a disk enclosure; and at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit; in step S102, forming a cluster with the input-output processing units in the at least one modular building block; and processing input or output I/O requests from the hosts and metadata services with the cluster at step S103.

The method 100 can construct an expandable storage device by using a variable number of modular building blocks according to actual needs, and has the advantage of high flexibility.

Herein, "modular building block" and "modular engine" may be used interchangeably. In one embodiment, at the hardware level, modular building blocks (modular engines) may be based on 2U chassis, i.e., devices in a 2U chassis, as one modular building block. However, as can be appreciated by those skilled in the art, embodiments of the present disclosure are not so limited, and the modular building blocks may be based on any hardware architecture, for example, it may be based on 1U, 4U chassis, or other architecture.

In some embodiments, modular building blocks are assumed to be based on a 2U chassis. Each 2U chassis may be divided into several separate spaces, for example 2, 3 or 4 separate spaces depending on the hardware design. Some space is used for high density Disk Enclosure (DE) with hard disk drives and Input Output (IO) expanders. One or more spaces are used for Storing Processors (SPs). The storage process SP has a Central Processing Unit (CPU), a memory and a motherboard, which is capable of running a fully functional Operating System (OS) like Linux, however, as can be appreciated by those skilled in the art, embodiments of the present disclosure are not limited to the type of operating system, i.e., it can be any suitable operating system.

In one embodiment, the software stack is built on top of a Storage Processor (SP). The bottom layer of the software stack may be, for example, but not limited to, a Linux OS and Linux container. The Linux container can provide a portable, lightweight runtime environment for the core storage stack. At least three or more separate runtime environments may be created using a Linux container. The first Linux container may be referred to as a "controller" which provides management services for the entire storage system. The 2 nd and 3 rd Linux containers may be referred to as "input output processing units" ("IOPUs") that manage disk enclosures and provide block devices. The availability of a high storage system can be provided by using two or more IOPUs.

In one embodiment, the plurality of modular building blocks combined in step S101 comprises a plurality of existing modular building blocks; and the generation of the modular building blocks need not be part of the method 100.

In one embodiment, only a first modular building block of the plurality of modular building blocks combined in step S101 may comprise the SP; and step S101 includes building the scalable storage device by connecting the first modular building block with each other modular building block of the plurality of modular building blocks. In one embodiment, this connection may be made via, for example, an input-output expander, although embodiments of the present disclosure are not so limited and any suitable alternative connection may be employed. In this embodiment, the SP (S) in the first modular building block at step S102 form a cluster to handle incoming or outgoing I/O requests and metadata services at step S103.

A building block with an SP similar to the first modular building block may be referred to as a Fully Functional Modular Engine (FFME) that includes both a Disk Enclosure (DE) and a Storage Processor (SP). An example implementation structure of FFME is shown in fig. 2A.

In this embodiment, each other modular building block of the plurality of modular building blocks other than the first modular building block includes only a disk enclosure and no SP. Such modular engines may be referred to as disk drive only modular engines (DDME). An example implementation structure of the DDME is shown in fig. 2B.

By way of example and not limitation, FFME and DDME may be based on an Open Computing Project (OCP) design. Open computing hardware is an open hardware platform that employs proven technology. It is mature enough to be used to implement commercial storage arrays thereon. For example, at the hardware level, the left and right DE in FIG. 2A may be used as a disk enclosure based on the OCP project Torpedo. In one specific example, inside the DE, there is an internal SAS expander, 15 3.5 "drives arranged in a 3x5 array, and 2 80mm fans. The central space in fig. 2A may be used for a storage controller (i.e., SP). The memory controller may be built, for example, based on OCP AMD motherboard hardware v 2.0. For example, it has 2 AMD CPUs and 4-channel Double Data Rate (DDR) memory. On top of the storage controller hardware, an open source Linux container "docker" may be used to build the software stack.

A schematic diagram of a storage device with FFME and DDME connected via an input-output extender to construct an extensible storage device by step S101 in one embodiment is shown in fig. 3A. The structure of the memory device so constructed may be referred to herein as a "single FFME + multiple DDME" architecture. In this embodiment, the SP(s) in the FFME form a cluster, i.e., all IO requests are serviced on the FFME, in this embodiment. The system can be applied to a use scene which requires huge disk capacity but less IO workload.

In another embodiment, each modular building block of the plurality of modular building blocks combined in step S101 comprises the Storage Processor (SP); and step S101 includes constructing the scalable storage device by interconnecting each of the plurality of modular building blocks. In one embodiment, the interconnection may be performed via, for example, an IP network line, but embodiments of the present disclosure are not limited thereto and may be performed by any suitable connection means. In this embodiment, in step S102, all SPs in the plurality of modular building blocks constitute a cluster.

A schematic diagram of a scalable storage device constructed by interconnecting a plurality of FFMEs via IP network lines in step S101 in one embodiment is shown in fig. 3B. Since the building blocks that are connected have the same structure, the architecture of the storage device so constructed may be referred to as a "symmetric architecture". Where all FFMEs are connected by network wires, e.g. switched connections over an IP network, and form an active/active cluster, or may be referred to as a dual active cluster, i.e. each side of the connection is active and available. In this example, all SPs in a plurality of FFMEs form a cluster to handle input or output I/O requests as well as metadata services. Where each SP in the cluster is capable of handling IO requests. That is, any IO request may be processed on each FFME. Therefore, this embodiment may be applicable to usage scenarios requiring high IO loads. For example, according to this embodiment, each FFME has an SP, and when an IO request arrives at a certain SP randomly, for example, the SP can determine whether the IO request can be processed according to its own state, such as a load condition; if not, it may be forwarded to other SPs for processing. Since each SP can process any IO, the workload of IO processing can be dispersed, and the performance of the storage system is improved.

In yet another embodiment, the plurality of modular building blocks combined in step S101 includes a first set of modular building blocks and a second set of modular building blocks, and only the first set of modular building blocks (e.g., FFME) includes the storage processor; and wherein step S101 comprises building the scalable storage device by interconnecting each modular building block of the first set of modular building blocks (e.g. via IP network lines) and connecting each modular building block of the first set of modular building blocks with one or more modular building blocks of the second set of modular building blocks (e.g. DDME) (e.g. via input-output expanders).

A schematic diagram of this embodiment is shown in fig. 3C. Which can be seen as an example of building a storage device by a hybrid of the embodiments of fig. 3A and 3B. The structure of the storage device thus constructed may be referred to as a "hybrid architecture". It should be noted that although each FFME is connected to the same number of DDMEs in fig. 3C, embodiments of the present disclosure are not so limited. In some embodiments, each building block of the first set of modular building blocks may connect a different number of building blocks of the second set of building blocks.

In one embodiment, at step S101, a cluster is formed using, for example, the Input Output Processing Units (IOPUs) in all SPs in all FFMEs in FIG. 3B or 3C. A single SP may be included in each FFME for IO processing and one or more DE's for providing disk capacity; more complex configurations may also be included to meet large-scale capacity and performance use cases, for example, one FFME may include two or more SPs. In one embodiment, the cluster may be a PAXOS group, i.e., all IOPUs in all SPs form a PAXOS group to handle data, metadata, and cluster state management. One example implementation can use Apache Zookeeper as an decentralized metadata service for metadata storage, global locks, and cluster state management.

In one embodiment, step S102 further includes selecting one of the i/o processing units in the cluster as a header of the cluster, where the header of the cluster serves the metadata update request; and wherein each input output processing unit in the cluster has the capability to provide the metadata service and the data service.

In one embodiment, the metadata service manages the block location mapping for all physical hard disks and logical volumes. The data service handles the I/O of the physical hard drives of all its locally attached drives.

In another embodiment, step S102 further includes, when the head of the cluster fails, selecting another input/output processing unit in the cluster as a new head of the cluster. In one exemplary embodiment, the determination of which IOPU in the cluster is the head may be made according to the PAXOS algorithm, and the election of a new head may be determined after the current head of the cluster has failed.

In yet another embodiment, step 103 may further comprise: when one input output processing unit starts up, other input output processing units in the cluster are notified of the attached local disk on the input output processing unit through the metadata service. For example, at power up, the IOPU boots up and joins the storage system, which reports all locally attached drives to the metadata service; the metadata services communicate with each other through PAXOS, and exchange information of metadata. Thus, when the metadata service of one IOPU learns the information of its hard drive, it will tell the metadata services of other IOPUs through PAXOS so that the metadata service on each IOPU can learn the disk information of all IOPUs. Any metadata changes are managed by the PAXOS group (i.e., cluster) and synchronized across all IOPUs of the SP.

In one embodiment, step 103 may further comprise: the storage locations of the data are determined in accordance with a consistent hashing algorithm so that the data can be evenly distributed across all of the plurality of modular building blocks. In one example, all hard disk drives in a storage system form a consistent hash ring. Based on the hash value, each driver is responsible for a range of data. The consistent hash may, for example, split the data based on the volume unique ID and the block offset.

In one embodiment, in step S103, determining the storage location of the data according to the consistent hashing algorithm may include the following operations:

computing a hash value based on a volume identifier and an offset value in an input or output I/O request;

determining a list of hard disk drives corresponding to the hash value;

query a metadata service to determine input-output processing units directly attached to hard disk drives in the list and to obtain the determined input-output load condition for each of the input-output processing units; and

based on a result of the query, selecting an input-output processing unit for processing the I/O request from the determined input-output processing units directly attached to the hard disk drives in the list.

In one embodiment, the above operations may be performed in the SP at which the I/O request arrives.

In one embodiment, a hash table may be used to record a mapping of hash values and data locations. The hash table may be considered a type of metadata whose storage location may be determined by the PAXOS algorithm, e.g., may be stored not only in the PAXOS group header, or cluster header, but distributed among the IOPUs.

In another embodiment, the number N of hard disk drives that may be included in the list of hard disk drives corresponding to the hash value determined in step S103 is greater than 1, and the number N can be defined by the end user. That is, each data is stored on at least two hard disk drives for fault tolerance, and the user can define the number of copies of the data.

In yet another embodiment, selecting an input output processing unit for processing the I/O request based on the result of the query includes selecting an IOPU that is least currently loaded. In one example, a metadata service may maintain health information for all hardware and software components in the storage device, such that by querying the metadata service, I/O requests can be prevented from being forwarded to a failed SP. In this way, failover can be efficiently accomplished.

In one embodiment, step S103 may further include: sending the I/O request to the selected input/output processing unit for processing the I/O request. In one example, if the I/O request is a write request, the write request is sent to the selected IOPU at step S103. The selected IOPU may then receive the write request and store the data to the hard disk. Success may then be returned after the write request is completed. In another example, if the I/O request is a read request, the read request is sent to the selected IOPU at step S103. The selected IOPU may then receive the read request and read the data from the hard disk. And returns a success after the read request is completed.

A method for building a scalable storage device according to an embodiment of the present disclosure is described above with reference to fig. 1. The method has the advantages that:

(1) providing a modular design that enables complex storage systems to be formed using different combinations of building blocks to meet different workload requirements;

(2) the expansibility is good; the method allows the use of consistent hash and PAXOS groups to eliminate system failures caused by a single failure and eliminate performance hotspots, so that a single point is not burdened too much;

(3) low cost and no hardware lock. That is, the method is not limited to the type of hardware, and can be applied to any hardware. For example, commercial hardware like OCP can be used as modular building blocks for producing low cost extended memory systems according to the methods of embodiments of the present disclosure.

At the same time, the method 100 is able to efficiently handle any hardware and software component failures, such as:

1) the IOPU metadata service maintains health information for all hardware and software components so that failover can be efficiently accomplished.

2) There are multiple copies of each piece of data so that a single or two hard drive failure does not affect data availability.

3) The IOPU metadata service uses PAXOS to replicate data, so there is no single point of failure.

4) Allowing multiple SPs to be employed, so that the storage system can tolerate SP failures, such as CPU/memory/motherboard failures;

5) each SP is allowed to have multiple IOPUs and is able to handle software failures of the IOPUs.

It should be noted that although certain exemplary embodiments of the present disclosure describe a method of building a scalable storage device in terms of two modular building blocks, the method is equally applicable to implementing scaling of a storage device using any number of building blocks, as will be appreciated by those skilled in the art. In actual implementation, expansion may be achieved by combining a variable number of modular building blocks as needed, by any of the methods described with reference to fig. 1-3.

An apparatus 400 for building a scalable storage device according to an embodiment of the present disclosure will be described below with reference to fig. 4. The apparatus may implement the method of any of the embodiments described with reference to fig. 1, but is not limited to implementing the method 100; while the method 100 described with reference to fig. 1 may be performed by the apparatus 400, it is not limited to being performed by the apparatus 400. For example, in some embodiments, at least one step of method 100 may be performed by another device.

As shown in fig. 4, the apparatus 400 includes a combining unit 401 configured to construct the scalable storage device by combining the plurality of modular building blocks; wherein each modular building block of the plurality of modular building blocks comprises a disk enclosure; and at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit; and a cluster forming unit 402 configured to form a cluster with the input-output processing units in the at least one modular building block; and a cluster processing unit 403 configured to process input or output I/O requests from hosts and metadata services with the cluster.

According to an embodiment, the combining unit 401, the cluster forming unit 402 and the cluster processing unit 403 can be configured to implement the operations of steps S101, S102 and S103, respectively, described with reference to fig. 1. Therefore, the description about steps S101, S102, and S103 made with reference to fig. 1 is equally applicable here.

In one embodiment, only a first modular building block of the plurality of modular building blocks combined by the combining unit 401 comprises the storage processor; and wherein the combining unit 401 is configured to build the scalable storage device by connecting the first modular building block with each other modular building block of the plurality of modular building blocks. In one embodiment, this interconnection may be performed via, for example, an input-output expander, but embodiments of the present disclosure are not limited thereto, and may be performed by any suitable connection means.

In another embodiment, each modular building block of the plurality of modular building blocks combined by the combining unit 401 comprises the storage processor; and wherein the combining unit 401 is configured to build the scalable storage device by interconnecting each of the plurality of modular building blocks. In one embodiment, the interconnection may be performed via, for example, an IP network line, but embodiments of the present disclosure are not limited thereto and may be performed by any suitable connection means.

In yet another embodiment, the plurality of modular building blocks combined by combining unit 401 includes a first set of modular building blocks and a second set of modular building blocks, and only the first set of modular building blocks (e.g., FFME) includes the storage processor, that is, the second set of modular building blocks (e.g., DDME) of the plurality of modular building blocks does not include the storage processor; and wherein the combining unit 401 is configured to build the scalable storage device by connecting each modular building block of the first set of modular building blocks to each other (e.g., via IP network lines) and connecting each modular building block of the first set of modular building blocks to one or more modular building blocks of the second set of modular building blocks (e.g., via input-output expanders).

In an embodiment of the present disclosure, the cluster forming unit 402 may be further configured to select one input/output processing unit in the cluster as a header of the cluster, where the header of the cluster serves the metadata update request; and wherein each input output processing unit in the cluster has the capability to provide the metadata service and the data service.

In another embodiment, the cluster forming unit 402 may be further configured to select another input-output processing unit in the cluster as a new head of the cluster when the head of the cluster fails.

In one embodiment, the cluster processing unit 403 may be further configured to notify, through the metadata service, other input-output processing units in the cluster of the attached local disk on one input-output processing unit when the one input-output processing unit is started.

In another embodiment, the cluster processing unit 403 may be further configured to determine the storage location of the data according to a consistent hashing algorithm, such that the data can be evenly distributed across all of the storage processors.

In an example embodiment, the cluster processing unit 403 may be further configured to perform the following operations:

determining a list of hard disk drives corresponding to the hash value;

In another embodiment, the list of hard disk drives determined to correspond to the hash value includes N >1 hard disk drives, and the number N can be defined by the end user.

In one embodiment, cluster processing unit 403 may be further configured to send the I/O request to a selected Input Output Processing Unit (IOPU) for processing the I/O request. The selected IOPU may then process the request.

As described above, the method 100 and apparatus 400 according to the embodiments of the present disclosure construct a scalable storage system using a single modular building block, provide good construction flexibility and can improve the allowable performance of the storage system.

Although embodiments of the method/apparatus proposed by the present disclosure are described in some embodiments with some specific components (e.g., 2U chassis, OCP-based hardware) and specific algorithms (e.g., PAXOS algorithm) as examples, as can be appreciated by those skilled in the art, embodiments of the present disclosure are not limited thereto, but can be more broadly applied.

It will be appreciated by those skilled in the art that any block diagrams described herein represent illustrative schematic diagrams implementing the principles of the disclosure. Similarly, it will be appreciated that the flow diagrams described herein represent various processes which may be embodied in a machine-readable medium and executed by a machine or processing device, whether or not such machine or processing device is explicitly shown. In some embodiments, some of the operations in the flow diagrams may also be done manually.

It is also to be understood by those of ordinary skill in the art that one or more of the method steps mentioned in the present disclosure may also be implemented in a single functional block or by a single device, while in some embodiments one functional block may also implement multiple method steps or the functionality of multiple functional blocks. The steps in the flow diagrams may be performed in any suitable order and not necessarily in the order shown.

The units included in the apparatus 400 according to embodiments of the present disclosure may be implemented in various ways, including software, hardware, firmware, or any combination thereof. For example, in some embodiments, the apparatus 400 may be implemented using software and/or firmware. Alternatively or additionally, the apparatus 400 may be implemented partially or completely based on hardware. For example, one or more elements of apparatus 400 may be implemented as an Integrated Circuit (IC) chip, an Application Specific Integrated Circuit (ASIC), a system on a chip (SOC), a Field Programmable Gate Array (FPGA), or the like. The scope of the present disclosure is not limited in this respect. Additionally, a single one of the devices 400 may also be implemented by multiple devices in some embodiments; in other embodiments, the functionality of multiple ones of the apparatus 400 may also be implemented as a single apparatus. In some embodiments, the functionality of certain elements may be implemented manually by a user, in the case of an apparatus, the elements may not be implemented by machine, software, firmware, and/or the like.

The present disclosure may be systems, apparatuses, devices, methods, and/or computer program products. According to one embodiment of the present disclosure, the present disclosure may be achieved by an apparatus for building a scalable storage device, the apparatus including at least one processor; and at least one memory including computer program code, wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform any of the methods previously described with reference to fig. 1. According to another embodiment of the disclosure, the invention may be implemented by a computer program product embodied in a computer readable medium and comprising computer readable program instructions that, when loaded into an apparatus, perform any of the methods according to the embodiments of the disclosure.

An embodiment of the present disclosure also provides an expanded storage device, the device being constructed according to any of the methods described with reference to fig. 1, and/or the device comprising any of the apparatus 400 described with reference to fig. 4, and a plurality of modular building blocks; wherein each modular building block of the plurality of modular building blocks comprises one or more disk enclosures; and wherein at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit.

The description above with reference to the drawings is given by way of illustration only for the purpose of illustrating the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the disclosure and are included within its spirit and scope. Moreover, all examples mentioned herein are explicitly provided primarily for pedagogical purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to the scope of the disclosure. Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass equivalents thereof.

Claims

1. A method of building a scalable storage device, comprising:

building the scalable storage device by combining a plurality of modular building blocks; wherein each modular building block comprises a disk enclosure; and at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit;

forming a cluster with the input-output processing units in the at least one modular building block; and

processing input or output I/O requests from hosts and metadata services with the cluster;

wherein utilizing the cluster to process input or output I/O requests from hosts and metadata services comprises:

the storage locations of the data are determined in accordance with a consistent hashing algorithm such that the data can be evenly distributed across the plurality of modular building blocks.

2. The method of claim 1, wherein only a first modular building block of the plurality of modular building blocks includes the storage processor; and is

Wherein building the scalable storage device by combining a plurality of modular building blocks comprises:

constructing the scalable storage device by connecting the first modular building block with each other modular building block of the plurality of modular building blocks.

3. The method of claim 1, wherein each modular building block of the plurality of modular building blocks comprises the storage processor; and is

the scalable storage device is constructed by interconnecting each of the plurality of modular building blocks.

4. The method of claim 1, wherein the plurality of modular building blocks comprises a first set of modular building blocks and a second set of modular building blocks, and only the first set of modular building blocks comprises the storage processor, and

the scalable storage device is constructed by interconnecting each modular building block of the first set of modular building blocks and connecting each modular building block of the first set of modular building blocks with one or more modular building blocks of the second set of modular building blocks.

5. The method of claim 1, wherein forming a cluster with the input-output processing units in the at least one modular building block further comprises:

selecting one input/output processing unit in the cluster as a head of the cluster;

wherein a header of the cluster services a metadata update request; and each input output processing unit in the cluster is capable of providing the metadata service and a data service.

6. The method of claim 5, forming a cluster with the input-output processing units in the at least one modular building block further comprising:

and when the head of the cluster fails, selecting another input/output processing unit in the cluster as a new head of the cluster.

7. The method of any of claims 1-6, wherein utilizing the cluster to process input or output I/O requests from hosts and metadata services comprises:

upon startup of one input output processing unit, notifying other input output processing units in the cluster of the attached local disk on the one input output processing unit through the metadata service.

8. The method of claim 1, wherein determining storage locations of data according to a consistent hashing algorithm comprises:

calculating a hash value based on a volume identifier and an offset value in the input or output I/O request;

determining a list of hard disk drives corresponding to the hash value;

querying a metadata service to determine input-output processing units directly attached to hard disk drives in the list and obtain the determined input-output load condition for each of the input-output processing units; and

9. The method of claim 8, wherein the number of hard disk drives included in the list is greater than one, and the number is definable by an end user.

10. The method of claim 8, wherein utilizing the cluster to process input or output I/O requests from hosts and metadata services comprises:

sending the I/O request to the selected input/output processing unit for processing the I/O request.

11. An apparatus for building a scalable storage device, comprising:

a combining unit configured to construct the scalable storage device by combining a plurality of modular building blocks; wherein each modular building block comprises a disk enclosure; and at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit;

a cluster forming unit configured to form a cluster with the input-output processing units in the at least one modular building block; and

a cluster processing unit configured to process input or output I/O requests from hosts and metadata services with the cluster;

wherein the cluster processing unit is further configured to:

the storage locations of the data are determined according to a consistent hashing algorithm so that the data can be evenly distributed across all of the storage processors.

12. The apparatus of claim 11, wherein only a first modular building block of the plurality of modular building blocks comprises the storage processor; and is

Wherein the combining unit is configured to construct the scalable storage device by connecting the first modular building block with each other modular building block of the plurality of modular building blocks.

13. The apparatus of claim 11, wherein each modular building block of the plurality of modular building blocks comprises the storage processor; and is

Wherein the combining unit is configured to construct the scalable storage device by interconnecting each of the plurality of modular building blocks.

14. The apparatus of claim 11, wherein the plurality of modular building blocks comprises a first set of modular building blocks and a second set of modular building blocks, and only the first set of modular building blocks comprises the storage processor; and is

Wherein the combining unit is configured to construct the scalable storage device by interconnecting each modular building block of the first set of modular building blocks and connecting each modular building block of the first set of modular building blocks with one or more modular building blocks of the second set of modular building blocks.

15. The apparatus of claim 11, wherein the cluster forming unit is further configured to:

16. The apparatus of claim 15, wherein the cluster forming unit is further configured to pick another input output processing unit in the cluster as a new head of the cluster when the head of the cluster fails.

17. The apparatus of any of claims 11-16, wherein the cluster processing unit is further configured to:

18. The apparatus of claim 11, wherein determining storage locations of data according to a consistent hashing algorithm comprises:

determining a list of hard disk drives corresponding to the hash value;

19. The apparatus of claim 18, wherein the number of hard disk drives included in the list is greater than one and the number is definable by an end user.

20. The apparatus of claim 18, wherein the cluster processing unit is further configured to:

21. An apparatus for building a scalable storage device, comprising:

at least one processor; and

at least one memory including computer program code,

wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of any of claims 1-10.

22. A computer readable medium having computer readable program instructions stored thereon which, when loaded into an apparatus, perform the method according to any one of claims 1-10.

23. An expanded storage device, the device comprising:

the apparatus according to any of claims 11-20, and

a plurality of modular building blocks;

wherein each modular building block of the plurality of modular building blocks comprises one or more disk enclosures; and is

Wherein at least one modular building block of the plurality of modular building blocks comprises a storage processor comprising an input output processing unit.