WO2019127021A1 - 存储系统中存储设备的管理方法及装置 - Google Patents

存储系统中存储设备的管理方法及装置 Download PDF

Info

Publication number
WO2019127021A1
WO2019127021A1 PCT/CN2017/118650 CN2017118650W WO2019127021A1 WO 2019127021 A1 WO2019127021 A1 WO 2019127021A1 CN 2017118650 W CN2017118650 W CN 2017118650W WO 2019127021 A1 WO2019127021 A1 WO 2019127021A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage node
storage
management server
storage device
memory address
Prior art date
Application number
PCT/CN2017/118650
Other languages
English (en)
French (fr)
Inventor
罗旦
刘玉
张巍
冒伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP17898343.3A priority Critical patent/EP3531666B1/en
Priority to EP21190258.0A priority patent/EP3985949A1/en
Priority to PCT/CN2017/118650 priority patent/WO2019127021A1/zh
Priority to CN201780002717.1A priority patent/CN110199512B/zh
Priority to CN202011475355.8A priority patent/CN112615917B/zh
Publication of WO2019127021A1 publication Critical patent/WO2019127021A1/zh
Priority to US16/912,377 priority patent/US11314454B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0688Non-volatile semiconductor memory arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/65Details of virtual memory and virtual address translation
    • G06F2212/657Virtual address space management

Definitions

  • the present invention relates to the field of information technology, and in particular, to a storage device management method and apparatus in a storage system.
  • NVMe storage device a storage device supporting the NVMe interface specification (hereinafter referred to as an NVMe storage device) is applied to a distributed storage system.
  • a distributed storage system typically includes a plurality of storage nodes, each of which contains one or more storage devices that support the NVMe interface specification.
  • the client accesses the storage device in the distributed storage system. First, the client needs to determine the storage node that processes the access request, establish communication with the storage node, and the storage node receives the access request.
  • the central processing unit (CPU) of the storage node resolves the access request to obtain an access command, and the address in the access request is obtained.
  • CPU central processing unit
  • the storage address of the NVMe storage device is determined, the NVMe storage device corresponding to the access request is determined, and the storage address and the access command of the NVMe storage device are sent to the corresponding NVMe storage device, and the foregoing operation process is performed on the storage device that accesses the traditional non-NVMe interface specification.
  • the process is the same and the performance of the NVMe storage device cannot be fully utilized.
  • the application provides a management method and device for a storage device in a storage system.
  • a first aspect of the present application provides a management method of a storage device in a storage system, where the storage system includes a management server and a first storage node, where the first storage node includes a first storage device of an NVMe interface specification, and the first storage device
  • the start address of the queue is located at a first memory address in the first memory of the first storage node; the management server obtains the first queue message from the first storage node; the first queue message includes the identifier of the first storage node and the first memory address
  • the management server establishes a first mapping relationship between the identifier of the first storage node and the first memory address; the management server receives the query request from the client; the query request includes the identifier of the first storage node; and the management server according to the first mapping relationship
  • the client sends a query request response; the query request response contains the first memory address.
  • the management server establishes a mapping relationship between the identifier of the storage node and the start address of the queue, and the client can obtain the information of the NVMe storage device queue from the management server, so that the NVMe storage device can be directly accessed, and the storage node CPU does not need to participate in processing the access request.
  • the management server establishes a mapping relationship between the identifier of the storage node and the start address of the queue, and the client can obtain the information of the NVMe storage device queue from the management server, so that the NVMe storage device can be directly accessed, and the storage node CPU does not need to participate in processing the access request.
  • the method further includes: the management server establishes a lock identifier, and the lock identifier is used to lock the first memory address.
  • the management server can record the allocation of the queue. Further, the management server may count the load of the NVMe storage device according to the lock information. Further, the management server can also record the identity of the client that obtained the queue.
  • the storage system further includes a second storage node, where the second storage node includes a second storage device of the NVMe interface specification, where the start address of the queue of the second storage device is located a second memory address in the second memory of the second storage node; management
  • the server acquires a second queue message of the second storage device from the second storage node; the second queue message includes an identifier of the second storage node and a second memory address; and the management server establishes the identifier of the second storage node and the second memory address The second mapping relationship.
  • the management server can manage the queue information of the NVMe devices included in all storage nodes in the storage queue.
  • the management server obtains the third queue message from the first storage node; the third queue message The identifier of the first storage node and the third memory address are included; the management server establishes a third mapping relationship between the identifier of the first storage node and the third memory address; wherein the start address of the queue of the third storage device of the NVMe interface specification is located The third memory address in the first memory, the third storage device is a newly added storage device of the first storage node.
  • the management server receives the queue message deletion message from the second storage node; the queue message deletion message includes the second memory address The management server deletes the second mapping relationship.
  • the management server detects that the communication with the first storage node is interrupted, and the management server deletes the first mapping relationship.
  • the management server detects that the communication with the first storage node is interrupted, specifically: the management server does not receive the information within the predetermined time.
  • the heartbeat of the first storage node is a predetermined time.
  • a second aspect of the present application provides a management method of a storage device in a storage system, where the storage system includes a management server and a storage node, wherein the storage node includes a first storage device of an NVMe interface specification; and the storage node is a first storage device in the memory.
  • the starting address of the queue is assigned a first memory address; the storage node sends a first queue message to the management server; the first queue message includes an identifier of the storage node and a first memory address.
  • the storage node sends the queue information of the storage device to the management server, so that the client can directly access the storage device by using the queue information.
  • the storage node detects that the first storage device is installed to the storage node.
  • the storage node detects that the first storage device is removed from the storage node; and the storage node sends the queue to the management server.
  • Message delete message; the queue message delete message contains the first memory address.
  • the third aspect of the present application provides a management method of a storage device in a storage system, where the storage system includes a management server and a first storage node, where the first storage node includes a first storage device of the NVMe interface specification, and the queue of the first storage device
  • the starting address is located at a first memory address in the first memory of the first storage node;
  • the management server stores a first mapping relationship between the identifier of the first storage node and the first memory address;
  • the client sends a query request to the management server;
  • the request includes an identifier of the first storage node;
  • the client receives a query request response from the management server; and the query request response includes a first memory address determined by the management server according to the first mapping relationship.
  • the client queries the management server to obtain the queue information of the NVMe storage device in the storage node, so that the NVMe storage device can be directly accessed according to the queue information, and the storage node CPU does not need to participate, and the performance of the NVMe storage device is fully utilized.
  • the client sends a first remote direct memory access request to the first storage node; the first remote direct memory access request includes the first memory address.
  • the storage system further includes a second storage node, where the second storage node includes a second storage device of the NVMe interface specification, and the second storage device
  • the start address of the queue is located in the second memory address in the second memory of the second storage node;
  • the management server stores the second mapping relationship between the identifier of the second storage node and the second memory address;
  • the query request includes the first An identifier of the second storage node;
  • the query request response includes the second memory address determined by the management server according to the second mapping relationship; wherein the first storage device and the second storage device form a stripe relationship;
  • the client sends the second storage node to the second storage node a second remote direct memory access request;
  • the second remote direct memory access request includes the second memory address.
  • the fourth aspect of the present application further provides a management server, where the management server includes multiple Units for performing the first aspect of the present application or any of the first to sixth possible implementations of the first aspect.
  • the fifth aspect of the present application further provides a storage node, a storage node.
  • a plurality of units are included for performing any of the first to second possible implementations of the second aspect or the second aspect of the present application.
  • the sixth aspect of the present application further provides a client, a client A plurality of units are included for performing any of the first to second possible implementations of the third aspect or the third aspect of the present application.
  • the seventh aspect of the present application further provides a management server, which is applied to the storage system in any one of the first to sixth possible implementations of the first aspect or the first aspect of the present application, where the management server includes a processor and an interface, and the processing The device is in communication with an interface, and the processor is configured to perform any of the first to sixth possible implementations of the first aspect or the first aspect of the present application.
  • the eighth aspect of the present application further provides a storage node, which is applied to the storage system in any one of the first to the second possible implementation manners of the second aspect or the second aspect of the present application, where the storage node includes a processor and an interface, and the processing The processor is in communication with an interface for performing any of the first to second possible implementations for performing the second or second aspect of the present application.
  • the ninth aspect of the present application further provides a client, which is applied to the storage system in any one of the first to second possible implementation manners of the third aspect or the third aspect of the present application, where the client includes a processor and an interface, and the processing The processor is in communication with an interface for performing any of the first to second possible implementations for performing the third or third aspect of the present application.
  • a tenth aspect of the present application further provides a computer readable storage medium and a computer program product, the computer readable storage medium and the computer program product comprising computer instructions for implementing the aspects of the first aspect of the present application.
  • the eleventh aspect of the present application further provides a computer readable storage medium and a computer program product, the computer readable storage medium and the computer program product comprising computer instructions for implementing the aspects of the second aspect of the present application.
  • the twelfth aspect of the present application further provides a computer readable storage medium and a computer program product, the computer readable storage medium and the computer program product comprising computer instructions for implementing the aspects of the third aspect of the present application.
  • FIG. 1 is a schematic diagram of a distributed block storage system according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a storage node according to an embodiment of the present invention.
  • FIG. 3 is a partial view of an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of an NVMe storage device according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a NVMe queue in a storage node according to an embodiment of the present invention.
  • FIG. 7 is a flowchart of a storage node sending queue information according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of storing queue information of an NVMe storage device by a management server according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of communication between a management server and a storage node according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a storage server storing NVMe storage device queue information according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a storage server storing NVMe storage device queue information according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of storage node storing NVMe storage device queue information according to an embodiment of the present invention.
  • FIG. 13 is a schematic flowchart of an access request processing process according to an embodiment of the present invention.
  • FIG. 14 is a schematic diagram of a distributed block storage system according to an embodiment of the present invention.
  • 15 is a schematic diagram of a client sending an RDMA write request to a storage class memory device according to an embodiment of the present invention
  • 16 is a schematic structural diagram of a storage node according to an embodiment of the present invention.
  • FIG. 17 is a schematic structural diagram of a client according to an embodiment of the present invention.
  • FIG. 18 is a schematic structural diagram of a management server according to an embodiment of the present invention.
  • FIG. 19 is a schematic structural diagram of a management server according to an embodiment of the present invention.
  • FIG. 20 is a schematic structural diagram of a client according to an embodiment of the present invention.
  • FIG. 21 is a schematic structural diagram of a client according to an embodiment of the present invention.
  • the storage system in the embodiment of the present invention can be applied to a storage array (such as Huawei). Oceanstor 18000 series, V3 series), distributed file storage system (such as Huawei) of 9000 series), distributed block storage systems (such as Huawei of Series), distributed object storage systems or distributed storage systems that support the Log-Structure interface.
  • a storage array such as Huawei
  • Oceanstor 18000 series, V3 series distributed file storage system (such as Huawei) of 9000 series)
  • distributed block storage systems such as Huawei of Series
  • distributed object storage systems such as support the Log-Structure interface.
  • the distributed block storage system in the embodiment of the present invention such as Huawei of series.
  • the distributed block storage system includes a plurality of storage nodes, such as a storage node 1, a storage node 2, a storage node 3, a storage node 4, a storage node 5, and a storage node 6, and the storage nodes communicate with each other through an InfiniBand or an Ethernet network.
  • the number of storage nodes in the distributed block storage system may be increased according to actual requirements, which is not limited by the embodiment of the present invention.
  • Each storage node contains one or more NVMe storage devices, such as the Solid State Disk (SSD) of the NVMe interface specification.
  • SSD Solid State Disk
  • the NVMe storage device in the embodiment of the present invention is not limited to an SSD.
  • the storage node includes an NVMe storage device, and the storage node may include an NVMe storage device, and the NVMe storage device may be externally placed in a storage node in the form of Just a Bunch Of Disks (JBOD).
  • JBOD Just a Bunch Of Disks
  • the storage node contains the structure shown in Figure 2. As shown in FIG. 2, the storage node includes a central processing unit (CPU) 201, a memory 202, and an interface card 203.
  • the computer 202 stores computer instructions, and the CPU 201 executes program instructions in the memory 202 to perform corresponding operations.
  • the card interface 303 can be a network interface card (NIC), an Infiniband protocol interface card, or the like.
  • NIC network interface card
  • FPGA Field Programmable Gate Array
  • the FPGA or other hardware may also perform the above-mentioned corresponding operations instead of the CPU 301, or the FPGA or other hardware may perform the above-mentioned corresponding operations together with the CPU 201.
  • the embodiment of the present invention collectively refers to the combination of the CPU 201 and the memory 202, the FPGA and other hardware or FPGA that replaces the CPU 201 and other hardware that replaces the CPU 201 and the CPU 201 as a processor.
  • the client may be a device independent of the device shown in FIG. 2, such as a server, a mobile device, or the like, or may be a virtual machine (VM).
  • the client runs the application, where the application can be a VM or a container, or it can be a specific application, such as office software.
  • the client writes data to or reads data from the distributed block device storage.
  • FIG. 2 and related description The structure of the client can be referred to FIG. 2 and related description.
  • the distributed node storage system program is loaded in the memory 202 of the storage node, and the CPU 201 executes the distributed block storage system program in the memory 202 to provide a block protocol access interface to the client, and provides a distributed block storage access point service for the client.
  • the client accesses the storage resource in the storage resource pool in the distributed block storage system.
  • the block protocol access interface is used to provide logical units to clients.
  • the hash space such as 0 ⁇ 2 ⁇ 32,
  • partition partition
  • the N equal parts are according to the number of hard disks. Perform equalization.
  • N defaults to 3600, that is, partitions are P1, P2, P3...P3600, respectively.
  • each NVMe storage device carries 200 partitions.
  • the partition P includes a total of M NVMe storage devices respectively distributed on the M storage nodes, and the M NVMe storage devices in the partition form a stripe relationship.
  • the stripe relationship can be multiple copies or Erasure Coding (EC).
  • EC Erasure Coding
  • the mapping between the partition and the NVMe storage device that is, the mapping relationship between the partition and the NVMe storage device included in the partition, also referred to as the partition view.
  • the partition view includes four NVMe storage devices.
  • the partition view is P2- (Storage Node N 1 - NVMe Storage Device 1) - (Storage Node N 2 - NVMe Storage Device 2) - (Storage Node N 3 - NVMe Storage Device 3) - (Storage Node N 4 - NVMe Storage Device 4). That is, the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4 constitute a stripe relationship. When each storage node contains only one NVMe storage device, the partitioned view shown in FIG. 3 can also be represented as P2-storage node N 1 - storage node N 2 - storage node N 3 - storage node N 4 .
  • the management server allocates a partitioned view when the distributed block storage system is initialized, which is subsequently adjusted as the number of NVMe storage devices in the distributed block storage system changes.
  • the structure of the management server can refer to the structure shown in FIG. 2. To facilitate client access and reduce management server access pressure, one implementation, the management server will send the partitioned view to the client.
  • the distributed block storage system provides the client with a volume as a storage resource. Specifically, the distributed block storage system divides the logical address of the NVMe storage device into a resource pool, and provides data access for the client, and the storage address in the access request received by the client is mapped to the logical address of the NVMe storage device, that is, the client access request. The data block in which the storage address is located is mapped to the logical address provided by the NVMe storage device.
  • the client accesses the volume's access request, such as a write request, and the write request carries the storage address and data.
  • the storage address is a Logical Block Address (LBA).
  • LBA Logical Block Address
  • the client determines the NVMe storage device that allocates storage space for the data block according to the partition view in the data block query management server or the partition view saved locally by the client. For example, suppose the data block size is 1024 bytes and the data block number in the volume starts at 0.
  • the write request contains a storage address of 1032 bytes for the write start address and a size of 64 bytes.
  • the write request is located in the data block numbered 1 (1032/1024), and the data block has an internal offset of 8 (1032% 1024). ).
  • the first data block is distributed in the partition P2, and the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4 constitute a strip relationship in a multi-copy relationship.
  • the client queries the partition view to determine that the storage address included in the write request is mapped to the logical address of the NVMe device.
  • the storage address is mapped to the logical addresses of the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4, respectively. It is L1, L2, L3 and L4.
  • the client queries the primary storage node in the partitioned view according to the partition view, such as the storage node N1, and the storage node N1 provides L1, L2, L3, and L4 to the client.
  • the client determines the NVMe storage device that provides the logical address, and obtains the memory address of the starting address of the queue of the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4 in the memory (hereinafter referred to as the start of the queue) Address), the client obtains the starting address of the queue of the NVMe storage device.
  • the specific implementation of the memory address in memory is described below.
  • the client sends an RDMA write request to the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4, respectively.
  • the RDMA write request sent by the client to the NVMe storage device 1 includes the logical address L1 and the start address of the queue of the NVMe storage device 1, and also includes the data written to the L1 in the write request received by the client; the client sends the NVMe
  • the RDMA write request sent by the storage device 2 includes the logical address L2 and the start address of the queue of the NVMe storage device 2, and also includes the data written to the L2 in the write request received by the client; the RDMA sent by the client to the NVMe storage device 3
  • the write request includes the logical address L3 and the start address of the queue of the NVMe storage device 3, and also includes the data written to the L3 in the write request received by the client; the RDMA write request sent by the client to the NVMe storage device 4 includes the logical address.
  • the start address of the queue of L4 and NVMe storage device 4 also includes the data written to L4 in the write request received by the client.
  • the interface card of the storage node 1 where the NVMe storage device 1 is located, the interface card of the storage node 2 where the NVMe storage device 2 is located, the interface card of the storage node 3 where the NVMe storage device 3 is located, and the NVMe storage device 4 are respectively implemented by the client.
  • the interface card of the storage node 4 where it is located sends an RDMA write request.
  • the partition view is P2-(storage node N1-NVMe storage device 1)-(storage node N2-NVMe storage device 2)-(storage node N3-NVMe storage device 3)-( Storage node N4-NVMe storage device 4) - (storage node N5 - NVMe storage device 5) - (storage node N6 - NVMe storage device 6).
  • the first data block is distributed in the partition P2, and the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, the NVMe storage device 4, the NVMe storage device 5, and the NVMe storage device 6 constitute an EC relationship, and the NVMe storage device 1 and the NVMe store
  • the device 2, the NVMe storage device 3, and the NVMe storage device 4 are storage devices that store data fragments
  • the NVMe storage device 5 and the NVMe storage device 6 are storage devices that store checksum fragments.
  • the length of the EC stripe is 12 kilobytes (Kilobyte, KB), and the length of the data fragment and the check fragment are both 2 KB.
  • the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4 respectively store storage addresses of 0 to 2 KB-1, 2 KB to 4 KB-1, and 4 KB to 6 KB-1.
  • the data fragment of 6 KB to 8 KB-1, the NVMe storage device 5 and the NVMe storage device 6 respectively store the verification slice of the first stripe.
  • the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4 respectively store storage addresses of 8 KB to 10 KB-1, 10 KB to 12 KB-1, and 12 KB to 14 KB.
  • NVMe storage device 5 and NVMe storage device 6 respectively store the verification slice of the second stripe.
  • the client receives a write request, and the storage address SA included in the write request is 0-8 KB-1.
  • the client queries the partition view, determines that the logical address of the NVMe storage device 1 corresponding to the first data fragment whose storage address is 0 to 2 KB-1 is L1, and determines the second data fragment corresponding to the storage address of 2 KB to 4 KB-1.
  • the logical address of the NVMe storage device 2 is L2, and the logical address of the NVMe storage device 3 corresponding to the third data fragment whose storage address is 4 KB to 6 KB-1 is L3, and the fourth data whose storage address is 6 KB to 8 KB-1 is determined.
  • the logical address of the NVMe storage device 4 corresponding to the fragment is L4, and the logical address of the NVMe storage device 5 corresponding to the first parity slice of the first stripe is L5, and the second parity slice of the first stripe is determined.
  • the logical address of the corresponding NVMe storage device 6 is L6.
  • the storage addresses SA are mapped to L1, L2, L3, L4, L5, and L6, respectively.
  • the client determines the NVMe device that provides the logical address, and the client obtains the starting address of the queue of the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, the NVMe storage device 4, the NVMe storage device 5, and the NVMe storage device 6, the client.
  • the specific implementation of the start address of the queue for obtaining the NVMe storage device is described below.
  • the client transmits an RDMA request to the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, the NVMe storage device 4, the NVMe storage device 5, and the NVMe storage device 6, respectively.
  • the RDMA write request sent by the client to the NVMe storage device 1 includes the logical address L1 and the start address of the queue of the NVMe storage device 1, and the first data fragment written into the L1 in the write request received by the client;
  • the RDMA write request sent by the client to the NVMe storage device 2 includes the logical address L2 and the start address of the queue of the NVMe storage device 2, and further includes the second data fragment written into the L2 in the write request received by the client;
  • the client The RDMA write request sent to the NVMe storage device 3 includes the logical address L3 and the start address of the queue of the NVMe storage device 3, and also includes the third data fragment written to the L3 in the write request received by the client; the client to the NVMe
  • the RDMA write request sent includes the logical address L5 and the start address of the queue of the NVMe storage device 5, and further includes the first check fragment; the RDMA write request sent by the client to the NVMe storage device 6 includes the logical address L6 and NVMe The start address of the queue of the storage device 6 further includes a second check fragment.
  • the client uses the RDMA write request to directly write the data to the start address of the queue of the NVMe storage device in the memory of the storage node, and does not require the storage node CPU to participate in processing the write request sent by the client, fully exerting The performance of NVMe storage devices improves the write performance of the storage system.
  • the distributed block storage system provides the client with a volume as a storage resource. Specifically, the distributed block storage system divides the logical address of the NVMe storage device into a resource pool, and provides data access for the client, that is, the data block where the storage address in the client access request is located is mapped to the logical address provided by the NVMe storage device, A partitioned view is required.
  • the mapping of the data block to the logical address provided by the NVMe storage device can be represented as a data block-----NVMe storage device----logical address.
  • the NVMe storage device includes an NVMe controller 501 and a storage medium 502.
  • Three key components are defined in the NVMe specification for accessing requests and data processing: submission Queue (SQ), Completion Queue (CQ), and Doorbell Register (DB).
  • SQ is used to store an access request sent by the client
  • CQ is used to store the result of the NVMe storage device processing the access request.
  • SQ and CQ exist in the form of queue pairs. As shown in FIG. 6, SQ and CQ are located in the memory 202 of the storage node. In the embodiment of the present invention, a pair of SQ and CQ are referred to as a queue.
  • the number of queues that NVMe storage devices use to process access requests can be up to 65,535.
  • Both SQ and CQ are ring queues.
  • the NVMe controller obtains the pending access request from the head of the SQ queue, and the SQ tail is used to store the latest access request sent by the client.
  • the client obtains the access result from the CQ header, and the tail of the CQ queue is used to store the result of the latest processing access request of the NVMe controller.
  • the client uses the RDMA request to access the NVMe storage device, and needs to obtain the end of the SQ and CQ queues of the NVMe storage device, that is, the memory addresses of the SQ and CQ starting addresses in the memory 202, respectively.
  • Each SQ or CQ has two corresponding registers, the header register and the tail register.
  • the header register is also called the headbell (DoorBell, DB)
  • the tail register is also called the tail DB. The role of the DB will be described in the following embodiments.
  • the NVMe storage device is connected to the storage node. After the storage node is started, the NVMe storage device registers with the storage node. As shown in Figure 7, the storage node performs the following process:
  • Step 701 The storage node allocates a memory address for the queue in the memory 202.
  • the storage node allocates a memory address for the queue in the memory 202, including allocating a memory address in the memory for the start address of the queue.
  • Step 702 The storage node sends a queue message to the management server; the queue message includes an identifier of the storage node and a start address of the queue.
  • the storage node locally establishes a mapping relationship between the identifier of the storage node and the start address of the queue.
  • the storage node may include one or more NVMe storage devices, and the queue message may further include an identifier of the NVMe storage device to distinguish the starting address of the queue of the different NVMe storage devices of the storage node.
  • the queue message may contain only the identifier of the storage node and the starting address of the queue.
  • the new NVMe storage device is connected to the storage node, and the storage node also performs the flow shown in FIG.
  • the storage node detects that the NVMe storage device is removed from the storage node, and the storage node sends a queue information deletion message to the management server, where the queue information deletion message includes the starting address of the queue.
  • the NVMe storage device is removed from the storage node, and may specifically include physical removal or NVMe storage device failure.
  • the storage node can detect the drive of the NVMe storage device to determine if the NVMe storage device is removed from the storage node.
  • the management server obtains queue information from the storage node. Specifically, the management server may send a request to the storage node to instruct the storage node to report the queue information, or the storage node sends the queue information to the management server.
  • the queue information includes an identifier of the storage node and a start address of the queue allocated by the storage node to the queue of the NVMe storage device.
  • the management server establishes a mapping relationship between the identifier of the storage node and the start address of the queue.
  • the storage node includes a plurality of NVMe storage devices, and is a start address of a queue of different NVMe storage devices that is different from the same storage node.
  • the queue information also includes an identifier of the NVMe storage device, and the mapping relationship established by the management server further includes an NVMe storage device. Logo.
  • the above mapping relationship is stored as an entry in a table as shown in FIG.
  • the mapping relationship can be stored using an entry structure as shown in FIG. 8, or other data structure that can embody the relationship between the identifier and the address.
  • N1, N2, and N3 represent the identifiers of the storage nodes
  • D11 and D12 respectively represent the identifiers of the NVMe storage devices in the storage node 1
  • Add1 represents the start address of the queue of the NVMe storage devices identified as D11.
  • Add1 represents the starting address of the SQ of the NVMe storage device, and in another implementation, Add1 can also represent the starting address of the queue of the NVMe storage device ( The starting address of SQ and the starting address of CQ).
  • the meanings of other items in the table shown in FIG. 8 can be referred to the above expression.
  • the mapping relationship between the identifier of the storage node and the start address of the queue is used as an entry of the table shown in FIG.
  • each storage node has a unique identifier, which may be a number assigned by the management server to each storage node, or may be hardware information of the storage node, such as interface card hardware information, or may be a storage node. Address information, such as an Internet Protocol (IP) address.
  • IP Internet Protocol
  • the identifier of the NVMe storage device may be the hardware information of the NVMe storage device or the internal number of the storage node where the NVMe storage device is located.
  • the NVMe storage device in the storage node 1 may be identified as D11 and D12, or represented as N1. +NVMe device number, for example, N1+1, N1+2, etc.
  • the management server establishes a mapping relationship between the identifier of the storage node and the start address of the queue, and the client can obtain the information of the queue of the NVMe storage device from the management server, so that the NVMe storage device can be directly accessed without the participation of the storage node CPU.
  • the performance of NVMe storage devices The performance of NVMe storage devices.
  • the management server communicates with the storage node to determine if the storage node is normal, such as by periodically receiving a heartbeat from the storage node to determine if the storage node is normal. For example, the management server does not receive the heartbeat of the storage node within a predetermined time, and the management server determines that the communication with the storage node is interrupted, and the storage node fails. For example, the management server does not receive the heartbeat transmitted by the storage node 1 shown in FIG. 9 within a predetermined time, and the management server determines that the communication with the storage node 1 is interrupted, and the storage node 1 fails.
  • FIG. 9 the management server communicates with the storage node to determine if the storage node is normal, such as by periodically receiving a heartbeat from the storage node to determine if the storage node is normal. For example, the management server does not receive the heartbeat of the storage node within a predetermined time, and the management server determines that the communication with the storage node is interrupted, and the storage node 1 fails.
  • the management server deletes the entry related to the storage node 1 recorded in the table, that is, the mapping information, and deletes the entry related to the storage node 1, as shown in FIG.
  • the flow shown in Fig. 7 is executed.
  • the NVMe storage device in the storage node is removed, the management server receives the queue information deletion message from the storage node, and the queue information deletion message includes the starting address of the queue.
  • the NVMe storage device identified as D11 is removed from the storage node 1, and the storage node 1 sends a queue information deletion message to the management server.
  • the management server deletes all the mappings identified as D11 shown in FIG. relationship.
  • the storage node includes multiple NVMe storage devices, which are used to distinguish the start address of the queue of the NVMe storage device of the same storage node, and the queue information also includes the identifier of the NVMe storage device, and the mapping relationship established by the management server further includes ID of the NVMe storage device.
  • the queue information deletion message sent by the storage node to the management server includes an identifier of the NVMe storage device, and the management server deletes the mapping relationship including the identifier of the NVMe storage device.
  • the storage node 1 allocates a memory address for the queue of the new NVMe storage device in the memory, and the process shown in FIG. 7 is executed, and details are not described herein again.
  • the storage node stores the table shown in FIG. 12 locally, and records the mapping relationship between the identifier of the NVMe storage device in the storage node and the start address of the queue of the NVMe storage device.
  • the table of FIG. 8 to FIG. 12 is used as an exemplary data structure.
  • the specific implementation may be implemented in multiple manners, for example, in an index manner. Further, it may exist in a multi-level index manner, for example, in FIG. 8 . In the table shown in FIG.
  • the first level index is a storage node identifier, which is used to search for an identifier of the corresponding NVMe storage device;
  • the second level index is an identifier of the NVMe storage device, and is used to search for a queue for storing the NVMe storage device. starting address.
  • the client accesses the storage system and determines the logical address of the NVMe storage device corresponding to the access address according to the storage address included in the access request, for example, the client sends a write request.
  • the specific implementation can access the partition view saved by the client or the partition view in the query management server.
  • the client determines that the storage address corresponds to the logical address L1 of the NVMe storage device 1 in the storage node 1, the logical address L2 of the NVMe storage device 2 in the storage node 2, the logical address L3 of the NVMe storage device 3 in the storage node 3, and
  • the logical address L4 of the NVMe storage device 4 in the storage node 4 that is, the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4 constitute a stripe relationship of multiple copies.
  • the client sends a query request to the management server according to the start address of the queue of the NVMe storage device that provides the logical address, and the query request includes the identifier of the storage node.
  • the query request further includes an identifier of the NVMe storage device.
  • the query request includes the following sets of identifiers: the identifier N1 of the storage node 1 and the identifier D11 of the NVMe storage device 1, the identifier N2 of the storage node 2 and the identifier D21 of the NVMe storage device 2, and the identifier N3 and NVMe storage of the storage node 3.
  • the identifier D31 of the device 3 the identifier N4 of the storage node 4 and the identifier D41 of the NVMe storage device 4.
  • the management server receives the query request from the client, queries the mapping relationship recorded in the entry, and sends a query request response to the client.
  • the response includes the start address Add of the queue of the NVMe storage device 1 of the storage node 1, the start address Addk of the queue of the NVMe storage device 2 of the storage node 2, and the start address Addy of the queue of the NVMe storage device 3 of the storage node 3.
  • the start address of the queue included in the query request response sent by the management server to the client includes the start address of the SQ, and may also include the start address of the CQ.
  • the client communicates with the management server, and can obtain the starting address of multiple NVMe storage device queues at one time, reducing the number of communication interactions.
  • the management server establishes a lock identifier, which is used to lock the start addresses Add1, Addk, Addy, and Addz. Once the start address of the store queue is locked, it indicates that the queue has been assigned to the client. Therefore, it can also be expressed that the lock identifier is used to lock the mapping relationship between the identifier of the storage node and the start address of the queue.
  • the lock identifier may be a flag bit, for example, 0 for lock and 1 for unlock.
  • the lock identifier can be recorded in the entry shown in FIG.
  • the management server can also record the identity of the client that obtained the queue.
  • the identifier may be a number assigned by the management server to each client, or may be hardware information of the client, such as interface card hardware information, or may be client address information, such as an IP address.
  • the management server may calculate the load of the NVMe storage device according to the lock information of the start address of the queue of the NVMe storage device, and dynamically determine the mapping relationship between the storage address and the logical address of the NVMe storage device according to the load of the NVMe storage device.
  • the mapping relationship between the storage address and the logical address of the NVMe storage device is dynamically determined according to the load of the NVMe storage device.
  • the mapping between the storage address and the logical address of the NVMe storage device is not bound at the initialization of the storage system, but is received at the client. When writing a request, the logical address of the NVMe storage device mapped to the storage address in the write request is determined.
  • the management server when the management server queries the NVMe storage address of the storage address mapping in the write request, the management server determines the logical address of the NVMe storage device according to the load of the NVMe storage device in the storage system; another implementation, the management server The partition view is determined according to the load of the NVMe storage device in the storage system, such as the mapping relationship between the partition and the storage node. Specifically, the load of the NVMe storage device may be counted according to the locking information of the start address of the queue of the NVMe storage device.
  • the client sends an RDMA write request to the storage node 1, the storage node 2, the storage node 3, and the storage node 4, respectively.
  • the RDMA write request sent to the storage node 1 includes L1 and Add1; the RDMA write request sent to the storage node 2 includes L2 and Addk; the RDMA write request sent to the storage node 3 includes L3 and Addy; and is sent to the storage.
  • Node 4's RDMA write request contains L4 and Addz.
  • the embodiment of the present invention is described by taking an example in which a client sends an RDMA write request to the storage node 1.
  • the client sends an RDMA write request to the interface card of the storage node 1, and the interface card of the storage node 1 receives the RDMA write request, and sends the logical address L1 and the data written to the L1 in the write request received by the client to the memory address of the storage node 1.
  • Add1 As shown in FIG. 13, SQ and CQ are empty, that is, the initial header address and the tail address are the same, and the initial header address and the tail address of the CQ are the same.
  • the specific operation process is shown in Figure 13:
  • the client sends a write request to the end of the SQ.
  • the client sends an RDMA write request to the storage node 1.
  • the RDMA write request includes L1 and the start address Add1 of the queue, and also includes data written to L1 in the write request received by the client.
  • the interface card of storage node 1 receives the RDMA write request and acquires L1 and Add1, for example, Add1 is 0.
  • the start address Add1 of the queue includes the start address of the SQ
  • the start address Add1 of the queue may further include the start address of the CQ
  • the initial value of the SQ tail DB is 0, the initial value of the CQ tail DB is 0.
  • the interface card storing node 1 sends the data written to L1 by L1 and the write request received by the client to the start address of SQ.
  • the RDMA write request is 1 command.
  • the client updates the tail DB of the SQ.
  • the client writes one RDMA write request to SQ, and the tail of SQ becomes 1.
  • the client writes an RDMA write request command to the SQ, it updates the SQ tail DB in the NVMe controller with a value of 1.
  • the client updates the SQ tail DB, it also informs the NVMe controller that there is a write request to be executed.
  • the 3NVMe controller fetches a write request from the SQ queue and performs a write request.
  • the client updates the SQ tail DB, and the NVMe controller receives the notification, acquires the write request from the SQ, and executes the write request.
  • the 4NVMe controller updates the SQ header DB.
  • the NVMe controller executes the write request in SQ, the header of SQ also becomes 1, and the NVMe controller writes the header of SQ1 to the SQ header DB.
  • the 5NVMe controller writes a write request execution result to the CQ.
  • the NVMe controller executes a write request and writes the write request execution result to the end of the CQ.
  • the client updates the tail DB of the CQ.
  • the NVMe controller executes a write request, writes a write request execution result to the CQ tail, and updates the CQ tail DB with a value of 1.
  • the client obtains the result of the write request execution.
  • the client can obtain the write request execution result from the CQ by using a polling manner.
  • the 8NVMe controller updates the CDB's header DB.
  • the client writes the address of the CQ header to the header DB of the CQ, and the value is 1.
  • steps 2 and 6 in FIG. 13 may also be implemented by an interface card of the storage node.
  • the storage node records the mapping relationship between the identifier of the NVMe storage device in the storage node and the start address of the queue of the NVMe storage device.
  • the client accesses the storage system, and the client determines the logical address of the NVMe storage device corresponding to the storage address according to the storage address included in the access request. For details, see the preceding description.
  • the client determines that the storage address corresponds to the logical address L1 of the NVMe storage device 1 in the storage node 1, the logical address L2 of the NVMe storage device 2 in the storage node 2, the logical address L3 of the NVMe storage device 3 in the storage node 3, and The logical address L4 of the NVMe storage device 4 in the storage node 4, that is, the NVMe storage device 1, the NVMe storage device 2, the NVMe storage device 3, and the NVMe storage device 4 constitute a stripe relationship of multiple copies.
  • the client sends a query request to the storage node 1, the storage node 2, the storage node 3, and the storage node 4, respectively, where the query includes an identifier of the NVMe storage device.
  • the storage node 1 receives the query request from the client, queries the mapping relationship recorded in the entry, and sends a query request response to the client.
  • the response includes the start address Add1 of the queue of the NVMe storage device 1 of the storage node 1.
  • the storage node 2, the storage node 3, and the storage node 4 respectively perform a query operation according to the query request, and send a query request response to the client.
  • the storage node 1 establishes a lock identifier, and the lock identifier is used to lock the start address Add1 of the queue. Once the start address of the queue is locked, it indicates that the queue has been assigned to the client.
  • the lock identifier may be a flag bit, for example, 0 for lock and 1 for unlock.
  • the lock identifier can be recorded in the entry shown in FIG.
  • the client sends a query request to the storage node, which can reduce the load on the management server.
  • the client receives the query request response from the storage node and performs the operation shown in FIG.
  • the NVMe storage device constitutes a stripe relationship of the EC.
  • the client can access the NVMe storage device in the above two modes, and details are not described here.
  • the client obtains the starting address of the queue of the NVMe storage device. Before releasing the queue of the NVMe storage device, the client according to the number of receiving access requests and the starting address of the queue, according to the start of the queue of the NVMe storage device. The change of the address, the RDMA access request is sent to the NVMe storage device queue, and the access request execution result is obtained from the CQ of the NVMe storage device.
  • the client can directly send an RDMA request to the NVMe storage device of the storage node in the storage system, and does not require the CPU of the storage node to participate, and fully utilizes and utilizes the performance of the NVMe storage device.
  • each storage node includes one or more NVMe storage devices, such as a solid state disk (SSD) of the NVMe interface specification, and includes one or both.
  • SSD solid state disk
  • a portion of the storage nodes include one or more NVMe storage devices, and some of the storage nodes include one or more SCM storage devices.
  • the distributed block storage system divides the logical address of the NVMe storage device into a resource pool, divides the storage device of the SCM into a resource pool, and stores a storage address mapping in the access request received by the client.
  • the logical address provided by the NVMe storage device can be referred to the foregoing embodiment.
  • the mapping manner of the storage address in the access request received by the client to the address provided by the storage device may be represented as a storage address-----NVMe storage device----logical address, storage address---SCM storage device--- - base address.
  • multiple copies of data can be stored on different types of storage devices. For example, one copy is stored on the SCM's storage device and one or more copies are stored on the NVMe storage device. Or a copy is stored on the storage device of the SCM, and one copy is stored on the plurality of NVMe storage devices in the form of EC strips.
  • the client reads data, it acquires data from the storage device of the SCM, thereby improving read performance.
  • the embodiment of the present invention is described by taking a copy on the storage device of the SCM and storing two copies on the NVMe storage device as an example.
  • the data in the write request received by the client is mapped to the base address of the storage device of one SCM and the logical address of the two NVMe storage devices.
  • the mapping between the storage address provided in the write request to the base address provided by the storage device of the SCM and the storage address in the write request to the logical address of the NVMe storage device may be based on the mappings shown in FIG. 3 and FIG. 4 .
  • the partitioned view may also be a direct mapping of the storage address in the write request to the base address provided by the storage device of the SCM and a direct mapping of the storage address in the write request to the logical address of the NVMe storage device.
  • the client receives the access request, and determines the storage device base address of the SCM storing the address mapping and the logical address of the NVMe storage device mapped by the storage address according to the storage address in the access request. In the embodiment of the present invention, the client determines the logical address of the NVMe storage device that stores the address mapping.
  • the subsequent access process may refer to the process of accessing the NVMe storage device by the client, and details are not described herein. As shown in FIG. 15, taking the copy length of 8 bytes as an example, the process of the client sending an RDMA write request to the SCM storage device is as follows:
  • the client sends a fetch and add (ptr, 8) command to the storage node.
  • the fetch and add (ptr, len value) is an RDMA atomic operation instruction, which is used to obtain the end address of the currently allocated storage space and the length of the write data.
  • the len value indicates the length of the write data.
  • the end address of the storage space currently allocated is 10, and the len value is 8 bytes.
  • the storage node allocates a storage address of 8 bytes in length to the client.
  • the storage node receives the fetch and add (ptr, 8) command and reserves the storage address 11-18 for the client.
  • the storage node returns the end address of the currently allocated storage space to the client.
  • the client sends an RDMA write request to the storage node.
  • the RDMA request includes data of a length of 8 bytes and an end address (base address) of the currently allocated storage space of 10.
  • the storage node may further include a mechanical hard disk, and each storage node includes one or more NVMe storage devices, such as a Solid State Disk (SSD) of the NVMe interface specification. Contains one or more mechanical hard drives at the same time.
  • a portion of the storage nodes include one or more NVMe storage devices, and some of the storage nodes include one or more mechanical hard disks. Multiple copies of the data can be stored on different types of storage devices. For example, one copy is stored on an NVMe storage device and one or more copies are stored on a mechanical hard disk. Or a copy is stored on the NVMe storage device, and a copy is stored on multiple mechanical hard disks in the form of EC strips.
  • the write request sent by the client to the mechanical hard disk includes the logical address mapped to the mechanical hard disk by the storage address in the write request received by the client.
  • the write request sent by the client to the mechanical hard disk can also be an RDMA request.
  • the foregoing embodiment of the present invention is described by using the example that the client sends an RDMA write request to the NVMe storage device in the storage node and the storage device of the SCM.
  • the solution of the embodiment of the present invention can also be applied to the NVMe storage device and the SCM in the storage node.
  • the storage device sends an RDMA read request and the like, which is not limited in this embodiment of the present invention. That is, the embodiment of the present invention can implement that the client sends an RDMA access request to the NVMe storage device in the storage node and the storage device of the SCM.
  • the storage address included in the access request received by the client may correspond to a logical address (or a base address) of the plurality of storage devices, and therefore, for one of the storage devices, the logical address is mapped to the storage device.
  • the embodiment of the present invention further provides a storage node.
  • the storage system further includes a management server, where the storage node includes a first storage device of an NVMe interface specification.
  • the storage node includes an allocating unit 161 and a sending unit 162, wherein the allocating unit is configured to allocate a first memory address in a memory for a starting address of a queue of the first storage device, and the sending unit 162 is configured to send the first to the management server.
  • Queue message; the first queue message contains the identity of the storage node and the first memory address.
  • the storage node further includes a detecting unit, configured to detect that the first storage device is installed to the storage node. Further, the detecting unit is further configured to detect that the first storage device is removed from the storage node; the sending unit 162 is further configured to send a queue message deletion message to the management server; and the queue message deletion message includes the first memory address.
  • the embodiment of the present invention further provides a client applied to the storage system, as shown in FIG. 17, the storage system includes a management server and a first storage node, and the first storage node includes the NVMe.
  • the first storage device of the interface specification the start address of the queue of the first storage device is stored in the first memory address in the first memory of the first storage node; the management server stores the identifier of the first storage node and the first memory address a first mapping relationship; the client includes a sending unit 171 and a receiving unit 172, wherein the sending unit 171 is configured to send a query request to the management server; the query request includes an identifier of the first storage node; and the receiving unit 172 is configured to receive the management
  • the server queries the request response; the query request response includes a first memory address determined by the management server according to the first mapping relationship.
  • the sending unit 171 is further configured to send the first remote direct memory access request to the first storage node; the first remote direct memory access request includes the first memory address.
  • the storage system further includes a second storage node, where the second storage node includes a second storage device of the NVMe interface specification, and the start address of the queue of the second storage device is located in the second memory of the second memory of the second storage node.
  • the management server stores a second mapping relationship between the identifier of the second storage node and the second memory address; the query request includes an identifier of the second storage node; and the query request response includes the first determined by the management server according to the second mapping relationship a second memory address; wherein the first storage device and the second storage device form a stripe relationship; the sending unit 171 is further configured to send a second remote direct memory access request to the second storage node; the second remote direct memory access request includes Two memory addresses.
  • a management server in a storage system includes a management server and a first storage node, and the first storage node includes a first storage device of an NVMe interface specification.
  • the start address of the queue of the first storage device is located in the first memory address in the first memory of the first storage node;
  • the management server includes an obtaining unit 181, an establishing unit 182, a receiving unit 183, and a sending unit 184;
  • the first queue message is obtained from the first storage node, and the first queue message includes the identifier of the first storage node and the first memory address.
  • the establishing unit 182 is configured to establish the identifier of the first storage node and the first memory.
  • the establishing unit 182 is further configured to establish a lock identifier, where the lock identifier is used to lock the first memory address.
  • the storage system further includes a second storage node, the second storage node includes a second storage device of the NVMe interface specification, and the start address of the queue of the second storage device is located in the second memory in the second memory of the second storage node.
  • the obtaining unit 181 is further configured to acquire the second queue message of the second storage device from the second storage node; the second queue message includes the identifier of the second storage node and the second memory address; the establishing unit 182 is further configured to establish A second mapping relationship between the identifier of the second storage node and the second memory address. Further, the obtaining unit 181 is further configured to obtain a third queue message from the first storage node; the third queue message includes an identifier of the first storage node and a third memory address; and the establishing unit 182 is further configured to establish the first storage node.
  • the receiving unit 183 is further configured to receive a queue message deletion message from the second storage node; the queue message deletion message includes a second memory address, and the management server further includes a deleting unit, configured to delete the second mapping relationship.
  • the management server further includes a detecting unit and a deleting unit, wherein the detecting unit is configured to detect that the communication with the first storage node is interrupted, and the deleting unit is configured to delete the first mapping relationship. Further, the detecting unit is specifically configured to detect that the heartbeat of the first storage node is not received within a predetermined time.
  • an embodiment of the present invention further provides a management server in a storage system, where the storage system further includes a first storage node and a second storage node, where the first storage node includes a first storage device of the NVMe interface specification.
  • the start address of the queue of the first storage device is located at the first memory address in the first memory of the first storage node;
  • the second storage node includes the second storage device of the NVMe interface specification, and the start address of the queue of the second storage device a second memory address in the second memory of the second storage node;
  • the management server stores a mapping table, where the mapping table includes a first mapping relationship between the identifier of the first storage node and the first memory address, and an identifier of the second storage node a second mapping relationship between the second memory address and the second memory address;
  • the management server includes a receiving unit 191 and a sending unit 192; wherein the receiving unit 191 is configured to receive a query request from the client; the query request includes the identifier of the first storage node and the first An identifier of the second storage node; the sending unit 192 is configured to send a query request response to the client according to the mapping table;
  • the query request response includes a first memory address and a second memory address.
  • the management server further includes an obtaining unit and an establishing unit, configured to acquire a first queue message from the first storage node, and obtain a second queue message from the second storage node; the first queue message includes an identifier of the first storage node and a first a memory address; the second queue message includes an identifier of the second storage node and a second memory address; and an establishing unit, configured to establish a first mapping relationship and a second mapping relationship. Further, the establishing unit is further configured to establish a lock identifier, where the lock identifier is used to lock the first mapping relationship and the second mapping relationship.
  • the obtaining unit is further configured to obtain a third queue message from the first storage node; the third queue message includes an identifier of the first storage node and a third memory address; and the establishing unit is further configured to establish an identifier of the first storage node and a third mapping relationship of the third memory address; wherein the starting address of the queue of the third storage device of the NVMe interface specification is located in the third memory address in the first memory, and the third storage device newly adds the first storage node Storage device.
  • the management server further includes a deleting unit; the receiving unit 191 is further configured to receive a queue message deletion message from the second storage node; the queue message deletion message includes a second memory address; and the deleting unit is configured to delete the mapping table The second mapping relationship.
  • the management server further includes a detecting unit and a deleting unit, wherein the detecting unit is configured to detect that the communication with the first storage node is interrupted, and the deleting unit is configured to delete the first mapping relationship. Further, the detecting unit is specifically configured to detect that the heartbeat of the first storage node is not received within a predetermined time.
  • FIG. 20 provides a client in a storage system according to an embodiment of the present invention.
  • the storage system includes a management server, a first storage node, and a second storage node, and the first storage node includes an NVMe interface specification.
  • the first storage device, the start address of the queue of the first storage device is located at a first memory address in the first memory of the first storage node;
  • the second storage node includes a second storage device of the NVMe interface specification, and the second storage device of the second storage device The start address of the queue is located in the second memory address in the second memory of the second storage node;
  • the management server stores a mapping table, where the mapping table includes a first mapping relationship between the identifier of the first storage node and the first memory address And a second mapping relationship between the identifier of the second storage node and the second memory address;
  • the client includes a sending unit 2001 and a receiving unit 2002; wherein the sending unit 2001 is configured to send a query request to the management server;
  • the query request includes An identifier of a storage node and an identifier of the second storage node;
  • a receiving unit 2002 configured to receive from the management server
  • the query request response includes a first memory address and a second memory address determined by
  • the sending unit 2002 is further configured to send a first remote direct memory access request to the first storage node, and send a second remote direct memory access request to the second storage node; the first remote direct memory access request includes the first memory address The second remote direct memory access request contains a second memory address.
  • an embodiment of the present invention provides another client in a storage system, where the storage system includes a first storage node, and the first storage node includes a first storage device of the NVMe interface specification, and the queue of the first storage device
  • the first address is located in the first memory of the first storage node
  • the client includes a receiving unit 2101, a query unit 2102, an obtaining unit 2103, and a sending unit 2104.
  • the receiving unit 2101 is configured to receive a write request.
  • the write request includes a storage address; the query unit 2102 is configured to query a mapping relationship of the storage address, and the storage address is mapped to the first logical address of the first storage device of the first storage node; the obtaining unit 2103 is configured to use The first memory address where the start address of the queue of the first storage device is located; the sending unit 2104 is configured to send a first remote direct memory access write request to the first storage node; the first remote direct memory access write request includes the first A memory address and a first logical address.
  • the storage system further includes a management server, where the management server stores a mapping relationship between the identifier of the first storage node and the memory address; the obtaining unit 2103 is specifically configured to: send the first query request to the management server; The request includes an identifier of the first storage node, and receives a first query request response from the management server, the first query request response including the first memory address. Further, the obtaining unit 2103 is specifically configured to: send a second query request to the first storage node; the second query request includes an identifier of the first storage device; receive a second query request response from the first storage node; and the second query The request response includes the first memory address.
  • the management server stores a mapping relationship between the identifier of the first storage node and the memory address
  • the obtaining unit 2103 is specifically configured to: send the first query request to the management server; The request includes an identifier of the first storage node, and receives a first query request response from the management server, the first query request response including the first memory address.
  • the storage system further includes a second storage node, the second storage node includes a second device of the storage level memory, and the mapping relationship of the storage address includes the storage address mapped to the first base address of the second storage device; the sending unit 2104 further And a second remote direct memory access write request is sent to the second storage node; the second remote direct memory access write request includes the first base address.
  • the storage system further includes a third storage node, the third storage node includes a third storage device, and the mapping relationship of the storage address includes a storage address mapped to a third logical address of the third storage device; the third storage device is a mechanical hard disk;
  • the sending unit 2104 is further configured to send a third write request to the third storage node, where the third write request includes the third logical address.
  • the storage system includes a fourth storage node, the fourth storage node includes a fourth storage device of the NVMe interface specification, and the start address of the queue of the fourth storage device is located at a third memory address in the third memory of the fourth storage node.
  • the mapping relationship of the storage address includes the storage address mapped to the fourth logical address of the fourth storage device; the obtaining unit 2101 is further configured to acquire the third memory address where the start address of the queue of the fourth storage device is located; the sending unit 2104, The fourth remote direct memory access write request is further sent to the first storage node; the fourth remote direct memory access write request includes a third memory address and a fourth logical address.
  • the memory address of the queue whose start address is located in the memory has the same meaning as the start address of the queue in the memory.
  • the above description is also referred to as the queue's starting address in memory located in a memory address in memory.
  • the device in the embodiment of the present invention may be a software module, which may be run on a server, so that the server completes various implementations described in the embodiments of the present invention.
  • the device may also be a hardware device.
  • the units of the device may be implemented by the processor of the server described in FIG. 2.
  • the embodiments of the present invention further provide a computer readable storage medium and a computer program product, the computer readable storage medium and the computer program product comprising computer instructions for implementing various solutions described in the embodiments of the present invention.
  • the EC and multiple copies are used as the striping algorithm, but the striping algorithm in the embodiment of the present invention is not limited to the EC and multiple copies as the striping algorithm.
  • the disclosed apparatus and method may be implemented in other manners.
  • the division of the units described in the device embodiments described above is only one logical function division, and may be further divided in actual implementation, for example, multiple units or components may be combined or may be integrated into another system, or Some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

Abstract

一种存储系统中存储设备的管理方法,客户端可以根据获取访问请求指向的NVMe存储设备的队列的起始地址以及访问请求指向的NVMe存储设备的逻辑地址,向NVMe存储设备所在的存储节点发送远程直接内存访问命令,从而充分发挥了NVMe存储设备的性能,提高了存储系统的写性能。

Description

存储系统中存储设备的管理方法及装置 技术领域
本发明涉及信息技术领域,尤其涉及一种存储系统中存储设备的管理方法及装置。
背景技术
随着非易失性内存高速(Non-volatile Memory Express,NVMe)接口规范的发展,支持NVMe接口规范的存储设备(以下简称为NVMe存储设备)应用于分布式存储系统中。分布式存储系统中,通常包含多个存储节点,每一个存储节点包含一个或多个支持NVMe接口规范的存储设备。客户端访问分布式存储系统中的存储设备。首先客户端需要确定处理访问请求的存储节点,与存储节点建立通信,存储节点接收访问请求,存储节点的中央处理单元(Central Processing Unit,CPU)解析访问请求得到访问命令,将访问请求中的地址转化为NVMe存储设备的存储地址,确定访问请求对应的NVMe存储设备,将NVMe存储设备的存储地址和访问命令发送给对应的NVMe存储设备,上述操作过程与访问传统的非NVMe接口规范的存储设备流程相同,无法充分利用NVMe存储设备的性能。
发明内容
申请提供了一种存储系统中存储设备的管理方法及装置。
本申请的第一方面提供了存储系统中存储设备的管理方法,在存储系统包含管理服务器和第一存储节点,其中,第一存储节点包含NVMe接口规范的第一存储设备,第一存储设备的队列的起始地址位于第一存储节点的第一内存中的第一内存地址;管理服务器从第一存储节点获取第一队列消息;第一队列消息包含第一存储节点的标识和第一内存地址;管理服务器建立第一存储节点的标识和第一内存地址之间的第一映射关系;管理服务器接收来自客户端的查询请求;查询请求包含第一存储节点的标识;管理服务器根据第一映射关系向客户端发送查询请求响应;查询请求响应包含第一内存地址。管理服务器建立存储节点的标识和队列的起始地址的映射关系,客户端可以从管理服务器获得NVMe存储设备队列的信息,从而可以直接访问NVMe存储设备,不需要存储节点CPU的参与处理访问请求,充分发挥了NVMe存储设备的性能,提高了存储系统的写性能。
结合本申请第一方面,在第一种可能实现方式中,还包括,管理服务器建立锁定标识,锁定标识用于锁定第一内存地址。从而管理服务器可以记录队列的分配情况。进一步的,管理服务器可以根据锁定信息统计NVMe存储设备的负载。进一步的,管理服务器还可以记录获得队列的客户端的标识。
结合本申请第一方面,在第二种可能实现方式中,存储系统还包括第二存储节点,第二存储节点包含NVMe接口规范的第二存储设备,第二存储设备的队列的起始地址位于第二存储节点的第二内存中的第二内存地址;管理
服务器从第二存储节点获取第二存储设备的第二队列消息;第二队列消息包含第二存储节点的标识和第二内存地址;管理服务器建立第二存储节点的标识和第二内存地址的第二映射关系。管理服务器可以管理存储队列中所有存储节点包含的NVMe设备的队列信息。
结合本申请第一方面,或第一方面的第一种或第二种可能实现方式,在第三种可能的实现方式中,管理服务器从第一存储节点获得第三队列消息;第三队列消息包含第一存储节点的标识和第三内存地址;管理服务器建立第一存储节点的标识和第三内存地址的第三映射关系;其中,NVMe接口规范的第三存储设备的队列的起始地址位于第一内存中的第三内存地址,第三存储设备为第一存储节点新增加的存储设备。
结合本申请第一方面的第二或第三种可能的实现方式,在第四种可能的实现方式中,管理服务器从第二存储节点接收队列消息删除消息;队列消息删除消息包含第二内存地址;管理服务器删除第二映射关系。
结合本申请第一方面,在第五种可能的实现方式中,管理服务器检测到与第一存储节点通信中断,管理服务器删除第一映射关系。
结合本申请第一方面的第五种可能的实现方式,在第六种可能的实现方式中,管理服务器检测到与第一存储节点通信中断,具体包括:管理服务器在预定的时间内未收到第一存储节点的心跳。
本申请的第二方面提供了存储系统中存储设备的管理方法,存储系统包含管理服务器和存储节点,其中,存储节点包含NVMe接口规范的第一存储设备;存储节点在内存中为第一存储设备的队列的起始地址分配第一内存地址;存储节点向管理服务器发送第一队列消息;第一队列消息包含存储节点的标识和第一内存地址。存储节点将存储设备的队列信息发送给管理服务器,从而使客户端能够使用队列信息直接访问存储设备。
结合本申请第二方面,在第一种可能实现方式中,存储节点检测到第一存储设备安装到存储节点。
结合本申请第二方面或第二方面的第一种可能实现方式,在第二种可能的实现方式中,存储节点检测述第一存储设备从存储节点中移除;存储节点向管理服务器发送队列消息删除消息;队列消息删除消息包含第一内存地址。
本申请第三方面提供了一种存储系统中存储设备的管理方法,其中存储系统包含管理服务器和第一存储节点,第一存储节点包含NVMe接口规范的第一存储设备,第一存储设备的队列的起始地址位于第一存储节点的第一内存中的第一内存地址;管理服务器存储第一存储节点的标识和第一内存地址的第一映射关系;客户端向管理服务器发送查询请求;查询请求包含所述第一存储节点的标识;客户端接收来自管理服务器查询请求响应;查询请求响应包含管理服务器根据第一映射关系确定的第一内存地址。客户端查询管理服务器,可以获得存储节点中NVMe存储设备的队列信息,从而能够根据队列信息直接访问NVMe存储设备,不需要存储节点CPU参与,充分发挥了NVMe存储设备的性能。
结合本申请第三方面,在第一种可能实现方式中,客户端向第一存储节点发送第一远程直接内存访问请求;第一远程直接内存访问请求包含第一内存地址。
结合本申请第三方面的第一种可能实现方式,在第二种可能实现方式中,存储系统还包含第二存储节点,第二存储节点包含NVMe接口规范的第二存储设备,第二存储设备的队列的起始地址位于第二存储节点的第二内存中的第二内存地址;管理服务器存储第二存储节点的标识和所述第二内存地址的第二映射关系;查询请求包含所述第二存储节点的标识;查询请求响应包含管理服务器根据第二映射关系确定的所述第二内存地址;其中,第一存储设备和第二存储设备构成分条关系;客户端向第二存储节点发送第二远程直接内存访问请求;第二远程直接内存访问请求包含所述第二内存地址。
结合本申请第一方面或第一方面的第一至六任一种可能的实现方式中的存储系统中存储设备的管理方法,本申请第四方面还提供了一种管理服务器,管理服务器包含多个单元,用于执行本申请第一方面或第一方面的第一至六任一种可能的实现方式。
结合用于本申请第二方面或第二方面的第一至二任一种可能的实现方式中的存储系统中存储设备的管理方法,本申请第五方面还提供了一种存储节点,存储节点包含多个单元,用于执行本申请第二方面或第二方面的第一至二任一种可能的实现方式。
结合用于本申请第三方面或第三方面的第一至二任一种可能的实现方式中的存储系统中存储设备的管理方法,本申请第六方面还提供了一种客户端,客户端包含多个单元,用于执行本申请第三方面或第三方面的第一至二任一种可能的实现方式。
本申请第七方面还提供了一种管理服务器,应用于本申请第一方面或第一方面的第一至六任一种可能的实现方式中的存储系统,管理服务器包含处理器和接口,处理器与接口通信,处理器用于执行本申请第一方面或第一方面的第一至六任一种可能的实现方式。
本申请第八方面还提供了一种存储节点,应用于本申请第二方面或第二方面的第一至二任一种可能的实现方式中的存储系统,存储节点包含处理器和接口,处理器与接口通信,处理器用于执行用于执行本申请第二方面或第二方面的第一至二任一种可能的实现方式。
本申请第九方面还提供了一种客户端,应用于本申请第三方面或第三方面的第一至二任一种可能的实现方式中的存储系统,客户端包含处理器和接口,处理器与接口通信,处理器用于执行用于执行本申请第三方面或第三方面的第一至二任一种可能的实现方式。
相应地,本申请第十方面还提供了计算机可读存储介质和计算机程序产品,计算机可读存储介质和计算机程序产品中包含计算机指令用于实现本申请第一方面各方案。
相应地,本申请第十一方面还提供了计算机可读存储介质和计算机程序产品, 计算机可读存储介质和计算机程序产品中包含计算机指令用于实现本申请第二方面各方案。
相应地,本申请第十二方面还提供了计算机可读存储介质和计算机程序产品,计算机可读存储介质和计算机程序产品中包含计算机指令用于实现本申请第三方面各方案。
附图说明
图1为本发明实施例分布式块存储系统示意图;
图2为本发明实施例存储节点结构示意图;
图3为本发明实施例分区视图;
图4为本发明实施例分区视图;
图5为本发明实施例NVMe存储设备示意图;
图6为本发明实施例存储节点中NVMe队列示意图;
图7为本发明实施例存储节点发送队列信息流程图;
图8为本发明实施例管理服务器存储NVMe存储设备队列信息示意图;
图9为本发明实施例管理服务器与存储节点通信示意图;
图10为本发明实施例管理服务器存储NVMe存储设备队列信息示意图;
图11为本发明实施例管理服务器存储NVMe存储设备队列信息示意图;
图12为本发明实施例存储节点存储NVMe存储设备队列信息示意图;
图13为本发明实施例访问请求处理流程示意图;
图14为本发明实施例分布式块存储系统示意图;
图15为本发明实施例客户端向存储类内存设备发送RDMA写请求示意图;
图16为本发明实施例存储节点结构示意图;
图17为本发明实施例客户端结构示意图;
图18为本发明实施例管理服务器结构示意图;
图19为本发明实施例管理服务器结构示意图;
图20为本发明实施例客户端结构示意图;
图21为本发明实施例客户端结构示意图。
本发明实施例
本发明实施例中的存储系统可以应用到存储阵列(如华为
Figure PCTCN2017118650-appb-000001
的Oceanstor
Figure PCTCN2017118650-appb-000002
18000系列,
Figure PCTCN2017118650-appb-000003
V3系列),分布式文件存储系统(如华为
Figure PCTCN2017118650-appb-000004
Figure PCTCN2017118650-appb-000005
9000系列),分布式块存储系统(如华为
Figure PCTCN2017118650-appb-000006
Figure PCTCN2017118650-appb-000007
系列)、分布式对象存储系统或支持日志结构(Log-Structure)接口的分布式存储系统等。
如图1所示,本发明实施例中的分布式块存储系统,如华为
Figure PCTCN2017118650-appb-000008
Figure PCTCN2017118650-appb-000009
系列。分布式块存储系统包括多个存储节点,如存储节点1、存储节点2、存储节点3、存储节点4、存储节点5和存储节点6,存储节点间通过InfiniBand或以太网络等互相通信。在实际应用当中,分布式块存储系统中存储节点的数量可以根据实际需求增加,本发明实施例对此不作限定。每一个存储节点包含一个或多个NVMe存储设备,例如NVMe接口规范的固态硬盘(Solid State  Disk,SSD)。本发明实施例中NVMe存储设备不限于SSD。本发明实施例中存储节点包含NVMe存储设备具体可能为存储节点内部包含NVMe存储设备,NVMe存储设备也可以是以硬盘簇(Just a Bunch Of Disks,JBOD)的形式外置于存储节点。
存储节点包含如图2所示的结构。如图2所示,存储节点包含中央处理单元(Central Processing Unit,CPU)201、内存202、接口卡203。内存202中存储计算机指令,CPU201执行内存202中的程序指令执行相应的操作。接卡口303可以为网络接口卡(Network Interface Card,NIC)、Infiniband协议接口卡等。另外,为节省CPU201的计算资源,现场可编程门阵列(Field Programmable Gate Array,FPGA)或其他硬件也可以代替CPU301执行上述相应的操作,或者,FPGA或其他硬件与CPU201共同执行上述相应的操作。为方便描述,本发明实施例将CPU201与内存202、FPGA及其他替代CPU201的硬件或FPGA及其他替代CPU201的硬件与CPU201的组合统称为处理器。
在图2所示的结构中,客户端可以是独立于图2所示的设备,例如服务器、移动设备等,也可以是虚拟机(Virtual Machine,VM)。客户端运行应用程序,其中,应用程序可以为VM或容器,也可以为某一个特定应用,如办公软件等。客户端向分布式块存储系统写入数据或从分布式块设备存储中读取数据。客户端的结构可参考图2及相关描述。
存储节点的内存202中加载分布式块存储系统程序,CPU201执行内存202中的分布式块存储系统程序,向客户端提供块协议访问接口,为客户端提供分布式块存储接入点服务,使客户端访问分布式块存储系统中存储资源池中的存储资源。通常,该块协议访问接口用于向客户端提供逻辑单元。示例性的,分布式块存储系统初始化时,将哈希空间(如0~2^32,)划分为N等份,每1等份是1个分区(Partition),这N等份按照硬盘数量进行均分。例如,分布式块存储系统中N默认为3600,即分区分别为P1,P2,P3…P3600。假设当前分布式块存储系统有18块NVMe存储设备,则每块NVMe存储设备承载200个分区。分区P包含分别分布在M个存储节点上的共M个NVMe存储设备,分区中M个NVMe存储设备构成分条关系。分条关系可以为多副本或纠删码(Erasure Coding,EC)。分区与NVMe存储设备对应关系,即分区与分区包含的NVMe存储设备的映射关系,也称为分区视图,如图3所示,以分区包含4个NVMe存储设备为例,分区视图为“P2-(存储节点N 1-NVMe存储设备1)-(存储节点N 2-NVMe存储设备2)-(存储节点N 3-NVMe存储设备3)-(存储节点N 4-NVMe存储设备4)”。即NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4构成分条关系。当每个存储节点只包含一个NVMe存储设备,图3所示的分区视图也可以表示为P2-存储节点N 1-存储节点N 2-存储节点N 3-存储节点N 4。通常分区的划分和分区视图的分配由管理服务器实现。管理服务器会在分布式块存储系统初始化时分配好分区视图,后续会随着分布式块存储系统中NVMe存储设备数量的变化进行调整。其中,管理服务器的结构可参考图2所示的结构。为方 便客户端访问,减少管理服务器访问压力,其中一种实现,管理服务器会将分区视图发送给客户端。
分布式块存储系统为客户端提供卷作为存储资源。具体实现,分布式块存储系统将NVMe存储设备的逻辑地址划分资源池,为客户端提供数据访问,客户端接收的访问请求中的存储地址映射到NVMe存储设备的逻辑地址,即客户端访问请求中的存储地址所在的数据块映射到NVMe存储设备提供的逻辑地址。客户端访问卷的访问请求,如写请求,写请求携带存储地址和数据。在分布式块存储系统中,存储地址为逻辑块地址(Logical Block Address,LBA)。根据写请求中的存储地址确定写请求对应的数据块。客户端根据数据块查询管理服务器中的分区视图或客户端本地保存的分区视图,确定为数据块分配存储空间的NVMe存储设备。例如,假设数据块大小为1024字节,卷中数据块编号从0开始。写请求包含的存储地址为写起始地址为1032字节,大小为64字节,写请求位于编号为1的数据块中(1032/1024),数据块为内偏移为8(1032%1024)。
例如,如图3所示,第1个数据块分布在分区P2,NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4构成分条关系为多副本关系。客户端查询分区视图,确定写请求包含的存储地址映射到NVMe设备的逻辑地址,例如,存储地址映射到NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4的逻辑地址分别为L1、L2、L3和L4。具体实现,客户端根据分区视图查询分区视图中的主存储节点,如存储节点N1,存储节点N1向客户端提供L1、L2、L3和L4。客户端确定提供逻辑地址的NVMe存储设备,获得NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4的队列的起始地址在内存中的内存地址(以下简称队列的起始地址),客户端获得NVMe存储设备的队列的起始地址在内存中的内存地址的具体实现请见下面描述。客户端分别向NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4发送RDMA写请求。其中,客户端向NVMe存储设备1发送的RDMA写请求中包含逻辑地址L1和NVMe存储设备1的队列的起始地址,还包含客户端接收的写请求中写入L1的数据;客户端向NVMe存储设备2发送的RDMA写请求中包含逻辑地址L2和NVMe存储设备2的队列的起始地址,还包含客户端接收的写请求中写入L2的数据;客户端向NVMe存储设备3发送的RDMA写请求中包含逻辑地址L3和NVMe存储设备3的队列的起始地址,还包含客户端接收的写请求中写入L3的数据;客户端向NVMe存储设备4发送的RDMA写请求中包含逻辑地址L4和NVMe存储设备4的队列的起始地址,还包含客户端接收的写请求中写入L4的数据。具体实现,客户端分别向NVMe存储设备1所在的存储节点1的接口卡、NVMe存储设备2所在的存储节点2的接口卡、NVMe存储设备3所在的存储节点3的接口卡和NVMe存储设备4所在的存储节点4的接口卡发送RDMA写请求。
另一种实现,如图4所示,分区视图为P2-(存储节点N1-NVMe存储设备1)-(存储节点N2-NVMe存储设备2)-(存储节点N3-NVMe存储设备3)- (存储节点N4-NVMe存储设备4)-(存储节点N5-NVMe存储设备5)-(存储节点N6-NVMe存储设备6)。第1个数据块分布在分区P2,NVMe存储设备1、NVMe存储设备2、NVMe存储设备3、NVMe存储设备4、NVMe存储设备5和NVMe存储设备6构成EC关系,NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4为存储数据分片的存储设备,NVMe存储设备5和NVMe存储设备6为存储校验分片的存储设备。EC分条的长度为12千字节(Kilobyte,KB),则数据分片和校验分片的长度均为2KB。例如,第1个分条中,NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4分别存储存储地址为0~2KB-1,2KB~4KB-1,4KB~6KB-1,6KB~8KB-1的数据分片,NVMe存储设备5和NVMe存储设备6分别存储第1分条的校验分片。在该分区的第2个分条中,NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4分别存储存储地址为8KB~10KB-1,10KB~12KB-1,12KB~14KB-1,14KB~16KB-1的数据分片,NVMe存储设备5和NVMe存储设备6分别存储第2分条的校验分片。例如,客户端接收写请求,写请求中包含的存储地址SA为0~8KB-1。客户端查询分区视图,确定存储地址为0~2KB-1的第1数据分片对应的NVMe存储设备1的逻辑地址为L1,确定存储地址为2KB~4KB-1的第2数据分片对应的NVMe存储设备2的逻辑地址为L2,确定存储地址为4KB~6KB-1的第3数据分片对应的NVMe存储设备3的逻辑地址为L3,确定存储地址为6KB~8KB-1的第4数据分片对应的NVMe存储设备4的逻辑地址为L4,确定第1分条的第1校验分片对应的NVMe存储设备5的逻辑地址为L5,确定第1分条的第2校验分片对应的NVMe存储设备6的逻辑地址为L6。客户端查询分区视图的具体实现可以参见本发明实施例前面的描述,在此不再赘述。本发明实施例中,存储地址SA分别映射到L1、L2、L3、L4、L5和L6。客户端确定提供逻辑地址的NVMe设备,客户端获得NVMe存储设备1、NVMe存储设备2、NVMe存储设备3、NVMe存储设备4、NVMe存储设备5和NVMe存储设备6的队列的起始地址,客户端获得NVMe存储设备的队列的起始地址的具体实现请见下面描述。客户端分别向NVMe存储设备1、NVMe存储设备2、NVMe存储设备3、NVMe存储设备4、NVMe存储设备5和NVMe存储设备6发送RDMA请求。其中,客户端向NVMe存储设备1发送的RDMA写请求中包含逻辑地址L1和NVMe存储设备1的队列的起始地址,还包含客户端接收的写请求中写入L1的第1数据分片;客户端向NVMe存储设备2发送的RDMA写请求中包含逻辑地址L2和NVMe存储设备2的队列的起始地址,还包含客户端接收的写请求中写入L2的第2数据分片;客户端向NVMe存储设备3发送的RDMA写请求中包含逻辑地址L3和NVMe存储设备3的队列的起始地址,还包含客户端接收的写请求中写入L3的第3数据分片;客户端向NVMe存储设备4发送的RDMA写请求中包含逻辑地址L4和NVMe存储设备4的队列的起始地址,还包含客户端接收的写请求中写入L4的第4数据分片;客户端向NVMe存储设备5发送的RDMA写请求中包含逻辑地址L5和NVMe存储设备5的队列的起始地址,还包含第1校验分片;客户端向NVMe存储设备6发送的RDMA 写请求中包含逻辑地址L6和NVMe存储设备6的队列的起始地址,还包含第2校验分片。
在上述实现方式中,客户端使用RDMA写请求将数据直接写入NVMe存储设备在存储节点的内存中队列的起始地址,不需要存储节点CPU的参与处理客户端发送的写请求,充分发挥了NVMe存储设备的性能,提高了存储系统的写性能。
另一种实现,分布式块存储系统为客户端提供卷作为存储资源。具体实现,分布式块存储系统将NVMe存储设备的逻辑地址划分资源池,为客户端提供数据访问,即客户端访问请求中的存储地址所在的数据块映射到NVMe存储设备提供的逻辑地址,不需要分区视图。数据块与NVMe存储设备提供的逻辑地址的映射可以表示为数据块-----NVMe存储设备----逻辑地址。
为进一步描述本发明实施例,如图5所示,NVMe存储设备包含NVMe控制器501和存储介质502。NVMe规范中定义了三个关键组件用于访问请求和数据的处理:提交队列(Submission Queue,SQ)、完成队列(Completion Queue,CQ)和门铃寄存器(Doorbell Register,DB)。SQ用于存储客户端发送的访问请求,CQ用于存储NVMe存储设备处理访问请求的结果。SQ和CQ以队列对的形式存在。如图6所示,SQ和CQ位于存储节点的内存202中,本发明实施例称一对SQ和CQ为队列。NVMe存储设备用于处理访问请求的队列的数量最大可以达到65535个。SQ和CQ都是环形队列,NVMe控制器从SQ队列的头获取待处理的访问请求,SQ尾用于存储客户端最新发送的访问请求。客户端从CQ的头获取访问结果,CQ队列的尾用于存储NVMe控制器最新处理访问请求的结果。客户端使用RDMA请求访问NVMe存储设备,需要获取NVMe存储设备的SQ和CQ队列的尾,也就是SQ和CQ的起始地址分别在内存202中的内存地址。在NVM控制器中有寄存器,用于记录SQ和CQ的头尾位置,每个SQ或CQ,都有两个对应的寄存器,即头寄存器和尾寄存器。通常头寄存器也称为头门铃(DoorBell,DB),尾寄存器也称为尾DB。关于DB的作用将在后面实施例中具描述。
NVMe存储设备连接到存储节点,存储节点启动后,NVMe存储设备向存储节点进行注册,如图7所示,存储节点执行如下流程:
步骤701:存储节点在内存202中为队列分配内存地址。
存储节点在内存202中为队列分配内存地址,包含为队列的起始地址在内存中分配内存地址。
步骤702:存储节点向管理服务器发送队列消息;队列消息包含存储节点的标识和队列的起始地址。
存储节点在本地建立存储节点的标识和队列的起始地址的映射关系。存储节点可以包含一个或多个NVMe存储设备,队列消息中还可以包含NVMe存储设备的标识,以区分存储节点的不同NVMe存储设备的队列的起始地址。当存储节点只包含一个NVMe存储设备时,队列消息中可以只包含存储节点的标识和队列的起始地址。
在存储节点运行过程中,新的NVMe存储设备连接到存储节点,存储节点也执行图7所示的流程。
进一步的,存储节点检测到NVMe存储设备从存储节点移除,存储节点向管理服务器发送队列信息删除消息,队列信息删除消息包含队列的起始地址。NVMe存储设备从存储节点移除,具体可以包含物理移除或NVMe存储设备故障。存储节点可以检测NVMe存储设备的驱动以确定NVMe存储设备是否从存储节点移除。
管理服务器从存储节点获取队列信息。具体实现,可以由管理服务器向存储节点发送请求用于指示存储节点上报队列信息,或者存储节点主动向管理服务器发送队列信息。其中,队列信息包含存储节点的标识和存储节点为NVMe存储设备的队列分配的队列的起始地址。管理服务器建立存储节点的标识和队列的起始地址的映射关系。通常,存储节点包含多个NVMe存储设备,为区分同一个存储节点不同NVMe存储设备的队列的起始地址,队列信息还包含NVMe存储设备的标识,管理服务器建立的映射关系中还包含NVMe存储设备的标识。一种实现方式,上述映射关系作为表项存储在如图8所示的表中。换句话说,映射关系可以用如图8所示的表项结构,或其他可体现标识和地址之间关系的数据结构储存。其中,N1、N2和N3分别表示存储节点的标识,D11和D12分别表示存储节点1中的NVMe存储设备的标识,Add1表示标识为D11的NVMe存储设备的队列的起始地址。因为标识为D11的NVMe存储设备可以有多个队列,其中一种实现,Add1表示NVMe存储设备的SQ的起始地址,另一种实现,Add1还可以表示NVMe存储设备的队列的起始地址(SQ的起始地址和CQ的起始地址)。图8所示的表中其他项的含义可参考上述的表述。存储节点的标识和队列的起始地址的映射关系作为图8所示的表的表项。在存储系统中,每一个存储节点有一个唯一标识,该标识可以是管理服务器为每一个存储节点分配的编号,也可以是存储节点的硬件信息,如接口卡硬件信息,还可以是存储节点的地址信息,如互联网协议地址(Internet Protocol,IP)地址。NVMe存储设备的标识可以是NVMe存储设备的硬件信息,也可以是NVMe存储设备所在的存储节点中的内部编号,比如,存储节点1中的NVMe存储设备可以标识为D11和D12,或者表示为N1+NVMe设备编号,例如,N1+1,N1+2等。管理服务器建立存储节点的标识和队列的起始地址的映射关系,客户端可以从管理服务器获得NVMe存储设备队列的信息,从而可以直接访问NVMe存储设备,不需要存储节点CPU的参与,充分发挥了NVMe存储设备的性能。
在一种实现方式中,如图9所示,管理服务器与存储节点通信,确定存储节点是否正常,如通过定期接收来自存储节点的心跳确定存储节点是否正常。例如管理服务器在预定的时间内未收到存储节点的心跳,管理服务器判断与存储节点通信中断,存储节点发生故障。例如,管理服务器未在预定的时间内收到图9所示的存储节点1发送的心跳,管理服务器判断与存储节点1的通信中断,存储节点1发生故障。结合图8,管理服务器删除表中记录的存储节点1相关的表项,即映射信息,删除存储节点1相关的表项后,表如图10所示。当存储节点1恢复后,执 行图7所示的流程。
存储节点中NVMe存储设备移除,管理服务器接收来自存储节点的队列信息删除消息,队列信息删除消息包含队列的起始地址。例如,标识为D11的NVMe存储设备从存储接节点1中移除,存储节点1向管理服务器发送队列信息删除消息,如图11所示,管理服务器删除图8所示的标识为D11的所有映射关系。另一种实现,存储节点包含多个NVMe存储设备,为区分同一个存储节点不同NVMe存储设备的队列的起始地址,队列信息还包含NVMe存储设备的标识,管理服务器建立的映射关系中还包含NVMe存储设备的标识。存储节点向管理服务器发送的队列信息删除消息包含NVMe存储设备的标识,管理服务器删除包含该NVMe存储设备的标识的映射关系。另一种实现,当新的NVMe存储设备安装到存储节点1时,存储节点1在内存中为新的NVMe存储设备的队列分配内存地址,执行图7所示流程,在此不再赘述。
本发明实施例中,存储节点本地存储图12所示的表,用于记录本存储节点中NVMe存储设备的标识与NVMe存储设备的队列的起始地址的映射关系。关于图12中的NVMe存储设备的标识的描述可以参考图8到图11的描述,在此不再赘述。本发明实施例图8至图12的表作为一种示意性的数据结构,具体实现可以有多种方式,例如以索引的方式存在,进一步的,可以以多级索引方式存在,例如在图8至图11所示的表中,第一级索引为存储节点标识,用于查找对应的NVMe存储设备的标识;第二级索引为NVMe存储设备的标识,用于查找存储NVMe存储设备的队列的起始地址。
如前所述,客户端访问存储系统,根据访问请求中包含的存储地址确定访问地址对应的NVMe存储设备的逻辑地址,例如客户端发送写请求。具体实现可以访问客户端保存的分区视图或查询管理服务器中的分区视图。例如,客户端确定存储地址对应存储节点1中的NVMe存储设备1的逻辑地址L1、存储节点2中的NVMe存储设备2的逻辑地址L2、存储节点3中的NVMe存储设备3的逻辑地址L3和存储节点4中的NVMe存储设备4的逻辑地址L4,即NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4构成多副本的分条关系。客户端根据为获取提供逻辑地址的NVMe存储设备的队列的起始地址,向管理服务器发送查询请求,查询请求中包含存储节点的标识。当存储节点包含多个NVMe存储设备时,查询请求中还包括NVMe存储设备的标识。例如查询请求中分别包含以下几组标识:存储节点1的标识N1和NVMe存储设备1的标识D11,存储节点2的标识N2和NVMe存储设备2的标识D21,存储节点3的标识N3和NVMe存储设备3的标识D31,存储节点4的标识N4和NVMe存储设备4的标识D41。结合图8,管理服务器接收来自客户端的查询请求,查询表项中记录的映射关系,向客户端发送查询请求响应。响应中包含存储节点1的NVMe存储设备1的队列的起始地址Add1、存储节点2的NVMe存储设备2的队列的起始地址Addk、存储节点3的NVMe存储设备3的队列的起始地址Addy、存储节点4的NVMe存储设备4的队列的起始地址Addz。本发明实施例中,管理服务器向客户端发送的查询请求响应中包含的队列的起始地址包含SQ的起始地址,还可以包含CQ的起始地址。客户端与管 理服务器通信,可以一次获取多个NVMe存储设备队列的起始地址,减少了通信交互次数。
管理服务器建立锁定标识,锁定标识用于锁定起始地址Add1、Addk、Addy和Addz,一旦存储队列的起始地址被锁定,表示该队列已经被分配给客户端。因此,也可以表述为锁定标识用于锁定存储节点的标识和队列的起始地址的映射关系。具体实现,锁定标识可以是一个标记位,例如用0表示锁定,1表示未锁定。锁定标识可以记录在图8所示的表项中。进一步的,管理服务器还可以记录获得队列的客户端的标识。该标识可以是管理服务器为每一个客户端分配的编号,也可以是客户端的硬件信息,如接口卡硬件信息,还可以是客户端的地址信息,如IP地址。
在本发明实施例中,管理服务器可以根据NVMe存储设备的队列的起始地址的锁定信息统计NVMe存储设备的负载,根据NVMe存储设备的负载动态决定存储地址与NVMe存储设备的逻辑地址的映射关系,从而实现存储系统的负载均衡。根据NVMe存储设备的负载动态决定存储地址与NVMe存储设备的逻辑地址的映射关系是指存储地址与NVMe存储设备的逻辑地址映射关系不是在存储系统初始化时绑定的,而是在客户端接收到写请求时,确定写请求中的存储地址映射的NVMe存储设备的逻辑地址。其中一种实现,管理服务器在客户端查询写请求中的存储地址映射的NVMe存储地址时,管理服务器根据存储系统中NVMe存储设备的负载确定NVMe存储设备的逻辑地址;另一种实现,管理服务器根据存储系统中NVMe存储设备的负载确定分区视图,如分区与存储节点的映射关系。具体的,NVMe存储设备的负载可以根据NVMe存储设备的队列的起始地址的锁定信息统计。
客户端分别向存储节点1、存储节点2、存储节点3和存储节点4发送RDMA写请求。其中,发送给存储节点1的RDMA写请求中包含L1和Add1;发送给存储节点2的RDMA写请求中包含L2和Addk;发送给存储节点3的RDMA写请求中包含L3和Addy;发送给存储节点4的RDMA写请求中包含L4和Addz。关于客户端分别向存储节点1、存储节点2、存储节点3和存储节点4发送RDMA写请求,可以参考本发明实施例前面的描述。本发明实施例以客户端向存储节点1发送RDMA写请求为例进行描述。客户端向存储节点1的接口卡发送RDMA写请求,存储节点1的接口卡接收RDMA写请求,将逻辑地址L1和客户端接收的写请求中写入L1的数据发送到存储节点1的内存地址Add1。在图13所示,SQ和CQ是空的,即初始的头地址和尾地址相同,CQ的初始的头地址和尾地址相同。具体操作过程如图13所示:
①客户端向SQ的尾发送写请求。
客户端向存储节点1发送RDMA写请求,RDMA写请求中包含L1和队列的起始地址Add1,还包含客户端接收的写请求中写入L1的数据。存储节点1的接口卡接收到RDMA写请求,获取L1和Add1,例如Add1为0。如前所述,本发明实施例中,队列的起始地址Add1包含SQ的起始地址,另一种实现,队列的起始地址Add1还可以包含CQ的起始地址,SQ尾DB初始值为0,CQ尾DB初始值为0。存储节 点1的接口卡将L1和客户端接收的写请求中写入L1的数据发送到SQ的起始地址。RDMA写请求为1个命令。
②客户端更新SQ的尾DB。
客户端向SQ写入1个RDMA写请求,SQ的尾变为1。客户端向SQ写入1个RDMA写请求命令后,更新NVMe控制器中的SQ尾DB,值为1。客户端更新SQ尾DB的同时,也是在通知NVMe控制器有写请求需要执行。
③NVMe控制器从SQ队列获取写请求并执行写请求。
客户端更新SQ尾DB,NVMe控制器收到通知,从SQ获取写请求并执行写请求。
④NVMe控制器更新SQ头DB。
NVMe控制器把SQ中的写请求执行完,SQ的头也变为1,NVMe控制器把SQ1的头写入到SQ头DB。
⑤NVMe控制器向CQ写入写请求执行结果。
NVMe控制器执行写请求,将写请求执行结果写入CQ的尾。
⑥客户端更新CQ的尾DB。
NVMe控制器执行写请求,向CQ尾写入写请求执行结果,更新CQ的尾DB,值为1。
⑦客户端获取写请求执行结果。
具体实现,客户端可以采用轮询的方式,从CQ获取写请求执行结果。
⑧NVMe控制器更新CQ的头DB。
客户端向CQ的头DB中写入CQ的头的地址,值为1。
本发明实施例访问请求处理流程另一种实现方式中,图13中的步骤②和⑥可以也可以由存储节点的接口卡实现。
如图12所示,存储节点记录本存储节点中NVMe存储设备的标识与NVMe存储设备的队列的起始地址的映射关系。客户端访问存储系统,客户端根据访问请求中包含的存储地址确定存储地址对应的NVMe存储设备的逻辑地址,具体实现请参见前面描述。例如,客户端确定存储地址对应存储节点1中的NVMe存储设备1的逻辑地址L1、存储节点2中的NVMe存储设备2的逻辑地址L2、存储节点3中的NVMe存储设备3的逻辑地址L3和存储节点4中的NVMe存储设备4的逻辑地址L4,即NVMe存储设备1、NVMe存储设备2、NVMe存储设备3和NVMe存储设备4构成多副本的分条关系。在另一种实现方式,客户端分别向存储节点1、存储节点2、存储节点3和存储节点4发送查询请求,查询请求中包含NVMe存储设备的标识。结合图12,存储节点1接收来自客户端的查询请求,查询表项中记录的映射关系,向客户端发送查询请求响应。响应中包含存储节点1的NVMe存储设备1的队列的起始地址Add1。存储节点2、存储节点3和存储节点4分别根据查询请求执行查询操作,向客户端发送查询请求响应。仍以存储节点1为例,存储节点1建立锁定标识,锁定标识用于锁定队列的起始地址Add1。一旦队列的起始地址被锁定,表示该队列已经被分配给客户端。具体实现,锁定标识可以是一个标记位,例如用0表示锁定,1表示未锁定。锁定标识可以记录在图12所示的表项中。 在这种实现方式中,客户端向存储节点发送查询请求,可以减少管理服务器的负载。客户端从存储节点接收查询请求响应,执行图13所示的操作。
本发明实施例中另一种实现方式,NVMe存储设备构成EC的分条关系。客户端也可以使用上述两种方式访问NVMe存储设备,在此不再赘述。
本发明实施例中,客户端获得NVMe存储设备队列的起始地址,在释放该NVMe存储设备队列之前,客户端根据接收访问请求的数量以及队列的起始地址,根据NVMe存储设备队列的起始地址的变化,向NVMe存储设备队列发送RDMA访问请求,并从NVMe存储设备的CQ获得访问请求执行结果。
在本发明实施例中,客户端可以直接向存储系统中存储节点的NVMe存储设备发送RDMA请求,不需要存储节点的CPU参与,充分发挥和利用了NVMe存储设备的性能。
在NVMe存储设备应用到本发明实施例的同时,随着存储设备的发展,存储系统采用多种类似存储设备以提升存储系统性能。例如,存储级内存(Storage Class Memory,SCM)的存储设备同时具备持久化和快速字节级访问的特点。目前比较流行的SCM的存储设备主要包含相变存储器(Phase-change Memory,PCM)、阻抗随机存储器(Resistive Random-access Memory,ReRAM)、磁性随机存储器(Magnetic Random Access Memory)和碳纳米管随机存储器(Nantero’s CNT Random Access Memory)等。如图14所示,本发明实施例中的分布式块存储系统,每一个存储节点包含一个或多个NVMe存储设备,例如NVMe接口规范的固态硬盘(Solid State Disk,SSD),同时包含一个或多个SCM的存储设备。另一种实现,部分存储节点包含一个或多个NVMe存储设备,部分存储节点包含一个或多个SCM的存储设备。
在图14所示的分布式块存储系中,分布式块存储系统将NVMe存储设备的逻辑地址划分资源池,将SCM的存储设备划分为资源池,客户端接收的访问请求中的存储地址映射到NVMe存储设备提供的逻辑地址,具体实现可参考前面实施例描述。客户端接收的访问请求中的存储地址到存储设备提供的地址的映射方式可以表示为存储地址-----NVMe存储设备----逻辑地址,存储地址---SCM的存储设备---基地址。
基于图14所示的分布式存储系统,数据的多个副本可以存储在不同类型的存储设备上。例如,一个副本存储在SCM的存储设备上,一个或多个副本存储在NVMe存储设备上。或者一个副本存储在SCM的存储设备上,一个副本以EC分条的形式存储在多个NVMe存储设备上。客户端读取数据时,从SCM的存储设备中获取数据,从而提高读性能。
本发明实施例以一个副本存储在SCM的存储设备上,两个副本存储在NVMe存储设备上为例进行描述。客户端接收的写请求中的数据映射到一个SCM的存储设备的基地址和两个NVMe存储设备的逻辑地址。如前面所述,具体实现,写请求中的存储地址到SCM的存储设备提供的基地址映射以及写请求中的存储地址到NVMe存储设备的逻辑地址的映射可以基于图3和图4所示的分区视图,也可以是写请求中的存储地址到SCM的存储设备提供的基地 址的直接映射以及写请求中的存储地址与NVMe存储设备的逻辑地址的直接映射。
进一步的,客户端接收访问请求,根据访问请求中的存储地址,确定存储地址映射的SCM的存储设备基地址以及存储地址映射的NVMe存储设备的逻辑地址。本发明实施例中,客户端确定存储地址映射的NVMe存储设备的逻辑地址,后续的访问过程可以参考前面客户端访问NVMe存储设备的过程,在此不再赘述。如图15所示,以副本长度为8字节为例,客户端向SCM的存储设备发送RDMA写请求的过程如下:
①客户端向存储节点发送获取与增加(fetch and add(ptr,8))命令。
其中,fetch and add(ptr,len value)为RDMA原子操作指令,用于获取当前已经分配的存储空间的结束地址以及写入数据长度。len value表示写入数据长度,本发明实施例中当前已经分配的存储空间的结束地址为10,len value为8字节。
②存储节点为客户端分配长度为8字节的存储地址。
存储节点收到fetch and add(ptr,8)命令,将存储地址为11-18预留给客户端。
③存储节点向客户端返回当前已经分配的存储空间的结束地址。
④客户端向存储节点发送RDMA写请求。其中,RDMA请求中包含长度为8字节的数据和当前已经分配的存储空间的结束地址(基地址)为10。
结合图1所示的存储系统,另一种实现方式,存储节点还可以包含机械硬盘,每一个存储节点包含一个或多个NVMe存储设备,例如NVMe接口规范的固态硬盘(Solid State Disk,SSD),同时包含一个或多个机械硬盘。另一种实现,部分存储节点包含一个或多个NVMe存储设备,部分存储节点包含一个或多个机械硬盘。数据的多个副本可以存储在不同类型的存储设备上。例如,一个副本存储在NVMe存储设备上,一个或多个副本存储在机械硬盘上。或者一个副本存储在NVMe存储设备上,一个副本以EC分条的形式存储在多个机械硬盘上。客户端读取数据时,从NVMe存储设备中获取数据,从而提高读性能。具体访问过程可参考本发明上述各实施例描述,在此不再赘述。客户端向机械硬盘发送的写请求包含客户端接收的写请求中的存储地址映射到机械硬盘的逻辑地址。客户端向机械硬盘发送的写请求也可以为RDMA请求。
本发明上述实施例以客户端向存储节点中的NVMe存储设备及SCM的存储设备发送RDMA写请求为例进行说明,本发明实施例方案也可以应用到向存储节点中的NVMe存储设备及SCM的存储设备发送RDMA读请求等,本发明实施例对此不作限定。即本发明实施例可以实现客户端向存储节点中的NVMe存储设备及SCM的存储设备发送RDMA访问请求。本发明实施例中,客户端接收到的访问请求中包含的存储地址可以对应多个存储设备的逻辑地址(或基地址),因此,对其中一个存储设备而言,表示为逻辑地址映射到该存储设备的逻辑地址(或基地址)
基于本发明上述实施例的描述,本发明实施例还提供了存储节点,如图16所示,应用于存储系统中,存储系统还包含管理服务器,存储节点包含NVMe接口规范的第一存储设备;存储节点包括分配单元161和发送单元162;其中,分配单元,用于在内存中为第一存储设备的队列的起始地址分配第一内存地址,发送单元162,用于向管理服务器发送第一队列消息;第一队列消息包含存储节点的标识和第一内存地址。进一步的,存储节点还包括检测单元,用于检测到第一存储设备安装到所述存储节点。进一步的,检测单元,还用于检测到第一存储设备从存储节点中移除;发送单元162,还用于向管理服务器发送队列消息删除消息;队列消息删除消息包含第一内存地址。
与图16所示的存储节点对应,本发明实施例还提供了如图17所示的一种应用于存储系统的客户端,存储系统包含管理服务器和第一存储节点,第一存储节点包含NVMe接口规范的第一存储设备,第一存储设备的队列的起始地址存储在第一存储节点的第一内存中的第一内存地址;管理服务器存储第一存储节点的标识和第一内存地址的第一映射关系;客户端包括发送单元171和接收单元172,其中,发送单元171,用于向管理服务器发送查询请求;查询请求包含第一存储节点的标识;接收单元172,用于接收来自管理服务器查询请求响应;查询请求响应包含管理服务器根据第一映射关系确定的第一内存地址。进一步的,发送单元171,还用于向第一存储节点发送第一远程直接内存访问请求;第一远程直接内存访问请求包含第一内存地址。进一步的,存储系统还包含第二存储节点,第二存储节点包含NVMe接口规范的第二存储设备,第二存储设备的队列的起始地址位于第二存储节点的第二内存中的第二内存地址;管理服务器存储第二存储节点的标识和第二内存地址的第二映射关系;查询请求包含所述第二存储节点的标识;查询请求响应包含所述管理服务器根据第二映射关系确定的第二内存地址;其中,第一存储设备和第二存储设备构成分条关系;发送单元171,还用于向第二存储节点发送第二远程直接内存访问请求;第二远程直接内存访问请求包含第二内存地址。
与图16和17对应,在图18所示实施例中提供了一种存储系统中的管理服务器,存储系统包含管理服务器和第一存储节点,第一存储节点包含NVMe接口规范的第一存储设备,第一存储设备的队列的起始地址位于第一存储节点的第一内存中的第一内存地址;管理服务器包括获取单元181、建立单元182、接收单元183和发送单元184;其中,获取单181元,用于从第一存储节点获取第一队列消息;第一队列消息包含第一存储节点的标识和第一内存地址;建立单元182,用于建立第一存储节点的标识和第一内存地址之间的第一映射关系;接收单元183,用于接收来自客户端的查询请求;查询请求包含第一存储节点的标识;发送单元184,用于根据第一映射关系向客户端发送查询请求响应;查询请求响应包含第一内存地址。进一步的,建立单元182,还用于建立锁定标识,锁定标识用于锁定第一内存地址。进一步的,存储系统还包括第二存储节点,第二存储节点包含NVMe接口规范 的第二存储设备,第二存储设备的队列的起始地址位于第二存储节点的第二内存中的第二内存地址;获取单元181还用于从第二存储节点获取第二存储设备的第二队列消息;第二队列消息包含第二存储节点的标识和第二内存地址;建立单元182,还用于建立第二存储节点的标识和第二内存地址的第二映射关系。进一步的,获取单元181还用于从第一存储节点获得第三队列消息;第三队列消息包含第一存储节点的标识和第三内存地址;建立单元182,还用于建立第一存储节点的标识和第三内存地址的第三映射关系;其中,NVMe接口规范的第三存储设备的队列的起始地址位于第一内存中的第三内存地址,第三存储设备为所述第一存储节点新增加的存储设备。进一步的,接收单元183,还用于从第二存储节点接收队列消息删除消息;队列消息删除消息包含第二内存地址,管理服务器还包含删除单元,用于删除第二映射关系。进一步的,管理服务器,还包括检测单元和删除单元;其中,检测单元,用于检测到与第一存储节点通信中断;删除单元,用于删除第一映射关系。进一步的,检测单元,具体用于检测在预定的时间内未收到第一存储节点的心跳。
如图19所示,本发明实施例进一步提供了一种存储系统中的管理服务器,存储系统还包含第一存储节点和第二存储节点,第一存储节点包含NVMe接口规范的第一存储设备,第一存储设备的队列的起始地址位于第一存储节点的第一内存中的第一内存地址;第二存储节点包含NVMe接口规范的第二存储设备,第二存储设备的队列的起始地址位于第二存储节点的第二内存中的第二内存地址;管理服务器存储映射表,映射表包含第一存储节点的标识和第一内存地址之间的第一映射关系以及第二存储节点的标识和第二内存地址之间的第二映射关系;管理服务器包括接收单元191和发送单元192;其中,接收单元191,用于接收来自客户端的查询请求;查询请求包含第一存储节点的标识和第二存储节点的标识;发送单元192,用于根据映射表向所述客户端发送查询请求响应;查询请求响应包含第一内存地址和第二内存地址。进一步的,管理服务器还包括获取单元和建立单元,用于从第一存储节点获取第一队列消息,从第二存储节点获取第二队列消息;第一队列消息包含第一存储节点的标识和第一内存地址;第二队列消息包含第二存储节点的标识和第二内存地址;建立单元,用于建立第一映射关系和第二映射关系。进一步的,建立单元还用于建立锁定标识,锁定标识用于锁定第一映射关系和第二映射关系。进一步的,获取单元还用于从第一存储节点获得第三队列消息;第三队列消息包含第一存储节点的标识和第三内存地址;建立单元,还用于建立第一存储节点的标识和第三内存地址的第三映射关系;其中,NVMe接口规范的第三存储设备的队列的起始地址位于第一内存中的第三内存地址,第三存储设备为所述第一存储节点新增加的存储设备。进一步的,管理服务器,还包括删除单元;接收单元191,还用于从第二存储节点接收队列消息删除消息;队列消息删除消息包含第二内存地址;删除单元,用于从映射表中删除第二映射关系。进一步的,管理服务器,还包括检测单元和 删除单元;其中,检测单元,用于检测到与第一存储节点通信中断;删除单元,用于删除第一映射关系。进一步的,检测单元具体用于检测到在预定的时间内未收到第一存储节点的心跳。
与图19描述的管理服务器对应,图20提供了本发明实施例的存储系统中的客户端,存储系统包含管理服务器、第一存储节点和第二存储节点,第一存储节点包含NVMe接口规范的第一存储设备,第一存储设备的队列的起始地址位于第一存储节点的第一内存中的第一内存地址;第二存储节点包含NVMe接口规范的第二存储设备,第二存储设备的队列的起始地址位于第二存储节点的第二内存中的第二内存地址;管理服务器存储映射表,映射表包含第一存储节点的标识和所述第一内存地址之间的第一映射关系以及第二存储节点的标识和第二内存地址之间的第二映射关系;客户端包括发送单元2001和接收单元2002;其中,发送单元2001,用于向管理服务器发送查询请求;查询请求包含第一存储节点的标识和第二存储节点的标识;接收单元2002,用于接收来自管理服务器的查询请求响应;查询请求响应包含管理服务器根据映射表确定的第一内存地址和第二内存地址。进一步的,发送单元2002,还用于向第一存储节点发送第一远程直接内存访问请求,向第二存储节点发送第二远程直接内存访问请求;第一远程直接内存访问请求包含第一内存地址;第二远程直接内存访问请求包含第二内存地址。
如图21所示,本发明实施例提供了存储系统中另一种客户端,存储系统包含第一存储节点,第一存储节点包含NVMe接口规范的第一存储设备,第一存储设备的队列的起始地址位于第一存储节点的第一内存中的第一内存地址,客户端包括接收单元2101、查询单元2102、获取单元2103和发送单元2104;其中,接收单元2101,用于接收写请求,写请求包含存储地址;查询单元2102,用于查询存储地址的映射关系,存储地址的映射关系所述存储地址映射到第一存储节点的第一存储设备的第一逻辑地址;获取单元2103,用于获取第一存储设备的队列的起始地址所在的第一内存地址;发送单元2104,用于向第一存储节点发送第一远程直接内存访问写请求;第一远程直接内存访问写请求包含第一内存地址和第一逻辑地址。进一步的,存储系统还包括管理服务器,管理服务器存储有第一存储节点的标识与所述内存地址的映射关系;获取单元2103,具体用于:向管理服务器发送第一查询请求;第一队列查询请求包含第一存储节点的标识,接收来自管理服务器的第一查询请求响应,第一查询请求响应包含第一内存地址。进一步的,获取单元2103,具体用于:向第一存储节点发送第二查询请求;第二查询请求包含第一存储设备的标识;接收来自第一存储节点的第二查询请求响应;第二查询请求响应包含所述第一内存地址。进一步的,存储系统还包括第二存储节点,第二存储节点包含存储级内存的第二设备,存储地址的映射关系包含存储地址映射到第二存储设备的第一基地址;发送单元2104,还用于向所述第二存储节点发送第二远程直接内存访问写请求;第二远程直接内存访问写请求包含所述第一基地址。进一步的,存储系统还包括第三存储节点,第三存储节点包 含第三存储设备,存储地址的映射关系包含存储地址映射到第三存储设备的第三逻辑地址;第三存储设备为机械硬盘;发送单元2104,还用于向第三存储节点发送第三写请求;第三写请求包含所述第三逻辑地址。进一步的,存储系统包含第四存储节点,第四存储节点包含NVMe接口规范的第四存储设备,第四存储设备的队列的起始地址位于第四存储节点的第三内存中的第三内存地址;存储地址的映射关系包含存储地址映射到第四存储设备的第四逻辑地址;获取单元2101,还用于获取第四存储设备的队列的起始地址所在的第三内存地址;发送单元2104,还用于向第一存储节点发送第四远程直接内存访问写请求;第四远程直接内存访问写请求包含第三内存地址和第四逻辑地址。
本发明实施例中,队列的起始地址位于内存中的内存地址与队列在内存中的起始地址具有相同的含义。上述描述也称为队列在内存中的起始地址位于内存中的某一内存地址。
本发明实施例的管理服务器、存储节点和客户端的实现可以参考前面本发明实施例中的管理服务器、存储节点和客户端的描述。具体的,本发明实施例中的装置可以为软件模块,可以运行在服务器上,从而使服务器完成本发明实施例中描述的各种实现。装置也可以为硬件设备,具体可以参考图2所示的结构,装置的各单元可以由图2描述的服务器的处理器实现。
相应的,本发明实施例还提供了计算机可读存储介质和计算机程序产品,计算机可读存储介质和计算机程序产品中包含计算机指令用于实现本发明实施例中描述的各种方案。
本发明实施例中以EC和多副本作为分条算法,但本发明实施例中的分条算法并不限于EC和多副本作为分条算法。
在本发明所提供的几个实施例中,应该理解到,所公开的装置、方法,可以通过其它的方式实现。例如,以上所描述的装置实施例所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例各方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。

Claims (54)

  1. 一种存储系统中存储设备的管理方法,其特征在于,所述存储系统包含管理服务器和第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备,所述第一存储设备的队列的起始地址位于所述第一存储节点的第一内存中的第一内存地址;所述方法包括:
    所述管理服务器从所述第一存储节点获取第一队列消息;所述第一队列消息包含所述第一存储节点的标识和所述第一内存地址;
    所述管理服务器建立所述第一存储节点的标识和所述第一内存地址之间的第一映射关系;
    所述管理服务器接收来自客户端的查询请求;所述查询请求包含所述第一存储节点的标识;
    所述管理服务器根据所述第一映射关系向所述客户端发送查询请求响应;所述查询请求响应包含所述第一内存地址。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述管理服务器建立锁定标识,所述锁定标识用于锁定所述第一内存地址。
  3. 根据权利要求1所述的方法,其特征在于,所述存储系统还包括第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备,所述第二存储设备的队列的起始地址位于所述第二存储节点的第二内存中的第二内存地址;所述方法还包括:
    所述管理服务器从所述第二存储节点获取所述第二存储设备的第二队列消息;所述第二队列消息包含所述第二存储节点的标识和所述第二内存地址;
    所述管理服务器建立所述第二存储节点的标识和所述第二内存地址第二映射关系。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述方法还包括:
    所述管理服务器从所述第一存储节点获得第三队列消息;所述第三队列消息包含所述第一存储节点的标识和第三内存地址;
    所述管理服务器建立所述第一存储节点的标识和所述第三内存地址的第三映射关系;其中,所述NVMe接口规范的第三存储设备的队列的起始地址位于所述第三内存地址,所述第三存储设备为所述第一存储节点新增加的存储设备。
  5. 根据权利要求3或4所述的方法,其特征在于,所述方法还包括:
    所述管理服务器从所述第二存储节点接收队列消息删除消息;所述队列消息删除消息包含所述第二内存地址;
    所述管理服务器删除所述第二映射关系。
  6. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述管理服务器检测到与所述第一存储节点通信中断;
    所述管理服务器删除所述第一映射关系。
  7. 根据权利要求6所述的方法,其特征在于,所述管理服务器检测到与所述第一存储节点通信中断,具体包括:
    所述管理服务器在预定的时间内未收到所述第一存储节点的心跳。
  8. 一种存储系统中存储设备的管理方法,其特征在于,所述存储系统包含管理服务器和存储节点,所述存储节点包含NVMe接口规范的第一存储设备;所述方法包括:
    所述存储节点在内存中为所述第一存储设备的队列的起始地址分配第一内存地址;
    所述存储节点向所述管理服务器发送第一队列消息;所述第一队列消息包含所述存储节点的标识和所述第一内存地址。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    所述存储节点检测到所述第一存储设备安装到所述存储节点。
  10. 根据权利要求8或9所述的方法,其特征在于,所述方法还包括:
    所述存储节点检测到所述第一存储设备从所述存储节点中移除;
    所述存储节点向所述管理服务器发送队列消息删除消息;所述队列消息删除消息包含所述第一内存地址。
  11. 一种存储系统中存储设备的管理方法,其特征在于,所述存储系统包含管理服务器和第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备,所述第一存储设备的队列的起始地址位于所述第一存储节点的第一内存中的第一内存地址;所述管理服务器存储所述第一存储节点的标识和所述第一内存地址的第一映射关系;所述方法包括:
    客户端向所述管理服务器发送查询请求;所述查询请求包含所述第一存储节点的标识;
    所述客户端接收来自所述管理服务器查询请求响应;所述查询请求响应包含所述管理服务器根据所述第一映射关系确定的所述第一内存地址。
  12. 根据权利要求11所述的管理方法,其特征在于,所述客户端向所述第一存储节点发送第一远程直接内存访问请求;所述第一远程直接内存访问请求包含所述第一内存地址。
  13. 根据权利要求12所述的管理方法,其特征在于,所述存储系统还包含第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备,所述第二存储设备的队列的起始地址位于所述第二存储节点的第二内存中的第二内存地址;所述管理服务器存储所述第二存储节点的标识和所述第二内存地址的第二映射关系;所述查询请求包含所述第二存储节点的标识;所述查询请求响应包含所述管理服务器根据所述第二映射关系确定的所述第二内存地址;其中,所述第一存储设备和所述第二存储设备构成分条关系;所述方法还包括:
    所述客户端向所述第二存储节点发送第二远程直接内存访问请求;所述第二远程直接内存访问请求包含所述第二内存地址。
  14. 一种存储系统中的管理服务器,其特征在于,所述存储系统包含管理服务器和第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备,所述第一存储设备的队列的起始地址位于所述第一存储节点的第一内存中的第一内存地址;所述管理服务器包括:
    获取单元,用于从所述第一存储节点获取第一队列消息;所述第一队列消息包含所述第一存储节点的标识和所述第一内存地址;
    建立单元,用于建立所述第一存储节点的标识和所述第一内存地址之间的第一映射关系;
    接收单元,用于接收来自客户端的查询请求;所述查询请求包含所述第一存储节点的标识;
    发送单元,用于根据所述第一映射关系向所述客户端发送查询请求响应;所述查询请求响应包含所述第一内存地址址。
  15. 根据权利要求14所述的管理服务器,其特征在于,所述建立单元,还用于建立锁定标识,所述锁定标识用于锁定所述第一内存地址。
  16. 根据权利要求14所述的管理服务器,其特征在于,所述存储系统还包括第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备,所述第二存储设备的队列的起始地址位于所述第二存储节点的第二内存中的第二内存地址;所述获取单元还用于从所述第二存储节点获取所述第二存储设备的第二队列消息;所述第二队列消息包含所述第二存储节点的标识和所述第二内存地址;
    所述建立单元,还用于建立所述第二存储节点的标识和所述第二内存地址的第二映射关系。
  17. 根据权利要求14至16任一所述的管理服务器,其特征在于,所述获取单元还用于从所述第一存储节点获得第三队列消息;所述第三队列消息包含所述第一存储节点的标识和第三内存地址;
    所述建立单元,还用于建立所述第一存储节点的标识和第三内存地址的第三映射关系;其中,所述NVMe接口规范的第三存储设备的队列的起始地址位于所述第一内存中的所述第三内存地址,所述第三存储设备为所述第一存储节点新增加的存储设备。
  18. 根据权利要求16或17所述的管理服务器,其特征在于,所述接收单元,还用于从所述第二存储节点接收队列消息删除消息;所述队列消息删除消息包含所述第二内存地址;
    所述管理服务器还包含删除单元,用于删除所述第二映射关系。
  19. 根据权利要求14所述的管理服务器,其特征在于,还包括检测单元和删除单元;其中,所述检测单元,用于检测到与所述第一存储节点通信中断;
    所述删除单元,用于删除所述第一映射关系。
  20. 根据权利要求19所述的管理服务器,其特征在于,所述检测单元,具体用于检测在预定的时间内未收到所述第一存储节点的心跳。
  21. 一种存储系统中存储节点,其特征在于,所述存储系统还包含管理服务器,所述存储节点包含NVMe接口规范的第一存储设备;所述存储节点包括:
    分配单元,用于在内存中为所述第一存储设备的队列的起始地址分配第一内存地址;
    发送单元,用于向所述管理服务器发送第一队列消息;所述第一队列消息包含所述存储节点的标识和所述第一内存地址。
  22. 根据权利要求21所述的存储节点,其特征在于,还包括:
    检测单元,用于检测到所述第一存储设备安装到所述存储节点。
  23. 根据权利要求22所述的存储节点,其特征在于,
    所述检测单元,还用于检测到所述第一存储设备从所述存储节点中移除;
    所述发送单元,还用于向所述管理服务器发送队列消息删除消息;所述队列消息删除消息包含所述第一内存地址。
  24. 一种应用于存储系统的客户端,其特征在于,所述存储系统包含管理服务器和第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备,所述第一存储设备的队列的起始地址位于所述第一存储节点的第一内存中的第一内存地址;所述管理服务器存储所述第一存储节点的标识和所述第一内存地址的第一映射关系;所述客户端包括:
    发送单元,用于向所述管理服务器发送查询请求;所述查询请求包含所述第一存储节点的标识;
    接收单元,用于接收来自所述管理服务器查询请求响应;所述查询请求响应包含所述管理服务器根据所述第一映射关系确定的所述第一内存地址。
  25. 根据权利要求24所述的客户端,其特征在于,
    所述发送单元,还用于向所述第一存储节点发送第一远程直接内存访问请求;所述第一远程直接内存访问请求包含所述第一内存地址。
  26. 根据权利要求25所述的客户端,其特征在于,所述存储系统还包含第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备,所述第二存储设备的队列的起始地址位于所述第二存储节点的第二内存中的第二内存地址;所述管理服务器存储所述第二存储节点的标识和所述第二内存地址的第二映射关系;所述查询请求包含所述第二存储节点的标识;所述查询请求响应包含所述管理服务器根据所述第二映射关系确定的所述第二内存地址;其中,所述第一存储设备和所述第二存储设备构成分条关系;
    所述发送单元,还用于向所述第二存储节点发送第二远程直接内存访问请求;所述第二远程直接内存访问请求包含所述第二内存地址。
  27. 一种存储系统中的管理服务器,其特征在于,所述存储系统包含管理服务器和第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备,所述第一存储设备的队列的起始地址位于所述第一存储节点的第一内存中的第一内存地址;所述所述管理服务器包括接口和处理器,所述接口和所述处理器通信;其中,所述处理器用于:
    从所述第一存储节点获取第一队列消息;所述第一队列消息包含所述第一存储节点的标识和所述第一内存地址;
    建立所述第一存储节点的标识和所述第一内存地址之间的第一映射关系;
    接收来自客户端的查询请求;所述查询请求包含所述第一存储节点的标识;
    根据所述第一映射关系向所述客户端发送查询请求响应;所述查询请求响应包含所述第一内存地址。
  28. 根据权利要求27所述的管理服务器,其特征在于,所述处理器还用于:
    建立锁定标识,所述锁定标识用于锁定所述第一内存地址。
  29. 根据权利要求27所述的管理服务器,其特征在于,所述存储系统还包括第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备,所述第二存储设备的队列的起始地址位于所述第二存储节点的第二内存中的第二内存地址;所述处理单元还用于:
    从所述第二存储节点获取所述第二存储设备的第二队列消息;所述第二队列消息包含所述第二存储节点的标识和所述第二内存地址;
    建立所述第二存储节点的标识和所述第二内存地址的第二映射关系。
  30. 根据权利要求27至29任一所述的管理服务器,其特征在于,所述处理器还用于:
    从所述第一存储节点获得第三队列消息;所述第三队列消息包含所述第一存储节点的标识和第三内存地址;
    建立所述第一存储节点的标识和第三内存地址的第三映射关系;其中,所述NVMe接口规范的第三存储设备的队列的起始地址位于所述第一内存中的所述第三内存地址,所述第三存储设备为所述第一存储节点新增加的存储设备。
  31. 根据权利要求29或30所述的管理服务器,其特征在于,所述处理器还用于:
    从所述第二存储节点接收队列消息删除消息;所述队列消息删除消息包含所述第二内存地址;
    删除所述第二映射关系。
  32. 根据权利要求27所述的管理服务器,其特征在于,所述处理器还用于:
    检测到与所述第一存储节点通信中断;
    删除所述第一映射关系。
  33. 根据权利要求32所述的管理服务器,其特征在于,所述处理器具体用于检测在预定的时间内未收到所述第一存储节点的心跳。
  34. 一种存储系统中存储节点,其特征在于,所述存储系统还包含管理服务器,所述存储节点包含NVMe接口规范的第一存储设备;所述存储节点 包括接口和处理器,所述接口和所述处理器通信;所述处理器用于:
    在内存中为所述第一存储设备的队列的起始地址分配第一内存地址;
    向所述管理服务器发送第一队列消息;所述第一队列消息包含所述存储节点的标识和所述第一内存地址。
  35. 根据权利要求34所述的存储节点,其特征在于,所述处理器还用于:
    检测到所述第一存储设备安装到所述存储节点。
  36. 根据权利要求35所述的存储节点,其特征在于,所述处理器还用于:
    检测到所述第一存储设备从所述存储节点中移除;
    向所述管理服务器发送队列消息删除消息;所述队列消息删除消息包含所述第一内存地址。
  37. 一种应用于存储系统的客户端,其特征在于,所述存储系统包含管理服务器和第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备,所述第一存储设备的队列的起始地址位于所述第一存储节点的第一内存中的第一内存地址;所述管理服务器存储所述第一存储节点的标识和所述第一内存地址的第一映射关系;所述客户端包括接口和处理器,所述接口和所述处理器通信;所述处理器用于:
    向所述管理服务器发送查询请求;所述查询请求包含所述第一存储节点的标识;
    接收来自所述管理服务器查询请求响应;所述查询请求响应包含所述管理服务器根据所述第一映射关系确定的所述第一内存地址。
  38. 根据权利要求37所述的客户端,其特征在于,所述处理器还用于:
    向所述第一存储节点发送第一远程直接内存访问请求;所述第一远程直接内存访问请求包含所述第一内存地址。
  39. 根据权利要求38所述的客户端,其特征在于,所述存储系统还包含第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备,所述第二存储设备的队列的起始地址位于所述第二存储节点的第二内存中的第二内存地址;所述管理服务器存储所述第二存储节点的标识和所述第二内存地址的第二映射关系;所述查询请求包含所述第二存储节点的标识;所述查询请求响应包含所述管理服务器根据所述第二映射关系确定的所述第二内存地址;其中,所述第一存储设备和所述第二存储设备构成分条关系;所述处理器还用于:
    向所述第二存储节点发送第二远程直接内存访问请求;所述第二远程直接内存访问请求包含所述第二内存地址。
  40. 一种存储系统,其特征在于,所述存储系统包含管理服务器和第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备;其中,
    所述第一存储节点用于在所述第一存储节点的第一内存中为所述第一存储设备的队列的起始地址分配第一内存地址,
    向所述管理服务器发送第一队列消息;所述第一队列消息包含所述第一存储节点的标识和所述第一内存地址;
    所述管理服务器,用于从获取所述第一队列消息,建立所述第一存储节点的标识和所述第一内存地址之间的第一映射关系。
  41. 根据权利要求40所述的存储系统,其特征在于,所述存储系统还包括第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备;其中,
    所述第二存储节点,用于在所述第二存储节点的第二内存中为所述第二存储设备的队列的起始地址分配第二内存地址,向所述管理服务器发送第二队列消息;所述第二队列消息包含所述第二存储节点的标识和所述第二内存地址;
    所述管理服务器,还用于获取所述第二队列消息,建立所述第二存储节点的标识和所述第二内存地址之间的第二映射关系。
  42. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储系统中的管理服务器所执行的计算机指令,所述存储系统还包含第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备,所述第一存储设备的队列的起始地址位于所述第一存储节点的第一内存中的第一内存地址;其中,所述管理服务器的处理器执行所述计算机指令用于:
    从所述第一存储节点获取第一队列消息;所述第一队列消息包含所述第一存储节点的标识和所述第一内存地址;
    建立所述第一存储节点的标识和所述第一内存地址之间的第一映射关系;
    接收来自客户端的查询请求;所述查询请求包含所述第一存储节点的标识;
    根据所述第一映射关系向所述客户端发送查询请求响应;所述查询请求响应包含所述第一内存地址。
  43. 根据权利要求42所述的计算机可读存储介质,其特征在于,所述管理服务器的处理器执行所述计算机指令进一步用于:
    建立锁定标识,所述锁定标识用于锁定所述第一内存地址。
  44. 根据权利要求42所述的计算机可读存储介质,其特征在于,所述存储系统还包括第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备,所述第二存储设备的队列的起始地址位于所述第二存储节点的第二内存中的第二内存地址;所述管理服务器的处理器执行所述计算机指令进一步用于:
    从所述第二存储节点获取所述第二存储设备的第二队列消息;所述第二队列消息包含所述第二存储节点的标识和所述第二内存地址;
    建立所述第二存储节点的标识和所述第二内存地址的第二映射关系。
  45. 根据权利要求42至44任一所述的计算机可读存储介质,其特征在于,所述管理服务器的处理器执行所述计算机指令进一步用于:
    从所述第一存储节点获得第三队列消息;所述第三队列消息包含所述第一存储节点的标识和第三内存地址;
    建立所述第一存储节点的标识和第三内存地址的第三映射关系;其中,所述NVMe接口规范的第三存储设备的队列的起始地址位于述第一内存中的所述第三内存地址,所述第三存储设备为所述第一存储节点新增加的存储设备。
  46. 根据权利要求44或45所述的计算机可读存储介质,其特征在于,所述管理服务器的处理器执行所述计算机指令进一步用于:
    从所述第二存储节点接收队列消息删除消息;所述队列消息删除消息包含所述第二内存地址;
    删除所述第二映射关系。
  47. 根据权利要求42所述的计算机可读存储介质,其特征在于,所述管理服务器的处理器执行所述计算机指令进一步用于:
    检测到与所述第一存储节点通信中断;
    删除所述第一映射关系。
  48. 根据权利要求47所述的计算机可读存储介质,其特征在于,所述管理服务器的处理器执行所述计算机指令具体用于:
    检测在预定的时间内未收到所述第一存储节点的心跳。
  49. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包含存储系统中的存储节点所执行的计算机指令,所述存储系统还包含管理服务器,所述存储节点包含NVMe接口规范的第一存储设备;其中,所述存储节点的处理器执行所述计算机指令用于:
    在内存中为所述第一存储设备的队列的起始地址分配第一内存地址;
    向所述管理服务器发送第一队列消息;所述第一队列消息包含所述存储节点的标识和所述第一内存地址。
  50. 根据权利要求49所述的计算机可读存储介质,其特征在于,所述存储节点的处理器执行所述计算机指令还用于:
    检测到所述第一存储设备安装到所述存储节点。
  51. 根据权利要求50所述的计算机可读存储介质,其特征在于,所述存储节点的处理器执行所述计算机指令还用于:
    检测到所述第一存储设备从所述存储节点中移除;
    向所述管理服务器发送队列消息删除消息;所述队列消息删除消息包含所述第一内存地址。
  52. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包含访问存储系统的客户端所执行的计算机指令,所述存储系统包含管理服务器和第一存储节点,所述第一存储节点包含NVMe接口规范的第一存储设备,所述第一存储设备的队列的起始地址位于所述第一存储节点的第一内存中的第一内存地址;所述管理服务器存储所述第一存储节点的标识和所述第一内存地址的第一映射关系;其中,所述客户端的处理器执行所述计算机指 令用于:
    向所述管理服务器发送查询请求;所述查询请求包含所述第一存储节点的标识;
    接收来自所述管理服务器查询请求响应;所述查询请求响应包含所述管理服务器根据所述第一映射关系确定的所述第一内存地址。
  53. 根据权利要求52所述的计算机可读存储介质,其特征在于,所述客户端的处理器执行所述计算机指令还用于:
    向所述第一存储节点发送第一远程直接内存访问请求;所述第一远程直接内存访问请求包含所述第一内存地址。
  54. 根据权利要求53所述的计算机可读存储介质,其特征在于,所述存储系统还包含第二存储节点,所述第二存储节点包含所述NVMe接口规范的第二存储设备,所述第二存储设备的队列的起始地址位于所述第二存储节点的第二内存中的第二内存地址;所述管理服务器存储所述第二存储节点的标识和所述第二内存地址的第二映射关系;所述查询请求包含所述第二存储节点的标识;所述查询请求响应包含所述管理服务器根据所述第二映射关系确定的所述第二内存地址;其中,所述第一存储设备和所述第二存储设备构成分条关系;所述客户端的处理器执行所述计算机指令还用于:
    向所述第二存储节点发送第二远程直接内存访问请求;所述第二远程直接内存访问请求包含所述第二内存地址。
PCT/CN2017/118650 2017-12-26 2017-12-26 存储系统中存储设备的管理方法及装置 WO2019127021A1 (zh)

Priority Applications (6)

Application Number Priority Date Filing Date Title
EP17898343.3A EP3531666B1 (en) 2017-12-26 2017-12-26 Method for managing storage devices in a storage system, and storage system
EP21190258.0A EP3985949A1 (en) 2017-12-26 2017-12-26 Method and apparatus for managing storage device in storage system
PCT/CN2017/118650 WO2019127021A1 (zh) 2017-12-26 2017-12-26 存储系统中存储设备的管理方法及装置
CN201780002717.1A CN110199512B (zh) 2017-12-26 2017-12-26 存储系统中存储设备的管理方法及装置
CN202011475355.8A CN112615917B (zh) 2017-12-26 2017-12-26 存储系统中存储设备的管理方法及存储系统
US16/912,377 US11314454B2 (en) 2017-12-26 2020-06-25 Method and apparatus for managing storage device in storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/118650 WO2019127021A1 (zh) 2017-12-26 2017-12-26 存储系统中存储设备的管理方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/912,377 Continuation US11314454B2 (en) 2017-12-26 2020-06-25 Method and apparatus for managing storage device in storage system

Publications (1)

Publication Number Publication Date
WO2019127021A1 true WO2019127021A1 (zh) 2019-07-04

Family

ID=67064268

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/118650 WO2019127021A1 (zh) 2017-12-26 2017-12-26 存储系统中存储设备的管理方法及装置

Country Status (4)

Country Link
US (1) US11314454B2 (zh)
EP (2) EP3985949A1 (zh)
CN (2) CN110199512B (zh)
WO (1) WO2019127021A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11314608B1 (en) * 2020-10-02 2022-04-26 Dell Products L.P. Creating and distributing spare capacity of a disk array
US11334245B1 (en) * 2020-10-30 2022-05-17 Dell Products L.P. Native memory semantic remote memory access system
US11875046B2 (en) 2021-02-05 2024-01-16 Samsung Electronics Co., Ltd. Systems and methods for storage device resource management

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260714A1 (en) * 2006-03-30 2007-11-08 International Business Machines Asynchronous interconnect protocol for a clustered dbms
CN102968498A (zh) * 2012-12-05 2013-03-13 华为技术有限公司 数据处理方法及装置
CN103634379A (zh) * 2013-11-13 2014-03-12 华为技术有限公司 一种分布式存储空间的管理方法和分布式存储系统
CN105989048A (zh) * 2015-02-05 2016-10-05 浙江大华技术股份有限公司 一种数据记录处理方法、设备及系统
CN107111596A (zh) * 2015-12-14 2017-08-29 华为技术有限公司 一种集群中锁管理的方法、锁服务器及客户端

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291347B (zh) * 2008-06-06 2010-12-22 中国科学院计算技术研究所 一种网络存储系统
CN102388357B (zh) * 2011-09-23 2013-08-28 华为技术有限公司 访问存储设备的方法及系统
CN103064797B (zh) * 2012-12-21 2016-06-29 华为技术有限公司 数据处理方法和虚拟机管理平台
US8595385B1 (en) * 2013-05-28 2013-11-26 DSSD, Inc. Method and system for submission queue acceleration
US9986028B2 (en) * 2013-07-08 2018-05-29 Intel Corporation Techniques to replicate data between storage servers
US9311110B2 (en) * 2013-07-08 2016-04-12 Intel Corporation Techniques to initialize from a remotely accessible storage device
US9304690B2 (en) * 2014-05-07 2016-04-05 HGST Netherlands B.V. System and method for peer-to-peer PCIe storage transfers
CN108062285B (zh) * 2014-06-27 2022-04-29 华为技术有限公司 一种访问NVMe存储设备的方法和NVMe存储设备
US9658782B2 (en) 2014-07-30 2017-05-23 Excelero Storage Ltd. Scalable data using RDMA and MMIO
US9112890B1 (en) * 2014-08-20 2015-08-18 E8 Storage Systems Ltd. Distributed storage over shared multi-queued storage device
US10061743B2 (en) * 2015-01-27 2018-08-28 International Business Machines Corporation Host based non-volatile memory clustering using network mapped storage
CN104850502B (zh) * 2015-05-05 2018-03-09 华为技术有限公司 一种数据的访问方法、装置及设备
CN104951252B (zh) * 2015-06-12 2018-10-16 北京联想核芯科技有限公司 一种数据访问方法及PCIe存储设备
KR102430187B1 (ko) * 2015-07-08 2022-08-05 삼성전자주식회사 RDMA NVMe 디바이스의 구현 방법
CN106775434B (zh) * 2015-11-19 2019-11-29 华为技术有限公司 一种NVMe网络化存储的实现方法、终端、服务器及系统
US10423568B2 (en) * 2015-12-21 2019-09-24 Microsemi Solutions (U.S.), Inc. Apparatus and method for transferring data and commands in a memory management environment
CN111752480A (zh) * 2016-03-24 2020-10-09 华为技术有限公司 一种数据写方法、数据读方法及相关设备、系统
US11086801B1 (en) * 2016-04-14 2021-08-10 Amazon Technologies, Inc. Dynamic resource management of network device
CN106155586B (zh) * 2016-05-31 2019-03-08 华为技术有限公司 一种存储方法、服务器及存储控制器
CN106469198B (zh) * 2016-08-31 2019-10-15 华为技术有限公司 键值存储方法、装置及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260714A1 (en) * 2006-03-30 2007-11-08 International Business Machines Asynchronous interconnect protocol for a clustered dbms
CN102968498A (zh) * 2012-12-05 2013-03-13 华为技术有限公司 数据处理方法及装置
CN103634379A (zh) * 2013-11-13 2014-03-12 华为技术有限公司 一种分布式存储空间的管理方法和分布式存储系统
CN105989048A (zh) * 2015-02-05 2016-10-05 浙江大华技术股份有限公司 一种数据记录处理方法、设备及系统
CN107111596A (zh) * 2015-12-14 2017-08-29 华为技术有限公司 一种集群中锁管理的方法、锁服务器及客户端

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3531666A4 *

Also Published As

Publication number Publication date
CN110199512B (zh) 2020-12-22
EP3985949A1 (en) 2022-04-20
CN112615917A (zh) 2021-04-06
US20200326891A1 (en) 2020-10-15
EP3531666A1 (en) 2019-08-28
US11314454B2 (en) 2022-04-26
EP3531666A4 (en) 2019-08-28
EP3531666B1 (en) 2021-09-01
CN112615917B (zh) 2024-04-12
CN110199512A (zh) 2019-09-03

Similar Documents

Publication Publication Date Title
WO2019127018A1 (zh) 存储系统访问方法及装置
US10459649B2 (en) Host side deduplication
CN113485636B (zh) 一种数据访问方法、装置和系统
US11262916B2 (en) Distributed storage system, data processing method, and storage node
US11321021B2 (en) Method and apparatus of managing mapping relationship between storage identifier and start address of queue of storage device corresponding to the storage identifier
US11314454B2 (en) Method and apparatus for managing storage device in storage system
US11921695B2 (en) Techniques for recording metadata changes
US20190114076A1 (en) Method and Apparatus for Storing Data in Distributed Block Storage System, and Computer Readable Storage Medium
US20210318826A1 (en) Data Storage Method and Apparatus in Distributed Storage System, and Computer Program Product
US20210124686A1 (en) Host-based read performance optimization of a content addressable storage system
US20210311654A1 (en) Distributed Storage System and Computer Program Product
US11947419B2 (en) Storage device with data deduplication, operation method of storage device, and operation method of storage server
WO2017177400A1 (zh) 一种数据处理方法及系统
US9501290B1 (en) Techniques for generating unique identifiers

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2017898343

Country of ref document: EP

Effective date: 20180907

NENP Non-entry into the national phase

Ref country code: DE