WO2023143033A1 - 一种访问存储节点的方法、装置及计算机设备 - Google Patents

一种访问存储节点的方法、装置及计算机设备 Download PDF

Info

Publication number
WO2023143033A1
WO2023143033A1 PCT/CN2023/071481 CN2023071481W WO2023143033A1 WO 2023143033 A1 WO2023143033 A1 WO 2023143033A1 CN 2023071481 W CN2023071481 W CN 2023071481W WO 2023143033 A1 WO2023143033 A1 WO 2023143033A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
dpu
node
controller
computing node
Prior art date
Application number
PCT/CN2023/071481
Other languages
English (en)
French (fr)
Inventor
杨杰
王淇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023143033A1 publication Critical patent/WO2023143033A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines

Definitions

  • the present application relates to the field of computer technology, in particular to a method, device and computer equipment for accessing storage nodes.
  • the concept of storage computing power offloading is proposed to provide computing power for storage/network through proprietary data processing modules, that is, Data processing unit (data processing unit, DPU).
  • Data processing unit data processing unit
  • DPU data processing unit
  • Complete storage-related calculations (such as encryption and decryption, deduplication, compression, etc.) in an independent DPU, without occupying the general computing resources of the central processing unit (CPU) in the host, and the host directly accesses the storage device through the DPU It's as easy and fast as accessing a local hard drive.
  • the DPU can support the Single Root I/O Virtualization (SR-IOV) standard, so that the virtual machine (virtual machine, VM) can directly access the storage device through the DPU without occupying the VM's host
  • SR-IOV Single Root I/O Virtualization
  • the performance of accessing the storage device is basically the same as that of accessing the storage device through the host computer.
  • the VM directly accesses the storage device through the DPU, if the DPU fails, the VM cannot continue to access the storage device, that is, the data channel is cut off, resulting in direct interruption of VM services.
  • the conventional means is to add a backup device.
  • two DPUs can be configured on the VM host.
  • the working modes of the two DPUs usually include dual-activation mode and active-standby mode. Since the DPU usually uses caches and other operations for acceleration during use, and the dual-activation mode will reduce the cache hit rate, therefore, the two DPUs usually It is the active/standby mode.
  • the active DPU accesses the storage device as a data channel. When the active DPU fails, the data channel is switched from the active DPU to the standby DPU, so as to continue to provide storage access services to ensure that the business is not interrupted.
  • the DPU Since the DPU is equipped with computing units, memory, optical ports and other devices, the price is usually relatively expensive; and the solution of adding a backup DPU, because the added DPU is used as a backup device, will not improve the performance of the system, that is, it will increase the cost. There is no additional performance gain.
  • the DPU is generally connected to the server through a peripheral component interconnect express (PCIE) interface, that is, the DPU needs to occupy the PCIE slot of the server, and adding a backup DPU requires occupying more PCIE slots in the server. In some scenarios, there may be insufficient PCIE slots on the server.
  • PCIE peripheral component interconnect express
  • Embodiments of the present application provide a method, device, and computer equipment for accessing a storage node, so as to solve the problem of maintaining a smooth data channel when a DPU fails without significantly increasing costs.
  • an embodiment of the present application provides a method for accessing a storage node, including: when the computing node determines that it needs to access the storage node, select a target controller for performing access to the storage node from a plurality of controllers, the plurality of The controller includes at least one DPU and a storage controller in at least one storage node, the DPU is installed on the computing node, the storage node is connected to the computing node through a network, and sends an access request to the target controller.
  • the above method can be applied to the scenario where storage and computing are separated, that is, the scenario where the computing node can be connected to the storage node through the network; further, it is applicable to the scenario where the computing node offloads the storage computing power to the DPU.
  • storage-related calculations can be implemented through the DPU without consuming the computing resources of the computing node's CPU. Since the computing node can be connected to the storage node through the network, and the current storage node is equipped with a storage controller, when the computing node accesses the storage node, it can not only realize storage-related computing The storage controller implements storage-related calculations.
  • the storage controller in the storage controller node replaces the backup DPU in the traditional solution without additional cost; and when the storage node is connected through the network, only the network port of the computing node is occupied, and the PCIE slot does not need to be occupied. It will lead to insufficient PCIE slots of the computing nodes; even if the network resources of the computing nodes are insufficient, it is necessary to add a network card to the computing node to implement the above solution, but the cost of adding a network card is significantly lower than the cost of adding a DPU. Therefore, the above-mentioned embodiments of the present application can not only achieve the performance level of using multiple DPUs in the traditional solution, but also reduce the cost compared with the traditional solution.
  • the computing node sending an access request to the target controller includes: the computing node sending a first access request to the DPU, so that the DPU Requesting to send a second access request to one of the storage controllers, the second access request including the ID of the DPU; or, the computing node sending a third access request to the selected storage controller, the first access request
  • the third access request includes the identifier of the computing node.
  • the computing node has at least one VM including a first virtual machine VM, and the first VM includes a multipath module, and the step of determining that the computing node needs to access the storage node specifically includes : The first VM selects the target controller from multiple controllers through the multipath module.
  • the foregoing method for accessing a storage node may be executed by a physical device or by a virtual machine.
  • the computing node is further deployed with a second VM
  • the second VM includes a second multipath module
  • the second VM determines that it needs to access the storage node, it uses the first
  • the second multi-path module selects a target controller for accessing the storage node from the plurality of controllers and sends an access request.
  • the DPU communicates with the virtual machine in the computing node through a virtualization standard SR-IOV.
  • SR-IOV allows the sharing of PCIE devices between the virtual machine and the computing node to which the virtual machine belongs, and enables the virtual machine to obtain I/O performance comparable to the local access performance.
  • the DPU By passing the DPU directly to the VM for access, the I/O path can be reduced
  • the protocol stack improves the I/O performance of the entire system.
  • the method further includes: the computing node scans logical volume information through each controller in the plurality of controllers, The identifiers of the logical volumes corresponding to each controller are different; the computing node determines that the logical volumes with different identifiers corresponding to the multiple controllers are the same logical volume, aggregates the logical volumes with different identifiers, and establishes the aggregation The corresponding relationship between the subsequent logical volumes and the multiple controllers.
  • the controllers configured with the volume management information of the same logical volume belong to the same storage subsystem.
  • the DPU and storage controller are configured with the volume management information of the same logical volume, they report to the computing node
  • the volume IDs of the controllers may be different, so the computing node needs to determine whether multiple controllers correspond to the same logical volume, and if so, aggregate them to facilitate subsequent computing node access.
  • the computing node is connected to storage node A and storage node B through the network, but only storage node A is configured with VM1 volume management information, then for the first VM, the storage controller of storage node A and the storage controller of storage node B do not belong to the same storage subsystem, and the computing node can determine whether the same logical volume is scanned or not Belong to the same storage subsystem.
  • the interface protocol between the DPU and the computing node is the NVME protocol
  • the interface protocol between the storage controller and the computing node is the NOF protocol that uses NVMe to support connection storage through a network structure.
  • the interface protocol between the DPU, the storage controller and the computing node may also be the Internet Small Computer System Interface (ISCSI) protocol.
  • ISCSI Internet Small Computer System Interface
  • the priority of the DPU is higher than the priority of any storage controller in the at least one storage controller;
  • the target controller of the node includes: the computing node selects a target controller for executing access to the storage node according to the priorities of the plurality of controllers.
  • computing nodes access storage nodes through the DPU without consuming computing resources of the computing node's CPU, while computing nodes need to consume computing resources of the computing node's CPU when accessing storage nodes through the storage controller, that is, the performance of accessing through DPU is better than that through The access performance of the storage controller. Therefore, when the DPU can work normally, you can choose to access the storage node through the DPU.
  • the lock ID used by the computing node to access the storage node to lock the data to be accessed or the logical volume to be accessed is based on the ID of the DPU Generated;
  • the target controller is a storage controller, the lock identifier used by the computing node to access the storage node to lock the data to be accessed or the logical volume to be accessed is based on the identifier of the computing node and the The ID of the storage controller is generated.
  • the storage controller generates volume management information based on the computing node ID, which can easily distinguish volumes created by different computing nodes; when the computing node accesses the storage node through the DPU, it is a local access for the computing node, so the access request may not carry computing
  • the node identifier may only include the DPU identifier when the DPU accesses the storage node, and generating volume management information based on the DPU identifier enables the storage node to identify the corresponding logical volume according to the DPU identifier.
  • the embodiment of the present application provides an apparatus for accessing a storage node, the apparatus includes a module/unit that executes the above-mentioned first aspect and any possible implementation of the first aspect; these modules/units may be implemented by hardware, Corresponding software implementation can also be executed by hardware.
  • the device includes: a multi-path module, configured to select a target controller for performing access to the storage node from multiple controllers when the device needs to access the storage node, and the multiple controllers include at least A data processing unit DPU and a storage controller in at least one storage node, the DPU is installed on the computing node where the device is located, and the storage node is connected to the computing node to which the device belongs through a network; the sending module is used for Send an access request to the target controller.
  • a multi-path module configured to select a target controller for performing access to the storage node from multiple controllers when the device needs to access the storage node, and the multiple controllers include at least A data processing unit DPU and a storage controller in at least one storage node, the DPU is installed on the computing node where the device is located, and the storage node is connected to the computing node to which the device belongs through a network; the sending module is used for Send an access request to the target controller.
  • the embodiment of the present application provides a computer device, the computer device includes a memory and a processor; the memory stores a computer program; the processor is used to invoke the computer program stored in the memory to execute The first aspect and the method described in any implementation manner of the first aspect.
  • an embodiment of the present application provides a computing system, including the computer device described in the third aspect, at least one data processing unit DPU connected to the computer device through a hardware interface, and at least one connected to the computer device through a network.
  • the embodiment of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a computer, the computer executes the computer-readable storage medium described in the first aspect and the second aspect. In one aspect, the method described in any implementation manner.
  • the embodiments of the present application provide a computer program product including instructions, which, when run on a computer, cause the method described in the first aspect and any implementation manner of the first aspect to be executed.
  • FIG. 1 is a schematic diagram of a traditional master and backup scheme provided in the embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an applicable scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for accessing a storage node provided in an embodiment of the present application
  • FIG. 4 is a schematic diagram of a connection between a computing node and multiple controllers provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of connection between another computing node and multiple controllers provided by the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an apparatus for accessing a storage node provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the user layer (user) and the kernel (kernel) are the virtual machine VM
  • the host (host) is the host of the host (computing node) of the virtual machine
  • the host is connected with a main DPU and a standby DPU through a PCIE slot.
  • the key technology for the standby DPU to quickly take over the main DPU to continue working after the main DPU fails is that the multipath module in the VM aggregates the devices that can access the storage nodes (that is, the main DPU and the standby DPU). Provides storage services for VMs.
  • the multi-path module queries the controller information connected to the computing node to which the VM belongs through control commands (such as ioctl), such as: device type, manufacturer ID, device world wide name (WWN), etc., to determine each controller Whether it is in the white list, and identify whether multiple controllers are the same type of controller (such as controllers connected through PCIE, controllers connected through the network), if they are the same type of controllers, further scan whether they correspond to have the same logical volume (even if different controllers correspond to the same logical volume, but the logical volume ID is not the same), if so, aggregate the logical volume to obtain a logical block device, and create the logical block device with the above multiple The mapping relationship of controllers of the same type.
  • control commands such as ioctl
  • the VM accesses the logical block device, and the multipath module selects a controller corresponding to the logical device to access. For example, the VM obtains the device information of the active DPU and the standby DPU, and the multipath module recognizes that the two block devices are of the same type.
  • the VM scans that the logical volume identifiers corresponding to the active DPU and the standby DPU are /dev/nvme0n1 and /dev/nvme0n1 respectively dev/nvme1n1, although the identifiers are different but corresponding to the same logical volume, the multipath module aggregates /dev/nvme0n1 and /dev/nvme1n1 into a logical block device /dev/mapper/dm-0, and the VM needs to be processed through the DPU For storage access, you can access /dev/mapper/dm-0; as long as at least one of the main DPU and the backup DPU can work normally, the VM can access /dev/mapper/dm-0, which ensures that when the DPU fails Business is not interrupted.
  • an embodiment of the present application provides a method for accessing a storage node, which is used to solve the problem of maintaining a smooth data channel when a DPU fails without significantly increasing costs.
  • the architecture may include computing nodes 210 and storage clusters (one or more storage nodes 200 ).
  • the computing node 210 is a computing device, such as a server, a desktop computer, or a controller of a storage array. Although only one computing node 210 is shown in FIG. 2 , more computing nodes 210 may be included in an actual architecture, and various computing nodes 210 may communicate with each other. In terms of hardware, the computing node 210 includes at least a processor 212 , a memory 213 and a network card 214 .
  • the processor 212 may be one or more CPUs, and one CPU may have one or more CPU cores for processing data access requests from outside the computing node 210 or requests generated inside the computing node 210 .
  • the processor 212 when the processor 212 receives the write data request sent by the user, it will temporarily store the data in the data write request in the memory 213 .
  • the processor 212 sends the data stored in the memory 213 to the storage node 200 for persistent storage.
  • the memory 213 refers to an internal memory directly exchanging data with the processor 212. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for the operating system or other running programs.
  • Memory includes at least two kinds of memory, for example, memory can be either random access memory or read only memory (ROM).
  • the random access memory is, for example, dynamic random access memory (DRAM), or storage class memory (SCM).
  • DRAM is a semiconductor memory that, like most random access memory (RAM), is a volatile memory device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory. Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM. .
  • the DRAM and the SCM are only illustrative examples in this embodiment, and the memory may also include other random access memories, such as static random access memory (static random access memory, SRAM) and the like.
  • the read-only memory for example, it may be programmable read-only memory (programmable read only memory, PROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) and the like.
  • the memory 213 can also be a dual in-line memory module or a dual-line memory module (dual In-line memory module, DIMM), that is, a module composed of a dynamic random access memory (DRAM), or a solid-state hard disk ( solid state disk, SSD).
  • DIMM dual In-line memory module
  • multiple memories 213 and different types of memories 213 may be configured in the computing node 210 .
  • This embodiment does not limit the quantity and type of the memory 213 .
  • the memory 213 can be configured to have a power saving function.
  • the power saving function means that the data stored in the internal memory 213 will not be lost when the system is powered off and then powered on again.
  • Memory with a power saving function is called non-volatile memory.
  • the network card 214 is used for communicating with the storage node 200 . For example, when the total amount of data in the memory 213 reaches a certain threshold, the computing node 210 may send a request to the storage node 200 through the network card 214 to persistently store the data.
  • the network card 214 is a network card without computing capability, that is, a non-smart network card.
  • the computing node 210 may further include a bus for communication between components inside the computing node 210 .
  • a bus for communication between components inside the computing node 210 .
  • the main function of computing node 210 in Fig. 2 is computing services, remote storage can be used to achieve persistent storage when storing data, so it has less local storage than conventional servers, thus realizing cost and Space saving. However, this does not mean that the computing node 210 cannot have a local storage.
  • the computing node 210 may also have a small number of built-in hard disks, or a small number of external hard disks.
  • the computing node 210 is also connected to at least one DPU through a hardware interface (such as a PCIE slot), and connected to at least one storage node 200 through a network (three storage nodes are used as an example in FIG. 2 ).
  • a hardware interface such as a PCIE slot
  • a storage node 200 includes one or more controllers 201 , network cards 204 and multiple hard disks 205 .
  • the network card 204 is used to communicate with the computing node 210 .
  • the hard disk 205 is used to store data, and may be a magnetic disk or other types of storage media, such as a solid state hard disk or a shingled magnetic recording hard disk.
  • the controller 201 is configured to write data into the hard disk 205 or read data from the hard disk 205 according to the read/write data request sent by the computing node 210 . In the process of reading and writing data, the controller 201 needs to convert the address carried in the read/write data request into an address that the hard disk can recognize. It can be seen that the controller 201 also has some simple calculation functions.
  • the controller 201 may have various forms.
  • the controller 201 includes a CPU and memory.
  • the CPU is used to perform operations such as address translation and reading and writing data.
  • the memory is used to temporarily store data to be written into the hard disk 205 , or to read data from the hard disk 205 to be sent to the computing node 210 .
  • the controller 201 is a programmable electronic component, such as a DPU.
  • the number of controllers 201 may be one, or two or more. When the storage node 200 includes at least two controllers 201 , there may be an ownership relationship between the hard disk 205 and the controllers 201 .
  • each controller can only access the hard disk that belongs to it, so this often involves forwarding read/write data requests between the controllers 201, resulting in a data access path longer.
  • the ownership relationship between the hard disk 205 and the controller 201 needs to be re-bound when a new hard disk 205 is added to the storage node 200 .
  • the method for accessing a storage node provided in this embodiment of the application can be applied to the computing node shown in Figure 2, or to a virtual machine configured on the computing node.
  • FIG. 3 it is a schematic flowchart of a method for accessing a storage node provided in the embodiment of the present application. As shown in the figure, the method may include the following steps:
  • step 301 when the computing node determines that it needs to access the storage node, it selects a target controller for accessing the storage node from multiple controllers.
  • the computing node mentioned above may be a computing node as shown in FIG. 2 , including a CPU, a memory, a network card, and the like.
  • the computing node in the embodiment of the present application further includes a DPU connected through a hardware interface (such as a PCIE slot); the computing node is also connected to at least one storage node through a network.
  • the foregoing plurality of controllers may include a DPU, and a storage controller in at least one storage node.
  • the controller used to implement storage-related computing may include a DPU or a controller in the storage node, instead of only being implemented by the DPU.
  • the compute node selects a controller for performing the current access from among available multiple controllers. For example, if a computing node detects that a DPU fails, it may select a storage controller for current access.
  • the computing node can actively query the device information of each controller, for example, the computing node can query the device type, supported storage network protocol, and vendor ID of each controller through a control command (such as ioctl) , WWN, etc.
  • each controller may actively report its own device information to the computing node, so that the computing node can identify the controller that can be used to access the storage node according to the device information.
  • the computing node determines whether multiple controllers support the same storage network protocol according to the device information.
  • the multiple controllers can be used as candidate controllers for accessing the storage node in this embodiment of the application, that is, the multiple controllers described in step 301 .
  • a DPU that communicates with computing nodes based on the non-volatile memory host controller interface specification (non-volatile memory express, NVMe) protocol, and a network structure based on the use of NVMe to support connected storage (NVMe over fabric, NOF) protocol and computing
  • NVMe non-volatile memory express
  • NOF non-volatile memory over fabric
  • the storage controllers that the nodes communicate with belong to the controllers that support the same storage network protocol.
  • multiple controllers that communicate with computing nodes based on the Internet Small Computer System Interface (ISCSI) protocol also belong to the controllers that support the same storage network protocol.
  • ISCSI Internet Small Computer System Interface
  • the computing node can further determine whether the controller belongs to the whitelist or blacklist. If it is determined to belong to the whitelist, the computing node is allowed to access the storage node through the controller, or , if it is determined to belong to the blacklist, the computing node is prohibited from accessing the storage node through the controller. For example, the computing node may determine whether it belongs to the white/black list according to the vendor ID of the controller and the supported storage network protocol.
  • Step 302 the computing node sends an access request to the target controller.
  • the computing node sends an access request to the target controller, and the target controller processes the requested data, that is, the target controller performs storage-related calculations, thereby completing reading data from the storage node or writing data to the storage node.
  • the computing node may send an access request to the DPU.
  • the DPU accesses the data of the storage node, it may send a second access request to the storage controller according to the first request, and the second access request may include the identifier of the DPU.
  • the access request sent by the computing node may not include the identification of the computing node.
  • the second access request may include the ID of the DPU, so that the storage controller can identify the requester.
  • the second access request may also include the identifier of the computing node.
  • the computing node may send a third access request to the storage controller, where the third access request includes an identifier of the computing node, so that the storage controller can identify the requester.
  • the computing node accesses the storage node through the DPU without consuming computing resources of the CPU of the computing node. If the selected destination controller is the storage controller in the storage node, the computing node may not be able to directly access the storage node through the storage controller, and at this time, the computing resources of the CPU in the computing node may need to be occupied.
  • the computing node accesses the storage node through the DPU
  • the DPU although the DPU also sends an access request to the storage controller, during the access process, the DPU needs to perform storage-related calculations, such as decompression and compression of the accessed data, and decompression of the written data. Heavy and other operations, etc., thereby sharing the load of the storage controller.
  • the above method for accessing a storage node may also be executed by a virtual machine.
  • the above-mentioned method is executed by a virtual machine, at least one virtual machine including the first virtual machine is deployed in the above-mentioned computing node, at this time, the above-mentioned step 301 may be executed by the first multipath module in the first virtual machine, That is, the first multipath module selects a target controller from multiple controllers.
  • a virtual machine is deployed in the computing node
  • the interface protocol between the DPU and the computing node is the NVMe protocol
  • the interface protocol between the storage controller and the computing node is the NOF protocol.
  • the NVMe controller and the storage controller of the virtual operating system emulator (Qemu) in the computing node implement access to the storage node.
  • a second virtual machine may also be deployed in the computing node, and the second multipath module in the second virtual machine may also perform the step of selecting a target controller from multiple controllers to access the storage node.
  • the DPU may communicate with the virtual machine in the computing node through SR-IOV.
  • SR-IOV allows the sharing of PCIE devices between the virtual machine and the computing node to which the virtual machine belongs, and enables the virtual machine to obtain I/O performance comparable to the local access performance.
  • the DPU By passing the DPU directly to the VM for access, the I/O path can be reduced
  • the protocol stack improves the I/O performance of the entire system.
  • the above method can be applied to the scenario where storage and computing are separated, that is, the scenario where the computing node can be connected to the storage node through the network; further, it is applicable to the scenario where the computing node offloads the storage computing power to the DPU.
  • storage-related calculations can be implemented through the DPU without consuming computing resources of the computing nodes. Since the computing node can be connected to the storage node through the network, and the current storage node is equipped with a storage controller, when the computing node accesses the storage node, it can not only realize storage-related computing The storage controller implements storage-related calculations.
  • the storage controller in the storage controller node replaces the backup DPU in the traditional solution without additional cost; and when the storage node is connected through the network, only the network port of the computing node is occupied, and the PCIE slot does not need to be occupied. It will lead to insufficient PCIE slots of the computing nodes; even if the network resources of the computing nodes are insufficient, it is necessary to add a network card to the computing node to implement the above solution, but the cost of adding a network card is significantly lower than the cost of adding a DPU. Therefore, the above-mentioned embodiments of the present application can not only achieve the performance level of using multiple DPUs in the traditional solution, but also reduce the cost compared with the traditional solution.
  • the computing node does not need to occupy the computing resources of the computing node when performing the step of accessing the storage node (that is, the above step 302);
  • the computing node may need to occupy computing resources of the computing node when performing step 302.
  • the performance of accessing the storage node through the DPU (such as the speed of reading/writing data, etc.) is better than that when accessing through the storage controller.
  • the priority of the DPU can be set to high priority, and the priority of the storage controller to low priority, so that when the DPU can work normally, the DPU is preferentially used to perform the operation of accessing the storage node, thereby ensuring the high performance of the system ;
  • the DPU fails, access the storage node through the low-priority storage controller to ensure the smooth flow of the data channel and the business will not be interrupted;
  • the DPU is successfully repaired or replaced with a new DPU, continue to select the DPU as the target The controller accesses the storage nodes.
  • the computing node can query the state of the DPU through heartbeat detection, etc., so as to find out whether the DPU fails in time.
  • the computing node can select the DPU as the target controller; after determining that the DPU is faulty, it chooses to access the storage node through the storage controller.
  • the computing node can select the DPU as the target controller; after determining that the DPU is faulty, it chooses to access the storage node through the storage controller.
  • a preset selection strategy such as a load balancing strategy.
  • a multi-path module can be set in the computing node, which can aggregate the main DPU and the backup DPU to provide storage access services for the computing node. It should be understood that in the traditional solution, only the controllers of the same type are aggregated , that is, multiple controllers connected through PCIE interfaces can be aggregated, but controllers connected through PCIE interfaces and controllers connected through a network cannot be aggregated. However, in the embodiment of the present application, a multi-path module may also be installed in the computing node to aggregate DPUs and storage controllers to provide storage access services for computing nodes, that is, different types of controllers may be aggregated.
  • the above operation of querying the device information of each controller can be realized by the multipath module; after the multipath module determines that multiple controllers support the same storage network protocol, it can scan to determine whether the multiple controllers can pass through Scanning to the same storage device (it can be a physical storage device, such as the storage node shown in Figure 2; it can also be a logical storage device, such as a logical volume), as mentioned above, even if the storage devices corresponding to different controllers are the same, but The identifiers of the storage devices reported to the computing nodes may be different. Therefore, the multipath module can aggregate the same scanned storage devices into a logical block device; subsequent computing nodes access the logical block when performing storage access equipment. The multipath module can also periodically scan to discover newly added devices and determine whether aggregation is required.
  • the DPU, the storage controller, and the storage node corresponding to the storage controller can be regarded as a storage subsystem.
  • the storage subsystem that is, the above-mentioned logical block device
  • the master node distributes controller information (such as control The name of the device in this subsystem, etc.), etc.
  • the master node can also configure the priority for each controller.
  • the DPU and the storage controller belong to different types of controllers and usually belong to different storage subsystems, the names of the storage subsystems reported to the computing nodes are different.
  • a computing node When a computing node accesses a storage node, it usually creates a volume (also called a logical volume) according to the storage resource of the storage node.
  • a logical volume is a virtual disk formed by logical disks. It is a storage management method. The purpose is to jump out of the hard disk space from the management method of the physical hard disk for more convenient unified management and allocation. Therefore, before the computing node executes the foregoing method, it may first create a logical volume for each storage node. During the process of creating a logical volume, the storage controller in each storage node generates volume management information such as the mapping relationship of the logical volume, which is used for subsequent access or management of the logical volume based on the storage node.
  • the computing node can control the storage controller to configure the generated volume management information into the DPU, that is, the DPU and the storage controller share the same volume management, so that the DPU and the storage control All servers can access logical volumes based on volume management information.
  • computing node A can access it, and can generate a logical volume based on the storage node
  • computing node B can also access it, and can generate a logical volume based on the storage node. Therefore, when the storage node generates volume management information for computing node A, it can add the identifier of computing node A, so that the storage node can provide services for computing node A according to the volume management information of computing node A.
  • the computing node may send the identifier of the computing node to the storage controller, so that the storage controller generates corresponding volume management information according to the identifier of the computing node. Further, the computing node may also send the ID of the DPU to the storage controller, so that the storage controller generates corresponding volume management information according to the computing node ID and the DPU ID.
  • Resource contention occurs when multiple controllers concurrently access data in the same logical volume, or access the same data in the same logical volume concurrently.
  • One way to solve resource contention is to control access through locks, and lock and protect concurrently accessed data.
  • DPU and storage controller can lock the data when accessing stored data, so as to avoid concurrent reading of the same data. /write operation.
  • the ID used by the DPU or the storage controller may be called a lock ID. Therefore, in a possible implementation manner, the generated volume management information may include lock identifiers of the DPU and the storage controller.
  • the computing node thinks that the DPU has failed and switched to the storage controller to re-access, but the DPU can still continue to access the data.
  • the DPU and the storage controller may access the same data.
  • it can By locking the data to prevent the DPU and the storage controller from accessing the data at the same time and causing errors.
  • the lock ID of the DPU may be generated according to the ID of the DPU; and the lock ID of the storage controller may be generated according to the ID of the computing node and the ID of the storage controller.
  • accessing storage nodes through the DPU is a local access method. Therefore, in this case, the access request does not need to carry the identity of the computing node, and the corresponding computing node can be determined only by the identity of the DPU. Therefore, The lock ID of the DPU can be generated according to the DPU ID.
  • a storage node can be connected to multiple computing nodes, and the lock ID of the storage controller is generated only by the ID of the storage controller, and the computing node that locks the data cannot be identified. Therefore, the lock ID of the storage controller can be based on The identity of the compute node and the identity of the storage controller are generated.
  • a simplified version of the snowflake (snowflake) algorithm may be used to generate the lock ID.
  • the lock identifier has a total of 8 bytes
  • the high-order 4 bytes store the identifier of the computing node (such as WWID)
  • the low-order 4 bytes store the controller identifier.
  • accessing storage nodes through the DPU is a local access method. Therefore, in this case, the access request does not need to carry the identifier of the computing node. Therefore, the upper 4 bytes of the DPU lock identifier can be set to 0000.
  • the lower 4 bytes are the ID of the DPU.
  • computing node A is configured with a virtual machine VM1.
  • Computing node A is connected to a DPU through a PCLE slot, and is connected to storage cluster A through a network.
  • Storage cluster A includes three storage nodes.
  • the DPU and the three storage nodes all support the NOF protocol, and the above method can be used to provide storage access services for the computing nodes.
  • the NVMe controller in the DPU is marked as controller1
  • the storage controller in storage node 1 is marked as controller2
  • the storage controller in storage node 2 is marked as controller3
  • the storage controller in storage node 4 is marked as controller4.
  • the user can create logical volumes for VM1 based on the storage resources of storage node 1, storage node 2, and storage node 3.
  • VM1 creates a logical volume according to the user's instruction, it sends the identity of VM1 and the identity of DPU to controller2, so that controller2 generates volume management information according to the identity of VM1 and the identity of DPU, and sends the volume management information to DPU.
  • controller2 can assign IDs in the subsystem to each controller in the subsystem.
  • the priority of each controller for example, the DPU can be assigned the highest priority, and the other memory controllers can be assigned the same lower priority.
  • the NVMe initiator and NOF initiator in VM1 scan separately.
  • the NVMe initiator scans the logical volume through the DPU, and its identifier is nvme0n1.
  • the identifiers of the logical volumes scanned by the NOF initiator through controller2, controller3, and controller4 are respectively nvme1n1, nvme2n1, nvme3n1.
  • the multi-path module in VM1 determines that the four controllers are all devices in the whitelist according to the obtained device information, and all support the NOF protocol, and the scanned logical volumes are actually the same logical volume although they have different identifiers.
  • these four controllers belong to the same storage subsystem, and aggregate nvme0n1, nvme1n1, nvme2n1, and nvme3n1 into a logical block device /dev/dm-0, and establish nvme0n1, nvme1n1, nvme2n1, nvme3n1 and /dev/dm-0 0 mapping relationship.
  • the multi-path module in VM1 can regularly query the status of these 4 controllers.
  • the DPU fails the command to detect the DPU fails or times out, and the status of nvme0n1 is set as abnormal (or the status of the DPU is directly set as abnormal).
  • VM1 needs to access /dev/dm-0 if the multipath module in VM1 determines that the state of nvme0n1 with the highest priority is normal, it will access the storage node through nvme0n1; A block device is selected, or a block device is selected according to preset rules, and the storage node is accessed through a controller corresponding to the selected block device.
  • the multipath module in VM1 detects that nvme0n1 is back to normal, it sets the state of nvme0n1 to normal, so that nvme0n1 is preferentially selected for subsequent access to storage nodes.
  • multiple virtual machines may be deployed in one computing node, and some or all of the multiple virtual machines may implement the foregoing method embodiments.
  • VM1 and VM2 are deployed in a computing node, and both VM1 and VM can select a target controller from multiple controllers to access the storage node through the above method.
  • VM1 and VM2 can respectively set multipath modules.
  • an embodiment of the present application further provides an apparatus for accessing a storage node, which is used to implement the foregoing method embodiment.
  • the apparatus may include modules/units that execute any possible implementation manner in the foregoing method embodiments; these modules/units may be implemented by hardware, or may be implemented by executing corresponding software on hardware.
  • the apparatus includes: a multipath module 601 and a sending module 602 .
  • the multi-path module 601 is configured to select a target controller for performing access to the storage node from multiple controllers when the device needs to access the storage node, and the multiple controllers include at least one data processing unit DPU and at least one A storage controller in a storage node, the DPU is installed on the computing node where the device is located, and the storage node is connected to the computing node to which the device belongs through a network.
  • the multiple controllers include at least one data processing unit DPU and at least one A storage controller in a storage node, the DPU is installed on the computing node where the device is located, and the storage node is connected to the computing node to which the device belongs through a network.
  • the sending module 602 is specifically configured to: send a first access request to the DPU, so that the DPU sends the first access request to one of the storage controllers according to the first access request. Two access requests, the second access request including the identifier of the DPU; or, sending a third access request to the selected storage controller, the third access request including the identifier of the device.
  • the device is a first virtual machine VM.
  • the computing node where the device is located is further deployed with a second VM
  • the second VM also includes a second multipath module, and when the second VM determines that it needs to access the storage node, using the second multi-path module to select a target controller for accessing the storage node from the plurality of controllers and send an access request.
  • the multi-path module 601 is further configured to scan logical volume information through each of the multiple controllers before the device determines that it needs to access the storage node, and each The identifiers of the logical volumes corresponding to the controllers are different; the logical volumes with different identifiers corresponding to the multiple controllers are the same logical volume, and the logical volumes with different identifiers are aggregated, and the aggregated logical volume and the logical volumes after the aggregation are established. Describe the corresponding relationship of multiple controllers.
  • the DPU communicates with the virtual machine in the computing node through a virtualization standard SR-IOV.
  • the interface protocol between the DPU and the computing node of the device is the NVME protocol
  • the interface protocol between the storage controller and the computing node of the device uses NVMe to support connected storage through a network structure. NOF protocol.
  • the priority of the DPU is higher than the priority of any storage controller in the at least one storage controller; the multipath module is specifically configured to: according to the plurality of controllers priority, select the target controller for performing access to the storage node.
  • the lock ID used by the device to access the storage node to lock the data to be accessed or the logical volume to be accessed is generated according to the ID of the DPU
  • the device accesses the lock identifier used by the storage node to lock the data to be accessed or the logical volume to be accessed, based on the identifier of the device and the storage controller generated by the id.
  • the embodiment of the present application also provides a computer device.
  • the computer device includes a processor 701 as shown in FIG. 7 , and a communication interface 702 connected to the processor 701 .
  • the processor 701 can be a general processor, a microprocessor, a specific integrated circuit (application specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic device, or one or more integrated circuits used to control the execution of the program of this application, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the communication interface 702 is used to communicate with other devices, such as PCI bus interface, Ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc.
  • PCI bus interface Ethernet
  • radio access network radio access network
  • RAN radio access network
  • WLAN wireless local area network
  • the processor 701 is configured to call the communication interface 702 to perform a function of receiving and/or sending, and execute the method described in any one of the preceding possible implementation manners.
  • the computer device may also include a memory 703 and a communication bus 704 .
  • the memory 703 is configured to store program instructions and/or data, so that the processor 701 calls the instructions and/or data stored in the memory 703 to implement the above functions of the processor 701 .
  • Memory 703 may be read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM) or other types that can store information and instructions It can also be an electrically erasable programmable read-only memory (EEPROM) or can be used to carry or store desired program code in the form of instructions or data structures and can be stored by the computer Any other medium, but not limited to it.
  • the memory 703 may exist independently, such as an off-chip memory, and is connected to the processor 701 through the communication bus 704 .
  • the memory 703 can also be integrated with the processor 701.
  • Communication bus 704 may include a path for communicating information between the components described above.
  • the processor 701 may perform the following steps through the communication interface 702: when it is determined that access to the storage node is required, select a target controller for performing access to the storage node from a plurality of controllers, the plurality of controllers including at least one A data processing unit DPU and a storage controller in at least one storage node, the DPU is connected to the computer device through a hardware interface, and it can be considered that the computer device includes the DPU; the storage node is connected to the computer device through a network; The storage node is accessed through the target controller.
  • the processor 701 is further configured to, through the communication interface 702, obtain the device information of each controller in the plurality of controllers before determining that access to the storage node is required; according to the device information , determining that the multiple controllers support the same storage network protocol.
  • the storage network protocol includes a storage network NOF protocol based on a non-volatile memory host controller interface specification, or an Internet Small Computer System Interface ISCSI protocol.
  • the processor 701 is further configured to, through the communication interface 702, create a volume for the at least one storage node before determining that the storage node needs to be accessed, and control the at least one storage controller to generate The volume management information is configured in the DPU.
  • the processor 701 is further configured to send the identifier of the computer device and the identifier of the DPU to the at least one storage controller through the communication interface 702, so that the at least one storage controller The controller generates volume management information according to the computer device identifier and the DPU identifier.
  • the processor 701 is further configured to scan the generated volumes through each of the multiple controllers; if the volumes scanned by the multiple controllers are consistent, it is determined that the Multiple controllers belong to the same storage system, and the at least one storage node belongs to the storage system.
  • the lock identifier of the DPU is generated according to the identifier of the DPU
  • the lock identifier of each storage controller is generated according to the identifier of the computer device and the identifier of the storage controller:
  • the lock identifier represents an identifier used to lock the data when accessing the data.
  • a virtual machine VM is configured in the computer device.
  • the DPU communicates with the virtual machine in the computing node through a virtualization standard SR-IOV.
  • the processor 701 through the communication interface 702 is specifically configured to: when the target controller is a storage controller in a storage node, through the controller in the computer device and the target The controller accesses the storage nodes.
  • the priority of the DPU is higher than the priority of any storage controller in the at least one storage controller; the processor 701 is specifically configured to: Level, select the target controller used to perform access to the storage node.
  • the embodiment of the present application also provides a computing system, including the computer device described in the above embodiment, at least one DPU connected to the computer device through a hardware interface, at least one DPU connected to the computer device through a network, A storage node to which computer equipment is attached.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are run on a computer, the above-mentioned The steps performed by the computing node in the method embodiment are performed.
  • embodiments of the present application provide a computer program product including instructions, which, when run on a computer, cause the steps performed by the computing node in the above method embodiments to be executed.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种访问存储节点的方法、装置及计算机设备,应用于将计算机技术领域。在该方法中,计算节点确定需要访问存储节点时,从多个控制器中选择用于执行访问存储节点的目标控制器,多个控制器包括至少一个DPU和至少一个存储节点中的存储控制器,DPU包含于所述计算节点,存储节点通过网络与所述计算节点连接;向所述目标控制器发送访问请求。由存储控制器节点中的存储控制器,替代传统方案中的备用DPU,并不需要增加额外的成本;即使计算节点的网络资源不足,导致实现上述方案需要为计算节点增加一张网卡,但增加一张网卡的成本明显低于增加一个DPU的成本。

Description

一种访问存储节点的方法、装置及计算机设备 技术领域
本申请涉及计算机技术领域,尤其涉及一种访问存储节点的方法、装置及计算机设备。
背景技术
当前计算性能和存储性能的发展差距越来越大,计算资源逐渐无法匹配存储性能的诉求,因此提出了存储算力卸载的概念,通过专有的数据处理模块为存储/网络提供算力,即数据处理单元(data processing unit,DPU)。将存储相关的计算(如加解密、重删、压缩等)在独立的DPU中完成,不占用主机中中央处理器(central processing unit,CPU)的通用算力资源,主机通过DPU直接访问存储设备就与访问本地硬盘一样简单、快捷。
DPU可以支持单根输入输出虚拟化(Single Root I/O Virtualization,SR-IOV)标准,从而使得虚拟机(virtual machine,VM)能够直接通过DPU访问存储设备,而不需要占用VM的宿主机中的CPU资源,其访问存储设备的性能与通过宿主机访问存储设备的性能基本持平。在VM直接通过DPU访问存储设备的场景中,若DPU发生故障,则VM无法继续访问存储设备,即数据通道被切断,导致VM业务直接中断。
对于硬件的单点故障来说,常规手段是增加备用设备。例如,可以在VM的宿主机上配置两个DPU。两个DPU的工作模式通常包括双激活模式和主备模式,由于DPU在使用过程中通常会采用缓存等用于加速的操作,而双激活模式会导致缓存命中率降低,因此,两个DPU通常是主备模式,主DPU作为数据通道访问存储设备,当主DPU发生故障时,数据通道从主DPU切换至备用DPU,从而继续提供存储访问服务以保障业务不被中断。
由于DPU内部设置有计算单元、内存、光口等器件,价格通常较为昂贵;而增加备用DPU的方案,由于增加的DPU作为备用设备,并不会提升系统的性能,即增加了较大成本确没有带来额外的性能收益。此外,DPU一般是通过快速外设部件互连(peripheral component interconnect express,PCIE)接口与服务器连接,即DPU需要占用服务器的PCIE插槽,增加备用DPU则需要占用服务器更多的PCIE插槽,在一些场景下可能导致服务器PCIE插槽不够用。
发明内容
本申请实施例提供一种访问存储节点的方法、装置及计算机设备,用于实现在不显著增加成本的情况下解决DPU故障时维持数据通道畅通的问题。
第一方面,本申请实施例提供一种访问存储节点的方法,包括:计算节点确定需要访问存储节点时,从多个控制器中选择用于执行访问存储节点的目标控制器,所述多个控制器包括至少一个DPU和至少一个存储节点中的存储控制器,所述DPU安装于所述计算节点,所述存储节点通过网络与所述计算节点连接;向所述目标控制器发送访问请求。
上述方法可以适用于存、算分离的场景中,即计算节点可以通过网络与存储节点连接的场景;进一步的,适用于应用于计算节点将存储算力卸载至DPU的场景中,计算节点在访问存储节点时,可以通过DPU实现存储相关的计算,而不必消耗计算节点CPU的计算资源。由 于计算节点可以通过网络与存储节点连接,而目前的存储节点中又设置有存储控制器,因此,计算节点在访问存储节点时,除了可以通过DPU实现存储相关的计算,也可以通过存储节点中的存储控制器实现存储相关的计算。由存储控制器节点中的存储控制器,替代传统方案中的备用DPU,并不需要增加额外的成本;且通过网络连接存储节点时,仅占用计算节点的网口,不必占用PCIE插槽,不会导致计算节点PCIE插槽不足的问题;即使计算节点的网络资源不足,导致实现上述方案需要为计算节点增加一张网卡,但增加一张网卡的成本明显低于增加一个DPU的成本。因此,本申请上述实施例,既能够达到传统方案中采用多个DPU的性能水平,相比传统方案又降低了成本。
在一种可能的实现方式中,所述计算节点向所述目标控制器发送访问请求,包括:所述计算节点向所述DPU发送第一访问请求,以使所述DPU根据所述第一访问请求向一个所述存储控制器发送第二访问请求,所述第二访问请求中包括所述DPU的标识;或者,所述计算节点向选择出的存储控制器发送第三访问请求,所述第三访问请求中包括所述计算节点的标识。计算节点通过DPU访问存储节点时,对计算节点来说为本地访问,因此计算节点发送的请求中可以不包括DPU的标识,而DPU在请求访问存储节点时可以仅携带有DPU的标识。
在一种可能的实现方式中,所述计算节点拥有包括第一虚拟机VM在内的至少一个VM,所述第一VM包括多路径模块,所述计算节点确定需要访问存储节点的步骤具体包括:所述第一VM通过所述多路径模块,从多个控制器中选择所述目标控制器。在本申请实施例中,上述访问存储节点的方法既可以由物理设备执行,也可以由虚拟机执行。
在一种可能的实现方式中,所述计算节点中还部署有第二VM,所述第二VM包括第二多路径模块,所述第二VM在确定需要访问存储节点时,使用所述第二多路径模块从所述多个控制器中选择用于执行访问存储节点的目标控制器访问并发送访问请求。
在一种可能的实现方式中,所述DPU通过虚拟化标准SR-IOV与所述计算节点中的虚拟机通信。SR-IOV允许在虚拟机与虚拟机所属计算节点之间共享PCIE设备,并且使得虚拟机获得能够与本地访问性能媲美的I/O性能,通过将DPU直通给VM访问,可以减少I/O路径协议栈,提升整系统的I/O性能。
在一种可能的实现方式中,在所述计算节点确定需要访问存储节点时之前,所述方法还包括:所述计算节点通过所述多个控制器中的每个控制器扫描逻辑卷信息,每个控制器对应的逻辑卷的标识不同;所述计算节点确定通过所述多个控制器对应的不同标识的逻辑卷为同一逻辑卷,对所述不同标识的逻辑卷进行聚合,并建立聚合后的逻辑卷与所述多个控制器的对应关系。对于计算节点来说,配置有相同逻辑卷的卷管理信息的控制器属于相同的存储子系统,但是,即使DPU和存储控制器中配置有相同逻辑卷的卷管理信息,但其向计算节点上报的卷标识可能不同,因此计算节点需要确定多个控制器是否对应相同的逻辑卷,若相同,则对其进行聚合从而方便后续计算节点访问。此外,计算节点连接的存储节点可以有多个,但这些存储节点未必属于相同的存储子系统,例如,计算节点通过网络连接有存储节点A和存储节点B,但仅存储节点A中配置有VM1的卷管理信息,那么对于第一VM来说存储节点A的存储控制器和存储节点B的存储控制器不属于相同的存储子系统,计算节点可以根据是否扫描到相同的逻辑卷确定是否其是否属于相同的存储子系统。
在一种可能的实现方式中,所述DPU与所述计算节点的接口协议为NVME协议,所述存储控制器与所述计算节点的接口协议为使用NVMe通过网络结构支持连接存储NOF协议。此外,DPU、存储控制器与计算节点的接口协议也可以是互联网小型计算机系统接口ISCSI协议。
在一种可能的实现方式中,所述DPU的优先级高于所述至少一个存储控制器中任一存储 控制器的优先级;所述计算节点从多个控制器中选择用于执行访问存储节点的目标控制器,包括:所述计算节点根据所述多个控制器的优先级,选择用于执行访问存储节点的目标控制器。通常情况下,计算节点通过DPU访问存储节点无需消耗计算节点CPU的计算资源,而计算节点通过存储控制器访问存储节点时需要消耗计算节点CPU的计算资源,即通过DPU访问的性能要优于通过存储控制器访问的性能,因此,在DPU能够正常工作时,可以选择通过DPU访问存储节点。
在一种可能的实现方式中,当所述目标控制器为DPU时,所述计算节点访问存储节点对待访问数据或待访问的逻辑卷上锁所使用的锁标识,是根据所述DPU的标识生成的;当所述目标控制器为存储控制器时,所述计算节点访问存储节点对待访问数据或待访问的逻辑卷上锁所使用的锁标识,是根据所述计算节点的标识和所述存储控制器的标识生成的。存储控制器根据计算节点标识生成卷管理信息,可以便于区分不同计算节点创建的卷;而计算节点通过DPU访问存储节点时,对于计算节点来说是本地访问,因此访问请求中可能不会携带计算节点标识,DPU访问存储节点时可能只包含DPU标识,而根据DPU标识生成卷管理信息能够使得存储节点能够根据DPU标识识别相应的逻辑卷。
第二方面,本申请实施例提供访问存储节点的装置,所述装置包括执行上述第一方面以及第一方面的任意一种可能的实现方式的模块/单元;这些模块/单元可以通过硬件实现,也可以通过硬件执行相应的软件实现。
示例性的,该装置包括:多路径模块,用于在所述装置需要访问存储节点时,从多个控制器中选择用于执行访问存储节点的目标控制器,所述多个控制器包括至少一个数据处理单元DPU和至少一个存储节点中的存储控制器,所述DPU安装于所述装置位于的计算节点,所述存储节点通过网络与所述装置所属的计算节点连接;发送模块,用于向所述目标控制器发送访问请求。
第三方面,本申请实施例提供一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器存储有计算机程序;所述处理器用于调用所述存储器中存储的计算机程序,以执行如第一方面及第一方面任一实现方式所述的方法。
第四方面,本申请实施例提供一种计算系统,包括如第三方面所述的计算机设备,以及至少一个通过硬件接口与所述计算机设备连接的数据处理单元DPU、至少一个通过网络与所述计算机设备连接的存储节点。
第五方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得所述计算机执行如第一方面及第一方面任一实现方式所述的方法。
第六方面,本申请实施例提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得如第一方面及第一方面任一实现方式所述的方法被执行。
上述第二方面至第六方面中任一方面中的任一可能实现方式可以实现的技术效果,请参照上述第一方面中相应实现方案可以达到的技术效果说明,重复之处不予论述。
附图说明
图1为本申请实施例提供的传统主、备方案示意图;
图2为本申请实施例提供的适用场景的架构示意图;
图3为本申请实施例提供的访问存储节点方法的流程示意图;
图4为本申请实施例提供的一种计算节点与多个控制器连接示意图;
图5为本申请实施例提供的另一种计算节点与多个控制器连接示意图;
图6为本申请实施例提供的访问存储节点装置的结构示意图;
图7为本申请实施例提供的计算机设备的结构示意图。
具体实施方式
采用主、备DPU的方案可以如图1所示。其中,用户层(user)和内核(kernel)为虚拟机VM,主机(host)为虚拟机的宿主(计算节点)的主机,该主机上通过PCIE插槽连接有主DPU和备用DPU。在该方案中,备用DPU在主DPU发生故障后,能够迅速接替主DPU继续工作的关键技术在于,VM中的多路径模块对能够访问存储节点的设备(即主DPU和备用DPU)进行聚合,为VM提供存储服务。
具体的,多路径模块通过控制指令(如ioctl)查询VM所属计算节点上连接的控制器信息,例如:设备类型、厂商ID、设备全球唯一标识(world wide name,WWN)等,确定各控制器是否为位于白名单中,并识别多个控制器是否为同一类型的控制器(如通过PCIE连接的控制器、通过网络连接的控制器),若为同一类型的控制器,进一步扫描其是否对应有相同的逻辑卷(不同控制器即使对应相同的逻辑卷,但逻辑卷的标识并不相同),若是,则对逻辑卷进行聚合得到一个逻辑块设备,并建立该逻辑块设备与上述多个同一类型的控制器的映射关系。VM访问该逻辑块设备,由多路径模块选择该逻辑设备对应的一个控制器访问。例如,VM获取主DPU和备用DPU的设备信息,多路径模块识别这两个块设备为同一类型的设备,假设VM扫描到主DPU和备用DPU对应的逻辑卷标识分别为/dev/nvme0n1和/dev/nvme1n1,虽然标识不同但对应的为同一逻辑卷,则多路径模块将/dev/nvme0n1和/dev/nvme1n1聚合为一个逻辑块设备/dev/mapper/dm-0,VM在需要通过DPU进行存储访问时,则访问/dev/mapper/dm-0即可;只要主DPU和备用DPU中至少有一个能够正常工作,VM就能够访问/dev/mapper/dm-0,保障了DPU故障场景时业务不被中断。
然而,上述方案在增加成本的情况下确不能提高系统性能,还可能导致计算节点PCIE插槽不够用。
有鉴于此,本申请实施例提供一种访问存储节点的方法,用于实现在不显著增加成本的情况下解决DPU故障时维持数据通道畅通的问题。
本申请实施例提供的方法,可以应用于图2所示的架构中。如图2所示,该架构可以包括计算节点210和存储集群(一个或多个存储节点200)。
计算节点210是一种计算设备,如服务器、台式计算机或者存储阵列的控制器等。虽然图2中仅示出了一个计算节点210,但实际的架构中可以包括更多数量的计算节点210,各个计算节点210之间可以相互通信。在硬件上,计算节点210中至少包括处理器212、内存213和网卡214。
其中,处理器212可以是一个或多个CPU,一个CPU又可以具有一个或多个CPU核,用于处理来自计算节点210外部的数据访问请求,或者计算节点210内部生成的请求。示例性的,处理器212接收用户发送的写数据请求时,会将这些写数据请求中的数据暂时保存在内存213中。当内存213中的数据总量达到一定阈值时,处理器212将内存213中存储的数据发送给存储节点200进行持久化存储。
内存213是指与处理器212直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存包括至少两种存储器,例如内存既可以是随机存取存储器,也可以是只读存储器(read only memory,ROM)。 举例来说,随机存取存储器是动态随机存取存储器(dynamic random access memory,DRAM),或者存储级存储器(storage class memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(random access memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但存取速度上比DRAM慢,在成本上也比DRAM更为便宜。然而,DRAM和SCM在本实施例中只是示例性的说明,内存还可以包括其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。另外,内存213还可以是双列直插式存储器模块或双线存储器模块(dual In‐line memory module,DIMM),即由动态随机存取存储器(DRAM)组成的模块,还可以是固态硬盘(solid state disk,SSD)。实际应用中,计算节点210中可配置多个内存213,以及不同类型的内存213。本实施例不对内存213的数量和类型进行限定。此外,可对内存213进行配置使其具有保电功能。保电功能是指系统发生掉电又重新上电时,内存213中存储的数据也不会丢失。具有保电功能的内存被称为非易失性存储器。
网卡214用于与存储节点200通信。例如,当内存213中的数据总量达到一定阈值时,计算节点210可通过网卡214向存储节点200发送请求以对所述数据进行持久化存储。在本申请实施例中,网卡214为不具备计算能力的网卡,即,非智能网卡。
另外,计算节点210还可以包括总线,用于计算节点210内部各组件之间的通信。在功能上,由于图2中的计算节点210的主要功能是计算业务,在存储数据时可以利用远程存储器来实现持久化存储,因此它具有比常规服务器更少的本地存储器,从而实现了成本和空间的节省。但这并不代表计算节点210不能具有本地存储器,在实际实现中,计算节点210也可以内置少量的硬盘,或者外接少量硬盘。
计算节点210还通过硬件接口(如PCIE插槽)连接有至少一个DPU,并通过网络连接有至少一个存储节点200(图2中以3个存储节点进行举例)。
一个存储节点200包括一个或多个控制器201、网卡204与多个硬盘205。网卡204用于与计算节点210通信。硬盘205用于存储数据,可以是磁盘或者其他类型的存储介质,例如固态硬盘或者叠瓦式磁记录硬盘等。控制器201用于根据计算节点210发送的读/写数据请求,往硬盘205中写入数据或者从硬盘205中读取数据。在读写数据的过程中,控制器201需要将读/写数据请求中携带的地址转换为硬盘能够识别的地址。由此可见,控制器201也具有一些简单的计算功能。
在实际应用中,控制器201可具有多种形态。一种情况下,控制器201包括CPU和内存。CPU用于执行地址转换以及读写数据等操作。内存用于临时存储将要写入硬盘205的数据,或者从硬盘205读取出来将要发送给计算节点210的数据。另一种情况下,控制器201是一个可编程的电子部件,例如DPU。控制器201的数量可以是一个,也可以是两个或两个以上。当存储节点200包含至少两个控制器201时,硬盘205与控制器201之间可具有归属关系。当硬盘205与控制器201之间具有归属关系时,每个控制器只能访问归属于它的硬盘,因此这往往涉及到在控制器201之间转发读/写数据请求,导致数据访问的路径较长。另外,如果存储空间不足,在存储节点200中增加新的硬盘205时需要重新绑定硬盘205与控制器201之间的归属关系。
本申请实施例提供的访问存储节点的方法,可以应用于图2所示的计算节点中,或者应 用于计算节点中配置的虚拟机上。
参见图3,为本申请实施例提供的访问存储节点方法的流程示意图,如图所示,该方法可以包括以下步骤:
步骤301、计算节点确定需要访问存储节点时,从多个控制器中选择用于执行访问存储节点的目标控制器。
其中,上述计算节点可以是如图2所示的计算节点,包括CPU、内存、网卡等。此外,本申请实施例中的计算节点,还包括通过硬件接口(如PCIE插槽)连接的DPU;计算节点还通过网络连接有至少一个存储节点。上述多个控制器可以包括DPU,以及至少一个存储节点中的存储控制器。
在本申请实施例中,在计算节点访问存储节点时,用于实现存储相关计算的控制器,既可以包括DPU,也可以包括存储节点中的控制器,而不再是仅通过DPU实现。计算节点从可用的多个控制器中选择用于执行当前访问的控制器。例如,若计算节点检测到DPU发生故障,则可以选择存储控制器用于实现当前访问。
进一步的,计算节点在步骤301之前,可以主动查询每个控制器的设备信息,例如,计算节点可以通过控制指令(如ioctl)查询每个控制器的设备类型、支持的存储网络协议、厂商ID、WWN等。或者,也可以是各个控制器主动将自身的设备信息上报给计算节点,以便于计算节点根据设备信息识别可用于访问存储节点的控制器。计算节点在获取到每个控制器的设备信息后,根据设备信息,确定多个控制器是否支持相同的存储网络协议。若计算节点确定存在支持相同存储网络协议的多个控制器,则该多个控制器可作为本申请实施例中用于访问存储节点的候选控制器,即步骤301中所述的多个控制器。例如,基于非易失性内存主机控制器接口规范(non-volatile memory express,NVMe)协议与计算节点通信的DPU,和基于使用NVMe通过网络结构支持连接存储(NVMe over fabric,NOF)协议与计算节点通信的存储控制器,属于支持相同存储网络协议的控制器。又例如,基于Internet小型计算机系统接口(internet small computer system interface,ISCSI)协议与计算节点通信的多个控制器,也属于支持相同存储网络协议的控制器。
可选的,计算节点在获取到各控制器的设备信息后,还可以进一步确定该控制器是否属于白名单或黑名单,若确定属于白名单则允许计算节点通过该控制器访问存储节点,或者,若确定属于黑名单则禁止计算节点通过该控制器访问存储节点。例如,计算节点可以根据控制器的厂商ID、支持的存储网络协议确定是否属于白/黑名单。
步骤302、计算节点向目标控制器发送访问请求。
计算节点向目标控制器发送访问请求,由目标控制器对请求访问的数据进行处理,即,由目标控制器执行存储相关的计算,从而完成从存储节点读取数据或向存储节点写入数据。
一种可能的实现方式中,若计算节点选择出的目标控制器为DPU,则计算节点可以向DPU发送访问请求。在DPU访问存储节点的数据时,可以根据第一请求向存储控制器发送第二访问请求,该第二访问请求中可以包含有DPU的标识。计算节点在通过DPU访问存储节点时,对于计算节点来说,可以视为本地访问,因此,计算节点发送的访问请求中可以不包含有计算节点的标识。而DPU访问存储节点时,并非本地访问,因此,第二访问请求中可以包含有DPU的标识,以使存储控制器能够识别请求者。可选的,第二访问请求中还可以包含有计算节点的标识。
在另一种可能的实现方式中,若计算节点选择出的目标控制器为存储控制器,则计算节点可以向存储控制器发送第三访问请求,该第三访问请求中包括计算节点的标识,以便于存 储控制器能够识别请求者。
若选择出的目的控制器为DPU,则计算节点通过DPU访问存储节点,无需消耗计算节点CPU的计算资源。若选择出的目的控制器为存储节点中的存储控制器,则计算节点可能无法直接通过存储控制器访问存储节点,此时可能需要占用计算节点中CPU的计算资源。
计算节点在通过DPU访问存储节点时,虽然DPU也向存储控制器发送访问请求,但在访问过程中,DPU需要执行存储相关的计算,如对访问数据的解、压缩,对写入数据的去重等操作等,从而分担了存储控制器的负荷。
如前所述,上述访问存储节点的方法也可以由虚拟机执行。在由虚拟机执行上述方法时,上述计算节点中部署有包括第一虚拟机在内的至少一个虚拟机,此时,上述步骤301,可以由第一虚拟机中的第一多路径模块执行,即第一多路径模块从多个控制器中选择目标控制器。
例如,如图4所示,计算节点中部署有虚拟机,DPU与计算节点之间的接口协议为NVMe协议,存储控制器与计算节点之间的接口协议为NOF协议,则虚拟机可以通过其所属计算节点中的虚拟操作系统模拟器(Qemu)的NVMe控制器以及存储控制器实现对存储节点的访问。
进一步的,计算节点中还可以部署有第二虚拟机,第二虚拟机中的第二多路径模块也可以执行从多个控制器中选择目标控制器访问存储节点的步骤。
可选的,当上述方法由虚拟机执行时,DPU可以通过SR-IOV与计算节点中的虚拟机通信。SR-IOV允许在虚拟机与虚拟机所属计算节点之间共享PCIE设备,并且使得虚拟机获得能够与本地访问性能媲美的I/O性能,通过将DPU直通给VM访问,可以减少I/O路径协议栈,提升整系统的I/O性能。
上述方法可以适用于存、算分离的场景中,即计算节点可以通过网络与存储节点连接的场景;进一步的,适用于应用于计算节点将存储算力卸载至DPU的场景中,计算节点在访问存储节点时,可以通过DPU实现存储相关的计算,而不必消耗计算节点的计算资源。由于计算节点可以通过网络与存储节点连接,而目前的存储节点中又设置有存储控制器,因此,计算节点在访问存储节点时,除了可以通过DPU实现存储相关的计算,也可以通过存储节点中的存储控制器实现存储相关的计算。由存储控制器节点中的存储控制器,替代传统方案中的备用DPU,并不需要增加额外的成本;且通过网络连接存储节点时,仅占用计算节点的网口,不必占用PCIE插槽,不会导致计算节点PCIE插槽不足的问题;即使计算节点的网络资源不足,导致实现上述方案需要为计算节点增加一张网卡,但增加一张网卡的成本明显低于增加一个DPU的成本。因此,本申请上述实施例,既能够达到传统方案中采用多个DPU的性能水平,相比传统方案又降低了成本。
如前所述,当目标控制器为DPU时,且DPU支持SR-IOV时,计算节点执行访问存储节点的步骤(即上述步骤302)时无需占用计算节点的计算资源;当目标控制器为存储节点中的存储控制器时,计算节点在执行步骤302时可能需要占用计算节点的计算资源。在这种情况下,通过DPU访问存储节点的性能(如读/写数据的速度等)优于通过存储控制器访问时的性能。有鉴于此,可以设置DPU的优先级为高优先级,存储控制器的优先级为低优先级,使得在DPU能够正常工作时,优先采用DPU执行访问存储节点的操作,从而保证系统的高性能;当DPU发生故障时,再通过低优先级的存储控制器访问存储节点,以保证数据通道的畅通,业务不被中断;当DPU被修复成功或者更换新的DPU后,则继续选择DPU作为目标控制器访问存储节点。
具体的,计算节点可以通过心跳检测等方式查询DPU的状态,以便及时发现DPU是否发生故障。当DPU正常时,计算节点可以选择DPU作为目标控制器;在确定DPU发生故障后, 则选择通过存储控制器访问存储节点。当存储控制器存在多个时,例如存储多个存储节点或者一个存储节点中存在多个存储控制器时,可以为不同的存储控制器设置不同的优先级,或者,也可以设置相同的优先级并按照预设的选择策略(如负载均衡策略)选择目标控制器。
在传统方案中,计算节点中可以设置有多路径模块,可以对主DPU和备用DPU进行聚合从而为计算节点提供存储访问服务,应当理解,在传统方案中,仅将相同类型的控制器进行聚合,即,可以对通过PCIE接口连接的多个控制器进行聚合,但不能对通过PCIE接口连接的控制器和通过网络连接的控制器进行聚合。而在本申请实施例中,计算节点中也可以设置有多路径模块,用于对DPU和存储控制器进行聚合为计算节点提供存储访问服务,即可以将不同类型的控制器进行聚合。具体的,上述查询各控制器设备信息的操作,可以由多路径模块实现;多路径模块在确定多个控制器支持相同的存储网络协议后,可以进行扫描,确定是否能够通过该多个控制器扫描到相同的存储设备(可以是物理存储设备,如图2所示的存储节点;也可以是逻辑存储设备,如逻辑卷),如前所述,即使不同控制器对应的存储设备相同,但上报至计算节点的存储设备的标识可能并不相同,因此,多路径模块可以对扫描到的相同存储设备进行聚合,聚合成为一个逻辑块设备;后续计算节点在进行存储访问时,访问该逻辑块设备即可。多路径模块还可以定时扫描以发现新增的设备并确定是否需要聚合。
在上述实施例中,DPU、存储控制器以及存储控制器对应的存储节点可以视为一个存储子系统,经过聚合后,DPU和存储控制器对计算节点上报的存储子系统(即上述逻辑块设备)相同。该子系统中可以存在一个主节点,可以是该子系统中的一个DPU,也可以是其中一个存储控制器,并由该主节点为该子系统中的各个控制器分配控制器信息(如控制器在该子系统中的名称等)等。此外,主节点还可以为各控制器配置优先级。而传统方案中,由于DPU和存储控制器属于不同类型的控制器,通常属于不同的存储子系统,对计算节点上报的存储子系统的名称并不相同。
计算节点在访问存储节点时,通常会根据存储节点的存储资源创建卷(又称逻辑卷)。逻辑卷是由逻辑磁盘形成的虚拟盘,是一种存储的管理方式,目的是把硬盘空间从物理硬盘的管理方式中跳出来,进行更方便的统一管理分配。因此,计算节点在执行上述方法之前,还可以先针对各存储节点创建逻辑卷。在创建逻辑卷的过程中,各存储节点中的存储控制器会生成逻辑卷的映射关系等卷管理信息,用于后续对基于存储节点的逻辑卷进行访问或管理。由于DPU需要访问基于存储节点创建的逻辑卷,因此,计算节点可以控制存储控制器将生成的卷管理信息配置到DPU中,即DPU和存储控制器共用相同的卷管理,从而使得DPU和存储控制器均能够根据卷管理信息实现对逻辑卷的访问。
对于存储节点来说,计算节点A可以对其进行访问,可以基于该存储节点生成逻辑卷,计算节点B也可以对其进行访问,可以基于该存储节点生成逻辑卷。因此,存储节点在为计算节点A生成卷管理信息时,可以加入计算节点A的标识,从而便于存储节点根据计算节点A的卷管理信息为计算节点A提供服务。基于此,在一种可能的设计中,计算节点可以将计算节点的标识发送给存储控制器,以使存储控制器根据计算节点标识生成相应的卷管理信息。进一步的,计算节点还可以将DPU的标识也发送给存储控制器,以使存储控制器根据计算节点标识和DPU标识生成相应的卷管理信息。
如果多个控制器对同一逻辑卷中的数据并发访问,或者对同一逻辑卷中的同一数据并发访问,就会产生资源争用。一种解决资源争用的方式是通过锁进行访问控制,对并发访问的数据进行上锁保护,DPU和存储控制器在访问存储数据时可以对数据进行上锁,从而避免并发对同一数据进行读/写操作。而上锁过程中,DPU或存储控制器所使用的标识,可以称为锁 标识。因此,在一种可能的实现方式中,生成的卷管理信息中,可以包括DPU和存储控制器的锁标识。虽然在本申请实施例中可以设置优先使用DPU访问存储节点中的数据,当DPU故障时再通过存储控制器访问存储节点中的数据,但是,在一种特殊情况下,例如,计算节点与DPU之间通信故障故计算节点认为DPU发生故障切换至存储控制器重新进行访问,但DPU仍能够继续进行访问操作,此时就可能出现DPU与存储控制器对相同的数据进行访问,此时就可以通过对数据上锁以避免DPU和存储控制器同时访问该数据而导致出错。
可选的,DPU的锁标识可以根据DPU的标识生成;而存储控制器的锁标识可以根据计算节点的标识和存储控制器的标识生成。对于计算节点来说,计算节点通过DPU访问存储节点属于本地访问方式,故这种情况下访问请求中可以不携带有计算节点的标识,仅由DPU的标识即可确定其对应的计算节点,因此DPU的锁标识根据DPU标识生成即可。而一个存储节点可以与多个计算节点相连接,仅由存储控制器的标识生成存储控制器的锁标识,则无法识别到对数据上锁的计算节点,因此,存储控制器的锁标识可以根据计算节点的标识和存储控制器的标识生成。
在一个具体实施例中,可以采用简化版的雪花(snowflake)算法生成锁标识。具体的,假设锁标识共8个字节,高位的4个字节存放计算节点的标识(如WWID),低位的4个字节存放控制器标识。对于计算节点来说,计算节点通过DPU访问存储节点属于本地访问方式,故这种情况下访问请求中可以不携带有计算节点的标识,故DPU锁标识的高位4个字节可以设置为0000,低位4个字节为DPU的标识。
为了更加清楚理解本申请上述实施例,下面结合附图5进行举例说明。
在图5所示的场景中,计算节点A中配置有虚拟机VM1,计算节点A通过PCLE插槽连接了一个DPU,并通过网络与存储集群A连接,存储集群A包括3个存储节点。该DPU和3个存储节点均支持NOF协议,可以采用上述方法为计算节点提供存储访问服务。DPU中的NVMe控制器记为controller1,存储节1中的存储控制器记为controller2,存储节点2中的存储控制器记为controller3,存储节点4中的存储控制器记为controller4。
用户可以为VM1基于存储节点1、存储节点2和存储节点3的存储资源创建逻辑卷。VM1在根据用户的指示创建逻辑卷时,向controller2发送了VM1的标识以及DPU的标识,以使controller2根据VM1标识、DPU标识生成卷管理信息,并将卷管理信息发送给DPU。此时,这3个存储节点和DPU可以构成一个存储子系统,假设存储节点1中的controller2为该存储子系统的主节点,则controller2可以为子系统中的各个控制器分配子系统中的ID以及各控制器的优先级,例如,可以为DPU分配最高优先级,为其他存储控制器分配相同的低优先级。VM1中的NVMe启动器和NOF启动器分别进行扫描,NVMe启动器通过DPU扫描到逻辑卷,其标识为nvme0n1,NOF启动器通过controller2、controller3、controller4扫描到的逻辑卷的标识分别nvme1n1、nvme2n1、nvme3n1。VM1中的多路径模块根据获取到的设备信息确定这4个控制器均为白名单中的设备,并都支持NOF协议,且扫描到的逻辑卷虽然标识不同但实际上是相同的逻辑卷,因此确定这4个控制器属于相同的存储子系统,并将nvme0n1、nvme1n1、nvme2n1、nvme3n1聚合为一个逻辑块设备/dev/dm-0,建立nvme0n1、nvme1n1、nvme2n1、nvme3n1与/dev/dm-0的映射关系。
VM1中的多路径模块可以定时查询这4个控制器的状态,当DPU发生故障时,探测DPU的指令失败或超时,则将nvme0n1状态设置为异常(或者直接将DPU的状态设置为异常)。VM1在需要访问/dev/dm-0时,VM1中的多路径模块若确定优先级最高的nvme0n1状态正常,则通过nvme0n1访问存储节点;若nvme0n1的状态异常,可以从nvme1n1、nvme2n1、nvme3n1中 随机选择一个块设备,或者根据预设规则从中选择一个块设备,并通过选择出的块设备所对应的控制器访问存储节点。当VM1中的多路径模块探测到nvme0n1恢复正常后,则将nvme0n1的状态设置为正常,以使在后续访问存储节点时优先选择nvme0n1进行访问。
此外,一个计算节点中可以部署有多个虚拟机,那么该多个虚拟机中的部分或全部均可以实现上述方法实施例。例如,一个计算节点中部署有VM1和VM2,VM1和VM均可以通过上述方法从多个控制器中选择目标控制器访问存储节点。当上述选择目标控制器、聚合逻辑块设备的操作由多路径模块执行时,那么VM1和VM2中可以各自设置多路径模块。
基于相同的技术构思,本申请实施例还提供一种访问存储节点装置,用于实现上述方法实施例。该装置可以包括执行上述方法实施例中任意一种可能的实现方式的模块/单元;这些模块/单元可以通过硬件实现,也可以通过硬件执行相应的软件实现。
示例性的,该装置可以如图6所示,包括:多路径模块601和发送模块602。
多路径模块601,用于在所述装置需要访问存储节点时,从多个控制器中选择用于执行访问存储节点的目标控制器,所述多个控制器包括至少一个数据处理单元DPU和至少一个存储节点中的存储控制器,所述DPU安装于所述装置位于的计算节点,所述存储节点通过网络与所述装置所属的计算节点连接。
发送模块602,用于向所述目标控制器发送访问请求。
在一种可能的实现方式中,所述发送模块602,具体用于:向所述DPU发送第一访问请求,以使所述DPU根据所述第一访问请求向一个所述存储控制器发送第二访问请求,所述第二访问请求中包括所述DPU的标识;或者,向选择出的存储控制器发送第三访问请求,所述第三访问请求中包括所述装置的标识。
在一种可能的实现方式中,所述装置为第一虚拟机VM。
在一种可能的实现方式中,所述装置位于的计算节点中还部署有第二VM,所述第二VM也包括第二多路径模块,所述第二VM在确定需要访问存储节点时,使用所述第二多路径模块从所述多个控制器中选择用于执行访问存储节点的目标控制器访问并发送访问请求。
在一种可能的实现方式中,所述多路径模块601,还用于在所述装置确定需要访问存储节点时之前,通过所述多个控制器中的每个控制器扫描逻辑卷信息,每个控制器对应的逻辑卷的标识不同;通过所述多个控制器对应的不同标识的逻辑卷为同一逻辑卷,对所述不同标识的逻辑卷进行聚合,并建立聚合后的逻辑卷与所述多个控制器的对应关系。
在一种可能的实现方式中,所述DPU通过虚拟化标准SR-IOV与所述计算节点中的虚拟机通信。
在一种可能的实现方式中,所述DPU与所述装置所述计算节点的接口协议为NVME协议,所述存储控制器与装置所述计算节点的接口协议为使用NVMe通过网络结构支持连接存储NOF协议。
在一种可能的实现方式中,所述DPU的优先级高于所述至少一个存储控制器中任一存储控制器的优先级;所述多路径模块具体用于:根据所述多个控制器的优先级,选择用于执行访问存储节点的目标控制器。
在一种可能的实现方式中,当所述目标控制器为DPU时,所述装置访问存储节点对待访问数据或待访问的逻辑卷上锁所使用的锁标识,是根据所述DPU的标识生成的;当所述目标控制器为存储控制器时,所述装置访问存储节点对待访问数据或待访问的逻辑卷上锁所使用的锁标识,是根据所述装置的标识和所述存储控制器的标识生成的。
基于相同的技术构思,本申请实施例还提供一种计算机设备。该计算机设备包括如图7所示的处理器701,以及与处理器701连接的通信接口702。
处理器701可以是通用处理器,微处理器,特定集成电路(application specific integrated circuit,ASIC),现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件,分立门或者晶体管逻辑器件,或一个或多个用于控制本申请方案程序执行的集成电路等。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
通信接口702,用于与其他设备通信,如PCI总线接口、以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。
在本申请实施例中,处理器701用于调用通信接口702执行接收和/或发送的功能,并执行如前任一种可能实现方式所述的方法。
进一步的,该计算机设备还可以包括存储器703以及通信总线704。
存储器703,用于存储程序指令和/或数据,以使处理器701调用存储器703中存储的指令和/或数据,实现处理器701的上述功能。存储器703可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器703可以是独立存在,例如片外存储器,通过通信总线704与处理器701相连接。存储器703也可以和处理器701集成在一起。
通信总线704可包括一通路,用于在上述组件之间传送信息。
示例性的,处理器701可以通过通信接口702执行以下步骤:确定需要访问存储节点时,从多个控制器中选择用于执行访问存储节点的目标控制器,所述多个控制器包括至少一个数据处理单元DPU和至少一个存储节点中的存储控制器,所述DPU通过硬件接口与所述计算机设备连接,可以认为该计算机设备包括该DPU;所述存储节点通过网络与所述计算机设备连接;通过所述目标控制器访问存储节点。
在一种可能的实现方式中,处理器701还用于通过通信接口702,在确定需要访问存储节点时之前,获取所述多个控制器中每个控制器的设备信息;根据所述设备信息,确定所述多个控制器支持相同的存储网络协议。
在一种可能的实现方式中,所述存储网络协议包括基于非易失性内存主机控制器接口规范的存储网络NOF协议,或者互联网小型计算机系统接口ISCSI协议。
在一种可能的实现方式中,处理器701还用于通过通信接口702,在确定需要访问存储节点时之前,对所述至少一个存储节点创建卷,并控制所述至少一个存储控制器将生成的卷管理信息配置在所述DPU中。
在一种可能的实现方式中,处理器701还用于通过通信接口702,将所述计算机设备的标识和所述DPU的标识发送给所述至少一个存储控制器,以使所述至少一个存储控制器根据所述计算机设备标识和DPU的标识生成卷管理信息。
在一种可能的实现方式中,处理器701还用于通过所述多个控制器中的每个控制器扫描生成的卷;通过所述多个控制器扫描到的卷一致,则确定所述多个控制器属于相同的存储系 统,所述至少一个存储节点属于所述存储系统。
在一种可能的实现方式中,所述DPU的锁标识根据所述DPU的标识生成,每个所述存储控制器的锁标识根据所述计算机设备的标识和所述存储控制器的标识生成:所述锁标识表示访问数据时对所述数据上锁时使用的标识。
在一种可能的实现方式中,所述计算机设备中配置有虚拟机VM。
在一种可能的实现方式中,所述DPU通过虚拟化标准SR-IOV与所述计算节点中的虚拟机通信。
在一种可能的实现方式中,处理器701通过通信接口702,具体用于:当所述目标控制器为存储节点中的存储控制器时,通过所述计算机设备中的控制器和所述目标控制器访问存储节点。
在一种可能的实现方式中,所述DPU的优先级高于所述至少一个存储控制器中任一存储控制器的优先级;处理器701具体用于:根据所述多个控制器的优先级,选择用于执行访问存储节点的目标控制器。
基于相同的技术构思,本申请实施例还提供一种计算系统,包括如上述实施例所述的计算机设备,以及至少一个通过硬件接口与所述计算机设备连接的DPU、至少一个通过网络与所述计算机设备连接的存储节点。
基于相同的技术构思,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机可读指令,当所述计算机可读指令在计算机上运行时,使得上述方法实施例中计算节点所执行的步骤被执行。
基于相同的技术构思,本申请实施例提供还一种包含指令的计算机程序产品,当其在计算机上运行时,使得上述方法实施例中计算节点所执行的步骤被执行。
需要理解的是,在本申请的描述中,“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制 造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (20)

  1. 一种访问存储节点的方法,其特征在于,包括:
    计算节点确定需要访问存储节点时,从多个控制器中选择用于执行访问存储节点的目标控制器,所述多个控制器包括至少一个数据处理单元DPU和至少一个存储节点中的存储控制器,所述DPU安装于所述计算节点,所述存储节点通过网络与所述计算节点连接;
    所述计算节点向所述目标控制器发送访问请求。
  2. 根据权利要求1所述的方法,其特征在于,所述计算节点向所述目标控制器发送访问请求,包括:
    所述计算节点向所述DPU发送第一访问请求,以使所述DPU根据所述第一访问请求向一个所述存储控制器发送第二访问请求,所述第二访问请求中包括所述DPU的标识;或者
    所述计算节点向选择出的存储控制器发送第三访问请求,所述第三访问请求中包括所述计算节点的标识。
  3. 根据权利要求1或2所述的方法,其特征在于,所述计算节点拥有包括第一虚拟机VM在内的至少一个VM,所述第一VM包括多路径模块,所述计算节点确定需要访问存储节点的步骤具体包括:
    所述第一VM通过所述多路径模块,从多个控制器中选择所述目标控制器。
  4. 根据权利要求3所述的方法,其特征在于,所述计算节点中还部署有第二VM,所述第二VM包括第二多路径模块,所述第二VM在确定需要访问存储节点时,使用所述第二多路径模块从所述多个控制器中选择用于执行访问存储节点的目标控制器访问并发送访问请求。
  5. 根据权利要求3或4所述的方法,其特征在于,所述DPU通过虚拟化标准SR-IOV与所述计算节点中的虚拟机通信。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,在所述计算节点确定需要访问存储节点时之前,所述方法还包括:
    所述计算节点通过所述多个控制器中的每个控制器扫描逻辑卷信息,每个控制器对应的逻辑卷的标识不同;
    所述计算节点确定通过所述多个控制器对应的不同标识的逻辑卷为同一逻辑卷,对所述不同标识的逻辑卷进行聚合,并建立聚合后的逻辑卷与所述多个控制器的对应关系。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述DPU与所述计算节点的接口协议为非易失性内存主机控制器接口规范NVMe协议,所述存储控制器与所述计算节点的接口协议为使用NVMe通过网络结构支持连接存储NOF协议。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述DPU的优先级高于所述至少一个存储控制器中任一存储控制器的优先级;
    所述计算节点从多个控制器中选择用于执行访问存储节点的目标控制器,包括:
    所述计算节点根据所述多个控制器的优先级,选择用于执行访问存储节点的目标控制器。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,当所述目标控制器为DPU时,所述计算节点访问存储节点对待访问数据或待访问的逻辑卷上锁所使用的锁标识,是根据所述DPU的标识生成的;
    当所述目标控制器为存储控制器时,所述计算节点访问存储节点对待访问数据或待访问的逻辑卷上锁所使用的锁标识,是根据所述计算节点的标识和所述存储控制器的标识生成的。
  10. 一种访问存储节点的装置,其特征在于,包括:
    多路径模块,用于在所述装置需要访问存储节点时,从多个控制器中选择用于执行访问存储节点的目标控制器,所述多个控制器包括至少一个数据处理单元DPU和至少一个存储节点中的存储控制器,所述DPU安装于所述装置所位于的计算节点,所述存储节点通过网络与所述装置所属的计算节点连接;
    发送模块,用于向所述目标控制器发送访问请求。
  11. 根据权利要求10所述的装置,其特征在于,所述发送模块,具体用于:
    向所述DPU发送第一访问请求,以使所述DPU根据所述第一访问请求向一个所述存储控制器发送第二访问请求,所述第二访问请求中包括所述DPU的标识;或者
    向选择出的存储控制器发送第三访问请求,所述第三访问请求中包括所述装置的标识。
  12. 根据权利要求10或11所述的装置,其特征在于,所述装置为虚拟机VM。
  13. 根据权利要求12所述的装置,其特征在于,所述第一VM所属的计算节点中还部署有第二VM,所述第二VM在确定需要访问存储节点时,从所述多个控制器中选择用于执行访问存储节点的目标控制器访问并发送访问请求。
  14. 根据权要求12或13所述的装置,其特征在于,所述多路径模块,还用于在所述装置确定需要访问存储节点时之前,通过所述多个控制器中的每个控制器扫描逻辑卷信息,每个控制器对应的逻辑卷的标识不同;通过所述多个控制器对应的不同标识的逻辑卷为同一逻辑卷,对所述不同标识的逻辑卷进行聚合,并建立聚合后的逻辑卷与所述多个控制器的对应关系。
  15. 根据权利要求12-14任一项所述的装置,其特征在于,所述DPU通过虚拟化标准SR-IOV与所述计算节点中的虚拟机通信。
  16. 根据权利要求10-15任一项所述的装置,其特征在于,所述DPU与所述装置所述计算节点的接口协议为NVME协议,所述存储控制器与装置所述计算节点的接口协议为使用NVMe通过网络结构支持连接存储NOF协议。
  17. 根据权利要求10-16任一项所述的装置,其特征在于,所述DPU的优先级高于所述至少一个存储控制器中任一存储控制器的优先级;
    所述多路径模块具体用于:根据所述多个控制器的优先级,选择用于执行访问存储节点的目标控制器。
  18. 根据权利要求10-17任一项所述的装置,其特征在于,当所述目标控制器为DPU时,所述装置访问存储节点对待访问数据或待访问的逻辑卷上锁所使用的锁标识,是根据所述DPU的标识生成的;
    当所述目标控制器为存储控制器时,所述装置访问存储节点对待访问数据或待访问的逻辑卷上锁所使用的锁标识,是根据所述装置的标识和所述存储控制器的标识生成的。
  19. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器;
    所述存储器存储有计算机程序;
    所述处理器用于调用所述存储器中存储的计算机程序,以执行权利要求1-9任一项所述的方法。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得所述计算机执行如权利要求1-9任一项所述的方法。
PCT/CN2023/071481 2022-01-30 2023-01-10 一种访问存储节点的方法、装置及计算机设备 WO2023143033A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210114555.3 2022-01-30
CN202210114555.3A CN116560785A (zh) 2022-01-30 2022-01-30 一种访问存储节点的方法、装置及计算机设备

Publications (1)

Publication Number Publication Date
WO2023143033A1 true WO2023143033A1 (zh) 2023-08-03

Family

ID=87470654

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071481 WO2023143033A1 (zh) 2022-01-30 2023-01-10 一种访问存储节点的方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN116560785A (zh)
WO (1) WO2023143033A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103765371A (zh) * 2011-08-26 2014-04-30 威睿公司 导出作为存储对象的逻辑卷的数据存储系统
US20140379836A1 (en) * 2013-06-25 2014-12-25 Mellanox Technologies Ltd. Offloading node cpu in distributed redundant storage systems
CN105549904A (zh) * 2015-12-08 2016-05-04 华为技术有限公司 一种应用于存储系统中的数据迁移方法及存储设备
CN110892380A (zh) * 2017-07-10 2020-03-17 芬基波尔有限责任公司 用于流处理的数据处理单元
CN113821311A (zh) * 2020-06-19 2021-12-21 华为技术有限公司 任务执行方法及存储设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103765371A (zh) * 2011-08-26 2014-04-30 威睿公司 导出作为存储对象的逻辑卷的数据存储系统
US20140379836A1 (en) * 2013-06-25 2014-12-25 Mellanox Technologies Ltd. Offloading node cpu in distributed redundant storage systems
CN105549904A (zh) * 2015-12-08 2016-05-04 华为技术有限公司 一种应用于存储系统中的数据迁移方法及存储设备
CN110892380A (zh) * 2017-07-10 2020-03-17 芬基波尔有限责任公司 用于流处理的数据处理单元
CN113821311A (zh) * 2020-06-19 2021-12-21 华为技术有限公司 任务执行方法及存储设备

Also Published As

Publication number Publication date
CN116560785A (zh) 2023-08-08

Similar Documents

Publication Publication Date Title
US11748278B2 (en) Multi-protocol support for transactions
US11025544B2 (en) Network interface for data transport in heterogeneous computing environments
US10884799B2 (en) Multi-core processor in storage system executing dynamic thread for increased core availability
US9760497B2 (en) Hierarchy memory management
KR102044023B1 (ko) 키 값 기반 데이터 스토리지 시스템 및 이의 운용 방법
US8943294B2 (en) Software architecture for service of collective memory and method for providing service of collective memory using the same
US11829309B2 (en) Data forwarding chip and server
US20190026225A1 (en) Multiple chip multiprocessor cache coherence operation method and multiple chip multiprocessor
US20210326177A1 (en) Queue scaling based, at least, in part, on processing load
WO2023072048A1 (zh) 网络存储方法、存储系统、数据处理单元及计算机系统
US20240126847A1 (en) Authentication method and apparatus, and storage system
US20210191777A1 (en) Memory Allocation in a Hierarchical Memory System
US20210311767A1 (en) Storage system, storage device therefor, and operating method thereof
US10339065B2 (en) Optimizing memory mapping(s) associated with network nodes
US20230281113A1 (en) Adaptive memory metadata allocation
US11157191B2 (en) Intra-device notational data movement system
WO2023143033A1 (zh) 一种访问存储节点的方法、装置及计算机设备
CN111666579B (zh) 计算机设备及其访问控制方法和计算机可读介质
US10936219B2 (en) Controller-based inter-device notational data movement system
US10860334B2 (en) System and method for centralized boot storage in an access switch shared by multiple servers
US11281612B2 (en) Switch-based inter-device notational data movement system
US12001386B2 (en) Disabling processor cores for best latency in a multiple core processor
US20240028344A1 (en) Core mapping based on latency in a multiple core processor
US20240231653A1 (en) Memory management method and apparatus, processor, and computing device
US20240231669A1 (en) Data processing method and apparatus, processor, and hybrid memory system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23745916

Country of ref document: EP

Kind code of ref document: A1