WO2020014869A1 - 处理i/o请求的方法及设备 - Google Patents

处理i/o请求的方法及设备 Download PDF

Info

Publication number
WO2020014869A1
WO2020014869A1 PCT/CN2018/095992 CN2018095992W WO2020014869A1 WO 2020014869 A1 WO2020014869 A1 WO 2020014869A1 CN 2018095992 W CN2018095992 W CN 2018095992W WO 2020014869 A1 WO2020014869 A1 WO 2020014869A1
Authority
WO
WIPO (PCT)
Prior art keywords
interval
access
control node
query command
host
Prior art date
Application number
PCT/CN2018/095992
Other languages
English (en)
French (fr)
Inventor
李浪波
张明谦
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202010260925.5A priority Critical patent/CN111767008A/zh
Priority to JP2020529744A priority patent/JP7094364B2/ja
Priority to CN201880003796.2A priority patent/CN110914796B/zh
Priority to PCT/CN2018/095992 priority patent/WO2020014869A1/zh
Priority to EP18926874.1A priority patent/EP3796149B1/en
Priority to KR1020207014420A priority patent/KR102342607B1/ko
Publication of WO2020014869A1 publication Critical patent/WO2020014869A1/zh
Priority to US17/026,280 priority patent/US11249663B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0688Non-volatile semiconductor memory arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0665Virtualisation aspects at area level, e.g. provisioning of virtual or logical volumes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0679Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/565Conversion or adaptation of application format or content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • H04L67/63Routing a service request depending on the request content or context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems

Definitions

  • This application relates to the field of storage, and in particular, to a method and device for processing I / O requests.
  • the storage system includes multiple control nodes and logical disks (for example, Namespace), and a host accesses one logical disk through multiple control nodes. In this way, the host and the logic There are multiple paths between the disks.
  • the host chooses a path to send I / O requests to the storage system through polling.
  • the control node that receives the I / O request in the storage system calculates the control node that executes the I / O request according to a certain algorithm according to the logical address carried by the I / O request.
  • the host sends an I / O request to a first control node through a first of the multiple paths, and the first control node calculates the logical address in the I / O request according to a preset algorithm to calculate the
  • the I / O request should be executed by the second control node, and the first control node forwards the I / O request to the second control node.
  • the forwarding of I / O requests will cause an increase in I / O delay.
  • Embodiments of the present invention provide a method, a device, and a host for processing I / O requests, which are used to set an access interval for each control node of a storage system.
  • a first aspect of the embodiments of the present invention provides a data processing method performed by a host.
  • the host and the storage system communicate through a NVMeoF protocol, the storage system includes a logical disk, and the host accesses the logical disk through a control node in the storage system.
  • the method includes: the host sends a status query command to the control node, the status query command is used to instruct the control node to report a path status of a path where the control node is located, and then receive the path reported by the control node status.
  • the path status received by the host indicates that the logical disk includes an access interval
  • an interval query command is sent to the control node, and then the access interval information reported by the control node is received, and the control node and the control node are recorded.
  • the mapping relationship between the access interval information, and the access interval indicated by the access interval information is allocated to the control node in advance.
  • the status query command of the NVMeoF protocol and the new interval query command are used to obtain the access interval set by the storage system for each control node, so that when subsequent I / O requests are received, they can fall according to the logical address of the I / O request.
  • the I / O request is sent to the controller corresponding to the partition interval, thereby avoiding the forwarding of I / O.
  • the method further includes: receiving an I / O request, the I / O request carrying a logical address of data to be accessed; determining an access interval in which the logical address falls; and according to the mapping The relationship determines the control node corresponding to the access interval; and sends the I / O request to the control node corresponding to the access interval.
  • the I / O request can be based on the partition interval in which the logical address of the I / O request falls.
  • the O request is sent to the controller corresponding to the partition interval, thereby avoiding I / O forwarding.
  • the state of the path indicating that the logical disk includes an access section is determined by the control node when the logical node includes the access section, and the path indicating that the logical disk includes an access section The status is reported to the host.
  • the host can determine whether to perform a query of a partition section according to the path state.
  • the access interval query command is defined based on the command get log-page-CommandDword in the NVMeoF protocol, and the access interval query command is carried in the Log PageIdentifier field of the command. Command ID.
  • the interval query command is defined by using a command defined in the existing NVMeoF protocol, and there is no need to modify the existing NVMeoF protocol.
  • the access intervals indicated by the access interval information are continuous address spaces.
  • the access interval information includes a first address of the access interval and a length of the access interval.
  • the access interval corresponding to the control node includes a first sub-access interval and a second sub-access interval, and there is a gap between the first sub-access interval and the second sub-access interval.
  • the first sub-access sub-interval and the second access sub-interval are adjacent, and the access interval information includes a first address of the access interval, a length of each sub-access interval, and two adjacent sub-intervals. Interval between visits.
  • the access interval information is reported to the host through response information defined for the access interval query command, and the access interval information is carried in an interval description field of the response information.
  • a second aspect of the present invention provides a data processing method, which is executed by a control node of a storage system that communicates with a host through a NVMeoF protocol, and the storage system includes a logical disk.
  • the method includes receiving a status query command sent by a host, the status query command is used to instruct the control node to report a path status of a path where the control node is located; when the logical disk includes an access section, the logical instruction is instructed Report the path status of the disk including the access interval to the host; receive the access interval query command sent by the host according to the path status; report the access interval information of the access interval allocated for the control node to the host according to the interval query information To enable the host to record a mapping relationship between the access interval information and the control node.
  • the status query command of the NVMeoF protocol and the new interval query command are used to obtain the access interval set by the storage system for each control node, so that when subsequent I / O requests are received, they can fall according to the logical address of the I / O request
  • the I / O request is sent to the controller corresponding to the partition interval, thereby avoiding the forwarding of I / O.
  • a third aspect of the embodiments of the present invention provides a data processing method performed by a host, the host being connected to a storage system, the storage system including a plurality of control nodes, and the host accessing the storage system through the plurality of control nodes Logical disk.
  • the method includes: receiving an I / O request, the I / O request carrying a logical address of data to be accessed; and determining an access interval to which the logical address belongs, wherein the logical disk includes a plurality of access intervals, the The host records the mapping relationship between each access interval and the control node; and sends the I / O request to the control node corresponding to the determined access interval.
  • an I / O request can be issued to the partition interval according to the partition interval in which the logical address of the I / O request falls.
  • Corresponding controller thus avoiding I / O forwarding.
  • the method further includes: sending a status query command to the multiple control nodes, the status query command is used to instruct the multiple control nodes to report a path status of a path where the control node is located;
  • the access interval set by the storage system for each control node is obtained through the status query command of the NVMeoF protocol and the newly added interval query command, so that the host can perform I / O according to the partition interval in which the logical address of the I / O request falls.
  • the request is delivered to the controller corresponding to the partition interval, thereby avoiding I / O forwarding.
  • the state of the path indicating that the logical disk includes an access section is determined by the control node that received the status query command when the logical node includes the access section.
  • the path state of the logical disk including the access section is reported to the host.
  • the host can determine whether to perform a query of a partition section according to the path state.
  • a fourth aspect of the embodiments of the present invention provides a data processing method performed by a host.
  • the host and the storage system communicate through an external network.
  • the method includes: encapsulating a non-volatile memory standard NVMe interval query command to the external.
  • the network protocol obtains an external network protocol interval query command, and the NVMe interval query command is used to query an access interval allocated for a controller of the storage system, where the access interval belongs to a namespace of the storage system; to the control node Sending the external network protocol interval query command; receiving an external network protocol interval response message sent by the control node, the external network protocol interval response message including an NVMe interval query command response, and the NVMe interval query command response including the naming Space access interval information; parse the external network protocol interval response message to obtain the NVMe interval query command response information; and obtain and record the access interval information of the control node from the NVMe interval query command response.
  • the interval query command added by the NVMeoF protocol is used to obtain the access interval set by the storage system for each control node, so that when subsequent I / O requests are received, according to the partition interval in which the logical address of the I / O request falls, the The I / O request is sent to the controller corresponding to the partition interval, thereby avoiding I / O forwarding.
  • the method before the encapsulating a non-volatile memory standard NVMe interval query command to the external network protocol to obtain an external network protocol interval query command, the method further includes:
  • the external network protocol status response message including an NVMe status query command response, the NVMe status query command response including path status information, and the path status information indicating the naming Space includes access intervals;
  • the indicated path state information can be reported to the host, and the host can determine whether the logical disk is divided into access zones according to the path state information.
  • a fifth aspect of the embodiments of the present invention provides a data processing method, which is executed by a control node of a storage system.
  • the host communicates with the storage system through an external network.
  • the method includes: receiving an external network protocol interval query command sent by the host; Parsing the external network protocol interval query command to obtain an NVMe interval query command; and generating response information of the NVMe interval query command, where the response information of the NVMe interval query command includes the access interval information corresponding to the control node,
  • the NVMe interval query command is encapsulated into the external network protocol to obtain an external network protocol interval query command; the external network protocol interval query command is reported to the host.
  • the interval query command added by the NVMeoF protocol is used to obtain the access interval set by the storage system for each control node, so that when subsequent I / O requests are received, according to the partition interval in which the logical address of the I / O request falls, the The I / O request is sent to the controller corresponding to the partition interval, thereby avoiding I / O forwarding.
  • the method before receiving an external network protocol interval query command sent by a host, the method further includes: receiving an external network protocol status query command, and analyzing the external network protocol status query command to obtain a non-volatile memory.
  • Standard NVMe status query command the NVMe status query command is used to instruct the control node to report the path status of the path where the control node is located; generate response information of the NVMe status query command, and when the namespace includes an access interval When carrying the path state indicating that the namespace includes the access section in the response information of the NVMe status query command; encapsulating the NVMe status query command to obtain response information of the external network protocol status query command, and The response information of the protocol status query command is reported to the host.
  • the indicated path state information can be reported to the host, and the host can determine whether the logical disk is divided into access zones according to the path state information.
  • an embodiment of the present invention further provides a host, where the host communicates with a storage system through a NVMeoF protocol, the storage system includes a logical disk, and the host accesses the host through a control node in the storage system A logical disk, the host also includes a unit or means for performing each step of the first aspect above.
  • an embodiment of the present invention further provides a control node of a storage system.
  • the host and the storage system communicate through the NVMeoF protocol.
  • the storage system includes a logical disk.
  • the host controls the storage system through control.
  • a node accesses the logical disk, and the control node further includes a unit or a means for performing each step of the second aspect above.
  • an embodiment of the present invention further provides a host, where the host communicates with a storage system through a NVMeoF protocol, the storage system includes a logical disk, and the host accesses the host through a control node in the storage system A logical disk, the host includes a unit or a means for performing each step of the third aspect above.
  • an embodiment of the present invention further provides a host, where the host communicates with the storage system through an external network, and the host includes a unit or a means for performing each step in the fourth aspect above.
  • an embodiment of the present invention further provides a control node of a storage system.
  • the host communicates with the storage system through an external network.
  • the control node includes a unit or a means for performing each step of the fifth aspect.
  • an embodiment of the present invention further provides a host, where the host communicates with a storage system through an NVMeoF protocol, the storage system includes a logical disk, and the host accesses an IP address through a control node in the storage system.
  • the logical disk, the host includes a memory and a processor, wherein the memory is used to store programs and data, the processor is used to run the program stored in the memory, and according to the data stored in the memory, the first Aspect, the third aspect, or the various methods provided by the fourth aspect.
  • an embodiment of the present invention further provides a control node of a storage system.
  • the host and the storage system communicate through the NVMeoF protocol.
  • the storage system includes a logical disk.
  • the host communicates with the storage system through A control node accesses the logical disk, the control node includes a memory and a processor, where the memory is used to store programs and data, the processor is used to run the program stored in the memory, and according to the data stored in the memory , Implement the methods provided in the second or fifth aspect above.
  • FIG. 1 is a structural diagram of a system applied in an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of N paths for a host to access a logical disk of a storage system according to an embodiment of the present invention.
  • 3a-3b are schematic diagrams of dividing an address space of a logical disk in a storage system according to an embodiment of the present invention into N consecutive address spaces and N discontinuous address spaces.
  • FIG. 4 is an architecture diagram of a host according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of a method for managing multiple paths to access the logical disk according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a path state query command according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of status reporting information in an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a path state defined in an NVMeoF protocol according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of carrying a state-reporting information indicating a space-dependent state in the embodiment of the present invention.
  • FIG. 10 is a schematic diagram of an interval query command according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of interval reporting information in an embodiment of the present invention.
  • FIGS. 12 and 13 are schematic diagrams of section reporting information carrying section description information of consecutive access sections in an address space and section description information of discontinuous access sections in an embodiment of the present invention, respectively.
  • FIG. 14 is a flowchart of assigning a control node that executes an I / O request according to an embodiment of the present invention.
  • FIG. 15 is a block diagram of a host provided in an embodiment of the present invention.
  • FIG. 16 is a block diagram of any control node in a storage system provided in an embodiment of the present invention.
  • FIG. 1 it is an architecture diagram of a system applied in an embodiment of the present invention.
  • the host 100 is connected to two switches 200 through two host ports, such as a Host Bus Adapter (HBA) 101, respectively.
  • Two switches 200 are respectively connected to the storage system 300.
  • the two switches 200 and the two host ports 101 are provided to prevent the failure of one of the two switches 200 or one of the two host ports. The path is disconnected.
  • HBA Host Bus Adapter
  • the storage system 300 includes a plurality of control nodes 302, such as Nodes 1 to N. Each control node 302 is connected to two switches 200 through two storage ports 301 (for example, HBA cards). In this way, the host and each control node 301 include two paths. These two paths are redundant paths to each other. If one path fails, another path can be used to transfer data between the host 100 and the control node 301. .
  • the storage system 300 includes a storage device 304 composed of a plurality of SSDs.
  • the storage device 304 may be an independent redundant disk array (Redundant Array of Independent Disks, RAID), or a flash memory cluster (Just).
  • the storage device 304 further includes a virtual SSD, and the virtual SSD is mapped to the storage device by other storage devices that communicate with the storage device 304 through a NVMeoF (Non-Volatile Memory Express Fabric) protocol. 304.
  • the remote SSD is mapped to the local storage device through the NVMeoF protocol.
  • the virtual SSD as a local storage device is the existing technology and will not be described again here.
  • the host 100 communicates with the storage system 300 through the NVMeoF protocol.
  • the local SSD and / or virtual SSD in the storage device 304 can be used to construct a logical disk 303.
  • the logical disk 303 can be a namespace, which is an expression of a logical disk defined in the NVMeoF protocol.
  • FIG. 1 only one logical disk 303 is shown, but in practical applications, multiple logical disks can be constructed in the storage device 304, wherein one logical disk can be allocated to multiple hosts, or multiple A logical disk can be allocated for use by a host.
  • the host 100 can access the logical disk 303 allocated to the host 100 through N control nodes 302.
  • the physical link between the host 100 and the logical disk in FIG. 1 is omitted, and the two redundant paths passing through the same control node are merged into One, then N paths for the host 100 to access the logical disk 303 are obtained.
  • the construction and management of the logical disk 303 can be performed by any control node 302 designated by the user.
  • the control node designated by the user is Node1.
  • Node1 assigns a disk identification (ID) and a disk code to the logical disk 303.
  • ID disk identification
  • the disk identifier may uniquely identify the logical disk.
  • the disk identifier includes a Globally Unique Identifier (GUID), manufacturer information, and product information that are applied to the NVMeoF protocol.
  • the disk code is used to distinguish different logical disks built in the storage device 304.
  • the disk code of the logical disk 303 can be expressed as Namespace1.
  • Node1 assigns a virtual host to the logical disk 303, for example, the virtual host allocated to the logical disk 303 is virtual host 1, and records the logical The mapping relationship between the disk 303 and the virtual host: virtual host 1: namespace 1.
  • Node1 can set a control node that accesses the logical disk 303.
  • N control nodes are set for the logical disk 303.
  • the storage system 300 assigns a unique identifier, that is, a node identifier, to each control node 303.
  • the control node set for the logical disk 303 may be represented by the following mapping relationship: Namespace 1: Node1, Node2, ..., Noden.
  • the Node1 may further set an accessible access interval for each control node 302.
  • FIG. 3a it is a schematic diagram in which the address space of the logical disk 303 is divided into N consecutive address spaces by Node1.
  • the interval divided by Node1 is 0-99M
  • the interval divided for Node 2 is 100M ⁇ 199M
  • the interval divided for Node 3 is 200 ⁇ 399M
  • the interval divided for Node N is 800 ⁇ 1023M.
  • Each access interval is a continuous address space, and the size of each access interval can be the same or different.
  • the access interval allocated by Node1 to each control node is discontinuous.
  • the address space of the logical disk 303 is first divided into at least two sub-address spaces of equal size, and then each sub-address space is divided into N sub-access zones, and each sub-access zone corresponds to a control node. .
  • Each sub-address space is divided in the same way, that is, the sub-access space corresponding to the same control node in each sub-address space has the same size and order in the sub-address space. Then the set of sub-access intervals corresponding to each control node constitutes the access interval of each control node.
  • the access range of Node1 is: 0 to a0, d0 + 1 to a1, d1 + 1 to a2;
  • the access range of Node2 is: a0 + 1 to b0, a1 + 1 to b1, a2 + 1 to b2;
  • the access intervals of NodeN are: c0 + 1 to d0, c1 + 1 to d1, and c2 + 1 to d2.
  • each sub-access interval of the access interval corresponding to each Node1 is equal in size and equal in interval.
  • the start address of the access interval, the address length of the sub-access intervals constituting the access interval, the address interval between each sub-access interval, and the configuration may be recorded.
  • the N control nodes 302 respond to the host's request and report the access interval corresponding to each control node 302 to the host 100, and the host can perform I / O requests accordingly. Issued. A method for reporting the access interval will be described below.
  • other logical disks other than the logic 303 are also constructed on the storage device 304, other logical disks may be divided into access intervals according to actual needs, or the access intervals may not be divided.
  • each control node 302 can obtain the identity of the host port 101 to which it is connected, and then connect the virtual host and logic
  • the mapping relationship of the disk encoding of the disk 303 is replaced with the mapping relationship of the host port identifier and the disk encoding of the logical disk, and the replaced mapping relationship can be expressed as: HBA1, HBA2 (virtual host 1): Namespace1.
  • HBA1, HBA2 virtual host 1
  • Namespace1 the host 100 is connected to the storage system 300 through two host ports 101, and the identifiers of the two host ports are recorded in the above mapping relationship, that is, HBA1 and HBA2.
  • mapping relationships generated above such as the mapping relationship between the disk encoding of the logical disk and the node identification of the N control nodes 302, the mapping relationship between the host port identification and the disk encoding of the logical disk, the identification of each control node and the access interval
  • the mapping relationship is stored in a storage area accessible to all the N control nodes 302.
  • FIG. 4 it is an architecture diagram of the host 100 in the embodiment of the present invention.
  • the host 100 also includes a processor 102 and a memory 103.
  • the memory 103 stores a multi-path program 104 and an operating system 105.
  • the processor 102 implements the management of the multiple paths to the logical disk 303 by the host 100 by executing the multi-path program 104 of the operating system 105.
  • the management of the multiple paths to the logical disk 303 by the host 100 will be described below through the flowchart shown in FIG. 5.
  • step S501 when the operating system 105 of the host 100 detects that the storage system 300 is connected to the host 100, the operating system 105 passes multiple paths between the host 100 and the storage system 300 (that is, the storage ports 301 and 301 of the N control nodes).
  • the multiple paths formed between the host ports 101) respectively send disk reporting commands to the storage system 300.
  • the disk report command passes through the host port 101, the host port identifier is carried in the disk report command.
  • each control node 302 obtains the host port identifier in the disk report command, according to the host port identifier and the host port identifier and logic.
  • the mapping relationship between the disk code of the disk determines the disk code corresponding to the host port identifier.
  • each control node 302 of the storage system 300 generates report information for the logical disk corresponding to the disk code, carries the disk code in the report information, and reports the report information to the host 100.
  • Node 1 uses Node 1 as an example to describe the reporting process of the reported information.
  • Node1 is connected to the host's two HBA cards through two HBA cards. In this way, there are two paths between Node1 and the host.
  • Node1 receives two disk reporting commands through the two HBA cards, it will use two disks respectively.
  • the identifier of the control node that reports the reported information in the process of reporting the reported information, the identifier of the control node that reports the reported information, the storage port 301 (for example, two HBA ports of Node1) through which the reported information passes, and the host that passes through Port 101 is added to the report information.
  • the controller identifier, storage port, host port, and disk code in the report information are used to indicate a path for reporting the disk code.
  • step S504 the multi-path program 104 of the host 100 sends the disk identification query command through the path indicated by the path information in each report information after receiving the report information reported by each control node 302 of the storage system 300.
  • the query command includes Disk encoding in the report information.
  • each control node 302 in the storage system 300 obtains a disk identifier corresponding to the disk code according to the disk code in the query command after receiving the query command.
  • each control node 302 in the storage system 300 reports the disk identification corresponding to the disk encoding through the path of sending the query command. Since the storage system 300 assigns a disk ID and a disk code to each logical disk 303 when the logical disk 303 is created, the storage system 300 can obtain the disk ID according to the disk code after receiving the disk code.
  • Step 507 The multi-path program 104 of the host 100 determines a path to access the logical disk corresponding to the disk ID according to the reported disk ID, and manages the determined path.
  • the multi-path program 104 when it receives a disk identifier for the first time, it creates a disk object according to the disk identifier, including assigning a disk object name to the disk object, and establishing the disk object name and the disk. The mapping relationship of the identifier, and records the path reporting the disk identifier as the path of the disk object. Further, other related information of the disk, such as an address space, a capacity, and the like, may also be recorded in the disk object. If the same disk identifier is subsequently received, the path of the subsequent reporting of the disk identifier may be recorded as another path of the disk object.
  • a redundant path is set between each control node 302 and the host 100.
  • the multi-path program manages the path to each logical disk 303, each path The redundant path between the control node 302 and the host 100 is merged into one.
  • the paths of the same control node can be recorded together according to the control node ID.
  • the host issues an I / O request, it selects one of the redundant paths to issue the I / O.
  • the host replaces the failed path with the other path for the I / O.
  • the following description uses a merged path for description, that is, a case where there is only one path between each control node 302 and the host 100.
  • the multi-path software 104 can manage multiple paths to the disk through the disk object, such as discovering new paths, deleting broken paths, and selecting paths for issuing IO requests according to preset policies. Wait.
  • the multipath software 104 first receives the disk identifier of the logical disk whose disk code is Namespace1 on path 1, then create a disk object for Namespace1, assign a disk object name sda to the disk object, and establish sda with the The mapping relationship of the disk identifier records the path 1 as the first path of the disk object. If the subsequent disk identifiers of Namespace1 are also received on path 2 and path 3, then path 2 and path 3 are also recorded as the path of the disk object.
  • the multi-path software receives 2N disk identifiers Namespace1 on 2N paths, and after the paths of the same control node are merged, the multi-path software communicates between the host and the storage system. N paths are managed.
  • Step S508 after the access path is established for the logical disk, the multipath software 104 sends a path status query command to the storage system 300, and the status query command is used to instruct each control node to report the path where the control node is located. Path status.
  • the host 100 communicates with the storage system 300 through the NVMeoF protocol.
  • the host first generates a non-volatile memory standard (Non-Volatile Memory Express, NVMe) path status query command.
  • NVMe Non-Volatile Memory Express
  • the NVMe interval query command is encapsulated into the external network protocol to obtain an external network protocol interval query command.
  • the path status query command sent in this step is the external network protocol interval query command.
  • the NVMe path status query command is a command defined in an existing NVMe protocol.
  • the NVMe path query command is a command defined based on GetLog, Page, Command, and Dword in the NVMe protocol.
  • the NVMe path status query command is shown in FIG. 6.
  • the command identifier OCh of the path status query command is carried in the log page identifier (LID) field, and the LID field is at 07:00 bytes of the path status query command. s position.
  • Other fields in the command are not related to the embodiment of the present invention, so they are not described in detail here. In the following description, only the fields related to the embodiment of the present invention are described.
  • the external network protocol may be Fibre Channel (FC), InfiniBand, RoCe (RDMA over converged Ethernet), iWARP (RDMA over TCP), and high-speed serial computer extension.
  • FC Fibre Channel
  • InfiniBand RoCe
  • RoCe RDMA over converged Ethernet
  • iWARP RDMA over TCP
  • high-speed serial computer extension RDMA over TCP
  • Bus standard peripheral component interconnect interconnect PCIe
  • step S509 the control node that receives the path status query command obtains the path status, and reports the path status to the multipath software 104 of the host in the status report information.
  • the control node receiving the path status query command after receiving the path status query command, the control node receiving the path status query command first parses the path status query command to obtain the NVMe The path state query command, and then according to the instruction of the NVMe path state query command, query the path state of the path where the control node receiving the path state query command is located.
  • the path states indicated by the state codes 01h to 04h are the path states defined in the existing protocol, where the state code 01h indicates the preferred state (that is, ANA Optimized State), after receiving the status report information including the status code 01h, the host 100 will set the path for reporting the status report information as the preferred path; the status code 02h indicates a non-optimal state (that is, ANA Non-Optimized State) After receiving the status report information including the code 02h, the host 100 sets the path for reporting the status report information to a non-preferred path; the status code 03h indicates an inaccessible state (ie, ANA inaccessible State), and the host 100 is receiving After the status report information including the status code 03h is set, the path for reporting the status report information is set to an inaccessible path; the status code 04h indicates a failure status (ie, ANA Persistent LossState), and the host 100 receives the status code including the status code
  • the space dependent state (ANA LBA-Depend State) indicated by the state code 05h is a state newly added in the embodiment of the present invention. This space-dependent state indicates that the logical disk 303 is divided into a plurality of access sections. After receiving the status report information including the status code 05h, the host 100 will obtain an access interval corresponding to the control node reporting the status. The specific acquisition method will be described in detail below.
  • the control node that receives the path status query command obtains the path status by determining whether the logical disk 303 is divided into multiple access intervals. When the logical disk 303 is not divided into multiple access intervals, the control node is acquired. Path status of the path. In the storage system, the status of each path is identified in advance. For example, a preferred path and a non-priority path for accessing the logical disk 303 are set in advance, and an inaccessible path and a failed path identified in advance are also identified. When the logical disk 303 is divided into multiple access sections, it may be determined that the path state is the space-dependent state. After acquiring the path status, the control node receiving the path status query command reports the status code of each path status to the host 100 through the status reporting information shown in FIG. 7.
  • FIG. 7 shows the format of response information of an NVMe path query command in the existing NVMeoF protocol.
  • the response information includes a disk number field, which is used to carry the number of logical disks that the host 100 can access, and is carried at 07:04 bytes of the reported information. In the embodiment of the present invention, only one logical disk is taken as an example for description, and 1 is filled here.
  • the status reporting information further includes a path status field, which is used to carry the status of the path, and is carried at 16 bytes of the reporting information.
  • one of 01h to 04h which is a code indicating the state of the path, is filled into 16 bytes.
  • the control node may carry the state code 01h corresponding to the preferred state in byte 16.
  • a code 05h indicating a space-dependent state is carried in 16 bytes of the state report information.
  • the generated response information of the NVMe path query command is encapsulated in the external network protocol to obtain an external network protocol status response message, and the external network protocol status response message is reported to the host as reporting information.
  • Step S510 After receiving the report information, the multi-path software 104 of the host determines whether the path status in the report information indicates that the logical disk 303 includes multiple access intervals.
  • the multipath software 104 of the host When the multipath software 104 of the host receives the report information, it parses the report information to obtain the NVMe status query command response information encapsulated in the report information, and then obtains the indication from the response information. Status code of the path status. If the status code is 05h, the multipath software 104 determines that the logic 303 includes multiple access intervals. If the status code is one of 01h to 04h, the multipath software 104 determines The logical disk is not divided into multiple access intervals.
  • step S511 when the multi-path software determines that the logical disk is not divided into multiple access sections, the path code is recorded as the path state of the path reporting the status report information, so that the multi-path software 104 is I / O Reference when selecting a path.
  • step S512 when the multi-path software 104 determines that the logical disk is not divided into multiple access intervals, it sends an interval query command to the storage system 300, where the interval query command is used to instruct a control node in the storage system to report as Access interval assigned by each control node.
  • the multi-path software Before sending the interval query command, the multi-path software first generates an NVMe interval query command, and the NVMe interval query command is also a command defined based on GetLog Page-CommandDword in the NVMe protocol.
  • the existing NVMe interval query command does not exist in the existing NVMe protocol.
  • the NVMe interval query command is a command that is newly defined based on the GetLog, Page, Command, and Dword.
  • the interval query command is a new subcommand added to the NVMeoF protocol.
  • the command format of the NVMe interval query command is the same as the format of the NVMe path status query command, except that the interval query command is different.
  • the LID field carries a command identifier of the NVMe interval query command, for example, 0Dh.
  • the NVMe interval query command can be encapsulated into the external network protocol to obtain an external network protocol interval query command, and the external network protocol interval query command is the interval query command.
  • the response information of the NVMe interval query command is defined for the NVMe interval query command.
  • the namespace identifier field of the response information of the NVMe interval query command is used to carry the disk encoding of the logical disk.
  • Namespace1 at 03:00 bytes of the response information, the number of access intervals field of the response information of the NVMe interval query command is used to carry the number of access intervals carried in the interval reporting information.
  • the response information At 11:08 bytes, in the embodiment of the present invention, since each control node accesses only one logical disk, the number of access intervals is one.
  • the host 100 can access other logical disks other than the logical disk 303 through the control node, and the number of access sections filled in here may also be different from one.
  • the response information of the NVMe interval query command further includes an access interval description field, which is used to carry the access interval information of the access interval.
  • the interval description field is 12:27 bytes of the response information, such as user data. segment descriptor 1.
  • multiple access interval expression fields may be added to the response information. For example, each access interval information is sequentially filled in bytes after 27 bytes. The description of each access interval is shown in Figures 12 and 13.
  • FIG. 3a there are two ways to divide the access interval, one is the division method of the continuous address space shown in FIG. 3a, and the other is the division method of the discontinuous address space shown in FIG. 3b.
  • the access interval description information includes an access interval first address field, which is used to carry the first address of the access interval, for example, 07:00 in the access interval description information.
  • the byte it also includes the interval length field, which is used to carry the length of the access interval. For example, at 07:00 bytes of the interval description information, because the address space is continuous and there is no sub-access interval, the 15 : 12 bytes is 0.
  • the access range is a non-contiguous address space, the address of each access range is discontinuous, then byte 07:00 indicates the first address of the first sub-access range, and 11:08 indicates each sub-access range.
  • the length of the access interval, 15:12 represents the interval between two adjacent sub-access intervals.
  • each control node of the storage system receives the interval query command, and then generates interval reporting information, and reports the interval reporting information to the host 100.
  • the control node receiving the query command parses the interval query command, obtains the NVMe interval query command, and then according to the format of the response information of the predefined NVMe interval query command Carrying the access interval of the control node that received the query command into the response information of the NVMe interval query command, and encapsulating the response information of the NVMe interval query command into the external network protocol to form the external network A protocol interval response message, and reports the external network protocol interval response message to the host 100 as interval reporting information.
  • step S514 the multi-path software 104 of the host obtains the access interval information from the interval reporting information, and then records the access interval information corresponding to the control node that reports the interval reporting information. In this way, after receiving the I / O When requested, I / O requests can be allocated based on this information.
  • the multi-path software 104 of the host When the multi-path software 104 of the host obtains the access interval information, it first analyzes the interval report information to obtain the NVMe interval query command response information, and then obtains the access interval from the NVMe interval query command response information. information.
  • FIG. 14 it is a flowchart of assigning a control node that executes the I / O request when the host 100 receives the I / O request.
  • Step 1301 The host 100 receives an I / O request, and the I / O request carries a logical address of data to be accessed;
  • Step 1302 The host 100 determines an access interval where the logical address is located.
  • Step 1303 The host 100 determines a control node that processes the I / O request according to the determined access interval.
  • step 1304 the host 100 sends the I / O request to the control node, and the control node processes the I / O request.
  • the I / O request issued by the host can be processed by the control node corresponding to the access zone where the logical address of the data to be accessed is located, thereby avoiding that I / O requests with the same logical address are issued to different control nodes for processing, causing Forwarding of I / O requests reduces the latency of I / O requests.
  • each is a block diagram of any control node in a host and a storage system provided in the embodiments of the present invention.
  • the host includes a path management module 1501, a path status query module, an interval query module 1503, an interval recording module 1504, an interval determination module 1505, and an I / O delivery module 1506.
  • the control node includes a disk encoding reporting module 1601, a disk identification reporting module 1602, a path status reporting module, and an interval reporting module 1604.
  • the path management module 1501 of the host is configured to send a disk reporting command to the storage system 300 through multiple paths between the host 100 and the storage system 300 when the operating system 105 of the host 100 detects that the storage system 300 is connected to the host 100. .
  • the disk code reporting module 1601 of the control node of the storage system 300 After receiving the disk report command, the disk code reporting module 1601 of the control node of the storage system 300 obtains the host port identifier in the disk report command, and according to the host port identifier and the host port identifier and the disk code of the logical disk The mapping relationship determines the disk encoding corresponding to the host port identifier, and then generates reporting information for the logical disk corresponding to the disk encoding, carries the disk encoding in the reporting information, and reports the reporting information to the host 100. .
  • the path management module 1501 of the host After receiving the report information reported by each control node 302 of the storage system 300, the path management module 1501 of the host sends a disk identification query command through a path indicated by the path information in each report information, where the query command includes the Disk encoding in the reported information.
  • the disk identification reporting module 1602 of the control node of the storage system 300 After receiving the query command, the disk identification reporting module 1602 of the control node of the storage system 300 obtains the disk identification corresponding to the disk encoding according to the disk encoding in the query command, and assigns the disk identification corresponding to the disk encoding. It is reported through the path of sending the query command.
  • the path management module 1501 of the host determines a path to access the logical disk corresponding to the disk ID according to the reported disk ID, and manages the determined path.
  • Functions performed by the path management module 1501 of the host are the same as those performed by steps S501, S504, and S507 in FIG.
  • the functions performed by the disk encoding reporting module 1601 of the control node are the same as steps S502 and S503 in FIG. 5, and the functions performed by the disk identification reporting module 1602 of the control node are the same as steps S505 and S507 in FIG. 5.
  • steps S505 and S507 for details, please refer to the description of the response steps in FIG. 5.
  • the host's path status query module 1502 After establishing an access path for the logical disk, the host's path status query module 1502 sends a path status query command to the storage system 300, and the status query command is used to instruct each control node to report the path of the control node's path. Path status.
  • the functions performed by the path state query module 1502 are the same as the functions performed by step S508. Therefore, for specific implementation details of the path state query module 1502, refer to the related description of step S508.
  • the path state reporting module 1603 of the storage node When receiving the path state query command, the path state reporting module 1603 of the storage node reports the path state to the host by carrying the state reporting information. Therefore, for specific implementation details of the path status reporting module 1603, refer to the related description of step S509.
  • the interval query module 1503 of the host determines whether the path status in the report information indicates that the logical disk 303 includes multiple access intervals. When it is determined that the logical disk is not divided into multiple accesses, During the interval, the path code is recorded as the path status of the path reporting the status reporting information for reference by the multipath software 104 when selecting a path for I / O. When it is determined that the logical disk is not divided into multiple When accessing an interval, an interval query command is sent to the storage system 300, and the interval query command is used to instruct a control node in the storage system to report an access interval allocated to each control node. For specific implementation details of the interval query module 1503, refer to the related description of steps S510 to S512.
  • the control node interval reporting module 1604 receives the interval query command, then generates interval reporting information, and reports the interval reporting information to the host 100. For specific implementation details of the interval reporting module 1504, refer to the related description of step S513.
  • the host's interval recording module 1504 After receiving the interval reporting information, the host's interval recording module 1504 obtains the access interval information from the interval reporting module, and then records the access interval information corresponding to the control node that reports the interval reporting information. Thus, upon receiving an I / O request, At this time, I / O requests can be allocated based on this information.
  • the interval determination module 1505 of the host is configured to receive an I / O request, where the I / O request carries a logical address of data to be accessed, and determines an access interval in which the logical address is located.
  • the I / O issuing module 1506 is configured to determine a control node that processes the I / O request according to the determined access interval, and then send the I / O request to the control node, and the control The node processes the I / O request.

Abstract

本发明实施例提供一种由主机执行的数据处理方法。所述主机与存储系统通过NVMeoF协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘。所述方法包括:发送状态查询命令至所述控制节点,所述状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态;接收所述控制节点上报的路径状态;当接收到的路径状态指示所述逻辑磁盘包括访问区间时,发送区间查询命令至所述控制节点;接收所述控制节点上报的访问区间信息,所述访问区间信息所指示的访问区间被预先分配给所述控制节点;记录所述控制节点与所述访问区间信息的映射关系。

Description

处理I/O请求的方法及设备 技术领域
本申请涉及存储领域,尤其涉及处理I/O请求的方法及设备。
背景技术
在现有基于NVMe over Fabrics(NVMeoF)协议的存储架构中,存储系统包括多个控制节点及逻辑磁盘(例如,Namespace),主机通过多个控制节点访问一个逻辑磁盘,这样,主机与所述逻辑磁盘之间存在多条路径。主机通过轮询的方式选择一条路径下发I/O请求至存储系统。存储系统中接收到I/O请求的控制节点根据所述I/O请求携带的逻辑地址按照一定的算法计算执行所述I/O请求的控制节点。例如,主机通过所述多条路径中的第一条路径下发I/O请求至第一控制节点,第一控制节点对所述I/O请求中逻辑地址按照预设的算法计算出所述I/O请求应该由第二控制节点执行,则所述第一控制节点将所述I/O请求转发至第二控制节点。I/O请求的转发会导致I/O时延的增加。
发明内容
本发明实施例提供I/O请求的处理方法、设备、及主机,用于为存储系统的每个控制节点设置访问区间。
本发明实施例第一方面提供一种由主机执行的数据处理方法。所述主机与存储系统通过NVMeoF协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘。所述方法包括:所述主机发送状态查询命令至所述控制节点,所述状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态,然后接收所述控制节点上报的路径状态。当所述主机接收到的路径状态指示所述逻辑磁盘包括访问区间时,发送区间查询命令至所述控制节点,然后接收所述控制节点上报的访问区间信息,并记录所述控制节点与所述访问区间信息的映射关系,所述访问区间信息所指示的访问区间被预先分配给所述控制节点。
通过NVMeoF协议的状态查询命令及新增的区间查询命令获取存储系统为每个控制节点设置的访问区间,这样在后续接收到I/O请求时,可以根据I/O请求的逻辑地址所落入的分区区间,将I/O请求下发至所述分区区间对应的控制器,从而避免了I/O的转发。
在一种可能的设计中,所述方法还包括:接收I/O请求,所述I/O请求携带待访问数据的逻辑地址;确定所述逻辑地址所落入的访问区间;根据所述映射关系确定所述访问区间对应的控制节点;将所述I/O请求发送至所述访问区间对应的控制节点。
由于所述主机中记录了所述控制节点与所述访问区间信息的映射关系,所以在接收 到I/O请求时,可以根据I/O请求的逻辑地址所落入的分区区间,将I/O请求下发至所述分区区间对应的控制器,从而避免了I/O的转发。
在一种可能的设计中,所述指示所述逻辑磁盘包括访问区间的路径状态由所述控制节点在判断所述逻辑磁盘包括访问区间时,将所述指示所述逻辑磁盘包括访问区间的路径状态上报至所述主机。
通过在NVMe协议中增加指示所述逻辑磁盘包括访问区间的路径状态的路径状态,从而使主机可以根据该路径状态判断是否进行分区区间的查询。
在一种可能的设计中,所述访问区间查询命令是基于所述NVMeoF协议中的命令get log page-Command Dword 10定义的,并在所述命令的Log Page Identifier字段携带所述访问区间查询命令的命令标识。
利用现有的NVMeoF协议中定义的命令定义所述区间查询命令,不需要改动现有NVMeoF协议。
在一种可能的设计中,所述访问区间信息所指示的访问区间为连续的地址空间。
在一种可能的设计中,所述访问区间信息包括访问区间首地址及访问区间长度。
在一种可能的设计中,所述控制节点对应的访问区间包括第一子访问区间和第二子访问区间,所述第一子访问区间和所述第二子访问区间之间有间隔。
在一种可能的设计中,所述第一子访问子区间和第二访问子区间相邻,所述访问区间信息包括访问区间首地址、每个子访问区间的长度、及两个相邻的子访问区间之间的间隔。
在一种可能的设计中,所述访问区间信息通过为所述访问区间查询命令定义的响应信息上报至所述主机,所述访问区间信息携带在所述响应信息的区间描述字段中。
本发明第二方面提供一种数据处理方法,由存储系统的控制节点执行,所述存储系统与主机通过NVMeoF协议进行通信,所述存储系统包括逻辑磁盘。所述方法包括接收主机发送的状态查询命令,所述状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态;当所述逻辑磁盘包括访问区间时,将指示所述逻辑磁盘包括访问区间的路径状态上报至主机;接收所述主机根据所述路径状态发送的访问区间查询命令;根据所述区间查询信息将为所述控制节点分配的访问区间的访问区间信息上报至主机,使主机记录所述访问区间信息与所述控制节点的映射关系。
通过NVMeoF协议的状态查询命令及新增的区间查询命令获取存储系统为每个控制节点设置的访问区间,这样在后续接收到I/O请求时,可以根据I/O请求的逻辑地址所落入的分区区间,将I/O请求下发至所述分区区间对应的控制器,从而避免了I/O的转发。
本发明实施例第二方面各种可能的设计与第一发明各种可能的设计基本相同,在此不再赘述。
本发明实施例第三方面提供一种由主机执行的数据处理方法,所述主机连接至存储系统,所述存储系统包括多个控制节点,所述主机通过所述多个控制节点访问存储系统中的逻辑磁盘。所述方法包括:接收I/O请求,所述I/O请求中携带待访问数据的逻辑地址;确定所述逻辑地址所属的访问区间,其中,所述逻辑磁盘包括多个访问区间,所述主机中记录有每个访问区间与控制节点的映射关系;将所述I/O请求下发至 所确定的访问区间对应的控制节点。
通过为存储系统的每个控制节点设置访问区间,在接收到I/O请求时,可以根据I/O请求的逻辑地址所落入的分区区间,将I/O请求下发至所述分区区间对应的控制器,从而避免了I/O的转发。
在一种可能的设计中,所述方法还包括:发送状态查询命令至所述多个控制节点,所述状态查询命令用于指示所述多个控制节点上报本控制节点所在路径的路径状态;
接收所述多个控制节点上报的路径状态;当接收到的路径状态指示所述逻辑磁盘包括访问区间时,发送区间查询命令至上报所述路径状态的控制节点;接收所述上报所述路径状态的控制节点所述控制节点上报的访问区间信息;记录所述上报所述路径状态的控制节点与所述访问区间信息的映射关系。
通过NVMeoF协议的状态查询命令及新增的区间查询命令获取存储系统为每个控制节点设置的访问区间,这样可以使主机根据I/O请求的逻辑地址所落入的分区区间,将I/O请求下发至所述分区区间对应的控制器,从而避免了I/O的转发。
在一种可能的设计中,所述指示所述逻辑磁盘包括访问区间的路径状态由所述接收到所述状态查询命令的控制节点在判断所述逻辑磁盘包括访问区间时,将所述指示所述逻辑磁盘包括访问区间的路径状态上报至所述主机。
通过在NVMe协议中增加指示所述逻辑磁盘包括访问区间的路径状态的路径状态,从而使主机可以根据该路径状态判断是否进行分区区间的查询。
本发明实施例的其他几种可能的设计与第一方面提供的几种可能的设计相同,在此不再赘述。
本发明实施例第四方面提供一种由主机执行的数据处理方法,所述主机与存储系统通过外部网络通信,所述方法包括:将非易失性存储器标准NVMe区间查询命令封装到所述外部网络协议得到外部网络协议区间查询命令,所述NVMe区间查询命令用于查询为所述存储系统的控制器分配的访问区间,所述访问区间属于所述存储系统的命名空间;向所述控制节点发送所述外部网络协议区间查询命令;接收所述控制节点发送的外部网络协议区间响应消息,所述外部网络协议区间响应消息包含NVMe区间查询命令响应,所述NVMe区间查询命令响应包含所述命名空间的访问区间信息;解析所述外部网络协议区间响应消息得到所述NVMe区间查询命令响应信息;从所述NVMe区间查询命令响应获取并记录所述控制节点的访问区间信息。
通过NVMeoF协议新增的区间查询命令获取存储系统为每个控制节点设置的访问区间,这样在后续接收到I/O请求时,可以根据I/O请求的逻辑地址所落入的分区区间,将I/O请求下发至所述分区区间对应的控制器,从而避免了I/O的转发。
在一种可能的实现方式中,在所述将非易失性存储器标准NVMe区间查询命令封装到所述外部网络协议得到外部网络协议区间查询命令之前,所述方法还包括:
将NVMe状态查询命令封装到所述外部网络协议得到外部网络协议状态查询命令,所述NVMe状态查询命令用于查询所述控制节点所在路径的路径状态;
向所述控制节点发送所述外部网络协议状态查询命令;
接收所述控制节点发送的外部网络协议状态响应消息,所述外部网络协议状态响应消息包含NVMe状态查询命令响应,所述NVMe状态查询命令响应包含路径状态信息, 所述路径状态信息指示所述命名空间包括访问区间;
解析所述外部网络协议状态响应消息得到所述NVMe状态查询命令响应信息。
通过NVMe路径状态查询命令,可以将指示路径状态信息上报至主机,主机可根据所述路径状态信息确定所述逻辑磁盘是否划分了访问区间。
本发明实施例的其他可能的实现方式与第一方面的各种可能的实现方式相同,在此不再赘述。
本发明实施例第五方面提供一种数据处理方法,由存储系统的控制节点执行,所述主机与存储系统通过外部网络通信,所述方法包括:接收主机发送的外部网络协议区间查询命令,对所述外部网络协议区间查询命令进行解析得到NVMe区间查询命令;生成所述NVMe区间查询命令的响应信息,所述NVMe区间查询命令的响应信息包括所述控制节点对应的访问区间信息,将所述NVMe区间查询命令封装至所述外部网络协议得到外部网络协议区间查询命令;上报所述外部网络协议区间查询命令至所述主机。
通过NVMeoF协议新增的区间查询命令获取存储系统为每个控制节点设置的访问区间,这样在后续接收到I/O请求时,可以根据I/O请求的逻辑地址所落入的分区区间,将I/O请求下发至所述分区区间对应的控制器,从而避免了I/O的转发。
在一种可能的设计中,在接收主机发送的外部网络协议区间查询命令之前,所述方法还包括:接收外部网络协议状态查询命令,解析所述外部网络协议状态查询命令得到非易失性存储器标准NVMe状态查询命令,所述NVMe状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态;生成所述NVMe状态查询命令的响应信息,并当所述命名空间包括访问区间时,将指示所述命名空间包括访问区间的路径状态携带在所述NVMe状态查询命令的响应信息中;将所述NVMe状态查询命令封装得到外部网络协议状态查询命令的响应信息,并将外部网络协议状态查询命令的响应信息上报至主机。
通过NVMe路径状态查询命令,可以将指示路径状态信息上报至主机,主机可根据所述路径状态信息确定所述逻辑磁盘是否划分了访问区间。
本发明实施例的其他可能的实现方式与第一方面的各种可能的实现方式相同,在此不再赘述。
第六方面,本发明实施例还提供了一种主机,所述主机与存储系统通过NVMeoF协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘,该主机还包括用于执行以上第一方面各个步骤的单元或手段。
第七方面,本发明实施例还提供了一种存储系统的控制节点,主机与所述存储系统通过NVMeoF协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘,该控制节点还包括用于执行以上第二方面各个步骤的单元或手段。
第八方面,本发明实施例还提供了一种主机,所述主机与存储系统通过NVMeoF协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘,所述主机包括用于执行以上第三方面各个步骤的单元或手段。
第九方面,本发明实施例还提供了一种主机,所述主机与存储系统通过外部网络 通信,该主机包括用于执行以上第四方面各个步骤的单元或手段。
第十方面,本发明实施例还提供了一种存储系统的控制节点,主机与所述存储系统通过外部网络通信,该控制节点包括用于执行以上第五方面各个步骤的单元或手段。
第十一方面,本发明实施例还提供了一种主机,所述主机与存储系统通过NVMeoF协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘,所述主机包括存储器和处理器,其中,所述存储器用于存储程序和数据,所述处理器用于运行所述存储器存储的程序,根据所述存储器存储的数据,执行以上第一方面、第三方面、或第四方面提供的各种方法。
第十二方面,本发明实施例还提供了一种存储系统的控制节点,主机与所述存储系统通过NVMeoF协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘,所述控制节点包括存储器和处理器,其中,所述存储器用于存储程序和数据,所述处理器用于运行所述存储器存储的程序,根据所述存储器存储的数据,执行以上第二方面或第五方面提供的各种方法。
附图说明
图1为本发明实施例所应用的系统的架构图。
图2为本发明实施例中主机访问存储系统的逻辑磁盘的N条路径的示意图。
图3a-3b分别为本发明实施例中所述存储系统中的逻辑磁盘的地址空间划分为N个连续地址空间及N个不连续地址空间的示意图。
图4为本发明实施例中的主机的架构图。
图5为本发明实施例中对访问所述逻辑磁盘的多条路径进行管理的方法的流程图。
图6为本发明实施例中的路径状态查询命令的示意图。
图7为本发明实施例中的状态上报信息的示意图。
图8为本发明实施例中NVMeoF协议中所定义的路径状态的示意图。
图9为本发明实施例中将表示空间依赖状态携带在所述状态上报信息的示意图。
图10为本发明实施例中的区间查询命令的示意图。
图11为本发明实施例中的区间上报信息的的示意图。
图12及13分别为本发明实施例中的区间上报信息携带地址空间连续的访问区间的区间描述信息及地址空间不连续的访问区间的区间描述信息的示意图。
图14为本发明实施例中分配执行I/O请求的控制节点的流程图。
图15为本发明实施例中提供的主机的模块图。
图16为本发明实施例中提供的存储系统中的任一控制节点的模块图。
具体实施方式
如图1所示,为本发明实施例所应用的系统的架构图。主机100通过两个主机端口,例如主机总线适配器(Host Bus Adapter,HBA)101分别连接至两个交换机200。两个交换机200分别连接至所述存储系统300,其中,两个交换机200及两个主机端口101的设置是为了防止两个交换机200中的其中一个或者两个主机端口中的其中一个故障而导致通路断开。
所述存储系统300包括多个控制节点302例如Node 1~N。每个控制节点302通过两个存储端口301(例如HBA卡)分别连接至两个交换机(switch)200。这样,主机与每个控制节点301之间就包括两条路径,这两条路径互为冗余路径,如果其中一条路径故障,则可以使用另外一条路径在主机100与控制节点301之间传输数据。在所述存储系统300中包括由多个SSD构成的存储设备304,所述存储设备304可以为独立冗余磁盘阵列(Redundant Array of Independent Disks,RAID)、闪存簇(Just a Bunch of Flash)。在一些实施例中,所述存储设备304还包括虚拟SSD,所述虚拟SSD是与所述存储设备304通过NVMeoF(Non-Volatile Memory express over Fabric)协议通信的其他存储设备映射至所述存储设备304的。通过NVMeoF协议将远端SSD映射至本地存储设备,作为本地存储设备的虚拟SSD为现有技术,在此不再赘述。所述主机100与所述存储系统300之间通过NVMeoF协议进行通信。
利用所述存储设备304中的本地SSD和/或虚拟SSD可以构建逻辑磁盘303,所述逻辑磁盘303可以是命名空间(Namespace),所述命名空间为NVMeoF协议中定义的逻辑磁盘的表达方式。在图1中,只示出了一个逻辑磁盘303,但在实际应用中,在所述存储设备304中可以构建多个逻辑磁盘,其中,一个逻辑磁盘可以被分配给多个主机使用,或者多个逻辑磁盘可以被分配给一个主机使用。
本发明实施例中,如图1所示,主机100可通过N个控制节点302访问分配给主机100的逻辑磁盘303。为了方便描述主机100与逻辑磁盘303之间的路径,在图2中,省略掉图1中主机100与逻辑磁盘之间的物理链路,并将经过同一控制节点的两条冗余路径合并为一条,则得到主机100访问逻辑磁盘303的N条路径。
在存储系统300中,对逻辑磁盘303的构建及管理可由用户指定的任意一个控制节点302执行,在此,假设用户指定的控制节点为Node 1。在逻辑磁盘303划分好后,Node1给逻辑磁盘303分配一个磁盘标识(Identification,ID)和一个磁盘编码。其中,该磁盘标识可以唯一标识该逻辑磁盘,例如,该磁盘标识包括应用于NVMeoF协议的全局唯一标识符(Globally Unique Identifier,GUID)、厂商信息及产品信息。该磁盘编码用于区别构建在存储设备304中的不同逻辑磁盘,例如所述逻辑磁盘303的磁盘编码可以表示为Namespace1。
在所述逻辑磁盘303的磁盘标识及磁盘编码分配好后,Node1为所述逻辑磁盘303分配一个虚拟主机,例如,给所述逻辑磁盘303分配的虚拟主机为虚拟主机1,并记录所述逻辑磁盘303与虚拟主机的映射关系:虚拟主机1:namespace 1。
在所述逻辑磁盘303的磁盘标识及磁盘编码分配好后,Node1即可设置访问所述逻辑磁盘303的控制节点。例如,为所述逻辑磁盘303设置N个控制节点。为了区分各个控制节点303,存储系统300会为每个控制节点303分配一个唯一标识,即节点标识。为所述逻辑磁盘303所设置的控制节点可以用以下映射关系表示:Namespace 1:Node 1、Node 2、……、Node N。
在本发明实施例中,所述Node1还可进一步为每个控制节点302设置其可访问的访问区间。如图3a所示,为Node1将所述逻辑磁盘303的地址空间划分为N个连续地址空间的示意图,例如,若所述逻辑磁盘303的存储空间大小为1G,则为Node 1划分的区间为0-99M,为Node 2划分的区间为100M~199M,为Node 3划分的区间为 200~399M……,为Node N划分的区间为800~1023M。每个访问区间为连续的地址空间,每个访问区间的大小可以相同,也可以不同。
在另外一个实施例中,Node1为每个控制节点分配的访问区间是不连续的。如图3b所示,首先将所述逻辑磁盘303的地址空间划分为至少两个等大的子地址空间,然后再将每个子地址空间划分为N个子访问区间,每个子访问区间对应一个控制节点。每个子地址空间划分的方式相同,即每个子地址空间中对应同一控制节点的子访问区间大小及在子地址空间中的顺序都相同。则每个控制节点对应的子访问区间的集合即构成每个控制节点的访问区间。例如,若所述逻辑磁盘303被划分为4个子地址空间,则Node1的访问区间为:0~a0、d0+1~a1、d1+1~a2;Node 2的访问区间为:a0+1~b0、a1+1~b1、a2+1~b2;NodeN的访问区间为:c0+1~d0、c1+1~d1、c2+1~d2。且每个Node1对应的访问区间的每个子访问区间的大小相等,且间隔相等。在记录为每个控制节点分配的访问区间时,可以记录所述访问区间的起始地址、构成所述访问区间的子访问区间的地址长度、每个子访问区间之间的地址间隔、构成所述访问区间的子访问区间数目。为所述N个控制节点302分配好访问区间后,所述N个控制节点302响应主机的请求,上报每个控制节点302对应的访问区间至主机100,主机可据此进行I/O请求的下发。关于所述访问区间的上报方法将在下文描述。在本发明实施例中,如果所述存储设备304上还构建有除所述逻辑303以外的其他逻辑磁盘,则其他逻辑磁盘根据实际需要可以划分访问区间,也可以不用划分访问区间。
如图1所示,当所述N个控制节点通过存储端口301及交换机200连接至主机端口101时,每个控制节点302可以获取与其所连接的主机端口101的标识,然后将虚拟主机与逻辑磁盘303的磁盘编码的映射关系替换为所述主机端口标识与所述逻辑磁盘的磁盘编码的映射关系,替换后的映射关系可以表示为:HBA1,HBA2(虚拟主机1):Namespace 1。本发明实施例中,为了冗余,主机100通过两个主机端口101连接至存储系统300,所述在上述映射关系中记录有两个主机端口的标识,即:HBA1,HBA2。
上面所生成的各种映射关系,例如逻辑磁盘的磁盘编码与N个控制节点302的节点标识的映射关系、主机端口标识与逻辑磁盘的磁盘编码的映射关系、每个控制节点标识与访问区间的映射关系存储在所述N个控制节点302都可以访问的存储区域。
如图4所示,为本发明实施例中的主机100的架构图。除了两个主机端口101之外,所述主机100还包括处理器102及内存103,所述内存103中存储有多路径程序104及操作系统105。所述处理器102通过执行所述操作系统105所述多路径程序104实现主机100对访问所述逻辑磁盘303的多条路径的管理。下面将通过图5所示的流程图介绍所述主机100对访问所述逻辑磁盘303的多条路径的管理。
步骤S501,当主机100的操作系统105监测到存储系统300连接到主机100时,操作系统105通过主机100与存储系统300之间的多条路径(即所述N个控制节点的存储端口301与所述主机端口101之间形成的多条路径)分别发送磁盘上报命令至存储系统300。所述磁盘上报命令经过所述主机端口101时,会将所述主机端口标识携带在所述磁盘上报命令中。
步骤S502,存储系统300的各个控制节点302在接收到所述磁盘上报命令后,每个控制节点302获取所述磁盘上报命令中的主机端口标识,根据所述主机端口标识及 主机端口标识与逻辑磁盘的磁盘编码的映射关系确定所述主机端口标识对应的磁盘编码。
步骤S503,存储系统300的各个控制节点302为所述磁盘编码对应的逻辑磁盘生成上报信息,在所述上报信息中携带所述磁盘编码,并将所述上报信息上报至主机100。下面以Node 1为例,介绍所述上报信息的上报过程。Node1通过两个HBA卡分别与主机的两个HBA卡连接,这样Node1与主机之间存在两条路径,当Node 1分别通过两个HBA卡收到两条磁盘上报命令后,分别从两条磁盘上报命令中获取两个HBA卡的标识,然后根据两个HBA卡标识获取相同的磁盘编码。在获得所述磁盘编码后即可将所述磁盘编码分别通过Node 1与主机100之间的两条路径上报至所述主机100。
根据NVMeoF协议的规定,在所述上报信息上报的过程中,会将上报所述上报信息的控制节点标识、所述上报信息所经过存储端口301(例如Node1的两个HBA端口)及经过的主机端口101添加到所述上报信息中,如此,所述上报信息中的控制器标识、存储端口、主机端口及磁盘编码用于表示上报所述磁盘编码的路径。
步骤S504,主机100的多路径程序104在收到存储系统300的各个控制节点302上报的上报信息后,通过各个上报信息中的路径信息指示的路径发送磁盘标识查询命令,所述查询命令中包括所述上报信息中的磁盘编码。
步骤S505,所述存储系统300中的各个控制节点302在接收到所述查询命令后,根据所述查询命令中的磁盘编码获取所述磁盘编码对应的磁盘标识。
步骤S506,所述存储系统300中的各个控制节点302,并将所述磁盘编码对应的磁盘标识通过发送所述查询命令的路径上报。由于存储系统300在创建逻辑磁盘303时,为每个逻辑磁盘303分配了磁盘标识和磁盘编码,所以存储系统300在接收到所述磁盘编码后,即可根据磁盘编码获取磁盘标识。
步骤507,主机100的多路径程序104根据上报的磁盘标识确定访问所述磁盘标识对应的逻辑磁盘的路径,并对所确定的路径进行管理。
具体地,所述多路径程序104在第一次收到一个磁盘标识时,会根据这个磁盘标识创建磁盘对象,包括为所述磁盘对象分配磁盘对象名称,建立所述磁盘对象名称与所述磁盘标识的映射关系,并将上报所述磁盘标识的路径记录为所述磁盘对象的路径。进一步的,还可以在所述磁盘对象中记录所述磁盘的其他相关信息,如地址空间,容量大小等。后续如果收到相同的磁盘标识,则可将后续上报所述磁盘标识的路径记录为所述磁盘对象另一条路径。
由于在本发明实施例中,为了可靠性,在每个控制节点302与主机100之间都设置了冗余路径,在所述多路径程序管理访问每个逻辑磁盘303的路径时,将每个控制节点302与主机100之间的冗余路径合并为一条。合并时,可根据控制节点标识将同一控制节点的路径记录在一起。这样,主机在下发I/O请求时,选择冗余路径中的其中一条进行I/O的下发,在其中一条路径故障时,则用另外一条路径代替故障路径进行I/O下发。为了方便描述,在下文中以合并后的路径进行描述,也即以每个控制节点302与主机100之间只存在一条路径的情况进行描述。
如此,所述多路径软件104即可通过所述磁盘对象对访问所述磁盘的多条路径进行管理,例如发现新的路径、删除断开的路径、根据预设策略选择下发IO请求的路径 等。
例如,如果所述多路径软件104首先在路径1上收到磁盘编码为Namespace1的逻辑磁盘的磁盘标识,则为Namespace1建立磁盘对象,并为磁盘对象分配磁盘对象名称sda,并建立sda与所述磁盘标识的映射关系,将所述路径1记录为所述磁盘对象的第一条路径。如果后续在路径2及路径3上还收到的Namespace1的磁盘标识,则将路径2及路径3也记录为所述磁盘对象的路径。例如,在本实施例中,所述多路径软件收到在2N条路径上收到2N个磁盘标识Namespace1,对同一控制节点的路径合并之后,则所述多路径软件对主机与存储系统之间的N条路径进行管理。
步骤S508,在为所述逻辑磁盘建立好访问路径后,所述多路径软件104会发送路径状态查询命令至存储系统300,所述状态查询命令用于指示每个控制节点上报本控制节点所在路径的路径状态。
在本发明实施例中,所述主机100与所述存储系统300之间通过NVMeoF协议进行通信,所述主机首先会生成非易失性存储器标准(Non-Volatile Memory express,NVMe)路径状态查询命令,然后将所述NVMe区间查询命令封装到所述外部网络协议得到外部网络协议区间查询命令,在本步骤中所发送的路径状态查询命令,即为所述外部网络协议区间查询命令。
所述NVMe路径状态查询命令为在现有的NVMe协议中定义的命令。所述NVMe路径查询命令是基于NVMe协议中的Get Log Page–Command Dword 10定义的命令。所述NVMe路径状态查询命令如图6所示,路径状态查询命令的命令标识OCh被携带在log page identifier(LID)字段中,所述LID字段在所述径状态查询命令的07:00字节的位置。命令中的其他字段与本发明实施例不相关,则在此不具体描述,在以下的描述中,也只描述跟本发明实施例相关的字段。
在本发明实施例中,所述外部网络协议可以是光纤通道(Fibre channel,FC)、无限带宽技术(InfiniBand)、RoCe(RDMA over converged ethernet)、iWARP(RDMA over TCP)及高速串行计算机扩展总线标准(peripheral component interconnect express,PCIe)等网络协议。
步骤S509,收到所述路径状态查询命令的控制节点获取路径状态,并将路径状态携带在状态上报信息中上报至主机的多路径软件104。
由于所述路径状态查询命令外部网络协议区间查询命令,所以在接收到所述路径状态查询命令后,收到所述路径状态查询命令的控制节点首先会解析所述路径状态查询命令获取所述NVMe路径状态查询命令,然后根据所述NVMe路径状态查询命令的指示去查询收到所述路径状态查询命令的控制节点所在路径的路径状态。
NVMeoF协议中,预先定义了几种路径状态,如图8所示,其中状态编码01h~04h所指示的路径状态为现有协议中定义的路径状态,其中,状态编码01h表示优选状态(即ANA Optimized State),主机100在接收到包括状态编码01h的状态上报信息后,则会将上报所述状态上报信息的路径设置为优选路径;状态编码02h表示非优选状态(即ANA Non-Optimized State),主机100在接收到包括编码02h的状态上报信息后,则会将上报所述状态上报信息的路径设置为非优选路径;状态编码03h表示不可访问状态(即ANA inaccessible State),主机100在接收到包括状态编码03h的状态上报 信息后,则会将上报所述状态上报信息的路径设置为不可访问路径;状态编码04h表示失效状态(即ANA Persistent Loss State),主机100在接收到包括状态编码03h的状态上报信息后,则会将上报所述状态上报信息的路径设置为失效路径。
状态编码05h表示的空间依赖状态(ANA LBA-Depend State)为在本发明实施例中新增的状态。该空间依赖状态表示逻辑磁盘303被划分了多个访问区间。主机100在接收到包括状态编码05h的状态上报信息后,会获取上报该状态的控制节点对应的访问区间,具体的获取方式将在下文详细描述。
收到所述路径状态查询命令的控制节点获取路径状态的方式为:确定逻辑磁盘303是否被划分为多个访问区间,当逻辑磁盘303没有被划分为多个访问区间时,则获取本控制节点所在路径的路径状态。在存储系统中,会对每条路径的状态预先标识。比如预先设置好访问逻辑磁盘303的优选路径和非优先路径,对预先识别出的不可访问路径及失效路径也会进行标识。当逻辑磁盘303被划分为多个访问区间时,则可确定所述路径状态为所述空间依赖状态。在获取路径状态后,收到所述路径状态查询命令的控制节点将各路径状态的状态编码通过图7所示的状态上报信息上报至主机100。
当确定了路径状态后,会生成所述NVMe路径查询命令的响应信息。图7所示为现有NVMeoF协议中的NVMe路径查询命令的响应信息的格式。所述响应信息包括磁盘个数字段,该字段用于携带主机100可以访问的逻辑磁盘的个数,在上报信息的07:04字节处携带。本发明实施例中只以一个逻辑磁盘为例进行说明,则在此处填1。状态上报信息还包括路径状态字段,用于携带所述路径状态,在上报信息的16字节处携带。当逻辑磁盘303没有被划分为多个访问区间时,则将表示路径状态的编码即01h~04h中的其中之一填入16字节,例如,若本控制节点所在的路径的路径状态为优选状态,则本控制节点可将优选状态对应的状态编码01h携带在字节16中。当逻辑磁盘303被划分为多个访问区间时,则如图9所示,将表示空间依赖状态的编码05h携带在所述状态上报信息的16字节中。
所生成的所述NVMe路径查询命令的响应信息会被封装到所述外部网络协议得到外部网络协议状态响应消息,并将所述外部网络协议状态响应消息作为上报信息上报至主机。
步骤S510,主机的多路径软件104在收到所述上报信息后,判断上报信息中的路径状态是否指示所述逻辑磁盘303包括多个访问区间。
当主机的多路径软件104收到所述上报信息后,对所述上报信息进行解析,得到封装在所述上报信息中的NVMe状态查询命令响应信息,然后从所述响应信息中获取所述指示路径状态的状态编码。如果所述状态编码是05h,则所述多路径软件104判断所述所述逻辑303包括多个访问区间,如果所述状态编码是01h~04h其中之一,则所述多路径软件104判断所述逻辑磁盘没有被划分为多个访问区间。
步骤S511,当多路径软件判断所述逻辑磁盘没有被划分为多个访问区间时,则将路径编码记录为上报所述状态上报信息的路径的路径状态,以供所述多路径软件104为I/O选择路径时参考。
步骤S512,当多路径软件104判断所述逻辑磁盘没有被划分为多个访问区间时,则发送区间查询命令至所述存储系统300,该区间查询命令用于指示存储系统中的控 制节点上报为各控制节点分配的访问区间。
在发送所述区间查询命令之前,所述多路径软件首先生成NVMe区间查询命令,所述NVMe区间查询命令也是基于NVMe协议中的Get Log Page–Command Dword 10定义的命令。但现有的NVMe协议中并不存在所述NVMe区间查询命令。所述NVMe区间查询命令是基于所述Get Log Page–Command Dword 10新定义的命令。所述区间查询命令为在NVMeoF协议中新增的子命令,如图10所示,所述NVMe区间查询命令的命令格式与所述NVMe路径状态查询命令的格式相同,区别在于所述区间查询命令在LID字段携带的是所述NVMe区间查询命令的命令标识,例如0Dh。在生成所述NVMe区间查询命令后,即可将该NVMe区间查询命令封装到所述外部网络协议得到外部网络协议区间查询命令,所述外部网络协议区间查询命令即为所述区间查询命令。
在存储系统300中,为该NVMe区间查询命令定义了NVMe区间查询命令的响应信息,如图11所示,所述NVMe区间查询命令的响应信息的命名空间标示字段用于携带逻辑磁盘的磁盘编码,例如Namespace1,在所述响应信息的03:00字节处,NVMe区间查询命令的响应信息的访问区间数量字段用于携带该区间上报信息中所携带的访问区间的数量,在所述响应信息的11:08字节处,本发明实施例中,由于每个控制节点只访问一个逻辑磁盘,所以访问区间的数量为1。在其他实施例中,主机100可以通过所述控制节点访问除所述逻辑磁盘303之外的其他逻辑磁盘,则此处所填写的访问区间的数量也可以不为1。所述NVMe区间查询命令的响应信息还包括访问区间表述字段,该访问区间表述字段用于携带访问区间的访问区间信息,该区间描述字段在所述响应信息的12:27字节,例如user data segment descriptor 1。当区间上报信息中携带多个访问区间时,可在所述响应信息中添加多个访问区间表述字段,例如,将对每个访问区间信息依次填写在27字节之后的字节中。每个访问区间的描述如图12及13所示。
参考图3a及图3b的描述,访问区间有两种划分方式,一种为图3a所示的连续地址空间的划分方式,另外一种为图3b所示的非连续地址空间的划分方式。
如果访问区间为连续地址空间的划分方式,则如图13所示,所述访问区间描述信息包括访问区间首地址字段,用于携带访问区间的首地址,例如在访问区间描述信息的07:00字节处,还包括区间长度字段,用于携带访问区间的长度,例如在区间描述信息的07:00字节处,由于地址空间连续,不存在子访问区间,则携带子访问区间间隔的15:12字节为0。如果访问区间为非连续地址空间的划分方式中,每个访问区间的地址是不连续的,则字节07:00则表示的是第一个子访问区间的首地址,11:08表示每个子访问区间的长度,15:12表示的是两个相邻的子访问区间之间的间隔。
步骤S513,所述存储系统的各个控制节点接收到所述区间查询命令,然后生成区间上报信息,并将所述区间上报信息上报至主机100。
在收到所述区间查询命令时,收到所述查询命令的控制节点对所述区间查询命令进行解析,获取所述NVMe区间查询命令,然后根据预先定义的NVMe区间查询命令的响应信息的格式,将接收到所述查询命令的控制节点的访问区间携带在所述NVMe区间查询命令的响应信息中,并将所述NVMe区间查询命令响应信息封装到所述外部 网络协议中形成所述外部网络协议区间响应消息,并将所述外部网络协议区间响应消息作为区间上报信息上报至主机100。
步骤S514,主机的多路径软件104在接收到所述区间上报信息后,从中获取访问区间信息,然后记录上报所述区间上报信息的控制节点对应的访问区间信息,如此,在接收到I/O请求时,即可根据该信息进行I/O请求的分配。
所述主机的多路径软件104在获取所述访问区间信息时,首先对所述区间上报信息进行解析得到所述NVMe区间查询命令响应信息,然后从所述NVMe区间查询命令响应信息中获取访问区间信息。
如图14所示,为主机接100收到I/O请求时,分配执行所述I/O请求的控制节点的流程图。
步骤1301,主机100接收I/O请求,所述I/O请求中携带待访问数据的逻辑地址;
步骤1302,主机100确定所述逻辑地址所在的访问区间;
步骤1303,主机100根据所确定的访问区间,确定处理所述I/O请求的控制节点;
步骤1304,主机100将所述I/O请求下发至所述控制节点,由所述控制节点处理所述I/O请求。
如此,主机下发的I/O请求可以由待访问数据的逻辑地址所在的访问区间对应的控制节点处理,从而避免具有相同逻辑地址的I/O请求被下发至不同的控制节点处理,引起I/O请求的转发,减少了I/O请求的时延。
如图15及图16所示,分别为本发明实施例中提供的主机及存储系统中的任一控制节点的模块图。所述主机包括路径管理模块1501、路径状态查询模块、区间查询模块1503、区间记录模块1504、区间确定模块1505及I/O下发模块1506。所述控制节点包括磁盘编码上报模块1601、磁盘标识上报模块1602、路径状态上报模块、及区间上报模块1604。
主机的所述路径管理模块1501用于当主机100的操作系统105监测到存储系统300连接到主机100时,通过主机100与存储系统300之间的多条路径分别发送磁盘上报命令至存储系统300。
存储系统300的控制节点的磁盘编码上报模块1601在接收到所述磁盘上报命令后,获取所述磁盘上报命令中的主机端口标识,根据所述主机端口标识及主机端口标识与逻辑磁盘的磁盘编码的映射关系确定所述主机端口标识对应的磁盘编码,然后为所述磁盘编码对应的逻辑磁盘生成上报信息,在所述上报信息中携带所述磁盘编码,并将所述上报信息上报至主机100。
主机的所述路径管理模块1501在收到存储系统300的各个控制节点302上报的上报信息后,通过各个上报信息中的路径信息指示的路径发送磁盘标识查询命令,所述查询命令中包括所述上报信息中的磁盘编码。
存储系统300的控制节点的磁盘标识上报模块1602在接收到所述查询命令后,根据所述查询命令中的磁盘编码获取所述磁盘编码对应的磁盘标识,并将所述磁盘编码对应的磁盘标识通过发送所述查询命令的路径上报。
主机的所述路径管理模块1501根据上报的磁盘标识确定访问所述磁盘标识对应的逻辑磁盘的路径,并对所确定的路径进行管理。
所述主机的路径管理模块1501所执行的功能与图5中的步骤S501、S504、及S507所执行的功能相同。所述控制节点的磁盘编码上报模块1601所执行的功能与图5中的步骤S502及S503相同,所述控制节点的磁盘标识上报模块1602所执行的功能与图5中的步骤S505及S507相同,具体细节请参考图5中响应步骤的描述。
在为所述逻辑磁盘建立好访问路径后,所述主机的路径状态查询模块1502发送路径状态查询命令至存储系统300,所述状态查询命令用于指示每个控制节点上报本控制节点所在路径的路径状态。所述路径状态查询模块1502所执行的功能与步骤S508执行的功能相同,所以路径状态查询模块1502的具体实现细节可参考步骤S508的相关描述。
所述存储节点的路径状态上报模块1603在收到所述路径状态查询命令时,将路径状态携带在状态上报信息中上报至主机。所以路径状态上报模块1603的具体实现细节可参考步骤S509的相关描述。
所述主机的区间查询模块1503在收到所述上报信息后,判断上报信息中的路径状态是否指示所述逻辑磁盘303包括多个访问区间,当判断所述逻辑磁盘没有被划分为多个访问区间时,则将路径编码记录为上报所述状态上报信息的路径的路径状态,以供所述多路径软件104为I/O选择路径时参考,当判断所述逻辑磁盘没有被划分为多个访问区间时,则发送区间查询命令至所述存储系统300,该区间查询命令用于指示存储系统中的控制节点上报为各控制节点分配的访问区间。所述区间查询模块1503的具体实现细节可参考步骤S510~S512的相关描述。
所述控制节点区间上报模块1604接收到所述区间查询命令,然后生成区间上报信息,并将所述区间上报信息上报至主机100。所述区间上报模块1504的具体实现细节可参考步骤S513的相关描述。
所述主机的区间记录模块1504在接收到所述区间上报信息后,从中获取访问区间信息,然后记录上报所述区间上报信息的控制节点对应的访问区间信息,如此,在接收到I/O请求时,即可根据该信息进行I/O请求的分配。
所述主机的区间确定模块1505用于接收I/O请求,所述I/O请求中携带待访问数据的逻辑地址,并确定所述逻辑地址所在的访问区间。
所述I/O下发模块1506用于根据所确定的访问区间,确定处理所述I/O请求的控制节点,然后将所述I/O请求下发至所述控制节点,由所述控制节点处理所述I/O请求。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (53)

  1. 一种数据处理方法,由主机执行,所述主机与存储系统通过NVMeoF(Non-Volatile Memory express over Fabric)协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘,所述方法包括:
    发送状态查询命令至所述控制节点,所述状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态;
    接收所述控制节点上报的路径状态;
    当接收到的路径状态指示所述逻辑磁盘包括访问区间时,发送区间查询命令至所述控制节点;
    接收所述控制节点上报的访问区间信息,所述访问区间信息所指示的访问区间被预先分配给所述控制节点;
    记录所述控制节点与所述访问区间信息的映射关系。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    接收I/O请求,所述I/O请求携带待访问数据的逻辑地址;
    确定所述逻辑地址所落入的访问区间;
    根据所述映射关系确定所述访问区间对应的控制节点;
    将所述I/O请求发送至所述访问区间对应的控制节点。
  3. 如权利要求1所述的方法,其特征在于,所述指示所述逻辑磁盘包括访问区间的路径状态由所述控制节点在判断所述逻辑磁盘包括访问区间时,将所述指示所述逻辑磁盘包括访问区间的路径状态上报至所述主机。
  4. 如权利要求1至3任意一项所述的方法,其特征在于,所述访问区间查询命令是基于所述NVMeoF协议中的命令get log page-Command Dword 10定义的,并在所述命令的Log Page Identifier字段携带所述访问区间查询命令的命令标识。
  5. 如权利要求1-4任意一项所述的方法,其特征在于,所述访问区间信息所指示的访问区间为连续的地址空间。
  6. 如权利要求5所述的方法,其特征在于,所述访问区间信息包括访问区间首地址及访问区间长度。
  7. 如权利要求1-4任意一项所述的方法,其特征在于,所述控制节点对应的访问区间包括第一子访问区间和第二子访问区间,所述第一子访问区间和所述第二子访问区间之间有间隔。
  8. 权利要求7所述的方法,其特征在于,所述第一子访问子区间和第二访问子区间相邻,所述访问区间信息包括访问区间首地址、每个子访问区间的长度、及两个相邻的子访问区间之间的间隔。
  9. 一种数据处理方法,由存储系统的控制节点执行,所述存储系统与主机通过NVMeoF(Non-Volatile Memory express over Fabric)协议进行通信,所述存储系统包括逻辑磁盘,其特征在于,所述方法包括:
    接收主机发送的状态查询命令,所述状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态;
    当所述逻辑磁盘包括访问区间时,将指示所述逻辑磁盘包括访问区间的路径状态上报至主机;
    接收所述主机根据所述路径状态发送的访问区间查询命令;
    根据所述区间查询信息将为所述控制节点分配的访问区间的访问区间信息上报至主机,使主机记录所述访问区间信息与所述控制节点的映射关系。
  10. 如权利要求9所述的方法,其特征在于,所述访问区间查询命令是基于所述NVMeoF协议中的命令get log page-Command Dword 10定义的,并在所述命令的Log Page Identifier字段携带所述访问区间查询命令的命令标识。
  11. 如权利要求9或10所述的方法,其特征在于,所述访问区间信息所指示的访问区间为连续的地址空间。
  12. 如权利要求11所述的方法,其特征在于,所述访问区间信息包括访问区间首地址及访问区间长度。
  13. 如权利要求9或10所述的方法,其特征在于,所述控制节点对应的访问区间包括第一子访问区间和第二子访问区间,所述第一子访问区间和所述第二子访问区间之间有间隔。
  14. 权利要求13所述的方法,其特征在于,所述第一子访问子区间和第二访问子区间相邻,所述访问区间信息包括访问区间首地址、每个子访问区间的长度、及两个相邻的子访问区间之间的间隔。
  15. 一种数据处理方法,由主机执行,所述主机连接至存储系统,所述存储系统包括多个控制节点,所述主机通过所述多个控制节点访问存储系统中的逻辑磁盘,其特征在于,所述方法包括:
    接收I/O请求,所述I/O请求中携带待访问数据的逻辑地址;
    确定所述逻辑地址所属的访问区间,其中,所述逻辑磁盘包括多个访问区间,所述主机中记录有每个访问区间与控制节点的映射关系;
    将所述I/O请求下发至所确定的访问区间对应的控制节点。
  16. 如权利要求15所述的方法,其特征在于,所述方法还包括:
    发送状态查询命令至所述多个控制节点,所述状态查询命令用于指示所述多个控制节点上报本控制节点所在路径的路径状态;
    接收所述多个控制节点上报的路径状态;
    当接收到的路径状态指示所述逻辑磁盘包括访问区间时,发送区间查询命令至上报所述路径状态的控制节点;
    接收所述上报所述路径状态的控制节点所述控制节点上报的访问区间信息;
    记录所述上报所述路径状态的控制节点与所述访问区间信息的映射关系。
  17. 如权利要求16所述的方法,其特征在于,所述指示所述逻辑磁盘包括访问区间的路径状态由所述接收到所述状态查询命令的控制节点在判断所述逻辑磁盘包括访问区间时,将所述指示所述逻辑磁盘包括访问区间的路径状态上报至所述主机。
  18. 如权利要求15至17任意一项所述的方法,其特征在于,所述访问区间查询命令是基于所述NVMeoF协议中的命令get log page-Command Dword 10定义的, 并在所述命令的Log Page Identifier字段携带所述访问区间查询命令的命令标识。
  19. 如权利要求15-18任意一项所述的方法,其特征在于,所述访问区间信息所指示的访问区间为连续的地址空间。
  20. 如权利要求19所述的方法,其特征在于,所述访问区间信息包括访问区间首地址及访问区间长度。
  21. 如权利要求15-18任意一项所述的方法,其特征在于,所述控制节点对应的访问区间包括第一子访问区间和第二子访问区间,所述第一子访问区间和所述第二子访问区间之间有间隔。
  22. 如权利要求21所述的方法,其特征在于,所述第一子访问子区间和第二访问子区间相邻,所述访问区间信息包括访问区间首地址、每个子访问区间的长度、及两个相邻的子访问区间之间的间隔。
  23. 一种数据处理方法,由主机执行,所述主机与存储系统通过外部网络通信,所述方法包括:
    将非易失性存储器标准NVMe区间查询命令封装到所述外部网络协议得到外部网络协议区间查询命令,所述NVMe区间查询命令用于查询为所述存储系统的控制器分配的访问区间,所述访问区间属于所述存储系统的命名空间;
    向所述控制节点发送所述外部网络协议区间查询命令;
    接收所述控制节点发送的外部网络协议区间响应消息,所述外部网络协议区间响应消息包含NVMe区间查询命令响应,所述NVMe区间查询命令响应包含所述命名空间的访问区间信息;
    解析所述外部网络协议区间响应消息得到所述NVMe区间查询命令响应信息;
    从所述NVMe区间查询命令响应获取并记录所述控制节点的访问区间信息。
  24. 如权利要求23所述的方法,其特征在于,在所述将非易失性存储器标准NVMe区间查询命令封装到所述外部网络协议得到外部网络协议区间查询命令之前,所述方法还包括:
    将NVMe状态查询命令封装到所述外部网络协议得到外部网络协议状态查询命令,所述NVMe状态查询命令用于查询所述控制节点所在路径的路径状态;
    向所述控制节点发送所述外部网络协议状态查询命令;
    接收所述控制节点发送的外部网络协议状态响应消息,所述外部网络协议状态响应消息包含NVMe状态查询命令响应,所述NVMe状态查询命令响应包含路径状态信息,所述路径状态信息指示所述命名空间包括访问区间;
    解析所述外部网络协议状态响应消息得到所述NVMe状态查询命令响应信息。
  25. 如权利要求23所述的方法,其特征在于,
    所述NVMe访问区间查询命令是基于所述NVMe协议中的命令get log page-Command Dword 10定义的,并在所述命令的Log Page Identifier字段携带所述访问区间查询命令的命令标识。
  26. 一种数据处理方法,由存储系统的控制节点执行,所述主机与存储系统通过外部网络通信,所述方法包括:
    接收主机发送的外部网络协议区间查询命令,对所述外部网络协议区间查询命 令进行解析得到NVMe区间查询命令;
    生成所述NVMe区间查询命令的响应信息,所述NVMe区间查询命令的响应信息包括所述控制节点对应的访问区间信息,将所述NVMe区间查询命令封装至所述外部网络协议得到外部网络协议区间查询命令;
    上报所述外部网络协议区间查询命令至所述主机。
  27. 如权利要求26所述的方法,其特征在于,在接收主机发送的外部网络协议区间查询命令之前,所述方法还包括:
    接收外部网络协议状态查询命令,解析所述外部网络协议状态查询命令得到非易失性存储器标准NVMe状态查询命令,所述NVMe状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态;
    生成所述NVMe状态查询命令的响应信息,并当所述命名空间包括访问区间时,将指示所述命名空间包括访问区间的路径状态携带在所述NVMe状态查询命令的响应信息中;
    将所述NVMe状态查询命令封装得到外部网络协议状态查询命令的响应信息,并将外部网络协议状态查询命令的响应信息上报至主机。
  28. 一种主机,所述主机与存储系统通过NVMeoF(Non-Volatile Memory express over Fabric)协议进行通信,所述存储系统包括逻辑磁盘,所述主机通过所述存储系统中的控制节点访问所述逻辑磁盘,所述主机包括:
    路径状态查询模块,用于发送状态查询命令至所述控制节点,所述状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态,并接收所述控制节点上报的路径状态;
    区间查询模块,用于当接收到的路径状态指示所述逻辑磁盘包括访问区间时,发送区间查询命令至所述控制节点,并接收所述控制节点上报的访问区间信息,所述访问区间信息所指示的访问区间被预先分配给所述控制节点;
    区间记录模块,用于记录所述控制节点与所述访问区间信息的映射关系。
  29. 如权利要求28所述的主机,其特征在于,所述主机还包括:
    区间确定模块,用于接收I/O请求,所述I/O请求携带待访问数据的逻辑地址,并确定所述逻辑地址所落入的访问区间;
    I/O下发模块,用于根据所述映射关系确定所述访问区间对应的控制节点,并将所述I/O请求发送至所述访问区间对应的控制节点。
  30. 如权利要求28所述的主机,其特征在于,所述指示所述逻辑磁盘包括访问区间的路径状态由所述控制节点在判断所述逻辑磁盘包括访问区间时,将所述指示所述逻辑磁盘包括访问区间的路径状态上报至所述主机。
  31. 如权利要求28至30任意一项所述的主机,其特征在于,所述访问区间查询命令是基于所述NVMeoF协议中的命令get log page-Command Dword 10定义的,并在所述命令的Log Page Identifier字段携带所述访问区间查询命令的命令标识。
  32. 如权利要求28-31任意一项所述的主机,其特征在于,所述访问区间信息所指示的访问区间为连续的地址空间。
  33. 如权利要求32所述的主机,其特征在于,所述访问区间信息包括访问区 间首地址及访问区间长度。
  34. 如权利要求28-31任意一项所述的主机,其特征在于,所述控制节点对应的访问区间包括第一子访问区间和第二子访问区间,所述第一子访问区间和所述第二子访问区间之间有间隔。
  35. 权利要求34所述的主机,其特征在于,所述第一子访问子区间和第二访问子区间相邻,所述访问区间信息包括访问区间首地址、每个子访问区间的长度、及两个相邻的子访问区间之间的间隔。
  36. 一种存储系统的控制节点,所述存储系统与主机通过NVMeoF(Non-Volatile Memory express over Fabric)协议进行通信,所述存储系统包括逻辑磁盘,其特征在于,所述方法包括:
    路径状态上报模块,用于接收主机发送的状态查询命令,所述状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态,并当所述逻辑磁盘包括访问区间时,将指示所述逻辑磁盘包括访问区间的路径状态上报至主机;
    区间上报模块,用于接收所述主机根据所述路径状态发送的访问区间查询命令,根据所述区间查询信息将为所述控制节点分配的访问区间的访问区间信息上报至主机,使主机记录所述访问区间信息与所述控制节点的映射关系。
  37. 如权利要求36所述的控制节点,其特征在于,所述访问区间查询命令是基于所述NVMeoF协议中的命令get log page-Command Dword 10定义的,并在所述命令的Log Page Identifier字段携带所述访问区间查询命令的命令标识。
  38. 如权利要求36或37所述的控制节点,其特征在于,所述访问区间信息所指示的访问区间为连续的地址空间。
  39. 如权利要求38所述的控制节点,其特征在于,所述访问区间信息包括访问区间首地址及访问区间长度。
  40. 如权利要求36或37所述的控制节点,其特征在于,所述控制节点对应的访问区间包括第一子访问区间和第二子访问区间,所述第一子访问区间和所述第二子访问区间之间有间隔。
  41. 权利要求40所述的控制节点,其特征在于,所述第一子访问子区间和第二访问子区间相邻,所述访问区间信息包括访问区间首地址、每个子访问区间的长度、及两个相邻的子访问区间之间的间隔。
  42. 一种主机,所述主机连接至存储系统,所述存储系统包括多个控制节点,所述主机通过所述多个控制节点访问存储系统中的逻辑磁盘,其特征在于,所述方法包括:
    区间确定模块,用于接收I/O请求,所述I/O请求中携带待访问数据的逻辑地址,确定所述逻辑地址所属的访问区间,其中,所述逻辑磁盘包括多个访问区间,所述主机中记录有每个访问区间与控制节点的映射关系;
    I/O下发模块,用于将所述I/O请求下发至所确定的访问区间对应的控制节点。
  43. 如权利要求42所述的主机,其特征在于,所述主机还包括:
    路径状态查询模块,用于发送状态查询命令至所述多个控制节点,所述状态查询命令用于指示所述多个控制节点上报本控制节点所在路径的路径状态,并接收 所述多个控制节点上报的路径状态;
    区间查询模块,用于当接收到的路径状态指示所述逻辑磁盘包括访问区间时,发送区间查询命令至上报所述路径状态的控制节点,接收所述上报所述路径状态的控制节点所述控制节点上报的访问区间信息;
    区间记录模块,用于记录所述上报所述路径状态的控制节点与所述访问区间信息的映射关系。
  44. 如权利要求43所述的主机,其特征在于,所述指示所述逻辑磁盘包括访问区间的路径状态由所述接收到所述状态查询命令的控制节点在判断所述逻辑磁盘包括访问区间时,将所述指示所述逻辑磁盘包括访问区间的路径状态上报至所述主机。
  45. 如权利要求42至44任意一项所述的主机,其特征在于,所述访问区间查询命令是基于所述NVMeoF协议中的命令get log page-Command Dword 10定义的,并在所述命令的Log Page Identifier字段携带所述访问区间查询命令的命令标识。
  46. 如权利要求43至45任意一项所述的主机,其特征在于,所述访问区间信息所指示的访问区间为连续的地址空间。
  47. 如权利要求46所述的主机,其特征在于,所述访问区间信息包括访问区间首地址及访问区间长度。
  48. 如权利要求43至45任意一项所述的主机,其特征在于,所述控制节点对应的访问区间包括第一子访问区间和第二子访问区间,所述第一子访问区间和所述第二子访问区间之间有间隔。
  49. 权利要求48所述的主机,其特征在于,所述第一子访问子区间和第二访问子区间相邻,所述访问区间信息包括访问区间首地址、每个子访问区间的长度、及两个相邻的子访问区间之间的间隔。
  50. 一种主机,所述主机与存储系统通过外部网络通信,所述方法包括:
    区间查询模块,用于将非易失性存储器标准NVMe区间查询命令封装到所述外部网络协议得到外部网络协议区间查询命令,所述NVMe区间查询命令用于查询为所述存储系统的控制器分配的访问区间,所述访问区间属于所述存储系统的命名空间,向所述控制节点发送所述外部网络协议区间查询命令;接收所述控制节点发送的外部网络协议区间响应消息,所述外部网络协议区间响应消息包含NVMe区间查询命令响应,所述NVMe区间查询命令响应包含所述命名空间的访问区间信息;
    区间记录模块,用于解析所述外部网络协议区间响应消息得到所述NVMe区间查询命令响应信息,从所述NVMe区间查询命令响应获取并记录所述控制节点的访问区间信息。
  51. 如权利要求50所述的主机,其特征在于,在所述将非易失性存储器标准NVMe区间查询命令封装到所述外部网络协议得到外部网络协议区间查询命令之前,所述主机还包括:
    路径状态查询模块,用于:
    将NVMe状态查询命令封装到所述外部网络协议得到外部网络协议状态查询 命令,所述NVMe状态查询命令用于查询所述控制节点所在路径的路径状态,向所述控制节点发送所述外部网络协议状态查询命令;
    接收所述控制节点发送的外部网络协议状态响应消息,所述外部网络协议状态响应消息包含NVMe状态查询命令响应,所述NVMe状态查询命令响应包含路径状态信息,所述路径状态信息指示所述命名空间包括访问区间;
    解析所述外部网络协议状态响应消息得到所述NVMe状态查询命令响应信息。
  52. 一种存储系统的控制节点,主机与所述存储系统通过外部网络通信,所述控制节点包括:
    区间上报模块,用于:
    接收所述主机发送的外部网络协议区间查询命令,对所述外部网络协议区间查询命令进行解析得到NVMe区间查询命令;
    生成所述NVMe区间查询命令的响应信息,所述NVMe区间查询命令的响应信息包括所述控制节点对应的访问区间信息,将所述NVMe区间查询命令封装至所述外部网络协议得到外部网络协议区间查询命令;
    上报所述外部网络协议区间查询命令至所述主机。
  53. 如权利要求52所述的控制节点,其特征在于,所述控制节点还包括:
    路径状态上报模块,用于:
    接收外部网络协议状态查询命令,解析所述外部网络协议状态查询命令得到非易失性存储器标准NVMe状态查询命令,所述NVMe状态查询命令用于指示所述控制节点上报所述控制节点所在路径的路径状态;
    生成所述NVMe状态查询命令的响应信息,并当所述命名空间包括访问区间时,将指示所述命名空间包括访问区间的路径状态携带在所述NVMe状态查询命令的响应信息中;
    将所述NVMe状态查询命令封装得到外部网络协议状态查询命令的响应信息,并将外部网络协议状态查询命令的响应信息上报至主机。
PCT/CN2018/095992 2018-07-17 2018-07-17 处理i/o请求的方法及设备 WO2020014869A1 (zh)

Priority Applications (7)

Application Number Priority Date Filing Date Title
CN202010260925.5A CN111767008A (zh) 2018-07-17 2018-07-17 处理i/o请求的方法及设备
JP2020529744A JP7094364B2 (ja) 2018-07-17 2018-07-17 I/o要求処理方法およびデバイス
CN201880003796.2A CN110914796B (zh) 2018-07-17 2018-07-17 处理i/o请求的方法及设备
PCT/CN2018/095992 WO2020014869A1 (zh) 2018-07-17 2018-07-17 处理i/o请求的方法及设备
EP18926874.1A EP3796149B1 (en) 2018-07-17 2018-07-17 Method and device for processing i/o request
KR1020207014420A KR102342607B1 (ko) 2018-07-17 2018-07-17 I/o 요청 처리 방법 및 기기
US17/026,280 US11249663B2 (en) 2018-07-17 2020-09-20 I/O request processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/095992 WO2020014869A1 (zh) 2018-07-17 2018-07-17 处理i/o请求的方法及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/026,280 Continuation US11249663B2 (en) 2018-07-17 2020-09-20 I/O request processing method and device

Publications (1)

Publication Number Publication Date
WO2020014869A1 true WO2020014869A1 (zh) 2020-01-23

Family

ID=69164173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095992 WO2020014869A1 (zh) 2018-07-17 2018-07-17 处理i/o请求的方法及设备

Country Status (6)

Country Link
US (1) US11249663B2 (zh)
EP (1) EP3796149B1 (zh)
JP (1) JP7094364B2 (zh)
KR (1) KR102342607B1 (zh)
CN (2) CN110914796B (zh)
WO (1) WO2020014869A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11494091B2 (en) * 2021-01-19 2022-11-08 EMC IP Holding Company LLC Using checksums for mining storage device access data

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11093352B2 (en) * 2019-09-11 2021-08-17 Hewlett Packard Enterprise Development Lp Fault management in NVMe systems
CN112684978B (zh) * 2020-12-23 2024-02-13 北京浪潮数据技术有限公司 一种存储设备的存储路径选择方法、系统及装置
US11656795B2 (en) * 2021-01-21 2023-05-23 EMC IP Holding Company LLC Indicating optimized and non-optimized paths to hosts using NVMe-oF in a metro cluster storage system
US20230112007A1 (en) * 2021-10-08 2023-04-13 Advanced Micro Devices, Inc. Global addressing for switch fabric
US11936535B2 (en) 2021-10-29 2024-03-19 Samsung Electronics Co., Ltd. Server and electronic device for transmitting and receiving stream data and method for operating the same
KR20230062132A (ko) * 2021-10-29 2023-05-09 삼성전자주식회사 스트림 데이터를 송신하고 수신하기 위한 서버, 전자 장치 및 그 동작 방법

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968422A (zh) * 2011-08-31 2013-03-13 中国航天科工集团第二研究院七○六所 流数据存储控制系统及其方法
CN103019622A (zh) * 2012-12-04 2013-04-03 华为技术有限公司 一种数据的存储控制方法、控制器、物理硬盘,及系统
US20150012735A1 (en) * 2013-07-08 2015-01-08 Eliezer Tamir Techniques to Initialize from a Remotely Accessible Storage Device
CN105138292A (zh) * 2015-09-07 2015-12-09 四川神琥科技有限公司 磁盘数据读取方法
CN106815338A (zh) * 2016-12-25 2017-06-09 北京中海投资管理有限公司 一种大数据的实时存储、处理和查询系统
CN108228076A (zh) * 2016-12-14 2018-06-29 华为技术有限公司 访问磁盘的方法和主机

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3992427B2 (ja) * 2000-08-01 2007-10-17 株式会社日立製作所 ファイルシステム
JP2005326935A (ja) * 2004-05-12 2005-11-24 Hitachi Ltd 仮想化ストレージを備える計算機システムの管理サーバおよび障害回避復旧方法
WO2007053356A2 (en) * 2005-10-28 2007-05-10 Network Appliance, Inc. System and method for optimizing multi-pathing support in a distributed storage system environment
JP2007316995A (ja) * 2006-05-26 2007-12-06 Hitachi Ltd 記憶システム及びデータ管理方法
JP5224706B2 (ja) * 2007-03-23 2013-07-03 キヤノン株式会社 記憶装置及び記憶装置の制御方法
JP2009043055A (ja) * 2007-08-09 2009-02-26 Hitachi Ltd 計算機システム、ストレージ装置及びデータ管理方法
EP2388961A1 (en) * 2009-01-13 2011-11-23 Hitachi, Ltd. Communication system, subscriber accommodating apparatus and communication method
WO2010109675A1 (en) * 2009-03-24 2010-09-30 Hitachi, Ltd. Storage system
KR101772496B1 (ko) * 2011-02-16 2017-09-13 에스케이텔레콤 주식회사 이종의 네트워크에서 경로 보호를 위한 시스템, 이를 위한 장치 및 이를 위한 방법
CN102650932A (zh) 2012-04-05 2012-08-29 华为技术有限公司 数据的访问方法、设备和系统
JP2014013459A (ja) * 2012-07-03 2014-01-23 Fujitsu Ltd 制御装置、ストレージ装置および制御装置の制御方法
US9658897B2 (en) 2014-06-23 2017-05-23 International Business Machines Corporation Flexible deployment and migration of virtual machines
JP6185668B2 (ja) * 2014-07-25 2017-08-23 株式会社日立製作所 ストレージ装置
CN110059020B (zh) * 2015-04-23 2024-01-30 华为技术有限公司 扩展内存的访问方法、设备以及系统
WO2017040706A1 (en) * 2015-09-02 2017-03-09 Cnex Labs, Inc. Nvm express controller for remote access of memory and i/o over ethernet-type networks
CN105677703A (zh) * 2015-12-25 2016-06-15 曙光云计算技术有限公司 Nas文件系统及其访问方法和装置
WO2017131724A1 (en) * 2016-01-29 2017-08-03 Hewlett Packard Enterprise Development Lp Host devices and non-volatile memory subsystem controllers
US10346041B2 (en) 2016-09-14 2019-07-09 Samsung Electronics Co., Ltd. Method for using BMC as proxy NVMeoF discovery controller to provide NVM subsystems to host
CN107665091B (zh) 2016-07-28 2021-03-02 深圳大心电子科技有限公司 数据读取方法、数据写入方法及其存储控制器
CN107329859B (zh) 2017-06-29 2020-06-30 杭州宏杉科技股份有限公司 一种数据保护方法及存储设备
US11226753B2 (en) * 2018-05-18 2022-01-18 Ovh Us Llc Adaptive namespaces for multipath redundancy in cluster based computing systems

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968422A (zh) * 2011-08-31 2013-03-13 中国航天科工集团第二研究院七○六所 流数据存储控制系统及其方法
CN103019622A (zh) * 2012-12-04 2013-04-03 华为技术有限公司 一种数据的存储控制方法、控制器、物理硬盘,及系统
US20150012735A1 (en) * 2013-07-08 2015-01-08 Eliezer Tamir Techniques to Initialize from a Remotely Accessible Storage Device
CN105138292A (zh) * 2015-09-07 2015-12-09 四川神琥科技有限公司 磁盘数据读取方法
CN108228076A (zh) * 2016-12-14 2018-06-29 华为技术有限公司 访问磁盘的方法和主机
CN106815338A (zh) * 2016-12-25 2017-06-09 北京中海投资管理有限公司 一种大数据的实时存储、处理和查询系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3796149A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11494091B2 (en) * 2021-01-19 2022-11-08 EMC IP Holding Company LLC Using checksums for mining storage device access data

Also Published As

Publication number Publication date
JP7094364B2 (ja) 2022-07-01
KR20200070349A (ko) 2020-06-17
US20210004171A1 (en) 2021-01-07
EP3796149B1 (en) 2024-02-21
CN111767008A (zh) 2020-10-13
CN110914796A (zh) 2020-03-24
JP2021510215A (ja) 2021-04-15
US11249663B2 (en) 2022-02-15
EP3796149A1 (en) 2021-03-24
KR102342607B1 (ko) 2021-12-22
EP3796149A4 (en) 2021-06-16
CN110914796B (zh) 2021-11-19

Similar Documents

Publication Publication Date Title
WO2020014869A1 (zh) 处理i/o请求的方法及设备
US10523573B2 (en) Dynamic resource allocation based upon network flow control
JP3895677B2 (ja) ライブラリ分割を利用して可動媒体ライブラリを管理するシステム
JP4691251B2 (ja) 仮想ローカルストレージを与えるためのストレージルータおよび方法
US9513825B2 (en) Storage system having a channel control function using a plurality of processors
US6950914B2 (en) Storage system
US8504770B2 (en) System and method for representation of target devices in a storage router
JP2004220450A (ja) ストレージ装置、その導入方法、及びその導入プログラム
US20070079098A1 (en) Automatic allocation of volumes in storage area networks
US20180081558A1 (en) Asynchronous Discovery of Initiators and Targets in a Storage Fabric
US9674312B2 (en) Dynamic protocol selection
US10782889B2 (en) Fibre channel scale-out with physical path discovery and volume move
JP4484597B2 (ja) ストレージ装置及びストレージ装置の排他制御方法
JP2006155640A (ja) アクセスの設定方法
US11201788B2 (en) Distributed computing system and resource allocation method
WO2019209625A1 (en) Methods for managing group objects with different service level objectives for an application and devices thereof
US20230350572A1 (en) Host path selection utilizing address range distribution obtained from storage nodes for distributed logical volume
US7925758B1 (en) Fibre accelerated pipe data transport
WO2023246241A1 (zh) 数据处理系统、数据处理方法、装置及相关设备
JP6250378B2 (ja) 計算機システム、アドレス管理装置およびエッジノード

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18926874

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20207014420

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020529744

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2018926874

Country of ref document: EP

Effective date: 20201215

NENP Non-entry into the national phase

Ref country code: DE