WO2018137217A1 - 一种数据处理的系统、方法及对应装置 - Google Patents

一种数据处理的系统、方法及对应装置 Download PDF

Info

Publication number
WO2018137217A1
WO2018137217A1 PCT/CN2017/072701 CN2017072701W WO2018137217A1 WO 2018137217 A1 WO2018137217 A1 WO 2018137217A1 CN 2017072701 W CN2017072701 W CN 2017072701W WO 2018137217 A1 WO2018137217 A1 WO 2018137217A1
Authority
WO
WIPO (PCT)
Prior art keywords
host
controller
target
data
target data
Prior art date
Application number
PCT/CN2017/072701
Other languages
English (en)
French (fr)
Inventor
程宏才
郭海涛
刘洪广
陈昊
李思聪
谭春毅
胡瑜
陈灿
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP17894187.8A priority Critical patent/EP3493046B1/en
Priority to CN201780000604.8A priority patent/CN108701004A/zh
Priority to PCT/CN2017/072701 priority patent/WO2018137217A1/zh
Publication of WO2018137217A1 publication Critical patent/WO2018137217A1/zh
Priority to US16/362,210 priority patent/US11489919B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a system, method, and corresponding device for data processing.
  • a host can access data in a remote storage device by managing a controller of the remote storage device.
  • data transmission between the host and the storage device takes a long time, and the rate at which the host reads and writes data is easily affected.
  • the present invention provides a data processing system, method, and corresponding device, which are used to solve the problem that data transmission between a host and a storage device needs to be forwarded by a controller that receives a host operation request, resulting in a long time of data transmission between the host and the storage device.
  • the problem is a data processing system, method, and corresponding device, which are used to solve the problem that data transmission between a host and a storage device needs to be forwarded by a controller that receives a host operation request, resulting in a long time of data transmission between the host and the storage device.
  • a first aspect of the present application provides a method of data processing, the method being performed by a controller in a data processing system, the system comprising a controller and at least two storage nodes.
  • the controller establishes a connection with each storage node for managing the at least two storage nodes.
  • the storage node and the controller establish a connection with the host, and the host is used to deploy the application service.
  • the host sends an operation request to the controller, the operation request including an identification of the target data to be operated and an operation type.
  • the controller receives the operation request and determines the target storage node according to the identifier of the target data.
  • the controller sends an indication message to the target storage node, where the indication message is used to indicate that the target storage node sends the target data to the host or acquires the target data from the host through the connection with the host.
  • the target storage node responds to the indication message, and transmits target data to the host or acquires target data from the host through a connection with the host.
  • the host and the target storage node can directly transmit data through the connection between the two without forwarding data through the controller, shortening the data transmission path and reducing the transmission time, and avoiding the bandwidth of the controller cannot meet the large data transmission requirement.
  • the problem that the host reads and writes the data rate is slow, and the problem that the host reads and writes data rate is slow due to the controller's computing capability cannot meet the large data processing requirements.
  • the controller determines, according to the identifier of the target data, that the target storage node includes a first storage node that stores the first data block of the target data and a second data block that stores the target data.
  • a storage node sends a first indication message to the first storage node, where the first indication message includes an identifier of the first data block, where the first storage node and the host transmit the first data block, and the controller sends the first data node to the second storage node.
  • Sending a second indication message where the second indication message includes an identifier of the second data block, and is used to indicate that the second storage node and the host transmit the second data block.
  • the controller can indicate that when the target data is stored in a plurality of storage nodes, the storage node that instructs the data block storing the target data respectively sends the target data to the host through the connection with the host or acquires the data of the target data from the host. Block, increase the rate of data transmission, reduce transmission time.
  • the operation request sent by the host to the controller further includes a target storage area specified by the host in the memory for the target data.
  • Location parameter The controller divides the target storage area according to the location parameter of the target storage area, and determines a location parameter of the sub-storage area specified by the data block of the target data corresponding to each target storage node in the target storage area. Then, an indication message is generated for each target storage node, where the indication message includes an identifier of the data block, an operation type to And the location parameter of the sub-storage area.
  • the target storage node After the target storage node receives the indication message, when the operation type in the indication message is a read operation, the target storage node writes the data block of the target data into the host memory neutron through the RDMA connection with the host according to the location parameter of the sub storage area. a storage area; when the operation type in the indication message is a write operation, the target storage node reads the data block of the target data from the sub-memory area of the host memory through the RDMA connection with the host according to the location parameter of the sub-storage area, and stores the data. Piece.
  • the controller can determine, for each target storage node, a storage area of the host memory corresponding to the storage data block when the target data is stored in a plurality of storage nodes, and instruct the storage node to connect through the RDMA with the host. Quickly read the data block of the target data from the host memory or write the data block of the target data to the host memory, thereby increasing the rate of data transmission and reducing the transmission time.
  • the storage node when the connection between the storage node and the host is an RDMA connection, the storage node creates a first QP, where the first QP includes the first SQ and the first RQ, and then, stores The node sends the parameters of the first QP to the controller through the first connection.
  • the controller sends the parameters of the first QP to the host through the second connection.
  • the host creates a second QP, where the second QP includes a second SQ and a second RQ, and sends a parameter of the second QP to the controller by using the second connection, where the controller sends the parameter of the second QP to the first connection.
  • a storage node when the connection between the storage node and the host is an RDMA connection, the storage node creates a first QP, where the first QP includes the first SQ and the first RQ, and then, stores The node sends the parameters of the first QP to the controller through the first connection.
  • the controller sends the parameters of the first QP to the host
  • the host associates the second QP with the first QP of any of the storage nodes according to the received parameters of the first QP.
  • the storage node binds the first SQ of the first QP to the second RQ of the second QP according to the received parameter of the second QP, and the first RQ of the first QP and the second SQ of the second QP Bind.
  • the host establishes a transmission control protocol/internet interconnection protocol TCP/IP connection with the controller and the at least two storage nodes; after the controller sends the indication message to the target storage node, The host sends a third indication message, where the third indication message includes a communication address of the target storage node, and is used to instruct the host to transmit the target data through a TCP/IP connection with the target storage node.
  • the controller not only sends an indication message to the packet sending end, but also sends an indication message to the packet receiving end, so that the packet receiving end can obtain the target data from the received TCP/IP packet instead of discarding the newspaper. Text.
  • the host establishes a TCP/IP connection with the control node and the at least two storage nodes, and the operation type of the operation request sent by the host to the controller is a read operation.
  • the controller adds the operation type, the communication address of the host, the communication address of the controller, and the identifier of the target data in the indication message sent to the target storage node, to indicate that the target storage node uses the communication address of the controller as the source address and the host
  • the communication address is a TCP/IP packet carrying the target data for the destination address.
  • the target storage node modifies the source address of the sent packet, and pretends that the controller sends the packet carrying the target data to the host, so that the storage node can be passed through the host without changing the existing host.
  • the TCP/IP connection directly sends data to the host, increasing the speed of data transfer between the host and the storage node, and reducing the time consuming for data transmission.
  • the controller determines, according to the TCP window size of the host, the data receiving amount of the host, and determines, according to the data receiving quantity, the target data that the target storage node carries through each TCP/IP packet.
  • the data block, the size of the data block is not greater than the TCP window size of the host, and then the controller generates an indication message including the identifier of the data block, and sends the indication message to the target storage node.
  • the controller can determine the data block sent to the host each time according to the TCP window size of the host, and indicate that the storage node storing the data block sends the data block to the host through the TCP/IP packet, and then through the storage node and the host.
  • the inter-TCP connection enables data to be sent to the host.
  • the target storage node sends the target data to the host or from the main After acquiring the target data, the machine sends a data transmission success message to the controller; after receiving the data transmission success message sent by all the target storage nodes, the controller sends an operation success message to the host.
  • the controller can inform the host that the data is successfully read and written after the target data transmission between the target storage node and the host is completed.
  • a second aspect of the present application provides a method of data processing performed by a storage node in a system of data processing.
  • the system includes a controller and at least two storage nodes.
  • the controller establishes a connection with each storage node for managing the at least two storage nodes.
  • the storage node and the controller establish a connection with the host, and the host is used to deploy the application service.
  • the method includes: the storage node establishes a connection with the host, and the host is configured to deploy an application service; the storage node receives an indication message sent by the controller, where the indication message includes an identifier and an operation type of the target data to be operated by the host; and the storage node passes the indication message according to the method.
  • the connection to the host sends the target data to the host or the target data from the host.
  • the storage node establishes an RDMA connection with the host, and the process of establishing the connection is: the storage node creates a first QP, and sends the parameter of the first QP to the controller by using the first connection.
  • the controller sends the parameters of the first QP to the host through the second connection.
  • the host creates a second QP pair, and sends the parameter of the second QP to the controller through the second connection, and the controller sends the parameter of the second QP to the storage node by using the first connection.
  • the host associates the second QP with the first QP of the storage node according to the received parameters of the first QP.
  • the storage node associates its first QP with the second QP according to the received parameters of the second QP.
  • the indication message sent by the controller includes an identifier of the target data, an operation type, and a location parameter of the target storage area in the host.
  • the storage node responds to the indication message, when the operation type in the indication message is a read operation, the storage node writes the target data into the storage area of the host memory through an RDMA connection with the host; when the operation type in the indication message is a write operation, The storage node reads the target data from the storage area of the host memory through an RDMA connection with the host, and stores the target data.
  • the target storage node can read and write data from the host memory according to the RDMA with the host, and complete the transmission of the target data with the host.
  • the host establishes a TCP/IP connection with the storage node, and the operation type of the operation request sent by the host to the controller is a read operation.
  • the controller adds the operation type, the communication address of the host, the communication address of the controller, and the identifier of the target data in the indication message sent to the target storage node.
  • the target storage node responds to the indication message, and sends a TCP/IP packet carrying the target data with the communication address of the controller as the source address and the communication address of the host as the destination address.
  • the target storage node modifies the source address of the sent packet, and pretends that the controller sends the packet carrying the target data to the host, so that the storage node can be passed through the host without changing the existing host.
  • the TCP/IP connection directly sends data to the host, increasing the speed of data transfer between the host and the storage node, and reducing the time consuming for data transmission.
  • a third aspect of the present application provides a method of data processing, the method being performed by a first controller in a system for processing data, the system comprising at least two controllers and at least two disks, at least two controllers for managing At least two disks, each of the at least two disks may be attributed to one controller management, and the at least two controllers form a storage array with the at least two disks.
  • the host can access the target data in the LUN managed by the controller through the controller.
  • At least two controllers establish a connection with the host, and the host is used to deploy an application service.
  • the method includes: the first controller receives an operation request sent by the host, where the operation request includes an identifier of the target data to be operated and an operation type; and the first controller determines the response operation according to the preset load balancing policy or the LUN of the target data.
  • the requested second controller sends an indication message to the second controller, instructing the second controller to send target data to the host or acquire target data from the host through a connection with the host.
  • the connection between the second controller and the host is an RDMA connection;
  • the operation request sent by the host to the first controller further includes a location parameter of the target storage area in the host;
  • the controller adds an identifier of the target data, an operation type, and the location parameter to the indication message sent to the second controller, to indicate that the second controller acquires the target data from the at least two disks when the operation type is a read operation,
  • the RDMA connection with the host writes the target data to the storage area of the host memory; and reads the target data from the storage area of the host memory through the RDMA connection with the host when the operation type is a write operation, and writes the target data to at least two disks.
  • the host and the second controller responsible for providing the host with the read/write target data service can quickly transfer the target data through the RDMA connection established between the two, thereby realizing high-speed reading and writing of data.
  • the first controller determines, according to the preset load balancing policy, that the target data is sent by the second controller to the host or the target data is obtained from the host; Or determining, according to the attribution of the logical unit number LUN where the target data is located, sending, by the second controller, the target data to the host or acquiring the target data from the host.
  • the first controller receives a fourth indication message that is used by the third controller that manages the at least two disks, where the fourth indication message includes the second target data that the host is to operate.
  • a fourth aspect of the present application provides an apparatus for data processing for performing the method of any of the above first aspect or any of the possible implementations of the first aspect.
  • the apparatus comprises means for performing the method of any of the above-described first aspect or any of the possible implementations of the first aspect.
  • a fifth aspect of the present application provides an apparatus for data processing for performing the method of any of the above-described second aspect or any possible implementation of the second aspect.
  • the apparatus comprises means for performing the method of any of the possible implementations of the second aspect or the second aspect described above.
  • a sixth aspect of the present application provides an apparatus for data processing for performing the method of any of the above-described third aspect or any possible implementation of the third aspect.
  • the apparatus comprises means for performing the method of any of the possible implementations of the third aspect or the third aspect described above.
  • a seventh aspect of the present application provides a data processing device, including a processor, a memory, a communication interface, and a bus.
  • the processor, the memory, and the communication interface are connected by a bus and complete communication with each other, and the memory is used to store the computer.
  • the instructions are executed, and when executed, the processor executes a computer in memory to execute instructions to perform the method of any of the above-described first aspects or any of the possible implementations of the first aspect with hardware resources in the device.
  • An eighth aspect of the present application provides a data processing device, including a processor, a memory, a communication interface, and a bus.
  • the processor, the memory, and the communication interface are connected by a bus and complete communication with each other, and the memory is used to store the computer.
  • the instructions are executed, and when executed, the processor executes a computer in memory to execute instructions to perform the method of any of the second aspect or the second aspect of the second aspect described above with hardware resources in the device.
  • a ninth aspect of the present application provides a data processing device, including a processor, a memory, a communication interface, and a bus.
  • the processor, the memory, and the communication interface are connected by a bus and complete communication with each other, and the memory is used to store the computer.
  • the instructions are executed, and when executed, the processor executes a computer in memory to execute instructions to perform the method of any of the above-described third aspect or any of the possible implementations of the third aspect with hardware resources in the device.
  • a tenth aspect of the present application provides a system for data processing, the system comprising the device of the seventh aspect, and the device of the at least two eighth aspects, Direct connection between people The data is transmitted without forwarding the data via the device described in the seventh aspect.
  • An eleventh aspect of the present application provides a system for data processing, comprising: at least two devices according to the ninth aspect, and at least two disks, the device of the ninth aspect for implementing an operation request in response to a host The data is transmitted directly with the host through the connection between the two without forwarding the data via the device receiving the host operation request.
  • a twelfth aspect of the present application provides a computer readable medium having stored therein instructions that, when executed on a computer, cause the computer to perform any of the first aspect or the first aspect The instructions of the method in the implementation.
  • a thirteenth aspect of the present application provides a computer readable medium having stored therein instructions that, when executed on a computer, cause the computer to perform any of the second aspect or the second aspect The instructions of the method in the implementation.
  • a fourteenth aspect of the present application provides a computer readable medium having stored therein instructions that, when executed on a computer, cause the computer to perform any of the third or third aspects The instructions of the method in the implementation.
  • FIG. 1 is a schematic diagram of a SAN storage system in the prior art
  • FIG. 2 is a schematic diagram of a discrete aggregation list SGL
  • FIG. 3 is a schematic diagram of a SAN system according to an embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of a method for transmitting data in a SAN system according to an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of establishing an RDMA connection between a host and a storage node according to an embodiment of the present invention
  • FIG. 6 is a schematic structural diagram of a storage node and a host according to an embodiment of the present invention.
  • FIG. 7 is a schematic flowchart of a host and a storage node transmitting target data through an RDMA connection according to an embodiment of the present invention
  • FIGS. 8a-8b are schematic diagrams showing a storage area for storing target data in a host memory according to an embodiment of the present invention.
  • FIG. 9 is a schematic flowchart of a host and a storage node transmitting target data through an iSER connection according to an embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of a method for a host and a storage node to transmit data through a TCP/IP connection according to an embodiment of the present invention
  • FIG. 11 is a schematic flowchart of another method for transmitting data between a host and a storage node through a TCP/IP connection according to an embodiment of the present invention
  • FIG. 12 is a schematic structural diagram of a frame of a packet sent by a storage node to a host according to an embodiment of the present disclosure
  • FIG. 13 is a schematic structural diagram of a device 50 according to an embodiment of the present invention.
  • FIG. 14 is a schematic structural diagram of a device 60 according to an embodiment of the present invention.
  • FIG. 15 is a schematic structural diagram of a device 70 according to an embodiment of the present invention.
  • 16 is a system block diagram of a prior art storage array
  • 17 is a system block diagram of a memory array in an embodiment of the present invention.
  • FIG. 18 is a schematic flowchart diagram of a data processing method according to an embodiment of the present invention.
  • FIG. 19 is a schematic structural diagram of a device 80 according to an embodiment of the present invention.
  • FIG. 20 is a schematic structural diagram of a device 90 according to an embodiment of the present invention.
  • FIG. 1 is a schematic diagram of a distributed storage area network (SAN) system including a controller 12 and at least two storage nodes, such as a storage node 13-1, a storage node 13-2, ..., Storage node 13-n.
  • the system is used to process the request message of the host 11, and the host 11 establishes a network connection with the controller 12, such as a Fibre Channel (FC) based connection, or an Ethernet based connection.
  • the host 11 transmits an operation request to the controller 12 through a connection with the controller 12.
  • the controller 12 establishes a network connection with the at least two storage nodes, and the SAN system uses a distributed manner to store data. If data needs to be written into the SAN system, the controller splits the data into multiple data blocks according to a preset algorithm.
  • FC Fibre Channel
  • the controller 12 stores information about data stored by each storage node, which is also referred to as a partitioned view of the storage node.
  • Table 1 is an example of a partitioned view in a storage node.
  • Table 1 includes a data block identifier, an associated original data identifier, a storage medium, and a check value, where the check value is a control node according to The check information of the data block calculated by the preset algorithm is used to determine the integrity of the data block when the data block is read or written; the storage medium is used to identify the target storage medium information stored in the storage node.
  • Data block identifier Associated raw data identifier Storage medium Check value Data block 11 Data 1 Storage medium 1 Check value 1 Data block 21 Data 2 Storage medium 2 Check value 2 Data block 31 Data 3 Storage medium 3 Check value 3
  • a global view of the storage node in the SAN system is saved in the controller, where the global view records information about data stored in each storage node.
  • Table 2 is an example of a full area view.
  • Data block identifier Associated raw data identifier Check value Storage node Storage medium Data block 11 Data 1 Check value 1 Storage node 1 Storage medium 1 Data block 12 Data 1 Check value 2 Storage node 2 Storage medium 1 Data block 13 Data 1 Check value 3 Storage node 3 Storage medium 1
  • the target storage node storing the target data (such as the target storage node is the storage node 13-1) is determined according to the identifier of the target data to be operated in the read request and the partition view.
  • the controller 12 then sends a read request to the target storage node 13-1.
  • the target storage node 13-1 returns the target data to the controller 12; the controller 12 receives the target data returned by the target storage node 13-1, and returns the target data to the host 11.
  • the target storage node 13-1 storing the target data is determined based on the identifier of the target data to be operated in the write request and the partition view. Controller 12 then sends a write request to the target storage node. For the write request sent by the controller 12, the target storage node 13-1 writes the target data to the storage medium and returns a write data response message to the controller 12. When the response message received by the controller 12 indicates that the write data is successful, a write success message is returned to the host 11.
  • the controller needs to forward the data to be operated, and the controller forwards the to-be-operated operation.
  • the controller needs to forward the data to be operated, and the controller forwards the to-be-operated operation.
  • the controller it is also necessary to process the data, such as encapsulation and unpacking.
  • the SAN system can be used to process operation requests of multiple hosts.
  • the controller may simultaneously process multiple host operations on the storage node, so that the controller's data transmission burden and computing burden are too heavy, and the host is restricted. The speed at which data is read and written with the storage node.
  • an embodiment of the present invention provides a data processing system, The data processing system will be described in detail by means of the drawings and specific embodiments.
  • RDMA Remote direct memory access
  • CPU central processing unit
  • a queue pair including a send queue (SQ) and a receive queue (RQ). Both ends of the RDMA connection need to establish a QP separately, then associate the SQ in its own QP with the RQ of the opposite QP, and associate the RQ in its own QP with the SQ in the opposite QP to implement its own QP.
  • an RDMA connection is established at both ends.
  • RNIC RDMA enabled network interface card
  • the RNIC of device A can directly read data from the memory of device A, send the read data to the RNIC of device B, and the RNIC of device B will device A.
  • the data received by the RNIC is written into the memory of device B.
  • the RNIC may be a host bus adapter (HBA) that supports RDMA.
  • HBA host bus adapter
  • RDMA operation types including RDMA send, RDMA receive for transmitting commands, and RDMA read, RDMA write for transferring data.
  • RDMA send for transmitting commands
  • RDMA receive for transmitting commands
  • RDMA read for transmitting data
  • RDMA write for transferring data.
  • iSER connection refers to the connection of the iSCSI extensions for RDMA (iSER) protocol based on the remote memory direct access method.
  • the iSCSI refers to the internet small computer system interface (internet small computer system interface, Iscsi), the iSER protocol supports RDMA transmission.
  • NVMe refers to non-volatile memory express (NVMe), and NOF protocol supports RDMA transmission.
  • NVMe non-volatile memory express
  • a scatter gather list refers to a form of data encapsulation.
  • both the source physical address and the target physical address must be contiguous.
  • the storage address of the data is not necessarily contiguous in physical space.
  • the discontinuous physical storage address is encapsulated in the form of SGL, and the device stores one piece of storage in a physically continuous storage space. After the data is transferred, the next piece of data stored in the physical contiguous storage space is transferred.
  • the SGL includes a plurality of scatter gather entries (SGEs).
  • SGE scatter gather entries
  • Each SGE includes an address field, a length field, and a flag, where the address field represents a storage area. The start position, the length field represents the length of the storage area, the flag field indicates whether the SGE is the last one in the SGL, and the flag field may also include other auxiliary information, such as a data block descriptor.
  • Each SGE represents a contiguous storage area according to its own address field and length field, and several characterization storage areas are sequentially connected. The SGEs are grouped together. The storage areas represented by different sets of SGEs in the SGL are not adjacent to each other. The last SGE of each group of SGEs points to the starting address of the next set of SGEs.
  • Logical unit number A LUN is just a number identifier and does not represent any entity attribute.
  • multiple disks are grouped into a logical disk according to a preset algorithm (such as a redundant array of independent disks (RAID) configuration relationship), and then the logical disks are divided into different according to preset rules.
  • Stripe each strip is called a LUN, where a LUN can be used to describe a contiguous storage area of a disk of a storage array, or a collection of multiple non-contiguous storage areas of a disk, or a description A collection of storage areas in a disk.
  • the system includes a host 40, a controller 20, and at least two storage nodes managed by the controller 20, such as a storage node 31, a storage node 32, and a storage node. 33.
  • the controller 20 establishes a connection with the storage node 31, the storage node 32, and the storage node 33, and the controller 20 establishes a connection with the host 40.
  • the storage node 31, the storage node 32, and the storage node 33 respectively establish a connection with the host 40.
  • Host 40 is used to deploy application services.
  • connection between the controller 20 and the storage node is referred to as a first connection
  • the connection between the controller 20 and the host 40 is referred to as a second connection
  • the storage node is connected to the host 40.
  • the connection is called the third connection.
  • the first connection, the second connection, and the third connection may be wired communication based connections, such as FC based connections; the first connection, the second connection, and the third connection may also be wireless communication based connections, such as cellular based (cellular communication)
  • the connection of communication such as wireless fidelity (WIFI) connection.
  • WIFI wireless fidelity
  • FIG. 4 shows a method for data processing according to the system shown in FIG. 3, the method comprising:
  • Step 101 The controller establishes a first connection with the storage node.
  • the number of storage nodes is two or more, and the controller establishes a first connection with each storage node, and the controller is configured to manage the storage node by using a first connection with the storage node. For example, the storage node is instructed to store/return/delete data according to the request of the host.
  • Step 102 The controller establishes a second connection with the host.
  • Step 103 The host establishes a third connection with the storage node.
  • Step 104 The host sends an operation request to the controller through the second connection, where the operation request includes an identifier of the target data to be operated and an operation type.
  • the read request and the write request are collectively referred to as an operation request.
  • the operation request includes an operation type indicating that the requested operation is a read operation or a write operation.
  • the identification of the target data is used to uniquely identify the target data. For example, when storing data in a key-value manner, the key value is the identifier of the data.
  • Step 105 The controller receives the operation request, and determines the target storage node according to the identifier of the target data and the global view or the partition view.
  • the controller obtains an identifier of the target data from the operation request, and determines a target storage node according to the identifier of the target data and the global view or the partition view.
  • the target storage node is a storage node to be written to the target data
  • the target storage node is a storage node storing the target data.
  • the controller searches for the identifier of the target data in the global view or the partition view to determine the target storage node that stores the identifier of the target data.
  • the target storage node may be one or more.
  • the target data is divided into a plurality of data blocks, and the plurality of data blocks are separately stored in the plurality of storage nodes.
  • the controller divides the data to be written into a plurality of data blocks according to a preset algorithm, and stores them in a plurality of storage nodes, and at this time, stores the data blocks.
  • Each storage node is a target storage node for storing target data.
  • Step 106 The controller sends an indication message to the target storage node, where the indication message is used to indicate that the target storage node sends the target data to the host or acquires the target data from the host through the connection with the host.
  • the controller may send an indication message to the multiple target storage nodes, indicating that each target storage node transmits the data block of the target data through the third connection with the host. . Since the plurality of target storage nodes can simultaneously transmit the data blocks of the target data with the host according to the indication message, the efficiency of the target data transmission can be improved, and the time consumption of the target data transmission can be reduced.
  • the controller sends the indication message to the multiple target storage nodes one by one, and after the data transmission between the previous target storage node and the host ends, sends an indication message to the next target storage node.
  • the controller can control the orderly execution of the target data transmission to ensure the correctness of the transmitted data.
  • Step 107 The target storage node sends the target data to the host or acquires the target data from the host according to the indication message sent by the controller.
  • the host instructs the target storage node to send target data to the host or acquire target data from the host through the controller, and the target storage node and the host directly transmit the target data through the connection between the two, instead of forwarding the data via the controller, minus
  • the short data transmission path further reduces the transmission time, and avoids the problem that the host reads and writes data rate is slow due to the controller's bandwidth cannot meet the large data transmission requirement, and can avoid the large amount of data cannot be satisfied due to the computing power of the controller.
  • the problem of slow read/write data rate by the host due to processing requirements.
  • the third connection established between the host and the storage node may have multiple implementation manners, which are respectively introduced below.
  • the third connection is an RDMA connection.
  • the process of establishing an RDMA connection between a host and a storage node is as follows:
  • Step 201 The storage node creates a first QP.
  • the QP established by the storage node in the embodiment of the present invention is referred to as a first QP.
  • the first QP includes a transmit queue SQ and a receive queue RQ, SQ is used to transmit data, and RQ is used to receive data.
  • the first QP further includes a completion queue, where the completion queue is configured to detect whether the SQ data sending task of the first QP and the RQ data receiving task of the first QP are completed.
  • Step 202 The storage node sends the parameter of the first QP to the controller through the first connection.
  • the parameters of the first QP may include an identifier of the first QP, and the identifier of the first QP may be a combination of numbers, letters, or other forms used to identify the QP.
  • the identifier of the protection domain (PD) configured for the first QP may be included, and the PD is used to represent the RDMA-capable network card (such as RNIC) allocated by the storage node for the RDMA connection; the first QP
  • the parameters may also include an identification of a connection manager (CM) assigned by the storage node to assist in establishing an RDMA connection and managing the RDMA connection.
  • CM connection manager
  • Step 203 The controller sends the parameter of the first QP to the host through the second connection.
  • Step 204 The host creates a second QP.
  • the QP created by the host is referred to as the second QP in the embodiment of the present invention.
  • the second QP includes a transmit queue SQ and a receive queue RQ.
  • the second QP further includes a completion queue, where the completion queue is configured to detect whether the sending data task of the SQ of the second QP and the receiving data task of the RQ of the second QP are completed.
  • Step 205 The host sends the parameter of the second QP to the controller through the second connection.
  • the parameter of the second QP may include an identifier of the second QP, and may also include an identifier of the protection domain PD configured for the second QP, an identifier of the connection manager CM configured for the second QP, and the like.
  • Step 206 The controller sends the parameter of the second QP to the storage node by using the first connection.
  • Step 207 The host associates the second QP with the first QP of the storage node according to the received parameter of the first QP.
  • Step 208 The storage node associates its first QP with the second QP according to the received parameter of the second QP.
  • the association between the first QP and the second QP refers to binding the SQ of the first QP with the RQ of the second QP according to the identifier of the first QP and the identifier of the second QP, thereby creating a data sent from the storage node to the host. And, the RQ of the first QP is bound to the SQ of the second QP, thereby creating a path for transmitting data from the host to the storage node.
  • the storage node creates the first QP and the step 204, the host creates the second QP, and may perform the step of performing the step 204 after the step 201 is performed.
  • the host can establish an RDMA connection with the storage node. Since the RDMA connection is a directly established connection between the host and the storage node, the host and the storage node transmit the target data through the RDMA connection without transiting, which can reduce the transmission time. . Moreover, the transmission rate of the RDMA connection itself is also fast, which can further reduce the transmission time.
  • FIG. 6 is a schematic diagram of the host and the storage node transmitting data through the RDMA connection, and the host 40 includes a RNIC 41 (such as an HBA card) and a memory 42.
  • the storage node 31 includes an RNIC 311 and a memory 312.
  • the RNIC 311 of the storage node 31 can send a request to the RNIC 41 of the host 41 to read the data of the specified location of the memory 42, the RNIC 41 reads the data from the designated location of the memory 42, sends the data to the RNIC 311, and the RNIC 311 writes the received data to the memory 312.
  • the above process is referred to as a storage node reading data from the host 40 in an RDMA read manner.
  • the RNIC 311 of the storage node 31 can also send a request to the RNIC 41 of the host 40 to write data to the specified location of the memory 42, the RNIC 41 buffers the data carried by the request, and writes the data to the location specified by the request in the memory 42, the process is called Write data to the host 40 in RDMA write mode for the storage node.
  • FIG. 7 is a schematic flowchart of a data processing method based on an RDMA connection, where the method includes the following steps:
  • Step 301 The host sends an operation request to the controller through the second connection, where the operation request includes an operation type, an identifier of the target data, and a location parameter of the target storage area specified by the host in the memory for the target data.
  • the location parameter of the target storage area is used to identify the storage location of the target data in the host memory, and the representation may be an offset in the memory.
  • the operation request may further include information such as the length of the target data, a remote key (Rekey key, Rkey), and the like.
  • Step 302 The controller receives an operation request, and determines a target storage node according to the identifier of the target data in the operation request.
  • the target storage node is a storage node to be written to the target data, and when the operation request is a read request, the target storage node is configured to store the target data. Storage node.
  • Step 303 The controller sends an indication message to the target storage node, where the indication message includes an identifier of the data block of the target data corresponding to the target storage node, an operation type, and a location parameter of the target storage area.
  • Step 304 The target storage node responds to the indication message, and sends target data to the host or acquires target data from the host through an RDMA connection with the host according to the location parameter of the target storage area.
  • the target storage node when the operation type in the indication message is a write operation, the target storage node is in the foregoing RDMA read mode, The target data is read from the target storage area in the host memory. The target storage node then writes the read target data to the disk of the target storage node.
  • the target storage node When the operation type in the indication message is a read operation, the target storage node writes the target data to the target storage area in the host memory in the foregoing RDMA write mode. The host then writes the target data stored in the target storage area of the host's memory to the host's disk.
  • Step 305 After completing the target data transmission to the host through the RDMA connection or acquiring the target data from the host, the target storage node sends a data transmission success message to the controller.
  • the target storage node When the operation type is a write operation, the target storage node sends a data transmission success message to the controller after reading the data of the target storage area in the host memory in the RDMA read manner.
  • the target storage node When the operation type is a read operation, the target storage node sends a data transmission success message to the controller after writing the stored target data to the target storage area in the host memory by RDMA write.
  • Step 306 After receiving the data transmission success message sent by the target storage node, the controller sends an operation success message to the host.
  • the data to be operated is transmitted via the RDMA connection between the host and the storage node, without using the controller, not only reducing the bandwidth burden and computing burden of the controller, but also realizing data transmission through a high-speed RDMA connection, and the data transmission rate is further improved. Fast, the entire process of writing data takes less time.
  • the target data is divided into a plurality of data blocks, and the storage areas in the disks of the target storage nodes corresponding to the different data blocks are discontinuous.
  • the storage areas in the disks of the target storage nodes corresponding to the different data blocks are discontinuous.
  • the number of target storage nodes corresponding to the target data is greater than 1, and different data blocks of the target data may correspond to different target storage nodes.
  • the target data is divided into a first data block and a second data block, wherein the first data block corresponds to the first storage node, and the second data block corresponds to the second storage node.
  • the number of target storage nodes corresponding to the target data is 1, but the storage area in the disk of the target storage node corresponding to different data blocks of the target data is not continuous.
  • the target data is divided into a first data block and a second data block, wherein the first data block corresponds to a first storage area in the target storage node, and the second data block corresponds to a second storage area in the target storage node, A storage area is not continuous with the second storage area.
  • the controller further determines, for each data block of the target data, a corresponding sub-storage area in the target storage area in the host memory, thereby instructing the target storage node to acquire data of the target data from the sub-storage area. Block, or write the data block of the target data to the sub-memory area in the host memory.
  • Scene 1 takes Scene 1 as an example for description.
  • the controller determines that the first data block of the target data is stored by the first storage node, and the target data is stored in the storage area of the memory. Dividing a first sub-memory area for storing the first data block, and determining a second data block of the target data is stored by the second storage node, and dividing the target data into a storage area in the memory for storing the a second sub-storage area of the second data block, the first sub-storage area and the second sub-storage area being consecutive storage areas.
  • the controller determines that the first data block of the target data is stored by the first storage node, and the target data is divided into the storage area in the memory for storing the first data block.
  • a first storage area of a data block the first storage area being a storage area composed of a plurality of non-contiguous storage areas.
  • the controller determines that the second data block of the target data is stored by the second storage node, and the target data is divided into a storage area in the memory for storing the second data block, where the first storage area is composed of multiple non-contiguous storage
  • the storage area that the area consists of.
  • the controller determines the actual sub-storage area in the host's memory for the data block of the target data.
  • the current mode is consistent with scenario 1, and will not be repeated here.
  • the controller After determining that the data block of the target data is in the corresponding sub-storage area in the memory of the host, the controller sends an indication message to the target storage node, where the indication message includes the identifier and the operation type of the data block of the target data corresponding to the target storage node. And a location parameter of the sub-storage area determined for the data block. Then, the target storage node responds to the indication message, and sends a data block of the target data to the host or acquires a data block of the target data from the host through an RDMA connection with the host according to the location parameter of the sub-storage area.
  • the target storage node sends a data transmission success message to the controller after completing the sending of the target data to the host through the RDMA connection or the data block of the target data from the host. Finally, after receiving the data transmission success message sent by all the target storage nodes, the controller sends an operation success message to the host.
  • the controller after determining that the data block of the target data is in the corresponding sub-storage area in the memory of the host, the controller sends an indication message to the target storage node, where the indication message includes the data block of the target data corresponding to the target storage node.
  • the target storage node responds to the indication message, and sends a data block of the target data to the host or acquires a data block of the target data from the host through an RDMA connection with the host according to the location parameter of the sub-storage area.
  • the target storage node sends a data transmission success message to the controller after completing the transmission of the target data to the host through the RDMA connection or the data block of the target data from the host. Then, after receiving the data transmission success message sent by all the target storage nodes, the controller sends an operation success message to the host.
  • the controller can determine a corresponding word storage area in the memory of the target data for the data block of the target data, so that the target storage node corresponding to the data block can acquire the data block from the host through the RDMA connection or send the data to the host. Piece.
  • the following describes the flow of the data processing method under the iSER protocol in conjunction with FIG. 9, wherein the iSER protocol supports the RDMA connection, including the following steps:
  • step 401 the host sends a setup iSER connection request to the controller.
  • step 402 the controller returns a connection establishment response to the host, and establishes an iSER connection with the host.
  • step 403 the host creates a second QP and sends the parameters of the second QP to the controller through an iSER connection with the controller.
  • Step 404 The controller sends the parameter of the second QP to each storage node by using a first connection with the storage node.
  • Step 405 The storage node creates a first QP, and sends the parameter of the first QP to the controller by using the first connection.
  • step 406 the controller sends the parameters of the first QP to the host through the iSER connection.
  • Step 407 The host associates the first QP with the second QP according to the parameter of the first QP.
  • Step 408 The storage node associates the second QP with the first QP according to the parameter of the second QP.
  • step 409 the host sends a control request to the controller through the iSER connection.
  • the control request is for requesting the controller to permit the host to send a command request to the controller in an RDMA manner, such as an operation request in the following.
  • RDMA RDMA manner
  • step 410 the controller returns a control response to the host through the iSER connection.
  • the control response characterization controller permits the host to send a command request to the controller in an RDMA manner.
  • step 411 the host sends an operation request to the controller through the iSER connection.
  • the operation request includes an operation type, an identification of the target data to be operated, and a location parameter of the target storage area specified by the host in the memory for the target data.
  • the operation request may not include a location parameter of the target storage area, and the host The location parameter of the target storage area is sent to the controller after the operation request is sent.
  • Step 412 The controller determines the target storage node according to the operation request, and determines a location parameter of the sub-storage area specified by the host in the memory for each data storage node corresponding to the target storage node according to the location parameter of the target storage area.
  • the controller determines, according to the locally stored partition view or the global view, which storage node is stored or should be stored on the storage node, and the determined storage node is the target storage node, wherein each target storage node stores one or the target data. Multiple data blocks. Then, the controller divides the storage area in the host memory for storing the target data, and determines a storage area specified in the host memory for the data block corresponding to each target storage node.
  • Step 413 The controller sends an indication message to each target storage node, where the indication message includes an identifier of the data block corresponding to the target storage node and a location parameter of the sub-storage area determined for the target storage node.
  • Step 414 The target storage node sends a data block of the target data to the host or a data block of the target data from the host through an RDMA connection with the host in response to the indication message.
  • the target storage node When the operation type in the indication message is a write operation, the target storage node reads data from the sub-storage area in an RDMA read manner according to the location parameter of the sub-storage area in the indication message. The read data is then written to the memory of the target storage node, and then the data is written from the memory to the disk, and the storage location of the data on the disk is recorded.
  • the target storage node When the operation type in the indication message is a read operation, the target storage node writes the data block of the target data stored by the target storage node to the sub-storage in an RDMA write manner according to the location parameter of the sub-storage area in the indication message. region.
  • Step 415 After completing the sending of the target data to the host through the RDMA connection or the data block of acquiring the target data from the host, the target storage node sends a data transmission success message to the controller.
  • Step 416 After receiving the data transmission success response sent by each target storage node, the controller sends an operation success message to the host through the iSER connection.
  • the data between the host and the storage node is forwarded through the RDMA connection instead of being forwarded via the controller, which not only reduces the load of the controller, but also avoids the control in the prior art.
  • the overload caused by forwarding data causes the host and storage node to transmit data at a slower speed.
  • different storage nodes can simultaneously transmit data to the host through the RDMA connection, thereby further increasing the rate of data transmission between the host and the storage node.
  • the NOF protocol also supports the RDMA connection, and the flow of the data processing method under the NOF protocol is consistent with the flow of the above steps 401 to 416.
  • the NOF protocol further supports that the host configures the target storage area in the memory as non-contiguous, and the host encapsulates the location parameter of the target storage area in the form of SGL to obtain the SGL package, and uses RDMA.
  • the write mode writes the SGL packet to the controller's memory.
  • the controller parses the SGL packet, and determines a location parameter of the sub-storage area specified by the data block corresponding to each target storage node in the target storage area of the host, and encapsulates the location parameter of the sub-storage area in the form of SGL, and The location parameter of the sub-storage area encapsulated in the SGL form is sent to the storage node, and the storage node parses the SGL packet to obtain the location parameter of the sub-storage area.
  • the host stores the data in the discrete storage area of the memory, the data transfer between the storage node and the host through the RDMA connection is implemented, so that the host can fully utilize the memory storage area and improve the memory utilization.
  • the third connection is a TCP/IP connection.
  • the host and the storage node both include a communication port that supports the TCP/IP protocol, and the communication port of the host and the communication port of the storage node can establish a communication link, that is, a third connection.
  • the host and the storage node can Data is transmitted through the third connection.
  • FIG. 10 is a schematic flowchart of a method for transmitting data between a host and a storage node when the third connection is a TCP/IP connection, and the method includes the following steps:
  • Step 501 The host sends an operation request to the controller by using the second connection, where the operation request includes an identifier of the target data and an operation type.
  • Step 502 The controller determines, in response to the operation request, a target storage node that stores target data, and determines a data block corresponding to each target storage node.
  • Step 503 The controller sends an indication message to each target storage node by using the first connection, where the indication message includes a communication address of the host, and an identifier of the data block corresponding to the target storage node.
  • the identification of the data block is used to cause the controller to determine the storage location of the target data at the storage node in the distributed system.
  • the identifier may further include verification information, such as identity verification and data access permission information.
  • Step 504 The controller sends, by using the second connection, a communication address of each target storage node and an identifier of a data block corresponding to each target storage node to the host.
  • the above communication address may be an IP address or a media access control (MAC) address.
  • the controller can obtain the communication address of the host from the operation request sent by the host, and the controller can locally store the communication address of each storage node to which it is connected.
  • MAC media access control
  • Step 505 Each target storage node and the host transmit, according to the communication address of the other party, the data block corresponding to the target storage node in the target data in a TCP/IP packet manner.
  • step 505 can have multiple implementations, including but not limited to the following:
  • Mode 1 the controller instructs the target storage node to initiate data transmission with the host.
  • the controller when the operation type in the operation request is a read operation, after determining the data block corresponding to each target storage node and each target storage node, the controller sends an indication message to each target storage node, the indication message. It includes the communication address of the host and the identifier of the data block that the storage node needs to return.
  • the storage node sends a TCP/IP message destined for the host in response to the indication message, the TCP/IP message including the data block indicated by the indication message. Since the controller has sent the communication address of the target storage node and the identifier of the data block corresponding to the target storage node to the host, the host can confirm the packet when receiving the TCP/IP packet sent by the target storage node.
  • the data packet is a legal packet, and the data block included in the TCP/IP packet is determined as a data block in the target data, and the data block is obtained from the TCP/IP packet.
  • the host After receiving the TCP/IP packet sent by each target storage node, the host can obtain the target data, and the target data is read from the storage node.
  • the storage node when the operation type in the operation request is a write operation, the storage node responds to the indication message sent by the controller, and sends a TCP/IP read request message to the host, where the TCP/IP read request message includes the target storage node responsible for storing The identity of the data block.
  • the host responds to the TCP/IP read request message, and sends a packet carrying the data block corresponding to the target storage node to the storage node.
  • Mode 2 the controller instructs the host to initiate data transmission with the target storage node.
  • the controller returns the communication address of each target storage node and the identifier of the data block that the target storage node is responsible for transmitting to the host. After receiving the above information, the host actively sends a data transmission request message to the target storage node.
  • the controller sends an indication message to each target storage node, where the indication message includes a communication address of the host and an identifier of the data block to be operated, and the indication message does not indicate that the target storage node actively returns data to the host, and It is to inform the target storage node of the legal data transmission requirement, so that when the target storage node receives the data transmission request message of the host, it identifies it as a legal message, and responds to the message to perform data transmission with the host.
  • the host and the storage node can transmit the data to be operated through the TCP/IP connection without By forwarding data through the controller, the problem of slow read/write data rate of the host caused by the bandwidth of the controller cannot be met, and the problem that the computing power of the controller cannot meet the large data processing requirements can be avoided.
  • the resulting host read and write data rate is slower.
  • step 505 the following steps are further included:
  • Step 506 After completing the data block for transmitting the target data through the TCP/IP connection, the target storage node sends a data transmission success response to the controller.
  • Step 507 After receiving the data transmission success response sent by all the target storage nodes, the controller sends an operation success message to the host.
  • the controller may send an operation success message to the host to inform the host that the data is successfully read, so that the host can confirm the read and write data succeeds in time.
  • Step 601 The host sends an operation request to the controller by using the second connection, where the operation request includes an identifier of the target data and an operation type, and the operation type is a read operation;
  • Step 602 The controller determines a target storage node that stores the target data, and determines a data block of the target data stored by each target storage node.
  • Step 603 The controller sends an indication message to each target storage node by using the first connection, where the indication message includes a communication address of the host, a communication address of the controller, and an identifier of the data block stored by the target storage node.
  • the controller determines the data receiving amount of the host according to the TCP window size of the host, and determines, according to the data receiving quantity, a data block of the target data carried by the target storage node through each TCP/IP packet, the data block.
  • the size is not greater than the TCP window size of the host, and then the controller generates the indication message including the identity of the data block, and sends the indication message to the target storage node.
  • the controller obtains the TCP window of the host may have multiple implementation manners, for example, the second connection between the host and the controller is a TCP/IP connection (such as an iSCSI connection), and the host is when the controller establishes the second connection.
  • the size of the TCP window is negotiated, and the controller then knows the TCP window of the host.
  • the host carries the TCP window size currently available to the host in the TCP/IP packet sent by the controller, and the controller determines the TCP window of the host from the TCP/IP packet.
  • the controller determines, according to the TCP window, that a 1000-byte data block is sent to the host each time, and then the controller determines that the first bit of the target data is a target storage node where the data block having a length of 1000 bytes is located,
  • the target storage node sends an indication message, indicating that the target storage node sends the 1000-byte data block to the host, and after the 1000-byte data block is successfully sent, indicates that the second length of the storage target data is 1000 bytes.
  • the target storage node where the data block is located sends the second 1000 bytes of data to the host, and so on, until the target storage node sends all the target data to the host.
  • the controller determines that the length of the data block sent to the host each time can be different because the available TCP window of the host is dynamically changing. For example, the controller determines that the first data block of the first length of 1000 bytes of the target data is sent to the host, indicating that the target storage node sends the 1000-byte data block to the host, and then determines the second direction. The host sends a data block of length 800 bytes, indicating that the target storage node storing the data block of length 800 bytes after storing the first data block of length 1000 bytes sends the data of length 800 bytes to the host. Piece.
  • the controller can simultaneously allocate the TCP window of the host to multiple target storage sections.
  • the TCP window size of the host is 1000 bytes
  • the first data block whose target data size is 450 bytes is stored in the first target storage node
  • the second data block whose target data size is 500 bytes is stored in a second target storage node
  • the controller may send an indication message to the first target storage node and the second target storage node, indicating that the first target storage node sends the first data block to the host, and the second target storage node is sent to the host
  • the two data blocks because the sum of the sizes of the first data block and the second data block is not larger than the TCP window size of the host, the host can successfully receive the first data block and the second data block.
  • the TCP window of the host can be fully utilized to improve the efficiency of data transmission.
  • Step 604 The target storage node sends a TCP/IP packet carrying the data block of the target data with the communication address of the controller as the source address and the communication address of the host as the destination address in response to the indication message.
  • Figure 12 is a schematic diagram of a network layer packet header and a transport layer packet header of a TCP/IP packet sent by a target storage node.
  • the target storage node sets the source IP address of the network layer packet header to the IP address of the controller.
  • the address is set to the parameters of the controller, such as the source port number, TCP serial number, TCP response sequence number, and send window size in the transport layer header.
  • the controller may add the above parameters to the indication message sent to the target storage node to enable the target storage node to obtain the above parameters.
  • Step 605 The host receives the TCP/IP packet whose source address is the communication address of the controller sent by the target storage node, and acquires the data block of the target data from the TCP/IP packet.
  • the controller instructs one or more of all target storage nodes to send data blocks of the target data to the host at a time, and the sum of the sizes of the data blocks is not greater than the TCP window size of the host.
  • the host After receiving the above data block, the host returns a receive data response to the controller.
  • the controller determines that the indication message sent to the target storage node is successfully responded, and continues to send an indication message to one or more of all target storage nodes, indicating that the one or more target storage nodes are The host sends the data blocks of the target data separately, and the sum of the sizes of all the data blocks is not larger than the TCP window size of the host.
  • the controller can control the target storage node to sequentially send the data block of the target data to the host according to the received data response message returned by the host, thereby ensuring the accuracy of the data transmission.
  • the controller if the controller fails to receive the received data response message from the host within a preset time period after sending the indication message to the target storage node, the controller sends a retransmission instruction to the storage node to indicate the storage node. Resend data to the host. In this implementation manner, the data transmission is smoothly performed by the above retransmission mechanism.
  • the storage node modifies the source address of the sent packet by using the solution in steps 601 to 606, and pretending that the controller sends a TCP/IP packet carrying the data block in the target data to the host, and the host identifies the TCP/IP packet.
  • a block of data for the target data returned by the controller.
  • the host can identify the TCP/IP packet sent by the storage node as the packet sent by the controller, so that the storage node can be connected to the host through TCP/IP without changing the existing host. Sending data directly to the host without forwarding data via the controller reduces the time it takes to send data from the target storage node to the host.
  • FIG. 13 shows a device 50 for data processing according to an embodiment of the present invention.
  • the device 50 corresponds to the controller 20 in the data processing system shown in FIG. 3, and is used to implement the control in the data processing method shown in FIG. 4 to FIG. The function of the device.
  • the device is configured to manage the at least two storage nodes, and the device 50 includes:
  • the first receiving module 51 is configured to receive an operation request sent by the host, where the operation request includes an identifier of the target data to be operated and an operation type;
  • a determining module 52 configured to determine at least one target storage node from the at least two storage nodes according to the identifier of the target data
  • a first sending module 53 configured to send an indication message to the at least one target storage node, where the indication message is used to indicate that the at least one target storage node sends the target data to the host by using a connection with the host Or acquiring the target data from the host.
  • the apparatus 50 of the embodiment of the present invention may be implemented by an application-specific integrated circuit (ASIC), or a programmable logic device (PLD), which may be a complex program logic device. (complex programmable logic device, CPLD), field-programmable gate array (FPGA), general array logic (GAL), or any combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • CPLD complex programmable logic device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the determining module 52 is configured to determine, according to the identifier of the target data, that the target storage node includes a first storage node that stores a first data block of the target data, and a second that stores the target data. a second storage node of the data block;
  • the first sending module 53 is specifically configured to send a first indication message to the first storage node, where the first indication message includes an identifier of the first data block, and is used to indicate that the first storage node is Sending, by the host, the first data block or acquiring the first data block from the host; sending, to the second storage node, a second indication message, where the second indication message includes the second data block And an identifier, configured to instruct the second storage node to send the second data block to the host or acquire the second data block from the host.
  • the operation request further includes a location parameter of the target storage area specified by the host in the memory for the target data;
  • the determining module 52 is further configured to determine, according to the location parameter of the target storage area, a data block in the target storage area that is the data block of the target data corresponding to each target storage node in the at least one target storage node. a location parameter of the sub-storage area; the indication message including an identifier of the data block, the operation type, and a location parameter of the sub-storage area, where the indication message is used to indicate that the indication message is received
  • the target storage node writes the data block of the target data into the sub-storage area in the host memory by an RDMA connection with the host according to a location parameter of the sub-storage area.
  • the operation type is a write operation
  • the device 50 further includes a second receiving module 54 and a second sending module 55.
  • the second receiving module 54 is configured to receive, by using any one of the storage nodes, a parameter of the first QP created by the any storage node;
  • a second sending module 55 configured to send parameters of the first QP to the host
  • the first receiving module 51 is further configured to receive, by the host, a parameter of the second QP created by the host;
  • the first sending module 53 is further configured to send the parameter of the second QP to the any storage node.
  • the connection between the device 50 and the host is a TCP/IP connection;
  • the indication message includes the operation type, a communication address of the host, and an identifier of the target data; Also includes:
  • a second sending module 55 configured to send a third indication message to the host, where the third indication message includes a communication address of the target storage node, where the host is configured to communicate with the target storage node
  • the TCP/IP transmits the target data to the target storage node or acquires the target data from the target storage node.
  • the connection between the device 50 and the host is a TCP/IP connection
  • the operation type is a read operation
  • the indication message includes the operation type, a communication address of the host, and the device
  • the indication message is used to indicate that the target storage node sends the carrying address by using the communication address of the device 50 as a source address and the communication address of the host as a destination address. TCP/IP packet of the target data.
  • the determining module 52 is further configured to determine, according to the TCP window size of the host, a data receiving quantity of the host, and determine, according to the data receiving quantity, each target storage node of the at least one target storage node. a data block of the target data carried by each TCP/IP packet; generating the indication message including an identifier of the data block, where the indication message is used to indicate that the target storage node that receives the indication information is The host transmits the data block.
  • each module of the apparatus 50 described above is the same as the embodiment of the data processing method corresponding to FIGS. 4 to 12, and the steps performed by the controller are the same, and will not be repeated here.
  • FIG. 14 shows a data processing apparatus 60 for implementing the functions of the storage node in the data processing method of FIG. 4 to FIG. 12 according to an embodiment of the present invention.
  • the device 60 is communicatively coupled to a controller for managing the device 60; the device 60 includes:
  • the receiving module 61 is configured to receive an indication message sent by the controller, where the indication message includes an identifier and an operation type of target data to be operated by the host;
  • the transmitting module 62 is configured to send the target data to the host or acquire the target data from the host by using a connection with the host according to the indication message.
  • the device 60 further includes a connection module 63, configured to create a first QP, where the first QP includes a first sending queue SQ and a second receiving queue RQ, and sending parameters of the first QP to the Receiving, by the controller, a parameter of the second QP created by the host, where the second QP includes a second sending queue SQ and a second receiving queue RQ; according to the parameters of the first QP and the The parameter of the second QP binds the first SQ to the second RQ of the second QP, and binds the first RQ with the second SQ to establish an RDMA connection with the host.
  • a connection module 63 configured to create a first QP, where the first QP includes a first sending queue SQ and a second receiving queue RQ, and sending parameters of the first QP to the Receiving, by the controller, a parameter of the second QP created by the host, where the second QP includes a second sending queue SQ and a second receiving queue RQ; according to the parameters of the first Q
  • the indication message includes an identifier of the target data, a type of the operation, and a location parameter of a target storage area in an in-memory of the host;
  • the transmission module 62 is configured to write the target data into the target storage area in the host memory by an RDMA connection with the host when the operation type is a read operation; At the time of a write operation, the target data is read from the target storage area in the host memory through an RDMA connection with the host, and the target data is stored.
  • the indication message includes the operation type, a communication address of the host, a communication address of the controller, and an identifier of the target data;
  • the transmission module 62 is configured to send a TCP/IP packet carrying the target data by using a communication address of the controller as a source address and a communication address of the host as a destination address.
  • each module of the device 60 is the same as the implementation of the steps performed by the storage node in the data processing method corresponding to FIG. 4 to FIG. 12, and details are not described herein.
  • FIG. 15 shows a data processing device 70 for implementing the functions of the controller in the data processing method shown in FIG. 4 to FIG. 12 according to an embodiment of the present invention.
  • the device 70 includes a processor 71, a memory 72, a communication interface 73, and a bus 74.
  • the processor 71, the memory 72, and the communication interface 73 are connected by the bus 74 and complete communication with each other.
  • the memory 72 is configured to store computer execution instructions, and when the device 70 is in operation, the processor The steps performed by the controller in the method of executing the instructions in FIG. 4 through FIG. 12 by the computer executing instructions in the memory 72 to perform the data processing corresponding to FIGS. 4 through 12 are performed.
  • the processor 71 may include a processing unit or a plurality of processing units.
  • the processor 71 can be a central processing unit CPU, or a specific integrated circuit ASIC, or one or more integrated circuits configured to implement embodiments of the present invention, such as one or more microprocessors (digital Signal processor, DSP), or one or more field programmable gate array FPGAs.
  • DSP digital Signal processor
  • FPGA field programmable gate array
  • the above-mentioned memory 72 may include a storage unit, and may also include a plurality of storage units, and is used to store executable program code, parameters required for the operation of the device 70, data, and the like.
  • the memory 72 may include a random-access memory (RAM), and may also include a non-volatile memory (NVM) such as a disk memory, a flash, or the like.
  • the communication interface 73 may be an interface supporting the TCP/IP protocol in some embodiments, or an interface supporting the RDMA protocol in other embodiments.
  • the bus 74 may be an industry standard architecture (ISA) bus, a peripheral component (PCI) bus, or an extended industry standard architecture (EISA) bus.
  • ISA industry standard architecture
  • PCI peripheral component
  • EISA extended industry standard architecture
  • the bus 74 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one line is shown in the figure, but it does not mean that there is only one bus or one type of bus.
  • a device 700 for data processing may correspond to the device 500 in the embodiment of the present invention, and may correspond to performing control in the data processing method described in FIG. 4 to FIG. 12 according to an embodiment of the present invention.
  • the corresponding flow of the operation steps of the main body is omitted for brevity.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains one or more sets of available media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (eg, a solid state hard disk SSD), or the like.
  • the embodiment of the invention further provides a storage device for implementing the functions of the storage node in the data processing method described in FIG. 4 to FIG. 12 .
  • the structure of the storage device can continue to refer to FIG. 15, the device includes a processor, a memory, a communication interface, and a bus, and the processor, the memory, and the communication interface are connected through the bus and complete communication with each other.
  • the memory is used to store computer execution instructions, and when the device is running, the processor executes computer execution instructions in the memory to perform data processing corresponding to FIG. 4 to FIG. 12 by using hardware resources in the device. The steps performed by the storage node in the method.
  • the data processing method provided by the embodiment of the present invention can be applied to a non-distributed storage array in addition to the distributed storage system.
  • the following describes the technical solution of the host to read and write target data to the storage array in the prior art.
  • the system includes: a host 810, a first controller 821, a second controller 822, and a plurality of disks, such as disks 830-1, 830-2, ... 830-n.
  • the first controller 821 And the second controller 822 is configured to manage a plurality of disks, and provides a host 810 with a service for accessing data on the disk.
  • the first controller 821 and the second controller 822 can work in an active/standby mode, that is, only at the same time.
  • the first controller 821 and the second controller 822 can also receive the operation request sent by the host without the active/standby component, and process the operation request.
  • the host 810 sends an operation request to the primary controller.
  • the primary controller reads the operation from the disk according to a preset algorithm. Determining the target data of the operation in the request, returning the target data to the host; when the operation type of the operation request is a write operation, the main controller determines, according to the preset algorithm, the target storage location of the target data carried in the operation request on the disk, The target data is written to the determined target storage location.
  • the host 810 may send an operation request to any controller, or the host 810 determines a controller that manages the logical unit number LUN to which the target data belongs, to The controller sends an operation request.
  • the data transfer between the controller and the host is the same as the data transfer between the host and the host controller described in the previous paragraph.
  • the controller that receives the operation request sent by the host 810 (such as the first controller 821) cannot process the target data for the host, for example, the link between the first controller and the disk fails, the first controller 821
  • the operation request of the host 810 needs to be forwarded to other controllers managing a plurality of disks, for example, the second controller 822, and the second controller 822 receives the operation request forwarded by the first controller, and performs processing of the target data for the host 810.
  • the process of the second controller 822 performing the processing of the target data for the host 810 includes: when the operation type of the operation request is a read operation, the second controller 822 reads the target data from the at least one disk (such as the disk 830-1). The target data is transmitted to the first controller 821, and the target data is returned to the host 810 by the first controller 821.
  • the operation type requested by the operation is a write operation
  • the first controller 821 obtains target data from the host 810, transmits the target data to the second controller 822, and the second controller 822 writes the target data to at least one disk.
  • the controller that receives the operation request of the host cannot process the target data for the host, although the target data can be processed by the other controller through the other controller, the controller responsible for the processing of the target data needs to be processed between the host and the host.
  • the transmission of the target data is performed by the controller that receives the operation request of the host, resulting in a long path of data transmission and a long time for data transmission.
  • the present invention further provides a plurality of embodiments for solving a problem that a controller that receives a host operation request in the storage array needs to forward target data for a controller that performs target data processing, resulting in a long time for data transmission.
  • FIG. 17 is a system block diagram of a memory array according to an embodiment of the present invention, including at least two controllers and at least two disks.
  • the at least two controllers may be the first controller 911 and the second controller 912 as shown in FIG. 17, and the at least two disks may be the disk 920-1, the disk 920-2, ..., the disk as shown in FIG. 920-n.
  • Each of the at least two disks may be a mechanical disk, a solid state drive SSD, a solid state hybrid drive (SSHD), or the like.
  • Each LUN of at least two disks may be attributed to one controller management, and the at least two controllers form a storage array with the at least two disks.
  • the host 930 can access target data in the LUN managed by the controller through the controller.
  • Each of the at least two controllers establishes an RDMA connection with the host 930, the process of establishing the RDMA connection being similar to the method described in steps 201 through 208. The difference is that the steps from step 201 to step 208 are In the method, the controller establishes the parameters of the QP created by the host for synchronizing the RDMA connection with the storage node, and in this embodiment, the host 930 and the controller can connect through the non-RDMA (such as TCP/IP). Connection) Send the parameters of the QPs created by each other to the other party, and then associate the QPs created by the other party according to the parameters of the QP to establish an RDMA connection.
  • the controller establishes the parameters of the QP created by the host for synchronizing the RDMA connection with the storage node, and in this embodiment, the host 930 and the controller can connect through the non-RDMA (such as TCP/IP). Connection) Send the parameters of the QPs created by each other to the other party, and then associate the Q
  • the embodiment of the present invention provides a data processing method.
  • the method includes the following steps:
  • Step 701 The host sends an operation request to the first controller, where the operation request includes an identifier of the target data to be operated and an operation type.
  • the host records the mapping relationship between the data and the controller that manages the LUN to which the data belongs, and the host determines, according to the mapping relationship, the LUN to which the first controller manages the target data, and therefore, the host sends the LUN to the first controller.
  • the operation request the mapping relationship between the data and the controller that manages the LUN to which the data belongs.
  • the first controller is any controller in the storage array, and the host sends the operation request to any controller in the storage array.
  • the first controller is a primary controller of the storage array, and the host sends the operation request to the primary controller in the storage array.
  • Step 702 The first controller receives an operation request sent by the host, and determines, according to the identifier of the target data, that the second controller of the at least two controllers sends the target data to the host or acquires the target data from the host.
  • the first controller determines that the second controller sends the target data to the host or obtains the target data from the host.
  • the second controller sends the target data to the host or obtains the target data from the host.
  • the first controller determines, according to the preset load balancing policy, a second controller that sends target data to the host or acquires target data from the host.
  • mode 1 may include the following implementation manners:
  • the first controller is the main controller, and the main controller is responsible for receiving the operation request of the host, and assigning a response according to the preset load balancing policy of the standby controller (such as polling, allocation to the standby controller with the smallest load, etc.)
  • the requesting standby controller should be operated, and the primary controller itself does not directly respond to the host's operational request.
  • the first controller as the primary controller responds to the operation request of the host when its own load is not greater than the threshold, and according to the standby control when its own load is greater than the threshold
  • the load of the device is allocated in response to the standby controller of the operation request.
  • the controller in the storage array has no active/standby points. Each controller stores the load of other controllers.
  • the first controller that receives the host operation request sends an operation request to the current after its own load exceeds the threshold. Other controllers with the least load.
  • the controller in the storage array has no master/slave, each controller does not know the load of other controllers, and the first controller that receives the host operation request sends an operation request to any one after the load exceeds the threshold.
  • the other controller if the received load of the second controller itself does not exceed the threshold, responds to the operation request, and if yes, the second controller forwards the operation request to another controller other than the first controller .
  • the first controller determines, according to the attribution of the LUN where the target data is located, the second controller that responds to the operation request.
  • the first controller After receiving the operation request, the first controller determines that the LUN where the target data is located is not managed by itself, and the first controller determines the second controller that actually manages the LUN, and determines that the second controller that actually manages the LUN responds to the operation of the host. request. Since the second controller of the LUN where the management target data is located has a faster access speed to the target data, the second controller can transmit the target data to the host or acquire the target data from the host, thereby reducing the transmission cost of the target data. Time.
  • the first controller determines the main controller that responds to the operation request of the host according to the above manner 1 or 2.
  • the link between the first controller and the managed disk fails, and the first controller cannot read and write the target data from the disk.
  • the first controller determines, by the above manner 1 or mode 2, that the other controllers respond to the operation request to ensure that the operation request of the host continues to be correctly responded.
  • Step 703 The first controller sends an indication message to the second controller.
  • the indication message includes an identifier of the target data, an operation type, and an identifier of the host, and is used to instruct the second controller to send target data to the host or acquire target data from the host through a connection with the host.
  • Step 704 The second controller sends the target data to the host or acquires the target data from the host by using a connection with the host according to the indication message.
  • connection between the host and the second controller can be implemented in a variety of ways, such as RDMA connections, TCP/IP connections, fast peripheral component interconnect PCIe connections, and the like.
  • the host and the second controller responsible for providing the host with the read/write target data service can transmit the target data through the connection established between the two, without forwarding the target data by the first controller that does not receive the operation request of the host.
  • the path length of the data transmission is shortened, and the time consumption of the target data transmission is reduced.
  • the first controller and the second controller respectively establish an RDMA connection with the host.
  • the operation request sent by the host to the first controller includes an identifier of the target data, an operation type, an identifier of the host, and a location parameter of the target storage area specified by the host in the memory for the target data.
  • the first controller also includes an identifier of the target data, an operation type, an identifier of the host, and the location parameter in the operation request sent to the second controller.
  • the process of the second controller sending the target data to the host according to the indication message or acquiring the target data from the host is:
  • the second controller determines the disk where the target data is located, reads the target data from the disk, determines an RDMA connection with the host according to the identifier of the host, and connects the host through the RDMA.
  • the RDMA write mode writes the target data to the target storage area of the host memory.
  • the second controller determines an RDMA connection with the host according to the identifier of the host, and reads data from the target storage area of the host memory through the RDMA connection by using the RDMA connection. According to the preset algorithm, the storage location of the target data on the disk is determined, and the target data is written to the disk.
  • the host and the second controller responsible for providing the host with the read/write target data service can quickly transfer the target data through the RDMA connection established between the two, thereby realizing high-speed reading and writing of data.
  • the RDMA connection between the second controller and the host may be established before the second controller receives the indication message sent by the first controller, or may be established after receiving the indication message.
  • the controller of the storage array may forward the operation request of the host multiple times, and the controller that receives the operation request and the host perform the transmission of the target data.
  • the third controller of the at least two disks receives a second operation request sent by the host, the second operation request including an identifier and an operation type of the third target data to be operated by the host.
  • the third controller determines that the third target data is transmitted by the first controller and the host according to the manner described in step 702.
  • the third controller sends an indication message to the first controller, where the indication message includes an identifier and an operation type of the third target data to be operated by the host.
  • the first controller After receiving the indication message sent by the third controller, the first controller determines, according to the manner in step 702, that the second target data is sent by the second controller to the host or the third target data is obtained from the host. The first controller sends an indication message to the second controller, where the indication message is used to instruct the second controller to transmit the third target data with the host.
  • the second controller transmits the third target data to the host through the connection with the host according to the indication message sent by the first controller.
  • the host when the controller performs two or more forwarding operations on the host, the host can still transmit with the second controller responsible for providing the host with the read/write target data service.
  • the target data shortens the path length of the data transmission and reduces the time consuming of the target data transmission.
  • an embodiment of the present invention further provides a data processing system, which includes the first controller 911, the second controller 912, and a plurality of disks, such as the disk 920-1 and the disk 920, as shown in FIG. -2, ... the disk 920-n, the first controller 911, the second controller 912 is used to perform the data processing method described in steps 701 to 704, and the controller and the host that respond to the host operation request pass the second Direct link transmission between target data.
  • FIG. 19 shows an apparatus 80 for processing data for implementing the function of the first controller in the method of FIG. 18, the apparatus 80 being connected to the second controller and at least two disks,
  • the device 80 and the second controller are configured to manage the at least two disks;
  • the device includes:
  • the receiving module 81 is configured to receive an operation request sent by the host, where the operation request includes an identifier of the target data to be operated and an operation type;
  • a determining module 82 configured to determine, by the second controller of the at least two controllers, to send the target data to the host or obtain the target data from the host;
  • the sending module 83 is configured to send an indication message to the second controller, where the indication message is used to indicate that the second controller sends the target data or the source to the host by using a connection with the host The host acquires the target data.
  • the connection between the second controller and the host is an RDMA connection;
  • the operation request further includes a location parameter of a target storage area in the host;
  • the indication message includes an identifier of the target data The operation type and the location parameter, the indication message is used to indicate that the second controller acquires the target data from the at least two disks when the operation type is a read operation;
  • the RDMA connection of the host writes the target data to the target storage area in the host memory; and when the operation type is a write operation, from the RDMA connection with the host from the The target storage area in the host memory reads the target data; and writes the target data to the at least two disks.
  • the determining module 82 is configured to determine, according to a preset load balancing policy, that the target data is sent by the second controller to the host or the target data is obtained from the host; or according to the The attribution of the logical unit number LUN where the target data is located is determined by the second controller to send the target data to the host or obtain the target data from the host.
  • the receiving module 81 is further configured to receive an indication message of a third controller that manages the at least two disks, where the indication message sent by the third controller includes second target data that is to be operated by the host. Identification and type of operation;
  • the determining module 82 is further configured to determine the second target data according to the indication message sent by the third controller;
  • the device also includes:
  • the transmission module 84 is configured to transmit the second target data to the host by using a connection with the host.
  • the receiving module 81 is further configured to receive a fifth indication message that is used by the fourth controller that manages the at least two disks, where the fifth indication message includes an identifier of the third target data to be operated by the host, and Type of operation;
  • the determining module 82 is further configured to determine, according to the identifier of the third target data, that the target data is sent by the second controller to the host or the target data is obtained from the host;
  • the sending module 83 is further configured to send a sixth indication message to the second controller, where the fifth indication message is used to instruct the second controller to transmit the third target data with the host.
  • each module of the device 80 reference may be made to the implementation of the steps performed by the first controller in the method shown in FIG. 18, which will not be described in detail herein.
  • FIG. 20 shows a device 90 for data processing according to an embodiment of the present invention.
  • the device 90 includes a processor 91, a memory 92, a first communication interface 93, a second communication interface 94, and a bus 95.
  • the processor 91, The memory 92 and the first communication interface 93 and the second communication interface 94 are connected by the bus 95 and complete communication with each other, and the memory 92 is configured to store a computer execution instruction, where the A communication interface is for communicating with the host, and the second communication interface is for communicating with at least two disks.
  • the processor 91 executes the steps performed by the first controller in the method in which the computer in the memory 92 executes instructions to perform the data processing shown in FIG. 18 using hardware resources in the device. .
  • the physical implementation of the processor 91 can refer to the foregoing processor 71.
  • the physical implementation of the memory 92 can refer to the foregoing memory 72.
  • the physical implementation of the bus 95 can refer to the foregoing bus 74.
  • the foregoing first communication interface 93 may be an interface supporting the RDMA protocol, or an interface supporting the TCP/IP protocol, and in other embodiments, an interface supporting the RDMA protocol.
  • the first communication interface 94 may be an interface supporting the PCIe protocol, or may be an interface for accessing a disk in the prior art.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains one or more sets of available media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (eg, a solid state hard disk SSD), or the like.

Abstract

本申请提供一种数据处理的系统、方法及对应装置,用于解决主机与存储设备间的数据传输耗时较长的问题。该系统包括控制器以及至少两个存储节点,控制器用于接收通过控制器与主机之间的第二连接接述主机发送的操作请求,操作请求包括待操作的目标数据的标识和操作类型;根据目标数据的标识从至少两个存储节点中确定至少一个目标存储节点;通过与至少一个目标存储节点之间的第一连接向至少一个目标存储节点发送指示消息,指示消息用于指示目标存储节点向主机发送目标数据或从主机获取目标数据;至少一个目标存储节点,用于根据指示消息通过与主机之间的第三连接向主机发送目标数据或从主机获取目标数据。

Description

一种数据处理的系统、方法及对应装置 技术领域
本发明涉及计算机技术领域,特别涉及一种数据处理的系统、方法及对应装置。
背景技术
在计算机存储系统中,主机可以通过管理远端存储设备的控制器访问远端存储设备中的数据。但是,现有技术中的访问方式,主机与存储设备间的数据传输耗时较长,主机读写数据的速率容易受到影响。
发明内容
本申请提供一种数据处理的系统、方法及对应装置,用于解决主机与存储设备间的数据传输需由接收主机操作请求的控制器转发,导致主机与存储设备间的数据传输耗时较长的问题。
本申请的第一方面提供一种数据处理的方法,该方法由数据处理系统中的控制器执行,该系统包括包括控制器以及至少两个存储节点。控制器与每个存储节点建立连接,用于管理该至少两个存储节点。存储节点以及控制器与主机建立连接,主机用于部署应用服务。主机向控制器发送操作请求,该操作请求包括待操作的目标数据的标识和操作类型。控制器接收操作请求,根据目标数据的标识确定目标存储节点。控制器向目标存储节点发送指示消息,该指示消息用于指示目标存储节点通过与主机的连接向主机发送目标数据或从主机获取目标数据。目标存储节点响应该指示消息,通过与主机间的连接向主机发送目标数据或从主机获取目标数据。
由于主机与目标存储节点可以通过二者间的连接直接传输数据,而不用经由控制器转发数据,减短数据传输路径进而减小传输耗时,且避免因控制器的带宽无法满足大量数据传输需求而导致的主机读写数据速率较慢的问题,还可以避免因控制器的运算能力无法满足大量数据处理需求而导致的主机读写数据速率较慢的问题。
在第一方面的一种可能的实现方式中,控制器根据目标数据的标识确定目标存储节点包括存储目标数据的第一数据块的第一存储节点以及存储目标数据的第二数据块的第二存储节点;控制器向第一存储节点发送第一指示消息,第一指示消息包括第一数据块的标识,用于指示第一存储节点与主机传输第一数据块;控制器向第二存储节点发送第二指示消息,第二指示消息包括第二数据块的标识,用于指示第二存储节点与主机传输第二数据块。通过上述方案,控制器能够在目标数据分块存储在多个存储节点时,指示存储目标数据的数据块的存储节点分别通过与主机间的连接向主机发送目标数据或从主机获取目标数据的数据块,提高数据传输的速率,减少传输耗时。
在第一方面的另一种可能的实现方式中,存储节点与主机之间的连接为RDMA连接时;主机向控制器发送的操作请求还包括主机在内存中为目标数据指定的目标存储区域的位置参数。控制器根据目标存储区域的位置参数对目标存储区域进行划分,确定目标存储区域中的为每个目标存储节点对应的目标数据的数据块所指定的子存储区域的位置参数。然后,针对每个目标存储节点生成指示消息,该指示消息包括数据块的标识、操作类型以 及该子存储区域的位置参数。目标存储节点接收该指示消息后,在指示消息中的操作类型为读操作时,目标存储节点根据子存储区域的位置参数,通过与主机的RDMA连接将目标数据的数据块写入主机内存中子存储区域;在指示消息中的操作类型为写操作时,目标存储节点根据子存储区域的位置参数,通过与主机的RDMA连接从主机内存中子存储区域读取目标数据的数据块,并存储数据块。通过上述方案,控制器能够在目标数据分块存储在多个存储节点时,为每个目标存储节点确定其存储数据块对应的主机内存的存储区域,指示该存储节点通过与主机间的RDMA连接快速地从主机内存中读取目标数据的数据块或向主机内存写入目标数据的数据块,提高数据传输的速率,减少传输耗时。
在第一方面的另一种可能的实现方式中,存储节点与主机之间的连接为RDMA连接时:存储节点创建第一QP,该第一QP包括第一SQ以及第一RQ,然后,存储节点通过第一连接将第一QP的参数发送给控制器。控制器通过第二连接将第一QP的参数发送给主机。主机创建第二QP,该第二QP包括第二SQ以及第二RQ,通过第二连接将第二QP的参数发送给控制器,控制器则通过第一连接将第二QP的参数发送给任一存储节点。主机根据接收的第一QP的参数将第二QP与所述任一存储节点的第一QP关联。所述任一存储节点根据接收的第二QP的参数将第一QP的第一SQ与第二QP的第二RQ绑定,以及将第一QP的第一RQ与第二QP的第二SQ绑定。通过上述方案,存储节点与主机能够在控制器的辅助下创建RDMA连接,进而能够通过该RDMA连接进行数据传输,而无需控制器负责数据的中转,提高数据传输的速率,减少数据传输耗时。
在第一方面的另一种可能的实现方式中,主机与控制器以及至少两个存储节点建立传输控制协议/网际互联协议TCP/IP连接;控制器在向目标存储节点发送指示消息后,向主机发送第三指示消息,第三指示消息包括目标存储节点的通信地址,用于指示主机通过与目标存储节点间的TCP/IP连接传输目标数据。通过上述方案,控制器不仅向报文发送端发送指示消息,也向报文接收端发送指示消息,使得报文接收端能够从接收的TCP/IP报文中获取目标数据,而不是舍弃该报文。
在第一方面的另一种可能的实现方式中,主机与控制节点以及至少两个存储节点建立TCP/IP连接,主机向控制器发送的操作请求的操作类型为读操作。控制器在向目标存储节点发送的指示消息中添加该操作类型、主机的通信地址、控制器的通信地址以及目标数据的标识,以指示目标存储节点以控制器的通信地址为源地址以及以主机的通信地址为目的地址发送携带目标数据的TCP/IP报文。通过上述方案,目标存储节点修改发送报文的源地址,伪装为控制器向主机发送携带目标数据的报文,进而可以在对现有的主机不作出改变的情况下,实现存储节点通过与主机间的TCP/IP连接直接将数据发送给主机,提高主机与存储节点间数据传输的速度,减少数据传输的耗时。
在第一方面的另一种可能的实现方式中,控制器根据主机的TCP窗口大小确定主机的数据接收量,根据该数据接收量确定目标存储节点通过每个TCP/IP报文携带的目标数据的数据块,该数据块的大小不大于主机的TCP窗口大小,然后,控制器生成包括数据块的标识的指示消息,向目标存储节点发送该指示消息。通过上述方案,控制器能够根据主机的TCP窗口大小确定每次向主机发送的数据块,指示存储该数据块的存储节点通过TCP/IP报文向主机发送该数据块,进而通过存储节点与主机间的TCP连接实现将数据发送至主机。
在第一方面的另一种可能的实现方式中,目标存储节点在向主机发送目标数据或从主 机获取目标数据之后,向控制器发送数据传输成功消息;控制器在接收所有目标存储节点发送的数据传输成功消息之后,向主机发送操作成功消息。通过上述方案,控制器能够在目标存储节点与主机间的目标数据传输完成后,告知主机数据读写成功。
本申请的第二方面提供一种数据处理的方法,该方法由数据处理的系统中的存储节点执行。该系统包括包括控制器以及至少两个存储节点。控制器与每个存储节点建立连接,用于管理该至少两个存储节点。存储节点以及控制器与主机建立连接,主机用于部署应用服务。该方法包括:存储节点与主机建立连接,主机用于部署应用服务;存储节点接收控制器发送的指示消息,指示消息包括主机待操作的目标数据的标识和操作类型;存储节点根据指示消息,通过与主机的连接向主机发送目标数据或从主机获取目标数据。
在第二方面的一种可能的实现方式中,存储节点与主机建立RDMA连接,建立连接的过程为:存储节点创建第一QP,通过第一连接将第一QP的参数发送给控制器。控制器通过第二连接将第一QP的参数发送给主机。主机创建第二QP对,通过第二连接将第二QP的参数发送给控制器,控制器则通过第一连接将第二QP的参数发送给存储节点。主机根据接收的第一QP的参数将第二QP与存储节点的第一QP关联。储节点根据接收的第二QP的参数将自身的第一QP与第二QP关联。通过上述方案,存储节点与主机能够在控制器的辅助下创建RDMA连接,进而能够通过该RDMA连接进行数据传输,而无需控制器负责数据的中转,提高数据传输的速率,减少数据传输耗时。
在第二方面的另一种可能的实现方式中,控制器发送的指示消息包括目标数据的标识、操作类型以及主机中目标存储区域的位置参数。存储节点响应该指示消息,在指示消息中的操作类型为读操作时,存储节点通过与主机的RDMA连接将目标数据写入主机内存中存储区域;在指示消息中的操作类型为写操作时,存储节点通过与主机的RDMA连接从主机内存中存储区域读取目标数据,并存储目标数据。通过上述方案,目标存储节点能够根据与主机的RDMA从主机内存中读写数据,完成与主机间目标数据的传输。
在第二方面的另一种可能的实现方式中,主机与存储节点建立TCP/IP连接,主机向控制器发送的操作请求的操作类型为读操作。控制器在向目标存储节点发送的指示消息中添加该操作类型、主机的通信地址、控制器的通信地址以及目标数据的标识。目标存储节点响应该指示消息,以控制器的通信地址为源地址以及以主机的通信地址为目的地址发送携带目标数据的TCP/IP报文。通过上述方案,目标存储节点修改发送报文的源地址,伪装为控制器向主机发送携带目标数据的报文,进而可以在对现有的主机不作出改变的情况下,实现存储节点通过与主机间的TCP/IP连接直接将数据发送给主机,提高主机与存储节点间数据传输的速度,减少数据传输的耗时。
本申请的第三方面提供一种数据处理的方法,该方法由处理数据的系统中的第一控制器执行,该系统包括至少两个控制器以及至少两个磁盘,至少两个控制器用于管理至少两个磁盘,至少两个磁盘的每个LUN可以归属于一个控制器管理,该至少两个控制器与该至少两个磁盘形成存储阵列。主机可以通过控制器访问该控制器管理的LUN中的目标数据。至少两个控制器与主机建立连接,主机用于部署应用服务。该方法包括:第一控制器接收主机发送的操作请求,操作请求包括待操作的目标数据的标识和操作类型;第一控制器根据预设的负载均衡策略或目标数据的LUN的归属确定响应操作请求的第二控制器;向第二控制器发送指示消息,指示第二控制器通过与主机的连接向主机发送目标数据或从主机获取目标数据。通过上述方案,主机与负责为主机提供读写目标数据服务的第二控制 器可以通过二者间建立的连接传输目标数据,而不用不经由接收主机的操作请求的第一控制器转发目标数据,缩短了数据传输的路径长度,减小了目标数据传输的耗时。
在第三方面的一种可能的实现方式中,第二控制器与主机之间的连接为RDMA连接;主机向第一控制器发送的操作请求还包括主机中目标存储区域的位置参数;第一控制器在向第二控制器发送的指示消息中添加目标数据的标识、操作类型以及该位置参数,以指示第二控制器在操作类型为读操作时从至少两个磁盘中获取目标数据,通过与主机的RDMA连接将目标数据写入主机内存中存储区域;以及在操作类型为写操作时通过与主机的RDMA连接从主机内存中存储区域读取目标数据,将目标数据写入至少两个磁盘。通过上述方案,主机与负责为主机提供读写目标数据服务的第二控制器可以通过二者间建立的RDMA连接快速地传输目标数据,实现高速地读写数据。
在第三方面的另一种可能的实现方式中,第一控制器根据预设负载均衡策略确定由第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;或者,根据所述目标数据所在的逻辑单元号LUN的归属确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据。
在第三方面的另一种可能的实现方式中,第一控制器接收管理所述至少两个磁盘的第三控制器的第四指示消息,第四指示消息包括主机待操作的第二目标数据的标识和操作类型;第一控制器响应该第四指示消息,通过与主机的连接向主机发送第二目标数据或从主机获取第二目标数据。本申请的第四方面提供一种数据处理的装置,该装置用于执行上述第一方面或第一方面的任意可能的实现中的方法。具体的,该装置包括用于执行上述第一方面或第一方面的任意可能的实现中的方法的模块。
本申请的第五方面提供一种数据处理的装置,该装置用于执行上述第二方面或第二方面的任意可能的实现中的方法。具体的,该装置包括用于执行上述第二方面或第二方面的任意可能的实现中的方法的模块。
本申请的第六方面提供一种数据处理的装置,该装置用于执行上述第三方面或第三方面的任意可能的实现中的方法。具体的,该装置包括用于执行上述第三方面或第三方面的任意可能的实现中的方法的模块。
本申请的第七方面提供一种数据处理的设备,包括处理器、存储器、通信接口以及总线,处理器、存储器和通信接口之间通过总线连接并完成相互间的通信,存储器中用于存储计算机执行指令,设备运行时,处理器执行存储器中的计算机执行指令以利用设备中的硬件资源执行上述第一方面或第一方面的任意可能的实现中的方法。
本申请的第八方面提供一种数据处理的设备,包括处理器、存储器、通信接口以及总线,处理器、存储器和通信接口之间通过总线连接并完成相互间的通信,存储器中用于存储计算机执行指令,设备运行时,处理器执行存储器中的计算机执行指令以利用设备中的硬件资源执行上述第二方面或第二方面的任意可能的实现中的方法。
本申请的第九方面提供一种数据处理的设备,包括处理器、存储器、通信接口以及总线,处理器、存储器和通信接口之间通过总线连接并完成相互间的通信,存储器中用于存储计算机执行指令,设备运行时,处理器执行存储器中的计算机执行指令以利用设备中的硬件资源执行上述第三方面或第三方面的任意可能的实现中的方法。
本申请的第十方面提供一种数据处理的系统,该系统包括第七方面所述的设备以及至少两个第八方面所述的设备,用于实现第八方面所述的设备与主机通过二者间的连接直接 传输数据,而不用经由第七方面所述的设备转发数据。
本申请的第十一方面提供一种数据处理的系统,该系统包括至少两个第九方面所述的设备以及至少两个磁盘,用于实现响应主机的操作请求的第九方面所述的设备与主机通过二者间的连接直接传输数据,而不用经由接收主机操作请求的设备转发数据。
本申请的第十二方面提供了一种计算机可读介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行第一方面或第一方面的任意可能的实现中的方法的指令。
本申请的第十三方面提供了一种计算机可读介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行第二方面或第二方面的任意可能的实现中的方法的指令。
本申请的第十四方面提供了一种计算机可读介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行第三方面或第三方面的任意可能的实现中的方法的指令。
本申请的在上述各方面提供的实现的基础上,还可以进行进一步组合以提供更多实现。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍。
图1为现有技术中SAN存储系统的示意图;
图2为离散聚合列表SGL的示意图;
图3为本发明实施例中SAN系统的示意图;
图4为本发明实施例中SAN系统中传输数据方法的流程示意图;
图5为本发明实施例中主机与存储节点建立RDMA连接的流程示意图;
图6为本发明实施例中存储节点与主机的结构示意图;
图7为本发明实施例中主机与存储节点通过RDMA连接传输目标数据的流程示意图;
图8a-图8b为本发明实施例中主机内存中存储目标数据的存储区域的示意图;
图9为本发明实施例中主机与存储节点通过iSER连接传输目标数据的流程示意图;
图10为本发明实施例中主机与存储节点通过TCP/IP连接传输数据方法的流程示意图;
图11为本发明实施例中主机与存储节点通过TCP/IP连接传输数据的另一方法的流程示意图;
图12为本发明实施例中存储节点向主机发送的报文的帧结构示意图;
图13为本发明实施例中装置50的结构示意图;
图14为本发明实施例中装置60的结构示意图;
图15为本发明实施例中设备70的结构示意图;
图16为现有技术中存储阵列的系统框图;
图17为本发明实施例中存储阵列的系统框图;
图18为本发明实施例中数据处理方法的流程示意图;
图19为本发明实施例中装置80的结构示意图;
图20为本发明实施例中设备90的结构示意图。
具体实施方式
为了便于理解,下面先介绍现有技术中主机访问远端存储设备中数据的技术方案。
图1为一种分布式的存储区域网络(storage area network,SAN)系统的示意图,该系统包括控制器12以及至少两个存储节点,如存储节点13-1、存储节点13-2、…、存储节点13-n。该系统用于处理主机11的请求消息,主机11与控制器12建立网络连接,如基于光纤通道(fibre channel,FC)的连接、或基于以太网的连接。主机11通过与控制器12之间的连接向控制器12发送操作请求。控制器12与该至少两个存储节点建立网络连接,SAN系统中采用分布式方式存储数据,如有数据需要写入SAN系统时,控制器按照预置算法将数据拆分成多个数据块,分别存储在不同的存储节点中,控制器12中记录有每个存储节点所存储数据的信息,该信息也称为存储节点的分区视图。示例地,表1为存储节点中分区视图的一种示例,如表所示,表1中包括数据块标识、关联原始数据标识、存储介质和校验值,其中,校验值为控制节点根据预置算法计算的该数据块的校验信息,用于在读取或写入数据块时确定数据块的完整性;存储介质用于标识数据块存储在存储节点中的目标存储介质信息。
表1存储节点中分区视图的示例
数据块标识 关联原始数据标识 存储介质 校验值
数据块11 数据1 存储介质1 校验值1
数据块21 数据2 存储介质2 校验值2
数据块31 数据3 存储介质3 校验值3
可选的,当SAN系统中采用分布式方式存储数据时,在控制器中保存有SAN系统中存储节点的全局视图,该全局视图记录有每个存储节点所存储数据的信息。示例地,表2为全区视图的一种示例。
表2控制节点中全局分区视图的示例
数据块标识 关联原始数据标识 校验值 存储节点 存储介质
数据块11 数据1 校验值1 存储节点1 存储介质1
数据块12 数据1 校验值2 存储节点2 存储介质1
数据块13 数据1 校验值3 存储节点3 存储介质1
当控制器12接收主机11发送的读请求时,根据读请求中待操作的目标数据的标识以及该分区视图确定存储目标数据的目标存储节点(比如目标存储节点为存储节点13-1)。然后,控制器12向目标存储节点13-1发送读请求。针对控制器12发送的读请求,目标存储节点13-1向控制器12返回目标数据;控制器12接收目标存储节点13-1返回的目标数据,将目标数据返回给主机11。
当控制器12接收主机11发送的写请求时,根据写请求中待操作的目标数据的标识以及该分区视图确定存储目标数据的目标存储节点13-1。然后,控制器12向目标存储节点发送写请求。针对控制器12发送的写请求,目标存储节点13-1将目标数据写入存储介质,并向控制器12返回写数据响应消息。当控制器12接收的响应消息指示写数据成功时,向主机11返回写操作成功消息。
通过上面的流程可以看出,无论是主机从存储节点读数据的场景,还是主机向存储节点写数据的场景,均需要由控制器来转发待操作的数据,而且,控制器在转发该待操作的数据时,还需要对数据进行封包、解包等处理。实际情况中,SAN系统可用于处理多个主机的操作请求,此时,控制器可能会同时处理多个主机对存储节点的操作请求,使得控制器的数据传输负担以及运算负担过重,制约主机与存储节点间读写数据的速度。
为了解决上述接收主机操作请求的控制器制约主机与存储节点间读写数据速度,增大主机与存储节点间读写数据耗时的问题,本发明的实施例提供一种数据处理的系统,下面通过附图以及具体实施例对该数据处理的系统做详细的说明。
首先介绍本发明各实施例中涉及的部分概念。
(1)远程直接数据存取(remote direct memory access,RDMA),为一种直接进行远程内存存取的技术,可以直接将数据从一个设备的存储器快速迁移到另一个远程设备的存储器中,减少了设备的中央处理器(central processing unit,CPU)参与数据传输过程的消耗,进而提升了系统处理业务的性能,具有高带宽、低时延及低CPU占用率的特点。
(2)队列对(queue pair,QP),包括接收队列(send queue,SQ)以及发送队列(receive queue,RQ)。建立RDMA连接的两端均需分别建立QP,然后将自身的QP中的SQ与对端的QP的RQ关联,以及将自身的QP中的RQ与对端的QP中的SQ关联,进而实现自身的QP与对端的QP关联,两端建立RDMA连接。
(3)具有RDMA功能的网卡(RDMA enabled network interface card,RNIC),用于实现基于RDMA连接的数据传输。在设备A与设备B的基于RDMA连接的数据传输中,设备A的RNIC可以直接从设备A的内存读取数据,将读取的数据发送给设备B的RNIC,设备B的RNIC将从设备A的RNIC接收的该数据写入设备B的内存中。本发明的实施例中,RNIC可以为支持RDMA的主机总线适配器(host bus adapter,HBA)。
(4)RDMA操作类型,包括用于传输命令的RDMA send、RDMA receive,以及用于传输数据的RDMA read、RDMA write。上述四种操作类型的具体实施方式请参照现有技术,在此不予详述。
(5)iSER连接,指的是基于远程内存直接访问方式扩展的网络小型计算机系统接口(iSCSI extensions for RDMA,iSER)协议的连接,iSCSI指的是互联网小型计算机系统接口(internet small computer system interface,iscsi),iSER协议支持RDMA传输。
(6)基于网络的非易失性存储总线(NVMe over fabric,NOF)连接,NVMe指非易失性存储总线(non-volatile memory express,NVMe),NOF协议支持RDMA传输。
(7)离散聚合列表(scatter gather list,SGL),指一种数据封装形式。在RDMA传输数据的过程中,要求源物理地址以及目标物理地址均必须是连续的。但实际情况中,数据的存储地址在物理空间上不一定是连续的,在这种情况下,将不连续的物理存储地址以SGL的形式封装,设备在传输完一块存储在物理连续的存储空间的数据后,根据该SGL传输下一块存储在物理连续的存储空间的数据。
图2所示为SGL的示意图,SGL包括多个离散聚合实体(scatter gather entries,SGE),每个SGE包括地址字段(address)、长度(length)字段以及flag,其中,address字段表征存储区域的起始位置,length字段表征存储区域的长度,flag字段表征该SGE是否为该SGL中最后一个,flag字段还可以包括其他辅助信息,例如数据块描述符。每个SGE根据自身包含的地址字段以及长度字段表征一段连续的存储区域,若干个表征的存储区域顺次相连的 SGE组成一组,SGL中的不同组的SGE所表征的存储区域之间不相邻,每组SGE的最后一个SGE指向下一组SGE的起始地址。
(8)逻辑单元号(logical unit number,LUN),LUN只是个号码标识,不代表任何实体属性。在存储阵列中,将多个磁盘按照预置算法(如独立冗余磁盘阵列(redundant array of independent disks,RAID)配置关系)组成一个逻辑磁盘,再将该逻辑磁盘按照预置规则切分成不同的条带,每个条带称为一个LUN,其中,一个LUN可以用于描述存储阵列的一个磁盘的一个连续的存储区域,或者描述一个磁盘的多个不连续的存储区域的集合,或者描述多个磁盘中的存储区域的集合。
图3为本发明实施例中的一种分布式存储系统的示意图,该系统包括主机40、控制器20以及控制器20管理的至少两个存储节点,如存储节点31、存储节点32、存储节点33。控制器20与存储节点31、存储节点32、存储节点33分别建立连接,控制器20与主机40建立连接,存储节点31、存储节点32、存储节点33分别与主机40建立连接。主机40用于部署应用服务。
为了便于区分,本发明的实施例中,将控制器20与存储节点间的连接称为第一连接,将控制器20与主机40间的连接称为第二连接,将存储节点与主机40间的连接称为第三连接。上述第一连接、第二连接以及第三连接可以为基于有线通信的连接,如基于FC的连接;上述第一连接、第二连接以及第三连接还可以为基于无线通信的连接,如基于蜂窝(cellular communication)通信的连接,又如无线保真(wireless fidelity,WIFI)连接。
图4所示为根据图3所示的系统进行数据处理的方法,该方法包括:
步骤101:控制器与存储节点建立第一连接。
存储节点的数量为两个或两个以上,控制器分别与每个存储节点建立第一连接,控制器用于通过与存储节点之间的第一连接对存储节点进行管理。例如,根据主机的请求指示存储节点存储/返回/删除数据。
步骤102:控制器与主机建立第二连接。
步骤103:主机与存储节点建立第三连接。
步骤104:主机通过第二连接向控制器发送操作请求,该操作请求包括待操作的目标数据的标识和操作类型。
本发明的实施例中,将读请求、写请求统称为操作请求。操作请求包括操作类型,用于表明请求的操作为读操作或写操作。目标数据的标识用于唯一标识目标数据,例如,在采用键值对(key-value)的方式存储数据时,key值就是数据的标识。
步骤105:控制器接收操作请求,根据目标数据的标识以及全局视图或分区视图确定目标存储节点。
控制器从该操作请求中获取目标数据的标识,根据该目标数据的标识以及上述全局视图或分区视图确定目标存储节点,当所述操作请求的操作类型为写操作时,所述目标存储节点为待写入所述目标数据的存储节点,当所述操作请求的操作类型为读操作时,所述目标存储节点为存储有所述目标数据的存储节点。
例如,在操作请求中的操作类型为读操作时,控制器在上述全局视图或分区视图中查找目标数据的标识,确定存储目标数据的标识的目标存储节点。该目标存储节点可以为一个或一个以上,比如,目标数据被切分为多个数据块,多个数据块被分别存储在多个存储节点中。
又例如,在操作请求中的操作类型为写操作时,控制器会根据预置算法将待写入数据切分成多个数据块,分别存储在多个存储节点中,此时,存储数据块的每个存储节点均为用于存储目标数据的目标存储节点。
步骤106:控制器向目标存储节点发送指示消息,该指示消息用于指示目标存储节点通过与主机的连接向主机发送目标数据或从主机获取目标数据。
作为一种可选的方式,在目标存储节点为多个时,控制器可以向多个目标存储节点发送指示消息,指示每个目标存储节点通过与主机间的第三连接传输目标数据的数据块。由于多个目标存储节点可以根据指示消息同时与主机进行目标数据的数据块的传输,能够提高目标数据传输的效率,减少目标数据传输的耗时。
作为另一种可选的方式,控制器向多个目标存储节点逐一发送该指示消息,在前一个目标存储节点与主机间的数据传输结束之后,再向后一个目标存储节点发送指示消息。通过上述方案,控制器可以控制目标数据传输的有序进行,保证传输数据的正确性。
步骤107:目标存储节点根据控制器发送的指示消息向主机发送目标数据或从主机获取目标数据。
通过上述方案,主机通过控制器指示目标存储节点向主机发送目标数据或从主机获取目标数据,且目标存储节点与主机直接通过二者间的连接传输目标数据,而不用经由控制器转发数据,减短数据传输路径进而减小传输耗时,且避免因控制器的带宽无法满足大量数据传输需求而导致的主机读写数据速率较慢的问题,还可以避免因控制器的运算能力无法满足大量数据处理需求而导致的主机读写数据速率较慢的问题。
进一步地,本发明实施例中,主机与存储节点之间建立的第三连接可以有多种实现方式,下面分别进行介绍。
(一)第三连接为RDMA连接。
参照图5,主机与存储节点建立RDMA连接的过程如下:
步骤201:存储节点创建第一QP。
为了便于区分,本发明实施例中将存储节点建立的QP称为第一QP。第一QP包括发送队列SQ和接收队列RQ,SQ用于发送数据,RQ用于接收数据。可选的,第一QP还包括完成队列,该完成队列用于检测第一QP的SQ的发送数据任务以及第一QP的RQ的接收数据任务是否完成。
步骤202:存储节点通过第一连接将第一QP的参数发送给控制器。
该第一QP的参数可以包括第一QP的标识,第一QP的标识可以是用于标识该QP的数字、字母或其他形式的组合。除此之外,也可以包括为第一QP配置的保护域(protect domain,PD)的标识,PD用于表征存储节点为RDMA连接所分配的具有RDMA功能的网卡(如RNIC);第一QP的参数还可以包括存储节点分配的协助建立RDMA连接以及管理RDMA连接的连接管理器(connection manager,CM)的标识。
步骤203:控制器通过第二连接将第一QP的参数发送给主机。
步骤204:主机创建第二QP。
为了便于区分,本发明实施例中将主机创建的QP称为第二QP。第二QP包括发送队列SQ和接收队列RQ。可选的,第二QP还包括完成队列,该完成队列用于检测第二QP的SQ的发送数据任务以及第二QP的RQ的接收数据任务是否完成。
步骤205:主机通过第二连接将第二QP的参数发送给控制器。
该第二QP的参数可以包括第二QP的标识,除此之外,也可以包括为第二QP配置的保护域PD的标识、为第二QP配置的连接管理器CM的标识等。
步骤206:控制器通过第一连接将第二QP的参数发送给存储节点。
步骤207:主机根据接收的第一QP的参数将第二QP与存储节点的第一QP关联。
步骤208:存储节点根据接收的第二QP的参数将自身的第一QP与第二QP关联。
上述第一QP与第二QP的关联指的是,根据第一QP的标识以及第二QP的标识将第一QP的SQ与第二QP的RQ绑定,进而创建从存储节点向主机发送数据的通路;以及,将第一QP的RQ与第二QP的SQ绑定,进而创建从主机向存储节点发送数据的通路。
需要说明的是,上述步骤201存储节点创建第一QP与步骤204主机创建第二QP可以同时进行,也可以为先执行步骤201后执行步骤204,还可以为先执行步骤204后执行步骤201。
通过上述方案,主机能够与存储节点建立RDMA连接,由于该RDMA连接为主机与存储节点之间直接建立的连接,主机与存储节点通过该RDMA连接进行目标数据的传输无需中转,可以减少传输耗时。而且,RDMA连接本身的传输速率也很快,能够进一步减小传输耗时。
接下来,结合图6进一步介绍存储节点与主机基于RDMA连接进行数据传输的实现方式,图6为主机与存储节点通过RDMA连接传输数据的示意图,主机40包括RNIC41(如HBA卡)以及内存42,存储节点31包括RNIC311以及内存312。
存储节点31的RNIC311可以向主机41的RNIC41发送读取内存42指定位置的数据的请求,RNIC41从内存42指定位置读取数据,将该数据发送至RNIC311,RNIC311将接收的数据写入内存312,上述过程称为存储节点以RDMA read方式从主机40读取数据。
存储节点31的RNIC311还可以向主机40的RNIC41发送向内存42指定位置写入数据的请求,RNIC41缓存该请求携带的数据,并将数据写入内存42中该请求所指定的位置,上述过程称为存储节点以RDMA write方式向主机40写入数据。
进一步地,图7为基于RDMA连接的数据处理方法的流程示意图,该方法包括如下步骤:
步骤301,主机通过第二连接向控制器发送操作请求,该操作请求包括操作类型、目标数据的标识以及主机在内存中为目标数据指定的目标存储区域的位置参数。
该目标存储区域的位置参数用于标识目标数据在主机内存中的存储位置,其表现形式可以为内存中的偏移量。该操作请求还可以包括目标数据的长度、远端密钥(remote key、Rkey)等信息。
步骤302,控制器接收操作请求,根据该操作请求中的目标数据的标识确定目标存储节点。
当所述操作请求为写请求时,所述目标存储节点为待写入所述目标数据的存储节点,当所述操作请求为读请求时,所述目标存储节点为存储有所述目标数据的存储节点。
步骤303,控制器向目标存储节点发送指示消息,该指示消息包括该目标存储节点对应的目标数据的数据块的标识、操作类型以及目标存储区域的位置参数。
步骤304,目标存储节点响应指示消息,根据目标存储区域的位置参数通过与主机的RDMA连接向主机发送目标数据或从主机获取目标数据。
具体的,当指示消息中的操作类型为写操作时,目标存储节点以前述RDMA read方式, 从主机内存中目标存储区域读取目标数据。然后,目标存储节点将读取的目标数据写入该目标存储节点的磁盘中。
当指示消息中的操作类型为读操作时,目标存储节点以前述RDMA write方式,向主机内存中该目标存储区域写入目标数据。然后,主机将主机的内存中目标存储区域存储的该目标数据写入主机的磁盘。
步骤305,目标存储节点在完成通过RDMA连接向主机发送目标数据或从主机获取目标数据后,向控制器发送数据传输成功消息。
在操作类型为写操作时,目标存储节点在以RDMA read方式读取主机内存中该目标存储区域的数据后,向控制器发送数据传输成功消息。
在操作类型为读操作时,目标存储节点在以RDMA write方式将存储的目标数据写入主机内存中该目标存储区域后,向控制器发送数据传输成功消息。
步骤306,控制器在接收到目标存储节点发送的数据传输成功消息后,向主机发送操作成功消息。
通过上述方案,待操作的数据经由主机与存储节点之间的RDMA连接传输,不用经由控制器,不仅减轻控制器的带宽负担以及运算负担,而且通过高速的RDMA连接实现数据传输,数据传输速率更快,整个写数据的过程耗时更短。
作为一种可选的方式,目标数据被划分为多个数据块,不同数据块对应的目标存储节点的磁盘中的存储区域不连续。具体又可以有两种场景:
场景1,目标数据对应的目标存储节点的个数大于1,目标数据的不同数据块可以对应不同的目标存储节点。例如,目标数据被划分为第一数据块以及第二数据块,其中,第一数据块对应第一存储节点,第二数据块对应第二存储节点。
场景2,目标数据对应的目标存储节点的个数为1,但是,目标数据的不同数据块对应的目标存储节点的磁盘中的存储区域不连续。例如,目标数据被划分为第一数据块以及第二数据块,其中,第一数据块对应目标存储节点中的第一存储区域,第二数据块对应目标存储节点中的第二存储区域,第一存储区域与第二存储区域不连续。
在上述两种场景中,控制器还要为目标数据的每个数据块确定在主机内存中目标存储区域中对应的子存储区域,进而指示目标存储节点从该子存储区域获取该目标数据的数据块,或者将该目标数据的数据块写入主机内存中该子存储区域。下面以场景1为例进行说明。
请参照图8a,主机在内存中为目标数据指定的目标存储区域为连续存储区域时,控制器确定目标数据的第一数据块由第一存储节点存储,从目标数据在内存中的存储区域中划分出用于存储该第一数据块的第一子存储区域,以及确定目标数据的第二数据块由第二存储节点存储,从目标数据在内存中的存储区域中划分出用于存储该第二数据块的第二子存储区域,该第一子存储区域以及第二子存储区域为连续存储区域。
请参照图8b,该目标存储区域为非连续存储区域时,控制器确定目标数据的第一数据块由第一存储节点存储,从目标数据在内存中的存储区域中划分出用于存储该第一数据块的第一存储区域,该第一存储区域为由多个非连续存储区域所组成的存储区域。控制器确定目标数据的第二数据块由第二存储节点存储,从目标数据在内存中的存储区域中划分出用于存储该第二数据块,该第一存储区域为由多个非连续存储区域所组成的存储区域。
针对场景2,控制器为目标数据的数据块确定在主机的内存中对应的子存储区域的实 现方式与场景1相一致,在此不再重复。
在确定出目标数据的数据块在主机的内存中对应的子存储区域后,控制器向目标存储节点发送指示消息,该指示消息包括该目标存储节点对应的目标数据的数据块的标识、操作类型以及为该数据块确定的子存储区域的位置参数。然后,目标存储节点响应指示消息,根据子存储区域的位置参数通过与主机的RDMA连接向主机发送目标数据的数据块或者从主机获取目标数据的数据块。再然后,目标存储节点在完成通过RDMA连接向主机发送目标数据或从主机获取目标数据的数据块后,向控制器发送数据传输成功消息。最后,控制器在接收到所有目标存储节点发送的数据传输成功消息后,向主机发送操作成功消息。
在场景2中,在确定出目标数据的数据块在主机的内存中对应的子存储区域后,控制器向目标存储节点发送指示消息,该指示消息包括该目标存储节点对应的目标数据的数据块的标识、操作类型以及为该数据块确定的子存储区域的位置参数。然后,目标存储节点响应指示消息,根据子存储区域的位置参数通过与主机的RDMA连接向主机发送目标数据的数据块或者从主机获取目标数据的数据块。然后,目标存储节点在完成通过RDMA连接向主机发送目标数据或从主机获取目标数据的数据块后,向控制器发送数据传输成功消息。然后,控制器在接收到所有目标存储节点发送的数据传输成功消息后,向主机发送操作成功消息。
通过上述方案,控制器能够为目标数据的数据块确定主机的内存中对应的字存储区域,进而使得该数据块对应的目标存储节点能够通过RDMA连接从主机获取该数据块或者向主机发送该数据块。
作为一种可能的实现方式,下面结合图9介绍iSER协议下数据处理方法的流程,其中,iSER协议支持RDMA连接,包括如下步骤:
步骤401,主机向控制器发送建立iSER连接请求。
步骤402,控制器向主机返回建立连接响应,与主机建立iSER连接。
步骤403,主机创建第二QP,并通过与控制器的iSER连接将第二QP的参数发送给控制器。
步骤404,控制器通过与存储节点间的第一连接将第二QP的参数发送给每个存储节点。
步骤405,存储节点创建第一QP,并通过第一连接将第一QP的参数发送给控制器。
步骤406,控制器通过iSER连接将第一QP的参数发送给主机。
步骤407,主机根据第一QP的参数将第一QP与第二QP关联。
步骤408,存储节点根据第二QP的参数将第二QP与第一QP关联。
步骤409,主机通过iSER连接向控制器发送控制请求。
该控制请求用于请求控制器准许主机以RDMA的方式向控制器发送命令请求,如后文中的操作请求。该控制请求的具体实现方式请参照现有技术。
步骤410,控制器通过iSER连接向主机返回控制响应。
该控制响应表征控制器准许主机以RDMA的方式向控制器发送命令请求。
步骤411,主机通过iSER连接向控制器发送操作请求。
该操作请求包括操作类型、待操作的目标数据的标识以及主机在内存中为该目标数据指定的目标存储区域的位置参数。
在一种可能的实现方式中,该操作请求可以不包括该目标存储区域的位置参数,主机 在发送操作请求之后再向控制器发送目标存储区域的位置参数。
步骤412,控制器根据该操作请求确定目标存储节点,并根据目标存储区域的位置参数确定主机在内存中为每个目标存储节点对应的数据块所指定的子存储区域的位置参数。
控制器根据本地存放的分区视图或全局视图确定目标数据存放或者应该被存放在哪些存储节点上,确定出的存储节点即为目标存储节点,其中,每个目标存储节点存储该目标数据的一个或多个数据块。然后,控制器对主机内存中用于存放目标数据的存储区域进行划分,确定主机内存中为每个目标存储节点所对应的数据块指定的存储区域。
步骤413,控制器向每个目标存储节点发送指示消息,该指示消息包括该目标存储节点对应的数据块的标识以及为该目标存储节点确定的子存储区域的位置参数。
步骤414,目标存储节点响应该指示消息,通过与主机的RDMA连接向主机发送目标数据的数据块或从主机获取目标数据的数据块。
在指示消息中的操作类型为写操作时,目标存储节点根据指示消息中的子存储区域的位置参数,以RDMA read的方式从该子存储区域读取数据。然后将读取的数据写入目标存储节点的内存,再将该数据从内存写入磁盘,记录该数据在磁盘中的存储位置。
在指示消息中的操作类型为读操作时,目标存储节点根据指示消息中的子存储区域的位置参数,以RDMA write的方式将该目标存储节点所存储的目标数据的数据块写入该子存储区域。
步骤415,目标存储节点在完成通过RDMA连接向主机发送目标数据或从主机获取目标数据的数据块后,向控制器发送数据传输成功消息。
步骤416,控制器在接收到每个目标存储节点发送的数据传输成功响应后,通过iSER连接向主机发送操作成功消息。
在上述步骤401至步骤416的流程中,主机与存储节点之间的数据通过二者之间的RDMA连接,而不是经由控制器转发,不仅降低了控制器的负载,避免现有技术中因控制器转发数据带来的负载过高导致主机与存储节点数据传输速度较慢的问题。而且,在目标存储节点为两个或两个以上时,不同存储节点可以同时与主机通过RDMA连接传输数据,进一步提高主机与存储节点间传输数据的速率。
作为另一种可能的实现方式,NOF协议也支持RDMA连接,NOF协议下数据处理方法的流程与上述步骤401至步骤416的流程相一致。
继续参照图8b,与iSER协议不同的是,NOF协议还支持主机将内存中的该目标存储区域配置为非连续,主机将目标存储区域的位置参数以SGL的形式封装得到SGL包,并以RDMA write方式将该SGL包写入控制器的内存。控制器对该SGL包解析,并确定主机的目标存储区域中为每个目标存储节点对应的数据块所指定的子存储区域的位置参数,以SGL的形式封装子存储区域的位置参数,并将SGL形式封装的子存储区域的位置参数发送给存储节点,存储节点对该SGL封包进行解析,获得该子存储区域的位置参数。通过上述方式,在主机将数据存储在内存的离散存储区域的情况下实现存储节点与主机之间通过RDMA连接进行数据传输,使得主机能够充分利用内存存储区域,提高内存利用率。
(二)、第三连接为TCP/IP连接。
主机与存储节点均包括支持TCP/IP协议的通信端口,主机的该通信端口与存储节点的该通信端口能够建立通信链路,即为第三连接,本发明实施例中,主机与存储节点能够通过该第三连接传输数据。
图10为在第三连接为TCP/IP连接时,主机与存储节点之间传输数据方法的流程示意图,该方法包括如下步骤:
步骤501,主机通过第二连接向控制器发送操作请求,该操作请求包括目标数据的标识以及操作类型;
步骤502,控制器响应该操作请求,确定存储目标数据的目标存储节点,以及确定每个目标存储节点所对应的数据块。
步骤503,控制器通过第一连接向每个目标存储节点发送指示消息,该指示消息包括主机的通信地址,以及该目标存储节点所对应的数据块的标识。
具体的,数据块的标识用于使控制器在分布式系统中确定目标数据在存储节点的存储位置。可选的,该标识中还可以包括验证信息,如身份验证、数据访问许可信息。
步骤504,控制器通过第二连接向主机发送每个目标存储节点的通信地址以及每个目标存储节点所对应的数据块的标识。
上述通信地址可以为IP地址或者媒体访问控制(media access control,MAC)地址。控制器可以从主机发送的操作请求中获得主机的通信地址,而控制器本地可以存储其连接的每个存储节点的通信地址。
步骤505,每个目标存储节点与主机基于对方的通信地址以TCP/IP报文的方式传输目标数据中该目标存储节点所对应的数据块。
具体地,步骤505可以有多种实现方式,包括但不限于以下方式:
方式1,控制器指示目标存储节点发起与主机间的数据传输。
例如,在操作请求中的操作类型为读操作时,控制器在确定出每个目标存储节点以及每个目标存储节点所对应的数据块后,向每个目标存储节点发送指示消息,该指示消息包括主机的通信地址以及该存储节点需返回的数据块的标识。存储节点响应该指示消息,发送以主机为目的地的TCP/IP报文,该TCP/IP报文包括指示消息所指示的数据块。由于控制器已将目标存储节点的通信地址以及该目标存储节点所对应的数据块的标识发送给主机,所以,主机在接收到目标存储节点发送的TCP/IP报文时,可以确认该报文为合法报文,且确定该TCP/IP报文中包含的数据块为目标数据中的数据块,并从TCP/IP报文中获取该数据块。主机在接收每个目标存储节点发送的TCP/IP报文后,可以获得目标数据,实现从存储节点读取该目标数据。
又例如,操作请求中的操作类型为写操作时,存储节点响应控制器发送的指示消息,向主机发送TCP/IP读请求报文,该TCP/IP读请求报文包括该目标存储节点负责存储的数据块的标识。主机响应该TCP/IP读请求报文,向该存储节点发送携带有该目标存储节点所对应的数据块的报文。
方式2,控制器指示主机发起与目标存储节点间的数据传输。
一方面,控制器向主机返回每个目标存储节点的通信地址以及该目标存储节点负责传输的数据块的标识,主机接收到上述信息后,主动向目标存储节点发送数据传输请求报文。
另一方面,控制器向每个目标存储节点发送指示消息,该指示消息包括主机的通信地址以及待操作的数据块的标识,该指示消息的作用不是指示目标存储节点主动向主机返回数据,而是告知目标存储节点合法的数据传输需求,以便在目标存储节点接收到主机的数据传输请求报文时,将其识别为合法报文,响应该报文,与主机进行数据传输。
通过上述方案,主机与存储节点之间能够通过TCP/IP连接传输待操作的数据,而不 用经由控制器转发数据,避免因控制器的带宽无法满足大量数据传输需求而导致的主机读写数据速率较慢的问题,而且,也可以避免因控制器的运算能力无法满足大量数据处理需求而导致的主机读写数据速率较慢的问题。
作为一种可能的实现方式,在步骤505之后,还包括如下步骤:
步骤506,目标存储节点在完成与主机通过TCP/IP连接传输目标数据的数据块之后,向控制器发送数据传输成功响应;
步骤507:控制器在接收到所有目标存储节点发送的数据传输成功响应后,向主机发送操作成功消息。
通过上述方案,控制器可以在每个目标存储节点的数据传输任务完成后,向主机发送操作成功消息,告知主机数据读取成功,便于主机及时确认读写数据成功。
接下来,结合图11具体介绍当第三连接为TCP/IP连接时本发明提供的另一种数据处理的方法,该方法包括如下步骤:
步骤601,主机通过第二连接向控制器发送操作请求,该操作请求包括目标数据的标识以及操作类型,操作类型为读操作;
步骤602,控制器确定存储目标数据的目标存储节点,以及确定每个目标存储节点所存储的目标数据的数据块。
步骤603,控制器通过第一连接向每个目标存储节点发送指示消息,该指示消息包括主机的通信地址、控制器的通信地址以及该目标存储节点所存储的数据块的标识。
作为一种可能的方式,控制器根据主机的TCP窗口大小确定主机的数据接收量,根据该数据接收量确定目标存储节点通过每个TCP/IP报文携带的目标数据的数据块,该数据块的大小不大于主机的TCP窗口大小,然后,控制器生成包括所述数据块的标识的所述指示消息,向目标存储节点发送该指示消息。
其中,控制器获得主机的TCP窗口可以有多种实现方式,例如,主机与控制器之间的第二连接为支持TCP/IP的连接(如iSCSI连接),主机在于控制器建立第二连接时会协商TCP窗口的大小,控制器进而获知主机的TCP窗口。又例如,主机在向控制器发送的TCP/IP报文中携带有主机当前可用的TCP窗口大小,控制器从该TCP/IP报文中确定主机的TCP窗口。
例如,控制器根据该TCP窗口确定每次均向主机发送1000字节的数据块,然后,控制器确定目标数据的第一位起长度为1000字节的数据块所在的目标存储节点,向该目标存储节点发送指示消息,指示该目标存储节点向主机发送该1000字节的数据块,在该1000字节的数据块发送成功后,再指示存储目标数据的第二个长度为1000字节的数据块所在的目标存储节点向主机发送该第二个长度为1000字节的数据,以此类推,直至目标存储节点将目标数据全部发送至主机。
在一种可能的实现方式中,控制器确定每次向主机发送的数据块的长度可以不同,这是因为主机的可用TCP窗口是动态变化的。例如,控制器确定第一次向主机发送目标数据的第一个长度为1000字节的数据块,指示该目标存储节点向主机发送该1000字节的数据块,然后,再确定第二次向主机发送长度为800字节的数据块,指示存储该第一个长度为1000字节的数据块之后的长度为800字节的数据块的目标存储节点向主机发送该长度为800字节的数据块。
在一种可能的实施方式中,控制器可以将主机的TCP窗口同时分配给多个目标存储节 点,例如,主机的TCP窗口大小为1000字节,目标数据的大小为450字节的第一数据块存储在第一目标存储节点,目标数据的大小为500字节的第二数据块存储在第二目标存储节点,控制器可以同时向第一目标存储节点以及第二目标存储节发送指示消息,指示第一目标存储节点向主机发送第一数据块,指示第二目标存储节点向主机发送第二数据块,由于第一数据块与第二数据块的大小之和不大于主机的TCP窗口大小,因而主机能够成功接收第一数据块以及第二数据块。通过上述方式,可以充分利用主机的TCP窗口,提高数据传输的效率。
步骤604,目标存储节点响应该指示消息,以控制器的通信地址为源地址以及以主机的通信地址为目的地址发送携带目标数据的数据块的TCP/IP报文。
图12所示为目标存储节点发送的TCP/IP报文的网络层报文头以及传输层报文头的示意图,目标存储节点通过将网络层报文头中源IP地址设置为控制器的IP地址,将传输层报文头中源端口号、TCP序列号、TCP响应序列号、发送窗口大小等参数均设置为控制器的参数。控制器可以在给目标存储节点发送的指示消息中添加上述参数,以使目标存储节点获得上述参数。
步骤605,主机接收目标存储节点发送的源地址为控制器的通信地址的该TCP/IP报文,从该TCP/IP报文中获取目标数据的数据块。
具体的,控制器一次指示所有目标存储节点中的一个或多个向主机发送目标数据的数据块,这些数据块的大小之和不大于主机的TCP窗口大小。主机在接收上述数据块之后,向控制器返回接收数据响应。控制器在接收该接收数据响应之后,确定向目标存储节点发送的指示消息均被成功响应,继续向所有目标存储节点中的一个或多个发送指示消息,指示该一个或多个目标存储节点向主机分别发送目标数据的数据块,所有数据块的大小之和不大于主机的TCP窗口大小。通过上述方案,控制器可以根据主机返回的接收数据响应消息控制目标存储节点有序地向主机发送目标数据的数据块,保证数据传输的准确性。
结合上一段所述的实施方式,如果控制器在向目标存储节点发送指示消息后预设时长内未能从主机接收该接收数据响应消息,控制器向该存储节点发送重传指令,指示存储节点重新向主机发送数据。本实现方式中,通过上述重传机制保证数据传输的顺利进行。
通过步骤601至步骤606所述方案,存储节点修改发送报文的源地址,伪装为控制器向主机发送携带目标数据中的数据块的TCP/IP报文,主机将该TCP/IP报文识别为由控制器返回的目标数据的数据块。由于主机能够将存储节点发送的TCP/IP报文识别为控制器所发送的报文,进而可以在对现有的主机不作出改变的情况下,实现存储节点通过与主机间的TCP/IP连接直接将数据发送给主机,而不用经由控制器转发数据,减小将数据由目标存储节点发送至主机的耗时。除此之外,还可以避免因控制器的带宽无法满足大量数据传输需求而导致的主机读写数据速率较慢的问题,而且,也可以避免因控制器的运算能力无法满足大量数据处理需求而导致的主机读写数据速率较慢的问题。再者,由于可以不对现有计算机的硬件以及其工作协议进行改变,成本较低。
图13所示为本发明实施例的一种数据处理的装置50,该装置50对应图3所示数据处理系统中的控制器20,用于实现图4至图12所述数据处理方法中控制器的功能。所述装置用于管理所述至少两个存储节点,所述装置50包括:
第一接收模块51,用于接收所述主机发送的操作请求,所述操作请求包括待操作的目标数据的标识和操作类型;
确定模块52,用于根据所述目标数据的标识从所述至少两个存储节点中确定至少一个目标存储节点;
第一发送模块53,用于向所述至少一个目标存储节点发送指示消息,所述指示消息用于指示所述至少一个目标存储节点通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
应理解的是,本发明实施例的装置50可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logic device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图4至图12中所示的数据处理方法时,装置50及其各个模块也可以为软件模块。
可选的,所述确定模块52,用于根据所述目标数据的标识确定所述目标存储节点包括存储所述目标数据的第一数据块的第一存储节点以及存储所述目标数据的第二数据块的第二存储节点;
所述第一发送模块53,具体用于向所述第一存储节点发送第一指示消息,所述第一指示消息包括所述第一数据块的标识,用于指示所述第一存储节点向所述主机发送所述第一数据块或从所述主机获取所述第一数据块;向所述第二存储节点发送第二指示消息,所述第二指示消息包括所述第二数据块的标识,用于指示所述第二存储节点向所述主机发送所述第二数据块或从所述主机获取所述第二数据块。
可选的,所述操作请求还包括所述主机在内存中为所述目标数据指定的目标存储区域的位置参数;
所述确定模块52还用于根据所述目标存储区域的位置参数确定所述目标存储区域中的为所述至少一个目标存储节点中每个目标存储节点对应的所述目标数据的数据块所指定的子存储区域的位置参数;生成包括所述数据块的标识、所述操作类型以及所述子存储区域的位置参数的所述指示消息,所述指示消息用于指示接收到所述指示消息的目标存储节点在所述操作类型为读操作时,根据所述子存储区域的位置参数通过与所述主机的RDMA连接将所述目标数据的数据块写入所述主机内存中所述子存储区域,或者,在所述操作类型为写操作时,根据所述子存储区域的位置参数,通过与所述主机的RDMA连接从所述主机内存中所述子存储区域读取所述目标数据的数据块,并存储所述数据块。
可选的,所述装置50还包括第二接收模块54和第二发送模块55,
第二接收模块54,用于从所述存储节点中任一存储节点接收所述任一存储节点创建的第一QP的参数;
第二发送模块55,用于将所述第一QP的参数发送给所述主机;
所述第一接收模块51,还用于从所述主机接收所述主机创建的第二QP的参数;
所述第一发送模块53,还用于将所述第二QP的参数发送给所述任一存储节点。
可选的,所述装置50与所述主机之间的连接为TCP/IP连接;所述指示消息包括所述操作类型、所述主机的通信地址以及所述目标数据的标识;所述装置50还包括:
第二发送模块55,用于向所述主机发送第三指示消息,所述第三指示消息包括一个所述目标存储节点的通信地址,用于指示所述主机通过与所述目标存储节点间的TCP/IP向所述目标存储节点发送所述目标数据或从所述目标存储节点获取所述目标数据。
可选的,所述装置50与所述主机之间的连接为TCP/IP连接,所述操作类型为读操作;所述指示消息包括所述操作类型、所述主机的通信地址、所述装置50的通信地址以及所述目标数据的标识,所述指示消息用于指示所述目标存储节点以所述装置50的通信地址为源地址以及以所述主机的通信地址为目的地址发送携带所述目标数据的TCP/IP报文。
可选的,所述确定模块52,还用于根据所述主机的TCP窗口大小确定所述主机的数据接收量,根据所述数据接收量确定所述至少一个目标存储节点中每个目标存储节点通过每个TCP/IP报文携带的所述目标数据的数据块;生成包括所述数据块的标识的所述指示消息,所述指示消息用于指示接收到所述指示信息的目标存储节点向所述主机发送所述数据块。
上述装置50的各模块的实现方式,与图4至图12对应的数据处理的方法中由控制器执行的步骤的实施方式相同,在此不予重复。
图14所示为本发明实施例的一种数据处理的装置60,用于实现图4至图12所述数据处理方法中存储节点的功能。所述装置60与控制器通信连接,所述控制器用于管理所述装置60;所述装置60包括:
接收模块61,用于接收所述控制器发送的指示消息,所述指示消息包括主机待操作的目标数据的标识和操作类型;
传输模块62,用于根据所述指示消息,通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
可选的,装置60还包括连接模块63,用于创建第一QP,所述第一QP中包括第一发送队列SQ以及第二接收队列RQ;将所述第一QP的参数发送给所述控制器;从所述控制器接收所述主机创建的第二QP的参数,所述第二QP中包括第二发送队列SQ以及第二接收队列RQ;根据所述第一QP的参数以及所述第二QP的参数将所述第一SQ与所述第二QP的第二RQ绑定,以及将所述第一RQ与所述第二SQ绑定,进而与所述主机建立RDMA连接。
可选的,所述指示消息包括所述目标数据的标识、所述操作类型以及所述主机的内存中目标存储区域的位置参数;
所述传输模块62,用于在所述操作类型为读操作时,通过与所述主机的RDMA连接将所述目标数据写入所述主机内存中所述目标存储区域;在所述操作类型为写操作时,通过与所述主机的RDMA连接从所述主机内存中所述目标存储区域读取所述目标数据,并存储所述目标数据。
可选的,所述指示消息包括所述操作类型、所述主机的通信地址、所述控制器的通信地址以及所述目标数据的标识;
所述传输模块62,用于以所述控制器的通信地址为源地址以及以所述主机的通信地址为目的地址发送携带所述目标数据的TCP/IP报文。
上述装置60的各模块的实现方式,与图4至图12对应的数据处理的方法中由存储节点执行的步骤的实施方式相同,在此不摘赘述。
图15所示为本发明实施例的一种数据处理的设备70,用于实现图4至图12所述数据处理方法中控制器的功能。设备70包括处理器71、存储器72、通信接口73以及总线74,所述处理器71、所述存储器72和所述通信接口73之间通过所述总线74连接并完成相互间的通信,所述存储器72中用于存储计算机执行指令,所述设备70运行时,所述处理器 71执行所述存储器72中的计算机执行指令以利用所述设备中的硬件资源执行图4至图12对应的数据处理的方法中由控制器执行的步骤。
上述处理器71可以包括一个处理单元,也可以包括多个处理单元。例如,该处理器71可以是中央处理器CPU,也可以是特定集成电路ASIC,或者是被配置成实施本发明实施例的一个或多个集成电路,例如:一个或多个微处理器(digital signal processor,DSP),或,一个或者多个现场可编程门阵列FPGA。
上述存储器72可以包括一个存储单元,也可以包括多个存储单元,且用于存储可执行程序代码、设备70运行所需要参数、数据等。且存储器72可以包括随机存储器(random-access memory,RAM),也可以包括非易失性存储器(non-volatile memory,NVM),例如磁盘存储器,闪存(flash)等。上述通信接口73,在一些实施例中可以为支持TCP/IP协议的接口,在另一些实施例中可以为支持RDMA协议的接口。
上述总线可74以是工业标准体系结构(industry standard architecture,ISA)总线、外部设备互连(peripheral component,PCI)总线或扩展工业标准体系结构(extended industry standard architecture,EISA)总线等。该总线74可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。
应理解,根据本发明实施例一种数据处理的设备700可对应于本发明实施例中的装置500,并可以对应于执行根据本发明实施例中图4至图12所述数据处理方法中控制器为执行主体的操作步骤的相应流程,为了简洁,在此不再赘述。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘SSD)等。
本发明实施例还提供一种存储设备,用于实现图4至图12所述数据处理方法中存储节点的功能。该存储设备的结构可以继续参照图15,该设备包括处理器、存储器、通信接口以及总线,所述处理器、所述存储器和所述通信接口之间通过所述总线连接并完成相互间的通信,所述存储器中用于存储计算机执行指令,所述设备运行时,所述处理器执行所述存储器中的计算机执行指令以利用所述设备中的硬件资源执行图4至图12对应的数据处理的方法中由存储节点执行的步骤。
本发明实施例提供的数据处理方法除了可以应用于分布式存储系统之外,还可以用于非分布式的存储阵列。下面先介绍现有技术中主机向存储阵列读写目标数据的技术方案。
图16为存储阵列的系统框图,如图所示,该系统包括:主机810、第一控制器821、第二控制器822以及多个磁盘,如磁盘830-1、830-2、...830-n。其中,所述第一控制器821 以及第二控制器822用于管理多个磁盘,且为主机810提供访问磁盘上数据的服务,第一控制器821和第二控制器822的工作方式,可以为主备模式,即同一时刻仅有一个控制器为主用状态,负责接收主机的操作请求,并对该操作请求进行处理,包括直接与主机传输目标数据;或者,将该操作请求转发给备控制器,由备控制器进行数据的处理。第一控制器821和第二控制器822也可以没有主备之分,均能够接收主机发送的操作请求,对该操作请求进行处理。
现有技术中主机810访问磁盘中数据的过程如下:
在管理多个磁盘的多个控制器有主备之分时,主机810向主控制器发送操作请求,在操作请求的操作类型为读操作时,主控制器根据预置算法从磁盘读取操作请求中请求操作的目标数据,向主机返回目标数据;在操作请求的操作类型为写操作时,主控制器根据预置算法确定操作请求中携带的目标数据在磁盘中的目标存储位置,将该目标数据写入确定出的目标存储位置。
在管理多个磁盘的多个控制器没有主备之分时,主机810可以将操作请求发送给任一控制器,或者,主机810确定管理目标数据所归属的逻辑单元号LUN的控制器,向该控制器发送操作请求。上述两种情形中,控制器与主机间的数据传输与上一段中介绍的主机与主控制器之间的数据传输方式相同。
然而,当接收主机810发送的操作请求的控制器(如第一控制器821)无法为主机进行目标数据的处理时,例如第一控制器与磁盘间的链路发生故障,第一控制器821需要将主机810的操作请求转发给管理多个磁盘的其他控制器处理,例如第二控制器822,第二控制器822接收第一控制器转发的操作请求,为主机810进行目标数据的处理。
第二控制器822为主机810进行目标数据的处理的过程包括:当操作请求的操作类型为读操作时,第二控制器822从至少一个磁盘(如磁盘830-1)中读取该目标数据,将目标数据发送给第一控制器821,由第一控制器821将目标数据返回给主机810。当该操作请求的操作类型为写操作时,第一控制器821从主机810获得目标数据,将目标数据发送给第二控制器822,第二控制器822将目标数据写入至少一个磁盘中。
可见,接收到主机的操作请求的控制器无法为主机进行目标数据的处理时,虽然可以通过其他控制器为主机进行目标数据的处理,但是,负责目标数据的处理的控制器与主机之间需要经由接收主机的操作请求的控制器进行目标数据的传输,导致数据传输的路径较长,数据传输耗时较长。
本发明还提供多个实施例用于解决上述存储阵列中接收主机操作请求的控制器需要为进行目标数据处理的控制器转发目标数据,导致数据传输耗时较长的问题。
图17所示为本发明实施例的存储阵列的系统框图,包括至少两个控制器以及至少两个磁盘。至少两个控制器可以为如图17所示的第一控制器911、第二控制器912,至少两个磁盘可以为如图17所示的磁盘920-1、磁盘920-2、…、磁盘920-n。
至少两个磁盘中的每个磁盘可以为机械磁盘、固态硬盘SSD、固态混合硬盘(solid state hybrid drive,SSHD)等。至少两个磁盘的每个LUN可以归属于一个控制器管理,该至少两个控制器与该至少两个磁盘形成存储阵列。主机930可以通过控制器访问该控制器管理的LUN中的目标数据。
至少两个控制器中每个控制器与主机930建立RDMA连接,该RDMA连接的建立过程与步骤201至步骤208中所述的方法类似。不同之处在于,步骤201至步骤208所述方 法中,由控制器为建立RDMA连接的主机与存储节点同步二者创建的QP的参数,而在本实施例中,主机930与控制器可以通过二者间的非RDMA连接(如TCP/IP连接)向对方发送各自创建的QP的参数,进而根据QP的参数关联对方创建的QP,建立RDMA连接。
结合图17所示的系统,本发明实施例提供一种数据处理的方法,参照图18,该方法包括如下步骤:
步骤701:主机向第一控制器发送操作请求,该操作请求包括待操作的目标数据的标识和操作类型。
具体的,主机中记录有数据与管理该数据所归属的LUN的控制器的映射关系,主机根据该映射关系确定第一控制器管理目标数据所归属的LUN,因此,主机向第一控制器发送所述操作请求。
可选的,第一控制器为存储阵列中任一控制器,主机向存储阵列中的任一控制器发送该操作请求。
可选的,第一控制器为存储阵列的主控制器,主机向存储阵列中的主控制器发送该操作请求。
步骤702,第一控制器接收主机发送的操作请求,根据目标数据的标识确定由至少两个控制器中的第二控制器向主机发送目标数据或从主机获取目标数据。
第一控制器确定由第二控制器向主机发送目标数据或从主机获取目标数据可以有多种实现方式,包括但不限于以下方式:
方式1,第一控制器根据预设的负载均衡策略确定向主机发送目标数据或从主机获取目标数据的第二控制器。
具体的,方式1可以包括如下实现方式:
其一,第一控制器为主控制器,主控制器负责接收主机的操作请求,并根据备控制器的预设负载均衡策略(如轮询、分配至负载最小的备控制器等)分配响应该操作请求的备控制器,该主控制器自身并不直接响应主机的操作请求。
其二,结合所述其一,与之不同之处在于,作为主控制器的第一控制器在自身的负载不大于阈值时响应主机的操作请求,在自身的负载大于该阈值时根据备控制器的负载分配响应该操作请求的备控制器。
其三,存储阵列中控制器没有主备之分,每个控制器均保存有其他控制器的负载,接收主机操作请求的第一控制器在自身的负载超过阈值后,将操作请求发送给当前负载最小的其他控制器。
其四,存储阵列中控制器没有主备之分,每个控制器不知道其他控制器的负载,接收主机操作请求的第一控制器在自身的负载超过阈值后,将操作请求发送给任一其他的控制器,若接收的第二控制器自身的负载没有超过阈值,则响应该操作请求,若超过,则第二控制器将该操作请求转发给处第一控制器之外的其他控制器。
方式2,第一控制器根据目标数据所在的LUN的归属确定响应该操作请求的第二控制器。
第一控制器接收操作请求后,确定目标数据所在的LUN实际不由自身管理,第一控制器确定实际管理该LUN的第二控制器,确定由实际管理该LUN的第二控制器响应主机的操作请求。由于管理目标数据所在LUN的第二控制器对目标数据的访问速度较快,通过第二控制器向主机发送目标数据或从主机获取目标数据,能够减少目标数据的传输耗 时。
方式3,第一控制器在自身无法访问目标数据时,根据上述方式1或方式2确定响应主机的操作请求的主控制器。
第一控制器与管理的磁盘间的链路发生故障,第一控制器无法从磁盘读写目标数据。第一控制器通过上述方式1或方式2,确定其他控制器响应该操作请求保证主机的操作请求继续被正确响应。
步骤703:第一控制器向第二控制器发送指示消息。
指示消息包括目标数据的标识、操作类型以及主机的标识,用于指示第二控制器通过与主机间的连接向主机发送目标数据或从主机获取目标数据。
步骤704:第二控制器根据指示消息,通过与主机间的连接向主机发送目标数据或从主机获取目标数据。
主机与第二控制器之间的连接可以有多种实现方式,例如RDMA连接,TCP/IP连接、快速外围组件互连PCIe连接等。
通过上述方案,主机与负责为主机提供读写目标数据服务的第二控制器可以通过二者间建立的连接传输目标数据,而不用不经由接收主机的操作请求的第一控制器转发目标数据,缩短了数据传输的路径长度,减小了目标数据传输的耗时。
作为一种可能的实现方式,第一控制器与第二控制器分别与主机建立RDMA连接。主机向第一控制器发送的操作请求包括目标数据的标识、操作类型、主机的标识以及主机在内存中为目标数据指定的目标存储区域的位置参数。第一控制器在向第二控制器发送的操作请求中同样包含目标数据的标识、操作类型、主机的标识以及该位置参数。
第二控制器根据该指示消息向主机发送目标数据或从主机获取目标数据的过程为:
在主机的操作请求的操作类型为读操作时,第二控制器确定目标数据所在的磁盘,从该磁盘读取目标数据,根据主机的标识确定与主机间的RDMA连接,通过该RDMA连接以前述RDMA write的方式将目标数据写入主机内存的目标存储区域。
在主机的操作请求的操作类型为写操作时,第二控制器根据主机的标识确定与主机间的RDMA连接,通过该RDMA连接以前述RDMA read的方式从主机内存的目标存储区域读取数据,根据预置算法确定目标数据在磁盘中的存储位置,将目标数据写入磁盘。
通过上述方案,主机与负责为主机提供读写目标数据服务的第二控制器可以通过二者间建立的RDMA连接快速地传输目标数据,实现高速地读写数据。
需要说明的是,第二控制器与主机间的RDMA连接,可以在第二控制器接收第一控制器发送的所述指示消息之前建立,也可以在接收所述指示消息之后建立。
作为一种可能的实现方式,存储阵列的控制器之间可以对主机的操作请求进行多次转发,由最后接收操作请求的控制器与主机进行目标数据的传输。
例如,至少两个磁盘的第三控制器接收主机发送的第二操作请求,该第二操作请求包括主机待操作的第三目标数据的标识和操作类型。第三控制器根据步骤702所述方式确定由第一控制器与主机传输该第三目标数据。第三控制器向第一控制器发送指示消息,该指示消息包括主机待操作的第三目标数据的标识和操作类型。
第一控制器接收第三控制器发送的该指示消息后,根据步骤702所述方式确定由第二控制器向主机发送第三目标数据或从主机获取第三目标数据。第一控制器向第二控制器发送指示消息,该指示消息用于指示第二控制器与主机传输第三目标数据。
第二控制器根据第一控制器发送的该指示消息,通过与主机间的连接与主机传输第三目标数据。
通过上述方案,在控制器对主机的操作请求进行两次或两次以上的转发时,主机仍然可以与负责为主机提供读写目标数据服务的第二控制器可以通过二者间建立的连接传输目标数据,缩短了数据传输的路径长度,减小了目标数据传输的耗时。
请参照图17,本发明实施例还提供一种数据处理的系统,该系统包括图17所述的第一控制器911、第二控制器912以及多个磁盘,如磁盘920-1、磁盘920-2、…磁盘920-n,第一控制器911、第二控制器912用于执行步骤701至步骤704所述的数据处理的方法,用于实现响应主机操作请求的控制器与主机通过二者间的直接链路传输目标数据。
图19所示为本发明实施例的一种处理数据的装置80,用于实现图18所述方法中第一控制器的功能,所述装置80与第二控制器以及至少两个磁盘连接,所述装置80以及所述第二控制器用于管理所述至少两个磁盘;所述装置包括:
接收模块81,用于接收主机发送的操作请求,所述操作请求包括待操作的目标数据的标识和操作类型;
确定模块82,用于确定由所述至少两个控制器中的第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;
发送模块83,用于向所述第二控制器发送指示消息,所述指示消息用于指示所述第二控制器通过与所述主机间的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
可选的,所述第二控制器与所述主机之间的连接为RDMA连接;所述操作请求还包括所述主机中目标存储区域的位置参数;所述指示消息包括所述目标数据的标识、所述操作类型以及所述位置参数,所述指示消息用于指示所述第二控制器在所述操作类型为读操作时,从所述至少两个磁盘中获取所述目标数据;通过与所述主机的所述RDMA连接将所述目标数据写入所述主机内存中所述目标存储区域;以及在所述操作类型为写操作时,通过与所述主机的所述RDMA连接从所述主机内存中所述目标存储区域读取所述目标数据;将所述目标数据写入所述至少两个磁盘。
可选的,所述确定模块82,用于根据预设负载均衡策略确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;或根据所述目标数据所在的逻辑单元号LUN的归属确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据。
可选的,所述接收模块81,还用于接收管理所述至少两个磁盘的第三控制器的指示消息,所述第三控制器发送的指示消息包括主机待操作的第二目标数据的标识和操作类型;
所述确定模块82,还用于根据所述第三控制器发送的指示消息确定所述第二目标数据;
所述装置还包括:
传输模块84,用于通过与所述主机的连接与所述主机传输所述第二目标数据。
可选的,所述接收模块81,还用于接收管理所述至少两个磁盘的第四控制器的第五指示消息,所述第五指示消息包括主机待操作的第三目标数据的标识和操作类型;
所述确定模块82,还用于根据所述第三目标数据的标识确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;
所述发送模块83,还用于向所述第二控制器发送第六指示消息,所述第五指示消息用于指示所述第二控制器与所述主机传输所述第三目标数据。
上述装置80的各模块的实现方式,可以参照图18所示方法中由第一控制器执行的步骤的实现方式,在此不予详述。
图20所示为本发明实施例提供的一种数据处理的设备90,设备90包括处理器91、存储器92、第一通信接口93、第二通信接口94以及总线95,所述处理器91、所述存储器92和所述第一通信接口93、所述第二通信接口94之间通过所述总线95连接并完成相互间的通信,所述存储器92中用于存储计算机执行指令,所述第一通信接口用于与主机进行通信,所述第二通信接口用于与至少两个磁盘通信。所述设备90运行时,所述处理器91执行所述存储器92中的计算机执行指令以利用所述设备中的硬件资源执行图18所示的数据处理的方法中由第一控制器执行的步骤。
处理器91的物理实现可以参照前述处理器71,存储器92的物理实现可以参照前述存储器72,总线95的物理实现可以参照前述总线74。
上述第一通信接口93,可以为支持RDMA协议的接口,也可以为支持TCP/IP协议的接口,在另一些实施例中可以为支持RDMA协议的接口。
上述第一通信接口94,可以为支持PCIe协议的接口,也可以为现有技术中的各种用于访问磁盘的接口。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘SSD)等。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (44)

  1. 一种数据处理的系统,其特征在于,包括控制器以及至少两个存储节点,所述控制器用于管理所述至少两个存储节点;
    所述控制器,用于接收通过所述控制器与主机之间的第二连接接收所述主机发送的操作请求,所述操作请求包括待操作的目标数据的标识和操作类型;根据所述目标数据的标识从所述至少两个存储节点中确定至少一个目标存储节点;通过与所述至少一个目标存储节点之间的第一连接向所述至少一个目标存储节点发送指示消息,所述指示消息用于指示所述目标存储节点向所述主机发送所述目标数据或从所述主机获取所述目标数据;
    所述至少一个目标存储节点,用于根据所述指示消息,通过与所述主机之间的第三连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  2. 根据权利要求1所述的系统,其特征在于,当根据所述目标数据的标识确定目标存储节点,向所述目标存储节点发送指示消息时,所述控制器用于:
    根据所述目标数据的标识确定所述至少一个目标存储节点包括存储所述目标数据的第一数据块的第一存储节点以及存储所述目标数据的第二数据块的第二存储节点;
    向所述第一存储节点发送第一指示消息,所述第一指示消息包括所述第一数据块的标识,用于指示所述第一存储节点向所述主机发送所述第一数据块或从所述主机获取所述第一数据块;
    向所述第二存储节点发送第二指示消息,所述第二指示消息包括所述第二数据块的标识,用于指示所述第二存储节点向所述主机发送所述第二数据块或从所述主机获取所述第二数据块。
  3. 根据权利要求1或2所述的系统,其特征在于,所述第三连接为远程直接数据存取RDMA连接;所述操作请求还包括所述主机在内存中为所述目标数据指定的目标存储区域的位置参数;
    所述控制器,还用于根据所述目标存储区域的位置参数确定所述目标存储区域中的为所述目标存储节点对应的所述目标数据的数据块所指定的子存储区域的位置参数;生成包括所述数据块的标识、所述操作类型以及所述子存储区域的位置参数的所述指示消息;
    当通过与所述主机之间的第三连接向所述主机发送所述目标数据或从所述主机获取所述目标数据时,所述目标存储节点用于:
    在所述操作类型为读操作时,根据所述子存储区域的位置参数,通过与所述主机的RDMA连接将所述目标数据的数据块写入所述主机的内存中的所述子存储区域;
    在所述操作类型为写操作时,根据所述子存储区域的位置参数,通过与所述主机的RDMA连接从所述主机的内存中的所述子存储区域读取所述目标数据的数据块,并存储所述数据块。
  4. 根据权利要求1至3中任一所述的系统,其特征在于,所述至少两个存储节点中的任一存储节点还用于:
    创建第一队列对QP,所述第一QP包括第一发送队列SQ以及第一接收队列RQ;将所述第一QP的参数发送给所述控制器,以及从所述控制器接收所述主机创建的第二QP的参数,所述第二QP包括第二SQ以及第二RQ;根据所述第一QP的参数以及所述第二QP的参数将所述第一SQ与所述第二QP的第二RQ绑定,以及将所述第一RQ与所述第二SQ绑定,进而与所述主机建立所述RDMA连接;
    所述控制器,还用于从所述主机接收所述第二QP的参数以及从所述任一存储节点接收所述第一QP的参数;将所述第一QP的参数发送给所述主机,将所述第二QP的参数发送给所述任一存储节点。
  5. 根据权利要求1至2中任一项所述的系统,其特征在于,所述第二连接与所述第三连接为TCP/IP连接;
    所述控制器,还用于向所述主机发送第三指示消息,所述第三指示消息包括所述目标存储节点的通信地址,用于指示所述主机通过与所述目标存储节点间的TCP/IP连接向所述目标存储节点发送所述目标数据或从所述目标存储节点获取所述目标数据。
  6. 根据权利要求1至2中任一项所述的系统,其特征在于,所述第二连接与所述第三连接为TCP/IP连接,所述操作类型为读操作;所述指示消息包括所述操作类型、所述主机的通信地址、所述控制器的通信地址以及所述目标数据的标识;
    在向所述主机发送所述目标数据或从所述主机获取所述目标数据时,所述目标存储节点用于:
    以所述控制器的通信地址为源地址、以所述主机的通信地址为目的地址发送携带所述目标数据的TCP/IP报文。
  7. 根据权利要求6所述的系统,其特征在于,所述控制器还用于:根据所述主机的TCP窗口大小确定所述主机的数据接收量,根据所述数据接收量确定所述目标存储节点通过每个TCP/IP报文携带的所述目标数据的数据块,生成包括所述数据块的标识的所述指示消息。
  8. 根据权利要求1至7任一项所述的系统,其特征在于,所述至少一个目标存储节点中的每个目标存储节点,还用于在向所述主机发送所述目标数据或从所述主机获取所述目标数据之后,向所述控制器发送数据传输成功消息;
    所述控制器,还用于在接收所述数据传输成功消息之后,向所述主机发送操作成功消息。
  9. 一种数据处理的方法,其特征在于,所述方法由控制器执行,所述控制器用于管理至少两个存储节点,所述方法包括:
    所述控制器接收主机发送的操作请求,所述操作请求包括待操作的目标数据的标识和操作类型;
    所述控制器根据所述目标数据的标识从所述至少两个存储节点中确定至少一个目标存储节点;
    所述控制器向所述至少一个目标存储节点发送指示消息,所述指示消息用于指示所述至少一个目标存储节点通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  10. 根据权利要求9所述的方法,其特征在于,所述控制器根据所述目标数据的标识从所述至少两个存储节点中确定至少一个目标存储节点,包括:
    所述控制器根据所述目标数据的标识确定所述至少一个目标存储节点包括存储所述目标数据的第一数据块的第一存储节点以及存储所述目标数据的第二数据块的第二存储节点;
    所述控制器向所述至少一个目标存储节点发送指示消息,包括:
    所述控制器向所述第一存储节点发送第一指示消息,所述第一指示消息包括所述第一 数据块的标识,用于指示所述第一存储节点向所述主机发送所述第一数据块或从所述主机获取所述第一数据块;
    所述控制器向所述第二存储节点发送第二指示消息,所述第二指示消息包括所述第二数据块的标识,用于指示所述第二存储节点向所述主机发送所述第二数据块或从所述主机获取所述第二数据块。
  11. 根据权利要求9至10中任一项所述的方法,其特征在于,所述操作请求还包括所述主机在内存中为所述目标数据指定的目标存储区域的位置参数;在所述控制器向所述至少一个目标存储节点发送指示消息之前,还包括:
    所述控制器根据所述目标存储区域的位置参数确定所述目标存储区域中的为所述至少一个目标存储节点中每个目标存储节点对应的所述目标数据的数据块所指定的子存储区域的位置参数;
    所述控制器生成包括所述数据块的标识、所述操作类型以及所述子存储区域的位置参数的所述指示消息,所述指示消息用于指示接收到所述指示消息的目标存储节点在所述操作类型为读操作时,根据所述子存储区域的位置参数通过与所述主机的RDMA连接将所述目标数据的数据块写入所述主机内存中所述子存储区域,或者,在所述操作类型为写操作时,根据所述子存储区域的位置参数,通过与所述主机的RDMA连接从所述主机内存中所述子存储区域读取所述目标数据的数据块,并存储所述数据块。
  12. 根据权利要求9至11中任一项所述的方法,其特征在于,所述方法还包括:
    所述控制器从所述主机接收所述主机创建的第二QP的参数以及从所述至少两个存储节点中的任一存储节点接收所述任一存储节点创建的第一QP的参数;
    所述控制器将所述第一QP的参数发送给所述主机,以及将所述第二QP的参数发送给所述任一存储节点。
  13. 根据权利要求9或10所述的方法,其特征在于,所述控制器与所述主机之间的连接为TCP/IP连接;所述指示消息包括所述操作类型、所述主机的通信地址以及所述目标数据的标识;所述方法还包括:
    所述控制器向所述主机发送第三指示消息,所述第三指示消息包括一个目标存储节点的通信地址,用于指示所述主机通过与所述目标存储节点间的TCP/IP连接向所述目标存储节点发送所述目标数据或从所述目标存储节点获取所述目标数据。
  14. 根据权利要求9或10所述的方法,其特征在于,所述控制器与所述主机之间的连接为TCP/IP连接,所述操作类型为读操作;所述指示消息包括所述操作类型、所述主机的通信地址、所述控制器的通信地址以及所述目标数据的标识,所述指示消息用于指示所述目标存储节点以所述控制器的通信地址为源地址以及以所述主机的通信地址为目的地址发送携带所述目标数据的TCP/IP报文。
  15. 根据权利要求14所述的方法,其特征在于,在所述向所述至少一个目标存储节点发送指示消息之前,所述方法还包括:
    所述控制器根据所述主机的TCP窗口大小确定所述主机的数据接收量,根据所述数据接收量确定所述至少一个目标存储节点中每个目标存储节点通过每个TCP/IP报文携带的所述目标数据的数据块;
    所述控制器生成包括所述数据块的标识的所述指示消息,所述指示消息用于指示接收到所述指示消息的目标存储节点向所述主机发送所述数据块。
  16. 一种数据处理的方法,其特征在于,所述方法由存储节点执行,所述存储节点与控制器通信连接,所述控制器用于管理所述存储节点;所述方法包括:
    所述存储节点接收所述控制器发送的指示消息,所述指示消息包括主机待操作的目标数据的标识和操作类型;
    所述存储节点根据所述指示消息,通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  17. 根据权利要求16所述的方法,其特征在于,所述方法还包括所述存储节点与所述主机建立连接的步骤,所述步骤包括:
    所述存储节点创建第一QP,所述第一QP中包括第一发送队列SQ以及第二接收RQ;
    所述存储节点将所述第一QP的参数发送给所述控制器;
    所述存储节点从所述控制器接收所述主机创建的第二QP的参数,所述第二QP中包括第二SQ以及第二RQ;
    所述存储节点根据所述第一QP的参数以及所述第二QP的参数将所述第一SQ与所述第二QP的第二RQ绑定,以及将所述第一RQ与所述第二SQ绑定,进而与所述主机建立所述RDMA连接。
  18. 根据权利要求16或17所述的方法,其特征在于,所述指示消息包括所述目标数据的标识、所述操作类型以及所述主机的内存中的目标存储区域的位置参数;
    所述存储节点通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据,包括:
    在所述操作类型为读操作时,所述存储节点通过与所述主机的RDMA连接将所述目标数据写入所述主机内存中所述目标存储区域;
    在所述操作类型为写操作时,所述存储节点通过与所述主机的RDMA连接从所述主机内存中所述目标存储区域读取所述目标数据,并存储所述目标数据。
  19. 根据权利要求16所述的方法,其特征在于,所述指示消息包括所述操作类型、所述主机的通信地址、所述控制器的通信地址以及所述目标数据的标识;
    所述所述存储节点通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据,包括:
    所述存储节点以所述控制器的通信地址为源地址、以所述主机的通信地址为目的地址发送携带所述目标数据的TCP/IP报文。
  20. 一种处理数据的系统,其特征在于,包括至少两个控制器以及至少两个磁盘,所述至少两个控制器用于管理所述至少两个磁盘;
    所述至少两个控制器中的第一控制器,用于接收主机发送的操作请求,所述操作请求包括待操作的目标数据的标识和操作类型;确定由所述至少两个控制器中的第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;向所述第二控制器发送指示消息,所述指示消息用于指示所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;
    所述第二控制器,用于接收所述指示消息;根据所述指示消息,通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  21. 根据权利要求20所述的系统,其特征在于,所述第二控制器与所述主机之间的连接为RDMA连接;所述操作请求还包括所述主机的内存中目标存储区域的位置参数; 所述指示消息包括所述目标数据的标识、所述操作类型以及所述位置参数;
    所述第二控制器,用于根据所述指示消息,通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据,包括:
    在所述操作类型为读操作时,从所述至少两个磁盘中获取所述目标数据;通过与所述主机的RDMA连接将所述目标数据写入所述主机内存中所述目标存储区域;
    在所述操作类型为写操作时,通过与所述主机的RDMA连接从所述主机内存中所述目标存储区域读取所述目标数据;将所述目标数据写入所述至少两个磁盘。
  22. 根据权利要求20或21所述的系统,其特征在于,所述第一控制器用于:
    根据预设负载均衡策略确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;或
    根据所述目标数据所在的逻辑单元号LUN的归属确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  23. 一种处理数据的方法,其特征在于,所述方法由第一控制器执行,所述第一控制器与第二控制器以及至少两个磁盘连接,所述第一控制器以及所述第二控制器用于管理所述至少两个磁盘;所述方法包括:
    所述第一控制器接收主机发送的操作请求,所述操作请求包括待操作的目标数据的标识和操作类型;
    所述第一控制器确定由所述至少两个控制器中的第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;
    所述第一控制器向所述第二控制器发送指示消息,所述指示消息用于指示所述第二控制器通过与所述第二控制器与所述主机间的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  24. 根据权利要求23所述的方法,其特征在于,所述第二控制器与所述主机之间的连接为RDMA连接,所述操作请求还包括所述主机的内存中中目标存储区域的位置参数;所述指示消息包括所述目标数据的标识、所述操作类型以及所述位置参数,所述指示消息用于指示所述第二控制器在所述操作类型为读操作时,从所述至少两个磁盘中获取所述目标数据;通过与所述主机的所述RDMA连接将所述目标数据写入所述主机内存中所述目标存储区域;以及在所述操作类型为写操作时,通过与所述主机的所述RDMA连接从所述主机内存中所述目标存储区域读取所述目标数据;将所述目标数据写入所述至少两个磁盘。
  25. 根据权利要求23或24所述的方法,其特征在于,所述第一控制器确定由所述至少两个控制器中的第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据,包括:
    根据预设负载均衡策略确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;或
    根据所述目标数据所在的逻辑单元号LUN的归属确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  26. 根据权利要求23至25任一项所述的方法,其特征在于,还包括:
    所述第一控制器接收管理所述至少两个磁盘的第三控制器发送的指示消息,所述第三控制器发送的指示消息包括所述主机待操作的第二目标数据的标识和操作类型;
    所述第一控制器响应所述第三控制器发送的指示消息,通过与所述主机间的连接向所述主机发送所述第二目标数据或从所述主机获取所述第二目标数据。
  27. 一种数据处理的装置,其特征在于,所述装置用于管理至少两个存储节点,所述装置包括:
    第一接收模块,用于接收主机发送的操作请求,所述操作请求包括待操作的目标数据的标识和操作类型;
    确定模块,用于根据所述目标数据的标识从所述至少两个存储节点中确定至少一个目标存储节点;
    第一发送模块,用于向所述至少一个目标存储节点发送指示消息,所述指示消息用于指示所述至少一个目标存储节点通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  28. 根据权利要求27所述的装置,其特征在于,所述确定模块,用于:根据所述目标数据的标识确定所述至少一个目标存储节点包括存储所述目标数据的第一数据块的第一存储节点以及存储所述目标数据的第二数据块的第二存储节点;
    所述第一发送模块,具体用于向所述第一存储节点发送第一指示消息,所述第一指示消息包括所述第一数据块的标识,用于指示所述第一存储节点向所述主机发送所述第一数据块或从所述主机获取所述第一数据块;
    向所述第二存储节点发送第二指示消息,所述第二指示消息包括所述第二数据块的标识,用于指示所述第二存储节点向所述主机发送所述第二数据块或从所述主机获取所述第二数据块。
  29. 根据权利要求27或28所述的装置,其特征在于,所述操作请求还包括所述主机在内存中为所述目标数据指定的目标存储区域的位置参数;
    所述确定模块,还用于根据所述目标存储区域的位置参数确定所述目标存储区域中的为所述至少一个目标存储节点中每个目标存储节点对应的所述目标数据的数据块所指定的子存储区域的位置参数;
    生成包括所述数据块的标识、所述操作类型以及所述子存储区域的位置参数的所述指示消息,所述指示消息用于指示接收到所述指示消息的目标存储节点在所述操作类型为读操作时,根据所述子存储区域的位置参数通过与所述主机的RDMA连接将所述目标数据的数据块写入所述主机内存中所述子存储区域,或者,在所述操作类型为写操作时,根据所述子存储区域的位置参数,通过与所述主机的RDMA连接从所述主机内存中所述子存储区域读取所述目标数据的数据块,并存储所述数据块。
  30. 根据权利要求27至29中任一项所述的装置,其特征在于,还包括:
    第二接收模块,用于从所述存储节点中的任一存储节点接收所述任一存储节点创建的第一QP的参数;
    第二发送模块,用于将所述第一QP的参数发送给所述主机;
    所述第一接收模块,还用于从所述主机接收所述主机创建的第二QP的参数;
    所述第一发送模块,还用于将所述第二QP的参数发送给所述任一存储节点。
  31. 根据权利要求27或28所述的装置,其特征在于,所述装置与所述主机之间的连接为TCP/IP连接;所述指示消息包括所述操作类型、所述主机的通信地址以及所述目标数据的标识;所述装置还包括:
    第二发送模块,用于向所述主机发送第三指示消息,所述第三指示消息包括一个目标存储节点的通信地址,用于指示所述主机通过与所述目标存储节点间的TCP/IP连接向所述目标存储节点发送所述目标数据或从所述目标存储节点获取所述目标数据。
  32. 根据权利要求27或28所述的装置,其特征在于,所述装置与所述主机之间的连接为TCP/IP连接,所述操作类型为读操作;所述指示消息包括所述操作类型、所述主机的通信地址、所述装置的通信地址以及所述目标数据的标识,所述指示消息用于指示所述目标存储节点以所述装置的通信地址为源地址以及以所述主机的通信地址为目的地址发送携带所述目标数据的TCP/IP报文。
  33. 根据权利要求32所述的装置,其特征在于,所述确定模块还用于:
    根据所述主机的TCP窗口大小确定所述主机的数据接收量,根据所述数据接收量确定所述至少一个目标存储节点中每个目标存储节点通过每个TCP/IP报文携带的所述目标数据的数据块;
    生成包括所述数据块的标识的所述指示消息,所述指示消息用于指示接收到所述指示信息的目标存储节点向所述主机发送所述数据块。
  34. 一种数据处理的装置,其特征在于,所述装置与控制器通信连接,所述控制器用于管理所述装置;所述装置包括:
    接收模块,用于接收所述控制器发送的指示消息,所述指示消息包括主机待操作的目标数据的标识和操作类型;
    传输模块,用于根据所述指示消息,通过与所述主机的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  35. 根据权利要求34所述的装置,其特征在于,所述装置还包括连接模块,用于:
    创建第一QP,所述第一QP中包括第一发送队列SQ以及第二接收队列RQ;
    将所述第一QP的参数发送给所述控制器;
    从所述控制器接收所述主机创建的第二QP的参数,所述第二QP中包括第二SQ以及第二RQ;
    根据所述第一QP的参数以及所述第二QP的参数将所述第一SQ与所述第二QP的第二RQ绑定,以及将所述第一RQ与所述第二SQ绑定,进而与所述主机建立所述RDMA连接。
  36. 根据权利要求34或35所述的装置,其特征在于,所述指示消息包括所述目标数据的标识、所述操作类型以及所述主机的内存中目标存储区域的位置参数;
    所述传输模块,用于在所述操作类型为读操作时,通过与所述主机的RDMA连接将所述目标数据写入所述主机内存中所述目标存储区域;
    在所述操作类型为写操作时,通过与所述主机的RDMA连接从所述主机内存中所述目标存储区域读取所述目标数据,并存储所述目标数据。
  37. 根据权利要求34所述的装置,其特征在于,所述指示消息包括所述操作类型、所述主机的通信地址、所述控制器的通信地址以及所述目标数据的标识;
    所述传输模块,用于以所述控制器的通信地址为源地址、以所述主机的通信地址为目的地址发送携带所述目标数据的TCP/IP报文。
  38. 一种处理数据的装置,其特征在于,所述装置与第二控制器以及至少两个磁盘连接,所述装置以及所述第二控制器用于管理所述至少两个磁盘;所述装置包括:
    接收模块,用于接收主机发送的操作请求,所述操作请求包括待操作的目标数据的标识和操作类型;
    确定模块,用于确定由所述至少两个控制器中的第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;
    发送模块,用于向所述第二控制器发送指示消息,所述指示消息用于指示所述第二控制器通过与所述主机间的连接向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  39. 根据权利要求38所述的装置,其特征在于,所述第二控制器与所述主机之间的连接为RDMA连接所述操作请求还包括所述主机的内存中目标存储区域的位置参数;所述指示消息包括所述目标数据的标识、所述操作类型以及所述位置参数,所述指示消息用于指示所述第二控制器在所述操作类型为读操作时,从所述至少两个磁盘中获取所述目标数据;通过与所述主机的所述RDMA连接将所述目标数据写入所述主机内存中所述目标存储区域;以及在所述操作类型为写操作时,通过与所述主机的所述RDMA连接从所述主机内存中所述目标存储区域读取所述目标数据;将所述目标数据写入所述至少两个磁盘。
  40. 根据权利要求38或39所述的装置,其特征在于,所述确定模块,用于:
    根据预设负载均衡策略确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据;或
    根据所述目标数据所在的逻辑单元号LUN的归属确定由所述第二控制器向所述主机发送所述目标数据或从所述主机获取所述目标数据。
  41. 根据权利要求38至40任一项所述的装置,其特征在于,所述接收模块,还用于:接收管理所述至少两个磁盘的第三控制器的指示消息,所述第三控制器发送的指示消息包括主机待操作的第二目标数据的标识和操作类型;
    所述装置还包括:传输模块,用于响应所述第三控制器发送的指示消息,通过与所述主机间的连接向所述主机发送所述第二目标数据或从所述主机获取所述第二目标数据。
  42. 一种数据处理的设备,其特征在于,包括处理器、存储器、通信接口以及总线,所述处理器、所述存储器和所述通信接口之间通过所述总线连接并完成相互间的通信,所述存储器中用于存储计算机执行指令,所述设备运行时,所述处理器执行所述存储器中的计算机执行指令以利用所述设备中的硬件资源执行权利要求9至15中任一所述的方法。
  43. 一种数据处理的设备,其特征在于,包括处理器、存储器、通信接口以及总线,所述处理器、所述存储器和所述通信接口之间通过所述总线连接并完成相互间的通信,所述存储器中用于存储计算机执行指令,所述设备运行时,所述处理器执行所述存储器中的计算机执行指令以利用所述设备中的硬件资源执行权利要求16至19中任一所述的方法。
  44. 一种数据处理的设备,其特征在于,包括处理器、存储器、通信接口以及总线,所述处理器、所述存储器和所述通信接口之间通过所述总线连接并完成相互间的通信,所述存储器中用于存储计算机执行指令,所述设备运行时,所述处理器执行所述存储器中的计算机执行指令以利用所述设备中的硬件资源执行权利要求23至26中任一所述的方法。
PCT/CN2017/072701 2017-01-25 2017-01-25 一种数据处理的系统、方法及对应装置 WO2018137217A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP17894187.8A EP3493046B1 (en) 2017-01-25 2017-01-25 Data processing system, method, and corresponding device
CN201780000604.8A CN108701004A (zh) 2017-01-25 2017-01-25 一种数据处理的系统、方法及对应装置
PCT/CN2017/072701 WO2018137217A1 (zh) 2017-01-25 2017-01-25 一种数据处理的系统、方法及对应装置
US16/362,210 US11489919B2 (en) 2017-01-25 2019-03-22 Method, apparatus, and data processing system including controller to manage storage nodes and host operations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/072701 WO2018137217A1 (zh) 2017-01-25 2017-01-25 一种数据处理的系统、方法及对应装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/362,210 Continuation US11489919B2 (en) 2017-01-25 2019-03-22 Method, apparatus, and data processing system including controller to manage storage nodes and host operations

Publications (1)

Publication Number Publication Date
WO2018137217A1 true WO2018137217A1 (zh) 2018-08-02

Family

ID=62977905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/072701 WO2018137217A1 (zh) 2017-01-25 2017-01-25 一种数据处理的系统、方法及对应装置

Country Status (4)

Country Link
US (1) US11489919B2 (zh)
EP (1) EP3493046B1 (zh)
CN (1) CN108701004A (zh)
WO (1) WO2018137217A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111431952A (zh) * 2019-01-09 2020-07-17 阿里巴巴集团控股有限公司 消息推送方法、装置及系统,计算机存储介质和电子设备
CN113014662A (zh) * 2021-03-11 2021-06-22 联想(北京)有限公司 数据处理方法及基于NVMe-oF协议的存储系统

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10261690B1 (en) 2016-05-03 2019-04-16 Pure Storage, Inc. Systems and methods for operating a storage system
CN111385327B (zh) * 2018-12-28 2022-06-14 阿里巴巴集团控股有限公司 数据处理方法和系统
CN111274176B (zh) * 2020-01-15 2022-04-22 联想(北京)有限公司 一种信息处理方法、电子设备、系统及存储介质
CN113360432B (zh) * 2020-03-03 2024-03-12 瑞昱半导体股份有限公司 数据传输系统
CN113742050B (zh) * 2020-05-27 2023-03-03 华为技术有限公司 操作数据对象的方法、装置、计算设备和存储介质
US11720413B2 (en) * 2020-06-08 2023-08-08 Samsung Electronics Co., Ltd. Systems and methods for virtualizing fabric-attached storage devices
US11442629B2 (en) * 2020-07-15 2022-09-13 International Business Machines Corporation I/O performance in a storage system
CN112104731B (zh) * 2020-09-11 2022-05-20 北京奇艺世纪科技有限公司 请求处理方法、装置、电子设备和存储介质
JP2022048716A (ja) 2020-09-15 2022-03-28 キオクシア株式会社 ストレージシステム
CN111931721B (zh) * 2020-09-22 2023-02-28 苏州科达科技股份有限公司 年检标签颜色和个数的检测方法、装置及电子设备
US11310732B1 (en) * 2020-11-23 2022-04-19 At&T Intellectual Property I, L.P. Facilitation of fast aiding radio access network intelligent controllers for 5G or other next generation network
WO2022183713A1 (zh) * 2021-03-02 2022-09-09 中国银联股份有限公司 数据存储方法、装置、设备及存储介质
CN112947287A (zh) * 2021-03-29 2021-06-11 联想(北京)信息技术有限公司 一种控制方法、控制器及电子设备
CN114489503B (zh) * 2022-01-21 2024-02-23 北京安天网络安全技术有限公司 数据报文的存储方法、装置、计算机设备
CN116804908A (zh) * 2022-03-16 2023-09-26 中兴通讯股份有限公司 数据读写方法、设备、存储节点及存储介质
EP4283457A3 (en) * 2022-05-23 2024-02-07 Samsung Electronics Co., Ltd. Computing system for managing distributed storage devices, and method of operating the same
CN116737619B (zh) * 2023-08-15 2023-11-03 苏州浪潮智能科技有限公司 数据请求系统、方法、装置、计算机设备和存储介质
CN117234743B (zh) * 2023-11-14 2024-02-20 苏州元脑智能科技有限公司 一种数据发送方法、装置、设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198311A1 (en) * 2012-01-17 2013-08-01 Eliezer Tamir Techniques for Use of Vendor Defined Messages to Execute a Command to Access a Storage Device
CN103440202A (zh) * 2013-08-07 2013-12-11 华为技术有限公司 一种基于rdma的通信方法、系统及通信设备
CN103516755A (zh) * 2012-06-27 2014-01-15 华为技术有限公司 虚拟存储方法及设备
CN103530066A (zh) * 2013-09-16 2014-01-22 华为技术有限公司 一种数据存储方法、装置及系统
CN103905526A (zh) * 2014-03-05 2014-07-02 深圳市同洲电子股份有限公司 一种调度方法及服务器
CN103984662A (zh) * 2014-05-29 2014-08-13 华为技术有限公司 一种读、写数据的方法及设备、存储系统
CN104580346A (zh) * 2014-09-11 2015-04-29 奇点新源国际技术开发(北京)有限公司 数据传输方法及装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7103744B2 (en) * 2003-03-27 2006-09-05 Hewlett-Packard Development Company, L.P. Binding a memory window to a queue pair
US20070041383A1 (en) * 2005-04-05 2007-02-22 Mohmmad Banikazemi Third party node initiated remote direct memory access
US7554976B2 (en) * 2005-05-13 2009-06-30 Microsoft Corporation Method and system for transferring a packet stream to RDMA
US7907621B2 (en) * 2006-08-03 2011-03-15 Citrix Systems, Inc. Systems and methods for using a client agent to manage ICMP traffic in a virtual private network environment
US8839044B2 (en) * 2012-01-05 2014-09-16 International Business Machines Corporation Debugging of adapters with stateful offload connections
US9229901B1 (en) * 2012-06-08 2016-01-05 Google Inc. Single-sided distributed storage system
US9483431B2 (en) * 2013-04-17 2016-11-01 Apeiron Data Systems Method and apparatus for accessing multiple storage devices from multiple hosts without use of remote direct memory access (RDMA)
US9727503B2 (en) * 2014-03-17 2017-08-08 Mellanox Technologies, Ltd. Storage system and server
US20160267050A1 (en) * 2015-03-09 2016-09-15 Unisys Corporation Storage subsystem technologies
KR102430187B1 (ko) * 2015-07-08 2022-08-05 삼성전자주식회사 RDMA NVMe 디바이스의 구현 방법
CN107526535B (zh) * 2016-06-22 2020-07-10 伊姆西Ip控股有限责任公司 用于管理存储系统的方法和系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198311A1 (en) * 2012-01-17 2013-08-01 Eliezer Tamir Techniques for Use of Vendor Defined Messages to Execute a Command to Access a Storage Device
CN103516755A (zh) * 2012-06-27 2014-01-15 华为技术有限公司 虚拟存储方法及设备
CN103440202A (zh) * 2013-08-07 2013-12-11 华为技术有限公司 一种基于rdma的通信方法、系统及通信设备
CN103530066A (zh) * 2013-09-16 2014-01-22 华为技术有限公司 一种数据存储方法、装置及系统
CN103905526A (zh) * 2014-03-05 2014-07-02 深圳市同洲电子股份有限公司 一种调度方法及服务器
CN103984662A (zh) * 2014-05-29 2014-08-13 华为技术有限公司 一种读、写数据的方法及设备、存储系统
CN104580346A (zh) * 2014-09-11 2015-04-29 奇点新源国际技术开发(北京)有限公司 数据传输方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3493046A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111431952A (zh) * 2019-01-09 2020-07-17 阿里巴巴集团控股有限公司 消息推送方法、装置及系统,计算机存储介质和电子设备
CN111431952B (zh) * 2019-01-09 2022-06-03 阿里巴巴集团控股有限公司 消息推送方法、装置及系统,计算机存储介质和电子设备
CN113014662A (zh) * 2021-03-11 2021-06-22 联想(北京)有限公司 数据处理方法及基于NVMe-oF协议的存储系统

Also Published As

Publication number Publication date
US20190222649A1 (en) 2019-07-18
EP3493046A1 (en) 2019-06-05
CN108701004A (zh) 2018-10-23
EP3493046B1 (en) 2022-04-13
EP3493046A4 (en) 2019-10-23
US11489919B2 (en) 2022-11-01

Similar Documents

Publication Publication Date Title
WO2018137217A1 (zh) 一种数据处理的系统、方法及对应装置
US11544001B2 (en) Method and apparatus for transmitting data processing request
EP3572948B1 (en) Resource management method and device
WO2017114091A1 (zh) 一种nas数据访问的方法、系统及相关设备
WO2022007470A1 (zh) 一种数据传输的方法、芯片和设备
CN105556930A (zh) 针对远程存储器访问的nvm express控制器
WO2021063160A1 (zh) 访问固态硬盘的方法及存储设备
US10372343B2 (en) Storage system, method, and apparatus for processing operation request
EP4318251A1 (en) Data access system and method, and device and network card
WO2021073546A1 (zh) 数据访问方法、装置和第一计算设备
CN107728936A (zh) 用于传输数据处理请求的方法和装置
WO2020134144A1 (zh) 数据或报文转发的方法、节点和系统
US10255213B1 (en) Adapter device for large address spaces
US10348519B2 (en) Virtual target port aggregation
WO2023143103A1 (zh) 报文处理方法、网关设备及存储系统
CN114911411A (zh) 一种数据存储方法、装置及网络设备
CN116260778A (zh) 一种外接内存装置、外接内存访问方法及网络系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17894187

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017894187

Country of ref document: EP

Effective date: 20190301

NENP Non-entry into the national phase

Ref country code: DE