CN107948233B

CN107948233B - Method for processing write request or read request, switch and control node

Info

Publication number: CN107948233B
Application number: CN201610896118.6A
Authority: CN
Inventors: 陈灿
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-10-13
Filing date: 2016-10-13
Publication date: 2021-01-08
Anticipated expiration: 2036-10-13
Also published as: CN107948233A

Abstract

The embodiment of the invention provides a method for processing a write request or a read request, a switch and a control node. After the switch receives a first write request, the switch directly fragments data carried in the first write request and calculates redundant fragments based on the forwarding table, then generates second write requests with corresponding numbers, and then directly and respectively sends the second write requests to corresponding storage nodes to finish data storage. Therefore, the data in the first write request can be stored only by transmitting the data on the switch for 1 time, so that the workload of the switch and the waste of network resources are greatly reduced, the time delay of write request processing is reduced, and the processing efficiency and the performance of the system are greatly improved.

Description

Method for processing write request or read request, switch and control node

Technical Field

The present invention relates to the field of information technology, and more particularly, to a method, a switch, and a control node for processing a write request or a read request.

Background

With the explosive increase of data volume, the traditional external storage system is difficult to meet the requirements of a data center because the performance and the capacity cannot be linearly expanded, and a distributed storage system is produced. In a distributed storage system, in order to ensure high availability of data, data redundancy is implemented between storage nodes, for example, data redundancy may be implemented in a Redundant Array of Independent Disks (RAID) manner, so that when a single storage node fails, the data is not lost, and a service is not interrupted. In a distributed storage system in the prior art, when data needs to be written in a host, a write request is sent to any storage node through a switch, the data in the write request is fragmented by any storage node, a new write request is generated based on each fragment, and then each new write request is sent to other storage nodes through the switch to be stored.

The inventor finds that the following defects exist in the prior art through analysis:

the switch is loaded during work: data of the write request from the host needs to be transmitted on the switch 2 times to complete data writing, which results in network resource waste and workload of the switch.

Disclosure of Invention

Embodiments of the present invention provide a method for processing a write request or a read request, a switch, and a control node, which can implement that data to be written in a distributed storage system only needs to be transmitted on the switch for 1 time, thereby reducing a workload of the switch, saving network resources, greatly reducing a processing delay of the write request, and improving processing efficiency and performance of the distributed storage system.

In a first aspect, a method for processing a write request is provided, where the method is applied to a switch, a forwarding table is preconfigured in the switch, the forwarding table includes multiple forwarding records, and each forwarding record includes a host address, a service switch address, a volume, a redundancy level number, and a storage node address, and the method includes:

receiving a first write request from a host, wherein the first write request carries first metadata and first data;

inquiring the forwarding table based on the first metadata to obtain K forwarding records, wherein K is more than or equal to 2;

performing fragmentation operation on the first data based on the redundancy level numbers in the K forwarding records to obtain K fragments;

generating K second write requests based on the redundancy level numbers and the storage node addresses in the K forwarding records, wherein each second write request in the K second write requests carries second data, and the second data is 1 fragment in the K fragments;

and respectively sending the K second write requests to K storage nodes.

With reference to the first aspect, in a first possible implementation manner, the first metadata includes a first source address, a first destination address, a storage index, and a data payload length, where the first source address is an address of the host, the first destination address is an address of the switch, and the data payload length is used to indicate a size of the first data;

the querying the forwarding table based on the first metadata specifically includes:

and respectively matching the host address, the service switch address and the volume of each forwarding record in the forwarding table based on the first source address, the first destination address and the storage index to obtain the K forwarding records.

Optionally, the first metadata may further include a storage type and a sequence number.

Optionally, the fragmentation operation is to divide the first data into a plurality of data fragments based on the queried redundancy level number, and calculate at least 1 redundancy fragment based on the plurality of data fragments, where the plurality of data fragments and the at least 1 redundancy fragment constitute K fragments corresponding to the first data.

Optionally, each of the K second write requests further carries second metadata, where the second metadata includes a second source address, a second destination address, a storage index, a redundancy level number, and a data payload length, where the second source address is an address of the switch, the second destination address is a storage node address of 1 forwarding record in the K forwarding records, the redundancy level number is a redundancy level number of 1 forwarding record in the K forwarding records, and the data payload length is used to indicate the size of the 1 fragment.

Optionally, the second metadata may further include a storage type and a sequence number.

Optionally, the switch may further create a state table for recording state information of the first write request processing.

With reference to the first aspect, optionally, the method further includes:

the K storage nodes respectively receive second write requests from the switch;

storing second data carried in the received second write request by taking the storage index and the redundancy level number in the second metadata carried in the received second write request as keys;

sending a second successful write message to the switch.

Optionally, the method further includes: the switch receives second successful writing messages from the K storage nodes respectively;

sending a first write success message to the host.

Optionally, the switch may further update the state table based on each second successful write message, that is, query the state table according to the second metadata information carried by the second successful write message, and modify the operation state of the corresponding row in the state table.

Further, optionally, the switch may also delete the state table.

It can be seen that a forwarding table is stored in the switch, and after the switch receives the first write request from the host, the switch may directly fragment data carried in the first write request and calculate redundant fragments based on the forwarding table, then generate a corresponding number of second write requests, and then directly send the second write requests to corresponding storage nodes respectively to store the data carried by the second write requests, so that the storage of the data carried in the first write request may be completed. In the embodiment of the invention, the data in the first write request can be stored only by transmitting the data on the switch for 1 time, so that the problem that the data can be stored only by transmitting the data for 2 times like the prior art is avoided, thereby greatly reducing the workload of the switch and the waste of network resources, reducing the time delay of processing the write request once, and greatly improving the processing efficiency and performance of the distributed storage system.

In a second aspect, a method for processing a read request is further provided, where the method is applied to a switch, a forwarding table is preconfigured in the switch, the forwarding table includes multiple forwarding records, and each forwarding record includes a host address, a service switch address, a volume, a redundancy level number, and a storage node address, and the method includes:

receiving a first read request from a host, wherein the first read request carries third metadata;

inquiring the forwarding table based on the third metadata to obtain K forwarding records, wherein K is more than or equal to 2;

generating K second reading requests based on the redundancy level numbers and the storage node addresses in the K forwarding records;

and sending the K second read requests to K storage nodes respectively.

Optionally, the third metadata includes a third source address, a third destination address, a storage index, and a data payload length, where the third source address is an address of the host, the third destination address is an address of the switch, and the data payload length is used to indicate a size of data to be read;

the querying the forwarding table based on the third metadata specifically includes:

and respectively matching the host address, the service switch address and the volume of each forwarding record in the forwarding table based on the third source address, the third destination address and the storage index in the third metadata to obtain the K forwarding records.

Optionally, the switch may further create a state table for recording state information of the first read request processing.

Optionally, each of the K second read requests further carries fourth metadata, where the fourth metadata includes a fourth source address, a fourth destination address, a storage index, a redundancy level number, and a data payload length, where the fourth source address is an address of the switch, the fourth destination address is a storage node address of 1 forwarding record in the K forwarding records, the redundancy level number is a redundancy level number of 1 forwarding record in the K forwarding records, and the data payload length is used to indicate a size of data to be read, and the method further includes:

each of the K storage nodes receiving a second read request from the switch;

reading data by taking the storage index in the fourth metadata carried by the second read request and the redundancy level number as keys;

and sending a second read completion message to the switch, wherein the second read completion message carries the read data.

Optionally, the method further includes:

the switch receives a second read completion message from each of the K storage nodes respectively;

recombining the data carried in the K second read completion messages;

and sending a first read completion message to the host, wherein the recombined data is carried in the first read completion message.

Optionally, the switch may further update the state table based on each second read completion message, that is, query the state table according to metadata information carried in the second read completion message, and modify the operation state of the corresponding row in the state table.

Further, optionally, the switch may also delete the state table.

Through the embodiment, the data to be read only needs to pass through the switch for 1 time in the direction from the plurality of storage nodes to the host through the switch, and the problem that the data reading can be completed only by transmitting for 2 times like the prior art is avoided, so that the workload of the switch and the waste of network resources are greatly reduced, the time delay of read request processing is also reduced, and the processing efficiency and the performance of the distributed storage system are greatly improved.

In a third aspect, there is also provided a switch comprising a receiver, a memory, a processor, a transmitter, and a RAID engine, wherein:

the receiver is configured to receive a first write request from a host, where the first write request carries first metadata and first data;

the memory is used for storing a forwarding table, the forwarding table comprises a plurality of forwarding records, and each forwarding record comprises a host address, a service switch address, a volume, a redundancy level number and a storage node address;

the processor is configured to parse the first write request to obtain the first metadata and the first data, and query the forwarding table based on the first metadata to obtain K forwarding records, where K is greater than or equal to 2;

the RAID engine is used for carrying out fragmentation operation on the first data based on the redundancy level numbers in the K forwarding records to obtain K fragments;

the processor is further configured to generate K second write requests based on the redundancy level numbers in the K forwarding records and the storage node addresses, where each of the K second write requests carries second data, and the second data is 1 fragment of the K fragments;

and the transmitter is used for respectively sending the K second write requests to the K storage nodes.

With reference to the third aspect, in a first possible implementation manner, the first metadata includes a first source address, a first destination address, a storage index, and a data payload length, where the first source address is an address of the host, the first destination address is an address of the switch, and the data payload length is used to indicate a size of the first data;

the querying, by the processor, the forwarding table based on the first metadata to obtain K forwarding records specifically includes:

Optionally, the receiver is further configured to receive a second successful write message from each of the K storage nodes respectively;

the processor is further configured to generate a first write success message based on the second write success message from each of the K storage nodes;

the transmitter is further configured to send the first successful write message to the host.

It can be seen that, after the switch receives the first write request from the host, the switch may directly fragment data carried in the first write request and calculate redundant fragments based on the forwarding table, then generate the second write requests of corresponding numbers, and then directly send the second write requests to corresponding storage nodes respectively to store the data carried by the second write requests, so that the storage of the data carried in the first write request may be completed. In the embodiment of the invention, the data in the first write request can be stored only by transmitting the data on the switch for 1 time, so that the problem that the data can be stored only by transmitting the data for 2 times like the prior art is avoided, thereby greatly reducing the workload of the switch and the waste of network resources, reducing the time delay of processing the write request once, and greatly improving the processing efficiency and the performance of the distributed storage system.

In a fourth aspect, there is also provided a switch comprising a receiver, a memory, a processor, and a transmitter, wherein:

the receiver is configured to receive a first read request from a host, where the first read request carries third metadata;

the processor is configured to parse the first read request to obtain the third metadata, query the forwarding table based on the third metadata to obtain K forwarding records, where K is greater than or equal to 2, and generate K second read requests based on redundancy level numbers and storage node addresses in the K forwarding records;

the transmitter is configured to send the K second read requests to K storage nodes, respectively.

With reference to the fourth aspect, in a first possible implementation manner, the switch further includes a RAID engine;

the receiver is further configured to receive a second read completion message from each of the K storage nodes, respectively;

the processor is configured to parse the second read completion message from each of the K storage nodes to obtain data carried by each of the second read completion messages;

the RAID engine is used for recombining the data carried in the K second read completion messages;

the processor is further configured to generate a first read completion message, where the first read completion message carries the reassembled data;

the transmitter is configured to send a first read complete message to the host.

In a fifth aspect, there is also provided a control node comprising a receiver, a processor, a memory, a transmitter, wherein: the transmitter is used for sending a switch capability information query message to the switch and sending a storage node capability information query message to the storage node; the receiver is configured to receive a switch capability information query response packet from the switch, where the switch capability information query response packet carries switch capability information, and receive a storage node capability information query response packet from the storage node, where the storage node capability information query response packet carries storage node capability information; the processor is further configured to analyze the switch capability information query response message to obtain the switch capability information, and analyze the storage node capability information query response message to obtain the storage node capability information; the memory is used for storing the switch capability information and the storage node capability information; the receiver is also used for receiving the storage requirement of the host; the processor is further configured to allocate a service switch, allocate a plurality of storage nodes, and configure a forwarding table for the host based on the host storage requirement and the switch capability information and the storage node capability information; the transmitter is further configured to send the address of the service switch to the host, and send the forwarding table to the service switch.

Optionally, the switch capability information may include a switch ID, a switch address, whether the switch has fragmentation and redundant computing capability; the storage node capability information may include: storage node address, storage node capacity.

Optionally, the control node may send a switch capability information query message to each switch in a unicast manner, send a storage node capability information query message to each storage node in a unicast manner, or send the messages in a broadcast manner.

The service switch should have fragmentation and redundant computing capabilities. Optionally, the processor may randomly select 1 switch with fragmentation and redundant computation capability as the service switch of the host, may select a switch with fragmentation and redundant computation capability closest to the host on a network route as the service switch of the host, and may also select a switch with less current computation task from a plurality of switches with fragmentation and redundant computation capability as the service switch of the host with reference to a load balancing principle.

Optionally, the processor may randomly select the storage node, preferentially select a storage node with a lower current utilization rate in consideration of a load balancing principle, or select a storage node closest to the service switch in a network topology.

Optionally, the forwarding table may include the following information: host address, service switch address, volume, redundancy level number, and storage node address, etc.

By the control node provided by the embodiment of the invention, the automatic centralized configuration of the distributed storage system can be realized, the problems that manual configuration is needed and different switches and storage nodes are needed to be configured respectively in the prior art are solved, and the configuration efficiency in the distributed storage system is greatly improved. Furthermore, a forwarding table is configured through the control node, and the switch performs forwarding operation according to the forwarding table configured by the control node, so that control/forwarding separation is realized, and thus the performance of the distributed storage system can be greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below.

Fig. 1 is a schematic structural diagram of a distributed storage system according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a message according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart of processing a write request according to an embodiment of the present invention.

Fig. 4 is a schematic flow chart of processing a read request according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a switch according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a control node according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Fig. 1 illustrates an exemplary system architecture of a distributed storage system according to an embodiment of the present invention. In the distributed storage system, a plurality of hosts communicate with any one of a plurality of storage nodes through any switch, each switch is connected with at least 1 other switch, each host can be connected with at least 1 switch, and each storage node can also be connected with at least 1 switch. Preferably, the distributed storage system may further include a control node, the control node may communicate with any one or more switches, and the control node may further communicate with any one host and any one storage node through any one or more switches. Those skilled in the art will appreciate that in an actual networking there may be 1 or more hosts, 1 or more switches and at least 2 more storage nodes. The number of hosts, switches and storage nodes is determined based on actual service requirements, and fig. 1 only shows an exemplary networking structure, and does not limit actual networking.

In the distributed storage system provided in the embodiment of the present invention, fragmentation and redundancy calculation of data of a write request from any host is implemented on one or more switches, then a new write request is generated for each fragment (data fragment or redundancy fragment) of the data corresponding to the write request through the switch, and each new write request is sent to different storage nodes to complete storage of data carried in the write request. For each write request from the host, the write request is fragmented on the switch and then sent to each storage node for storage, so that the data of each write request from the host can be stored in the storage node only by transmitting the data on the switch for 1 time, and the problems of heavy workload of the switch, waste of network resources and large time delay caused by the fact that the data needs to be transmitted on the switch for 2 times in the prior art are solved.

In the embodiment of the invention, in the initial stage, the control node respectively sends the switch capability information inquiry message to each switch to request the switch to report the capability information of the switch. After receiving the switch capability information query message, each switch may return a switch capability information query response message to the control node, where the switch capability information query response message may include the following information: switch ID, switch address, whether the switch has fragmentation and redundant computing capability. It should be noted that, in the embodiment of the present invention, each switch in the distributed storage system is not required to have the fragmentation and redundant computing capability, and in an actual networking, the distributed storage system includes 1 or more conventional switches, and the present invention can be implemented only by ensuring that at least 1 switch in the deployed system has the fragmentation and redundant computing capability. Preferably, from a reliability perspective, it is recommended to deploy more than 2 switches with fragmentation and redundant computing capabilities. Table 1 below shows the capability information of each switch (table 1 takes 3 switches in the system as an example for explanation):

TABLE 1

The control node sends a storage node capacity information query message to each storage node and requires each storage node to report the capacity information of the storage node. After receiving the storage node capability information query message, each storage node can return the following information to the control node through a storage node capability information query response message: storage node address, storage node capacity. Table 2 below shows the capability information of each storage node (table 2 takes 6 storage nodes in the system as an example for explanation):

storage node address	Capacity of storage node
		200.1.1.110	20T
200.1.1.111	20T
		200.1.1.112	20T
200.1.1.113	20T
		200.1.1.114	20T
200.1.1.115	30T

TABLE 2

The control node may send the query message to each switch or storage node in a unicast manner, or may broadcast the query message in a broadcast manner in a network; it can be understood by those skilled in the art that the above-mentioned "switch capability information query message", "switch capability information query response message", "storage node capability information query message", and "storage node capability information query response message" are only an example, and the names of the messages themselves are not particularly limited, and in different practices, any message having similar functions falls within the protection scope of the embodiment of the present invention.

After the control node collects the capability information of each switch and each storage node in the distributed storage system, the control node can actively or according to the host computer requirement configure storage resources to the host computer. For example, according to the capacity requirement of the host and the requirement of the redundancy level, the host is allocated with a switch and a storage node, and the internal forwarding table entry of the allocated switch is configured. The capacity requirement is the size of the storage capacity required by the host, such as 10T or 20T; the redundancy level may be a RAID level, such as RAID 1, RAID 3, RAID5, or the like; preferably, the host may further require the number of storage nodes constituting the RAID level in the redundancy level, such as RAID 1/3, that is, indicating that RAID 1 is implemented by using storage resources from 3 storage nodes respectively. Illustratively, as shown in table 3 below: the host sends 2 storage requirements to the control node, the first requirement is a volume which requires to be allocated with a volume identifier of LUN A, the capacity of the volume is 10T, the redundancy level is RAID 1/3, namely the redundancy level of the LUN A is RAID 1 and the volume is distributed on 3 storage nodes; the second requirement is that a volume with volume identification LUN B is required to be allocated, the capacity of the volume is 50T, and the redundancy level is RAID5/5, i.e. the redundancy level of LUN B is RAID5, and the volume is distributed over 5 storage nodes.

Host address	Volume label	Capacity requirement	Redundancy level
				200.1.1.3	LUN A	10T	RAID 1/3
200.1.1.3	LUN B	50T	RAID 5/5

TABLE 3

After receiving the storage demand from the host, the control node configures resources for the storage demand of the host based on the collected capability information of each switch and storage node, and specifically includes:

1) a service switch is designated for the host.

The service switch needs to perform fragmentation operation on data in the write request from the host and calculate redundant fragments, so the service switch can only select from switches with fragmentation and redundant calculation capabilities. If a plurality of switches in the distributed storage system have fragmentation and redundant computing capabilities, the control node can randomly select 1 switch as a service switch of the host; or selecting the switch closest to the host on the network route, where the switch closest to the host on the network route is the switch to which a message sent by the host can be routed through the minimum number of hops on the network; or, referring to a load balancing principle, selecting a switch with less current computing tasks from a plurality of switches with fragmentation and redundant computing capabilities as a service switch of the host. For example, referring to table 1, for the first storage requirement described above, the control node may select switch 2 as the serving switch; the control node may select switch 3 as the serving switch for the second storage requirement mentioned above.

2) And allocating storage nodes for the host.

And the control node selects a proper storage node according to the capability information of each storage node to provide the storage service for the host. For example, as the first storage requirement mentioned above, the host requests to allocate a volume with volume id LUN a, the capacity of the volume is 10T, and the redundancy level is RAID 1/3, which means that the control node needs to select 3 storage nodes, each of which has at least 3.4T of available storage capacity. After receiving the requirement of the host, the control node checks the capability information of each storage node, and selects 3 storage nodes from a plurality of storage nodes with available storage capacity meeting the requirement to provide storage service for the storage requirement of the host. The control node can randomly select 3 storage nodes from a plurality of storage nodes meeting requirements; the storage node with the lower current utilization rate can be selected preferentially by considering a load balancing principle, so that the utilization rate of each storage node is ensured to be balanced as much as possible, wherein the utilization rate refers to the ratio of the allocated capacity of a certain storage node to the total capacity of the storage node; in addition, the control node may also consider a network topology relationship between each storage node and the service switch, and may preferentially select a storage node that is closest to the service switch in the network topology. For example, in the distributed storage system shown in fig. 1, assuming that the switch 2 is the service switch of the host 1, the control node may preferentially select the storage node 0 and the storage node 1 to provide services for the host, because these 2 storage nodes are directly connected to the switch 2, are closest to the switch 2 from the network topology point of view, and then select from the storage nodes 2 to the storage node n + 1. For example, referring to table 2, for the first storage requirement, 3 storage nodes with addresses 200.1.1.110, 200.1.1.111, and 200.1.1.112 can be selected to provide storage services for the first storage requirement of the host; for the second storage requirement, 5 storage nodes with selectable addresses of 200.1.1.111, 200.1.1.112, 200.1.1.113, 200.1.1.114 and 200.1.1.115 provide storage services for the second storage requirement of the host.

3) Configuring a forwarding table for the selected storage node based on the designated serving switch.

Based on the aforementioned 2 storage requirements and the service switch and the storage node selected by the control node, the control node configures the forwarding table, where the forwarding table may include multiple forwarding records, and each forwarding record may include the following information: host address, service switch address, volume, redundancy level number, and storage node address; the redundancy level number may be represented by RAID M/N _ X, where RAID M/N is a redundancy level, where M represents a RAID level number, and a value range of M > is 0, and those skilled in the art can understand that M generally takes a value of 0, 1, 3, 5, 6, 10, 50, and the like; wherein, the N represents the number of storage nodes forming the RAID level, such as RAID 1/3, which represents that RAID 1 needs to be realized by 3 storage nodes; wherein X is taken as [0 … N-1] and is respectively used to represent the … th N-1 st member storage node in RAID M composed of N storage nodes, and certainly may also be taken as [1 … N ] and is respectively used to represent the 1 st … th member storage node in RAID composed of N storage nodes; here, the storage node address is the forwarding path of the corresponding record. As shown in table 4 below, is an exemplary forwarding table:

TABLE 4

As shown in table 4 above, there are 3 forwarding records corresponding to LUN a of the host with address 200.1.1.3 and the service switch with address 200.1.1.101, the redundancy level numbers of the 3 forwarding records are RAID 1/3_0, RAID 1/3_1, and RAID 1/3_2, respectively, and the addresses of the 3 storage nodes corresponding to the 3 forwarding records are 200.1.1.110, 200.1.1.111, and 200.1.1.112, respectively, that is, if there is data from the host that needs to be written into LUN a, the data needs to be written into the 3 storage nodes with addresses 200.1.1.110, 200.1.1.111, and 200.1.1.112, respectively, in a RAID 1 manner.

After the control node configures the forwarding table, the control node may actively send the forwarding table to each service switch, and each service switch locally stores the forwarding table entry, or after receiving a forwarding table query request from a certain switch, send the forwarding table to the switch. Further, the control node may also notify the host of the address of the service switch assigned to the host, so that various service-related messages of the subsequent host may be directly sent to the service switch.

It should be understood by those skilled in the art that the above tables 1, 2, 3, and 4 are only exemplary illustrations made for clarity of description in the embodiments of the present invention, but do not constitute limitations on the presentation manner and the storage manner of the above information, and in a specific practice, the above information may be flexibly stored and recorded in various manners such as a linked list, a file, a log, etc., and these different implementations should not be considered as exceeding the scope of the present invention, and the embodiments of the present invention are not set forth herein.

In the embodiment of the invention, by adopting the mode, the automatic centralized configuration of the distributed storage system can be realized, the problems that manual configuration is needed and different switches and storage nodes are needed to be configured respectively in the prior art are solved, and the configuration efficiency in the distributed storage system is greatly improved. Furthermore, a forwarding table is configured through the control node, and the switch performs forwarding operation according to the forwarding table configured by the control node, so that control/forwarding separation is realized, and thus the performance of the distributed storage system can be greatly improved.

As shown in fig. 2, for the structural schematic diagram of the messages communicated between the host and the switch, and between the switch and the storage node in the distributed storage system provided by the embodiment of the present invention, the messages communicated between the host and the switch, between the switch and the storage node may include metadata and a data payload, where the metadata is a part that each message must contain, and the data payload is an optional part.

The metadata may include a source address, a destination address, and a layer 4 message header, where the source address is used to indicate an address of an entity that sent the message, e.g., if the message is sent from a host, the source address is the address of the host, and if the message is sent from a switch, the source address is the address of the switch; the destination address is used to indicate the ultimate recipient of the message.

The layer 4 message header may contain a memory header, an operation command, a redundancy level number, a sequence number, and a data payload length. The Storage header may be a Storage Type (Storage Type) and a Storage Index (Storage Index), where the Storage Type is used to indicate a Storage access Type, such as: block (Block), File (File), Object (Object), etc.; the storage index corresponds to the storage type, for example: if the storage type is a block, the storage index is volume ID + logical address, if the storage type is a File, the storage index is File system ID (File system ID) + Directory (Directory) + File name (File name) + Offset address (Offset), if the storage type is an object, the storage index is Key (Key) + Version number (Version). It should be noted that, because the structures of the storage indexes themselves corresponding to different storage types are completely different, in the embodiment of the present invention, the storage type field may be an optional field. In this embodiment of the present invention, the operation command may be "write", "read", "write complete", "write failure", "read complete", "read failure", and the like, where "write" indicates that data needs to be written into the storage node, "read" indicates that data needs to be read from the storage node, "write complete", "write failure" respectively indicate 2 results of the write operation, and "read complete", "read failure" respectively indicate 2 results of the read operation. The redundant level number is described with reference to table 4 above with respect to the forwarding table, and is not repeated here, and in general, the redundant level number may not need to be carried in the message sent from the host to the switch, and this field needs to be carried in the message sent from the switch to the storage node. The sequence number is an optional field used for context identification of messages sent from the host, and is obtained by the host by sequential accumulation, for example, the sequence number of the first message sent by the host may be Ox0000, the sequence number of the second message may be Ox0001, the sequence number of the third message may be Ox0002, and so on. The data payload length is also an optional field, and is used to describe the length of data to be written or the length of data to be read, in the embodiment of the present invention, the value of the data payload length may be 128 bytes, 256 bytes, 512 bytes, 1024 bytes, 2048 bytes, or the like.

The data payload is data to be written or data to be read.

The message structure shown in fig. 2 and the above description are only exemplary descriptions of the embodiment of the present invention, and the name and the like of each field do not constitute specific limitations to the message structure, and in specific practice, the names of the fields in the message may vary, but as long as the information carried by the fields does not exceed the essence of the above description, these different implementations should not be considered as exceeding the scope of the present invention, and the embodiment of the present invention is not set forth herein.

Fig. 3 illustrates a method for processing a write request from a host according to an embodiment of the present invention.

Step 300: receiving a first write request from a host, wherein the first write request carries first metadata and first data.

After the control node completes configuration based on the storage requirement of the host, the control node notifies the host to select a service switch for the host, and then when the host has data to be written, a first write request can be sent, wherein the first write request carries first metadata and first data. Referring to fig. 2 and the foregoing message structure, in an embodiment of the present invention, the first metadata may include: a first source address, a first destination address, a storage index, an operation command, and a data payload length, wherein: the first source address is the address of the host, 200.1.1.3; the first destination address is the address of the host's serving switch, 200.1.1.101; the storage index is LUN a + Ox0000FFFF, that is, the write request is to write the carried first data at the start address of Ox0000FFFF of volume a; the operation command is 'write'; the data payload length may take on a value of 512 bytes.

Optionally, in this embodiment of the present invention, the first metadata may further include a storage type, for example, the storage type is "block", and the storage index corresponding to the storage type is LUN a + Ox0000FFFF as described above; the first metadata may further include a sequence number, which takes a value of 1 added to the message data that the host has sent before sending the first write request, for example, the sequence number of the first write request may be Ox 1234.

Step 301: a forwarding table is queried based on the first metadata.

After receiving the first write request, the service switch needs to determine how to process the write request by querying the forwarding table. If yes, querying the forwarding table based on the first metadata carried by the first write request to obtain K forwarding records, where a value of K is greater than or equal to 2; taking the above table 4 as an example, in the table 4, 3 forwarding records corresponding to the host 200.1.1.3, the service switch 200.1.1.101, and the LUN a can be found, the redundancy levels corresponding to the forwarding records are respectively RAID 1/3_0, RAID 1/3_1, and RAID 1/3_2, that is, the redundancy level corresponding to the LUN a of the host 200.1.1.3 is RAID 1/3, the forwarding paths corresponding to the forwarding records are 3 storage nodes, and the addresses of the storage nodes are respectively: 200.1.1.110, 200.1.1.111, 200.1.1.112.

Step 302: and carrying out fragmentation operation on the first data in the first write request.

Performing a fragmentation operation on the first data in the first write request based on the queried redundancy level numbers of the K forwarding records, as described above, where the queried redundancy level numbers are RAID 1/3_0, RAID 1/3_1, and RAID 1/3_2, respectively, that is, the redundancy level is RAID 1/3, that is, the RAID level is RAID 1, and the RAID 1 is composed of 3 storage nodes, because in the case of RAID 1, striping processing is not required on the first data, and redundant computation is not required on the first data, and RAID 1 protection is to mirror the first data, in this embodiment, because the redundancy level is RAID 1/3, in this case, the fragmentation operation can directly copy the first data by 3 more copies, and the copied 3 pieces of first data together form a fragment corresponding to the first data in the first write request, it is sufficient that the data is stored by the corresponding 3 storage nodes.

As will be understood by those skilled in the art, as shown in table 4, in the case that the redundancy level numbers are RAID5/5_0, RAID5/5_1, RAID5/5_2, RAID5/5_3, and RAID5/5_4, respectively, the redundancy level is RAID5/5, which indicates that RAID5 is formed by 5 storage nodes, it is necessary to divide the first data into 4 data fragments, and calculate 1 redundancy fragment based on the first data, and 5 fragments corresponding to the first data are formed by the 4 data fragments and the 1 redundancy fragment, where each fragment is stored by 1 storage node. It will be understood by those skilled in the art that how the first data should be fragmented and whether redundant fragmentation needs to be computed and a plurality of redundant fragmentation needs to be computed, can be determined directly based on the level of redundancy queried according to the basic principles of RAID technology, and need not be described in detail herein. The following description will be given by taking an example in which the redundancy level is RAID 1/3.

Step 303: a second write request is sent.

And generating K second write requests based on the redundancy level numbers and the storage node addresses in the K forwarding records, and respectively sending the K second write requests to K storage nodes. As in step 301, since the queried redundancy level is RAID 1/3, which corresponds to 3 redundancy level numbers RAID 1/3_0, RAID 1/3_1, RAID 1/3_2, and the addresses of the respective storage nodes are 3 storage nodes of 200.1.1.110, 200.1.1.111, and 200.1.1.112, the service switch may construct 3 second write requests, each of which includes second metadata and second data, where the second data in the 3 second write requests is one of the 3 fragments, and a second source address in the second metadata in the 3 second write requests is 200.1.1.101, that is, that the second write request is issued by the service switch with an address of 200.1.1.101; the second destination addresses are 200.1.1.110, 200.1.1.111 or 200.1.1.112 respectively, which indicate that the 3 second write requests need to be sent to storage nodes with addresses of 200.1.1.110, 200.1.1.111 or 200.1.1.112 respectively; the operation commands in the second metadata are all 'write'; the storage index in the second metadata is the same as the storage index in the first metadata and still is LUN A + Ox0000 FFFF; the payload data length is 512 bytes.

Further, since the second write request is sent to the storage node by the switch, the redundancy level number also needs to be included in the second metadata of the second write request, and referring to table 4 above, the redundancy level numbers in the second metadata of the 3 second write requests are RAID 1/3_0, RAID 1/3_1, and RAID 1/3_2, respectively.

Optionally, as described above, the second metadata of the second write request may further include a storage type, which has a value same as that of the storage type in the first metadata, such as a "block"; the second metadata of the second write request may further include a sequence number, which has the same value as the sequence number in the first metadata, such as Ox 1234.

Optionally, the second write request may further include data verification, and the calculation method of the data verification is as described above and is not described herein again.

The service switch sends the 3 second write requests to 3 storage nodes with addresses 200.1.1.110, 200.1.1.111, 200.1.1.112, respectively.

Optionally, in the process of querying a forwarding table, fragmenting and sending the second write request after receiving the first write request, the service switch may further create a state table for recording state information processed for the first write request, and add the state information on the basis of the table 4, where an operation state in the state table is an operation command in the first metadata, and is recorded as "write" here; the sequence number in the state table takes the value of the sequence number in the first metadata, and may be Ox1234, for example; the payload length in the state table is equal to the data payload length in the first metadata, and may be 512 bytes, for example. The details are shown in table 5 below:

TABLE 5

Further, it should be noted here that the creation of the state table is an optional step, and the creation of the state table may be performed after step 303, may be performed after step 302, and may be performed after step 301.

Step 304: and saving the data.

And the K storage nodes respectively receive the second write requests and store the second data carried by the second write requests. Referring to the above steps, the 3 storage nodes with addresses 200.1.1.110, 200.1.1.111, and 200.1.1.112 receive the second write request from the service switch, respectively, and then store the second data carried in the received second write request.

Preferably, in the embodiment of the present invention, the storage node may store the second data in an object store manner, and as will be understood by those skilled in the art, the object store is stored in a Key-Value (i.e., Key-Value) manner, where the Key (Key) is used to index the stored Value (Value). Here, the Value (i.e., Value) is the second data, and the Key (i.e., Key) corresponding to the Value may be a storage index (i.e., volume ID + logical address) and a redundancy level number in the second metadata in the second write request, in an embodiment of the present invention, a Value of the Key may be the volume ID + logical address + redundancy level number; or the value of the key may be obtained by performing a hash operation on the volume ID + logical address + redundancy level number. Therefore, when the storage node needs to read data subsequently, the volume ID + the logical address + the redundancy level number can be directly used as a key to index the corresponding data.

Step 305: and the storage nodes respectively send second successful writing messages to the service switch.

After the K storage nodes complete the storage of the second data in the second write request, send a second write success message to the service switch, as described in the foregoing step, K is 3 in this embodiment. With reference to the message structure described in fig. 2, the second successful write message only needs to contain a metadata portion and does not need to contain a data payload and a data check portion. Wherein the metadata of the second successful write message is specifically:

source address: i.e., the address of the storage node, such as 200.1.1.110, or 200.1.1.111, or 200.1.1.112, respectively;

destination address: the address of the service switch, such as 200.1.1.101;

the storage type is as follows: and the value of the storage type in the second write request, such as 'block';

and (4) storing indexes: the value of the storage index in the second write request, such as "LUN a + Ox0000 FFFF";

and (3) operating commands: the writing is completed;

redundancy level numbering: the value of the redundancy level number in the second write request is equal to the value of the redundancy level number in the second write request, such as RAID 1/3_0, RAID 1/3_1 and RAID 1/3_ 2;

sequence number: and a value of the sequence number in the second write request, such as "Ox 1234".

Since the second successful write message does not need to carry any data, this field of data payload length is also not needed in the metadata. It will be appreciated by those skilled in the art that, referring to the foregoing description regarding the first write request and the second write request, the fields of the sequence number and the storage type of the metadata portion in the second write success message are also optional.

Step 306: the service switch sends a first successful write message to the host.

After receiving the second write success messages from the K storage nodes, the service switch may determine that data storage is completed for the first write request, so that the service switch sends the first write success message to the host, where the first write success message is the same as the second write success message, and the first write success message only needs to include a metadata portion and does not need to include a data payload and a data verification portion. Wherein the metadata of the first successful write message specifically is:

source address: i.e. the address of the service switch, e.g. 200.1.1.101;

destination address: the address of the host, such as 200.1.1.3;

the storage type is as follows: and the value of the storage type in the first write request, such as 'block';

and (4) storing indexes: the value of the storage index in the first write request, such as "LUN a + Ox0000 FFFF";

and (3) operating commands: the writing is completed;

sequence number: and a value of the sequence number in the first write request, such as "Ox 1234".

Since the second successful write message does not need to carry any data, this field of data payload length is also not needed in the metadata; as previously mentioned, messages between the service switch and the host need not carry a redundancy level number field. It will be appreciated by those skilled in the art that, referring to the foregoing description regarding the first write request and the second write request, the fields of the sequence number and the storage type of the metadata portion in the first write success message are also optional.

Further, if the service switch has a created state table after receiving the first write request, in this link, the service switch modifies the operation state in the state table (for example, table 5) to "write successfully" every time it receives 1 second write successfully message from the storage node, specifically, as shown in table 6, after receiving the second write successfully message from each storage node, the service switch queries the state table according to the second metadata information carried by the second write successfully message, and modifies the operation state of the corresponding row in the state table to "write successfully":

TABLE 6

Further, when the second write success message is received from the 3 storage nodes to trigger that the operation states of the state table are all changed to "write success", which indicates that the first write request processing is successful, the first write success message may be sent to the host, and the state table may be deleted.

The write request processing process in the above embodiment is described by taking the redundancy level as RAID 1/3 as an example, and it can be understood by those skilled in the art that the redundancy level may be based on various situations of different RAID levels such as RAID 0, RAID 1, RAID 3, RAID5, RAID 10, or RAID 50, and corresponding RAID levels may be implemented by different numbers of storage nodes, then under the situation of different redundancy levels, in step 302, the first data should be subjected to the fragmentation operation and the redundancy fragmentation is calculated based on the corresponding RAID levels and the numbers of the storage nodes constituting the RAID levels, and the fragmentation operation and the redundancy fragmentation calculation belong to the basic principle of the RAID technology, so the embodiment of the present invention is not necessarily described in detail with respect to different redundancy levels.

As can be seen from the foregoing embodiments, a forwarding table is stored in the switch, and after the switch receives the first write request from the host, the switch may directly fragment data carried in the first write request and calculate redundant fragments based on the forwarding table, then generate the second write requests in corresponding numbers, and then directly send the second write requests to corresponding storage nodes respectively to store the data carried by the second write requests, so that the storage of the data carried in the first write request may be completed. In the embodiment of the invention, the data in the first write request can be stored only by transmitting the data on the service switch for 1 time, so that the problem that the data can be stored only by transmitting the data for 2 times like the prior art is avoided, thereby greatly reducing the workload of the switch and the waste of network resources, reducing the time delay of processing the write request once, and greatly improving the processing efficiency and the performance of the distributed storage system.

Fig. 4 is a flowchart illustrating a process for processing a read request from a host in the distributed storage system according to an embodiment of the present invention.

Step 400: receiving a first read request from a host, wherein the first read request carries third metadata.

When the host has a need of reading data, a first read request is sent, and the first read request carries the third metadata. The third metadata may include: a third source address, a third destination address, a storage index, and a data payload length, wherein: wherein the third source address is the address of the host, 200.1.1.3; the third destination address is the address of the switch, 200.1.1.101; the storage index is LUN a + Ox0000FFFF, i.e. it indicates that the read request is to read data from volume a at the start address of Ox0000 FFFF; the operation command is read; the data payload length is used to indicate the size of the data to be read, and may be 512 bytes, which indicates that 512 bytes of data need to be read continuously from the location where the start address of the volume a indicated by the storage index is Ox0000 FFFF.

Optionally, in this embodiment of the present invention, the third metadata may further include a storage type, for example, the storage type is "block", and the storage index corresponding to the storage type is LUN a + Ox0000FFFF as described above; the third metadata may further include a sequence number, which takes a value of 1 added to the message data that the host has sent before sending the first read request, for example, the sequence number of the first read request may be Ox 5678.

Step 401: and querying a forwarding table based on the third metadata.

After receiving the first read request, the service switch needs to determine how to process the read request by querying a forwarding table, where the forwarding table may be respectively matched with a host address, a service switch address, and a volume of each forwarding record in the forwarding table based on the third source address, the third destination address, and the storage index in the third metadata to obtain K forwarding records, and a value of K is greater than or equal to 2. In table 4, 3 forwarding records corresponding to the host 200.1.1.3, the service switch 200.1.1.101, and the LUN a may be found, where the redundancy level numbers corresponding to the forwarding records are respectively RAID 1/3_0, RAID 1/3_1, and RAID 1/3_2, that is, the redundancy level of the LUN a corresponding to the host 200.1.1.3 is RAID 1/3, the forwarding paths corresponding to the forwarding records are 3 storage nodes, and the addresses of the forwarding records are: 200.1.1.110, 200.1.1.111, 200.1.1.112.

Step 402: a second read request is sent.

And generating K second reading requests based on the redundancy grade numbers and the storage node addresses in the K forwarding records, and respectively sending the K second reading requests to K storage nodes. As in the embodiment in step 401, since the queried redundancy level is RAID 1/3, and the forwarding paths corresponding to the 3 redundancy level numbers RAID 1/3_0, RAID 1/3_1, and RAID 1/3_2 are 3 storage nodes with addresses of 200.1.1.110, 200.1.1.111, and 200.1.1.112, respectively, the service switch may construct 3 second read requests, each of which includes fourth data, and a fourth source address in the fourth data in the 3 second read requests is 200.1.1.101, that is, the second read request is sent by the service switch with address 200.1.1.101; the fourth destination address is 200.1.1.110, 200.1.1.111 or 200.1.1.112 respectively, which indicates that the 3 second read requests need to be sent to the storage node with the address of 200.1.1.110, 200.1.1.111 or 200.1.1.112 respectively; the operation commands in the fourth metadata are all read; the storage index in the fourth metadata is the same as the storage index in the third metadata and still is LUN A + Ox0000 FFFF; the length of the payload data in the fourth metadata is the same as that of the payload data in the third metadata and is still 512 bytes.

Further, since the second read request is sent to the storage node by the switch, the fourth data of the second read request further needs to include a redundancy level number, and referring to table 4, the redundancy level numbers in the fourth data of the 3 second read requests are RAID 1/3_0, RAID 1/3_1, and RAID 1/3_2, respectively.

Optionally, as described above, the fourth metadata of the second read request may further include a storage type, which is equal to a value of the storage type in the third metadata, such as a "block"; the fourth metadata of the second read request may further include a sequence number, which has the same value as the sequence number in the third metadata, such as Ox 5678.

The serving switch sends the 3 second read requests to 3 storage nodes with addresses 200.1.1.110, 200.1.1.111, 200.1.1.112, respectively.

Optionally, in the process of querying a forwarding table and sending the second read request after receiving the first read request, the service switch may further create a state table for recording state information processed for the first read request, and add the state information on the basis of the table 4, where an operation state in the state table is an operation command in the third metadata, and the record is "read"; the sequence number in the state table takes the value of the sequence number in the third metadata, and may be Ox5678, for example; the payload length in the state table is equal to the data payload length in the third metadata, and may be 512 bytes, for example. The details are shown in table 7 below:

TABLE 7

Further, it should be noted herein that the creation of the state table is an optional step, and the creation of the state table may be performed after step 401 or after step 402.

Step 403: the storage node sends a second read complete message.

Each storage node in the K storage nodes receives a second read request from the switch, and the storage index and the redundancy level number in fourth data carried by the second read request are used as key read data. The 3 storage nodes with addresses 200.1.1.110, 200.1.1.111, 200.1.1.112 receive 1 of the 3 second read requests from the serving switch, respectively, and read data locally based on the fourth metadata in the second read requests.

In the embodiment of the present invention, as described in fig. 3 and the corresponding embodiment, if the storage node stores data in an object storage manner, in this step, the storage node indexes data to be read by using a storage index (i.e., volume ID + logical address) and a redundancy level number in the fourth metadata as keys; or performing a hash operation based on the storage index (i.e., volume ID + logical address) in the fourth metadata and the redundancy level number to obtain a key, and then indexing the data to be read with the obtained key.

And the K storage nodes respectively generate second reading completion messages based on the read data and send the second reading completion messages to the service switch. The second read completion message contains metadata and a data payload with reference to the message structure described in fig. 2. The metadata of the second read completion message specifically includes:

destination address: the address of the service switch, such as 200.1.1.101;

the storage type is as follows: and the value of the storage type in the second read request, such as "block";

and (4) storing indexes: the value of the storage index in the second read request, such as "LUN a + Ox0000 FFFF";

and (3) operating commands: completing reading;

redundancy level numbering: the value of the redundancy level number in the second read request is, for example, "RAID 1/3_0, RAID 1/3_1, and RAID 1/3_ 2", respectively;

sequence number: and the value of the sequence number in the second read request, such as "Ox 5678";

length of data payload: and the length of the data payload in the second read request is equal to a value, such as 512 bytes.

It will be appreciated by those skilled in the art that, referring to the foregoing description of the first read request and the second read request, the fields of the sequence number and the storage type of the metadata portion in the second read completion message are also optional.

The data payload in the second read completion message is the data read out locally by the storage node based on the second read request.

Step 404: and (5) carrying out fragment recombination.

And the service switch receives the second read completion message from each storage node in the K storage nodes respectively, analyzes the carried data from each second read completion message, and recombines the data carried in the K second read completion messages. In this embodiment, the example that the redundancy level is RAID 1/3 is taken as an example for explanation, so that after the data payloads in the second read completion messages from the 3 storage nodes are checked to be correct, the data payloads are recombined into 1 data of 512 bytes, where the data is data that needs to be read by the host, and the data is sent to the host in a subsequent process.

It can be understood by those skilled in the art that, if the redundancy level is RAID5/5, the service switch may receive second read completion messages returned from 5 different storage nodes, and analyze data payloads from the second read completion messages returned from 5 different storage nodes, where the 5 analyzed data payloads respectively include 4 data fragments and 1 redundancy fragment, and the service switch performs check on the 5 data payloads respectively and then reassembles the 4 data fragments and the 1 redundancy fragment based on the RAID5 technology to generate data that needs to be read by the host, and sends the data to the host in a subsequent process. It can be understood by those skilled in the art that the redundancy level may be based on various cases of RAID levels such as RAID 0, RAID 1, RAID 3, RAID5, RAID 10, or RAID 50, and the corresponding RAID level may be implemented by different numbers of storage nodes, and in the case of the different redundancy levels, in step 604, the fragmentation and reassembly of the data payload in the second read completion message from the different storage nodes should be performed based on the corresponding RAID level and the number of storage nodes constituting the RAID level, and the fragmentation and reassembly belongs to the basic principle of RAID technology, so that the embodiments of the present invention do not necessarily refer to detailed descriptions of the different redundancy levels. The following description will be given by taking an example in which the redundancy level is RAID 1/3.

Step 405: sending a first read complete message to the host.

And the service switch generates a first read completion message based on the data after the fragmentation reorganization. Referring to the message structure described with reference to fig. 2, the first read complete message contains metadata and a data payload. Wherein, the data payload in the first read completion message is the data reassembled by the service switch based on the fragments; the metadata of the first read completion message is specifically:

source address: i.e. the address of the service switch, e.g. 200.1.1.101;

destination address: the address of the host, such as 200.1.1.3;

and (3) operating commands: completing reading;

As previously mentioned, messages between the service switch and the host need not carry a redundancy level number field. It will be appreciated by those skilled in the art that, referring to the foregoing description regarding the first read request and the second read request, the fields of the sequence number and the storage type of the metadata portion in the first read complete message are also optional.

Further, if the service switch has created the state table 7 after receiving the first read request, in this embodiment, the service switch modifies the operation state in the state table (e.g. table 7) to "read successfully" every time it receives 1 second read completion message from the storage node, which is specifically shown in table 8:

TABLE 8

Further, when the second read completion message is received from the 3 storage nodes to trigger that the operation states of the state tables are all changed to "read successful", which indicates that the processing of the first read request is successful, the first read completion message may be sent to the host, and the state tables may be deleted.

In the distributed storage system provided in the embodiment of the present invention, after the service switch configured for the host receives the first read request from the host, the service switch may directly split the first read request into a plurality of second read requests based on the forwarding table, then send the second read requests to corresponding storage nodes to read data, and after receiving data carried in a plurality of second read completion messages returned from the plurality of storage nodes, recombine the data carried in the plurality of second read completion messages based on RAID technology and return the data to the host through the first read completion message, thereby implementing fast read request processing, where data that the host needs to read only needs to pass through the service switch 1 time in a direction from the plurality of storage nodes to the host through the service switch, the data reading can be completed only by transmitting for 2 times like the prior art, so that the waste of network resources of the switch is reduced, the time delay of reading request processing is also reduced, and the processing efficiency and the performance of the distributed storage system are greatly improved.

Based on the same or similar inventive concept as described in fig. 1, 2, 3, and 4 and the corresponding embodiments above, as shown in fig. 5, a schematic structure diagram of a switch provided for an embodiment of the present invention is shown, where the switch 500 includes a receiver 501, a processor 502, a memory 503, a RAID engine 504, and a transmitter 505.

In the embodiment of the present invention, the receiver 501 is configured to receive a switch capability information query message from a control node, the processor 502 is configured to analyze the switch capability information query message, query capability information of the switch, such as a switch ID, a switch address, whether the switch has fragmentation and redundant computing capability, based on the switch capability information query message, generate a switch capability information query response message based on the capability information, and the transmitter 505 is configured to transmit the switch capability information query response message to the control node. Further, the receiver 501 is further configured to receive a forwarding table from the control node, and the processor 502 is configured to parse the forwarding table and then transfer the parsed forwarding table to the memory 503 for storing the forwarding table.

Further, the receiver 501 is further configured to receive a first write request from a host, where the first write request carries first metadata and first data; the processor 502 is further configured to parse the first write request to obtain the first metadata and the first data, and query the forwarding table based on the first metadata to obtain K forwarding records, where K is greater than or equal to 2; the RAID engine 504 is configured to perform a fragmentation operation on the first data based on the redundancy level numbers in the K forwarding records to obtain K fragments; the processor 502 is further configured to generate K second write requests based on the redundancy level numbers in the K forwarding records and the storage node addresses, where each of the K second write requests carries second data, and the second data is 1 fragment of the K fragments; the transmitter 505 is further configured to send the K second write requests to K storage nodes, respectively. The receiver 501 is further configured to receive a second successful write message from each of the K storage nodes respectively; the processor 502 is further configured to generate a first write success message based on the second write success message from each of the K storage nodes; the transmitter 505 is further configured to send the first successful write message to the host. The forwarding table and the first metadata carried by the first write request are described with reference to fig. 1, 2, and 3 and the description of the corresponding embodiments, which is not repeated in the embodiment of the apparatus; similarly, the implementation details of how the processor 502 queries the forwarding table based on the first metadata, performs fragmentation operation, generates K second write requests, and generates a first write success message based on the second write success message from each of the K storage nodes also refer to fig. 1, 2, and 3 and the description of the corresponding embodiment, which is not repeated herein; further, as described with reference to fig. 3 and the corresponding embodiments, during the processing of the write request, the processor 502 may be further configured to create a state table, and update the state table and delete the state table after receiving the second write success message from each of the plurality of storage nodes.

In the switch provided in the embodiment of the present invention, after receiving the first write request from the host, the switch may directly fragment data carried in the first write request based on the forwarding table and calculate redundant fragments, then generate the second write requests in corresponding numbers, then directly send the second write requests to corresponding storage nodes, and then the storage nodes respectively store the data carried in the received second write requests, so that the storage of the data carried in the first write request may be completed. In the embodiment of the invention, the data in the first write request can be stored only by transmitting the data on the service switch for 1 time, so that the problem that the data can be stored only by transmitting the data for 2 times like the prior art is avoided, thereby reducing the waste of network resources of the switch, reducing the time delay of processing the write request once, and greatly improving the processing efficiency and the performance of the distributed storage system.

Further, the receiver 501 is further configured to receive a first read request from a host, where the first read request carries third metadata; the processor 502 is further configured to parse the first read request to obtain the third metadata, query the forwarding table based on the third metadata to obtain K forwarding records, where K is greater than or equal to 2, and generate K second read requests based on redundancy level numbers and storage node addresses in the K forwarding records; the transmitter 505 is further configured to send the K second read requests to K storage nodes, respectively. The receiver 501 is further configured to receive a second read completion message from each of the K storage nodes; the processor 502 is further configured to parse the second read completion message from each of the K storage nodes to obtain data carried by each of the second read completion messages; the RAID engine 504 is further configured to reassemble the data carried in the K second read completion messages; (ii) a The processor 502 is further configured to generate a first read completion message, where the first read completion message carries the reassembled data; the transmitter 505 is configured to send a first read complete message to the host. For details of implementation of each action of each module/component in the switch, please refer to fig. 4 and description of the corresponding embodiment, which is not described herein again.

After receiving the first read request from the host, the switch provided in the embodiment of the present invention may directly split the first read request into a plurality of second read requests based on the forwarding table, then send the second read requests to corresponding storage nodes to read data, and after receiving data carried in a plurality of second read completion messages returned from the plurality of storage nodes, recombine the data carried in the plurality of second read completion messages based on RAID technology and return the data to the host through the first read completion message, thereby implementing fast read request processing, where the data that the host needs to read only needs to pass through the service switch 1 time in a direction from the plurality of storage nodes to the host through the service switch, and avoiding that data reading can be completed only by transmitting 2 times as in the prior art, therefore, the waste of network resources of the switch is reduced, the time delay of read request processing is also reduced, and the processing efficiency and the performance of the distributed storage system are greatly improved.

Based on the same or similar inventive concept as described in fig. 1 and the above-mentioned corresponding embodiments, as shown in fig. 6, a schematic structural diagram of a control node according to an embodiment of the present invention is provided, where the control node 600 includes a receiver 601, a processor 602, a memory 603, and a transmitter 604. The transmitter 604 is configured to send a switch capability information query message to the switch, and send a storage node capability information query message to the storage node; the receiver 601 is configured to receive a switch capability information query response packet from the switch, where the switch capability information query response packet carries switch capability information, and the capability may include the following information: the ID of the switch, the address of the switch and whether the switch has the fragmentation and redundancy calculation capability; and is further configured to receive a storage node capability information query response packet from the storage node, where the storage node capability information query response packet carries storage node capability information, and the capability may include the following information: the following information is contained: storage node address, storage node capacity. The processor 602 is configured to analyze the switch capability information query response packet to obtain the switch capability information, and is further configured to analyze the storage node capability information query response packet to obtain the storage node capability information; the memory 603 is configured to store the switch capability information and the storage node capability information. The receiver 601 is further configured to receive a host storage requirement; the processor 602 is further configured to allocate a service switch, allocate a plurality of storage nodes, and configure a forwarding table for the host based on the host storage requirement, the switch capability information, and the storage node capability information. The transmitter 604 is further configured to send the address of the service switch to the host, and send the forwarding table to the service switch. The presenting mode and the storing mode of the switch capability information and the storage node capability information are described in the figure 1 and the corresponding embodiment; for details of the implementation of the processor 602 in allocating a service switch, allocating a plurality of storage nodes, and configuring a forwarding table for the host, please refer to fig. 3 and the corresponding embodiment, which are not described herein again in this embodiment of the present invention.

It should be noted that, in all the embodiments of the present invention, the terms "first", "second", "third", "fourth", and the like are used only for distinguishing one from another for convenience of description, and these numbers should not be construed as limiting the scope of entities connected later.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for processing a write request, the method being applied to a switch, wherein a forwarding table is preconfigured in the switch, the forwarding table includes a plurality of forwarding records, and each forwarding record includes a host address, a serving switch address, a volume, a redundancy level number, and a storage node address, the method comprising:

and respectively sending the K second write requests to K storage nodes.

2. The method of claim 1, wherein the first metadata comprises a first source address, a first destination address, a storage index, and a data payload length, wherein the first source address is an address of the host, the first destination address is an address of the serving switch, and the data payload length indicates a size of the first data;

3. The method of claim 1, wherein each of the K second write requests further carries second metadata, and the second metadata includes a second source address, a second destination address, a storage index, a redundancy level number, and a data payload length, wherein the second source address is an address of the service switch, the second destination address is a storage node address of 1 forwarding record of the K forwarding records, the redundancy level number is a redundancy level number of 1 forwarding record of the K forwarding records, and the data payload length indicates a size of the 1 fragment.

4. The method of claim 3, further comprising:

the K storage nodes respectively receive second write requests from the service switch;

sending a second successful write message to the serving switch.

5. The method of claim 4, further comprising:

the service switch receives second successful writing messages from the K storage nodes respectively;

sending a first write success message to the host.

6. A method for processing a read request, wherein the method is applied to a switch, a forwarding table is pre-configured in the switch, the forwarding table includes a plurality of forwarding records, each forwarding record includes a host address, a serving switch address, a volume, a redundancy level number, and a storage node address, and the method includes:

and sending the K second read requests to K storage nodes respectively.

7. The method of claim 6, wherein the third metadata comprises a third source address, a third destination address, a storage index, and a data payload length, wherein the third source address is an address of the host, the third destination address is an address of the service switch, and the data payload length indicates a size of the data to be read;

8. The method of claim 7, wherein each of the K second read requests further carries fourth data, and the fourth data includes a fourth source address, a fourth destination address, a storage index, a redundancy level number, and a data payload length, wherein the fourth source address is an address of the serving switch, the fourth destination address is a storage node address of 1 forwarding record of the K forwarding records, the redundancy level number is a redundancy level number of 1 forwarding record of the K forwarding records, and the data payload length indicates a size of data to be read, and the method further comprises:

each of the K storage nodes receiving a second read request from the serving switch;

and sending a second read completion message to the service switch, wherein the second read completion message carries the read data.

9. The method of claim 8, further comprising:

the service switch receives a second read completion message from each of the K storage nodes respectively;

recombining the data carried in the K second read completion messages;

10. A switch, comprising a receiver, a memory, a processor, a transmitter, and a RAID engine, wherein:

11. The switch of claim 10, wherein the first metadata comprises a first source address, a first destination address, a storage index, and a data payload length, wherein the first source address is an address of the host, the first destination address is an address of the serving switch, and the data payload length indicates a size of the first data;

12. The switch of claim 10, wherein the receiver is further configured to receive a second successful write message from each of the K storage nodes, respectively;

13. A switch, comprising a receiver, a memory, a processor, and a transmitter, wherein:

14. The switch of claim 13, wherein the switch further comprises a RAID engine;

15. A control node, characterized in that the control node comprises a receiver, a processor, a memory, a transmitter, wherein:

the transmitter is used for sending a switch capability information query message to the switch and sending a storage node capability information query message to the storage node;

the receiver is configured to receive a switch capability information query response packet from the switch, where the switch capability information query response packet carries switch capability information, and receive a storage node capability information query response packet from the storage node, where the storage node capability information query response packet carries storage node capability information;

the processor is further configured to analyze the switch capability information query response message to obtain the switch capability information, and analyze the storage node capability information query response message to obtain the storage node capability information;

the memory is used for storing the switch capability information and the storage node capability information;

the receiver is also used for receiving the storage requirement of the host;

the processor is further configured to allocate a service switch, allocate a plurality of storage nodes, and configure a forwarding table for the host based on the host storage requirement, the switch capability information, and the storage node capability information, where the forwarding table includes a plurality of forwarding records, and each forwarding record includes a host address, a service switch address, a volume, a redundancy level number, and a storage node address;

the transmitter is further configured to send the address of the service switch to the host, and send the forwarding table to the service switch.

16. A storage medium, characterized in that it stores a computer program enabling, when executed by a computer device, to implement the method of any one of claims 1 to 5.

17. A storage medium, characterized in that it stores a computer program enabling, when executed by a computer device, to implement the method of any one of claims 6 to 9.