CN115858181B

CN115858181B - Distributed storage inclined work load balancing method based on programmable switch

Info

Publication number: CN115858181B
Application number: CN202310170363.9A
Authority: CN
Inventors: 胡增; 江大白; 汪刚
Original assignee: China Applied Technology Co Ltd
Current assignee: China Applied Technology Co Ltd
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-06-06
Anticipated expiration: 2043-02-27
Also published as: CN115858181A

Abstract

The invention relates to the field of distributed storage, and discloses a distributed storage inclined workload balancing method based on a programmable switch, which utilizes the programmable switch to process the workload of a path, ensures the consistency of data by using a consistency directory in a network, and can increase the throughput by 10 times or reduce the number of storage servers required by 90 percent compared with the existing load balancing method. The invention can quickly respond to the condition that the hot key is changed, and can provide better load balance characteristic by copying a small quantity of hot objects, so that the hot objects can be quickly identified and tracked no matter the hot objects are kept unchanged or the hot objects are quickly changed.

Description

Distributed storage inclined work load balancing method based on programmable switch

Technical Field

The invention relates to the field of distributed storage, in particular to a distributed storage inclined work load balancing method based on a programmable switch.

Background

The real workload of a storage system typically exhibits a highly tilted object access pattern, i.e., a small fraction of hot objects receive far more requests than the rest of the objects. Many such workloads can be modeled using a Zipfian access distribution, but some actual workloads may exhibit very high tilt levels (e.g., a Zipf distribution using a >1, which is a special case of a Zipfian distribution). Furthermore, the collection of hot objects may change dynamically, and in some cases, the hot objects may lose heat on average within 10 minutes.

Distributed storage systems typically store objects on multiple storage servers to achieve scalability and load distribution. The high tilt of the workload means that the load across the storage servers is also uneven: a few storage servers storing the hottest objects will receive disproportionate traffic over other storage servers. Too high an access tilt may overload the storage servers by loading the objects with more than the processing power of a single storage server. To reduce the performance penalty, the system requires over-configuration, greatly increasing the overall cost.

Tilted workloads include a wide variety of such as re-read workloads (more than 95% of requests are read requests), re-write workloads, and hybrid workloads. In addition, there is a great difference in object (value) size, for example, within a program, the system may store smaller values (several bytes), larger values (kilobytes to megabytes), or a combination of both. Thus, there is a need for a better tilting workload balancing approach to address the above.

The prior art comprises the following steps:

and (3) caching: caching has long been the standard method of accelerating database-supported web applications, and has proven its effectiveness both theoretically and practically. However, the caching method has two limitations: first, the effectiveness of a cache depends on whether a cache can be built that can handle more orders of magnitude of data, and requires that the cache be requested much faster than the storage server. The above object is easily achieved when the cache system itself is not limited by the memory storage hardware. However, due to the current hardware limitations, achieving the above objective by the caching method becomes a difficult challenge. Second, the caching solution only benefits the read-only workload because the cached copy must be invalidated before the storage server processes the write operation.

Selective replication: selective replication is another common solution to achieve load balancing. By selectively replicating hot objects to multiple storage servers, requests for these hot objects can be sent to any one storage server with a replica, effectively distributing the load among the storage servers. However, existing selective replication methods face two challenges. First, the client must be able to identify the hot object and its duplicate location, but the hot object and its duplicate location may change as the object warmth changes. Although this can be achieved by using a centralized directory service or by copying the directory to the clients, both have scalability limitations because a centralized directory service can easily become a bottleneck and it is difficult to synchronize the directory among hundreds or thousands of clients. A second challenge of the selective replication scheme is to provide consistency for hot objects, which the prior art fails to address.

Disclosure of Invention

In order to solve the technical problems, the invention provides a distributed storage inclined work load balancing method based on a programmable switch.

In order to solve the technical problems, the invention adopts the following technical scheme:

a distributed storage tilting workload balancing method based on a programmable switch is used for balancing workload between a client and a distributed storage server, wherein the client sends a request containing an application layer message to the storage server, and the request comprises a write request for storing a value corresponding to a key to the storage server and a read request for reading the value corresponding to the key from the storage server; the storage server returns a response containing the application layer message to the client; the workload comprises a client request and a storage server response, and the workload passes through the programmable switch; the storage server comprises a main storage server and a plurality of copy storage servers; the application layer message comprises information of a key and an object version number of the key;

the workload balancing method specifically comprises the following steps:

p1: the programmable switch counts the hottest O (nlogn) hotkeys when a client interacts with a storage server

N represents the number of all the different keys, and (2)>

Representing complexity; the programmable switch will be hotkey->

Corresponding value +.>

Is stored in a copy storage server and the hotkey +_ is recorded by a programmable switch>

Hotkey->

The highest object version number ver_completed and the value +.>

Storage server at home->

A list;

p2: when a client sends a write request to a storage server:

the programmable switch allocates an incremental object version number to each write request key in turn; if a key in the write request has a record in the programmable switch, the key is a hot key, and the programmable switch selects one or more storage servers from all storage servers to forward the write request; if the key in the write request does not have a record in the programmable switch, forwarding the write request to the primary storage server;

when a client sends a read request to a storage server:

if a key in the read request has a record in the programmable switch, the key is a hotkey, find hotkey

Storage server where the value of (2) is located +.>

To forward the read request; if the key in the read request does not have a record in the programmable switch, forwarding the read request to the primary storage server;

p3: the main storage server records the object version numbers of all keys and the values of the keys, and the duplicate storage server records the object version numbers of the hot keys and the values of the hot keys; when processing a write request, the storage server compares the object version number ver of the key in the write request with the object version number of the key in the storage server, and only when the object version number ver of the key in the write request is higher, the storage server updates the value of the key and the object version number;

p4: either one of themStorage server

Returning a response to the client, and the key in the response has a record in the programmable switch, then:

comparing hotkeys in response

Object version number ver of the programmable switch and hotkey recorded +.>

And if ver > ver_completed, updating the value of ver_completed to ver and programming the hotkey recorded by the switch->

Corresponding storage server->

After the list content is emptied, the storage server is again left +.>

Joining a storage server->

A list; if ver=ver_completed, the server will be stored +.>

Joining a storage server

A list.

Specifically, the application layer header includes an OP field, a KEYHASH field, a VER field, and a server field; the OP field represents the workload type, the KEYHASH field content is a hash value of a key generated in the programmable switch, the VER field content is an object version number VER of the key distributed by the programmable switch for the workload, and the SERVRID field content is a storage server identifier filled by the storage server when responding; the contents of the OP field include READ, WRITE, READ-REPLY or WRITE-REPLY, READ indicating a READ request, WRITE indicating a WRITE request, READ-REPLY indicating a REPLY to the READ request, WRITE-REPLY indicating a REPLY to the WRITE request.

Specifically, the programmable switch has an intra-network coherence directory therein, and the intra-network coherence directory is hot-keyed by a hash table

Hotkey->

Object version number and value +.>

Storage server at home->

Recording is performed.

Specifically, when a key

When heat is lost, i.e. bond +.>

O (nlogn) hot bonds which are no longer the hottest +.>

The programmable switch is for the key->

Marking and receiving the inclusion key +.>

In response to the programmable switch pair key is deleted->

Is recorded in the database.

Specifically, when a hotkey is required from the primary storage server

When the value of (2) is copied to the copy storage server, the programmable switch issues a virtual write command to write the hot key +.>

Hot key for modifying object version number into programmable switch record>

Is the highest object version number ver_completed and will be hotkey +.>

Hotkey->

Corresponding value and hotkey->

The object version number of (2) is sent to a copy storage server for storage.

Compared with the prior art, the invention has the beneficial technical effects that:

the invention utilizes the programmable switch and can increase throughput by 10 times or reduce the number of storage servers by 90% compared with the existing load balancing method.

The invention can quickly respond to the condition that the hot key is changed, and can provide better load balance characteristic by copying a small quantity of hot objects, so that the hot objects can be quickly identified and tracked no matter the hot objects are kept unchanged or the hot objects are quickly changed.

The present invention is applicable to a variety of workloads, such as a reread workload and a rewriter workload, while also being applicable to workloads having different sizes of objects and different tilt levels.

Drawings

FIG. 1 is a system model diagram of the present invention;

fig. 2 is a diagram of an application layer header data format of a workload of the present invention.

Detailed Description

A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

The invention is implemented by a programmable switch having a programmable data plane, such as a programmable switch having a BarefotTofino, caviem XPliant, or Broadcomptrident 3 series chip. These chips have the following characteristics: (1) application layer header specific programmable parsing; (2) A flexible packet processing pipeline, typically consisting of 10-20 pipeline stages, each capable of performing a match lookup and one or more ALU operations; (3) general purpose memory on the order of 10 MB. And the above-mentioned features are all on the data plane of the programmable switch, which means that these chips can be used when processing data packets at full line rate. The tilted workload balancing method of the present invention provides load balancing at the rack level, i.e. 32 to 256 storage servers are connected through one programmable switch, but does not provide fault tolerance guarantees inside the rack.

The invention uses the top of rack (ToR) programmable switch as the central point of the system, the ToR programmable switch is positioned on the path of each client request and storage server response, so the abstraction of the consistency catalogue in the network can be realized in the ToR programmable switch. The ToR programmable switch may track the location of each hot object (value) in the system through a coherence directory within the network and forward the request to a storage server with available capacity, even by determining the location where the write request was sent to alter the number or location of copies of the value. With this intra-network coherence directory, the present invention designs a version-based coherence protocol that guarantees linearity and is very efficient in handling value updates. The present invention can provide good load balancing even for write-intensive workloads.

As shown in FIG. 1, in this embodiment, an intra-network coherence directory is implemented in a rack-level storage system. The storage system includes a plurality of storage servers, all located in one rack. The storage server comprises a main storage server and a copy storage server, wherein the main storage server is the storage server where the value script is located, and the copy storage server is the storage server which stores the value copy after the value becomes a hot object. The system used in the present invention includes a programmable switch and a controller for the programmable switch.

Each key corresponds to a value stored on the storage server; the client can send out a request; the storage server is capable of data storage and responding to requests issued by clients. If the read request is the read request, the storage server finds the stored value according to the key in the read request and returns the value to the client; if the value is a write request, the storage server stores the value which needs to be written in the write request.

1. Programmable switch

Taking ToR programmable switch as an example, the intra-network consistency directory maintained by the present invention is described: the consistency directory in the network stores a group of hotkeys, the highest object version number of each hotkey and a storage server list rset where the value corresponding to each hotkey is located. To reduce programmable switch resource overhead and support arbitrarily sized keys, the intra-network coherence directory stores keys via a fixed-size hash table.

The present invention defines an application layer header embedded in a four layer (L4) workload, as shown in fig. 2. The invention reserves a special UDP port for the programmable switch to match the system data packet. The application layer header includes an ETH field, an IP field, a UDP field, an OP field, a KEYHASH field, a VER field, and a server field. The OP field content may be READ, WRITE, READ-REPLY or WRITE-REPLY, READ indicating a READ request, WRITE indicating a WRITE request, READ-REPLY indicating a REPLY to the READ request, WRITE-REPLY indicating a REPLY to the WRITE request; the KEYHASH field content is a fixed-size key hash value generated in the programmable switch, and the VER field content is an object version number of a key allocated by the programmable switch for a workload; the server field content is a unique identification of the storage server, which is populated upon reply.

The system message forwarding not of the invention uses standard two-layer (L2) or three-layer (L3) routing, keeping the programmable switch fully compatible with existing network protocols.

2. Controller for controlling a power supply

The programmable switch controller of the present invention decides which keys are hot keys and is responsible for updating the intra-network coherence directory with the hottest O (nlogn) keys, n representing the number of all the different keys. To this end, the present invention designs a request statistics engine in a programmable switch that tracks the access rate of each key using the data plane of the programmable switch and the programmable switch CPU. The controller may run on a programmable switch CPU or a remote storage server, reading the access rate of each key from the request statistics engine, taking the most common key as the hotkey. The controller only keeps soft state, and can be replaced immediately when in fault. The controller copies the value corresponding to the hotkey from the primary storage server to the replica storage server.

3. Request processing

The processing of the request is divided into client request processing and storage server response processing, and the client request and the storage server response belong to the workload.

Processing algorithm of client request (algorithm one):

1：if pkt1.op=WRITE then

2：pkt1.ver←ver_next++；

3：end if；

4：if rkeys.contains(pkt1.keyhash)then

5：if pkt1.op=READthen

6：pkt1.dst←select replica from rset[pkt1.keyhash]；

7：elseif pkt1.op= WRITE then

8：pkt1.dst←select from all servers；

9：endif；

10：end if；

11：Forward packet；

wherein pkt1 represents a client request, pkt1.OP represents the OP field content of pkt1, pkt1.VER represents the VER field content of pkt1, that is, the object version number of a key, ver_next represents the next object version number, pkt1.KEYHASH represents the KEYHASH field content of pkt1, pkt1.Dst represents a target storage server to which the request is forwarded, rset [ pkt1.KEYHASH ] represents a storage server list rset corresponding to a hot key in pkt1, and Forward packet represents a packet from which the request is sent. rkeys represents a consistency directory within the network maintained by a programmable switch.

For processing of client requests, lines 1 to 3 of algorithm one: the programmable switch of the present invention assigns an object version number to each WRITE Request (WRITE), and adds 1 to ver_next after writing ver_next into the VER field of the request. Lines 4 to 10 of algorithm one: determining how to forward the request by matching the key hash value of the request to a coherence directory within the network: if the key is not a hot key, the request is forwarded to the original destination, i.e., the key's primary storage server; for a read request of a hotkey, forwarding the read request by selecting a storage server from the list rset of storage servers of the hotkey; for a hot-key write request, one or more storage servers are selected from all storage servers to forward the write request, based on storage server selection policy decisions.

The storage server maintains an object version number for each key and a key value. When processing a write request, the storage server compares the object version number in the application layer header VER field with the object version number of the key in the storage server, and only when the object version number in the VER field is higher, the storage server updates the key value and the key object version number.

Processing algorithm of storage server response (algorithm two):

1：if rkeys.contains(pkt2.keyhash)then

2：if pkt2.ver>ver_completed[pkt2.keyhash]then

3：ver_completed[pkt2.keyhash]←pkt2.ver；

4：rset[pkt2.keyhash]<-set(pkt2.serverid)；

5：elseif pkt2.ver=ver_completed[pkt2.keyhash]then

6：rset[pkt2.keyhash].add(pkt2.serverid)；

7：endif；

8：end if；

9：Forward packet；

pkt2 represents the response of the storage server, pkt2.KEYHASH represents the KEYHASH field contents of pkt2, ver_completed represents the highest object version number of the keys stored within the programmable switch; pkt2.SERVERID represents the content of the SERVERID field representing pkt2, i.e., the identification id of the storage server that responded.

Algorithm two, lines 1 to 7: for processing of storage server responses, when the programmable switch receives a response of the READ-REPLY or WRITE-REPLY type, searching a key in the response in a consistency directory in the programmable switch network, if the key is contained, the key in the response is a hot key, and the programmable switch compares the object version number in the application layer header VER field with the latest object version number ver_completed of the hot key stored in the programmable switch:

if the hot key in the response has a higher object version number, the programmable switch updates ver_completed and resets the storage server list rset corresponding to the hot key, and then the server sending the response is stored in the storage server list rset corresponding to the hot key; if the two object version numbers are equal, the programmable switch directly stores the server sending the response to the storage server list rset corresponding to the hotkey.

The effects of algorithm one and algorithm two are as follows: after a write request for a key is sent to one or more storage servers, when the storage servers complete and acknowledge the write operation, the programmable switch can record these storage servers, which will send all future read requests for the key to the one or more storage servers, which is sufficient to ensure linearity.

4. Adding and deleting hotkeys

The hotness of the keys is constantly changing, and the programmable switch controller of the present invention continuously monitors the access frequency of the keys and updates the intra-network coherence directory by the hottest O (nlogn) keys. When a new key becomes a hot key, a directory entry is created for it, and the controller implements the creation of the directory entry by adding a key to the primary storage server, where the object version number of the key is the highest object version number ver_completed for that key stored in the programmable switch. Finally, the controller adds a hotkey to the intra-network coherence directory. In addition, after a key becomes a hotkey, the value of the hotkey is not immediately moved or copied to other storage servers, and a later write request for the hotkey is sent to a new storage server, thereby enabling the creation of a copy of the value of the hotkey.

When a key is no longer a hot key, the controller need only mark the key in the in-network coherence directory; the next write request for the key is sent to its primary storage server and the key is deleted from the intra-network coherence directory once the programmable switch receives a reply to the write request for the key.

The above scheme moves or copies the value of a key to a new storage server only on the next write request of that key. While this simplifies the design, it is not applicable to read-only values or values that are not frequently modified. The present invention can also solve this problem by performing a write operation that does not change the value when it is desired to move or copy the value of a key. More precisely, the controller can force copying or moving the value of the key by issuing a virtual write command to the key's main storage server, instruct the storage server to increase the stored object version number of the key to the highest object version number ver_completed of the key stored in the programmable switch, and forward the value of the key to other storage servers, which are then added to the storage server list rset corresponding to the key by the response of the storage servers, helping to provide the reading service.

5. Storage server selection policy

The present invention can currently select the following two storage server selection policies: the first strategy is to randomly select a storage server and rely on statistics for load balancing. Another strategy is to use weighted polling: based on the collected load statistics of the storage servers, the controller assigns a weight to each storage server and instructs the programmable switch to select the storage server at a frequency proportional to the weight.

The read request is sent to only one storage server and the write request may be sent to one or more storage servers. Larger storage servers provide more options for future read requests, improving load balancing, but larger storage servers increase the cost of write operations. For workloads where the write load is large, increasing the write cost easily negates the benefits of any load balancing.

When the invention selects the value copy number (replication factor) of the hotkey, the programmable switch tracks the reading times of each hotkey in a period of time, and the replication factor proportional to the reading times is selected, so that the cost can be limited.

6. Hash collision

The intra-network coherence directory of the present invention holds only hotkeys, not all keys. If there is a hash collision of a hot key with a non-hot key, a request for the non-hot key may be sent to the wrong storage server. To address this problem, each storage server keeps track of all current hotkeys (kept up-to-date by the controller). With this information, the storage server can forward incorrectly forwarded requests to the correct storage server. Since the number of hotkeys is small and the number of non-hotkey requests is small, the request linking method has little influence on the performance. When hash collisions occur rarely with two hotkeys, the present invention only copies the value of one hotkey to ensure correctness.

7. Object version number

The object version number must monotonically increase. The invention uses 64-bit object version number, and the full linear rate processing of the programmable switch exceeds 100 years, so that the overflow of the object version number is possible, and the service life is extremely long.

8. Garbage collection

The present invention does not explicitly invalidate or delete the old version when processing a write request of a hotkey. While the intra-network coherence directory can ensure that all requests are forwarded to the latest version without affecting correctness, always preserving an outdated value copy wastes memory space on the storage server. The present invention addresses this problem by garbage collection. The controller has informed the storage server which keys are hot keys and reports the highest object version number of the keys on a regular basis. Then, if the version of a key in the storage servers is over time, or the key is no longer a hot key and the storage server is not the primary storage server for that key, then each storage server may safely delete that key.

The invention provides a distributed storage inclined work load balancing method based on a programmable switch, which has the following characteristics: 1) Providing good load balancing for high-tilt dynamic workloads; 2) The system can work together with a rapid memory storage system; 3) An object of arbitrary size can be handled; 4) The linearity is guaranteed; 5) The same is valid for the re-read workload, the re-write workload, and the hybrid read-write workload.

In addition, the invention can ensure the consistency of data, in particular linearity, by using the consistency directory in the network without introducing other performance losses, and achieve faster performance by using the memory storage.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims

1. The distributed storage tilting workload balancing method based on the programmable switch is characterized by being used for balancing the workload between a client and a distributed storage server, wherein the client sends a request containing an application layer message to the storage server, and the request comprises a write request for storing a value corresponding to a key into the storage server and a read request for reading the value corresponding to the key from the storage server; the storage server returns a response containing the application layer message to the client; the workload comprises a client request and a storage server response, and the workload passes through the programmable switch; the storage server comprises a main storage server and a plurality of copy storage servers; the application layer message comprises information of a key and an object version number of the key;

the workload balancing method specifically comprises the following steps:

N represents the number of all the different keys, and (2)>

Representing complexity; the programmable switch will be hotkey->

Corresponding value +.>

Hotkey->

The highest object version number ver_completed and the value +.>

Storage server at home->

A list;

p2: when a client sends a write request to a storage server:

when a client sends a read request to a storage server:

Storage server where the value of (2) is located +.>

p4: any storage server

comparing hotkeys in response

Object version number ver of the programmable switch and hotkey recorded +.>

Corresponding storage server->

After the list content is emptied, the storage server is again left +.>

Joining a storage server->

A list; if ver=ver_completed, the server will be stored +.>

Joining a storage server->

A list.

2. The programmable switch-based distributed storage tilting workload balancing method according to claim 1, wherein: the application layer header includes an OP field, a KEYHASH field, a VER field, and a server field; the OP field represents the workload type, the KEYHASH field content is a hash value of a key generated in the programmable switch, the VER field content is an object version number VER of the key distributed by the programmable switch for the workload, and the SERVRID field content is a storage server identifier filled by the storage server when responding; the contents of the OP field include READ, WRITE, READ-REPLY or WRITE-REPLY, READ indicating a READ request, WRITE indicating a WRITE request, READ-REPLY indicating a REPLY to the READ request, WRITE-REPLY indicating a REPLY to the WRITE request.

3. The method of claim 2, wherein the programmable switch has an intra-network coherence directory, the intra-network coherence directory being hot-keyed by a hash table pair

Hotkey->

Object version number and value +.>

Storage server at home->

Recording is performed.

4. The programmable switch-based distributed storage tilting workload balancing method according to claim 1, wherein when a key

When heat is lost, i.e. bond +.>

O (nlogn) hot bonds which are no longer the hottest +.>

The programmable switch is for the key->

Marking and receiving the inclusion key +.>

In response to the programmable switch pair key is deleted->

Is recorded in the database.

5. The programmable switch-based distributed storage tilting workload balancing method according to claim 1, wherein hotkeys are used when needed from the primary storage server

Hotkey for modifying object version number into programmable switch record

Is the highest object version number ver_completed and will be hotkey +.>

Hotkey->

Corresponding value and hotkey->

The object version number of (2) is sent to a copy storage server for storage. />