CN117614956B

CN117614956B - Intra-network caching method and system for distributed storage and storage medium

Info

Publication number: CN117614956B
Application number: CN202410096138.XA
Authority: CN
Inventors: 谭小彬; 李尚蔚; 吕礼童; 袁莘智; 王伟锋; 郑烇; 杨坚
Original assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Current assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date: 2024-01-24
Filing date: 2024-01-24
Publication date: 2024-03-29
Anticipated expiration: 2044-01-24
Also published as: CN117614956A

Abstract

The invention discloses an in-network caching method, a system and a storage medium for distributed storage, which comprise the following steps: s1: the method comprises the steps that a read request sent by an initial computing node of a front-end network is transmitted to a storage node of a back-end network through a programmable switch, and the computing node obtains a read reply fed back by the programmable switch and stores a cache key; s2: the current computing node sends a read request to a programmable switch, and the programmable switch reroutes the read request based on load balancing; s3: based on a leaf spine network architecture, mapping keys to different spine switches by all computing nodes based on consistent hash, wherein the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets; the method, the system and the storage medium for caching in the network reduce the problem of unbalanced load of the storage nodes, thereby improving the overall throughput of the caching in the network.

Description

Intra-network caching method and system for distributed storage and storage medium

Technical Field

The present invention relates to the field of data storage technologies, and in particular, to an in-network caching method and system for distributed storage, and a storage medium.

Background

A distributed storage system is a storage system in which data is distributed across a plurality of physical locations, typically consisting of multiple servers, computers or nodes, and is divided into a front-end network and a back-end network, as shown in fig. 1. This design is intended to ensure data persistence, reliability and scalability. The application scenarios of distributed systems are increased, and the demands of these scenarios on the performance and functionality of distributed systems.

The front-end network (Frontend Networking) is a network composed of computing nodes, and is mainly designed by taking into consideration:

a. client-to-storage node communication: the front-end network focuses on the communication between clients and the distributed storage system. This generally relates to how data can be accessed and transmitted efficiently, quickly, and securely.

b. Data chunking and dispersion: to make the data more secure and reliable, the data is typically cut into blocks, which may be encrypted and replicated or encoded on multiple nodes.

c. Load balancing: when multiple clients attempt to access or write data, load balancing techniques ensure load balancing on each storage node, thereby improving overall performance of the system.

d. Data consistency and caching: in view of accessing data from multiple places, the front-end network must ensure consistency of the data. Furthermore, caching techniques may be used to speed up access to data.

The backend network (Backend Networking) is a network composed of storage nodes, and design needs to be considered:

a. communication between nodes: backend networks focus on communication between storage nodes. This is to ensure redundancy, backup and consistency of the data.

b. Data replication and redundancy: to increase the reliability and availability of data, data is typically replicated across multiple nodes. When a node fails, copies on other nodes may be used to recover the data.

c. Fault detection and recovery: the backend network needs to have the ability to detect failures of storage nodes and automatically restore or redistribute data to ensure continued availability of the system.

d. Data consistency and synchronization: when updating and accessing data on multiple nodes, the system must ensure that all copies of the data are consistent. This typically requires complex protocols and algorithms to guarantee.

e. Scalability: as the amount of data grows, the distributed storage system should be able to easily add more storage nodes and capacity.

Distributed storage systems are designed to distribute load, improve data throughput, and ensure low latency access. However, load imbalance may occur in practical applications due to various factors, thereby affecting achievement of these objectives.

1. Different popularity of data:

some data may be more popular than others, resulting in a large number of clients requesting the same block of data at the same time. This may not only cause overload of certain storage nodes, but may also result in increased latency in accessing these "hot spot" data, thereby reducing overall throughput.

2. Uneven data distribution:

data may accumulate too much on some nodes, resulting in their workload being heavier than others. Such uneven distribution can pose a threat to overall high throughput and low latency targets.

3. Node resource inconsistency:

resource inconsistencies between nodes may cause certain nodes to become performance bottlenecks. For example, a node with limited storage or computing capabilities may not provide the same throughput or response speed as a high-end node.

In addition, with the development of network infrastructure and the emergence of new network applications, a large number of new network scenarios such as data center networks, high-performance computing network networks and the like and corresponding new network technologies are derived. The novel network scene has higher switching speed and more flexible new requirements for network forwarding. The traditional commercial network switch has the characteristics of closed black box and non-programmable. The protocol supported by the device, the table entry space and the forwarding logic are fixed when the device comes out, and the method has hysteresis relative to the rapid development of network technology. When a novel protocol, tunnel encapsulation and forwarding logic are required to be flexibly deployed in the network, flexible support cannot be achieved. While software switches can support flexible definition of forwarding logic and deploy new protocols, the speed is much lower than that of traditional switches implemented by hardware, and the requirements of new scenes cannot be met.

Disclosure of Invention

Based on the technical problems in the background art, the invention provides an in-network caching method and system for distributed storage and a storage medium, and the overall throughput of the in-network caching is improved.

The invention provides an in-network caching method for distributed storage, which comprises the following steps:

s1: the method comprises the steps that a read request sent by an initial computing node of a front-end network is forwarded to a storage node of a back-end network through a programmable switch, the initial computing node obtains a read reply fed back by the programmable switch and stores a cache key, and the read reply is data sent to the programmable switch by the storage node according to the obtained read request;

s2: the current computing node sends a read request to a programmable switch, and the programmable switch reroutes the read request based on load balancing;

s21: the current computing node sends a read request to a programmable switch, the programmable switch reads a key in the head of a read request message, and a cache node list is found according to the key; the cache node list is a collection of computing nodes for caching the key;

s22: judging whether the cache node list is empty, if so, entering a step S23, and if not, entering a step S24;

s23: the programmable switch sends the read request to a unique storage node in the back-end network, and the selection of the unique storage node is determined based on consistent hash;

s24: the programmable switch takes a unique storage node and a cache node list as candidate lists, sends a read request to the candidate lists based on historical data, acquires read replies fed back by the candidate lists, calculates time delays of different nodes in the candidate lists for the read request, arranges the time delays corresponding to the different nodes in the candidate lists in ascending order, and forwards the read request sent by a current computing node to the node with the first time delay;

s25: the current computing node acquires a key and a destination node in the reading reply, and adds the destination node into a cache node list of the key, wherein the destination node is a node for feeding back the current reading reply in a candidate list;

s3: based on the leaf spine network architecture, all computing nodes map keys to different spine switches based on consistent hash, and the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets.

Further, in step S2, in the process that the current computing node sends a read request to the programmable switch, other computing nodes are interspersed and set to send a write request operation to the programmable switch, and the processing after the programmable switch receives the write request is as follows:

after receiving the write request, the programmable switch reads the key in the write request message, eliminates all the computing nodes in the cache node list corresponding to the key, and forwards the write request to the unique storage node in the back-end network;

and the read requests arriving at the programmable switch later than the write requests are rerouted according to the new cache node list, so that the sequence consistency is realized.

Further, the compute node stores the cache key based on a hash table instead of a tree structure.

Further, in step S24, the time delay of different nodes in the candidate list for the read request is calculated, specifically:

when receiving a read request, the programmable switch adds a current time stamp into a read request message as a first time stamp;

when receiving the read reply, the programmable switch records the current timestamp corresponding to the read reply message as a second timestamp;

and obtaining the time delay of the read request based on the difference value between the second time stamp and the first time stamp.

The system comprises a node cache module, a cache tracking module and a multi-machine expansion module, wherein the cache tracking module comprises a request corresponding module, a judging module, a first request sending module and a second request sending module;

the node caching module is used for forwarding a read request sent by an initial computing node of the front-end network to a storage node of the back-end network through the programmable switch, the initial computing node obtains a read reply fed back by the programmable switch and stores a caching key, and the read reply is data sent to the programmable switch by the storage node according to the obtained read request

The cache tracking module is used for sending a read request to the programmable switch by the current computing node, and the programmable switch reroutes the read request based on load balancing;

the multi-machine expansion module is used for mapping keys to different spine switches based on a leaf spine network architecture, and the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets;

the request corresponding module is used for sending a read request to the programmable switch by the current computing node, and the programmable switch reads a key in the read request message header and finds a cache node list according to the key; the cache node list is a collection of computing nodes for caching the key;

the judging module is used for judging whether the cache node list is empty, if so, entering the first request sending module, and if not, entering the second request sending module;

the first request sending module is used for sending a read request to a unique storage node in the back-end network by the programmable switch, and the selection of the unique storage node is determined based on consistent hash;

the second request sending module is used for the programmable switch to send a read request to the candidate list based on historical data by taking the unique storage node and the cache node list as the candidate list, obtain a read reply fed back by the candidate list, calculate time delays of different nodes in the candidate list to the read request, arrange time delays corresponding to different nodes in the candidate list in ascending order, and forward the read request sent by the current computing node to the node with the first time delay.

Further, in the cache tracking module, in the process that the current computing node sends a read request to the programmable switch, other computing nodes are arranged alternately to send a write request operation to the programmable switch, and the processing after the programmable switch receives the write request is as follows:

A computer readable storage medium having stored thereon a number of classification procedures for being invoked by a processor and performing an in-network caching method as described above.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

The in-network caching method, the in-network caching system and the storage medium for distributed storage have the advantages that: the in-network caching method, the in-network caching system and the storage medium for distributed storage provided by the structure of the invention adopt the client of the computing node as a caching node to form negative feedback regulation, so that extra caching communication overhead is not introduced, and the node bandwidth requirement on a back-end network is reduced; the method solves the problem of time delay increase caused by different data popularity in the existing distributed storage based on load balancing, solves the problems of uneven data distribution and inconsistent node resources caused by accumulation of data on certain nodes based on load balancing and multi-machine expansion, and simultaneously solves the problem of time delay increase caused by different data popularity by accessing a negative feedback mechanism for increasing cache nodes, thereby reducing the problem of load imbalance of storage nodes and improving the overall throughput of internal network cache.

Drawings

FIG. 1 is a schematic diagram of a junction flow diagram of the present invention;

FIG. 2 is a schematic diagram of the structure of a front-end network and a back-end network;

FIG. 3 is a schematic diagram of a flow of the architecture of in-network cache management;

FIG. 4 is a schematic diagram of a node cache architecture flow;

FIG. 5 is a schematic flow diagram of the structure of a cache track and write request;

FIG. 6 is a schematic diagram of the structure flow of the in-network measurement of the task reading delay in step S24;

FIG. 7 is a schematic diagram of a structure flow of the programmable switch forwarding a read request to a minimum delay node in step S24;

fig. 8 is a schematic structural flow diagram of the multi-machine expansion in step S3.

Detailed Description

In the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.

As shown in fig. 1 to 8, the number of computing nodes of the front-end network is 32, the number of storage nodes of the back-end network is 16, and all the storage nodes are connected to 4 programmable switches. The read-write request is initiated by the computing node and forwarded by the programmable switch, so that the overall throughput of the intranet cache is improved.

As shown in fig. 1 to 8, the method for caching in a distributed storage network according to the present invention includes the following steps S1 to S3:

s1: node caching: the method comprises the steps that a read request sent by an initial computing node of a front-end network is forwarded to a storage node of a back-end network through a programmable switch, the initial computing node obtains a read reply fed back by the programmable switch and stores a cache key, and the read reply is data sent to the programmable switch by the storage node according to the obtained read request;

based on the operation of the compute node cache key, the problem of delay increase caused by different data popularity in the traditional technology is solved, and the problems of low throughput and high delay caused by uneven data distribution are solved.

The programmable switch realizes the programmable data plane, abstracts the forwarding flow of the switch, supports the user-defined analysis and forwarding logic, ensures the forwarding flexibility and ensures the forwarding speed by adopting hardware. The advent of programmable switches reduced network complexity.

The node caching is only performed at the computing nodes of the front-end network, which caches the read request after sending it and receiving the reply. Because the cache does not need long-term storage, the computing node can use a hash table instead of a tree structure to store the cache key, thereby reducing the reading time and the memory consumption. Subsequently, when the computing node receives the read request, a reply is sent to the requesting node.

S2: cache tracking and load balancing: the current computing node sends a read request to a programmable switch, and the programmable switch reroutes the read request based on load balancing;

tasks in the distributed storage system can be divided into a read task and a write task, and traffic can be divided into a read request, a read reply, a write request and a write reply. The read reply is sent from the back-end network or the cached computing node to the front-end network. The programmable switch maintains a mapping from keys to a list that caches the keys.

When the initial computing node does not have a cache, when any computing node sends a read request to the programmable switch, the programmable switch forwards the read request to the storage node to obtain a read reply, but when the computing node starts to cache, after the next computing node sends the read request to the programmable switch, the programmable switch needs to consider the cache of the computing node and whether the storage node has the read reply corresponding to the read request at the same time, and reroutes the read request based on load balancing, for example: as shown in fig. 3, a client program on a compute node sends a read request to a programmable switch and is processed by the system. The process is shown in fig. 4. The computing node H1 sends an access Key1 to the programmable switch, the programmable switch judges that only the back-end network contains the Key1 according to the table entry, so that the back-end network is forwarded to the storage node H4, the computing node H2 sends the access Key2 to the programmable switch, the table entry of the programmable switch indicates that H1 also contains the Key2, and the programmable switch forwards the table entry to the computing node H1 according to a load balancing algorithm so as to acquire a read reply fed back by the computing node.

The embodiment is based on the setting of a load balancing algorithm, so that the problem of node performance bottleneck caused by resource inconsistency among nodes is avoided, and the load balancing algorithm is specifically as follows:

s24: the programmable switch takes a unique storage node and a cache node list as candidate lists, sends a read request to the candidate lists based on historical data, acquires read replies fed back by the candidate lists, calculates time delays of different nodes in the candidate lists for the read request, arranges the time delays corresponding to the different nodes in the candidate lists in ascending order, and forwards the read request sent by the current calculation node to the node with the time delay arranged on the first position;

as shown in fig. 6 and 7, the programmable switch realizes the in-network measurement of the reading task time delay through an in-network telemetry method, and judges the load condition of the node according to the in-network result. The programmable switch maintains latency measurements of one read request for all nodes (including compute nodes and storage nodes), respectively. When receiving a read request, adding a current time stamp into a read request message by the switch as a first time stamp; when receiving the read reply, the programmable switch records the current time stamp corresponding to the read reply message as a second time stamp, and based on the difference value between the second time stamp and the first time stamp, the obtained time length is the time delay of the node processing the read request, and the node corresponding to the minimum time delay in all the node time delays is selected to receive the read request sent by the computing node.

And the programmable switch processes the obtained time delay based on the time stamp difference value to perform load balancing, and selects the node transmission with the minimum processing time delay. Illustrated by way of example in fig. 6 and 7. In the process: (1) after the switch receives the read request, it finds that both H1 and H2 can reply the request, and selects the node H1 with the minimum processing time delay to forward.

S25: the current computing node acquires a key and a destination node in the read reply, and adds the destination node into a cache node list of the key, wherein the destination node is a node for feeding back the current read reply in the candidate list.

Through steps S21 to S25, the programmable switch reroutes other computing nodes sending read requests based on real-time consideration of previous computing node caches. For example: the programmable switch updates the table entry and forwards the read request according to the read reply. As shown in fig. 5, the process of in-network cache tracking is illustrated using one example. In the process: (1) the client on compute node H1 issues a read request to the programmable switch. (2) The programmable switch finds the storage node H3 of the back end network by analyzing the Key and forwards the storage node H3. (3) The storage node H3 sends a read reply to the programmable switch, (4) after receiving the read reply, the programmable switch adds its destination address H1 to the cache directory of K1, and (5) forwards the reply to the computing node H1. (7) After which the compute node H2 issues a read request, the programmable switch will redirect the read request to compute node H1 according to the previously added entry (8).

In addition, in step S2, in the process that the current computing node sends a read request to the programmable switch, the processing of inserting other computing nodes to send a write request to the programmable switch and receiving the write request by the programmable switch is as follows:

when a compute node initiates a write request to a key, meaning that the value corresponding to the key will be modified, the value in the cache will be different from the written value. In order to maintain the data consistency of the system, when the programmable switch receives a write request, the programmable switch reads a key in a write request message, eliminates all computing nodes in a cache node list corresponding to the key, and forwards the write request to a unique storage node in a back-end network. Because the write request to any one key is forwarded to a unique node and the read request arriving at the switch later than the request is rerouted according to the new cache node list, order consistency is achieved. Resetting the cache node list does not affect performance when the write task is less frequent than the read task.

Using one example to illustrate the process of cache list reset, as shown in fig. 5, the process: (9) the compute node H2 sends a write request to the programmable switch, which resets the forwarding table to include only the store node H3, ⑪, which forwards it to H3.

S3: based on a leaf spine network architecture, mapping keys to different spine switches by all computing nodes based on consistent hash, wherein the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets;

in step S3, all keys are divided into different spine switches, the different spine switches and all connected leaf switches and nodes form mutually incoherent subnets, and the nodes map the keys to the different spine switches by means of consistent hash. As shown in fig. 8, when a computing node is to request Key1, key2, it is sent to programmable switch S1, and when a computing node is to request Key11, key12, it is sent to programmable switch S2.

Step S3 is based on the spine-leaf topology (leaf-spine network architecture) of the existing data center network, and multi-machine expansion is achieved through subnet division. In the multi-machine expansion scheme of the embodiment, the cache tracking program in the step 2 only runs on the spine switch, the computing node forwards the read request to different spine switches according to the consistent hash, and for all the spine switches, the situation is the same as that of a single switch, and keys responsible for different spine switches are mutually disjoint subsets of all key sets. Through the multi-machine expansion mode, the method and the system realize sequence consistency under a multi-machine scene.

According to steps S1 to S3, the in-network cache tracking realizes the cache tracking on the data surface of the programmable switch, and the method has the advantages of no need of extra hardware and no generation of extra traffic by utilizing the characteristic of the line speed processing of the programmable switch. In addition, the in-network cache tracking realizes the cache of the content of the storage node by utilizing the idle bandwidth and the processing capacity of the computing node, can improve the system capacity on the processing capacity of the end system and the network bandwidth, and simultaneously reduces the problem of unbalanced load of the storage node by accessing a negative feedback mechanism for increasing the cache nodes. The in-network cache also realizes multi-switch expansion with sequence consistency through sub-network division, is suitable for a spine-leaf network architecture widely used in a data center, and can be deployed into the existing large-scale network system. In current network scenarios such as: in cloud computing and distributed machine learning, the existing system uses distributed storage in a large amount as a support, so that if an in-network cache tracking system can be deployed in the existing network, the total throughput of the system can be improved. The in-network cache tracking system can be widely applied to the scene as a basic set.

Meanwhile, the method solves the problem of time delay increase caused by different data popularity in the existing distributed storage based on load balancing, and solves the problems of uneven data distribution and inconsistent node resources caused by data accumulation on certain nodes based on load balancing and multi-machine expansion, so that the overall throughput of the intranet cache is improved.

In addition, existing distributed storage also exists: (a 1) non-optimized workload pattern: the access pattern of certain applications may be detrimental to high throughput and low latency. For example, large amounts of random reads and writes or large-scale data bulk operations may cause unnecessary network congestion or storage delays. (a 2) challenges of high throughput and low latency: load imbalance directly affects the throughput and latency of the system. In order to achieve the goals of high throughput and low latency, it is necessary not only to evenly distribute data and requests, but also to ensure that each node is able to fully utilize its resources. According to the embodiment, node caching and node tracking are performed by setting an in-network caching method for distributed storage, and the problems of caching of computing nodes and unbalanced load of storage nodes are solved by matching with load balancing and multi-machine expansion.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. An in-network caching method for distributed storage is characterized by comprising the following steps:

2. The method for in-network caching of distributed storage according to claim 1, wherein in step S2, in the process that the current computing node sends a read request to the programmable switch, the other computing nodes are interspersed to send a write request operation to the programmable switch, and the processing after the programmable switch receives the write request is as follows:

3. The method of in-network caching of distributed storage of claim 1, wherein the compute nodes store the cache keys based on hash tables instead of tree structures.

4. The method of claim 1, wherein in step S24, the time delay of the different nodes in the candidate list for the read request is calculated, specifically:

5. The network caching system for distributed storage is characterized by comprising a node caching module, a caching tracking module and a multi-machine expansion module, wherein the caching tracking module comprises a request corresponding module, a judging module, a first request sending module and a second request sending module;

6. The system of claim 5, wherein in the cache tracking module, in the process that the current computing node sends the read request to the programmable switch, the other computing nodes are arranged in a penetrating manner to send the write request to the programmable switch, and the processing after the programmable switch receives the write request is as follows:

7. A computer readable storage medium having stored thereon a number of classification procedures for being invoked by a processor and performing the in-network caching method according to any one of claims 1 to 4.