CN117614956A - Intra-network caching method and system for distributed storage and storage medium - Google Patents
Intra-network caching method and system for distributed storage and storage medium Download PDFInfo
- Publication number
- CN117614956A CN117614956A CN202410096138.XA CN202410096138A CN117614956A CN 117614956 A CN117614956 A CN 117614956A CN 202410096138 A CN202410096138 A CN 202410096138A CN 117614956 A CN117614956 A CN 117614956A
- Authority
- CN
- China
- Prior art keywords
- node
- programmable switch
- read request
- read
- request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000013507 mapping Methods 0.000 claims abstract description 5
- 230000001934 delay Effects 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 11
- 230000001174 ascending effect Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000000149 penetrating effect Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000008713 feedback mechanism Effects 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009125 negative feedback regulation Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1001—Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
- H04L67/1004—Server selection for load balancing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/90—Buffering arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/568—Storing data temporarily at an intermediate stage, e.g. caching
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an in-network caching method, a system and a storage medium for distributed storage, which comprise the following steps: s1: the method comprises the steps that a read request sent by an initial computing node of a front-end network is transmitted to a storage node of a back-end network through a programmable switch, and the computing node obtains a read reply fed back by the programmable switch and stores a cache key; s2: the current computing node sends a read request to a programmable switch, and the programmable switch reroutes the read request based on load balancing; s3: based on a leaf spine network architecture, mapping keys to different spine switches by all computing nodes based on consistent hash, wherein the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets; the method, the system and the storage medium for caching in the network reduce the problem of unbalanced load of the storage nodes, thereby improving the overall throughput of the caching in the network.
Description
Technical Field
The present invention relates to the field of data storage technologies, and in particular, to an in-network caching method and system for distributed storage, and a storage medium.
Background
A distributed storage system is a storage system in which data is distributed across a plurality of physical locations, typically consisting of multiple servers, computers or nodes, and is divided into a front-end network and a back-end network, as shown in fig. 1. This design is intended to ensure data persistence, reliability and scalability. The application scenarios of distributed systems are increased, and the demands of these scenarios on the performance and functionality of distributed systems.
The front-end network (Frontend Networking) is a network composed of computing nodes, and is mainly designed by taking into consideration:
a. client-to-storage node communication: the front-end network focuses on the communication between clients and the distributed storage system. This generally relates to how data can be accessed and transmitted efficiently, quickly, and securely.
b. Data chunking and dispersion: to make the data more secure and reliable, the data is typically cut into blocks, which may be encrypted and replicated or encoded on multiple nodes.
c. Load balancing: when multiple clients attempt to access or write data, load balancing techniques ensure load balancing on each storage node, thereby improving overall performance of the system.
d. Data consistency and caching: in view of accessing data from multiple places, the front-end network must ensure consistency of the data. Furthermore, caching techniques may be used to speed up access to data.
The backend network (Backend Networking) is a network composed of storage nodes, and design needs to be considered:
a. communication between nodes: backend networks focus on communication between storage nodes. This is to ensure redundancy, backup and consistency of the data.
b. Data replication and redundancy: to increase the reliability and availability of data, data is typically replicated across multiple nodes. When a node fails, copies on other nodes may be used to recover the data.
c. Fault detection and recovery: the backend network needs to have the ability to detect failures of storage nodes and automatically restore or redistribute data to ensure continued availability of the system.
d. Data consistency and synchronization: when updating and accessing data on multiple nodes, the system must ensure that all copies of the data are consistent. This typically requires complex protocols and algorithms to guarantee.
e. Scalability: as the amount of data grows, the distributed storage system should be able to easily add more storage nodes and capacity.
Distributed storage systems are designed to distribute load, improve data throughput, and ensure low latency access. However, load imbalance may occur in practical applications due to various factors, thereby affecting achievement of these objectives.
1. Different popularity of data:
some data may be more popular than others, resulting in a large number of clients requesting the same block of data at the same time. This may not only cause overload of certain storage nodes, but may also result in increased latency in accessing these "hot spot" data, thereby reducing overall throughput.
2. Uneven data distribution:
data may accumulate too much on some nodes, resulting in their workload being heavier than others. Such uneven distribution can pose a threat to overall high throughput and low latency targets.
3. Node resource inconsistency:
resource inconsistencies between nodes may cause certain nodes to become performance bottlenecks. For example, a node with limited storage or computing capabilities may not provide the same throughput or response speed as a high-end node.
In addition, with the development of network infrastructure and the emergence of new network applications, a large number of new network scenarios such as data center networks, high-performance computing network networks and the like and corresponding new network technologies are derived. The novel network scene has higher switching speed and more flexible new requirements for network forwarding. The traditional commercial network switch has the characteristics of closed black box and non-programmable. The protocol supported by the device, the table entry space and the forwarding logic are fixed when the device comes out, and the method has hysteresis relative to the rapid development of network technology. When a novel protocol, tunnel encapsulation and forwarding logic are required to be flexibly deployed in the network, flexible support cannot be achieved. While software switches can support flexible definition of forwarding logic and deploy new protocols, the speed is much lower than that of traditional switches implemented by hardware, and the requirements of new scenes cannot be met.
Disclosure of Invention
Based on the technical problems in the background art, the invention provides an in-network caching method and system for distributed storage and a storage medium, and the overall throughput of the in-network caching is improved.
The invention provides an in-network caching method for distributed storage, which comprises the following steps:
s1: the method comprises the steps that a read request sent by an initial computing node of a front-end network is forwarded to a storage node of a back-end network through a programmable switch, the initial computing node obtains a read reply fed back by the programmable switch and stores a cache key, and the read reply is data sent to the programmable switch by the storage node according to the obtained read request;
s2: the current computing node sends a read request to a programmable switch, and the programmable switch reroutes the read request based on load balancing;
s21: the current computing node sends a read request to a programmable switch, the programmable switch reads a key in the head of a read request message, and a cache node list is found according to the key; the cache node list is a collection of computing nodes for caching the key;
s22: judging whether the cache node list is empty, if so, entering a step S23, and if not, entering a step S24;
s23: the programmable switch sends the read request to a unique storage node in the back-end network, and the selection of the unique storage node is determined based on consistent hash;
s24: the programmable switch takes a unique storage node and a cache node list as candidate lists, sends a read request to the candidate lists based on historical data, acquires read replies fed back by the candidate lists, calculates time delays of different nodes in the candidate lists for the read request, arranges the time delays corresponding to the different nodes in the candidate lists in ascending order, and forwards the read request sent by a current computing node to the node with the first time delay;
s25: the current computing node acquires a key and a destination node in the reading reply, and adds the destination node into a cache node list of the key, wherein the destination node is a node for feeding back the current reading reply in a candidate list;
s3: based on the leaf spine network architecture, all computing nodes map keys to different spine switches based on consistent hash, and the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets.
Further, in step S2, in the process that the current computing node sends a read request to the programmable switch, other computing nodes are interspersed and set to send a write request operation to the programmable switch, and the processing after the programmable switch receives the write request is as follows:
after receiving the write request, the programmable switch reads the key in the write request message, eliminates all the computing nodes in the cache node list corresponding to the key, and forwards the write request to the unique storage node in the back-end network;
and the read requests arriving at the programmable switch later than the write requests are rerouted according to the new cache node list, so that the sequence consistency is realized.
Further, the compute node stores the cache key based on a hash table instead of a tree structure.
Further, in step S24, the time delay of different nodes in the candidate list for the read request is calculated, specifically:
when receiving a read request, the programmable switch adds a current time stamp into a read request message as a first time stamp;
when receiving the read reply, the programmable switch records the current timestamp corresponding to the read reply message as a second timestamp;
and obtaining the time delay of the read request based on the difference value between the second time stamp and the first time stamp.
The system comprises a node cache module, a cache tracking module and a multi-machine expansion module, wherein the cache tracking module comprises a request corresponding module, a judging module, a first request sending module and a second request sending module;
the node caching module is used for forwarding a read request sent by an initial computing node of the front-end network to a storage node of the back-end network through the programmable switch, the initial computing node obtains a read reply fed back by the programmable switch and stores a caching key, and the read reply is data sent to the programmable switch by the storage node according to the obtained read request
The cache tracking module is used for sending a read request to the programmable switch by the current computing node, and the programmable switch reroutes the read request based on load balancing;
the multi-machine expansion module is used for mapping keys to different spine switches based on a leaf spine network architecture, and the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets;
the request corresponding module is used for sending a read request to the programmable switch by the current computing node, and the programmable switch reads a key in the read request message header and finds a cache node list according to the key; the cache node list is a collection of computing nodes for caching the key;
the judging module is used for judging whether the cache node list is empty, if so, entering the first request sending module, and if not, entering the second request sending module;
the first request sending module is used for sending a read request to a unique storage node in the back-end network by the programmable switch, and the selection of the unique storage node is determined based on consistent hash;
the second request sending module is used for the programmable switch to send a read request to the candidate list based on historical data by taking the unique storage node and the cache node list as the candidate list, obtain a read reply fed back by the candidate list, calculate time delays of different nodes in the candidate list to the read request, arrange time delays corresponding to different nodes in the candidate list in ascending order, and forward the read request sent by the current computing node to the node with the first time delay.
Further, in the cache tracking module, in the process that the current computing node sends a read request to the programmable switch, other computing nodes are arranged alternately to send a write request operation to the programmable switch, and the processing after the programmable switch receives the write request is as follows:
after receiving the write request, the programmable switch reads the key in the write request message, eliminates all the computing nodes in the cache node list corresponding to the key, and forwards the write request to the unique storage node in the back-end network;
and the read requests arriving at the programmable switch later than the write requests are rerouted according to the new cache node list, so that the sequence consistency is realized.
A computer readable storage medium having stored thereon a number of classification procedures for being invoked by a processor and performing an in-network caching method as described above.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
The in-network caching method, the in-network caching system and the storage medium for distributed storage have the advantages that: the in-network caching method, the in-network caching system and the storage medium for distributed storage provided by the structure of the invention adopt the client of the computing node as a caching node to form negative feedback regulation, so that extra caching communication overhead is not introduced, and the node bandwidth requirement on a back-end network is reduced; the method solves the problem of time delay increase caused by different data popularity in the existing distributed storage based on load balancing, solves the problems of uneven data distribution and inconsistent node resources caused by accumulation of data on certain nodes based on load balancing and multi-machine expansion, and simultaneously solves the problem of time delay increase caused by different data popularity by accessing a negative feedback mechanism for increasing cache nodes, thereby reducing the problem of load imbalance of storage nodes and improving the overall throughput of internal network cache.
Drawings
FIG. 1 is a schematic diagram of a junction flow diagram of the present invention;
FIG. 2 is a schematic diagram of the structure of a front-end network and a back-end network;
FIG. 3 is a schematic diagram of a flow of the architecture of in-network cache management;
FIG. 4 is a schematic diagram of a node cache architecture flow;
FIG. 5 is a schematic flow diagram of the structure of a cache track and write request;
FIG. 6 is a schematic diagram of the structure flow of the in-network measurement of the task reading delay in step S24;
FIG. 7 is a schematic diagram of a structure flow of the programmable switch forwarding a read request to a minimum delay node in step S24;
fig. 8 is a schematic structural flow diagram of the multi-machine expansion in step S3.
Detailed Description
In the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit or scope of the invention, which is therefore not limited to the specific embodiments disclosed below.
As shown in fig. 1 to 8, the number of computing nodes of the front-end network is 32, the number of storage nodes of the back-end network is 16, and all the storage nodes are connected to 4 programmable switches. The read-write request is initiated by the computing node and forwarded by the programmable switch, so that the overall throughput of the intranet cache is improved.
As shown in fig. 1 to 8, the method for caching in a distributed storage network according to the present invention includes the following steps S1 to S3:
s1: node caching: the method comprises the steps that a read request sent by an initial computing node of a front-end network is forwarded to a storage node of a back-end network through a programmable switch, the initial computing node obtains a read reply fed back by the programmable switch and stores a cache key, and the read reply is data sent to the programmable switch by the storage node according to the obtained read request;
based on the operation of the compute node cache key, the problem of delay increase caused by different data popularity in the traditional technology is solved, and the problems of low throughput and high delay caused by uneven data distribution are solved.
The programmable switch realizes the programmable data plane, abstracts the forwarding flow of the switch, supports the user-defined analysis and forwarding logic, ensures the forwarding flexibility and ensures the forwarding speed by adopting hardware. The advent of programmable switches reduced network complexity.
The node caching is only performed at the computing nodes of the front-end network, which caches the read request after sending it and receiving the reply. Because the cache does not need long-term storage, the computing node can use a hash table instead of a tree structure to store the cache key, thereby reducing the reading time and the memory consumption. Subsequently, when the computing node receives the read request, a reply is sent to the requesting node.
S2: cache tracking and load balancing: the current computing node sends a read request to a programmable switch, and the programmable switch reroutes the read request based on load balancing;
tasks in the distributed storage system can be divided into a read task and a write task, and traffic can be divided into a read request, a read reply, a write request and a write reply. The read reply is sent from the back-end network or the cached computing node to the front-end network. The programmable switch maintains a mapping from keys to a list that caches the keys.
When the initial computing node does not have a cache, when any computing node sends a read request to the programmable switch, the programmable switch forwards the read request to the storage node to obtain a read reply, but when the computing node starts to cache, after the next computing node sends the read request to the programmable switch, the programmable switch needs to consider the cache of the computing node and whether the storage node has the read reply corresponding to the read request at the same time, and reroutes the read request based on load balancing, for example: as shown in fig. 3, a client program on a compute node sends a read request to a programmable switch and is processed by the system. The process is shown in fig. 4. The computing node H1 sends an access Key1 to the programmable switch, the programmable switch judges that only the back-end network contains the Key1 according to the table entry, so that the back-end network is forwarded to the storage node H4, the computing node H2 sends the access Key2 to the programmable switch, the table entry of the programmable switch indicates that H1 also contains the Key2, and the programmable switch forwards the table entry to the computing node H1 according to a load balancing algorithm so as to acquire a read reply fed back by the computing node.
The embodiment is based on the setting of a load balancing algorithm, so that the problem of node performance bottleneck caused by resource inconsistency among nodes is avoided, and the load balancing algorithm is specifically as follows:
s21: the current computing node sends a read request to a programmable switch, the programmable switch reads a key in the head of a read request message, and a cache node list is found according to the key; the cache node list is a collection of computing nodes for caching the key;
s22: judging whether the cache node list is empty, if so, entering a step S23, and if not, entering a step S24;
s23: the programmable switch sends the read request to a unique storage node in the back-end network, and the selection of the unique storage node is determined based on consistent hash;
s24: the programmable switch takes a unique storage node and a cache node list as candidate lists, sends a read request to the candidate lists based on historical data, acquires read replies fed back by the candidate lists, calculates time delays of different nodes in the candidate lists for the read request, arranges the time delays corresponding to the different nodes in the candidate lists in ascending order, and forwards the read request sent by the current calculation node to the node with the time delay arranged on the first position;
as shown in fig. 6 and 7, the programmable switch realizes the in-network measurement of the reading task time delay through an in-network telemetry method, and judges the load condition of the node according to the in-network result. The programmable switch maintains latency measurements of one read request for all nodes (including compute nodes and storage nodes), respectively. When receiving a read request, adding a current time stamp into a read request message by the switch as a first time stamp; when receiving the read reply, the programmable switch records the current time stamp corresponding to the read reply message as a second time stamp, and based on the difference value between the second time stamp and the first time stamp, the obtained time length is the time delay of the node processing the read request, and the node corresponding to the minimum time delay in all the node time delays is selected to receive the read request sent by the computing node.
And the programmable switch processes the obtained time delay based on the time stamp difference value to perform load balancing, and selects the node transmission with the minimum processing time delay. Illustrated by way of example in fig. 6 and 7. In the process: (1) after the switch receives the read request, it finds that both H1 and H2 can reply the request, and selects the node H1 with the minimum processing time delay to forward.
S25: the current computing node acquires a key and a destination node in the read reply, and adds the destination node into a cache node list of the key, wherein the destination node is a node for feeding back the current read reply in the candidate list.
Through steps S21 to S25, the programmable switch reroutes other computing nodes sending read requests based on real-time consideration of previous computing node caches. For example: the programmable switch updates the table entry and forwards the read request according to the read reply. As shown in fig. 5, the process of in-network cache tracking is illustrated using one example. In the process: (1) the client on compute node H1 issues a read request to the programmable switch. (2) The programmable switch finds the storage node H3 of the back end network by analyzing the Key and forwards the storage node H3. (3) The storage node H3 sends a read reply to the programmable switch, (4) after receiving the read reply, the programmable switch adds its destination address H1 to the cache directory of K1, and (5) forwards the reply to the computing node H1. (7) After which the compute node H2 issues a read request, the programmable switch will redirect the read request to compute node H1 according to the previously added entry (8).
In addition, in step S2, in the process that the current computing node sends a read request to the programmable switch, the processing of inserting other computing nodes to send a write request to the programmable switch and receiving the write request by the programmable switch is as follows:
when a compute node initiates a write request to a key, meaning that the value corresponding to the key will be modified, the value in the cache will be different from the written value. In order to maintain the data consistency of the system, when the programmable switch receives a write request, the programmable switch reads a key in a write request message, eliminates all computing nodes in a cache node list corresponding to the key, and forwards the write request to a unique storage node in a back-end network. Because the write request to any one key is forwarded to a unique node and the read request arriving at the switch later than the request is rerouted according to the new cache node list, order consistency is achieved. Resetting the cache node list does not affect performance when the write task is less frequent than the read task.
Using one example to illustrate the process of cache list reset, as shown in fig. 5, the process: (9) the compute node H2 sends a write request to the programmable switch, which resets the forwarding table to include only the store node H3, ⑪, which forwards it to H3.
S3: based on a leaf spine network architecture, mapping keys to different spine switches by all computing nodes based on consistent hash, wherein the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets;
in step S3, all keys are divided into different spine switches, the different spine switches and all connected leaf switches and nodes form mutually incoherent subnets, and the nodes map the keys to the different spine switches by means of consistent hash. As shown in fig. 8, when a computing node is to request Key1, key2, it is sent to programmable switch S1, and when a computing node is to request Key11, key12, it is sent to programmable switch S2.
Step S3 is based on the spine-leaf topology (leaf-spine network architecture) of the existing data center network, and multi-machine expansion is achieved through subnet division. In the multi-machine expansion scheme of the embodiment, the cache tracking program in the step 2 only runs on the spine switch, the computing node forwards the read request to different spine switches according to the consistent hash, and for all the spine switches, the situation is the same as that of a single switch, and keys responsible for different spine switches are mutually disjoint subsets of all key sets. Through the multi-machine expansion mode, the method and the system realize sequence consistency under a multi-machine scene.
According to steps S1 to S3, the in-network cache tracking realizes the cache tracking on the data surface of the programmable switch, and the method has the advantages of no need of extra hardware and no generation of extra traffic by utilizing the characteristic of the line speed processing of the programmable switch. In addition, the in-network cache tracking realizes the cache of the content of the storage node by utilizing the idle bandwidth and the processing capacity of the computing node, can improve the system capacity on the processing capacity of the end system and the network bandwidth, and simultaneously reduces the problem of unbalanced load of the storage node by accessing a negative feedback mechanism for increasing the cache nodes. The in-network cache also realizes multi-switch expansion with sequence consistency through sub-network division, is suitable for a spine-leaf network architecture widely used in a data center, and can be deployed into the existing large-scale network system. In current network scenarios such as: in cloud computing and distributed machine learning, the existing system uses distributed storage in a large amount as a support, so that if an in-network cache tracking system can be deployed in the existing network, the total throughput of the system can be improved. The in-network cache tracking system can be widely applied to the scene as a basic set.
Meanwhile, the method solves the problem of time delay increase caused by different data popularity in the existing distributed storage based on load balancing, and solves the problems of uneven data distribution and inconsistent node resources caused by data accumulation on certain nodes based on load balancing and multi-machine expansion, so that the overall throughput of the intranet cache is improved.
In addition, existing distributed storage also exists: (a 1) non-optimized workload pattern: the access pattern of certain applications may be detrimental to high throughput and low latency. For example, large amounts of random reads and writes or large-scale data bulk operations may cause unnecessary network congestion or storage delays. (a 2) challenges of high throughput and low latency: load imbalance directly affects the throughput and latency of the system. In order to achieve the goals of high throughput and low latency, it is necessary not only to evenly distribute data and requests, but also to ensure that each node is able to fully utilize its resources. According to the embodiment, node caching and node tracking are performed by setting an in-network caching method for distributed storage, and the problems of caching of computing nodes and unbalanced load of storage nodes are solved by matching with load balancing and multi-machine expansion.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (7)
1. An in-network caching method for distributed storage is characterized by comprising the following steps:
s1: the method comprises the steps that a read request sent by an initial computing node of a front-end network is forwarded to a storage node of a back-end network through a programmable switch, the initial computing node obtains a read reply fed back by the programmable switch and stores a cache key, and the read reply is data sent to the programmable switch by the storage node according to the obtained read request;
s2: the current computing node sends a read request to a programmable switch, and the programmable switch reroutes the read request based on load balancing;
s21: the current computing node sends a read request to a programmable switch, the programmable switch reads a key in the head of a read request message, and a cache node list is found according to the key; the cache node list is a collection of computing nodes for caching the key;
s22: judging whether the cache node list is empty, if so, entering a step S23, and if not, entering a step S24;
s23: the programmable switch sends the read request to a unique storage node in the back-end network, and the selection of the unique storage node is determined based on consistent hash;
s24: the programmable switch takes a unique storage node and a cache node list as candidate lists, sends a read request to the candidate lists based on historical data, acquires read replies fed back by the candidate lists, calculates time delays of different nodes in the candidate lists for the read request, arranges the time delays corresponding to the different nodes in the candidate lists in ascending order, and forwards the read request sent by the current calculation node to the node with the time delay arranged on the first position;
s25: the current computing node acquires a key and a destination node in the reading reply, and adds the destination node into a cache node list of the key, wherein the destination node is a node for feeding back the current reading reply in a candidate list;
s3: based on the leaf spine network architecture, all computing nodes map keys to different spine switches based on consistent hash, and the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets.
2. The method for in-network caching of distributed storage according to claim 1, wherein in step S2, in the process that the current computing node sends a read request to the programmable switch, the other computing nodes are interspersed to send a write request operation to the programmable switch, and the processing after the programmable switch receives the write request is as follows:
after receiving the write request, the programmable switch reads the key in the write request message, eliminates all the computing nodes in the cache node list corresponding to the key, and forwards the write request to the unique storage node in the back-end network;
and the read requests arriving at the programmable switch later than the write requests are rerouted according to the new cache node list, so that the sequence consistency is realized.
3. The method of in-network caching of distributed storage of claim 1, wherein the compute nodes store the cache keys based on hash tables instead of tree structures.
4. The method of claim 1, wherein in step S24, the time delay of the different nodes in the candidate list for the read request is calculated, specifically:
when receiving a read request, the programmable switch adds a current time stamp into a read request message as a first time stamp;
when receiving the read reply, the programmable switch records the current timestamp corresponding to the read reply message as a second timestamp;
and obtaining the time delay of the read request based on the difference value between the second time stamp and the first time stamp.
5. The network caching system for distributed storage is characterized by comprising a node caching module, a caching tracking module and a multi-machine expansion module, wherein the caching tracking module comprises a request corresponding module, a judging module, a first request sending module and a second request sending module;
the node caching module is used for forwarding a read request sent by an initial computing node of the front-end network to a storage node of the back-end network through the programmable switch, the initial computing node obtains a read reply fed back by the programmable switch and stores a caching key, and the read reply is data sent to the programmable switch by the storage node according to the obtained read request
The cache tracking module is used for sending a read request to the programmable switch by the current computing node, and the programmable switch reroutes the read request based on load balancing;
the multi-machine expansion module is used for mapping keys to different spine switches based on a leaf spine network architecture, and the different spine switches, all leaf switches connected with the spine switches, the computing nodes and the storage nodes form mutually incoherent subnets;
the request corresponding module is used for sending a read request to the programmable switch by the current computing node, and the programmable switch reads a key in the read request message header and finds a cache node list according to the key; the cache node list is a collection of computing nodes for caching the key;
the judging module is used for judging whether the cache node list is empty, if so, entering the first request sending module, and if not, entering the second request sending module;
the first request sending module is used for sending a read request to a unique storage node in the back-end network by the programmable switch, and the selection of the unique storage node is determined based on consistent hash;
the second request sending module is used for the programmable switch to send a read request to the candidate list based on historical data by taking the unique storage node and the cache node list as the candidate list, obtain a read reply fed back by the candidate list, calculate time delays of different nodes in the candidate list to the read request, arrange time delays corresponding to different nodes in the candidate list in ascending order, and forward the read request sent by the current computing node to the node with the first time delay.
6. The system of claim 5, wherein in the cache tracking module, in the process that the current computing node sends the read request to the programmable switch, the other computing nodes are arranged in a penetrating manner to send the write request to the programmable switch, and the processing after the programmable switch receives the write request is as follows:
after receiving the write request, the programmable switch reads the key in the write request message, eliminates all the computing nodes in the cache node list corresponding to the key, and forwards the write request to the unique storage node in the back-end network;
and the read requests arriving at the programmable switch later than the write requests are rerouted according to the new cache node list, so that the sequence consistency is realized.
7. A computer readable storage medium having stored thereon a number of classification procedures for being invoked by a processor and performing the in-network caching method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410096138.XA CN117614956B (en) | 2024-01-24 | 2024-01-24 | Intra-network caching method and system for distributed storage and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410096138.XA CN117614956B (en) | 2024-01-24 | 2024-01-24 | Intra-network caching method and system for distributed storage and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117614956A true CN117614956A (en) | 2024-02-27 |
CN117614956B CN117614956B (en) | 2024-03-29 |
Family
ID=89952089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410096138.XA Active CN117614956B (en) | 2024-01-24 | 2024-01-24 | Intra-network caching method and system for distributed storage and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117614956B (en) |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008123198A (en) * | 2006-11-10 | 2008-05-29 | Toshiba Corp | Storage cluster system having cache consistency guarantee function |
CN102244685A (en) * | 2011-08-11 | 2011-11-16 | 中国科学院软件研究所 | Distributed type dynamic cache expanding method and system supporting load balancing |
US20150358421A1 (en) * | 2014-06-10 | 2015-12-10 | International Business Machines Corporation | Cooperative decentralized caching |
CN105554121A (en) * | 2015-12-18 | 2016-05-04 | 深圳中兴网信科技有限公司 | Method and system for realizing load equalization of distributed cache system |
US9621399B1 (en) * | 2012-12-19 | 2017-04-11 | Amazon Technologies, Inc. | Distributed caching system |
CN108881942A (en) * | 2018-06-06 | 2018-11-23 | 西安交通大学 | A kind of super fusion normality recording and broadcasting system based on distributed objects storage |
CN110169040A (en) * | 2018-07-10 | 2019-08-23 | 深圳花儿数据技术有限公司 | Distributed data storage method and system based on multilayer consistency Hash |
CN111726415A (en) * | 2020-06-30 | 2020-09-29 | 国电南瑞科技股份有限公司 | TCP long connection load balancing scheduling method and system based on negative feedback mechanism |
CN112422651A (en) * | 2020-11-06 | 2021-02-26 | 电子科技大学 | Cloud resource scheduling performance bottleneck prediction method based on reinforcement learning |
CN114356970A (en) * | 2021-11-19 | 2022-04-15 | 苏州浪潮智能科技有限公司 | Storage system resource caching method and device |
US20220129379A1 (en) * | 2020-10-22 | 2022-04-28 | EMC IP Holding Company LLC | Cache memory management |
CN114844846A (en) * | 2022-04-14 | 2022-08-02 | 南京大学 | Multi-level cache distributed key value storage system based on programmable switch |
CN115858181A (en) * | 2023-02-27 | 2023-03-28 | 中用科技有限公司 | Distributed storage tilting workload balancing method based on programmable switch |
CN116560562A (en) * | 2022-01-30 | 2023-08-08 | 华为技术有限公司 | Method and device for reading and writing data |
-
2024
- 2024-01-24 CN CN202410096138.XA patent/CN117614956B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008123198A (en) * | 2006-11-10 | 2008-05-29 | Toshiba Corp | Storage cluster system having cache consistency guarantee function |
CN102244685A (en) * | 2011-08-11 | 2011-11-16 | 中国科学院软件研究所 | Distributed type dynamic cache expanding method and system supporting load balancing |
US9621399B1 (en) * | 2012-12-19 | 2017-04-11 | Amazon Technologies, Inc. | Distributed caching system |
US20150358421A1 (en) * | 2014-06-10 | 2015-12-10 | International Business Machines Corporation | Cooperative decentralized caching |
CN105554121A (en) * | 2015-12-18 | 2016-05-04 | 深圳中兴网信科技有限公司 | Method and system for realizing load equalization of distributed cache system |
CN108881942A (en) * | 2018-06-06 | 2018-11-23 | 西安交通大学 | A kind of super fusion normality recording and broadcasting system based on distributed objects storage |
CN110169040A (en) * | 2018-07-10 | 2019-08-23 | 深圳花儿数据技术有限公司 | Distributed data storage method and system based on multilayer consistency Hash |
US20210208987A1 (en) * | 2018-07-10 | 2021-07-08 | Here Data Technology | Systems and methods of distributed data storage using multi-layers consistent hashing |
CN111726415A (en) * | 2020-06-30 | 2020-09-29 | 国电南瑞科技股份有限公司 | TCP long connection load balancing scheduling method and system based on negative feedback mechanism |
US20220129379A1 (en) * | 2020-10-22 | 2022-04-28 | EMC IP Holding Company LLC | Cache memory management |
CN112422651A (en) * | 2020-11-06 | 2021-02-26 | 电子科技大学 | Cloud resource scheduling performance bottleneck prediction method based on reinforcement learning |
CN114356970A (en) * | 2021-11-19 | 2022-04-15 | 苏州浪潮智能科技有限公司 | Storage system resource caching method and device |
CN116560562A (en) * | 2022-01-30 | 2023-08-08 | 华为技术有限公司 | Method and device for reading and writing data |
CN114844846A (en) * | 2022-04-14 | 2022-08-02 | 南京大学 | Multi-level cache distributed key value storage system based on programmable switch |
CN115858181A (en) * | 2023-02-27 | 2023-03-28 | 中用科技有限公司 | Distributed storage tilting workload balancing method based on programmable switch |
Non-Patent Citations (2)
Title |
---|
丁力;王劲林;杨奇峰;: "信息中心网络缓存节点架构设计研究综述", 网络新媒体技术, no. 03, 15 May 2019 (2019-05-15) * |
魏文国;陈潮填;闫俊虎;: "集群协作缓存机制研究", 计算机科学, no. 01, 25 January 2008 (2008-01-25) * |
Also Published As
Publication number | Publication date |
---|---|
CN117614956B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112640371B (en) | Method and system for performing data operations on a distributed storage environment | |
CN111581284B (en) | Database high availability method, device, system and storage medium | |
US20170208124A1 (en) | Higher efficiency storage replication using compression | |
JP5016063B2 (en) | Consistent fault-tolerant distributed hash table (DHT) overlay network | |
WO2018000993A1 (en) | Distributed storage method and system | |
US20020133491A1 (en) | Method and system for managing distributed content and related metadata | |
Puppin et al. | A grid information service based on peer-to-peer | |
US20090240705A1 (en) | File switch and switched file system | |
EP1892921A2 (en) | Method and sytem for managing distributed content and related metadata | |
US8554867B1 (en) | Efficient data access in clustered storage system | |
US20050223096A1 (en) | NAS load balancing system | |
De la Rocha et al. | Accelerating content routing with bitswap: A multi-path file transfer protocol in ipfs and filecoin | |
CN113489784A (en) | Distributed storage asymmetric logic unit access multipath implementation method and system | |
CN115499449A (en) | Mirror image acceleration system, method and device | |
CN114466344A (en) | Edge cloud discovery and selection method suitable for wireless self-organizing network environment | |
CN117614956B (en) | Intra-network caching method and system for distributed storage and storage medium | |
JP4533923B2 (en) | Super-peer with load balancing function in hierarchical peer-to-peer system and method of operating the super-peer | |
Rahmani et al. | A comparative study of replication schemes for structured P2P networks | |
CN113076298B (en) | Distributed small file storage system | |
Fesehaye et al. | A Scalable Distributed File System for Cloud Computing | |
Nguyen et al. | A dynamic-clustering backup scheme for high-availability distributed File sharing Systems | |
Jernberg et al. | Doh: A content delivery peer-to-peer network | |
CN101815022B (en) | Source switching method, device and system in peer-to-peer network | |
EP4394573A1 (en) | Data processing method and related device | |
CN116149576B (en) | Method and system for reconstructing disk redundant array oriented to server non-perception calculation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |