CN115190135B

CN115190135B - Distributed storage system and copy selection method thereof

Info

Publication number: CN115190135B
Application number: CN202210768871.2A
Authority: CN
Inventors: 党曼玉; 洪旺; 施展; 廖子逸; 李一泠; 张望
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2024-05-14
Anticipated expiration: 2042-06-30
Also published as: CN115190135A

Abstract

The invention discloses a distributed storage system and a copy selection method thereof, which belong to the technical field of distributed storage, wherein an Actor network is arranged in each edge server to rapidly calculate the score of each edge server, and a Critic network is deployed in a cloud to comprehensively consider the information of all Actor networks for joint action evaluation; the Actor network is trained based on the evaluation result output by the corresponding Critic network, and the Critic network is trained based on data obtained by random sampling from the experience pool; the training processes of the Actor network and the Critic network are independent and continuous, so that the service quality of each edge server can be accurately scored at each moment, and the copy selection has complete server state information and has no forwarding delay cost by maintaining a server ranking among servers and distributing the server ranking to clients, so that the copy selection in the edge environment can be better adapted, the request processing delay in the edge environment is reduced, and the performance and reliability are both realized.

Description

Distributed storage system and copy selection method thereof

Technical Field

The invention belongs to the technical field of distributed storage, and particularly relates to a distributed storage system and a copy selection method thereof.

Background

With the popularity of mobile phones, wearable devices, and various sensors, the number of internet of things devices is rapidly increasing. According to the mobile report in ericsson 2021, 146 billions of internet of things are connected in the world established in 2021, and it is expected that the number in 2027 will increase to 302 billions. These devices are used to support a variety of applications including road safety services, real-time video analytics, gaming, augmented reality and virtual reality applications. However, due to computational, storage, and energy limitations, these applications can only collect data and then transfer it to a cloud data center with powerful processing capabilities for processing. With the support of cloud computing, users may run these applications by using devices that are less powerful.

In cloud computing mode, however, data is sent from the edge to the cloud over multiple hops, which can result in significant delays in request processing. And a large amount of data is generated by the Internet of things equipment at any moment, and all the data are forwarded to the cloud for processing, so that a large amount of network bandwidth is occupied. For this reason, a new calculation mode edge calculation has emerged. Edge computing provides computing and storage services by deploying edge servers at the edges of the network, enabling user data to be processed directly at the edges, reducing latency of requests and saving network bandwidth between the edges and the cloud. In addition, as the transmission path is shortened, the reliability of transmission is also improved.

The edge deployment of storage services allows terminal devices to access data at high speed, reducing the response delay of data access, which is important for commonly delay-sensitive applications. However, due to many sources of variability, performance fluctuations often occur at each node of a distributed storage system, thereby affecting the quality of service of the system. And in the edge environment, the system service quality changes due to the position change of the user and the time-varying dynamic network. The copy selection strategy, which is a widely used request scheduling method for improving the service quality of a system, can effectively reduce the processing delay of each request by selecting the edge server with the lowest delay for the request. Copy selection does not increase the load on the system compared to other approaches (e.g., redundancy requests, reissue requests, etc.). And replica selection is an indispensable ring in a distributed storage system (when a request arrives, it is always necessary to service it by selecting a server). Therefore, it is necessary to study copy selection policies in the edge environment to ensure the quality of service of the system. However, the current conventional copy selection policy is often set at the client, and cannot quickly adapt to the change of the state of the edge server. In order to solve the above problems, the existing copy selection policies are set at the server side to perceive the performance of the server, and mainly include a copy selection policy based on the client side and a copy selection policy based on the central node; the replica selection strategy based on the client side is inaccurate in server delay estimation due to the fact that the replica selection strategy lacks complete server state information, load oscillation is easy to occur due to the fact that multiple selection nodes are difficult to coordinate, and request delay is increased; the copy selection policy based on the central node performs a copy selection task for all clients through an additional central node, uses a cloud data center as a copy selection node in an edge scene, sends a request to one cloud data center, selects an edge server with the best service capability for each request in the cloud data center, has a request forwarding function, introduces additional response delay, and in an edge-like geographically distributed environment, the delay generated by the request forwarding is larger.

In order to reduce the processing delay of requests in the edge environment, ensure the service quality of the system, realize the compromise of performance and reliability, how to design a distributed storage system which is convenient for copy selection, and optimize the copy selection method in the distributed storage system become a problem to be solved.

Disclosure of Invention

In order to meet the above defects or improvement demands of the prior art, the invention provides a distributed storage system and a copy selection method thereof, which are used for solving the technical problem of higher response delay in the prior art.

To achieve the above object, in a first aspect, the present invention provides a distributed storage system, including: cloud end and server end; the server side comprises: a plurality of distributed edge servers; each edge server is provided with an Actor network; the cloud is provided with a plurality of Critic networks, the number of the Critic networks is the same as that of the edge servers, and one Critic network corresponds to one Actor network;

the operation process of the distributed storage system comprises the following steps:

At each time t, each edge server performs the following operations: the edge server collects the current state data of the network environment where the edge server is located as the state information of the edge server, and inputs the current state data into an Actor network which is used for scoring the service quality of the edge server in the edge server to obtain scores of the edge server; the edge server sends the state information and scores of all the edge servers to corresponding Critic networks in the cloud to obtain evaluation results, and training an Actor network in the edge server by taking the maximized evaluation results as targets;

At each time t, the cloud performs the following operations: collecting information sent by all edge servers, calculating a reward value r _t-1 at the time t-1 after collecting the information sent by all edge servers at the time t, and storing corresponding tuple information into an experience pool; when the experience pool is full of data, randomly sampling the tuple information data from the experience pool to train each Critic network at the same time; wherein the tuple information includes: the state information of all edge servers at the time t-1, the scores of all edge servers at the time t-1, the rewards value at the time t-1 and the state information of all edge servers at the time t.

Further preferably, the prize value r _t-1 at time t-1 is:

wherein N is the number of edge servers; an average delay for the ith edge server; /(I) An average value of average delays for all edge servers; /(I)The number of requests processed for the ith edge server; /(I)An average of the number of requests processed for the ith edge server.

Further preferably, in the process of executing the operation at each time t, after the experience pool is not full of data or Critic network training is completed, judging whether the time period elapsed from the time t is longer than a preset time period, if so, obtaining the scores of the edge servers at different times from the experience pool, and calculating to obtain the score average value of each edge server; dividing the edge servers into a low-delay edge server and a high-delay edge server by taking the median of the grading average value of each edge server as a dividing point; the scoring average value of the low-delay edge server is larger than or equal to the dividing point, and the scoring average value of the high-delay edge server is smaller than the dividing point; partitioning the edge servers by adopting two root barrel structures respectively, and marking the root barrel structures as a Low barrel and a High barrel respectively; will beThe Low-delay edge servers are placed in a Low barrel, and N/2 High-delay edge servers are placed in a High barrel; select/>, in Low bucketSelecting M/2 High-delay edge servers in a High bucket to place copies; otherwise, the cloud end operation at the time t is finished; wherein N is the number of edge servers; m is the number of copies.

Further preferably, the Actor network includes: an Actor online network and an Actor target network; the Critic network comprises a Critic online network and a Critic target network;

At each time t, each edge server performs the following operations: the edge server collects the current state data of the network environment where the edge server is located as the state information of the edge server, and inputs the current state data into an Actor online network and an Actor target network in the edge server respectively to obtain scores output by the Actor online network and scores output by the Actor target network; the edge server sends the state information and scores output by the Actor online networks of all the edge servers to corresponding Critic online networks in the cloud to obtain evaluation results, and training the Actor online networks in the edge server by taking the maximized evaluation results as targets; after training for a plurality of rounds, updating the Actor target network based on the parameters of the Actor online network;

At each time t, the cloud performs the following operations: collecting information sent by all edge servers, calculating a reward value at the time t-1 after collecting the information sent by all edge servers at the time t, and storing corresponding tuple information into an experience pool; when the experience pool is full of data, randomly sampling the tuple information data from the experience pool to train all the Critic networks simultaneously; the above tuple information includes: the state information s _t-1 of all edge servers at the time t-1, the scores a _t-1 output by the Actor online network of all edge servers at the time t-1, the rewards value r _t-1 at the time t-1, the state information s _t of all edge servers at the time t and the scores a' _t output by the Actor target network of all edge servers at the time t; wherein, The state information of the ith edge server at the time t-1; /(I)The score of the output of the Actor online network of the ith edge server at the time of t-1; /(I)The state information of the ith edge server at the t moment; /(I)The score of the output of the Actor target network of the ith edge server at the moment t is given; n is the number of edge servers.

Further preferably, the method for training each Critic network by randomly sampling tuple information data from an experience pool comprises:

Recording the j-th tuple information data obtained by sampling as (s _b,a_b,r_b,s_b+1,a'_b+1); wherein, The state information of the ith edge server at the moment b; /(I)The score of the online network output of the Actor of the ith edge server at the moment b; /(I)The score of the output of the Actor target network of the ith edge server at the time of b+1;

Acquiring an evaluation result and a corresponding evaluation label of each edge server based on the sampled tuple information data; wherein the evaluation result of the ith edge server based on the jth tuple information data is that And a _b is input to an evaluation result obtained by the ith Critic online network; evaluation tag/>, based on the j-th tuple information data, of the i-th edge serverR _b is the prize value at time b; gamma is the rewarding discount rate; /(I)To/>And a' _b+1 is input to an evaluation result obtained by the ith Critic target network;

Training each Critic online network by minimizing the difference between the evaluation result of each edge server and the corresponding evaluation label; and after training for a plurality of rounds, updating the corresponding Critic target network based on the parameters of the Critic online network.

In a second aspect, the present invention provides a copy selection method based on the above distributed storage system, including: in the running process of the distributed storage system, when the server receives a copy access request, ranking the edge servers based on scores of the edge servers, and selecting the edge server with the highest ranking and the data copy as a node selected by the copy to access the data.

Further preferably, all edge servers in the distributed storage system constitute a Ceph system; and normalizing the score of each edge server with the data copy by the Ceph system, taking the score as affinity-primary parameter values corresponding to the edge servers, and selecting the edge server for data access based on the affinity-primary parameter values.

Further preferably, the Ceph system normalizes the score of each edge server having a copy of the data using a max-min normalization method.

In a third aspect, the present invention provides a copy selection system comprising: a memory storing a computer program and a processor executing the copy selection method provided by the second aspect of the present invention.

In a fourth aspect, the present invention also provides a computer readable storage medium comprising a stored computer program, wherein the computer program, when executed by a processor, controls a device in which the storage medium is located to perform the copy selection method provided in the second aspect of the present invention.

In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:

1. The invention provides a distributed storage system, which is characterized in that different network structures are deployed at a cloud end and an edge server end; because the edge has larger delay to the cloud, and a plurality of factors influencing the service quality of the system exist in the edge environment, and meanwhile, the scoring value of the edge servers is considered to be a continuous value, the method and the system set an Actor network in each edge server to rapidly calculate the scoring (ranking) of each edge server, instead of uniformly calculating the scoring redistributing through the cloud; in addition, critic networks are deployed in the cloud to comprehensively consider the information of all the Actor networks to perform joint action evaluation; the server network trains based on the evaluation result output by the corresponding Critic network, the Critic network trains based on the data obtained by random sampling from the experience pool, and the training processes of the server network and the Critic network are mutually independent and continuously carried out, so that the service quality of each edge server can be accurately scored at each moment, and the mode of maintaining a server rank among the servers and distributing the server rank to the client is adopted, so that the copy selection has complete server state information and has no forwarding delay cost, the transmission cost of cloud edge data is greatly reduced, the copy selection in the edge environment can be better adapted, the request processing delay in the edge environment is reduced, and both the performance and the reliability are realized.

2. According to the distributed storage system provided by the invention, the Actor network and the Critic network are both of a double-network structure, so that the learning stability is greatly improved, and the accuracy of copy selection is further improved.

3. According to the distributed storage system provided by the invention, in consideration of the fact that the data access service is the data access service in a state, the data access request can only carry out copy selection among servers with data copies, and the placement position of the copy can influence the effectiveness of a copy selection strategy.

4. Because the intrusion system will be involved in modifying the selection of the copies, it is very difficult to perfectly embed the copy selection mechanism into the existing system, the copy selection method provided by the invention designs an additional processing flow for the existing internal mechanism of the Ceph system to change the selection of the copies, after each score of the edge server is obtained, the score of the edge server is normalized, and the Ceph system will select the main OSD node as the node of the copy selection to access the data based on the affinity-primary parameter value, i.e. the edge server with the highest ranking and the data copy.

Drawings

Fig. 1 is a schematic structural diagram of a distributed storage system according to embodiment 1 of the present invention;

Fig. 2 is a schematic diagram of an Actor network structure provided in embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of a Critical network according to embodiment 1 of the present invention;

FIG. 4 is a multi-agent reinforcement learning data flow diagram in the edge environment provided in embodiment 1 of the present invention;

FIG. 5 is a schematic diagram of a dual "root barrel" structure provided in embodiment 1 of the present invention;

FIG. 6 is a rule implementation in a dual "root bucket" structure provided in embodiment 1 of the present invention;

FIG. 7 is a graph showing the average delay results of different copy selection strategies under three loads, read-only, read-leave and Update-leave, provided in example 2 of the present invention;

fig. 8 is a schematic diagram of an average response delay result at each moment of each node using 3 different strategies under Read-only load according to embodiment 2 of the present invention;

Fig. 9 is a schematic diagram of the delay effect of different client numbers on three copy selection policies under Read-only load according to embodiment 2 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1,

A distributed storage system, as shown in fig. 1, comprising: cloud end and server end; the server side comprises: a plurality of distributed edge servers; each edge server is provided with an Actor network; the cloud is provided with a plurality of Critic networks, the number of the Critic networks is the same as that of the edge servers, and one Critic network corresponds to one Actor network;

The operation process of the server side comprises the following steps:

It should be noted that each edge server runs an OSD process, including an OSD node. After normalizing scores of edge servers, the Ceph system is used as affinity-primary parameter values of OSD nodes of each edge server, and the Ceph system can select a main OSD node of the edge servers as a node selected by a copy to access data based on the affinity-primary parameter values, namely the edge server with highest ranking and data copy. Specifically, normalization methods such as tanh normalization, sigmoid normalization, max-min normalization and the like can be adopted. Preferably, the max-min normalization method is adopted to normalize the scores of the edge servers, so that the original data information can be more completely reserved compared with other normalization methods (such as tanh normalization and sigmoid normalization).

In an alternative implementation manner, the edge server mainly comprises a Ceph system module, an information acquisition module, a scoring module and an adapter module;

and a scoring module: the scoring module mainly comprises an Actor network in depth deterministic strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) reinforcement learning, outputs actions (scores) for the edge server according to information acquired by the edge server alone, and sends a group of information such as state information, actions and performances to the cloud.

An adapter module: because invasive systems will involve a large number of system internal flows to change the selection of copies based on scores, this module is dedicated to interfacing with specific systems, transforming scores into a mechanism that can change the selection of copies based on specific systems. Specifically, an additional processing flow is designed for the existing internal mechanism of the Ceph system to change the selection of the copy, because the OSD nodes placed by the objects in the Ceph system are directly calculated through the CRUSH algorithm, and the reading and writing operations of the objects are completed through the main OSD nodes, the main OSD carries a lot of processing logic of the system. Intrusion into the system to change the requested target OSD node directly from the score (rank) would involve a large number of existing mechanisms within the system. Thus, the present invention contemplates changing the node of the replica selection by changing the main OSD node to which the object corresponds, starting with the selection of the main OSD node. In Ceph, OSD nodes are selected through a 'lottery algorithm', and three OSD nodes (three copies) with the longest signature are selected from all the nodes to serve as data placement nodes. In this initial OSD sequence, the main OSD node is the longest node. Then, for more dynamics, a process flow is designed for the selection of the main OSD nodes, and the Ceph system provides an Affinity-Primary parameter to control the probability of each OSD node becoming the main OSD node. In the Ceph system, the range of the Affinity-Primary interval is set to 0,1, and the score value output through the neural network obviously exceeds this range, so that the value output by the neural network needs to be mapped into this interval. The mapping is considered in this embodiment using the max-min normalization method. Compared with other normalization methods (such as tanh normalization and sigmoid normalization), the method can more completely retain the original data information, and is shown as the formula:

The cloud operation process comprises the following steps:

In this embodiment, the prize value r _t-1 at the time t-1 is:

In an alternative embodiment, the cloud mainly comprises a reward calculation module, an experience pool module, an evaluation module and a copy placement optimization module.

A reward calculation module: the rewarding calculation module needs to maintain the last state and action information, receives the current state and action information, calculates the whole rewarding value of the system of the last round, and stores the tuple information of the last state, the last action, the rewarding value, the current state and the action output by the Actor target network in the experience pool. It should be noted that, the reward calculation module needs to maintain all information at time t-1 (i.e. state s _t-1 and action information a _t-1 at time t-1), and collect information of all edge servers at time t (i.e. state s _t and action information a' _t at time t). The prize value r _t-1 at time t-1 can then be calculated from the information at time t and the tuple (s _t-1,a_t-1,r_t-1,s_t,a'_t) information stored in the experience pool. Specifically, how the prize value is calculated is critical to reinforcement learning, and the present embodiment uses the number of requests processed by each node to measure the magnitude of the prize value when the average delay of the nodesLess than ensemble average delay/>When the request number has a positive rewarding feedback, the request number processed by each node at the moment t is defined as/>While the number of requests for different nodes should have different rewards weights, considering that the lower the delay the node should process the request for more rewards. Owning the weight information, then not all requests will have rewards feedback anymore, but the number of requests each node processes/>And mean/>And have bonus feedback representing multiple processing or fewer processing requests (based on equal consideration of the number of requests processed per node). The number of requests with bonus feedback per node is defined as: /(I)Because of the per node reward weight and delay/>Correlation, consider the direct use of the average delay per node with the overall average delay/>The difference of (2) to represent the weight parameter as shown in the formula: /(I)The calculation of the final prize value is defined as: /(I)

And an evaluation module: the evaluation module consists of a Critic network in DDPG reinforcement learning and evaluates the action information of the Actor network. The evaluation value output by the Critic network is used as 'supervision information' of the Actor network learning, and historical data is sampled from the experience pool to carry out training learning.

Copy placement optimization module: considering that a storage system provides stateful data access services, the placement of copies will affect the optional nodes of the selection policy. Considering the migration of the data, the placement position of the copy is optimized, so that the copy selection is better performed.

The input of the Actor network is defined as s for the own state information observed by each edge server, and the output is defined as a for the score (action). In an alternative embodiment, the specific implementation structure of the Actor network is shown in fig. 2, and the whole Actor network is composed of two full connection layers (LINEAR LAYER) and one Relu activation layer. Considering that the edge server resources are limited, MARLRS defines the output (or input) of the full connection layer, i.e., the middle hidden layer, as 50 dimensions for less computational overhead. The weight matrix of two full connection layers of the Actor network is respectively defined as w _a1,w_a2, and the dimensions of the weight matrix are len(s) multiplied by 50 and 50 multiplied by 1, wherein len(s) is the dimension of the state. Then the calculation formula for the Actor network representation is: a= Relu (s×w _a1)*w_a2).

The Critic network is used for evaluating the calculation result of the Actor network, namely, the output of the Critic network is the supervision information of the Actor network learning. The better the result of the Actor network is, the greater the output result of the Critic network is positive; the worse the result of the Actor network, the less negative and smaller the output result of the Critic network. The Critic network is used for evaluating the calculation result of the Actor network, namely, the output of the Critic network is the supervision information of the Actor network learning. The better the result of the Actor network is, the greater the output result of the Critic network is positive; the worse the result of the Actor network, the less negative and smaller the output result of the Critic network. The Critic network needs to input both the input s of the Actor network and the output a of the Actor network. In an alternative implementation, as shown in fig. 3, the specific implementation structure of the Critic network is shown, s and a inputted are respectively defined as mid _s and mid _a by using one full connection layer to calculate an intermediate result, weight matrixes of the two full connection layers are respectively defined as w _cs and w _ca, matrix dimensions are respectively defined as len(s) ×200 and n×200, and N is the number of edge servers. The calculations for mid _s and mid _a are mid _s＝s*wc_s and mid _a＝a*w_ca, respectively. The intermediate output result of the network is then linearly summed as shown below: mid = mid _s+mid_a +b; where b is the noise matrix. Then, the evaluation result q is calculated through an activation function and a full connection layer as in the Actor network. Defining the weight matrix of the last full connection layer as wc, the calculation formula of the Critic network representation is shown as the formula: q=relu (mid) ×w _c.

Further, because the edge has a larger delay from the cloud, the invention sets an Actor network in each edge server to quickly calculate the score (rank) of each node, instead of uniformly calculating the score redistribution through the cloud. And the Critic networks need to comprehensively consider the information of all the Actor networks to perform joint action evaluation, and in order to better train the networks, a batch of data needs to be randomly sampled from an experience pool to perform learning at the same time, and all the Critic networks are deployed at the cloud (one Actor network corresponds to one Critic network, i.e. the ith Actor network corresponds to the ith Critic network, i=1, 2, …, N). And in order to improve learning stability, in an alternative embodiment, both the Actor network and the Critic network employ a dual network arrangement. Specifically, the Actor network includes: an Actor online network and an Actor target network; the Critic network comprises a Critic online network and a Critic target network; and evaluating the action of the Actor online network at each moment through the Critic online network.

Specifically, a method for training a Critic network by randomly sampling tuple information data from an experience pool comprises the following steps:

The total quantity of the tuple information data obtained by sampling in the experience pool is B; recording the j-th tuple information data obtained by sampling as (s _b,a_b,r_b,s_b+1,a'_b+1); wherein, The state information of the ith edge server at the moment b; /(I)The score of the online network output of the Actor of the ith edge server at the moment b; /(I)The score of the output of the Actor target network of the ith edge server at the time of b+1;

It should be noted that, in the dual-network structure of the Actor and the Critic, the online network and the target network have the same network model setting, but the weight parameters between the networks are different; specifically, the structure of the Actor online network and the Actor target network are the same as the structure of the Actor network, and are not described herein. The structure of the Critic online network and the Critic target network are the same as the structure of the Critic network, and will not be described here. The online network weight is updated in real time (single step), and the target network weight is updated according to the online network weight after the online network is updated in n steps.

Specifically, the neural network calculation processes of the Actor online network, the Actor target network, the Critic online network and the Critic target network are respectively defined as functions mu ⁽ⁱ⁾、μ'⁽ⁱ⁾、Q⁽ⁱ⁾ and Q' ⁽ⁱ⁾, and the overall parameters of the neural network are respectively defined asAnd/>Where i denotes the number of the edge server. To further illustrate the operation of the above-described distributed storage system, the following describes the complete data flow process of the Actor network and the Critic network in the edge environment by taking the multi-agent reinforcement learning data flow diagram in the edge environment as shown in fig. 4 as an example:

1) First, there is clock synchronization processing between edge servers. By time t, all edge servers observe and obtain their own environmental state information, which is defined as

2) Then, the status information is used forAs the input of the Actor online network, the action/>, at the moment t, is calculated through the neural networkThe formula is defined as/>Each edge server then performs the action directly/>

3) Tuple is addedAnd the information and the additional rewarding value calculation information are sent to a rewarding calculation module of the cloud. Considering that the input of Critic target network depends on the output of Actor target network, at this stage/>Inputting an Actor target network for calculation, wherein the network output is defined as/>The formula is defined as/>If not at this stage/>When the information is calculated, each time the Critic target network performs calculation, data needs to be sent from the cloud to the edge, and after the calculation is performed by the Actor target network of the edge, corresponding data is sent to the cloud, which generates additional expenditure. Performing the corresponding calculation at this stage may save unnecessary overhead.

4) The rewards calculation module aggregates all edge server information and maintains information at time t-1. Thus, the system global prize r _t-1 at time t-1 can be calculated from the information at time t. The tuple (s _t-1,a_t-1,r_t-1,s_t,a'_t) information is then stored in the experience pool for random sample learning by the Critic network.

5) The Critic network randomly samples B tuple data from the experience pool. As indicated by the tuple information in stage 4); specifically, the j-th tuple information data obtained by sampling is recorded as (s _b,a_b,r_b,s_b+1,a'_b+1).

6) Is a completely parallel process with stage 5) and does not interfere with each other. Evaluating the behavior of the Actor on-line network at the t moment by using a corresponding Critic on-line network, wherein the evaluation result is defined asThe formula is defined as/>Inputting Critic online network information into corresponding state/>, of the Actor online networkAnd joint action/>

7) Will be in the j-th tuple information dataAnd a' _b+1 input to input Critic target network acquisition/>The formula is defined as/>Using prize values r _b and/>The "supervision information" required by Critic online network learning is calculated (the label of the Critic online network is different from the label of the supervision learning, the label of the Critic online network depends on the Critic target network which is being learned in the system), and the evaluation label of the ith edge server obtained based on the jth tuple information data is as follows:

Where γ is the prize discount rate.

8) The online network performs forward propagation and calculates gradients, which is the first step in the online network training learning process. Both the Actor and Critic networks perform this process, but not simultaneously (on different machines) and differently for the training data. In the most primitive DDPG network design, both the Actor network and the Critic network use the same batch of sampled data for training and learning, but now the Actor network and the Critic network are respectively on different machines, and an experience pool is placed in the cloud. If the original model is reused, additional overhead (and temporal) is incurred. Therefore, the invention makes the Actor network only learn the data at the moment t, and the Critic network randomly samples the data with the size b from the experience pool and simultaneously performs training learning.

9) And at this stage, calculating the loss value of the corresponding network, and carrying out back propagation update on the online network parameters. Critic propagates forward during the use phase of the in-line network to store the j-th tuple information dataAnd a _b inputs the evaluation result obtained by the ith Critic online network as/>Wherein/>And tag/>Calculate the loss value/>The formula is shown as the following formula:

Where B is the size of the batch sample data.

Further, the Actor online network directly uses the evaluation information of Critic online networkAs a criterion for judging whether the behavior is good or bad,/>The larger the decision made by the Actor network, the better, and thus the Actor online network is going to be more likely to get larger/>The direction of the network modifies the weight parameters of the network. The loss function defining the Actor online network is shown in the formula:

The back propagation is then performed to update the parameters μ ⁽ⁱ⁾ and Q ⁽ⁱ⁾ of the Actor and Critic online networks, respectively.

10 After the online network is updated in real time by n steps, the network weight of the target network needs to be updated depending on the weight information of the online network. Instead of directly making a complete copy of the online network weight parameter information, a learning rate τ is defined, and the target network learns a portion of the content from the online network at a time, a process called Soft Update (Soft Update). The target network parameter updating formulas are respectively shown as the formulas:

Further, in an optional implementation manner, in the process that the cloud performs the operation at each time t, after the experience pool is not full of data or Critic network training is completed, judging whether the time period elapsed from the time t is longer than a preset time period (the value is 600s in the implementation manner), if so, obtaining the scores of the edge servers at different times from the experience pool, and calculating to obtain the score average value of each edge server; dividing the edge servers into a low-delay edge server and a high-delay edge server by taking the median of the grading average value of each edge server as a dividing point; the scoring average value of the low-delay edge server is larger than or equal to the dividing point, and the scoring average value of the high-delay edge server is smaller than the dividing point; partitioning the edge servers by adopting two root barrel structures respectively, and marking the root barrel structures as a Low barrel and a High barrel respectively; will be The Low-delay edge servers are placed in a Low barrel, and N/2 High-delay edge servers are placed in a High barrel; select/>, in Low bucketSelecting M/2 High-delay edge servers in a High bucket to place copies; otherwise, the cloud end operation at the time t is finished; wherein N is the number of edge servers; m is the number of copies.

Specifically, in the above alternative embodiment, the overall flow of the distributed storage system includes:

Edge portion: at each time t, the edge server begins collecting current state data, after which actions are calculated using the Actor network. And then performing adaptation operation on the action, executing the adaptation action, and simultaneously sending information such as state, action and performance to the cloud. And finally, waiting for the evaluation result of the cloud to train and learn the Actor network.

Cloud portion: after the cloud collects the information of all the edge servers at the t moment, the cloud starts to calculate the rewarding value at the t-1 moment, and stores the corresponding tuple information into an experience pool for the sampling study of the Critic network. Next, the cloud uses Critic networks to evaluate the behavior of all edge servers. And then sending the evaluation result to each edge server, and judging whether the experience pool is full of data. If there is enough data, the Critic network will randomly sample the data from the experience pool and perform training learning on the Critic network. If the data is not enough, directly judging whether the time period of the copy placement adjustment is elapsed. And if yes, directly starting to acquire scoring data from the experience pool, and calculating service performance expectations of each server. And finally, partitioning the server according to the expected value, and changing the placement position of the copy. Otherwise, ending the flow.

It should be noted that the storage system provides a stateful data access service, which means that data access requests can only be selected between edge servers where copies of data exist, so that the placement of copies will affect the decision of selection. However, the placement of the data is random, so that at time t, it may happen that copies of the existing data are on servers with higher latency, requests to access this portion of the data will have higher latency overhead, and the response latency of this portion of the request cannot be optimized well by the copy selection policy alone. If it is assumed that there are now 8 edge servers, the response delay of each edge server is set at 2-9 ms, and there are now 8 files to put into the storage system. Assuming a 3-copy policy is used by the storage system, each edge server will store 3 files (considering the data is evenly distributed), and if the data is randomly placed, copies of the data that may appear are stored in servers with higher response delays. In order to solve the above problems, in one mode, the files can be directly exchanged to ensure that all data are in an edge server with lower response delay, and then the access requests of all data can obtain lower response delay through a copy selection strategy; however, data migration is overhead and requires some time to complete, and there is a transfer delay between servers in an edge scenario, requiring more time to complete the data migration task. Thus, the placement of the replica cannot update the policy in real time as the selection of the replica, the placement of the replica should have a larger policy update time period than the selection of the replica, and how to measure the performance of the server over a long period of time is a difficult task.

Aiming at the situation that the data placement can occur, the invention designs a copy placement optimization strategy (recorded as RDRP) based on ranking expectation, data are migrated to optimize the placement of the copy, and corresponding data are migrated and placed in a server with lower delay. Considering that the invention ranks the edge servers once at each time t, and that the purpose of the RDRP is to better select the copies, the RDRP uses the server ranking expectations during this period to measure the performance ranking that each edge server can provide when optimizing the placement of copies. The score of each server at each time t is the output of the Actor networkDefining a long period of time to contain m times in total, the ranking of each server is expected to be as shown in the formula:

And meanwhile, the copies of the data are placed at the nodes with highest ranking, so that not only can the data be unbalanced, but also the nodes can suffer excessive requests to cause response delay to be increased, and the aim of the data placement strategy is broken. Thus, RDRP divides the edge servers into lower latency and higher latency portions according to ranking expectations, ensuring that all data is placed in at least one copy in equilibrium in the lower latency edge servers. Specifically, a specific implementation process of RDRP is designed by combining built-in rules of a Ceph system, and in order to realize a more flexible placement mode, the Ceph system designs a barrel and rule structure in a cluster topology, and can realize various flexible data placement strategies by combining the barrel and the rules.

According to the partition design of lower delay and higher delay, two barrels are required to be defined to respectively place OSD nodes with different prediction scores, however, only the problem of the partition of the OSD nodes is solved, and specific data placement position selection is controlled by rules. The present invention achieves this by designing two "root buckets" and defining the corresponding rule flow. The placement of the data is changed in a non-intrusive manner in the Ceph system by way of a bucket in combination with rules. The two "root buckets" are designed in such a way that, as shown in fig. 5, the double "root buckets" are defined as a Low bucket and a High bucket, respectively, wherein the Low bucket places nodes with lower delay, and the High bucket places nodes with higher delay. Put in the Low barrelThe number of hosts (i.e., edge servers) is N/2 of the number of hosts in the High bucket. Defining the copy number of the data as M, the selection rule will select/>, in the Low bucketAnd selecting M/2 Host in the High bucket, so that each data can be ensured to have a copy in the node with lower delay.

The present embodiment shows implementation of specific bucket structure and rule definition with 3 copies and 5 OSD nodes. Table 1 shows specific barrel construction details.

TABLE 1

As shown in table 1, which contains a 7 bucket implementation, the first field indicates the type of bucket. The remaining 4 types of fields are specifically defined information for the bucket, where id represents the unique identification number of the bucket (bucket numbered down from-1 and OSD node numbered up from 0 in Ceph); alg represents a placement selection algorithm of sub-buckets or OSD nodes in the bucket (in the selection of the placement algorithm, the invention considers that the bucket structure needs to be changed to migrate data, and uses an upgrade version of 'lottery algorithm' straw2 to reduce the data migration); hash represents the hash function used in the calculation process (0 represents the default function jenkins 1); item represents a sub-bucket or OSD node placed in a bucket.

As shown in fig. 6. Wherein ruleset represents a unique identity in the rule set; type represents the way in which multiple copies are saved (copy or erasure codes); the last step represents a specific selection procedure. There are three types of operations in step, like, choose and emit, respectively. Where like means obtaining a "root bucket"; choose indicates selecting sub-buckets or OSD nodes, and emit is ending a "root bucket" selection. In choose types of operations, the first parameter is the manner of choice, and firstn (depth-first traversal) method is used herein; the second parameter is the selected number; the third parameter is a category identifier; the fourth parameter is the specific category (which may be bucket or OSD).

Changing the bucket structure at system run time can change the placement location of the copy, algorithm 1 is the pseudocode of the bucket replacement algorithm, as shown in table 2. The algorithm firstly empties the Low and High root buckets, and then adds corresponding Host buckets into the root buckets according to the desired ranking.

TABLE 2

In summary, the present invention enables replica selection with complete server state information and no forwarding delay overhead by maintaining a server rank among servers and distributing it to clients. Then, aiming at a plurality of factors influencing the service quality of the system in the edge environment, a neural network is used for establishing a high-dimensional performance model, and a performance modeling method based on multi-agent reinforcement learning is designed. And by adjusting the structure and the data flow of the basic model, different network structures can be deployed on the cloud and the edge to accelerate the adjustment of the copy selection strategy, and the transmission overhead of cloud edge data is reduced. Finally, considering that the placement position of the copy influences the selection of the copy, a copy placement optimization method based on ranking expectation is designed. The replica placement location is adjusted according to the desirability of server ranking, thereby enabling requests to select lower latency servers, reducing request processing latency. The invention can be better suitable for copy selection in the edge environment, and can realize both performance and reliability.

EXAMPLE 2,

A copy selection method based on the distributed storage system of embodiment 1, comprising:

In the running process of the distributed storage system, when the server receives a copy access request, ranking the edge servers based on scores of the edge servers, and selecting the edge server with the highest ranking and the data copy as a node selected by the copy to access the data.

Preferably, all edge servers in the distributed storage system constitute a Ceph system; and normalizing the score of each edge server with the data copy by the Ceph system, taking the score as affinity-primary parameter values corresponding to the edge servers, and selecting the edge server for data access based on the affinity-primary parameter values.

The related technical solution is the same as that of embodiment 1, and will not be described here in detail.

In order to illustrate the performance of the copy selection method provided by the invention, performance test experiments are performed on three types of loads respectively for the three copy selection methods, wherein the number of clients is set to 10. FIG. 7 shows the average delay results for different replica selection strategies under three loads, read-only, read-weave, and Update-weave, wherein the abscissa represents a specific replica selection strategy and the ordinate represents the corresponding performance index (average delay in ms); MARLRS is a copy selection strategy provided by the invention, and concentrated On-off and scattered DRS-RT are two existing copy selection strategies. As can be seen from fig. 7, the centralized DRS-RT method has a higher average delay than the decentralized On-Off method because there is a higher delay overhead for request forwarding in the edge environment (transmission delay between nodes). Compared with other two methods, the MARLRS method provided by the invention has lower response delay under three loads because the multi-agent reinforcement learning is used for establishing a high-dimensional model and the centralized copy selection mechanism of the server rank is used. However, as can be seen from comparison of different loads, MARLRS and the other two methods have the lowest average delay reduction ratio under the load with higher writing ratio, because writing operation has synchronous overhead, and MARLRS has uncontrollable selection of synchronous replication nodes.

Further, table 3 shows the average delay reduction ratio of MARLRS provided by the present invention over the other two methods at three loads. Specifically, the average delay was reduced by 8.89%, 8.55% and 2.47%, respectively, compared to the On-Off method. The average delay was reduced by 11.78%, 13.72% and 10.07% compared to the DRS-RT method, respectively.

TABLE 3 Table 3

Further, the performance of the distributed storage system is unstable due to a variety of factors, and the response delay for processing requests varies from node to node. And at the edge, due to user mobility, different requests using different servers will also get different response delays. The invention observes the stability of the system service by collecting the average response delay of the system at each moment for a long time, and verifies the validity of MARLRS provided by the invention. Wherein each time interval is 1 second long. Specifically, the average response delay at each instant of time for each node using 3 different strategies under Read-only load is shown in fig. 8. Wherein, three methods of MALRS, on-Off and DRS-RT are respectively represented from top to bottom; the abscissa indicates the time instants; the present invention collects a total of 1000 time instances of system average delay data, with the ordinate representing delay (in ms). It can be seen from the figure that the On-Off method has more load oscillation moments because the method using clients as selection nodes has only partial views and multiple selection nodes have difficulty coordinating policies and thus are prone to oscillations. From the overall trend of each sub-graph, the overall average delay of the system using the On-Off and DRS-RT methods fluctuates greatly, which means that the two methods do not have good allocation requests, and the MARLRS method provided by the invention also fluctuates, but the overall trend is smoother compared with the other two methods. After a long time observation, MARLRS can be seen to be more effective than the copy selection strategies of On-Off and DRS-RT methods, and can make the response delay of the system more stable and provide more stable service quality.

Further, the average delay variation of different replica selection strategies is observed by increasing the number of clients (increasing the overall system load). The number of clients is set to 10, 20, 30, 40, 50, respectively. The average latency of each copy selection policy under Read-only workload is tested. As shown in fig. 9, the delay effect of different client numbers On three copy selection strategies under Read-only load is shown, and it can be seen from the graph that, although the average response delay of the three strategies all shows an ascending trend with the increase of the client numbers (the increase of the system load), compared with the On-Off method, the average delay of MARLRS provided by the invention is respectively reduced by 8.89%, 10.02%, 11.34%, 12.76% and 14.43% with the increase of the client numbers; compared with the DRS-RT method, the average delay of MARLRS is respectively reduced by 11.78%, 12.04%, 12.12%, 12.15% and 11.88% as the number of clients is increased. Specifically, the average delay reduction ratio for MARLRS at different client numbers under Read-only load is shown for example in table 4:

TABLE 4 Table 4

As can be seen from the data in table 4, MARLRS provided by the present invention has a greater delay reduction effect than On-Off as the number of clients increases. This illustrates that the On-Off approach, as the number of clients increases (concurrency increases), the On-Off switching strategy reduces the selection efficiency and makes it more difficult to coordinate decision to reduce load oscillations, resulting in higher delays. A DRS-RT method using a forwarding mechanism is exceeded even when the number of 40 clients is large. Meanwhile, the data in the table shows MARLRS that the delay reduction effect is not much changed compared to the DRS-RT method, because MARLRS and DRS-RT are both centralized decisions. With the increase of the concurrency number, the method of MARLRS decision making at every other time may cause the situation that delay is increased in higher concurrency in the time, but DRS-RT also faces the concurrency problem of single-point centralized decision making, which results in higher delay. Overall, the average response delay of MARLRS method is better than the other two methods as the number of clients increases.

In summary, the present invention discloses a method for selecting copies of a distributed storage system, which maintains a server rank among servers and distributes the server rank to clients, so that the copies are selected to have complete server state information and no forwarding delay overhead. Then, aiming at a plurality of factors influencing the service quality of the system in the edge environment, a neural network is used for establishing a high-dimensional performance model, and a performance modeling method based on multi-agent reinforcement learning is designed. And by adjusting the structure and the data flow of the basic model, different network structures can be deployed on the cloud and the edge to accelerate the adjustment of the copy selection strategy, and the transmission overhead of cloud edge data is reduced. Finally, considering that the placement position of the copy influences the selection of the copy, a copy placement optimization method based on ranking expectation is designed. The replica placement location is adjusted according to the desirability of server ranking, thereby enabling requests to select lower latency servers, reducing request processing latency. The invention can be better suitable for copy selection in the edge environment, and can realize both performance and reliability.

EXAMPLE 3,

A copy selection system, comprising: a memory storing a computer program and a processor that executes the computer program to perform the copy selection method provided in embodiment 2 of the present invention.

The related technical solution is the same as that of embodiment 2, and will not be described here in detail.

EXAMPLE 4,

A computer readable storage medium comprising a stored computer program, wherein the computer program, when executed by a processor, controls a device in which the storage medium resides to perform a copy selection method provided in embodiment 2 of the present invention.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A distributed storage system, comprising: cloud end and server end; the server side comprises: a plurality of distributed edge servers; each edge server is provided with an Actor network; the cloud is provided with a plurality of Critic networks, the number of the Critic networks is the same as that of the edge servers, and one Critic network corresponds to one Actor network;

At each time t, the cloud performs the following operations: collecting information sent by all edge servers, calculating a reward value r _t-1 at the time t-1 after collecting the information sent by all edge servers at the time t, and storing corresponding tuple information into an experience pool; when the experience pool is full of data, randomly sampling the tuple information data from the experience pool to train the Critic network; wherein the tuple information includes: the state information of all edge servers at the time t-1, the scores of all edge servers at the time t-1, the rewards value at the time t-1 and the state information of all edge servers at the time t;

In the process of executing the operation at each time t, after the experience pool is not full of data or Critic network training is completed, judging whether the time period elapsed from the time t is longer than a preset time period, if so, obtaining the scores of the edge servers at different times from the experience pool, and calculating to obtain the average value of the scores of each edge server; dividing the edge servers into a low-delay edge server and a high-delay edge server by taking the median of the grading average value of each edge server as a dividing point; the scoring average value of the low-delay edge server is larger than or equal to the dividing point, and the scoring average value of the high-delay edge server is smaller than the dividing point; partitioning the edge servers by adopting two root barrel structures respectively, and marking the root barrel structures as a Low barrel and a High barrel respectively; will be Placing Low-delay edge servers in the Low barrel, and placing N/2 High-delay edge servers in the High barrel; select/>, in the Low bucketSelecting M/2 High-delay edge servers in the High bucket to place copies; otherwise, the cloud end operation at the time t is finished; wherein N is the number of edge servers; m is the number of copies.

2. The distributed storage system of claim 1, wherein the prize value r _t-1 at time t-1 is:

3. The distributed storage system of any of claims 1-2, wherein the Actor network comprises: an Actor online network and an Actor target network; the Critic network comprises a Critic online network and a Critic target network;

At each time t, the cloud performs the following operations: collecting information sent by all edge servers, calculating a reward value at the time t-1 after collecting the information sent by all edge servers at the time t, and storing corresponding tuple information into an experience pool; when the experience pool is full of data, randomly sampling the tuple information data from the experience pool to train each Critic network; the tuple information includes: the state information s _t-1 of all edge servers at the time t-1, the scores a _t-1 output by the Actor online network of all edge servers at the time t-1, the rewards value r _t-1 at the time t-1, the state information s _t of all edge servers at the time t and the scores a' _t output by the Actor target network of all edge servers at the time t; wherein, The state information of the ith edge server at the time t-1; /(I)The score of the output of the Actor online network of the ith edge server at the time of t-1; /(I)The state information of the ith edge server at the t moment; /(I)The score of the output of the Actor target network of the ith edge server at the moment t is given; n is the number of edge servers.

4. A distributed storage system according to claim 3, wherein the method of training the Critic network by randomly sampling tuple information data from the experience pool comprises:

5. A copy selection method based on the distributed storage system of any of claims 1-4, comprising: and in the running process of the distributed storage system, when the server receives the copy access request, ranking the edge servers based on the scores of the edge servers, and selecting the edge server with the highest ranking and the data copy as a node for copy selection to access the data.

6. The replica selection method of claim 5, wherein all edge servers in said distributed storage system form a Ceph system; and normalizing the scores of each edge server with the data copy by the Ceph system, taking the normalized scores as affinity-primary parameter values corresponding to the edge servers, and selecting the edge server for data access based on the affinity-primary parameter values.

7. The replica selection method of claim 6, wherein said Ceph system normalizes the scores of each edge server on which a data replica exists using a max-min normalization method.

8. A copy selection system, comprising: a memory storing a computer program and a processor that when executing the computer program performs the copy selection method of any of claims 5-7.

9. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program, when run by a processor, controls a device in which the storage medium is located to perform the copy selection method of any of claims 5-7.