CN115190135A

CN115190135A - A distributed storage system and its copy selection method

Info

Publication number: CN115190135A
Application number: CN202210768871.2A
Authority: CN
Inventors: 党曼玉; 洪旺; 施展; 廖子逸; 李一泠; 张望
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-14
Anticipated expiration: 2042-06-30
Also published as: CN115190135B

Abstract

The invention discloses a distributed storage system and a copy selection method thereof, belonging to the technical field of distributed storage, wherein an Actor network is arranged in each edge server to quickly calculate the score of each edge server, and a Critic network is deployed at a cloud end to comprehensively consider the information of all Actor networks for joint action evaluation; the Actor network is trained on the basis of an evaluation result output by the corresponding Critic network, and the Critic network is trained on the basis of data obtained by random sampling from the experience pool; the training processes of the Actor network and the Critic network are mutually independent and continuously carried out, so that the service quality of each edge server can be accurately scored at each moment, the server ranking is maintained among the servers and distributed to the client, the copy selection has complete server state information and is free of forwarding delay overhead, the copy selection in the edge environment can be better adapted, the request processing delay in the edge environment is reduced, and the consideration of performance and reliability is realized.

Description

A distributed storage system and its copy selection method

技术领域technical field

本发明属于分布式存储技术领域，更具体地，涉及一种分布式存储系统及其副本选择方法。The invention belongs to the technical field of distributed storage, and more particularly, relates to a distributed storage system and a method for selecting copies thereof.

背景技术Background technique

随着手机、可穿戴设备和各种传感器的普及，物联网设备的数量迅速增加。根据爱立信2021年的移动报告显示，全球2021年建立的物联网连接有146亿个，预计到2027年这一数字将增加到302亿个。这些设备被用来支持各种应用，包括道路安全服务、实时视频分析、游戏、增强现实和虚拟现实等应用。然而由于计算、存储和能量的限制，这些应用程序只能收集数据，然后将其转移到具有强大处理能力的云数据中心进行处理。在云计算的支持下，用户可以通过使用功能不太强大的设备来运行这些应用。With the proliferation of mobile phones, wearable devices and various sensors, the number of IoT devices is rapidly increasing. According to Ericsson's Mobility Report 2021, there will be 14.6 billion IoT connections established worldwide in 2021 and this number is expected to increase to 30.2 billion by 2027. These devices are used to support a variety of applications including road safety services, real-time video analytics, gaming, augmented reality and virtual reality. However, due to computing, storage and energy constraints, these applications can only collect data and then move it to cloud data centers with powerful processing power for processing. With the support of cloud computing, users can run these applications by using less powerful devices.

但是在云计算模式中，数据经过多次跳转从边缘发送到云端，将导致请求处理产生巨大的延迟。并且如此多的物联网设备，每时每刻都产生着大量的数据，所有的数据都转发到云端进行处理，将占用大量的网络带宽。为此，出现了一种新的计算模式边缘计算。边缘计算通过在网络边缘部署边缘服务器来提供计算和存储服务，从而能够在边缘直接处理用户数据，降低请求的延迟并节省边缘和云之间的网络带宽。此外，随着传输路径的缩短，传输的可靠性也得到了提高。However, in the cloud computing mode, data is sent from the edge to the cloud through multiple jumps, which will cause huge delays in request processing. And so many IoT devices generate a lot of data all the time, and all the data is forwarded to the cloud for processing, which will take up a lot of network bandwidth. To this end, a new computing model, edge computing, has emerged. Edge computing provides computing and storage services by deploying edge servers at the edge of the network, enabling direct processing of user data at the edge, reducing request latency and saving network bandwidth between the edge and the cloud. In addition, as the transmission path is shortened, the reliability of transmission is also improved.

在边缘部署存储服务允许终端设备高速访问数据，降低数据访问的响应延迟，这对于普遍延迟敏感的应用非常重要。但是受许多可变性来源的影响，分布式存储系统各节点常常会出现性能波动，从而影响系统服务质量。并且在边缘环境中，由于用户的位置变化以及时变的动态网络都会导致系统服务质量出现变化。副本选择策略，作为一种广泛使用的为提高系统服务质量的请求调度方法，通过为请求选择延迟最低的边缘服务器，能够有效降低每个请求的处理时延。与其他方法相比(例如，冗余请求、重新发出请求等)，副本选择不会增加系统的负载。并且副本选择是分布式存储系统中不可缺少的一环(当一个请求到达时，总需要为其选择一个服务器提供服务)。因此，在边缘环境中研究副本选择策略来保障系统的服务质量十分必要。但是，当前传统的副本选择策略往往设置在客户端，无法快速适应边缘服务器状态的变化。为了解决上述问题，现有的副本选择策略多设置在服务器端，以感知服务器的性能，主要包括基于客户端的副本选择策略和基于中心节点的副本选择策略；其中，基于客户端的副本选择策略因其缺少完整的服务器状态信息导致对服务器延迟估计不准确，且多选择节点难以协调容易出现负载振荡，这些都会导致请求延迟升高；而基于中心节点的副本选择策略通过额外的中心节点为所有客户端执行副本选择任务，在边缘场景中使用云数据中心作为副本选择节点，将请求发送到一个云数据中心，在云数据中心中为每个请求选择服务能力最好的边缘服务器，存在请求转发，将引入额外的响应延迟，且在边缘这种地理分布的环境中，请求转发产生的延迟将更大。Deploying storage services at the edge allows end devices to access data at high speed and reduce the response latency of data access, which is very important for applications that are generally latency-sensitive. However, due to the influence of many sources of variability, the performance of each node of the distributed storage system often fluctuates, which affects the quality of service of the system. And in the edge environment, the system service quality will change due to the change of the user's location and the time-varying dynamic network. Replica selection strategy, as a widely used request scheduling method to improve system service quality, can effectively reduce the processing delay of each request by selecting the edge server with the lowest delay for the request. Compared to other methods (e.g. redundant requests, reissue requests, etc.), replica selection does not increase the load on the system. And replica selection is an indispensable part of a distributed storage system (when a request arrives, it is always necessary to select a server to serve it). Therefore, it is necessary to study the replica selection strategy in the edge environment to ensure the quality of service of the system. However, the current traditional replica selection strategy is often set on the client side and cannot quickly adapt to changes in the status of edge servers. In order to solve the above problems, the existing replica selection strategies are mostly set on the server side to perceive the performance of the server, mainly including the client-based replica selection strategy and the central node-based replica selection strategy. Among them, the client-based replica selection strategy is due to its The lack of complete server status information leads to inaccurate estimation of server latency, and the difficulty of coordination among multiple selection nodes is prone to load oscillations, which will lead to higher request latency; while the replica selection strategy based on the central node uses an additional central node for all clients. Perform the task of replica selection, use the cloud data center as the replica selection node in the edge scenario, send the request to a cloud data center, and select the edge server with the best service capability for each request in the cloud data center. Introduces additional response latency, and in geographically distributed environments such as the edge, the latency incurred by request forwarding will be greater.

为了降低边缘环境中请求处理延迟，保障系统的服务质量，实现性能和可靠性的兼顾，如何设计便于副本选择的分布式存储系统，以及在分布式存储系统中对副本选择方法进行优化成为一个亟需解决的问题。In order to reduce the request processing delay in the edge environment, ensure the service quality of the system, and achieve both performance and reliability, how to design a distributed storage system that facilitates replica selection and how to optimize the replica selection method in the distributed storage system has become an urgent task. problem to be solved.

发明内容SUMMARY OF THE INVENTION

针对现有技术的以上缺陷或改进需求，本发明提供了一种分布式存储系统及其副本选择方法，用于解决现有技术响应延迟较高的技术问题。In view of the above defects or improvement requirements of the prior art, the present invention provides a distributed storage system and a copy selection method thereof, which are used to solve the technical problem of high response delay in the prior art.

为实现上述目的，第一方面，本发明提供了一种分布式存储系统，包括：云端和服务器端；其中，服务器端包括：多个分布式分布的边缘服务器；每个边缘服务器中均部署有一个Actor网络；云端部署有多个Critic网络，其数量与边缘服务器的数量相同，一个Critic网络对应一个Actor网络；In order to achieve the above objects, in the first aspect, the present invention provides a distributed storage system, including: a cloud and a server; wherein, the server includes: a plurality of distributed edge servers; each edge server is deployed with One Actor network; there are multiple Critic networks deployed in the cloud, the number of which is the same as the number of edge servers, and one Critic network corresponds to one Actor network;

上述分布式存储系统的运行过程包括：The operation process of the above distributed storage system includes:

在每一时刻t，每一个边缘服务器均执行以下操作：边缘服务器采集其所在网络环境的当前状态数据作为其状态信息，并输入到其内部的用于进行边缘服务器服务质量评分的Actor网络中，得到其评分；边缘服务器将其状态信息和所有边缘服务器的评分发送至云端中对应的Critic网络中得到其评价结果后，以最大化评价结果为目标，对其内部的Actor网络进行训练；At each moment t, each edge server performs the following operations: the edge server collects the current state data of its network environment as its state information, and inputs it into its internal Actor network for edge server service quality scoring, Get its score; the edge server sends its state information and the scores of all edge servers to the corresponding Critic network in the cloud to get its evaluation result, and trains its internal Actor network with the goal of maximizing the evaluation result;

在每一时刻t，云端执行以下操作：收集所有边缘服务器发送过来的信息，并在收集完t时刻下所有边缘服务器发送过来的信息后，计算t-1时刻的奖励值r_t-1，并将对应的元组信息存入经验池中；当经验池存满数据时，从经验池中随机采样元组信息数据对每一个Critic网络同时进行训练；其中，上述元组信息包括：t-1时刻所有边缘服务器的状态信息、t-1时刻所有边缘服务器的评分、t-1时刻的奖励值和t时刻所有边缘服务器的状态信息。At each time t, the cloud performs the following operations: collect the information sent by all edge servers, and after collecting the information sent by all edge servers at time t, calculate the reward value r _{t-1 at time t-1} , and Store the corresponding tuple information in the experience pool; when the experience pool is full of data, randomly sample tuple information data from the experience pool to train each Critic network at the same time; wherein, the above-mentioned tuple information includes: t-1 The state information of all edge servers at time, the scores of all edge servers at time t-1, the reward value at time t-1, and the state information of all edge servers at time t.

进一步优选地，上述t-1时刻的奖励值r_t-1为：Further preferably, the reward value r _{t-1 at time t-1} is:

其中，N为边缘服务器的数量；

为第i个边缘服务器的平均延迟；

为所有边缘服务器的平均延迟的平均值；

为第i个边缘服务器处理的请求数；

为第i个边缘服务器处理的请求数的平均值。Among them, N is the number of edge servers;

is the average delay of the i-th edge server;

is the average of the average latency of all edge servers;

The number of requests processed for the ith edge server;

The average number of requests processed for the ith edge server.

进一步优选地，云端在每一时刻t下执行操作的过程中，在经验池未存满数据或者Critic网络训练完成后，判断从时刻t开始所经过的时长是否大于预设时间周期，若是，则从经验池中获取不同时刻下各边缘服务器的评分，计算得到每一个边缘服务器的评分平均值；以各边缘服务器的评分平均值的中位数为划分点，将边缘服务器划分为低延迟边缘服务器和高延迟边缘服务器；其中，低延迟边缘服务器的评分平均值大于或等于划分点，高延迟边缘服务器的评分平均值小于划分点；分别采用两个根桶结构对边缘服务器进行分区，分别记为Low桶和High桶；将

个低延迟边缘服务器放置在Low桶中，将N/2个的高延迟边缘服务器放置在High桶中；在Low桶中选择

个低延迟边缘服务器、在High桶中选择M/2高延迟边缘服务器放置副本；否则，云端在该时刻t下的操作结束；其中，N为边缘服务器的数量；M为副本数量。Further preferably, in the process of the cloud performing operations at each time t, after the experience pool is not full of data or the Critic network training is completed, it is determined whether the time elapsed from time t is greater than the preset time period, and if so, then Obtain the score of each edge server at different times from the experience pool, and calculate the average score of each edge server; take the median of the average score of each edge server as the dividing point, and divide the edge servers into low-latency edge servers and high-latency edge servers; among them, the average score of low-latency edge servers is greater than or equal to the dividing point, and the average score of high-latency edge servers is less than the dividing point; two root bucket structures are used to partition edge servers, which are denoted as Low and High barrels; will

Place the low-latency edge servers in the Low bucket, and place N/2 high-latency edge servers in the High bucket; select from the Low bucket

select M/2 high-latency edge servers in the High bucket to place replicas; otherwise, the cloud operation at this time t ends; where N is the number of edge servers; M is the number of replicas.

进一步优选地，上述Actor网络包括：Actor在线网络和Actor目标网络；Critic网络包括Critic在线网络和Critic目标网络；Further preferably, the above-mentioned Actor network includes: Actor online network and Actor target network; Critic network includes Critic online network and Critic target network;

在每一时刻t，每一个边缘服务器均执行以下操作：边缘服务器采集其所在网络环境的当前状态数据作为其状态信息，并分别输入到其内部的Actor在线网络和Actor目标网络中，得到Actor在线网络输出的评分和Actor目标网络输出的评分；边缘服务器将其状态信息和所有边缘服务器的Actor在线网络输出的评分发送至云端中对应的Critic在线网络中得到其评价结果后，以最大化评价结果为目标，对其内部的Actor在线网络进行训练；且每训练若干轮后，基于Actor在线网络的参数对Actor目标网络进行更新；At each moment t, each edge server performs the following operations: The edge server collects the current state data of its network environment as its state information, and inputs it into its internal Actor online network and Actor target network respectively, and obtains Actor online The score of the network output and the score of the Actor target network output; the edge server sends its state information and the score of the Actor online network output of all edge servers to the corresponding critical online network in the cloud to obtain its evaluation results to maximize the evaluation results. As the target, train its internal Actor online network; and after several rounds of training, update the Actor target network based on the parameters of the Actor online network;

在每一时刻t，云端执行以下操作：收集所有边缘服务器发送过来的信息，并在收集完t时刻下所有边缘服务器发送过来的信息后，计算t-1时刻的奖励值，并将对应的元组信息存入经验池中；当经验池存满数据时，从经验池中随机采样元组信息数据对各Critic网络同时进行训练；上述元组信息包括：t-1时刻所有边缘服务器的状态信息s_t-1、t-1时刻所有边缘服务器的Actor在线网络输出的评分a_t-1、t-1时刻的奖励值r_t-1、t时刻所有边缘服务器的状态信息s_t以及t时刻所有边缘服务器的Actor目标网络输出的评分a′_t；其中，

为t-1时刻下第i个边缘服务器的状态信息；

为t-1时刻下第i个边缘服务器的Actor在线网络输出的评分；

为t时刻下第i个边缘服务器的状态信息；

为t时刻下第i个边缘服务器的Actor目标网络输出的评分；N为边缘服务器的数量。At each time t, the cloud performs the following operations: collects the information sent by all edge servers, and after collecting the information sent by all edge servers at time t, calculates the reward value at time t-1, and assigns the corresponding The group information is stored in the experience pool; when the experience pool is full of data, the tuple information data is randomly sampled from the experience pool to train each Critic network at the same time; the above tuple information includes: the state information of all edge servers at t-1 time s _t-1 , the Actor online network output scores a _t-1 of all edge servers at time t-1, the reward value r _t-1 at time t-1, the state information s _t of all edge servers at time t, and all The score a′ _t of the actor target network output of the edge server; where,

is the status information of the i-th edge server at time t-1;

is the score of the Actor online network output of the i-th edge server at time t-1;

is the status information of the i-th edge server at time t;

is the score of the Actor target network output of the i-th edge server at time t; N is the number of edge servers.

进一步优选地，从经验池中随机采样元组信息数据对各Critic网络进行训练的方法，包括：Further preferably, the method for training each Critic network by randomly sampling tuple information data from the experience pool includes:

记采样得到的第j个元组信息数据为(s_b,a_b,r_b,s_b+1,a'_b+1)；其中，

为b时刻下第i个边缘服务器的状态信息；

为b时刻下第i个边缘服务器的Actor在线网络输出的评分；

为b+1时刻下第i个边缘服务器的Actor目标网络输出的评分；The jth tuple information data obtained by sampling is (s _b , a _b , r _b , s _b+1 , a' _b+1 ); among them,

is the status information of the i-th edge server at time b;

is the score of the Actor online network output of the i-th edge server at time b;

is the score of the Actor target network output of the i-th edge server at time b+1;

基于采样得到的元组信息数据获取每一个边缘服务器的评价结果和对应的评价标签；其中，基于第j个元组信息数据得到的第i个边缘服务器的评价结果为将

和a_b输入至第i个Critic在线网络所得的评价结果；基于第j个元组信息数据得到的第i个边缘服务器的评价标签

r_b为b时刻的奖励值；γ为奖励折扣率；

为将

和a′_b+1输入至第i个Critic目标网络所得的评价结果；Obtain the evaluation result and corresponding evaluation label of each edge server based on the sampled tuple information data; wherein, the evaluation result of the i-th edge server obtained based on the j-th tuple information data is:

The evaluation result obtained by inputting and a _b to the i-th Critic online network; the evaluation label of the i-th edge server based on the j-th tuple information data

r _b is the reward value at moment b; γ is the reward discount rate;

for the

and a′ _b+1 are input to the evaluation result obtained by the i-th Critic target network;

通过最小化每一个边缘服务器的评价结果与对应的评价标签之间的差异，对各Critic在线网络进行训练；且每训练若干轮后，基于Critic在线网络的参数对对应的Critic目标网络进行更新。By minimizing the difference between the evaluation result of each edge server and the corresponding evaluation label, each Critic online network is trained; and after several rounds of training, the corresponding Critic target network is updated based on the parameters of the Critic online network.

第二方面，本发明提供了一种基于上述分布式存储系统的副本选择方法，包括：在分布式存储系统的运行过程中，当服务器端接收到副本访问请求时，基于边缘服务器的评分对各边缘服务器进行排名，选择排名最高、且存在数据副本的边缘服务器作为副本选择的节点进行数据访问。In a second aspect, the present invention provides a copy selection method based on the above distributed storage system, including: during the operation of the distributed storage system, when the server receives a copy access request, based on the score of the edge server The edge servers are ranked, and the edge server with the highest ranking and the existence of a data copy is selected as the node selected by the copy for data access.

进一步优选地，分布式存储系统中的所有边缘服务器构成一个Ceph系统；Ceph系统对每一个存在数据副本的边缘服务器的评分进行归一化后，作为边缘服务器所对应的affinity-primary参数值，并基于affinity-primary参数值选择进行数据访问的边缘服务器。Further preferably, all edge servers in the distributed storage system constitute a Ceph system; after the Ceph system normalizes the score of each edge server with a data copy, it is used as the affinity-primary parameter value corresponding to the edge server, and Select the edge server for data access based on the affinity-primary parameter value.

进一步优选地，Ceph系统采用max-min归一化方法对每一个存在数据副本的边缘服务器的评分进行归一化。Further preferably, the Ceph system uses the max-min normalization method to normalize the score of each edge server that has a data copy.

第三方面，本发明提供了一种副本选择系统，包括：存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时执行本发明第二方面所提供的副本选择方法。In a third aspect, the present invention provides a copy selection system, comprising: a memory and a processor, wherein the memory stores a computer program, and the processor executes the copy selection provided in the second aspect of the present invention when the processor executes the computer program method.

第四方面，本发明还提供了一种计算机可读存储介质，所述计算机可读存储介质包括存储的计算机程序，其中，在所述计算机程序被处理器运行时控制所述存储介质所在设备执行本发明第二方面所提供的副本选择方法。In a fourth aspect, the present invention also provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, wherein when the computer program is run by a processor, the device where the storage medium is located is controlled to be executed The replica selection method provided by the second aspect of the present invention.

总体而言，通过本发明所构思的以上技术方案，能够取得以下有益效果：In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be achieved:

1、本发明提供了一种分布式存储系统，在云端和边缘服务器端部署不同的网络结构；由于边缘到云端具有较大的延迟，且针对边缘环境中存在多种影响系统服务质量的因素，同时考虑到边缘服务器的评分值是一个连续的数值，本发明在每个边缘服务器中设置Actor网络来快速计算每个边缘服务器的评分(排名)，而不是通过云端统一计算评分再分发；另外在云端部署Critic网络以综合考虑所有Actor网络的信息进行联合的动作评价；且Actor网络基于对应的Critic网络输出的评价结果进行训练，Critic网络基于从经验池中随机采样得到的数据进行训练，Actor网络和Critic网络的训练过程相互独立且持续进行，从而能够在每一时刻均对各边缘服务器的服务质量进行准确评分，并通过在服务器之间维护一份服务器排名并分发给客户端的方式，使得副本选择具有完整的服务器状态信息且没有转发延迟开销，大大减少了云边数据的传输开销，能够更好地适应边缘环境中的副本选择，降低边缘环境中请求处理延迟，实现性能和可靠性的兼顾。1. The present invention provides a distributed storage system that deploys different network structures on the cloud and the edge server; due to the large delay from the edge to the cloud, and there are many factors affecting the system service quality in the edge environment, At the same time, considering that the score value of the edge server is a continuous value, the present invention sets up an Actor network in each edge server to quickly calculate the score (rank) of each edge server, instead of uniformly calculating the score and redistributing it through the cloud; The Critic network is deployed in the cloud to comprehensively consider the information of all Actor networks for joint action evaluation; and the Actor network is trained based on the evaluation results output by the corresponding Critic network, and the Critic network is trained based on the data randomly sampled from the experience pool. The training process of the Critic network is independent and continuous, so that the service quality of each edge server can be accurately scored at every moment, and by maintaining a server ranking between servers and distributing it to the client, the copy The choice has complete server status information and no forwarding delay overhead, which greatly reduces the transmission overhead of cloud-side data, can better adapt to copy selection in the edge environment, reduce the request processing delay in the edge environment, and achieve both performance and reliability. .

2、本发明所提供的分布式存储系统，其中，Actor网络和Critic网络均为双网络结构，大大提高了学习的稳定性，进而提高了副本选择的准确性。2. In the distributed storage system provided by the present invention, the Actor network and the Critic network are both dual network structures, which greatly improves the stability of learning and further improves the accuracy of copy selection.

3、本发明所提供的分布式存储系统，考虑到数据存取服务是状态的数据存取服务，数据访问请求只能在存在数据副本的服务器之间进行副本选择，副本的放置位置将会影响到副本选择策略的有效性，本发明根据服务器排名的期望，通过双根桶结构来调整副本的放置位置，从而使得请求能够选择更低延迟的服务器，进而降低了请求处理时延。3. In the distributed storage system provided by the present invention, considering that the data access service is a stateful data access service, the data access request can only be selected among servers with data copies, and the placement of the copies will affect the As for the effectiveness of the copy selection strategy, the present invention adjusts the placement position of the copy through the double-root bucket structure according to the expectation of server ranking, so that the request can select a server with lower delay, thereby reducing the request processing delay.

4、由于侵入系统去修改副本的选择将会涉及到大量的系统内部既有机制，想要完美的将副本选择机制嵌入进现有的系统非常困难，本发明所提供的副本选择方法针对Ceph系统已有的内部机制设计额外的处理流程来改变副本的选择，每得到一次边缘服务器的评分后，对边缘服务器的评分进行归一化，作为该边缘服务器的OSD节点的affinity-primary参数值，Ceph系统将会基于affinity-primary参数值选择其主OSD节点作为副本选择的节点进行数据访问，即排名最高、且存在数据副本的边缘服务器。4. Since intrusion into the system to modify the copy selection will involve a large number of existing mechanisms within the system, it is very difficult to perfectly embed the copy selection mechanism into the existing system. The copy selection method provided by the present invention is aimed at the Ceph system. The existing internal mechanism designs additional processing procedures to change the selection of replicas. After each edge server score is obtained, the score of the edge server is normalized as the value of the affinity-primary parameter of the OSD node of the edge server. Ceph Based on the affinity-primary parameter value, the system will select its primary OSD node as the node selected by the replica for data access, that is, the edge server with the highest ranking and data replica.

附图说明Description of drawings

图1为本发明实施例1提供的一种分布式存储系统的结构示意图；1 is a schematic structural diagram of a distributed storage system according to Embodiment 1 of the present invention;

图2为本发明实施例1提供的Actor网络结构示意图；FIG. 2 is a schematic diagram of the Actor network structure provided in Embodiment 1 of the present invention;

图3为本发明实施例1提供的Critic网络结构图；3 is a structural diagram of a Critical network provided by Embodiment 1 of the present invention;

图4为本发明实施例1提供的边缘环境下多代理强化学习数据流图；4 is a multi-agent reinforcement learning data flow diagram in an edge environment provided by Embodiment 1 of the present invention;

图5为本发明实施例1提供的双“根桶”结构示意图；5 is a schematic structural diagram of a double "root bucket" provided in Embodiment 1 of the present invention;

图6为本发明实施例1提供的双“根桶”结构中的规则实现；Fig. 6 is the rule realization in the double "root bucket" structure provided by Embodiment 1 of the present invention;

图7为本发明实施例2提供的在Read-only、Read-heavy和Update-heavy三种负载下不同副本选择策略的平均延迟结果示意图；7 is a schematic diagram of the average delay results of different replica selection strategies under three loads of Read-only, Read-heavy, and Update-heavy according to Embodiment 2 of the present invention;

图8为本发明实施例2提供的在Read-only负载下使用3种不同策略各节点在每个时刻的平均响应延迟结果示意图；8 is a schematic diagram of the average response delay results of each node at each moment using three different strategies under Read-only load provided by Embodiment 2 of the present invention;

图9为本发明实施例2提供的在Read-only负载下不同客户端数量对三种副本选择策略的延迟影响示意图。FIG. 9 is a schematic diagram of the delay effect of different numbers of clients on three replica selection strategies under a read-only load according to Embodiment 2 of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

实施例1、Embodiment 1,

一种分布式存储系统，如图1所示，包括：云端和服务器端；其中，服务器端包括：多个分布式分布的边缘服务器；每个边缘服务器中均部署有一个Actor网络；云端部署有多个Critic网络，其数量与边缘服务器的数量相同，一个Critic网络对应一个Actor网络；A distributed storage system, as shown in Figure 1, includes: a cloud and a server; wherein, the server includes: a plurality of distributed edge servers; each edge server is deployed with an Actor network; the cloud is deployed with Multiple Critic networks, the number of which is the same as the number of edge servers, one Critic network corresponds to one Actor network;

服务器端的运行过程：Server-side running process:

需要说明的是，每一个边缘服务器运行一个OSD进程，包含一个OSD节点。所有的边缘服务器构成一个Ceph系统，对边缘服务器的评分进行归一化后，作为每一个边缘服务器的OSD节点的affinity-primary参数值，Ceph系统可以基于affinity-primary参数值选择其主OSD节点作为副本选择的节点进行数据访问，即排名最高、且存在数据副本的边缘服务器。具体地，可以采用tanh归一化、sigmoid归一化、max-min归一化等归一化方法。优选地，采用max-min归一化方法对边缘服务器的评分进行归一化，相比于其他归一化方法(如tanh归一化、sigmoid归一化)能够更完整的保留原始的数据信息。It should be noted that each edge server runs an OSD process, including an OSD node. All edge servers form a Ceph system. After normalizing the score of edge servers, it is used as the affinity-primary parameter value of the OSD node of each edge server. The Ceph system can select its main OSD node based on the affinity-primary parameter value. The node selected by the replica is used for data access, that is, the edge server with the highest ranking and data replica. Specifically, normalization methods such as tanh normalization, sigmoid normalization, and max-min normalization may be used. Preferably, the max-min normalization method is used to normalize the score of the edge server, which can preserve the original data information more completely than other normalization methods (such as tanh normalization and sigmoid normalization). .

在一种可选实施方式下，边缘服务器主要包括Ceph系统模块、信息采集模块、评分模块和适配器模块；In an optional implementation manner, the edge server mainly includes a Ceph system module, an information collection module, a scoring module and an adapter module;

评分模块：评分模块主要由深度确定性策略梯度(Deep Deterministic PolicyGradient，DDPG)强化学习中的Actor网络组成，它根据边缘服务器单独采集到的信息为边缘服务器输出动作(评分)，并将一组状态信息、动作和性能等信息发送给云端。Scoring module: The scoring module is mainly composed of the Actor network in Deep Deterministic Policy Gradient (DDPG) reinforcement learning. It outputs actions (scores) for the edge server according to the information collected by the edge server separately, and assigns a set of states. Information such as information, actions, and performance are sent to the cloud.

适配器模块：因为侵入系统根据评分改变副本选择将涉及到大量的系统内部流程，这个模块专门用于对接具体系统，根据具体系统将评分转变成可以改变副本选择的机制。具体地，针对Ceph系统已有的内部机制设计额外的处理流程来改变副本的选择，由于Ceph系统中对象放置的OSD节点是直接通过CRUSH算法计算出来的，并且对象的读写操作都是通过主OSD节点来完成的，主OSD承载着系统很多的处理逻辑。如果直接根据评分(排名)侵入系统去改变请求的目标OSD节点，将牵扯到大量的系统内部既有机制。因此，本发明考虑从主OSD节点的选取入手，通过改变对象对应的主OSD节点来改变副本选择的节点。在Ceph中通过“抽签算法”来进行OSD节点的选取，从所有节点中选出签最长的三个OSD节点(三副本)作为数据的放置节点。在这最初始的OSD顺序中，主OSD节点就是签最长的节点。随后，为了更多的动态性还对主OSD节点的选取设计了处理流程，Ceph系统提供了Affinity-Primary参数来控制每个OSD节点成为主OSD节点的概率。在Ceph系统中Affinity-Primary的区间范围设置为[0,1]，而通过神经网络输出的评分值显然是超过这个范围的，因此需要将神经网络输出的值映射到这个区间内。本实施例中考虑使用max-min归一化方法进行映射。相比于其他归一化方法(如tanh归一化、sigmoid归一化)能够更完整的保留原始的数据信息，如公式所示为：

Adapter module: Because intrusion into the system to change the copy selection based on the score will involve a large number of internal system processes, this module is specially used to interface with the specific system, and convert the score into a mechanism that can change the copy selection according to the specific system. Specifically, an additional processing flow is designed for the existing internal mechanism of the Ceph system to change the selection of copies. Since the OSD nodes placed by the objects in the Ceph system are directly calculated by the CRUSH algorithm, and the read and write operations of the objects are performed by the main It is completed by the OSD node, and the main OSD carries a lot of processing logic of the system. If you directly invade the system to change the requested target OSD node according to the score (rank), it will involve a large number of existing mechanisms within the system. Therefore, the present invention considers starting from the selection of the primary OSD node, and changes the node selected by the replica by changing the primary OSD node corresponding to the object. In Ceph, the "lottery algorithm" is used to select the OSD nodes, and the three OSD nodes (three copies) with the longest signatures are selected from all the nodes as the data placement nodes. In this initial OSD sequence, the primary OSD node is the node with the longest signature. Subsequently, a processing flow was designed for the selection of the main OSD node for more dynamics. The Ceph system provides the Affinity-Primary parameter to control the probability of each OSD node becoming the main OSD node. In the Ceph system, the range of Affinity-Primary is set to [0,1], and the score value output by the neural network obviously exceeds this range, so it is necessary to map the value output by the neural network to this range. In this embodiment, the max-min normalization method is considered for mapping. Compared with other normalization methods (such as tanh normalization, sigmoid normalization), the original data information can be more completely preserved, as shown in the formula:

云端的运行过程：The operation process of the cloud:

本实施例中，上述t-1时刻的奖励值r_t-1为：In this embodiment, the reward value r _{t-1 at time t-1} is:

其中，N为边缘服务器的数量；

为第i个边缘服务器的平均延迟；

为所有边缘服务器的平均延迟的平均值；

为第i个边缘服务器处理的请求数；

is the average delay of the i-th edge server;

is the average of the average latency of all edge servers;

The number of requests processed for the ith edge server;

The average number of requests processed for the ith edge server.

在一种可选实施方式下，云端主要包括奖励计算模块、经验池模块、评价模块和副本放置优化模块。In an optional implementation manner, the cloud mainly includes a reward calculation module, an experience pool module, an evaluation module, and a replica placement optimization module.

奖励计算模块：奖励计算模块需要维护上一次的状态和动作信息，并接收本次的状态和动作信息，以及计算上一轮的系统整体奖励值，并将<上次状态，上次动作，奖励值，本次状态，Actor目标网络本次输出的动作>元组信息存入经验池中。需要说明的是，奖励计算模块需要维护t-1时刻的所有信息(即t-1时刻的状态s_t-1和动作信息a_t-1)，并收集t时刻所有边缘服务器的信息(即t时刻的状态s_t和动作信息a'_t)。之后可以通过t时刻的信息计算出t-1时刻的奖励值r_t-1，并将元组(s_t-1,a_t-1,r_t-1,s_t,a'_t)信息存储到经验池中。具体地，奖励值如何计算对于强化学习至关重要，本实施例使用每个节点处理的请求数来衡量奖励值的大小，当节点的平均延迟

小于整体平均延迟

时，请求数具有一个正奖励反馈，t时刻每个节点处理的请求数定义为

同时考虑到延迟越低的节点处理请求应该获得越多的奖励，因此不同节点的请求数应具有不同的奖励权重。拥有权重信息，那么将不再是所有请求都具有奖励反馈，而是每个节点处理的请求数

与均值

的差才具有奖励反馈，代表多处理或少处理请求的奖励反馈(基于每个节点处理请求数相等考虑)。每个节点具有奖励反馈的请求数定义如公式所示：

由于每个节点奖励权重与延迟

相关，考虑直接使用每个节点的平均延迟与整体平均延迟

的差来表示权重参数，如公式所示：

那么最后奖励值的计算定义如公式所示：

Reward calculation module: The reward calculation module needs to maintain the last state and action information, and receive the current state and action information, as well as calculate the overall reward value of the system in the previous round, and calculate the value of < last state, last action, reward Value, the current state, the action output by the Actor target network this time > tuple information is stored in the experience pool. It should be noted that the reward calculation module needs to maintain all the information at time t-1 (that is, the state s _t-1 and action information a _{t-1 at time t-1} ), and collect the information of all edge servers at time t (that is, t state s _t and action information a' _t ). Afterwards, the reward value r _t- 1 at time t-1 can be calculated from the information at time t, and the tuple (s _t-1 , a _t-1 , r _t-1 , s _t , a' _t ) information is stored into the experience pool. Specifically, how the reward value is calculated is very important for reinforcement learning. In this embodiment, the number of requests processed by each node is used to measure the size of the reward value. When the average delay of the node is

less than the overall average latency

, the number of requests has a positive reward feedback, and the number of requests processed by each node at time t is defined as

At the same time, considering that nodes with lower latency should receive more rewards for processing requests, the number of requests for different nodes should have different reward weights. With weight information, then it will no longer be all requests with reward feedback, but the number of requests processed by each node

with mean

The difference of , has reward feedback, which represents the reward feedback of processing more or less requests (based on the equal number of requests processed by each node). The number of requests with reward feedback for each node is defined as the formula:

Due to each node reward weight and delay

Related, consider directly using the average latency of each node versus the overall average latency

The difference of , to represent the weight parameter, as shown in the formula:

Then the calculation definition of the final reward value is as shown in the formula:

评价模块：评价模块由DDPG强化学习中的Critic网络组成，它对Actor网络的动作信息进行评价。Critic网络输出的评价值将作为Actor网络学习的“监督信息”，并从经验池中采样历史数据对自身进行训练学习。Evaluation module: The evaluation module consists of the Critic network in DDPG reinforcement learning, which evaluates the action information of the Actor network. The evaluation value output by the Critic network will be used as the "supervision information" for the Actor network to learn, and it will sample historical data from the experience pool to train itself.

副本放置优化模块：考虑到存储系统提供的是有状态的数据存取服务，副本的放置位置将影响到选择策略的可选节点。考虑对数据进行迁移，优化副本的放置位置，从而更好的进行副本选择。Replica placement optimization module: Considering that the storage system provides stateful data access services, the placement of replicas will affect the optional nodes for the selection strategy. Consider migrating data and optimizing the placement of replicas for better replica selection.

需要说明的是，Actor网络的输入为每个边缘服务器观测到的自身状态信息定义为s，输出为评分(动作)定义为a。在一种可选实施方式下，Actor网络的具体实现结构如图2所示，整个Actor网络由两个全连接层(Linear Layer)和一个Relu激活层组成。考虑到边缘服务器资源有限，为了更少的计算开销，MARLRS将全连接层的输出(或输入)即中间的隐藏层定义为50维。Actor网络两个全连接层的权重矩阵分别定义为w_a1，w_a2，权重矩阵的维度分别为len(s)×50、50×1，其中len(s)为状态的维度。那么Actor网络表示的计算公式为：a＝Relu(s*w_a1)*w_a2。It should be noted that the input of the Actor network is defined as s for the self-state information observed by each edge server, and the output is the score (action) defined as a. In an optional implementation manner, the specific implementation structure of the Actor network is shown in FIG. 2 , and the entire Actor network consists of two fully connected layers (Linear Layer) and one Relu activation layer. Considering the limited resources of the edge server, in order to reduce the computational overhead, MARLRS defines the output (or input) of the fully connected layer, that is, the middle hidden layer, as 50 dimensions. The weight matrices of the two fully connected layers of the Actor network are respectively defined as w _a1 and w _a2 , and the dimensions of the weight matrices are len(s)×50 and 50×1, respectively, where len(s) is the dimension of the state. Then the calculation formula represented by the Actor network is: a=Relu(s*w _a1 )*w _a2 .

Critic网络的作用是对Actor网络的计算结果进行评价，即Critic网络的输出就是Actor网络学习的“监督信息”。Actor网络的结果越好，则Critic网络的输出结果为正值且越大；Actor网络的结果越差，则Critic网络的输出结果为负值且越小。Critic网络的作用是对Actor网络的计算结果进行评价，即Critic网络的输出就是Actor网络学习的“监督信息”。Actor网络的结果越好，则Critic网络的输出结果为正值且越大；Actor网络的结果越差，则Critic网络的输出结果为负值且越小。Critic网络需要同时输入Actor网络的输入s和Actor网络的输出a。在一种可选实施方式下，Critic网络的具体实现结构如图3所示，输入的s和a都分别使用一个全连接层计算出中间结果分别定义为mid_s和mid_a，两个全连接层的权重矩阵分别定义为w_cs和w_ca，矩阵维度分别定义为len(s)×200和N×200，N为边缘服务器数量。mid_s和mid_a的计算分别为mid_s＝s*wc_s和mid_a＝a*w_ca。则网络的中间输出结果根据以下所示进行线性求和计算，具体为：mid＝mid_s+mid_a+b；其中b为噪声矩阵。然后和Actor网络一样经过一个激活函数和全连接层计算出评价结果q。定义最后一个全连接层的权重矩阵为wc，那么Critic网络表示的计算公式如公式所示：q＝Relu(mid)*w_c。The role of the Critic network is to evaluate the calculation results of the Actor network, that is, the output of the Critic network is the "supervision information" learned by the Actor network. The better the result of the Actor network, the more positive and larger the output of the Critic network; the worse the result of the Actor network, the more negative and smaller the output of the Critic network. The role of the Critic network is to evaluate the calculation results of the Actor network, that is, the output of the Critic network is the "supervision information" learned by the Actor network. The better the result of the Actor network, the more positive and larger the output of the Critic network; the worse the result of the Actor network, the more negative and smaller the output of the Critic network. The Critic network requires both the input s of the Actor network and the output a of the Actor network. In an optional implementation manner, the specific implementation structure of the Critic network is shown in Figure 3. The input s and a are respectively calculated by a fully connected layer, and the intermediate results are respectively defined as mid _s and mid _a , and the two fully connected The weight matrices of the layers are defined as w _cs and w _ca , respectively, and the matrix dimensions are defined as len(s)×200 and N×200, respectively, where N is the number of edge servers. The calculations for mid _s and mid _a are mid _s =s*wc _s and mid _a =a*w _ca , respectively. Then the intermediate output results of the network are linearly summed as shown below, specifically: mid=mid _s +mid _a +b; where b is the noise matrix. Then, like the Actor network, the evaluation result q is calculated through an activation function and a fully connected layer. Define the weight matrix of the last fully connected layer as wc, then the calculation formula represented by the Critic network is as shown in the formula: q=Relu(mid)*w _c .

进一步地，由于边缘到云端具有较大的延迟，本发明在每个边缘服务器中设置Actor网络来快速计算每个节点评分(排名)，而不是通过云端统一计算评分再分发。而Critic网络需要综合考虑所有Actor网络的信息进行联合的动作评价，且为了更好的训练网络需要从经验池中随机采样一批数据同时进行学习，在云端部署所有的Critic网络(其中一个Actor网络对应一个Critic网络，即第i个Actor网络对应第i个Critic网络，i＝1,2,…,N)。并且为了提高学习的稳定性，在一种可选实施方式下，Actor网络和Critic网络都采用双网络设置。具体地，上述Actor网络包括：Actor在线网络和Actor目标网络；Critic网络包括Critic在线网络和Critic目标网络；通过Critic在线网络每个时刻对Actor在线网络的动作进行评价。Further, due to the large delay from the edge to the cloud, the present invention sets up an Actor network in each edge server to quickly calculate the score (rank) of each node, instead of uniformly calculating the score and redistributing it through the cloud. The Critic network needs to comprehensively consider the information of all Actor networks for joint action evaluation, and in order to better train the network, it is necessary to randomly sample a batch of data from the experience pool to learn at the same time, and deploy all Critic networks (one of the Actor networks) in the cloud. Corresponds to a Critic network, that is, the i-th Actor network corresponds to the i-th Critic network, i=1,2,...,N). And in order to improve the stability of learning, in an optional implementation manner, both the Actor network and the Critic network adopt a dual network setting. Specifically, the above Actor network includes: Actor online network and Actor target network; Critic network includes Critic online network and Critic target network; Actions of Actor online network are evaluated at each moment through the Critic online network.

为t-1时刻下第i个边缘服务器的状态信息；

为t-1时刻下第i个边缘服务器的Actor在线网络输出的评分；

为t时刻下第i个边缘服务器的状态信息；

is the status information of the i-th edge server at time t-1;

is the status information of the i-th edge server at time t;

具体地，从经验池中随机采样元组信息数据对Critic网络进行训练的方法，包括：Specifically, the method of randomly sampling tuple information data from the experience pool to train the Critic network includes:

经验池中采样得到的元组信息数据的总数量为B；记采样得到的第j个元组信息数据为(s_b,a_b,r_b,s_b+1,a'_b+1)；其中，

为b时刻下第i个边缘服务器的状态信息；

为b时刻下第i个边缘服务器的Actor在线网络输出的评分；

为b+1时刻下第i个边缘服务器的Actor目标网络输出的评分；The total number of tuple information data sampled in the experience pool is B; the jth tuple information data obtained by sampling is (s _b , a _b , r _b , s _b+1 , a' _b+1 ); in,

is the status information of the i-th edge server at time b;

r_b为b时刻的奖励值；γ为奖励折扣率；

为将

r _b is the reward value at moment b; γ is the reward discount rate;

for the

需要说明的是，在Actor和Critic的双网络结构中，在线网络和目标网络具有相同的网络模型设置，只是网络间的权重参数不同；具体地，Actor在线网络和Actor目标网络的结构相同，均同上述Actor网络的结构，这里不做赘述。Critic在线网络和Critic目标网络的结构相同，均同上述Critic网络的结构，这里不做赘述。在线网络权重实时更新(单步)，目标网络权重则是在在线网络更新n步之后，根据在线网络权重进行网络参数的更新。It should be noted that in the dual network structure of Actor and Critic, the online network and the target network have the same network model settings, but the weight parameters between the networks are different; The structure of the Actor network is the same as above, and will not be repeated here. The structures of the Critic online network and the Critic target network are the same, and both are the same as those of the above-mentioned Critic network, which will not be repeated here. The online network weight is updated in real time (single step), and the target network weight is updated according to the online network weight after n steps of online network update.

具体地，分别记Actor在线网络、Actor目标网络、Critic在线网络和Critic目标网络的神经网络计算过程分别定义为函数μ⁽ⁱ⁾、μ'⁽ⁱ⁾、Q⁽ⁱ⁾和Q'⁽ⁱ⁾，神经网络的整体参数分别定义为

和

，其中i表示边缘服务器的编号。为了进一步说明上述分布式存储系统的运行过程，下面以图4所示的边缘环境下多代理强化学习数据流图为例来说明边缘环境中Actor网络和Critic网络完整的数据流过程：Specifically, the neural network calculation processes of Actor online network, Actor target network, Critic online network and Critic target network are respectively defined as functions μ ⁽ⁱ⁾ , μ' ⁽ⁱ⁾ , Q ⁽ⁱ⁾ and Q' ⁽ⁱ⁾ , the overall parameters of the neural network are defined as

and

, where i represents the number of the edge server. In order to further illustrate the operation process of the above distributed storage system, the following takes the multi-agent reinforcement learning data flow diagram in the edge environment shown in Figure 4 as an example to illustrate the complete data flow process of the Actor network and the Critic network in the edge environment:

1)首先，边缘服务器之间具有时钟同步处理。到时刻t时，所有边缘服务器都观测获得自身的环境状态信息，状态信息定义为

1) First, there is clock synchronization between edge servers. At time t, all edge servers observe and obtain their own environmental status information, which is defined as

2)然后，将状态信息

作为Actor在线网络的输入，经过神经网络计算得出t时刻的动作

公式定义为

然后每个边缘服务器直接执行动作

2) Then, put the status information

As the input of the Actor online network, the action at time t is calculated by the neural network

The formula is defined as

Each edge server then executes the action directly

3)将元组

信息和额外的奖励值计算信息发送到云端的奖励计算模块。考虑到Critic目标网络的输入依赖于Actor目标网络的输出，因此在此阶段将

输入Actor目标网络进行计算，网络输出定义为

公式定义为

如果不在此阶段将

信息计算出来，那么每次Critic目标网络执行计算时，都需要从云端发送数据到边缘，由边缘的Actor目标网络计算后再将相应数据发送到云端，这种做法将产生额外的开销。在此阶段完成相应的计算可以节省不必要的开销。3) put the tuple

The information and additional reward value calculation information are sent to the reward calculation module in the cloud. Considering that the input of the Critic target network depends on the output of the Actor target network, at this stage the

Input Actor target network for calculation, network output is defined as

The formula is defined as

If not at this stage, the

After the information is calculated, each time the Critic target network performs a calculation, it needs to send data from the cloud to the edge, and the Actor target network on the edge calculates and then sends the corresponding data to the cloud, which will generate additional overhead. Completing the corresponding calculations at this stage can save unnecessary overhead.

4)奖励计算模块汇总所有边缘服务器信息且维护着t-1时刻的信息。因此能够根据t时刻的信息计算出t-1时刻的系统全局奖励r_t-1。然后将元组(s_t-1,a_t-1,r_t-1,s_t,a'_t)信息存入经验池中，以供Critic网络进行随机采样学习。4) The reward calculation module aggregates all edge server information and maintains the information at time t-1. Therefore, the system global reward r _t- 1 at time t-1 can be calculated according to the information at time t. Then, the tuple (s _t-1 , a _t-1 , r _t-1 , s _t , a' _t ) information is stored in the experience pool for the Critic network to perform random sampling learning.

5)Critic网络从经验池中随机采样B个元组数据。如阶段4)中元组信息所示；具体地，记采样得到的第j个元组信息数据为(s_b,a_b,r_b,s_b+1,a'_b+1)。5) The Critic network randomly samples B tuples from the experience pool. As shown in the tuple information in stage 4); specifically, the jth tuple information data obtained by sampling is denoted as (s _b , a _b , r _b , s _b+1 , a' _b+1 ).

6)与阶段5)是完全平行的过程，互不干扰。用相应的Critic在线网络对t时刻Actor在线网络的行为进行评价，评价结果定义为

公式定义为

输入Critic在线网络的信息为对应Actor在线网络的状态

和联合动作

6) and stage 5) are completely parallel processes without interfering with each other. The behavior of the Actor online network at time t is evaluated with the corresponding Critic online network, and the evaluation result is defined as

The formula is defined as

The information entered in the Critic online network is the status of the corresponding Actor online network

and joint action

7)将第j个元组信息数据中的

和a′_b+1输入至输入Critic目标网络获得

公式定义为

利用奖励值r_b和

计算出Critic在线网络学习所需的“监督信息”(与监督式学习的标签存在区别，Critic在线网络的标签依赖于系统中本身正在学习的Critic目标网络)，基于第j个元组信息数据得到的第i个边缘服务器的评价标签为：7) Put the jth tuple in the information data

and a′ _b+1 input to the input Critic target network to obtain

The formula is defined as

Using the reward value r _b and

Calculate the "supervision information" required for Critic online network learning (different from the label of supervised learning, the label of the Critic online network depends on the Critic target network itself being learned in the system), based on the jth tuple information data to get The evaluation label of the ith edge server is:

其中γ为奖励折扣率。where γ is the reward discount rate.

8)在线网络进行前向传播并计算梯度，是在线网络训练学习过程的第一步。Actor和Critic网络都执行这一个过程，但并不是同时的(在不同的机器上)且针对的训练数据也不同。在最原始的DDPG网络设计中，Actor网络和Critic网络都是使用同一批采样的数据进行训练学习，但是现在Actor网络和Critic网络分属不同的机器上，且经验池放在云端。如果再使用原模型的做法，将产生额外的开销(且具有时延性)。因此，本发明让Actor网络只是针对t时刻的数据进行学习，而Critic网络则是从经验池中随机采样b大小的数据同时进行训练学习。8) The online network performs forward propagation and calculates the gradient, which is the first step in the online network training and learning process. Both Actor and Critic Network perform this process, but not simultaneously (on different machines) and for different training data. In the original DDPG network design, the Actor network and the Critic network use the same batch of sampled data for training and learning, but now the Actor network and the Critic network belong to different machines, and the experience pool is placed in the cloud. If the original model is used again, additional overhead (and delay) will be generated. Therefore, the present invention allows the Actor network to learn only for the data at time t, while the Critic network randomly samples b-sized data from the experience pool for training and learning at the same time.

9)此阶段计算相应网络的损失值，对在线网络参数进行反向传播更新。Critic在线网络使用阶段中前向传播，将第j个元组信息数据中的

和a_b输入至第i个Critic在线网络所得的评价结果记为

其中

与标签

计算损失值

公式如公式所示：9) In this stage, the loss value of the corresponding network is calculated, and the online network parameters are updated by back-propagation. Forward propagation in the Critic online network usage phase, the jth tuple in the information data

The evaluation result obtained by inputting and a _b to the i-th Critic online network is recorded as

in

with tags

Calculate the loss value

The formula is as shown in the formula:

其中，B为批采样数据的大小。Among them, B is the size of batch sampling data.

进一步地，Actor在线网络则直接使用Critic在线网络的评价信息

作为行为好坏的评判标准，

越大说明Actor网络做出的决策越好，因此Actor在线网络要朝着更有可能获得更大

的方向修改网络的权重参数。定义Actor在线网络的损失函数如公式所示：Further, the Actor online network directly uses the evaluation information of the Critic online network

As a criterion for judging good or bad behavior,

Bigger means better decisions made by the Actor network, so the Actor online network is more likely to get bigger

The direction modifies the weight parameters of the network. The loss function that defines the Actor online network is shown in the formula:

然后进行反向传播分别更新Actor在线网络和Critic在线网络的参数μ⁽ⁱ⁾和Q⁽ⁱ⁾。Then back-propagation is performed to update the parameters μ ⁽ⁱ⁾ and Q ⁽ⁱ⁾ of the Actor online network and the Critic online network, respectively.

10)在在线网络实时更新n步之后，需要依赖在线网络的权重信息对目标网络的网络权重进行更新。但并不是直接将在线网络权重参数信息进行完整复制，而是定义一个学习率τ，目标网络每次从在线网络学习一部分内容，这一过程被称为软更新(Soft Update)。目标网络参数更新公式分别如公式所示：10) After the online network is updated in real time for n steps, the network weight of the target network needs to be updated by relying on the weight information of the online network. However, instead of directly copying the weight parameter information of the online network, a learning rate τ is defined, and the target network learns a part of the content from the online network each time. This process is called soft update. The update formulas of the target network parameters are as follows:

进一步地，在一种可选实施方式下，云端在每一时刻t下执行操作的过程中，在经验池未存满数据或者Critic网络训练完成后，判断从时刻t开始所经过的时长是否大于预设时间周期(本实施方式下取值为600s)，若是，则从经验池中获取不同时刻下各边缘服务器的评分，计算得到每一个边缘服务器的评分平均值；以各边缘服务器的评分平均值的中位数为划分点，将边缘服务器划分为低延迟边缘服务器和高延迟边缘服务器；其中，低延迟边缘服务器的评分平均值大于或等于划分点，高延迟边缘服务器的评分平均值小于划分点；分别采用两个根桶结构对边缘服务器进行分区，分别记为Low桶和High桶；将

个低延迟边缘服务器、在High桶中选择M/2高延迟边缘服务器放置副本；否则，云端在该时刻t下的操作结束；其中，N为边缘服务器的数量；M为副本数量。Further, in an optional implementation, the cloud performs operations at each time t, after the experience pool is not full of data or the Critic network is trained, it is determined whether the time elapsed from time t is greater than The preset time period (the value is 600s in this embodiment), if yes, then obtain the scores of each edge server at different times from the experience pool, and calculate the average score of each edge server; take the average score of each edge server The median of the values is the dividing point, which divides edge servers into low-latency edge servers and high-latency edge servers; where the average score of low-latency edge servers is greater than or equal to the dividing point, and the average score of high-latency edge servers is less than the dividing point point; two root bucket structures are used to partition edge servers, which are respectively recorded as Low bucket and High bucket;

具体地，在上述可选实施方式下，上述分布式存储系统的整体流程包括：Specifically, in the above-mentioned optional implementation manner, the overall process of the above-mentioned distributed storage system includes:

边缘部分：在每个时刻t，边缘服务器开始采集当前的状态数据，之后使用Actor网络计算出动作。然后将动作进行适配操作并执行适配动作，同时将状态、动作和性能等信息发送到云端。最后等待云端的评价结果对Actor网络进行训练学习。Edge part: At each time t, the edge server starts to collect the current state data, and then uses the Actor network to calculate the action. Then adapt the action and execute the adaptation action, and send information such as status, action and performance to the cloud at the same time. Finally, wait for the evaluation results of the cloud to train and learn the Actor network.

云端部分：云端收集完所有t时刻边缘服务器的信息后，开始计算t-1时刻的奖励值，并将相应的元组信息存入经验池以供Critic网络采样学习。接着，云端使用Critic网络对所有边缘服务器的行为进行评价。然后将评价结果发送给每个边缘服务器，同时判断经验池是否存满数据了。若有足够的数据，Critic网络将从经验池中随机采样数据，对Critic网络进行训练学习。若没有足够数据，则直接判断是否经过了副本放置调整的时间周期。是则直接开始从经验池中获取评分数据，计算每个服务器的服务性能期望。最后根据期望值对服务器进行分区，改变副本的放置位置。否则结束流程。Cloud part: After the cloud collects the information of all edge servers at time t, it starts to calculate the reward value at time t-1, and stores the corresponding tuple information in the experience pool for the Critic network to sample and learn. Next, the cloud uses the Critic network to evaluate the behavior of all edge servers. Then, the evaluation results are sent to each edge server, and at the same time, it is judged whether the experience pool is full of data. If there is enough data, the Critic network will randomly sample data from the experience pool to train the Critic network. If there is not enough data, it is directly judged whether the time period for replica placement adjustment has passed. If it is, it directly starts to obtain score data from the experience pool and calculates the service performance expectations of each server. Finally, the server is partitioned according to the desired value, and the placement of the replicas is changed. Otherwise end the process.

需要说明的是，存储系统提供的是一个有状态的数据存取服务，这意味着数据访问请求只能在存在数据副本的边缘服务器之间进行选择，因此副本的放置将会影响到选择的决策。然而数据的放置是随机，因此在时刻t时可能出现有的数据的副本都在延迟较高的服务器上，访问这一部分数据的请求都将具有较高的延迟开销，并且只通过副本选择策略无法很好的优化这一部分请求的响应延迟。如假设现在有8个边缘服务器，每个边缘服务器的响应延迟按2～9ms进行设置，并且现在有8个文件要放入存储系统中。假设存储系统使用3副本策略，那么每个边缘服务器将存储3个文件(考虑数据均匀分布)，如果对数据进行随机放置，那么将可能出现有的数据的副本都存储在响应延迟较高的服务器中。为了解决上述问题，在一种方式下可以直接将文件进行交换保证所有数据在响应延迟较低的边缘服务器中，再通过副本选择策略，以保证所有数据的访问请求都能获得较低的响应延迟；然而，数据迁移是存在开销的且需要一定的时间来完成，并且在边缘场景下服务器之间存在着传输延迟，需要更多的时间来完成数据迁移任务。因此，副本的放置不能像副本的选择一样实时更新策略，副本放置应该比副本选择有着更大的策略更新时间周期，而如何衡量一段长时间周期内服务器的性能是一件比较困难的事。It should be noted that the storage system provides a stateful data access service, which means that data access requests can only be selected between edge servers with data replicas, so the placement of replicas will affect the selection decision. . However, the placement of data is random. Therefore, at time t, copies of existing data may all be on servers with high latency. Requests to access this part of data will have high latency overhead, and only through the copy selection strategy cannot be used. It is very good to optimize the response delay of this part of the request. For example, suppose there are now 8 edge servers, the response delay of each edge server is set at 2~9ms, and there are now 8 files to be put into the storage system. Assuming that the storage system uses a 3-copy strategy, each edge server will store 3 files (considering that the data is evenly distributed). If the data is placed randomly, the copies of the existing data will be stored on the server with higher response delay. middle. In order to solve the above problems, in one way, files can be directly exchanged to ensure that all data are in the edge server with low response delay, and then the copy selection strategy is adopted to ensure that all data access requests can obtain low response delay. However, data migration is expensive and takes a certain amount of time to complete, and in edge scenarios, there is a transmission delay between servers, which requires more time to complete the data migration task. Therefore, replica placement cannot update policies in real time like replica selection. Replica placement should have a larger policy update time period than replica selection. How to measure server performance over a long period of time is difficult.

针对可能会出现这种数据放置的情况，本发明设计了一种基于排名期望的副本放置优化策略(记为RDRP)，对数据进行迁移，以优化副本的放置，将相应的数据迁移放置到延迟更低的服务器中。考虑到本发明在每个时刻t都对边缘服务器进行过一次排名，且RDRP的目的就是为了更好的进行副本选择，因此，RDRP使用这段时间周期内服务器排名的期望来衡量副本放置优化时各边缘服务器能提供的性能排名。每个时刻t各服务器的评分为Actor网络的输出

定义一段长时间周期总共包含m个时刻，则每个服务器的排名期望如公式所示：Aiming at the situation where such data placement may occur, the present invention designs a replica placement optimization strategy (referred to as RDRP) based on ranking expectations, migrates data to optimize replica placement, and places corresponding data migration until the delay lower server. Considering that the present invention ranks the edge servers once at each time t, and the purpose of RDRP is to make better replica selection, therefore, RDRP uses the expectation of server ranking in this time period to measure the replica placement optimization. A ranking of the performance that each edge server can provide. The score of each server at each time t is the output of the Actor network

Defining a long period of time containing m moments in total, the ranking expectation of each server is shown in the formula:

同时考虑到将数据的副本都放置在排名最高的节点，不仅会导致数据失衡而且该节点会遭受过多的请求导致响应延迟升高，从而打破这种数据放置策略的目的。因此，RDRP根据排名期望将边缘服务器分为延迟较低和延迟较高两部分，保证所有数据在延迟较低的边缘服务器中均衡的放置至少一份副本。具体地，结合Ceph系统的内置规则来设计RDRP具体的实现过程，Ceph系统为了实现更灵活的放置方式，在集群拓扑中设计了桶与规则的结构，通过桶与规则结合的方式可以实现各种各样灵活的数据放置策略。At the same time, it is considered that placing copies of the data on the highest-ranked node will not only lead to data imbalance, but also the node will suffer from excessive requests and lead to increased response latency, thus defeating the purpose of this data placement strategy. Therefore, RDRP divides edge servers into two parts with lower latency and higher latency according to ranking expectations, and ensures that all data is balanced and placed at least one copy in edge servers with lower latency. Specifically, the specific implementation process of RDRP is designed in combination with the built-in rules of the Ceph system. In order to achieve a more flexible placement method, the Ceph system designs the structure of buckets and rules in the cluster topology. Through the combination of buckets and rules, various Various flexible data placement strategies.

本发明根据较低延迟与较高延迟的分区设计，需要定义两个桶来分别放置不同预测评分的OSD节点，然而这只解决的是OSD节点的分区问题，具体的数据放置位置选择由规则进行控制。本发明通过设计两个“根桶”的方式并定义相应的规则流程来实现这一目标。在Ceph系统中通过桶与规则结合的方式以不侵入系统的方式改变数据的放置位置。设计两个“根桶”的方式，如图5所示，双“根桶”分别定义为Low桶和High桶，其中Low桶放置具有较低延迟的节点，High桶则放置较高延迟的节点。Low桶中放置

个数量的Host(即边缘服务器)，High桶中则放置N/2个数量的Host。定义数据的副本数量为M，则选择规则在Low桶中将选择

个Host，在High桶中选择M/2个Host，从而可以保证每个数据在较低延迟的节点中都存在着副本。According to the partition design of lower delay and higher delay, the present invention needs to define two buckets to place OSD nodes with different prediction scores respectively. However, this only solves the partition problem of OSD nodes, and the specific data placement location selection is carried out by rules. control. The present invention achieves this goal by designing two "root buckets" and defining corresponding rule processes. In the Ceph system, the placement of data is changed in a way that does not intrude into the system by combining buckets with rules. The way to design two "root buckets", as shown in Figure 5, the two "root buckets" are defined as the Low bucket and the High bucket, where the Low bucket places nodes with lower delays, and the High buckets place nodes with higher delays . placed in a low bucket

number of Hosts (that is, edge servers), and N/2 hosts are placed in the High bucket. Define the number of copies of data to be M, then the selection rule will select in the Low bucket

For each host, select M/2 hosts in the High bucket, so as to ensure that each data has a copy in the node with lower latency.

本实施例以3副本和5个OSD节点来展示具体的桶结构和规则定义的实现。表1展示了具体的桶结构详情。This embodiment uses 3 replicas and 5 OSD nodes to show the implementation of the specific bucket structure and rule definition. Table 1 shows the specific bucket structure details.

表1Table 1

如表1所示，其中包含7个桶的实现，首个字段表示桶的类型。剩下4个类型的字段为桶的具体定义信息，其中id表示桶的唯一标识号(在Ceph中桶从-1开始往下编号，OSD节点从0开始往上编号)；alg表示桶中子桶或OSD节点的放置选择算法(在放置算法的选择中，本发明考虑到需要改变桶结构对数据进行迁移，使用升级版的“抽签算法”straw2减少数据迁移量)；hash表示在计算过程中用到的哈希函数(0代表默认函数jenkins1)；item表示桶中放置的子桶或OSD节点。As shown in Table 1, it contains the implementation of 7 buckets, and the first field indicates the type of bucket. The remaining four types of fields are the specific definition information of the bucket, where id represents the unique identification number of the bucket (in Ceph the buckets are numbered from -1 to the bottom, and the OSD nodes are numbered from 0 to the top); alg represents the bucket neutron Bucket or OSD node placement selection algorithm (in the selection of placement algorithm, the present invention takes into account the need to change the bucket structure to migrate data, and uses an upgraded version of the "drawing algorithm" straw2 to reduce the amount of data migration); hash indicates that in the calculation process The hash function used (0 represents the default function jenkins1); item represents the sub-bucket or OSD node placed in the bucket.

如图6所示。其中ruleset代表在规则集中的唯一标识；type代表保存多副本的方式(复制或纠删码)；最后的step则代表着具体的选择流程。在step中有三种类型的操作，分别是take、choose和emit。其中take表示获得一个“根桶”；choose表示选择子桶或OSD节点，emit则是结束一个“根桶”的选择。在choose类型的操作中，第一个参数为选择的方式，本文使用的是firstn(深度优先遍历)方法；第二个参数为选择的个数；第三个参数为类别标识符；第四个参数则是具体类别(可以是桶或OSD)。As shown in Figure 6. Among them, ruleset represents the unique identifier in the rule set; type represents the way of saving multiple copies (copy or erasure code); the last step represents the specific selection process. There are three types of operations in step, namely take, choose and emit. Among them, take means to obtain a "root bucket"; choose means to select a sub-bucket or OSD node, and emit means to end a "root bucket" selection. In the choose type operation, the first parameter is the method of selection. This article uses the firstn (depth-first traversal) method; the second parameter is the number of choices; the third parameter is the category identifier; the fourth parameter Parameters are specific categories (can be buckets or OSDs).

在系统运行时改变桶结构能够改变副本的放置位置，算法1为桶替换算法的伪代码，如表2所示。算法首先清空Low和High两个“根桶”，然后根据期望排名分别往“根桶”中加入相应的Host桶。Changing the bucket structure while the system is running can change the placement position of the replicas. Algorithm 1 is the pseudocode of the bucket replacement algorithm, as shown in Table 2. The algorithm first empties the two "root buckets" Low and High, and then adds the corresponding Host buckets to the "root buckets" according to the expected ranking.

表2Table 2

综上所述，本发明通过在服务器之间维护一份服务器排名并分发给客户端的方式，使副本选择具有完整的服务器状态信息且没有转发延迟开销。然后，针对边缘环境中存在多种影响系统服务质量的因素，研究使用神经网络建立高维的性能模型，设计了一种基于多代理强化学习的性能建模方法。并通过对基础模型的结构和数据流进行调整，从而能够在云和边缘部署不同的网络结构加速副本选择策略的调整，且减少云边数据的传输开销。最后，考虑到副本放置位置将影响副本的选择，设计了一种基于排名期望的副本放置优化方法。根据服务器排名的期望，调整副本放置位置，从而使得请求能够选择更低延迟的服务器，降低请求处理时延。本发明能够更好适应边缘环境中的副本选择，实现性能和可靠性的兼顾。To sum up, the present invention maintains a server ranking among servers and distributes it to clients, so that replica selection has complete server state information and no forwarding delay overhead. Then, considering that there are many factors affecting the system service quality in the edge environment, the neural network is used to establish a high-dimensional performance model, and a performance modeling method based on multi-agent reinforcement learning is designed. By adjusting the structure and data flow of the basic model, different network structures can be deployed in the cloud and the edge to speed up the adjustment of the copy selection strategy and reduce the transmission overhead of cloud-edge data. Finally, considering that replica placement will affect replica selection, an optimization method for replica placement based on ranking expectation is designed. According to the expectations of server ranking, adjust the placement of replicas, so that requests can select servers with lower latency and reduce request processing delays. The present invention can better adapt to the copy selection in the edge environment, and achieve both performance and reliability.

实施例2、Embodiment 2,

一种基于实施例1所述分布式存储系统的副本选择方法，包括：A copy selection method based on the distributed storage system described in Embodiment 1, comprising:

在分布式存储系统的运行过程中，当服务器端接收到副本访问请求时，基于边缘服务器的评分对各边缘服务器进行排名，选择排名最高、且存在数据副本的边缘服务器作为副本选择的节点进行数据访问。During the operation of the distributed storage system, when the server receives a copy access request, it ranks each edge server based on the score of the edge server, and selects the edge server with the highest ranking and has a copy of the data as the node selected by the copy for data processing. access.

优选地，分布式存储系统中的所有边缘服务器构成一个Ceph系统；Ceph系统对每一个存在数据副本的边缘服务器的评分进行归一化后，作为边缘服务器所对应的affinity-primary参数值，并基于affinity-primary参数值选择进行数据访问的边缘服务器。Preferably, all edge servers in the distributed storage system constitute a Ceph system; after the Ceph system normalizes the score of each edge server that has a data copy, it is taken as the affinity-primary parameter value corresponding to the edge server, and based on The affinity-primary parameter value selects the edge server for data access.

相关技术方案同实施例1，这里不做赘述。The related technical solutions are the same as those in Embodiment 1, and are not repeated here.

为了说明本发明所提供的副本选择方法的性能，分别对三种副本选择方法在三种负载中进行性能测试实验，其中，客户端数量设置为10。如图7所示为在Read-only、Read-heavy和Update-heavy三种负载下不同副本选择策略的平均延迟结果，其中，横坐标表示具体的副本选择策略，纵坐标为相应的性能指标(平均延迟，以ms计算)；其中，MARLRS为本发明所提供的副本选择策略，集中的On-off和分散的DRS-RT为现有的两种副本选择策略。从图7可以看出，集中的DRS-RT方法比分散的On-Off方法平均延迟更高，这是因为在边缘环境下(节点之间存在传输延迟)，DRS-RT方法存在着较高的请求转发的延迟开销。而本发明所提供的MARLRS方法因为使用多代理强化学习建立高维模型且使用服务端排名的集中副本选择机制，相比于其他两种方法在三种负载下均有更低的响应延迟。但从不同负载的对比可以看出，MARLRS与其他两种方法在写比例较高的负载下平均延迟降低比例最低，这是因为写操作具有同步开销，而MARLRS对同步复制节点的选取不可控。In order to illustrate the performance of the replica selection method provided by the present invention, a performance test experiment is carried out for the three replica selection methods under three loads, wherein the number of clients is set to 10. Figure 7 shows the average delay results of different replica selection strategies under three loads of Read-only, Read-heavy and Update-heavy, where the abscissa represents the specific replica selection strategy, and the ordinate represents the corresponding performance index ( Average delay, calculated in ms); wherein MARLRS is the replica selection strategy provided by the present invention, and centralized On-off and decentralized DRS-RT are two existing replica selection strategies. It can be seen from Figure 7 that the centralized DRS-RT method has a higher average delay than the decentralized On-Off method, because in the edge environment (there is a transmission delay between nodes), the DRS-RT method has a higher average delay The latency overhead of request forwarding. However, the MARLRS method provided by the present invention uses multi-agent reinforcement learning to build a high-dimensional model and uses a centralized replica selection mechanism for server ranking, so compared with the other two methods, it has lower response delays under three loads. However, from the comparison of different loads, it can be seen that MARLRS and the other two methods have the lowest average delay reduction ratio under the load with a high write ratio. This is because the write operation has synchronization overhead, and the selection of synchronous replication nodes by MARLRS is uncontrollable.

进一步地，表3展示了在三种负载下，本发明所提供的MARLRS相比于其他两种方法的平均延迟降低比例。具体来说，与On-Off方法相比，平均延迟分别降低了8.89％、8.55％和2.47％。与DRS-RT方法相比，平均延迟分别降低了11.78％、13.72％和10.07％。Further, Table 3 shows the average delay reduction ratio of MARLRS provided by the present invention compared with the other two methods under three loads. Specifically, compared with the On-Off method, the average latency is reduced by 8.89%, 8.55% and 2.47%, respectively. Compared with the DRS-RT method, the average latency is reduced by 11.78%, 13.72% and 10.07%, respectively.

表3table 3

进一步地，由于多种因素影响，分布式存储系统的性能是不稳定的，各节点处理请求的响应延迟不同。并且在边缘由于用户移动性，不同请求使用不同服务器也将获得不同的响应延迟。本发明通过一段长时间的采集每个时刻系统的平均响应延迟来观测系统服务的稳定性，验证本发明所提供的MARLRS的有效性。其中每个时刻间隔时长为1秒。具体地，如图8所示为在Read-only负载下使用3种不同策略各节点在每个时刻的平均响应延迟。其中，从上往下分别代表MARLRS、On-Off和DRS-RT三种方法；横坐标表示一个个时刻；本发明总共采集了1000个时刻的系统平均延迟数据，纵坐标表示延迟(以ms计)。从图中可以看出On-Off方法存在较多的负载振荡时刻，这是因为使用客户端作为选择节点的方法只有局部视图且多选择节点难以协调策略从而容易造成振荡。从每个子图的整体趋势可以看到使用On-Off和DRS-RT方法系统整体平均延迟波动较大，说明这两者的方法没有很好的分配请求，本发明所提供的MARLRS方法虽然也会出现波动，但是相比于其他两种方法整体趋势都更加平稳。经过一段长时间的观测，可以看到MARLRS相比于On-Off和DRS-RT方法的副本选择策略都更加的有效，能够使系统响应延迟更加的稳定，提供更稳定的服务质量。Further, due to the influence of various factors, the performance of the distributed storage system is unstable, and the response delays of each node in processing requests are different. And due to user mobility at the edge, different requests using different servers will also get different response delays. The present invention observes the stability of the system service by collecting the average response delay of the system at each moment for a long period of time, and verifies the validity of the MARLRS provided by the present invention. Each time interval is 1 second long. Specifically, Figure 8 shows the average response delay of each node at each moment using three different strategies under Read-only load. Among them, the three methods of MARLRS, On-Off and DRS-RT are respectively represented from top to bottom; the abscissa represents each time; the present invention collects the system average delay data of 1000 times in total, and the ordinate represents the delay (measured in ms). ). It can be seen from the figure that the On-Off method has more load oscillation moments. This is because the method of using the client as the selection node has only a partial view and it is difficult for multiple selection nodes to coordinate the strategy, which is easy to cause oscillation. From the overall trend of each sub-graph, it can be seen that the overall average delay of the system using the On-Off and DRS-RT methods fluctuates greatly, indicating that the two methods do not have a good allocation request. Although the MARLRS method provided by the present invention will also There are fluctuations, but the overall trend is more stable than the other two methods. After a long period of observation, it can be seen that MARLRS is more effective than the copy selection strategy of On-Off and DRS-RT methods, which can make the system response delay more stable and provide more stable service quality.

进一步地，通过增加客户端的数量(增加系统整体负载)来观察不同副本选择策略的平均延迟变化。设置客户端的数量分别为10、20、30、40、50。测试在Read-only工作负载下各副本选择策略的平均延迟。如图9所示为在Read-only负载下不同客户端数量对三种副本选择策略的延迟影响，从图中可以看出，虽然随着客户端数量的增多(系统负载加大)三种策略的平均响应延迟均呈现上升趋势，但是与On-Off方法相比，随着客户端数量的增加，本发明所提供的MARLRS的平均延迟分别降低了8.89％、10.02％、11.34％、12.76％、14.43％；与DRS-RT方法相比，随着客户端数量的增加，MARLRS的平均延迟分别降低了11.78％、12.04％、12.12％、12.15％、11.88％。具体地，Read-only负载下MARLRS在不同客户端数量的平均延迟降低比例如表4所示：Further, by increasing the number of clients (increasing the overall load of the system), we observe the variation of the average delay of different replica selection strategies. Set the number of clients to 10, 20, 30, 40, and 50, respectively. Test the average latency of each replica selection strategy under read-only workloads. Figure 9 shows the delay effect of different number of clients on the three copy selection strategies under Read-only load. It can be seen from the figure that although the number of clients increases (the system load increases) the three strategies Compared with the On-Off method, with the increase of the number of clients, the average delay of MARLRS provided by the present invention is reduced by 8.89%, 10.02%, 11.34%, 12.76%, 14.43%; compared with the DRS-RT method, as the number of clients increases, the average latency of MARLRS decreases by 11.78%, 12.04%, 12.12%, 12.15%, and 11.88%, respectively. Specifically, the average delay reduction ratio of MARLRS with different numbers of clients under Read-only load is shown in Table 4:

表4Table 4

从表4中的数据可以看出，随着客户端数量的增加，本发明所提供的MARLRS相比于On-Off有着更大的延迟降低效果。这说明了On-Off方法随着客户端数量的增多(并发数增大)，On-Off的开关策略会降低选择效率，且更加难以协调决策降低负载振荡，从而导致较高的延迟。甚至在40个客户端数量时就超过的使用转发机制的DRS-RT方法。同时，表中数据显示MARLRS相比于DRS-RT方法的延迟降低效果没有太大的变化，这是因为MARLRS与DRS-RT都是集中式决策。随着并发数的增加，MARLRS这种每隔一个时刻决策的方法可能会出时刻内较高并发出现延迟升高的情况，但DRS-RT也面临单点集中决策的并发量问题，导致出现较高的延迟。总的来说，随着客户端数量的增加MARLRS方法的平均响应延迟优于其他两种方法。It can be seen from the data in Table 4 that as the number of clients increases, the MARLRS provided by the present invention has a greater delay reduction effect than On-Off. This shows that with the increase of the number of clients (concurrency increases) in the On-Off method, the On-Off switching strategy will reduce the selection efficiency, and it will be more difficult to coordinate decisions to reduce load oscillation, resulting in higher latency. Even the DRS-RT method using the forwarding mechanism exceeds the number of 40 clients. At the same time, the data in the table shows that the delay reduction effect of MARLRS does not change much compared to the DRS-RT method, because both MARLRS and DRS-RT are centralized decision-making. As the number of concurrency increases, MARLRS, a method of making decisions every other moment, may experience higher concurrency and higher latency in the moment, but DRS-RT also faces the problem of concurrency of single-point centralized decision-making, resulting in relatively high concurrency issues. high latency. Overall, the average response delay of the MARLRS method is better than the other two methods as the number of clients increases.

综上所述，本发明公开了一种分布式存储系统的副本选择方法，通过在服务器之间维护一份服务器排名并分发给客户端的方式，使副本选择具有完整的服务器状态信息且没有转发延迟开销。然后，针对边缘环境中存在多种影响系统服务质量的因素，研究使用神经网络建立高维的性能模型，设计了一种基于多代理强化学习的性能建模方法。并通过对基础模型的结构和数据流进行调整，从而能够在云和边缘部署不同的网络结构加速副本选择策略的调整，且减少云边数据的传输开销。最后，考虑到副本放置位置将影响副本的选择，设计了一种基于排名期望的副本放置优化方法。根据服务器排名的期望，调整副本放置位置，从而使得请求能够选择更低延迟的服务器，降低请求处理时延。本发明能够更好适应边缘环境中的副本选择，实现性能和可靠性的兼顾。In summary, the present invention discloses a copy selection method for a distributed storage system. By maintaining a server ranking among servers and distributing it to clients, the copy selection has complete server state information and no forwarding delay. overhead. Then, considering that there are many factors affecting the system service quality in the edge environment, the neural network is used to establish a high-dimensional performance model, and a performance modeling method based on multi-agent reinforcement learning is designed. By adjusting the structure and data flow of the basic model, different network structures can be deployed in the cloud and the edge to speed up the adjustment of the copy selection strategy and reduce the transmission overhead of cloud-edge data. Finally, considering that replica placement will affect replica selection, an optimization method for replica placement based on ranking expectation is designed. According to the expectations of server ranking, adjust the placement of replicas, so that requests can select servers with lower latency and reduce request processing delays. The present invention can better adapt to the copy selection in the edge environment, and achieve both performance and reliability.

实施例3、Embodiment 3,

一种副本选择系统，包括：存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时执行本发明实施例2所提供的副本选择方法。A copy selection system includes: a memory and a processor, wherein the memory stores a computer program, and the processor executes the copy selection method provided in Embodiment 2 of the present invention when the processor executes the computer program.

相关技术方案同实施例2，这里不做赘述。The related technical solutions are the same as those in Embodiment 2, and are not repeated here.

实施例4、Embodiment 4,

一种计算机可读存储介质，所述计算机可读存储介质包括存储的计算机程序，其中，在所述计算机程序被处理器运行时控制所述存储介质所在设备执行本发明实施例2所提供的副本选择方法。A computer-readable storage medium, the computer-readable storage medium comprising a stored computer program, wherein when the computer program is run by a processor, the device where the storage medium is located is controlled to execute the copy provided in Embodiment 2 of the present invention Method of choosing.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, etc., All should be included within the protection scope of the present invention.

Claims

1. A distributed storage system, comprising: the cloud terminal and the server terminal; the server side includes: a plurality of distributed edge servers; an Actor network is deployed in each edge server; the cloud end is provided with a plurality of Critic networks, the number of the Critic networks is the same as that of the edge servers, and one Critic network corresponds to one Actor network;

the operation process of the distributed storage system comprises the following steps:

at each time t, each edge server performs the following operations: the method comprises the steps that an edge server collects current state data of a network environment where the edge server is located as state information of the edge server, and the current state data are input into an Actor network used for scoring the service quality of the edge server to obtain a score of the edge server; after the edge server sends the state information and the scores of all the edge servers to a corresponding Critic network in the cloud to obtain the evaluation result, training an Actor network in the edge server by taking the maximum evaluation result as a target;

at each time t, the cloud performs the following operations: collecting the information sent by all edge servers, and after collecting the information sent by all edge servers at the time t, calculating the reward value r at the time t-1 _t-1 Storing the corresponding tuple information into an experience pool; when the experience pool is full of data, randomly sampling tuple information data from the experience pool to train the Critic network; wherein the tuple information comprises: the state information of all edge servers at the time t-1, the scores of all edge servers at the time t-1, the reward value at the time t-1 and the state information of all edge servers at the time t.

2. The distributed storage system of claim 1, wherein the t-1 timeBonus value of the moment r _t-1 Comprises the following steps:

wherein N is the number of edge servers;

average latency for the ith edge server;

average of the average delays for all edge servers;

the number of requests processed for the ith edge server;

is the average of the number of requests processed by the ith edge server.

3. The distributed storage system according to claim 1, wherein in the process of executing operations at each time t, after the experience pool is not full of data or Critic network training is completed, the cloud determines whether a time duration elapsed from the time t is greater than a preset time period, if so, scores of each edge server at different times are obtained from the experience pool, and a score average value of each edge server is calculated; dividing by the median of the score average of each edge serverA point, dividing edge servers into low latency edge servers and high latency edge servers; wherein the average score of the low latency edge servers is greater than or equal to a partition point and the average score of the high latency edge servers is less than a partition point; partitioning the edge server by adopting two root barrel structures respectively, and marking as a Low barrel and a High barrel respectively; will be provided with

Placing N/2 High latency edge servers in the High bucket; selecting among the Low buckets

Selecting M/2 High latency edge servers in the High bucket to place copies; otherwise, the operation of the cloud end at the moment t is finished; wherein N is the number of edge servers; m is the number of copies.

4. The distributed storage system according to any one of claims 1-3, wherein the Actor network comprises: an Actor online network and an Actor target network; the Critic network comprises a Critic online network and a Critic target network;

at each time t, each edge server performs the following operations: the method comprises the steps that an edge server collects current state data of a network environment where the edge server is located to serve as state information of the edge server, and the current state data are respectively input into an Actor online network and an Actor target network inside the edge server to obtain a score output by the Actor online network and a score output by the Actor target network; the method comprises the steps that after an edge server sends state information of the edge server and scores output by Actor online networks of all edge servers to a corresponding criticic online network in a cloud end to obtain an evaluation result, the Actor online network in the edge server is trained by taking the maximum evaluation result as a target; after each training cycle, updating the target network of the Actor based on the parameters of the on-line network of the Actor;

at each time t, the cloud performs the following operations: collecting information sent by all edge servers, calculating the reward value at the t-1 moment after collecting the information sent by all edge servers at the t moment, and storing corresponding tuple information into an experience pool; when the experience pool is full of data, randomly sampling tuple information data from the experience pool to train each Critic network; the tuple information includes: state information s of all edge servers at time t-1 _t-1 And the score a output by the Actor on-line network of all edge servers at the moment of t-1 _t-1 The reward value r at time t-1 _t-1 And state information s of all edge servers at time t _t And the scores a 'output by the Actor target networks of all edge servers at the moment t' _t (ii) a Wherein,

the state information of the ith edge server at the time of t-1;

the score is the score output by the Actor online network of the ith edge server at the time of t-1;

state information of the ith edge server at the moment t;

the score is output by an Actor target network of the ith edge server at the time t; n is the number of edge servers.

5. The distributed storage system according to claim 4, wherein the method for training the Critic network from randomly sampled tuple information data in the experience pool comprises the following steps:

the j-th tuple information data obtained by sampling is recorded as(s) _b ,a _b ,r _b ,s _b+1 ,a' _b+1 ) (ii) a Wherein,

state information of the ith edge server at the moment b;

the score is output by an Actor online network of the ith edge server at the moment b;

the score is output by an Actor target network of the ith edge server at the moment b + 1;

acquiring an evaluation result and a corresponding evaluation label of each edge server based on tuple information data obtained by sampling; wherein, the evaluation result of the ith edge server obtained based on the jth tuple information data is that

And a _b Inputting the evaluation result obtained by the ith Critic online network; evaluation label of ith edge server obtained based on jth tuple information data

r _b The reward value at the moment b; γ is the reward discount rate;

to be composed of

And a' _b+1 Inputting the evaluation result obtained by the ith Critic target network;

training each Critic online network by minimizing the difference between the evaluation result of each edge server and the corresponding evaluation label; and after each training cycle, updating the corresponding Critic target network based on the parameters of the Critic online network.

6. A copy selection method for a distributed storage system according to any one of claims 1 to 5, comprising: in the operation process of the distributed storage system, when a server end receives a copy access request, ranking each edge server based on the scores of the edge servers, and selecting the edge server with the highest ranking and the data copy as a node for copy selection to perform data access.

7. The copy selection method of claim 6, wherein all edge servers in the distributed storage system form a Ceph system; and after normalizing the scores of each edge server with the data copy, the Ceph system is used as an affinity-primary parameter value corresponding to the edge server, and selects the edge server for data access based on the affinity-primary parameter value.

8. The replica selection method of claim 7 wherein the Ceph system normalizes the scores of each edge server where a data replica exists using a max-min normalization method.

9. A copy selection system, comprising: a memory storing a computer program and a processor executing the computer program to perform the copy selection method of any of claims 6-8.

10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed by a processor, controls an apparatus in which the storage medium is located to perform the copy selection method of any of claims 6-8.