CN113779302B

CN113779302B - Semi-distributed collaborative storage method based on value decomposition network and multiple agents

Info

Publication number: CN113779302B
Application number: CN202111058748.3A
Authority: CN
Inventors: 陈由甲; 蔡粤楷; 郑海峰; 胡锦松
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2023-09-22
Anticipated expiration: 2041-09-09
Also published as: CN113779302A

Abstract

The application provides a semi-distributed collaborative storage method based on value decomposition network and multi-agent reinforcement learning, which designs a semi-distributed multi-agent reinforcement learning framework, a state space, an action space and a reward function according to a wireless intelligent storage network model to realize the characteristic identification of user and wireless service node information in a wireless network; combining with the efficient decision capability of the Dueling DQN network, a dynamic storage algorithm is further provided for the storage replacement policy of each wireless service node; the method comprises the steps of providing global strategy updating parameters calculated by utilizing an aggregation node embedded value decomposition network of a wireless network, and transmitting the global strategy updating parameters to each wireless service node to update local strategies of each intelligent agent; the neural network in each intelligent agent is continuously and iteratively updated to enable the global loss function to reach a convergence state, so that the global optimal storage strategy is obtained. The information of each intelligent agent is transmitted to the sink node so as to promote the mutual collaboration of each intelligent agent and quickly achieve global optimum.

Description

Semi-distributed collaborative storage method based on value decomposition network and multiple agents

Technical Field

The application belongs to the field of wireless communication and the technical field of computers, relates to deep reinforcement learning, a distributed system, algorithm complexity optimization, wireless transmission and the like in machine learning, and particularly relates to a semi-distributed collaborative storage method based on a value decomposition network and multi-agent reinforcement learning.

Background

With the exponential growth of mobile wireless communication and data demands and the continuous improvement of device storage and computing capabilities, real-time multimedia services gradually become a main business in 5G communication networks, and human life and work gradually migrate to the mobile internet comprehensively, pushing various network functions to the edges of the networks, such as edge computing and edge storage. By storing popular content of user requests, edge storage aims to reduce traffic load and duplicate transmissions in the backhaul network, thereby significantly reducing transmission delay. In addition, for the rise of online video services, how to improve the experience of video users in wireless networks has also become a new challenge. To capture the dynamic nature of the user's requested content and the wireless environment, policy control frameworks are introduced into the wireless storage domain. Deep reinforcement learning combines deep neural network and Q learning, and shows excellent performance in solving the problem of complex control, and in addition, due to the layout of large-scale wireless service nodes, how to improve the overall service performance of the wireless network through cooperation among a plurality of wireless service nodes has been paid more attention.

Disclosure of Invention

In order to make up for the blank and the deficiency of the prior art, the application aims to provide a semi-distributed collaborative storage method based on a value decomposition network and multi-agent reinforcement learning, which designs a semi-distributed multi-agent reinforcement learning framework, a state space, an action space and a reward function according to a wireless intelligent storage network model to realize the characteristic identification of user and wireless service node information in a wireless network; combining with the efficient decision capability of the Dueling DQN network, a dynamic storage algorithm is further provided for the storage replacement policy of each wireless service node; the method comprises the steps of providing global strategy updating parameters calculated by utilizing an aggregation node embedded value decomposition network of a wireless network, and transmitting the global strategy updating parameters to each wireless service node to update local strategies of each intelligent agent; the neural network in each intelligent agent is continuously and iteratively updated to enable the global loss function to reach a convergence state, so that the global optimal storage strategy is obtained. The information of each intelligent agent is transmitted to the sink node so as to promote the mutual collaboration of each intelligent agent and quickly achieve global optimum.

The key problem of the application is that accurate prediction of user demands is achieved, a dimension decomposition mechanism is introduced in consideration of the actual environment complexity of users in a network, particularly in a wireless network, and finally a user service strategy algorithm based on dimension decomposition is provided in each intelligent agent, and the final strategy is converged by continuously updating iteration. Simulation results show that the algorithm is remarkably improved in terms of reducing access delay and improving user service experience performance under various environmental parameter scenes. In addition, the algorithm can process a very large action space, the convergence of the whole system is accelerated by the half-division frame constructed by value decomposition, the calculation complexity is low, and most of the running time is saved compared with the traditional multi-agent algorithm.

The application adopts the following technical scheme:

the semi-distributed collaborative storage method based on the value decomposition network and the multi-agent reinforcement learning is characterized by comprising the following steps of:

step S1: the method comprises the steps of constructing a wireless network model of multi-equipment cooperation semi-distributed cooperation storage based on wireless network transmission, comprising aggregation nodes and wireless service nodes, defining an agent state space and an action space based on value decomposition network and multi-agent deep reinforcement learning, combining the state space and the action space, and a reward function based on optimization target design so as to improve the wireless network service quality to the maximum extent and reduce the access delay of storage contents;

step S2: and collecting and analyzing information about each wireless service node in the sink node, coordinating the cooperation of each wireless service node by constructing a value decomposition network model, namely, outputting a global action cost function and a global strategy updating parameter of the whole system by taking the action cost function of each wireless service node as the input of the value decomposition network, and feeding back the result to the whole semi-distributed system, wherein the feedback comprises feeding back the result to each wireless service node to update the strategy of the single wireless service node so as to improve the cooperation performance and convergence speed of wireless edge storage.

Further, the step S1 specifically includes the following steps:

step S11: defining a user set, a situation that the user set belongs to a wireless service node, a user request variable, a wireless service node storage variable, a file set, a quality variable, a video layer set and the wireless service node set, a local hit, a collaboration hit, a unit time delay of downloading from a server, and a user request quality and a service quality variable;

step S12: constructing performance indexes of a storage model, including video access time delay and user experience score, and constructing a final optimization target, namely a reward function, based on the two target optimization problems; defining a user request variable, user request quality and wireless service node storage variable as a state space, and defining the wireless service node storage variable and user service quality at the next moment as an action space;

step S13: fitting states and actions with a Dueling DQN network that splits branches of the neural network into state value branches for estimating the current wireless network state and dominant action branches for estimating each action; the performance of each action is evaluated in combination with the state value and the dominance value.

Further, step S11 specifically includes: a user set represented by 1, U (U) _j Representing the set of users belonging to a wireless service node j, the user request variable lambda _iv And wireless service node storage variable delta _jvl File set { v. V, the quality variable K is a function of the quality variable, the video layer set 1. L, wireless service node set {1..j..j }; using unit access delay d ₀ ,d _jj' ,d _j Respectively representing local hits, collaborative hits and unit delays downloaded from a server; defining user request quality k _iv And quality of service variables

In step S12: the performance indexes of the storage model are built, wherein the performance indexes comprise video access time delay D and user experience scores M as follows:

wherein c ₁ Take 0.16, c ₂ Taking 0.66 as the quality evaluation coefficient.

And based on these two objective optimization problems, construct a final optimization objective, namely a reward function,η is a weight coefficient used to adjust the weight of the access delay and the user experience score.

In step S13, the evaluation operator q (S, a; θ) of the lasting DQN network is divided into q for the target network and the evaluation network _ej ，q _gj 。

Further, the step S2 specifically includes the following steps:

step S21: introducing a value decomposition network into a sink node, wherein the sink node firstly collects the states of all the agents and rewards to construct a combined state and a combined action and calculates a rewards value of the whole system from the combined state and the combined action; and introduces an experience playback library for storing a playback library containing four elements (S ^(t) ,A ^(t) ,r ^(t) ,S ^(t+1) ) Each sample calculating a respective action cost function q (s, a; θ), finally calculating global action cost function of the whole system by utilizing value decomposition networkAnd->

Step S22: constructing a loss function according to the global action cost function of the whole system calculated in the step S21 and the aggregate rewarding functionTo calculate the global policy update parameters, the global strategy updating parameters obtained by training are reversely transferred back to each wireless service node in the wireless service node group, so that the wireless service node is convenient to use a gradient updating method aiming at the self neural network>And updating to obtain a better strategy.

Further, a dimension decomposition mechanism is adopted to be embedded into the lasting DQN, so that the complexity of decision making is reduced and the performance of wireless service is improved:

the actions output by the lasting DQN network are decomposed according to the dimension of the actual physical meaning, namely, the actions are decomposed into three dimensions, namely, the three dimensions are respectively: what type of video delta is stored _jv What video layer delta is stored _jl And what quality to serve to the userThe actions in each dimension are represented by a single neural network branch, and all the actions are selected independently in the dimension without mutual influence;

further, after the dimension decomposition mechanism is embedded, the calculation method of the action cost function is as follows: calculated in terms of dimensions in a lasting DQN network, i.e At the same time, the calculation of the global action cost function of the sink node is updated, namely +.>And->

And, a wireless network model, comprising: the system comprises a wireless service node, an aggregation node, a source server and a core network; each wireless service node can download files on the source server through a return link, store the files locally and directly serve users in the cell; user storage and inter-user collaboration are performed using a semi-distributed collaborative storage method based on a value decomposition network and multi-agent reinforcement learning as described above.

Further, for each wireless service node, namely each agent, the user request set and the file storage set are used as a state space and are used as input of a neural network, the output of the neural network is a storage content set of the next time period and a file quality set of a service user, and each wireless service node can download files on a source server through a return link and store the files locally to directly serve the users in a coverage area; in order to reduce the file downloading time, different wireless service nodes are allowed to perform multi-device cooperation through the sink node, so that the service quality of the wireless network is further improved, and the access time delay of stored content is reduced.

Compared with the prior art, the method and the system have the advantages that the network is decomposed by utilizing the values of the sink nodes, so that cooperation among multiple intelligent agents can be promoted, and meanwhile, the problem of decision complexity in a real wireless network environment is solved by utilizing the framework of a dimension decomposition mechanism, so that the capability of mobile wireless edge storage is improved. Good performance is achieved even in situations where storage resources are limited, computing resources are limited, and the user and wireless environment are complex, to reduce access latency to stored content and to improve the quality of service for the user.

Drawings

The application is described in further detail below with reference to the attached drawings and detailed description:

fig. 1 is a schematic diagram of a semi-distributed wireless collaborative storage network model in an embodiment of the present application.

FIG. 2 is a schematic diagram of a Dueling DQN network in an embodiment of the application.

FIG. 3 is a schematic diagram of a dimension decomposition mechanism.

FIG. 4 is a graph comparing results under different file parameters in the examples of the present application.

FIG. 5 is a graph comparing the performance of different algorithms in an embodiment of the application.

FIG. 6 is a graph comparing results of different dimensions in an example of the application.

Detailed Description

In order to make the features and advantages of the present patent more comprehensible, embodiments accompanied with figures are described in detail below:

it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The semi-distributed collaborative storage algorithm based on the value decomposition network and the multi-agent reinforcement learning provided by the embodiment is realized according to the following steps,

step S1: providing a wireless network model of semi-distributed collaborative storage, defining a state space and an action space of each intelligent agent and a reward function designed based on time delay and user experience, and aiming at improving the service quality of a local wireless service node to the maximum extent;

1) In this example, the first, {1..i..I } represents is a set of users of the (a), U (U) _j Representing the set of users belonging to a wireless service node j, the user request variable lambda _iv And wireless service node storage variable delta _jvl File set { v. V,the quality variable K is a function of the quality variable, the video layer set 1. L, wireless service node set {1..j. J }. In addition, unit access time delay d is also provided ₀ ,d _jj' ,d _j Representing the unit delays of local hits, collaborative hits and downloads from the server, respectively, while the user request quality k is defined for the construction optimization problem _iv And quality of service variables

2) Building performance metrics of the storage model, including video access latency D and user experience score M as follows, and based on these two objective optimization problems, constructing a final optimization objective, i.e. the reward function, ++>In addition, a user request variable is defined, the user request quality and a wireless service node storage variable are used as a state space, and the wireless service node storage variable and the user service quality at the next moment are used as an action space.

3) Fitting of states and actions is performed with a forcing DQN network that splits branches of the neural network into state value branches, which are used mainly to estimate the current wireless network state, and dominant action branches, which are used to estimate each action, finally, the performance of each action can be accurately estimated by q (s, a; theta) by combining the state value and the advantage value, and the q (s, a; theta) is also divided into two q types because the lasting DQN network is also divided into a target network and an estimation network _ej ，q _gj 。

Step S2: collecting and analyzing information about each wireless service node in the sink node, coordinating the cooperation of each wireless service node by constructing a value decomposition network model, namely, taking an action value function of each intelligent agent as the input of the value decomposition network, outputting the action value function and the global strategy updating parameter as the global action value function of the whole system, feeding back the result to the whole semi-distributed system, and improving the cooperation performance of wireless edge storage;

by defining an action cost function, carrying out continuous iteration of network parameters, and finally, each wireless service node can obtain an optimal storage strategy and a user service strategy in each non-communication state; in order to achieve better performance of the neural network, the Dueling DQN network of the present embodiment may additionally use a duel-bucket mechanism and a dual-network mechanism. The dual-network mechanism adopts a neural network with completely consistent structure for delay updating to improve the stability of the algorithm, so that the algorithm is easier to converge. The decision mechanism additionally adopts the scores of the estimated state value and the dominant value to judge the merits of the output action of the neural network, so that the decision is more accurate.

1) In order to better cooperate between wireless service nodes, a new module called value decomposition network is introduced into the sink node, which can collect the states of all the intelligent agents and rewards to construct joint states and joint actions and calculate the rewards value of the whole system, in addition, an experience playback library is introduced for storing the information containing four elements (S ^(t) ,A ^(t) ,r ^(t) ,S ^(t+1) ) Each of which can calculate a respective action cost function q (s, a; θ), finally calculating global action cost function of the whole system by utilizing value decomposition networkAnd

2) Global of the whole system calculated from the foregoingAn action cost function, a collective reward function constructs a loss function to calculate global policy update parameters, the global strategy updating parameters obtained by training can be reversely transferred back to each wireless service node in the wireless service node group so as to facilitate the use of +.>And updating to obtain a better strategy, wherein the updated strategy has better performance in the aspects of collaboration and predictability.

Step S3: considering the complexity of the deep reinforcement learning practical environment, a dimension decomposition mechanism is additionally proposed to be embedded into the Dueling DQN to reduce the complexity of decision making and improve the performance of wireless services.

1) The actions output by the lasting DQN network are decomposed according to dimensions, in this scenario the actions are decomposed into three dimensions, respectively what type of video delta is stored _jv What video layer delta is stored _jl And what quality to serve to the userThe actions in each dimension are represented by a single neural network branch, so that all actions are selected independently in their own dimension and do not affect each other.

2) After embedding the dimension decomposition mechanism, the action cost function calculation method also generates corresponding change, and the action cost function is calculated according to the dimension in the lasting DQN network, namely At the same time, the calculation of the global action cost function of the sink node is updated, namely +.>And

in order to further understand the semi-distributed collaborative storage algorithm based on the value decomposition network and multi-agent reinforcement learning proposed by the present application, the following detailed description is provided with reference to specific embodiments. The embodiment is implemented on the premise of the technical scheme of the application.

As shown in fig. 1, a semi-distributed wireless smart storage network model is provided.

The model mainly comprises a wireless service node, a sink node, a source server, a core network and the like, introduces a user storage model under the wireless service node and an inter-user cooperation model, and each wireless service node can download files on the source server through a backhaul link and locally store the files to directly serve users in a cell.

As shown in fig. 2, a diagram of a Dueling DQN network and a value decomposition network is shown.

The network framework is divided into a target network and an evaluation network, the neural network branch of each network is further divided into a state value network and a dominance value network, and the advantages and disadvantages of the action can be better evaluated through the double-network and the double-branch architecture. The network takes the state space as input and the action space as output, and continuously optimizes the network parameters by continuously receiving the global policy update parameters.

As shown in fig. 3, the decomposition mechanism is schematically represented.

The composite action space is decomposed into independent actions in multiple dimensions, each dimension can select actions in independent neural network branches, and high complexity of the composite action is avoided.

As shown in FIG. 4, a comparison of results under different file parameters is shown in the examples.

According to experimental results, the algorithm can cope with complex calculation under high file numbers, and a global strategy can be well found under different high complexity, so that the reward value converges to a high value.

The analysis shows that the semi-distributed collaborative storage algorithm based on the value decomposition network and multi-agent reinforcement learning provided by the application can obtain better storage capacity than the existing method, can well improve the storage problem of users, and has certain reference value and practical economic benefit.

FIG. 5 is a graph showing the comparison of algorithm performance in an embodiment of the present application.

Compared with the traditional multi-agent algorithm, the introduced semi-distributed architecture and the dimension decomposition method can respectively improve the performance by 23.4% and 30.5% compared with the traditional algorithm under the conditions of 5 files and 10 files, and the convergence speed is faster, so that the algorithm can rapidly locate a global optimal strategy under different scenes.

FIG. 6 is a graph showing the comparison of dimension number results in an example of the present application.

According to the experimental result, the algorithm can be extended and expanded on a dimension decomposition mechanism, and the more the number of subdivision branches is (under the premise of conforming to an action decomposition standard), the better the algorithm can be obtained.

The above method provided in this embodiment may be stored in a computer readable storage medium in a coded form, implemented in a computer program, and input basic parameter information required for calculation through computer hardware, and output a calculation result.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present application and not for limiting the same, and although the present application has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the application without departing from the spirit and scope of the application, which is intended to be covered by the claims.

The patent is not limited to the best mode, any person can obtain other various semi-distributed collaborative storage methods based on value decomposition network and multi-agent reinforcement learning under the teaching of the patent, and all equivalent changes and modifications made according to the scope of the patent are covered by the patent.

Claims

1. The semi-distributed collaborative storage method based on the value decomposition network and the multi-agent reinforcement learning is characterized by comprising the following steps of:

step S2: collecting and analyzing information about each wireless service node in the sink node, coordinating the cooperation of each wireless service node by constructing a value decomposition network model, namely, outputting a global action cost function and a global strategy updating parameter of the whole system as the input of the value decomposition network by using the action cost function of each wireless service node, and feeding back the result to the whole semi-distributed system, wherein the feedback updating parameter is fed back to each wireless service node to update the strategy of the single wireless service node so as to improve the cooperation performance and convergence speed of wireless edge storage;

the step S1 specifically comprises the following steps:

step S13: fitting states and actions with a DuelingDQN network that splits branches of the neural network into state value branches for estimating the current wireless network state and dominant action branches for estimating each action; evaluating the performance of each action in combination with the state value and the dominance value;

the step S11 specifically includes: a user set represented by 1, U (U) _j Representing the set of users belonging to a wireless service node j, the user request variable lambda _iv And wireless service node storage variable delta _jvl File set { v. V, the quality variable K is a function of the quality variable, the video layer set 1. L, wireless service node set {1..j..j }; using unit access delay d ₀ ,d _jj' ,d _j Respectively representing local hits, collaborative hits and unit delays downloaded from a server; defining user request quality k _iv And quality of service variables

wherein c ₁ Take 0.16, c ₂ Taking 0.66 as a quality evaluation coefficient;

and based on these two objective optimization problems, construct a final optimization objective, namely a reward function,η is a weight coefficient used to adjust the weight of the access delay and the user experience score;

in step S13, the evaluation operator q (S, a; θ) of the lasting DQN network is divided into q for the target network and the evaluation network _ej ，q _gj

The step S2 specifically comprises the following steps:

Step S22: according to the global action cost function of the whole system calculated in the step S21, the aggregate rewarding function constructs a loss function to calculate a global strategy updating parameter, the global strategy updating parameters obtained by training are reversely transmitted back to each wireless service node in the wireless service node group, so that the gradient is used for the neural network of the wireless service node groupNew method->Updating to obtain a better strategy;

embedding the dimension decomposition mechanism into the DuelingDQN to reduce the complexity of decision making and improve the performance of wireless services:

the actions output by the DuelingDQN network are decomposed according to the dimension of the actual physical meaning, namely, the actions are decomposed into three dimensions, namely, the three dimensions are respectively: what type of video delta is stored _jv What video layer delta is stored _jl And what quality to serve to the userThe actions in each dimension are represented by a single neural network branch, and all the actions are selected independently in the dimension without mutual influence;

after the dimension decomposition mechanism is embedded, the calculation method of the action cost function is as follows: calculated in terms of dimensions in a lasting DQN network, i.e At the same time, the calculation of the global action cost function of the sink node is updated, namely +.>And

2. a wireless network model, comprising: the system comprises a wireless service node, an aggregation node, a source server and a core network; each wireless service node can download files on the source server through a return link, store the files locally and directly serve users in the cell; content storage and inter-user collaboration are performed using the semi-distributed collaborative storage method based on a value decomposition network and multi-agent reinforcement learning of claim 1.

3. The wireless network model of claim 2, wherein: for each wireless service node, namely each agent, the user request set and the file storage set are used as a state space and are used as the input of a neural network, the output of the neural network is a storage content set of the next time period and a file quality set of a service user, and each wireless service node can download files on a source server through a return link and store the files locally to directly serve the users in a coverage range; in order to reduce the file downloading time, different wireless service nodes are allowed to perform multi-device cooperation through the sink node, so that the service quality of the wireless network is further improved, and the access time delay of stored content is reduced.