CN113779302A

CN113779302A - Semi-distributed cooperative storage method based on value decomposition network and multi-agent reinforcement learning

Info

Publication number: CN113779302A
Application number: CN202111058748.3A
Authority: CN
Inventors: 陈由甲; 蔡粤楷; 郑海峰; 胡锦松
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2021-12-10
Anticipated expiration: 2041-09-09
Also published as: CN113779302B

Abstract

The invention provides a semi-distributed cooperative storage method based on a value decomposition network and multi-agent reinforcement learning, which is characterized in that a semi-distributed multi-agent reinforcement learning frame is designed according to a wireless intelligent storage network model, and a state space, an action space and a reward function are designed to realize characteristic identification of user and wireless service node information in a wireless network; a dynamic storage algorithm is provided by combining the efficient decision-making capability of the Dueling DQN network, and is used for a storage replacement strategy of each wireless service node; providing a global strategy updating parameter calculated by decomposing a network by using a sink node embedded value of a wireless network, and transmitting the global strategy updating parameter to each wireless service node to update the local strategy of each agent; and continuously iterating and updating the neural networks in the agents to enable the global loss function to reach a convergence state, so that a globally optimal storage strategy is obtained. The information of each agent is transmitted to the sink node, so that mutual cooperation of the agents is promoted, and global optimum is quickly achieved.

Description

Semi-distributed cooperative storage method based on value decomposition network and multi-agent reinforcement learning

Technical Field

The invention belongs to the field of wireless communication and the technical field of computers, relates to deep reinforcement learning, a distributed system, algorithm complexity optimization, wireless transmission and the like in machine learning, and particularly relates to a semi-distributed cooperative storage method based on value decomposition network and multi-agent reinforcement learning.

Background

With the exponential growth of mobile wireless communication, data demand, and the continuous increase of device storage and computing power, real-time multimedia services are gradually becoming a major business in 5G communication networks, and human life and work are gradually migrating toward the overall movement of the mobile internet, pushing various network functions to the edge of the network, such as edge computing and edge storage. By storing popular content requested by users, edge storage aims to reduce traffic load and duplicate transmissions in the backhaul network, thereby significantly reducing transmission delays. In addition, for the rise of online video services, how to improve the experience of video users in wireless networks also becomes a new challenge. Therefore, this should be a video service policy bar. To capture the dynamics of user requested content and the wireless environment, a policy control framework is introduced into the field of wireless storage. In addition, due to the layout of large-scale wireless service nodes, more and more attention is paid to how to improve the overall service performance of the wireless network through cooperation among a plurality of wireless service nodes.

Disclosure of Invention

In order to make up the blank and the deficiency of the prior art, the invention aims to provide a semi-distributed cooperative storage method based on a value decomposition network and multi-agent reinforcement learning, which designs a semi-distributed multi-agent reinforcement learning frame, a state space, an action space and a reward function according to a wireless intelligent storage network model to realize the characteristic identification of user and wireless service node information in a wireless network; a dynamic storage algorithm is provided by combining the efficient decision-making capability of the Dueling DQN network, and is used for a storage replacement strategy of each wireless service node; providing a global strategy updating parameter calculated by decomposing a network by using a sink node embedded value of a wireless network, and transmitting the global strategy updating parameter to each wireless service node to update the local strategy of each agent; and continuously iterating and updating the neural networks in the agents to enable the global loss function to reach a convergence state, so that a globally optimal storage strategy is obtained. The information of each agent is transmitted to the sink node, so that mutual cooperation of the agents is promoted, and global optimum is quickly achieved.

The key problem of the invention is to accurately predict the user requirements, a dimension decomposition mechanism is introduced in consideration of the actual environment complexity of the user in the network, particularly in the wireless network, and finally, a user service strategy algorithm based on dimension decomposition is provided in each agent, and the final strategy is converged by continuously updating and iterating. Simulation results show that under various environmental parameter scenes, the algorithm is remarkably improved in the aspects of reducing access delay and improving the performance of user service experience. In addition, the algorithm can process a great action space, the convergence of the whole system is accelerated by a semi-fractional framework constructed by value decomposition, the calculation complexity is low, and most of the operation time is saved compared with the traditional multi-agent algorithm.

The invention specifically adopts the following technical scheme:

a semi-distributed cooperative storage method based on a value decomposition network and multi-agent reinforcement learning is characterized in that the implementation process comprises the following steps:

step S1: constructing a wireless network model of multi-device cooperative semi-distributed cooperative storage based on wireless network transmission, wherein the wireless network model comprises a sink node and wireless service nodes, an intelligent agent state space and an action space based on value decomposition network and multi-intelligent agent deep reinforcement learning, a combined state space and action space and a reward function based on optimization target design are defined, so that the wireless network service quality is improved to the maximum extent, and the access delay of storage contents is reduced;

step S2: the information about each wireless service node is collected and analyzed in the aggregation node, the cooperation of each wireless service node is coordinated through a constructed value decomposition network model, namely, the action value function of each wireless service node is used as the input of the value decomposition network, the output is the global action value function and the global strategy updating parameters of the whole system, and the result is fed back to the whole semi-distributed system, including the strategy of feeding back each wireless service node to update a single wireless service node, so that the cooperation performance and the convergence speed of wireless edge storage are improved.

Further, step S1 specifically includes the following steps:

step S11: defining a user set, a condition that the user set belongs to a wireless service node, a user request variable, a wireless service node storage variable, a file set, a quality variable, a video layer set and a wireless service node set, unit time delay of local hit, a cooperation hit and downloading from a server, and a user request quality and service quality variable;

step S12: constructing performance indexes of a storage model, including video access delay and user experience scores, and constructing a final optimization target, namely a reward function, based on the two target optimization problems; defining a user request variable, a user request quality and a wireless service node storage variable as a state space, and defining the wireless service node storage variable and the user service quality at the next moment as an action space;

step S13: utilizing a Dueling DQN network to perform state and action fitting, wherein the Dueling DQN network splits branches of a neural network into a state value branch and a dominant action branch, the state value branch is used for estimating the state of the current wireless network, and the dominant action branch is used for estimating each action; and evaluating the performance of each action by combining the state value and the advantage value.

Further, step S11 is specifically: set of users, U, expressed as {1.. I.. I }_jRepresenting a set of users belonging to a wireless serving node j, a user request variable lambda_ivAnd wireless service node storage variable delta_jvlA file set {1.. V.. V }, a quality variable K, a video layer set {1.. L.. L }, a wireless service node set {1.. J.. J }; using unit access delay d₀,d_jj',d_jRespectively representing unit time delay of local hit, cooperation hit and downloading from a server;

and defining a user request quality k_ivAnd quality of service variables

In step S12: the performance indexes of the storage model are constructed, and comprise video access time delay D and user experience score M as follows:

wherein c is₁Take 0.16, c₂0.66 is taken as a quality evaluation coefficient.

And constructs a final optimization objective, i.e. a reward function,

η is a weighting factor used to adjust the weights of access delay and user experience score.

In step S13, the evaluation operator q (S, a; theta) of the Dueling DQN network is divided into q for the target network and the evaluation network_ej，q_gj。

Further, step S2 specifically includes the following steps:

step S21: introducing a value decomposition network into a sink node, wherein the sink node firstly collects the states and rewards of all agents to construct a combined state and combined action and calculates the reward value of the whole system; and introduces an experience playback library for storing a library containing four elements (S)^(t),A^(t),r^(t),S^(t+1)) Each sample computing from the agent's DuelingDQN network a respective action cost function q (s, a; theta), and finally calculating the global action cost function of the whole system by using the value decomposition network

And

step S22: based on the global action cost function of the whole system calculated in step S21, the collective reward function constructs a loss function to calculate global policy update parameters,

the trained global strategy updating parameters are reversely transmitted back to each wireless service node in the wireless service node group, so that the wireless service nodes can conveniently use a gradient updating method for the self neural network

And updating to obtain a better strategy.

Further, a dimension decomposition mechanism is embedded into the dulingdqn to reduce the complexity of decision and improve the performance of wireless service:

the actions output by the dulingdqn network are decomposed according to the dimension of the actual physical meaning, that is, the actions are decomposed into three dimensions, which are: what type of video delta is stored_jvWhat video layer δ is stored_jlAnd what quality of service to the user

(ii) a The action in each dimension is represented by a single neural network branch, and all actions are independently selected in the dimension of the action and are not influenced by each other;

further, after the dimension decomposition mechanism is embedded, the calculation method of the action cost function is as follows: calculated by dimension in a Dueling DQN network, i.e.

At the same time, the calculation of the sink node global action cost function is also updated, i.e.

And

and, a wireless network model, comprising: the system comprises a wireless service node, a sink node, a source server and a core network; each wireless service node can download files on a source server through a return link, store the files locally and directly serve users in a cell; user storage and inter-user collaboration are performed using the semi-distributed collaborative storage method based on a value decomposition network and multi-agent reinforcement learning as described above.

Furthermore, for each wireless service node, namely each agent, a user request set and a file storage set are used as state spaces and used as input of a neural network, the output of the neural network is a storage content set and a file quality set of a service user in the next time period, and each wireless service node can download files on a source server through a backhaul link, store the files locally and directly serve the users in a coverage range; in order to reduce the file downloading time, different wireless service nodes are allowed to perform multi-device cooperation through the sink node, so that the wireless network service quality is further improved, and the access delay of the stored content is reduced.

Compared with the prior art, the method and the preferred scheme thereof can promote the cooperation among the multiple intelligent agents by utilizing the value decomposition network of the sink node, and solve the problem of decision complexity in a real wireless network environment by utilizing the framework of a dimension decomposition mechanism, thereby improving the capability of mobile wireless edge storage. The method can obtain good performance even under the conditions of limited storage resources, limited computing resources and complex user and wireless environment, so as to reduce the access delay of the stored content and improve the service quality of the user.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

fig. 1 is a schematic diagram of a semi-distributed wireless cooperative storage network model in an embodiment of the present invention.

Fig. 2 is a schematic diagram of a Dueling DQN network in an embodiment of the present invention.

FIG. 3 is a schematic diagram of a dimension decomposition mechanism.

FIG. 4 is a comparison chart of the results of different file parameters in the embodiment of the present invention.

FIG. 5 is a graph comparing performance of different algorithms in an embodiment of the invention.

FIG. 6 is a comparison of results for different dimensional numbers in an example of the invention.

Detailed Description

In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:

it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The semi-distributed cooperative storage algorithm based on the value decomposition network and the multi-agent reinforcement learning provided by the embodiment is specifically realized according to the following steps,

step S1: a semi-distributed cooperative storage wireless network model is provided, state spaces, action spaces and reward functions based on time delay and user experience design of each agent are defined, and the purpose is to improve the service quality of local wireless service nodes to the maximum extent;

1) in this example, first, a user set, U, represented by {1.. I }, is_jRepresenting a set of users belonging to a wireless serving node j, a user request variable lambda_ivAnd wireless service node storage variable delta_jvlA set of files {1.. V }, a quality variable K, a set of video layers {1.. L.. L }, a set of wireless serving nodes {1.. J.. J }. In addition, unit access time delay d is given₀,d_jj',d_jRespectively representing unit time delay of local hit, cooperative hit and download from a server, and defining user request quality k for constructing an optimization problem_ivAnd quality of service variables

2) The performance indexes of the storage model are constructed, including video access time delay D and user experience score M,

and constructs a final optimization objective, i.e. the reward function,

in addition to defining the user request variables,the user request quality also comprises a state space of a wireless service node storage variable, and the next moment of the wireless service node storage variable and the user service quality are action spaces.

3) The Dueling DQN network is used for fitting the state and the action, the Dueling DQN network divides the branches of the neural network into a state value branch and a dominant action branch, the state value branch is mainly used for estimating the state of the current wireless network, the dominant action branch is used for estimating each action,

finally, q (s, a; theta) is combined with the state value and the advantage value to accurately evaluate the performance of each action, and the Dueling DQN network is further divided into a target network and an evaluation network, so that the q (s, a; theta) also has two kinds of q_ej，q_gj。

Step S2: information about each wireless service node is collected and analyzed in a sink node, cooperation of each wireless service node is coordinated through a value decomposition network model, namely an action value function of each intelligent agent is used as input of the value decomposition network and is output as a global action value function and a global strategy updating parameter of the whole system, and a result is fed back to the whole semi-distributed system, so that cooperation performance of wireless edge storage is improved;

by defining an action value function, continuously iterating network parameters, and finally obtaining an optimal storage strategy and a user service strategy by each wireless service node in each obstructed state; in order to achieve better performance of the neural network, the Dueling DQN network of this embodiment may additionally use a fighting mechanism and a dual-network mechanism. The dual-network mechanism adopts a neural network with completely consistent structure for delaying updating to improve the stability of the algorithm, so that the algorithm is easier to converge. The decision mechanism additionally adopts the values of the estimated state value and the dominant value to judge the quality of the output action of the neural network, so that the decision is more accurate.

1) To be withoutThe online service nodes can better cooperate, a new module value decomposition network is introduced into the aggregation node, the aggregation node can firstly collect the states and rewards of all intelligent agents to construct combined states and combined actions and then calculate the reward value of the whole system, and in addition, an experience playback library is introduced and used for storing the reward value containing four elements (S)^(t),A^(t),r^(t),S^(t+1)) Each sample can calculate a respective action cost function q (s, a; theta), and finally calculating the global action cost function of the whole system by using the value decomposition network

And

2) based on the global action cost function of the whole system calculated in the previous step, the aggregate reward function constructs a loss function to calculate the global strategy update parameters,

the trained global policy update parameters may be passed back to individual wireless serving nodes within the wireless serving node cluster to facilitate their use with their own neural network

And updating to obtain a better strategy, wherein the updated strategy has better performance in both collaboration and predictability.

Step S3: in consideration of the complexity of the deep reinforcement learning practical environment, a dimension decomposition mechanism is additionally provided to be embedded into the dulling DQN so as to reduce the complexity of decision and improve the performance of the wireless service.

1) Decomposing the actions output by the Dueling DQN network according to the dimensions, and analyzing the actions in the sceneIn (1), the action is decomposed into three dimensions, respectively what type of video δ is stored_jvWhat video layer δ is stored_jlAnd what quality of service to the user

The actions in each dimension are represented by a single neural network branch, so that all actions are selected independently in the dimension without mutual influence.

2) After the dimension decomposition mechanism is embedded, the action cost function calculation method also generates corresponding changes, and the action cost function is calculated according to the dimension in the Dueling DQN network, namely the action cost function is calculated according to the dimension

And

in order to further understand the semi-distributed cooperative storage algorithm based on the value decomposition network and the multi-agent reinforcement learning, which is proposed by the present invention, the following detailed description is made with reference to specific embodiments. The embodiment is implemented on the premise of the technical scheme of the invention.

As shown in fig. 1, it is a semi-distributed wireless intelligent storage network model.

The model mainly comprises wireless service nodes, sink nodes, a source server, a core network and the like, introduces a user storage model under the wireless service nodes and a cooperation model among users, and each wireless service node can download files on the source server through a return link, store the files locally and directly serve the users in a cell.

Fig. 2 is a schematic diagram of a Dueling DQN network and a value decomposition network.

The network framework is divided into a target network and an evaluation network, the neural network branches of each network are subdivided into a state value network and an advantage value network, and the advantages and disadvantages of actions can be better evaluated through the double-network and double-branch architecture. The network takes the state space as input and the action space as output, and the network parameters are continuously optimized by continuously receiving the global strategy updating parameters.

As shown in fig. 3, the decomposition mechanism is schematic.

The composite action space is decomposed into independent actions in multiple dimensions, each dimension can select actions in an independent neural network branch, and therefore high complexity of composite actions is avoided.

FIG. 4 is a graph comparing the results of different file parameters in the examples.

The experimental result shows that the algorithm can not only cope with the complexity calculation degree under the condition of high file number, but also can well find a global strategy under different high complexity, so that the reward value converges to a high value.

The analysis shows that the semi-distributed cooperative storage algorithm based on the value decomposition network and the multi-agent reinforcement learning can obtain better storage capacity than the existing method, can well improve the storage problem of the user, and has certain reference value and actual economic benefit.

Fig. 5 is a graph showing comparison between the performance of the algorithm in the embodiment of the present invention.

Compared with the traditional multi-agent algorithm, the introduced semi-distributed architecture and the dimension decomposition method can respectively improve the performance by 23.4% and 30.5% compared with the traditional algorithm under the conditions of 5 files and 10 files, and meanwhile, the convergence speed is higher, so that the algorithm can rapidly position a global optimal strategy under different scenes.

FIG. 6 shows the comparison of the results of the dimension numbers in the example of the present invention.

The experimental result can be analyzed, the algorithm can be extended and expanded on a dimension decomposition mechanism, and the more the subdivision branches of the dimension number are (on the premise of meeting the action decomposition standard), the better yield can be brought by the algorithm.

The above method provided by this embodiment can be stored in a computer readable storage medium in a coded form, and implemented in a computer program, and inputs basic parameter information required for calculation through computer hardware, and outputs the calculation result.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

The present invention is not limited to the above preferred embodiments, and all other various semi-distributed cooperative storage methods based on value decomposition network and multi-agent reinforcement learning can be obtained by anyone who can benefit from the present invention.

Claims

1. A semi-distributed cooperative storage method based on a value decomposition network and multi-agent reinforcement learning is characterized in that the implementation process comprises the following steps:

step S2: the method comprises the steps of collecting and analyzing information about each wireless service node in a sink node, coordinating the cooperation of each wireless service node by constructing an value decomposition network model, namely taking an action value function of each wireless service node as the input of the value decomposition network, outputting the action value function as a global action value function and a global strategy updating parameter of the whole system, and feeding back the result to the whole semi-distributed system, wherein the strategy of feeding back the updating parameter to each wireless service node to update a single wireless service node is included, so that the cooperation performance and the convergence speed of wireless edge storage are improved.

2. The semi-distributed collaborative storage method based on value decomposition network and multi-agent reinforcement learning of claim 1, wherein: step S1 specifically includes the following steps:

step S13: fitting of states and actions is carried out by utilizing a DuelingDQN network, the DuelingDQN network splits branches of a neural network into state value branches and dominant action branches, the state value branches are used for estimating the state of the current wireless network, and the dominant action branches are used for estimating each action; and evaluating the performance of each action by combining the state value and the advantage value.

3. The semi-distributed collaborative storage method based on value decomposition networking and multi-agent reinforcement learning of claim 2, wherein:

step S11 specifically includes: set of users, U, expressed as {1.. I.. I }_jRepresenting a set of users belonging to a wireless serving node j, a user request variable lambda_ivAnd wireless service node storage variable delta_jvlA file set {1.. V.. V }, a quality variable K, a video layer set {1.. L.. L }, a wireless service node set {1.. J.. J }; using unit access delay d₀,d_jj',d_jRespectively representing unit time delay of local hit, cooperation hit and downloading from a server; and defining a user request quality k_ivAnd quality of service variables

wherein c is₁Take 0.16, c₂Taking 0.66 as a quality evaluation coefficient;

and constructs a final optimization objective, i.e. a reward function,

eta is a weight coefficient used for adjusting the weight of access delay and user experience score;

in step S13, the evaluation operator q (S, a; theta) of the DuelingDQN network is divided into q for the target network and the evaluation network_ej，q_gj。

4. The semi-distributed collaborative storage method based on value decomposition networking and multi-agent reinforcement learning of claim 3, wherein: step S2 specifically includes the following steps:

And

And updating to obtain a better strategy.

5. The semi-distributed collaborative storage method based on value decomposition networking and multi-agent reinforcement learning of claim 2, wherein: a dimension decomposition mechanism is adopted to be embedded into the DuelingDQN so as to reduce the complexity of decision and improve the performance of wireless service:

will DThe actions output by the uelingDQN network are decomposed according to the dimension of the actual physical meaning, i.e. the actions are decomposed into three dimensions, which are: what type of video delta is stored_jvWhat video layer δ is stored_jlAnd what quality of service to the user

The action in each dimension is represented by a single neural network branch, and all actions are independently selected in the dimension without mutual influence.

6. The semi-distributed collaborative storage method based on value decomposition networking and multi-agent reinforcement learning of claim 5, wherein:

after the dimension decomposition mechanism is embedded, the calculation method of the action cost function comprises the following steps: calculated by dimension in the DuelingDQN network, i.e.

7. A wireless network model, comprising: the system comprises a wireless service node, a sink node, a source server and a core network; each wireless service node can download files on a source server through a return link, store the files locally and directly serve users in a cell; performing content storage and inter-user collaboration using a semi-distributed collaborative storage method based on a value decomposition network and multi-agent reinforcement learning as claimed in any one of claims 1-6.

8. The wireless network model of claim 7, wherein: for each wireless service node, namely each agent, taking a user request set and a file storage set as state spaces and as input of a neural network, wherein the output of the neural network is a storage content set of a next time period and a file quality set of a service user, and each wireless service node can download files on a source server through a backhaul link, store the files locally and directly serve the users in a coverage range; in order to reduce the file downloading time, different wireless service nodes are allowed to perform multi-device cooperation through the sink node, so that the wireless network service quality is further improved, and the access delay of the stored content is reduced.