CN113191487A - Self-adaptive continuous power control method based on distributed PPO algorithm - Google Patents

Self-adaptive continuous power control method based on distributed PPO algorithm Download PDF

Info

Publication number
CN113191487A
CN113191487A CN202110469413.4A CN202110469413A CN113191487A CN 113191487 A CN113191487 A CN 113191487A CN 202110469413 A CN202110469413 A CN 202110469413A CN 113191487 A CN113191487 A CN 113191487A
Authority
CN
China
Prior art keywords
network
ppo
theta
global
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110469413.4A
Other languages
Chinese (zh)
Other versions
CN113191487B (en
Inventor
谢显中
范子申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110469413.4A priority Critical patent/CN113191487B/en
Publication of CN113191487A publication Critical patent/CN113191487A/en
Application granted granted Critical
Publication of CN113191487B publication Critical patent/CN113191487B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention relates to a self-adaptive continuous power control method based on a distributed PPO algorithm, belonging to the field of deep reinforcement learning and comprising the following steps of S1: firstly, representing a plurality of secondary networks by a plurality of threads, sharing a global PPO network strategy parameter by the plurality of secondary networks, and initializing all parameters; s2: a plurality of threads concurrently execute a global PPO network strategy and collect certain batches of data information in different environments in parallel; s3: the multiple threads transmit the collected sample data to the global PPO network, and the multiple threads stop collecting the sample data; s4: the global PPO network trains the network according to sample data transmitted by a plurality of threads, and strategy parameters are updated; s5: after the global PPO network updates the parameters, stopping updating the parameters, controlling the multiple threads to continue to concurrently collect sample data information, and then repeating the step S4 until the task is finished.

Description

Self-adaptive continuous power control method based on distributed PPO algorithm
Technical Field
The invention belongs to the field of deep reinforcement learning, and relates to a self-adaptive continuous power control method based on a distributed PPO algorithm.
Background
In the document, a Policy-Based deep reinforcement learning algorithm and a near-end Policy Optimization (PPO) algorithm are used to help a secondary user in a Cognitive wireless Network to realize adaptive Continuous Power Control, so that spectrum resources of a primary user are shared, and successful communication between the primary user and the user is realized.
The PPO algorithm is a deep reinforcement learning algorithm based on an AC framework, can process an infinite state space environment by the PPO algorithm with an artificial neural network, can process an infinite action space by a strategy-based method, and is very consistent with the goal of realizing intelligent continuous power control of secondary users in a complex environment so as to share primary user spectrum resources.
The PPO algorithm mainly and well solves the problems that the traditional strategy-based method is sensitive to the updating step number and the network updating efficiency is too low. The method mainly comprises the steps that the PPO algorithm adopts an importance sampling method to convert an on-policy method into an off-policy method, empirical data can be repeatedly used, the updating efficiency of a network is improved, and the problem that a strategy method is too sensitive to the updating step length is solved by adding a method for limiting the updating step length. The traditional strategy method is to use a network theta to execute a parameterized strategy piθWhen the parameters of the network are updated, the data collected according to the strategy can not be used for training the network, and the data needs to be re-sampled by a new strategy, which results in low utilization rate of the network on the sample information, low parameter updating efficiency and huge time consumption. The importance sampling method comprises constructing a network theta' with output action probability distribution similar to the network theta to interact with the environment and adopt sample data, and then executing strategy piθ'The network theta is trained for a plurality of times, and the parameters of the network theta' are fixed and unchangeable, so that the repeated utilization of the sampling data is realized, and the updating efficiency of the network is improved. In the conventional strategic gradient method, the gradient calculation formula is:
Figure BDA0003044782740000011
after the importance sampling method is adopted, the formula (1) can be changed into:
Figure BDA0003044782740000012
a new objective function based on the PPO algorithm is then obtained:
Figure BDA0003044782740000021
wherein Jθ′The network theta' whose meaning is represented by (theta) interacts with the environment updates the network theta. In this way, the on-policy method is switched to the off-policy method, so that the sampled data can be reused, however, the problem of sensitivity to the update step size in the conventional strategy method still exists, that is, if the probability distribution of the output actions of the two networks is too far apart, the training is difficult to converge, and in order that the probability distribution of the two networks is not too far apart, the PPO algorithm adds a limitation condition in the equation (1.3), as shown below:
Figure BDA0003044782740000022
wherein ratio ist(theta) is
Figure BDA0003044782740000023
The ratio of two network strategies is shown, clip is a clipping function, and clipping is carried out when the action probability distributions of the two networks are far away, namely when ratiotIf the value of (theta) is less than 1-epsilon, taking 1-epsilon; if ratiotIf the value of (theta) is greater than 1+ epsilon, 1+ epsilon is taken, and the value of epsilon is typically 0.1 or 0.2. Therefore, overlarge updating can be effectively prevented, and the problem that a strategy method is sensitive to the updating step length is solved.
Although the PPO algorithm can effectively help secondary users to learn the optimal continuous power control strategy, the deep neural network has a large parameter amount in the training of the neural network, and a large amount of parameters are updated iteratively, so that the complexity is high, and the training time is long.
Disclosure of Invention
In view of the above, the present invention provides a Distributed PPO (Distributed PPO, DPPO) method to increase the training speed and reduce the training time. Different from the PPO method, the DPPO has a plurality of secondary networks and a main network, the secondary networks share the strategy parameters of the main network, the secondary networks execute the strategy of the main network in parallel in training, sample information is collected in respective environments and samples are transmitted to the main network, the main network trains according to the sample information transmitted by the secondary networks, strategy parameters are updated, and then the plurality of secondary networks continue to train the main network with the new strategy parameters updated by the main network to collect sample data continuously and concurrently until the training is finished.
In order to achieve the purpose, the invention provides the following technical scheme:
a self-adaptive continuous power control method based on a distributed PPO algorithm comprises the following steps:
s1: firstly, representing a plurality of secondary networks by a plurality of threads, wherein the plurality of secondary networks share a global PPO network strategy parameter and initialize all the parameters;
s2: the multiple threads concurrently execute a global PPO network strategy and collect certain batches of data information in different environments in parallel;
s3: the multiple threads transmit the collected sample data to the global PPO network, and the multiple threads stop collecting the sample data;
s4: the global PPO network trains the network according to sample data transmitted by a plurality of threads and updates strategy parameters;
s5: and after the parameters of the global PPO network are updated, stopping updating the parameters, controlling the multiple threads to continuously and concurrently collect sample data information, and then repeating the step S4 until the task is finished.
Further, in step S1, a plurality of threads workers are initialized; initializing a global PPO network parameter; initializing a parameter theta of an Actor network and a parameter phi of another network theta', Critic network; number of updates M to initialize theta' network and number of updates to Critic networkB; initializing environmental parameters such as the number of sensors and interference errors; initializing the number N of turns of network training, the number T of iteration steps of each turn and the sampling batch step number Batchsize; initializing the power of a primary user and the power of a secondary user, substituting the power of the primary user into a power control strategy to obtain the power of the next time frame, and further obtaining the initial state s of the environment0
Further, in step S2, multiple workers execute the policy pi of the Actor network in the global PPO networkθCollecting sample data s in respective environmentst,at,rt}。
Further, in step S3, the plurality of workers stop collecting data after each execute the Batchsize step, and transmit the collected sample information to the global PPO network.
Further, in steps S4-S5, the global PPO network calculates a cumulative function dominance function AθAssigning the parameters of the network theta to the network theta ', and repeatedly updating the theta ' network M times by using the sample data acquired in the step S3, performing gradient calculation according to the gradient formula (4) each time, and updating the network parameters of the theta '; b times of parameter updating are carried out on the Critic network phi, and each time of updating is carried out on the advantage function Aθ'Performing a gradient descent algorithm to reduce the value of the dominant function as much as possible so as to optimize the theta' network strategy, and then stopping updating;
Figure BDA0003044782740000031
in the formula
Figure BDA0003044782740000032
The objective function to be optimized is shown, and gradient rising calculation is carried out on the objective function to obtain the maximum expected reward;
Figure BDA0003044782740000033
indicating a reward expectation value; stRepresenting the state of the agent at time t, atRepresenting the action taken by the agent at time t; piθ'Expressed is the strategy of theta' network, expressed is the intelligenceProbability that an energy body takes a certain action in a certain specific state; a. theθ'(st,at) The advantage function of the theta' network is used for representing how much better the action taken at the current moment is than the average action, if the advantage function is greater than 0, the probability of the strategy is continuously improved, and if the advantage function is less than 0, the probability of the strategy is reduced;
ratiot(theta) is
Figure BDA0003044782740000041
The ratio of two network policies is shown; clip is a clipping function, clipping is performed when the action probability distributions of the two networks are too far apart, i.e., when ratiotIf the value of (theta) is less than 1-epsilon, taking 1-epsilon; if ratiotIf the value of (theta) is greater than 1+ epsilon, 1+ epsilon is taken.
Further, in steps S4-S5, the net θ obtains new parameters trained by the net θ', and multiple threads share the new parameters to the global PPO network, and when the number of steps in the round of each thread reaches T, the training of the round is finished, otherwise, the process returns to step S2.
Further, in steps S4-S5, training is terminated when the number of rounds reaches N, otherwise, the next round of training is performed, the initial power of the primary user and the secondary user is reinitialized, and the initial state S is obtained0The iterative training is continued from step S2.
The invention has the beneficial effects that: the secondary user network trained on the multi-thread DPPO algorithm can achieve the same effect as the single-thread PPO algorithm, the secondary user can learn the optimal continuous power control strategy under different parameter conditions, the sample sampling time required by training is prolonged, the main network can collect sample data required by training in a short time, the training speed of the DPPO algorithm is obviously increased, and the training time is effectively shortened.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
fig. 1 is a schematic view of a spectrum sharing scenario in a cognitive wireless network;
FIG. 2 is a schematic flow chart of a distributed PPO algorithm-based adaptive continuous power control method;
fig. 3 is a comparison graph of network system capacities after training of the DPPO algorithm and the PPO algorithm.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
Referring to fig. 1 to 3, an adaptive continuous power control method based on a distributed PPO algorithm is shown, and the PPO algorithm used in the cognitive wireless network scenario of fig. 1 may help a secondary user to effectively learn an optimal continuous power control strategy in a complex environment, so as to implement sharing with a primary user spectrum resource.
The PPO algorithm is a deep reinforcement learning algorithm based on an AC framework, can process an infinite state space environment by the PPO algorithm with an artificial neural network, can process an infinite action space by a strategy-based method, and is very consistent with the goal of realizing intelligent continuous power control of secondary users in a complex environment so as to share primary user spectrum resources.
The PPO algorithm mainly and well solves the problems that the traditional strategy-based method is sensitive to the updating step number and the network updating efficiency is too low. The method mainly comprises the steps that the PPO algorithm adopts an importance sampling method to convert an on-policy method into an off-policy method, empirical data can be repeatedly used, the updating efficiency of a network is improved, and the problem that a strategy method is too sensitive to the updating step length is solved by adding a method for limiting the updating step length. The traditional strategy method is to use a network theta to execute a parameterized strategy piθWhen the parameters of the network are updated, the data collected according to the strategy can not be used for training the network, and the data needs to be re-sampled by a new strategy, which results in low utilization rate of the network on the sample information, low parameter updating efficiency and huge time consumption. The importance sampling method is to construct a network theta' with output action probability distribution similar to the network theta to interact with the environment for samplingWith sample data, then implementing the strategy πθ'The network theta is trained for a plurality of times, and the parameters of the network theta' are fixed and unchangeable, so that the repeated utilization of the sampling data is realized, and the updating efficiency of the network is improved. In the conventional strategic gradient method, the gradient calculation formula is:
Figure BDA0003044782740000051
after the importance sampling method is adopted, the formula (1.1) can be changed into:
Figure BDA0003044782740000061
a new objective function based on the PPO algorithm is then obtained:
Figure BDA0003044782740000062
wherein Jθ′The network theta' whose meaning is represented by (theta) interacts with the environment updates the network theta. In this way, the on-policy method is switched to the off-policy method, so that the sampled data can be reused, however, the problem of sensitivity to the update step size in the conventional strategy method still exists, that is, if the probability distribution of the output actions of the two networks is too far apart, the training is difficult to converge, and in order that the probability distribution of the two networks is not too far apart, the PPO algorithm adds a limitation condition in the equation (1.3), as shown below:
Figure BDA0003044782740000063
wherein ratio ist(theta) is
Figure BDA0003044782740000064
The ratio of two network strategies is shown, clip is a clipping function, and when the action probability distribution of the two networks is consistentClipping is performed when the difference is too far, i.e. when the ratiotIf the value of (theta) is less than 1-epsilon, taking 1-epsilon; if ratiotIf the value of (theta) is greater than 1+ epsilon, 1+ epsilon is taken, and the value of epsilon is typically 0.1 or 0.2. Therefore, overlarge updating can be effectively prevented, and the problem that a strategy method is sensitive to the updating step length is solved.
The PPO algorithm comprises the following specific steps:
1. initializing a parameter theta of an Actor network and a parameter phi of another network theta', Critic network; initializing the updating times M of the theta' network and the updating times B of the Critic network, and the like; initializing environmental parameters such as the number of sensors, interference errors and the like; initializing the number N of turns of network training, the number T of iteration steps of each turn, the sampling batch step number Batchsize and the like; initializing the power of a primary user and the power of a secondary user, substituting the power of the primary user into a power control strategy to obtain the power of the next time frame, and further obtaining the initial state s of the environment0
2. Training begins, the Actor network performs piθSample data { s ] is collectedt,at,rtAfter executing the Batchsize step, the cumulative function dominance function A is calculatedθAnd assigning the parameters of the network theta to the network theta'.
3. The theta 'network repeatedly updates for M times by using the sample data acquired in the step 2, wherein each updating is to perform gradient calculation according to a gradient formula (1.4) and update the network parameters of the theta' network; b times of parameter updating are carried out on the Critic network phi, and each time of updating is carried out on the advantage function Aθ'And (4) performing a gradient descent algorithm to reduce the value of the dominant function as much as possible, so as to optimize the theta' network strategy.
4. The network theta obtains new parameters trained by the network theta'.
5. Terminating the training round when the number of steps in the round reaches T, otherwise, continuously repeating the step 2.
6. Terminating the training when the number of rounds reaches N, otherwise carrying out the training of the next round, reinitializing the initial power of the primary user and the secondary user, and obtaining the initial state s0And continuously repeating the step 2 to carry out iterative training.
However, because the amount of parameters of the deep neural network is large, the complexity is high and the training time is long due to the fact that the PPO method is used for updating the parameters in a large number of training iterations. Based on this, as shown in fig. 2, the present invention proposes a Distributed PPO (Distributed PPO, DPPO) algorithm to improve the training speed and reduce the training time. Different from the PPO method, the DPPO has a plurality of secondary networks and a main network, the secondary networks share the strategy parameters of the main network, the secondary networks execute the strategy of the main network in parallel in training, sample information is collected in respective environments and samples are transmitted to the main network, the main network trains according to the sample information transmitted by the secondary networks, strategy parameters are updated, and then the plurality of secondary networks continue to train the main network with the new strategy parameters updated by the main network to collect sample data continuously and concurrently until the training is finished. By the method, the sample sampling time required by training is prolonged, the main network can collect sample data required by training in a short time, and experimental results show that the training speed of the DPPO algorithm is obviously increased, and the training time is effectively shortened.
The DPPO algorithm is a method for collecting data by adding a plurality of threads on the basis of the PPO algorithm. The threads share a global PPO network, the threads execute the strategy of the global PPO network to concurrently collect sample data information in respective environments, the threads do not calculate gradients by themselves and are only responsible for transmitting the collected sample information to the global PPO network for training, so that the sample information acquisition time is remarkably prolonged, the training time of a neural network is shortened, and the following steps are specific steps.
1. Initializing a plurality of thread workers; initializing a global PPO network parameter; the rest are consistent with the initialization parameters of the first step in section 1.2.
2. Training begins, a plurality of workers execute a strategy pi of an Actor network in a global PPO networkθCollecting sample data s in respective environmentst,at,rtAnd stopping data acquisition after executing the Batchsize steps respectively, and transmitting acquired sample information to the global PPO network.
3. Global PPO network computing cumulative function dominance function AθAssigning the parameters of the network theta to the network theta ', and repeatedly utilizing the sample data acquired in the step 2 by the theta' network to update for M times, performing gradient calculation according to a gradient formula (1.4) each time, and updating the network parameters of the theta; b times of parameter updating are carried out on the Critic network phi, and each time of updating is carried out on the advantage function Aθ'And (4) performing a gradient descent algorithm to reduce the value of the dominance function as much as possible so as to optimize the theta' network strategy, and then stopping updating.
4. And obtaining the new parameters trained by the network theta', sharing the new parameters to the global PPO network by multiple threads, finishing the training of the round when the number of steps of each thread in the round reaches T, and otherwise, continuously repeating the step 2.
5. Terminating the training when the number of rounds reaches N, otherwise carrying out the training of the next round, reinitializing the initial power of the primary user and the secondary user, and obtaining the initial state s0And continuing to perform iterative training from step 2.
FIG. 3 is a simulation result of a network system capacity comparison test after training of a DPPO algorithm and a PPO algorithm, in FIG. 3, PU-DPPO represents the capacity of a primary user in the DPPO algorithm, SU-DPPO represents the capacity of a secondary user in the DPPO algorithm, PU-PPO represents the capacity of a primary user in the PPO algorithm, and SU-PPO represents the capacity of a secondary user in the DPPO algorithm. The simulation result shows that the system capacity trained based on the DPPO algorithm is very close to the system capacity trained based on the PPO algorithm, which proves the effectiveness of the DPPO algorithm, and the training time record in the experiment shows that the training time of the DPPO algorithm is 261 seconds, and the training time of the PPO algorithm is 350 seconds, which proves that the network training speed of the DPPO algorithm is further improved.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (7)

1. A self-adaptive continuous power control method based on a distributed PPO algorithm is characterized by comprising the following steps: the method comprises the following steps:
s1: firstly, representing a plurality of secondary networks by a plurality of threads, wherein the plurality of secondary networks share a global PPO network strategy parameter and initialize all the parameters;
s2: the multiple threads concurrently execute a global PPO network strategy and collect certain batches of data information in different environments in parallel;
s3: the multiple threads transmit the collected sample data to the global PPO network, and the multiple threads stop collecting the sample data;
s4: the global PPO network trains the network according to sample data transmitted by a plurality of threads and updates strategy parameters;
s5: and after the parameters of the global PPO network are updated, stopping updating the parameters, controlling the multiple threads to continuously and concurrently collect sample data information, and then repeating the step S4 until the task is finished.
2. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 1, wherein: in step S1, a plurality of threads workers are initialized; initializing a global PPO network parameter; initializing a parameter theta of an Actor network and a parameter phi of another network theta', Critic network; initializing the updating times M of the theta' network and the updating times B of the Critic network; initializing environmental parameters such as the number of sensors and interference errors; initializing the number N of turns of network training, the number T of iteration steps of each turn and the sampling batch step number Batchsize; initializing the power of a primary user and the power of a secondary user, substituting the power of the primary user into a power control strategy to obtain the power of the next time frame, and further obtaining the initial state s of the environment0
3. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 1, wherein: in step S2, multiple workers execute the strategy pi of the Actor network in the global PPO networkθAt each locationSample data { s } collected from the environmentt,at,rt}。
4. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 3, wherein: in step S3, the plurality of workers stop collecting data after each execute the Batchsize step, and transmit the collected sample information to the global PPO network.
5. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 1, wherein: in steps S4-S5, the global PPO network calculates the cumulative function dominance function AθAssigning the parameters of the network theta to the network theta ', and repeatedly updating the theta ' network M times by using the sample data acquired in the step S3, performing gradient calculation according to the gradient formula (4) each time, and updating the network parameters of the theta '; b times of parameter updating are carried out on the Critic network phi, and each time of updating is carried out on the advantage function Aθ'Performing a gradient descent algorithm to reduce the value of the dominant function as much as possible so as to optimize the theta' network strategy, and then stopping updating;
Figure FDA0003044782730000021
clip(ratiot(θ),1-ε,1+ε)Aθ′(st,at))]
in the formula
Figure FDA0003044782730000022
The objective function to be optimized is shown, and gradient rising calculation is carried out on the objective function to obtain the maximum expected reward;
Figure FDA0003044782730000023
indicating a reward expectation value; stRepresenting the state of the agent at time t, atRepresenting the action taken by the agent at time t; piθ'Expressed is the strategy of theta' network, expressed is that the intelligent agent is in a certain special placeProbability of taking a certain specific action in a certain state; a. theθ'(st,at) The advantage function of the theta' network is used for representing how much better the action taken at the current moment is than the average action, if the advantage function is greater than 0, the probability of the strategy is continuously improved, and if the advantage function is less than 0, the probability of the strategy is reduced;
Figure FDA0003044782730000024
the ratio of two network policies is shown; clip is a clipping function, clipping is performed when the action probability distributions of the two networks are too far apart, i.e., when ratiotIf the value of (theta) is less than 1-epsilon, taking 1-epsilon; if ratiotIf the value of (theta) is greater than 1+ epsilon, 1+ epsilon is taken.
6. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 5, wherein: in steps S4-S5, the net θ obtains new parameters trained by the net θ', and multiple threads share the new parameters of the global PPO network, and when the number of steps in the round of each thread reaches T, the training in the round is finished, otherwise, the process returns to step S2.
7. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 6, wherein: in the steps S4-S5, the training is terminated when the number of rounds reaches N, otherwise, the training of the next round is carried out, the initial power of the primary user and the secondary user is reinitialized, and the initial state S is obtained0The iterative training is continued from step S2.
CN202110469413.4A 2021-04-28 2021-04-28 Self-adaptive continuous power control method based on distributed PPO algorithm Active CN113191487B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110469413.4A CN113191487B (en) 2021-04-28 2021-04-28 Self-adaptive continuous power control method based on distributed PPO algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110469413.4A CN113191487B (en) 2021-04-28 2021-04-28 Self-adaptive continuous power control method based on distributed PPO algorithm

Publications (2)

Publication Number Publication Date
CN113191487A true CN113191487A (en) 2021-07-30
CN113191487B CN113191487B (en) 2023-04-07

Family

ID=76980163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110469413.4A Active CN113191487B (en) 2021-04-28 2021-04-28 Self-adaptive continuous power control method based on distributed PPO algorithm

Country Status (1)

Country Link
CN (1) CN113191487B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629461A (en) * 2023-07-25 2023-08-22 山东大学 Distributed optimization method, system, equipment and storage medium for active power distribution network

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105050176A (en) * 2015-05-29 2015-11-11 重庆邮电大学 Stackelberg game power control method based on interruption probability constraint in cognitive radio network
US20160267380A1 (en) * 2015-03-13 2016-09-15 Nuance Communications, Inc. Method and System for Training a Neural Network
CN110852448A (en) * 2019-11-15 2020-02-28 中山大学 Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN111526592A (en) * 2020-04-14 2020-08-11 电子科技大学 Non-cooperative multi-agent power control method used in wireless interference channel
CN112162861A (en) * 2020-09-29 2021-01-01 广州虎牙科技有限公司 Thread allocation method and device, computer equipment and storage medium
US20210103286A1 (en) * 2019-10-04 2021-04-08 Hong Kong Applied Science And Technology Research Institute Co., Ltd. Systems and methods for adaptive path planning
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning
US20210119881A1 (en) * 2018-05-02 2021-04-22 Telefonaktiebolaget Lm Ericsson (Publ) First network node, third network node, and methods performed thereby, for handling a performance of a radio access network
CN112700664A (en) * 2020-12-19 2021-04-23 北京工业大学 Traffic signal timing optimization method based on deep reinforcement learning
CN112700663A (en) * 2020-12-23 2021-04-23 大连理工大学 Multi-agent intelligent signal lamp road network control method based on deep reinforcement learning strategy

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160267380A1 (en) * 2015-03-13 2016-09-15 Nuance Communications, Inc. Method and System for Training a Neural Network
CN105050176A (en) * 2015-05-29 2015-11-11 重庆邮电大学 Stackelberg game power control method based on interruption probability constraint in cognitive radio network
US20210119881A1 (en) * 2018-05-02 2021-04-22 Telefonaktiebolaget Lm Ericsson (Publ) First network node, third network node, and methods performed thereby, for handling a performance of a radio access network
US20210103286A1 (en) * 2019-10-04 2021-04-08 Hong Kong Applied Science And Technology Research Institute Co., Ltd. Systems and methods for adaptive path planning
CN110852448A (en) * 2019-11-15 2020-02-28 中山大学 Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN111526592A (en) * 2020-04-14 2020-08-11 电子科技大学 Non-cooperative multi-agent power control method used in wireless interference channel
CN112162861A (en) * 2020-09-29 2021-01-01 广州虎牙科技有限公司 Thread allocation method and device, computer equipment and storage medium
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning
CN112700664A (en) * 2020-12-19 2021-04-23 北京工业大学 Traffic signal timing optimization method based on deep reinforcement learning
CN112700663A (en) * 2020-12-23 2021-04-23 大连理工大学 Multi-agent intelligent signal lamp road network control method based on deep reinforcement learning strategy

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FAN ZISHEN等: "Proximal Policy Optimization Based Continuous Intelligent Power Control in Cognitive Radio Network", 《2020 IEEE 6TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC)》 *
NICOLAS HEESS等: "Emergence of Locomotion Behaviours in Rich Environments", 《ARXIV:ARTIFICIAL INTELLIGENCE》 *
YIJUN CHENG等: "Optimal Energy Management of Energy Internet: A Distributed Actor-Critic Reinforcement Learning Method", 《2020 AMERICAN CONTROL CONFERENCE (ACC)》 *
范子申: "认知无线网络中基于深度强化学习的频谱共享模型与算法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629461A (en) * 2023-07-25 2023-08-22 山东大学 Distributed optimization method, system, equipment and storage medium for active power distribution network
CN116629461B (en) * 2023-07-25 2023-10-17 山东大学 Distributed optimization method, system, equipment and storage medium for active power distribution network

Also Published As

Publication number Publication date
CN113191487B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN112668128B (en) Method and device for selecting terminal equipment nodes in federal learning system
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
CN113573324B (en) Cooperative task unloading and resource allocation combined optimization method in industrial Internet of things
CN111556461B (en) Vehicle-mounted edge network task distribution and unloading method based on deep Q network
CN112367353A (en) Mobile edge computing unloading method based on multi-agent reinforcement learning
CN113543176B (en) Unloading decision method of mobile edge computing system based on intelligent reflecting surface assistance
CN110691422B (en) Multi-channel intelligent access method based on deep reinforcement learning
CN113873022A (en) Mobile edge network intelligent resource allocation method capable of dividing tasks
CN112598150B (en) Method for improving fire detection effect based on federal learning in intelligent power plant
CN113225377B (en) Internet of things edge task unloading method and device
CN114285853B (en) Task unloading method based on end edge cloud cooperation in equipment-intensive industrial Internet of things
CN113469325A (en) Layered federated learning method, computer equipment and storage medium for edge aggregation interval adaptive control
CN111367657A (en) Computing resource collaborative cooperation method based on deep reinforcement learning
CN109787696B (en) Cognitive radio resource allocation method based on case reasoning and cooperative Q learning
CN113312177B (en) Wireless edge computing system and optimizing method based on federal learning
CN112287990A (en) Model optimization method of edge cloud collaborative support vector machine based on online learning
CN112492691A (en) Downlink NOMA power distribution method of deep certainty strategy gradient
CN110336620A (en) A kind of QL-UACW back-off method based on MAC layer fair exchange protocols
CN113191487B (en) Self-adaptive continuous power control method based on distributed PPO algorithm
CN112272074A (en) Information transmission rate control method and system based on neural network
CN116080407A (en) Unmanned aerial vehicle energy consumption optimization method and system based on wireless energy transmission
CN114204971B (en) Iterative aggregate beam forming design and user equipment selection method
CN117119486B (en) Deep unsupervised learning resource allocation method for guaranteeing long-term user rate of multi-cell cellular network
CN113613332A (en) Spectrum resource allocation method and system based on cooperative distributed DQN (differential Quadrature reference network) combined simulated annealing algorithm
Sharma et al. Feel-enhanced edge computing in energy constrained uav-aided iot networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant