CN113191487A

CN113191487A - Self-adaptive continuous power control method based on distributed PPO algorithm

Info

Publication number: CN113191487A
Application number: CN202110469413.4A
Authority: CN
Inventors: 谢显中; 范子申
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-30
Anticipated expiration: 2041-04-28
Also published as: CN113191487B

Abstract

The invention relates to a self-adaptive continuous power control method based on a distributed PPO algorithm, belonging to the field of deep reinforcement learning and comprising the following steps of S1: firstly, representing a plurality of secondary networks by a plurality of threads, sharing a global PPO network strategy parameter by the plurality of secondary networks, and initializing all parameters; s2: a plurality of threads concurrently execute a global PPO network strategy and collect certain batches of data information in different environments in parallel; s3: the multiple threads transmit the collected sample data to the global PPO network, and the multiple threads stop collecting the sample data; s4: the global PPO network trains the network according to sample data transmitted by a plurality of threads, and strategy parameters are updated; s5: after the global PPO network updates the parameters, stopping updating the parameters, controlling the multiple threads to continue to concurrently collect sample data information, and then repeating the step S4 until the task is finished.

Description

Self-adaptive continuous power control method based on distributed PPO algorithm

Technical Field

The invention belongs to the field of deep reinforcement learning, and relates to a self-adaptive continuous power control method based on a distributed PPO algorithm.

Background

In the document, a Policy-Based deep reinforcement learning algorithm and a near-end Policy Optimization (PPO) algorithm are used to help a secondary user in a Cognitive wireless Network to realize adaptive Continuous Power Control, so that spectrum resources of a primary user are shared, and successful communication between the primary user and the user is realized.

The PPO algorithm is a deep reinforcement learning algorithm based on an AC framework, can process an infinite state space environment by the PPO algorithm with an artificial neural network, can process an infinite action space by a strategy-based method, and is very consistent with the goal of realizing intelligent continuous power control of secondary users in a complex environment so as to share primary user spectrum resources.

The PPO algorithm mainly and well solves the problems that the traditional strategy-based method is sensitive to the updating step number and the network updating efficiency is too low. The method mainly comprises the steps that the PPO algorithm adopts an importance sampling method to convert an on-policy method into an off-policy method, empirical data can be repeatedly used, the updating efficiency of a network is improved, and the problem that a strategy method is too sensitive to the updating step length is solved by adding a method for limiting the updating step length. The traditional strategy method is to use a network theta to execute a parameterized strategy pi_θWhen the parameters of the network are updated, the data collected according to the strategy can not be used for training the network, and the data needs to be re-sampled by a new strategy, which results in low utilization rate of the network on the sample information, low parameter updating efficiency and huge time consumption. The importance sampling method comprises constructing a network theta' with output action probability distribution similar to the network theta to interact with the environment and adopt sample data, and then executing strategy pi_θ'The network theta is trained for a plurality of times, and the parameters of the network theta' are fixed and unchangeable, so that the repeated utilization of the sampling data is realized, and the updating efficiency of the network is improved. In the conventional strategic gradient method, the gradient calculation formula is:

after the importance sampling method is adopted, the formula (1) can be changed into:

a new objective function based on the PPO algorithm is then obtained:

wherein J^θ′The network theta' whose meaning is represented by (theta) interacts with the environment updates the network theta. In this way, the on-policy method is switched to the off-policy method, so that the sampled data can be reused, however, the problem of sensitivity to the update step size in the conventional strategy method still exists, that is, if the probability distribution of the output actions of the two networks is too far apart, the training is difficult to converge, and in order that the probability distribution of the two networks is not too far apart, the PPO algorithm adds a limitation condition in the equation (1.3), as shown below:

wherein ratio is_t(theta) is

The ratio of two network strategies is shown, clip is a clipping function, and clipping is carried out when the action probability distributions of the two networks are far away, namely when ratio_tIf the value of (theta) is less than 1-epsilon, taking 1-epsilon; if ratio_tIf the value of (theta) is greater than 1+ epsilon, 1+ epsilon is taken, and the value of epsilon is typically 0.1 or 0.2. Therefore, overlarge updating can be effectively prevented, and the problem that a strategy method is sensitive to the updating step length is solved.

Although the PPO algorithm can effectively help secondary users to learn the optimal continuous power control strategy, the deep neural network has a large parameter amount in the training of the neural network, and a large amount of parameters are updated iteratively, so that the complexity is high, and the training time is long.

Disclosure of Invention

In view of the above, the present invention provides a Distributed PPO (Distributed PPO, DPPO) method to increase the training speed and reduce the training time. Different from the PPO method, the DPPO has a plurality of secondary networks and a main network, the secondary networks share the strategy parameters of the main network, the secondary networks execute the strategy of the main network in parallel in training, sample information is collected in respective environments and samples are transmitted to the main network, the main network trains according to the sample information transmitted by the secondary networks, strategy parameters are updated, and then the plurality of secondary networks continue to train the main network with the new strategy parameters updated by the main network to collect sample data continuously and concurrently until the training is finished.

In order to achieve the purpose, the invention provides the following technical scheme:

a self-adaptive continuous power control method based on a distributed PPO algorithm comprises the following steps:

s1: firstly, representing a plurality of secondary networks by a plurality of threads, wherein the plurality of secondary networks share a global PPO network strategy parameter and initialize all the parameters;

s2: the multiple threads concurrently execute a global PPO network strategy and collect certain batches of data information in different environments in parallel;

s3: the multiple threads transmit the collected sample data to the global PPO network, and the multiple threads stop collecting the sample data;

s4: the global PPO network trains the network according to sample data transmitted by a plurality of threads and updates strategy parameters;

s5: and after the parameters of the global PPO network are updated, stopping updating the parameters, controlling the multiple threads to continuously and concurrently collect sample data information, and then repeating the step S4 until the task is finished.

Further, in step S1, a plurality of threads workers are initialized; initializing a global PPO network parameter; initializing a parameter theta of an Actor network and a parameter phi of another network theta', Critic network; number of updates M to initialize theta' network and number of updates to Critic networkB; initializing environmental parameters such as the number of sensors and interference errors; initializing the number N of turns of network training, the number T of iteration steps of each turn and the sampling batch step number Batchsize; initializing the power of a primary user and the power of a secondary user, substituting the power of the primary user into a power control strategy to obtain the power of the next time frame, and further obtaining the initial state s of the environment₀。

Further, in step S2, multiple workers execute the policy pi of the Actor network in the global PPO network_θCollecting sample data s in respective environments_t,a_t,r_t}。

Further, in step S3, the plurality of workers stop collecting data after each execute the Batchsize step, and transmit the collected sample information to the global PPO network.

Further, in steps S4-S5, the global PPO network calculates a cumulative function dominance function A^θAssigning the parameters of the network theta to the network theta ', and repeatedly updating the theta ' network M times by using the sample data acquired in the step S3, performing gradient calculation according to the gradient formula (4) each time, and updating the network parameters of the theta '; b times of parameter updating are carried out on the Critic network phi, and each time of updating is carried out on the advantage function A^θ'Performing a gradient descent algorithm to reduce the value of the dominant function as much as possible so as to optimize the theta' network strategy, and then stopping updating;

in the formula

The objective function to be optimized is shown, and gradient rising calculation is carried out on the objective function to obtain the maximum expected reward;

indicating a reward expectation value; s_tRepresenting the state of the agent at time t, a_tRepresenting the action taken by the agent at time t; pi_θ'Expressed is the strategy of theta' network, expressed is the intelligenceProbability that an energy body takes a certain action in a certain specific state; a. the^θ'(s_t,a_t) The advantage function of the theta' network is used for representing how much better the action taken at the current moment is than the average action, if the advantage function is greater than 0, the probability of the strategy is continuously improved, and if the advantage function is less than 0, the probability of the strategy is reduced;

ratio_t(theta) is

The ratio of two network policies is shown; clip is a clipping function, clipping is performed when the action probability distributions of the two networks are too far apart, i.e., when ratio_tIf the value of (theta) is less than 1-epsilon, taking 1-epsilon; if ratio_tIf the value of (theta) is greater than 1+ epsilon, 1+ epsilon is taken.

Further, in steps S4-S5, the net θ obtains new parameters trained by the net θ', and multiple threads share the new parameters to the global PPO network, and when the number of steps in the round of each thread reaches T, the training of the round is finished, otherwise, the process returns to step S2.

Further, in steps S4-S5, training is terminated when the number of rounds reaches N, otherwise, the next round of training is performed, the initial power of the primary user and the secondary user is reinitialized, and the initial state S is obtained₀The iterative training is continued from step S2.

The invention has the beneficial effects that: the secondary user network trained on the multi-thread DPPO algorithm can achieve the same effect as the single-thread PPO algorithm, the secondary user can learn the optimal continuous power control strategy under different parameter conditions, the sample sampling time required by training is prolonged, the main network can collect sample data required by training in a short time, the training speed of the DPPO algorithm is obviously increased, and the training time is effectively shortened.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

fig. 1 is a schematic view of a spectrum sharing scenario in a cognitive wireless network;

FIG. 2 is a schematic flow chart of a distributed PPO algorithm-based adaptive continuous power control method;

fig. 3 is a comparison graph of network system capacities after training of the DPPO algorithm and the PPO algorithm.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Referring to fig. 1 to 3, an adaptive continuous power control method based on a distributed PPO algorithm is shown, and the PPO algorithm used in the cognitive wireless network scenario of fig. 1 may help a secondary user to effectively learn an optimal continuous power control strategy in a complex environment, so as to implement sharing with a primary user spectrum resource.

The PPO algorithm mainly and well solves the problems that the traditional strategy-based method is sensitive to the updating step number and the network updating efficiency is too low. The method mainly comprises the steps that the PPO algorithm adopts an importance sampling method to convert an on-policy method into an off-policy method, empirical data can be repeatedly used, the updating efficiency of a network is improved, and the problem that a strategy method is too sensitive to the updating step length is solved by adding a method for limiting the updating step length. The traditional strategy method is to use a network theta to execute a parameterized strategy pi_θWhen the parameters of the network are updated, the data collected according to the strategy can not be used for training the network, and the data needs to be re-sampled by a new strategy, which results in low utilization rate of the network on the sample information, low parameter updating efficiency and huge time consumption. The importance sampling method is to construct a network theta' with output action probability distribution similar to the network theta to interact with the environment for samplingWith sample data, then implementing the strategy π_θ'The network theta is trained for a plurality of times, and the parameters of the network theta' are fixed and unchangeable, so that the repeated utilization of the sampling data is realized, and the updating efficiency of the network is improved. In the conventional strategic gradient method, the gradient calculation formula is:

after the importance sampling method is adopted, the formula (1.1) can be changed into:

a new objective function based on the PPO algorithm is then obtained:

wherein ratio is_t(theta) is

The ratio of two network strategies is shown, clip is a clipping function, and when the action probability distribution of the two networks is consistentClipping is performed when the difference is too far, i.e. when the ratio_tIf the value of (theta) is less than 1-epsilon, taking 1-epsilon; if ratio_tIf the value of (theta) is greater than 1+ epsilon, 1+ epsilon is taken, and the value of epsilon is typically 0.1 or 0.2. Therefore, overlarge updating can be effectively prevented, and the problem that a strategy method is sensitive to the updating step length is solved.

The PPO algorithm comprises the following specific steps:

1. initializing a parameter theta of an Actor network and a parameter phi of another network theta', Critic network; initializing the updating times M of the theta' network and the updating times B of the Critic network, and the like; initializing environmental parameters such as the number of sensors, interference errors and the like; initializing the number N of turns of network training, the number T of iteration steps of each turn, the sampling batch step number Batchsize and the like; initializing the power of a primary user and the power of a secondary user, substituting the power of the primary user into a power control strategy to obtain the power of the next time frame, and further obtaining the initial state s of the environment₀。

2. Training begins, the Actor network performs pi_θSample data { s ] is collected_t,a_t,r_tAfter executing the Batchsize step, the cumulative function dominance function A is calculated^θAnd assigning the parameters of the network theta to the network theta'.

3. The theta 'network repeatedly updates for M times by using the sample data acquired in the step 2, wherein each updating is to perform gradient calculation according to a gradient formula (1.4) and update the network parameters of the theta' network; b times of parameter updating are carried out on the Critic network phi, and each time of updating is carried out on the advantage function A^θ'And (4) performing a gradient descent algorithm to reduce the value of the dominant function as much as possible, so as to optimize the theta' network strategy.

4. The network theta obtains new parameters trained by the network theta'.

5. Terminating the training round when the number of steps in the round reaches T, otherwise, continuously repeating the step 2.

6. Terminating the training when the number of rounds reaches N, otherwise carrying out the training of the next round, reinitializing the initial power of the primary user and the secondary user, and obtaining the initial state s₀And continuously repeating the step 2 to carry out iterative training.

However, because the amount of parameters of the deep neural network is large, the complexity is high and the training time is long due to the fact that the PPO method is used for updating the parameters in a large number of training iterations. Based on this, as shown in fig. 2, the present invention proposes a Distributed PPO (Distributed PPO, DPPO) algorithm to improve the training speed and reduce the training time. Different from the PPO method, the DPPO has a plurality of secondary networks and a main network, the secondary networks share the strategy parameters of the main network, the secondary networks execute the strategy of the main network in parallel in training, sample information is collected in respective environments and samples are transmitted to the main network, the main network trains according to the sample information transmitted by the secondary networks, strategy parameters are updated, and then the plurality of secondary networks continue to train the main network with the new strategy parameters updated by the main network to collect sample data continuously and concurrently until the training is finished. By the method, the sample sampling time required by training is prolonged, the main network can collect sample data required by training in a short time, and experimental results show that the training speed of the DPPO algorithm is obviously increased, and the training time is effectively shortened.

The DPPO algorithm is a method for collecting data by adding a plurality of threads on the basis of the PPO algorithm. The threads share a global PPO network, the threads execute the strategy of the global PPO network to concurrently collect sample data information in respective environments, the threads do not calculate gradients by themselves and are only responsible for transmitting the collected sample information to the global PPO network for training, so that the sample information acquisition time is remarkably prolonged, the training time of a neural network is shortened, and the following steps are specific steps.

1. Initializing a plurality of thread workers; initializing a global PPO network parameter; the rest are consistent with the initialization parameters of the first step in section 1.2.

2. Training begins, a plurality of workers execute a strategy pi of an Actor network in a global PPO network_θCollecting sample data s in respective environments_t,a_t,r_tAnd stopping data acquisition after executing the Batchsize steps respectively, and transmitting acquired sample information to the global PPO network.

3. Global PPO network computing cumulative function dominance function A^θAssigning the parameters of the network theta to the network theta ', and repeatedly utilizing the sample data acquired in the step 2 by the theta' network to update for M times, performing gradient calculation according to a gradient formula (1.4) each time, and updating the network parameters of the theta; b times of parameter updating are carried out on the Critic network phi, and each time of updating is carried out on the advantage function A^θ'And (4) performing a gradient descent algorithm to reduce the value of the dominance function as much as possible so as to optimize the theta' network strategy, and then stopping updating.

4. And obtaining the new parameters trained by the network theta', sharing the new parameters to the global PPO network by multiple threads, finishing the training of the round when the number of steps of each thread in the round reaches T, and otherwise, continuously repeating the step 2.

5. Terminating the training when the number of rounds reaches N, otherwise carrying out the training of the next round, reinitializing the initial power of the primary user and the secondary user, and obtaining the initial state s₀And continuing to perform iterative training from step 2.

FIG. 3 is a simulation result of a network system capacity comparison test after training of a DPPO algorithm and a PPO algorithm, in FIG. 3, PU-DPPO represents the capacity of a primary user in the DPPO algorithm, SU-DPPO represents the capacity of a secondary user in the DPPO algorithm, PU-PPO represents the capacity of a primary user in the PPO algorithm, and SU-PPO represents the capacity of a secondary user in the DPPO algorithm. The simulation result shows that the system capacity trained based on the DPPO algorithm is very close to the system capacity trained based on the PPO algorithm, which proves the effectiveness of the DPPO algorithm, and the training time record in the experiment shows that the training time of the DPPO algorithm is 261 seconds, and the training time of the PPO algorithm is 350 seconds, which proves that the network training speed of the DPPO algorithm is further improved.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A self-adaptive continuous power control method based on a distributed PPO algorithm is characterized by comprising the following steps: the method comprises the following steps:

2. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 1, wherein: in step S1, a plurality of threads workers are initialized; initializing a global PPO network parameter; initializing a parameter theta of an Actor network and a parameter phi of another network theta', Critic network; initializing the updating times M of the theta' network and the updating times B of the Critic network; initializing environmental parameters such as the number of sensors and interference errors; initializing the number N of turns of network training, the number T of iteration steps of each turn and the sampling batch step number Batchsize; initializing the power of a primary user and the power of a secondary user, substituting the power of the primary user into a power control strategy to obtain the power of the next time frame, and further obtaining the initial state s of the environment₀。

3. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 1, wherein: in step S2, multiple workers execute the strategy pi of the Actor network in the global PPO network_θAt each locationSample data { s } collected from the environment_t,a_t,r_t}。

4. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 3, wherein: in step S3, the plurality of workers stop collecting data after each execute the Batchsize step, and transmit the collected sample information to the global PPO network.

5. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 1, wherein: in steps S4-S5, the global PPO network calculates the cumulative function dominance function A^θAssigning the parameters of the network theta to the network theta ', and repeatedly updating the theta ' network M times by using the sample data acquired in the step S3, performing gradient calculation according to the gradient formula (4) each time, and updating the network parameters of the theta '; b times of parameter updating are carried out on the Critic network phi, and each time of updating is carried out on the advantage function A^θ'Performing a gradient descent algorithm to reduce the value of the dominant function as much as possible so as to optimize the theta' network strategy, and then stopping updating;

clip(ratio_t(θ),1-ε,1+ε)A^θ′(s_t,a_t))]

in the formula

indicating a reward expectation value; s_tRepresenting the state of the agent at time t, a_tRepresenting the action taken by the agent at time t; pi_θ'Expressed is the strategy of theta' network, expressed is that the intelligent agent is in a certain special placeProbability of taking a certain specific action in a certain state; a. the^θ'(s_t,a_t) The advantage function of the theta' network is used for representing how much better the action taken at the current moment is than the average action, if the advantage function is greater than 0, the probability of the strategy is continuously improved, and if the advantage function is less than 0, the probability of the strategy is reduced;

6. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 5, wherein: in steps S4-S5, the net θ obtains new parameters trained by the net θ', and multiple threads share the new parameters of the global PPO network, and when the number of steps in the round of each thread reaches T, the training in the round is finished, otherwise, the process returns to step S2.

7. The adaptive continuous power control method based on the distributed PPO algorithm according to claim 6, wherein: in the steps S4-S5, the training is terminated when the number of rounds reaches N, otherwise, the training of the next round is carried out, the initial power of the primary user and the secondary user is reinitialized, and the initial state S is obtained₀The iterative training is continued from step S2.