CN114881228A

CN114881228A - Average SAC deep reinforcement learning method and system based on Q learning

Info

Publication number: CN114881228A
Application number: CN202210683336.7A
Authority: CN
Inventors: 陈志奎
Original assignee: Dalian Juzhi Information Technology Co ltd
Current assignee: Dalian Juzhi Information Technology Co ltd
Priority date: 2021-09-04
Filing date: 2022-06-16
Publication date: 2022-08-09

Abstract

The invention provides an average SAC (sample consensus) deep reinforcement learning method and system based on Q learning, belonging to the technical field of deep reinforcement learning, and 1) randomly initializing all parameter information; 2) updating Critic network parameters, finishing strategy evaluation to perform soft strategy iteration, calculating a state value function of the value network, training a soft Q function by using Q learning of the soft Q network, and then selecting the previous K previously learned soft Q values to calculate an average soft Q value; 3) updating Critic network parameters, and finishing strategy improvement to perform learning optimization; 4) and finishing a round of training and learning of the intelligent agent, and performing iterative updating until a termination condition is met, thereby finishing the deep reinforcement learning. The invention designs an Averaged Soft Actor-Critic method based on Q learning aiming at an image game, comprehensively considers the reason causing poor stability in deep reinforcement learning, and can effectively improve the performance of a deep reinforcement learning algorithm Soft Actor-Critic by utilizing the strategy of adopting the first K Soft Q values to reduce over-high estimation errors.

Description

Average SAC deep reinforcement learning method and system based on Q learning

Technical Field

The invention relates to the technical field of deep reinforcement learning, in particular to an average SAC deep reinforcement learning method and system based on Q learning.

Background

With the rapid development of the internet, an era of artificial intelligence has come. Artificial intelligence is a field knowledge of a method for simulating human thinking, and the generation of an agent with completely autonomous learning ability is the main task of the artificial intelligence. These autonomously generated agents need to interact with the current environment at all times to accomplish the task of information exchange and delivery. The final task set by these agents is to select the optimal action by training and learning continuously with the current environment information to achieve the agent optimal strategy in the current environment. An artificial intelligence system under this definition includes a robot that can interact with the surrounding environment, and a pure software-based agent that can interact with multimedia devices (e.g., computers, mobile phones, etc.) and natural language. Current deep reinforcement learning is a particularly suitable algorithm to address this interactive scheme. The algorithm principle of the deep reinforcement learning is the autonomous learning and training ability of an intelligent agent. Recently, for some problems commonly existing in machine learning, such as high computational complexity, huge memory occupation, and huge sample complexity, the deep reinforcement learning algorithm can well solve or alleviate the problems by using the neural network characteristics in deep learning. However, most of the current deep reinforcement learning geographical theory knowledge is in a theoretical exploration stage, and the application in a real ground practice is not very popular. Therefore, researchers have spent much effort recently to apply the algorithm model to the daily life of common people practically, so that the algorithm model provides convenience for the daily life of common people. However, in practical application, the security of the smart device in which the algorithm is located is a problem that must be considered. Since deep reinforcement learning tends to explore strange and dangerous states, the strategy learned by the deep reinforcement learning agent will be easily affected by the safety exploration problem, which is an extremely dangerous learning behavior if the deep reinforcement learning often explores unsafe states (e.g., a mobile robot drives a car into an unsafe state). Therefore, for the deep reinforcement learning security problem applied to the actual smart device, the algorithm must try to enable the smart body of the algorithm in the smart device to provide an optimal strategy which is efficient and extremely secure. Even if this policy is not the optimal one, the algorithm guarantees the security of the policy in the exploration process. Such as in an intelligent unmanned vehicle, this would be completely impractical if only the optimal strategy for the intelligent device definition were selected to skip traveling the optimal route or the shortest route to reach the destination, regardless of the safety factor of its optimal strategy.

However, according to research and investigation, many current statistical intelligent devices applied to the deep reinforcement learning algorithm do not effectively consider the security problem of the intelligent agent in exploring the strategy, or even use the security problem as a performance index of the intelligent device. This setting would be an extremely unreasonable behavior. The unsafe exploration behavior of the intelligent agent is mainly caused by the over-estimation of a Q learning process in a deep reinforcement learning algorithm.

In summary, in order to improve the safety and non-controllability problems in practical applications, it is necessary to improve the overestimation problem caused in the Q learning process, so as to reduce the transmission error in the algorithm training and improve the algorithm stability. However, the stability of the deep reinforcement learning algorithm is a serious challenge, which limits the further optimization development of the algorithm. Although the performance of computers has improved, the stability of deep reinforcement learning algorithms has not improved well. Therefore, how to improve the stability in the Q learning process in the deep reinforcement learning algorithm and how to improve the security in the deep reinforcement learning algorithm are one of the most challenging problems in the deep reinforcement learning.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an Averaged Soft Actor-Critic algorithm which uses K previously learned state values to calculate a Soft Q value in the Averaged Soft Actor-Critic algorithm and aims to reduce the problem of over-high estimation of the Soft Q value in the calculation process. The invention innovatively provides an Avaged-SAC algorithm model to improve the performance of an SAC (Soft Actor-Critic) algorithm and reduce high errors caused by over-high estimation.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

an average SAC deep reinforcement learning method based on Q learning specifically comprises the following steps:

s1, finishing strategy evaluation to perform soft strategy iteration: calculating the state value of the intelligent agent in the environment interaction process through strategy evaluation, approximating the state value function of the intelligent agent to a soft Q value, and estimating the soft Q value from a single operation sample of the intelligent agent strategy in the current environment;

s2, calculating a state value function of the value network: obtaining a state value function of the value network according to a calculation formula of the soft Q value, and training the state value function through a Q learning process;

s3, training a soft Q function by using Q learning of the soft Q network: obtaining the state value according to the step S2, training a soft Q function through a Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, and estimating errors generated in the Q learning process by using the soft Q value of the target value function and a Monte Carlo algorithm;

s4, calculating an average soft Q value: selecting the previous K learned soft Q values to calculate an average soft Q value in the interaction process between the current image game agent and the environment;

s5, finishing strategy improvement to carry out learning optimization: the strategy is updated by minimizing the KL difference from the boltzmann strategy.

Preferably, step S1 specifically includes the following steps:

s110, evaluating and calculating the state value of the intelligent agent in the environment interaction process through a strategy, wherein a soft state value function is defined as

Where V represents the state cost function, s, of the current agent in the current round _t Es is the state the agent is in at time step t,a _t the method comprises the steps that the operation of the intelligent agent is carried out in the current state, and pi is a strategy adopted by the intelligent agent in the current environment;

s120, defining a soft Q function as

S130, approximating the state value function of the agent to a soft Q value, and estimating the soft Q value from a single operation sample of the agent strategy under the current environment.

Preferably, step S2 specifically includes the following steps:

s210, obtaining a state value function of the value network according to a calculation formula of the soft Q value;

s220, training the state value function through the Q learning process to reduce the error of the sample data, wherein the calculation formula is as follows

Where D represents the state distribution of the previous sample, seen as the sample distribution of the empirical playback zone, the algorithm gradient of the above formula is estimated using an unbiased estimation function in the averged Soft Actor-critical algorithm.

Preferably, step S3 specifically includes the following steps:

s3 obtaining the state value according to the step S2, training the soft Q function through the Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, estimating the error generated in the Q learning process by using the soft Q value of the target value function and the Monte Carlo algorithm, wherein the calculation formula of the Q learning process for training the soft Q function is as follows

Where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s _t E S is the state of the agent under the current time step, p: s × A → S is currentA probability distribution function of an intelligent environmental state;

is a target value network of the Averaged Soft Actor-criticic algorithm; θ is a soft Q network parameter used to estimate the state of the agent operating in the current environment.

Preferably, step S4 specifically includes the following steps:

s4 selecting K previous learned soft Q values to calculate average soft Q value when running to a certain step in the interaction process between the intelligent agent and environment of current image game, the calculation formula of the average soft Q value is as follows

Preferably, step S5 specifically includes the following steps:

s5, updating the strategy by minimizing the KL difference with the Boltzmann strategy, wherein the strategy improvement formula is as follows

Where D represents the distribution of states of previous samples, the distribution of samples of the empirical playback zone; s _t The state of the agent in the time step is belonged to S; theta is the Soft Q network parameter used to estimate the agent operating in this state and phi is the policy network parameter of the Averaged Soft Actor-critical algorithm.

Preferably, the policy is re-parameterized by a re-parameterization technique, and the calculation formula is as follows:

a _t ＝f _φ (∈ _t ；s _t ) Wherein ∈ _t ～N(0，I)；

The strategy improvement formula is as follows:

an average SAC deep reinforcement learning system based on Q learning specifically comprises

The initialization module is used for initializing soft Q values corresponding to all states and actions randomly; the intelligent agent and environment interaction experiment is used for collecting data information of the current environment, initializing the first state of the current state sequence and obtaining a feature vector of the first state;

the training module is used for training the parameters of the soft Q network according to the collected characteristic vector data information, so that the trained information is used for estimating a soft Q value, and the parameters of the soft Q network are fixed after the soft Q network is trained; using the feature vector information as input in an algorithm network, outputting a corresponding selection action, and obtaining a new state and instant reward feedback based on the action;

the Critic network parameter updating module is used for obtaining soft Q value output and a corresponding state value function by using the characteristic vector data information as input in a Critic network; the Critic network part calculates TD error through the calculated state value functions and corresponding return; then, using a mean square error loss function to update the gradient of the Critic network parameters;

and the Actor network parameter updating module selects Softmax or a Gaussian score function and updates the Softmax or the Gaussian score function in combination with the TD error value of the Critic network.

The invention has the beneficial effects that: the invention designs an Averaged Soft Actor-Critic depth reinforcement learning algorithm based on Q learning aiming at image game data, the algorithm uses K previously learned state values to calculate the Soft Q value in the Averaged Soft Actor-Critic algorithm, and the aim of reducing the problem of over-high estimation of the Soft Q value in the calculation process is fulfilled.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for average SAC deep reinforcement learning based on Q learning according to the present invention;

FIG. 2 is a graph of overestimated results for Avaged-SAC and SAC;

FIG. 3 is a framework diagram of the Averaged Soft Actor-Critic deep reinforcement learning algorithm based on Q learning.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments given herein without any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an average SAC depth reinforcement learning method based on Q learning, and fig. 3 is a framework diagram of an average Soft Actor-critical depth reinforcement learning algorithm based on Q learning. Firstly, the intelligent agent of the algorithm interacts with the current environment to obtain the state information of the current intelligent agent; then the state information of the agent is used as the network layer input of the Averaged Soft Actor-Critic algorithm, and after the state information is processed by the convolutional neural network and the Softmax function, the network layer outputs the action selected by the agent at present; feeding back the selected action to the current interactive environment, so that the environment gives instant reward feedback brought by the current action; the Averaged Soft Actor-criticic algorithm uniformly feeds back the data information to the criticic module, so that criticic parameters are updated, and TD errors are calculated; the Actor module receives the environmental information and the TD error transmitted by the criticic module, so as to update the Actor parameter; and finally, the intelligent agent trains the intelligent agent to learn an optimal strategy by continuously circulating the steps.

the strategy evaluation is the first step of Soft strategy iteration performed by the Averaged Soft Actor-Critic algorithm, and the value of the intelligent agent in the environment interaction process needs to be calculated. To achieve this goal, the Averaged Soft Actor-Critic algorithm defines the Soft state value function as shown in equation (1):

where V represents the state cost function, s, of the current agent in the current round _t E S is the state of the agent at time step t, a _t The method is an operation of the agent when the agent is executed in the current state, and pi is a strategy adopted by the agent in the current environment.

The Soft Q function of the Averaged Soft Actor-criticic algorithm is defined as shown in formula (2):

as can be seen in equation (2), the agent may obtain the next state based on the current state and the action taken.

In the Averaged Soft Actor-Critic algorithm, the state value function approximation can be regarded as a Soft Q value, and the algorithm does not need to design a separate function approximator for the state value in principle, because the relationship between the state value function and the Soft Q function is related to the policy function of the formula (1). The algorithm can estimate the soft Q value from a single operation sample of the agent policy in the current environment without introducing an additional bias. The soft Q value of the deep reinforcement learning algorithm is calculated through a single function approximator, so that the deep reinforcement learning algorithm can be stably trained and is convenient to train with other networks at the same time.

after the calculation formula of the soft Q function is determined, the state value function of the value network can be obtained. The Averaged Soft Actor-criticic algorithm mainly trains a state value function through a Q learning process to reduce the error of sample data to the maximum extent, and the calculation mode is as shown in formula (3):

where D represents the distribution of states of previous samples, which can be considered as the distribution of samples of the empirical playback zone. The algorithm gradient of equation (3) can be estimated using an unbiased estimation function in the Averaged Soft Actor-Critic algorithm.

When the Averaged Soft Actor-criticic algorithm is applied to the continuous state space, the neural network with the current Averaged Soft Actor-criticic algorithm parameters needs to be updated.

after the state values are obtained in the above steps, a Q learning process through a soft Q network is needed to train a soft Q function, so that interaction between the intelligent agent and the environment is completed. The Averaged Soft Actor-Critic algorithm trains a Soft Q function through Q learning, so that the residual error of the Beckmann equation is minimized, and the calculation mode is shown as formula (4):

where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s _t E S is the place of the agent under the current time stepState, p: s multiplied by A → S is the probability distribution function of the current intelligent environmental state;

the method is a target value network of the Averaged Soft Actor-Critic algorithm, the network structure and parameters of the target value network are the same as the value network psi, but the updating rates of the two networks are different; θ is a soft Q network parameter used to estimate the state of the agent operating in the current environment. The Averaged Soft Actor-Critic algorithm uses the Soft Q values of the target value network and the Monte Carlo algorithm to estimate the error generated during Q learning.

during the interaction between the current image game agent and the environment, the average Soft Actor-Critic algorithm selects the first K previously learned Soft Q values to calculate the average Soft Q value when the current operation reaches a certain step. Therefore, the average Soft Q value calculation method of the Averaged Soft Actor-critical algorithm is shown in formula (5):

by comparing the game result performance of the Avaged Soft Actor-Critic algorithm and the Soft Actor-Critic algorithm in the InvertedDoublePendulum game in FIG. 2, it can be seen that the Avaged Soft Actor-Critic algorithm provided by the invention has excellent performance effect in reducing the over-high estimation of the Soft Q value in the Q learning process. Therefore, the Averaged Soft Actor-Critic algorithm provided by the invention can effectively reduce the problem caused by over-high estimation of the Soft Q value in the Q learning process.

The principle of the strategy improvement step of the Averaged Soft Actor-Critic algorithm is to update the learning parameters of the intelligent agent of the current environment as much as possible in the direction of maximizing the return obtainable by the intelligent agent. The Averaged Soft Actor-Critic algorithm updates the temperature parameter to the Boltzmann policy and uses the Soft Q function at that temperature as energy. Thus, the Averaged Soft Actor-Critic algorithm updates the policy by minimizing its Kullback-Leibler (KL) difference from the Boltzmann policy. After so setting, the strategy improvement of the Averaged Soft Actor-Critic algorithm is as shown in equation (6):

where D represents the distribution of states of previous samples, which can be considered as the distribution of samples of the empirical playback zone; s _t The state of the agent in the time step is belonged to S; theta is the Soft Q network parameter used to estimate the agent operating in this state and phi is the policy network parameter of the Averaged Soft Actor-critical algorithm.

Since the probability distribution of the strategy output action is expected by the conventional deep reinforcement learning algorithm, some error data is generated in the back propagation. The Averaged Soft Actor-Critic algorithm solves this problem using a re-parameterization technique. The Averaged Soft Actor-Critic algorithm combines its corresponding neural network output with the input noise vector sampled from the spherical Gaussian distribution to establish a channel for correctly computing the back propagation error information. The output of the policy network of the Averaged Soft Actor-criticic algorithm can be used for forming the random action distribution in the state, and the output data of the policy network is not required to be directly used. The form of the strategy re-parameterized by the Averaged Soft Actor-criticic algorithm through the re-parameterization technology is shown in formula (7):

a _t ＝f _φ (∈ _t ；s _t ) (7)

wherein e _t N (0, I). After the modification and replacement of this formula in the original algorithm, the Averaged Soft Actor-Critic algorithm policy objective will become as shown in formula (8):

the invention also provides an average SAC deep reinforcement learning system based on Q learning, which specifically comprises

The specific procedure is shown in the following table.

TABLE 1 Overall Process of the invention

And (4) verification result:

in the experiment of the present invention, 6 different games of MuJoCo by OpenAI Gym were used as experimental environments. In most MuJoCo games, the Soft Actor-Critic algorithm is obviously superior to other depth reinforcement learning algorithms, so in the Averaged Soft Actor-Critic algorithm experiment, 6 MuJoCo games are selected to test the performance comparison of the Soft Actor-Critic algorithm and the Averaged Soft Actor-Critic algorithm.

In order to more intuitively compare the performance of the Averaged Soft Actor-Critic algorithm and the Soft Actor-Critic algorithm of the present invention, their average training scores are listed in Table 2. The results are summarized in table 1. It can be seen that the overall performance of the Averaged Soft Actor-Critic algorithm is better than that of the Soft Actor-Critic algorithm. As can be seen, as the K value increases, the average training score of the Averaged Soft Actor-Critic algorithm will increase; if the increased K value exceeds a certain range, the average score of the Averaged Soft Actor-Critic algorithm will tend to decrease in certain gaming environments. This is mainly because the average Soft Actor-critical algorithm needs to calculate a large number of K values to obtain its average Soft Q value during the training process, and therefore, more training time and calculation amount are needed. But this cost penalty is acceptable in view of the performance improvement of the Averaged Soft Actor-Critic algorithm.

TABLE 2 training scores for Averaged-SAC and SAC

In summary, after 6 comparison experiments in the MuJoCo environment, the Averaged Soft Actor-Critic algorithm does perform better than the Soft Actor-Critic algorithm. In addition, increasing the K value in the Averaged Soft Actor-Critic algorithm appropriately will result in better performance and stability, while also increasing the agent training time cost within the acceptable range. Therefore, in the Averaged Soft Actor-Critic algorithm, an appropriate K value is selected according to the situation of computing resources for training and learning.

The experimental result shows that in a certain range, the intelligent agent can obtain excellent performance by increasing the K value; however, if the value of K exceeds a range, the learning performance of the agent will tend to decline. By repeatedly carrying out experiments with different K values, it is verified that the intelligent learning performance can be optimized when the K value is 10. Averaged Soft Actor-Critic Algorithm which can be easily integrated with the Soft Actor-Critic Algorithm.

In light of the foregoing description of the preferred embodiments of the present invention, those skilled in the art can now make various alterations and modifications without departing from the scope of the invention. The technical scope of the present invention is not limited to the contents of the specification, and must be determined according to the scope of the claims.

Claims

1. An average SAC deep reinforcement learning method based on Q learning is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step S1 specifically includes the following steps:

Where V represents the state cost function, s, of the current agent in the current round _t E S is the state of the agent at time step t, a _t The method comprises the steps that the operation of the intelligent agent is carried out in the current state, and pi is a strategy adopted by the intelligent agent in the current environment;

s120, defining a soft Q function as

3. The method as claimed in claim 1, wherein the step S2 specifically includes the following steps:

4. The method as claimed in claim 1, wherein the step S3 specifically includes the following steps:

Where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s _t E S is the state of the agent under the current time step, p: s multiplied by A → S is the probability distribution function of the current intelligent environmental state;

5. The method as claimed in claim 1, wherein the step S4 specifically includes the following steps:

6. The method as claimed in claim 1, wherein the step S5 specifically includes the following steps:

7. The mean SAC depth reinforcement learning method based on Q learning as claimed in claim 6, wherein the strategy is re-parameterized by a re-parameterization technique, and the calculation formula is as follows:

a _t ＝f _φ (∈ _t ；s _t ) Wherein ∈ _t ～N(0，I)；

The strategy improvement formula is as follows:

8. an average SAC deep reinforcement learning system based on Q learning is characterized by specifically comprising

The initialization module is used for initializing soft Q values corresponding to all states and actions randomly; the intelligent agent and environment interaction experiment is used for collecting data information of the current environment, initializing the first state of the current state sequence and obtaining a feature vector of the current state sequence;