CN114881228A - Average SAC deep reinforcement learning method and system based on Q learning - Google Patents

Average SAC deep reinforcement learning method and system based on Q learning Download PDF

Info

Publication number
CN114881228A
CN114881228A CN202210683336.7A CN202210683336A CN114881228A CN 114881228 A CN114881228 A CN 114881228A CN 202210683336 A CN202210683336 A CN 202210683336A CN 114881228 A CN114881228 A CN 114881228A
Authority
CN
China
Prior art keywords
soft
value
state
learning
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210683336.7A
Other languages
Chinese (zh)
Inventor
陈志奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Juzhi Information Technology Co ltd
Original Assignee
Dalian Juzhi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Juzhi Information Technology Co ltd filed Critical Dalian Juzhi Information Technology Co ltd
Publication of CN114881228A publication Critical patent/CN114881228A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Abstract

The invention provides an average SAC (sample consensus) deep reinforcement learning method and system based on Q learning, belonging to the technical field of deep reinforcement learning, and 1) randomly initializing all parameter information; 2) updating Critic network parameters, finishing strategy evaluation to perform soft strategy iteration, calculating a state value function of the value network, training a soft Q function by using Q learning of the soft Q network, and then selecting the previous K previously learned soft Q values to calculate an average soft Q value; 3) updating Critic network parameters, and finishing strategy improvement to perform learning optimization; 4) and finishing a round of training and learning of the intelligent agent, and performing iterative updating until a termination condition is met, thereby finishing the deep reinforcement learning. The invention designs an Averaged Soft Actor-Critic method based on Q learning aiming at an image game, comprehensively considers the reason causing poor stability in deep reinforcement learning, and can effectively improve the performance of a deep reinforcement learning algorithm Soft Actor-Critic by utilizing the strategy of adopting the first K Soft Q values to reduce over-high estimation errors.

Description

Average SAC deep reinforcement learning method and system based on Q learning
Technical Field
The invention relates to the technical field of deep reinforcement learning, in particular to an average SAC deep reinforcement learning method and system based on Q learning.
Background
With the rapid development of the internet, an era of artificial intelligence has come. Artificial intelligence is a field knowledge of a method for simulating human thinking, and the generation of an agent with completely autonomous learning ability is the main task of the artificial intelligence. These autonomously generated agents need to interact with the current environment at all times to accomplish the task of information exchange and delivery. The final task set by these agents is to select the optimal action by training and learning continuously with the current environment information to achieve the agent optimal strategy in the current environment. An artificial intelligence system under this definition includes a robot that can interact with the surrounding environment, and a pure software-based agent that can interact with multimedia devices (e.g., computers, mobile phones, etc.) and natural language. Current deep reinforcement learning is a particularly suitable algorithm to address this interactive scheme. The algorithm principle of the deep reinforcement learning is the autonomous learning and training ability of an intelligent agent. Recently, for some problems commonly existing in machine learning, such as high computational complexity, huge memory occupation, and huge sample complexity, the deep reinforcement learning algorithm can well solve or alleviate the problems by using the neural network characteristics in deep learning. However, most of the current deep reinforcement learning geographical theory knowledge is in a theoretical exploration stage, and the application in a real ground practice is not very popular. Therefore, researchers have spent much effort recently to apply the algorithm model to the daily life of common people practically, so that the algorithm model provides convenience for the daily life of common people. However, in practical application, the security of the smart device in which the algorithm is located is a problem that must be considered. Since deep reinforcement learning tends to explore strange and dangerous states, the strategy learned by the deep reinforcement learning agent will be easily affected by the safety exploration problem, which is an extremely dangerous learning behavior if the deep reinforcement learning often explores unsafe states (e.g., a mobile robot drives a car into an unsafe state). Therefore, for the deep reinforcement learning security problem applied to the actual smart device, the algorithm must try to enable the smart body of the algorithm in the smart device to provide an optimal strategy which is efficient and extremely secure. Even if this policy is not the optimal one, the algorithm guarantees the security of the policy in the exploration process. Such as in an intelligent unmanned vehicle, this would be completely impractical if only the optimal strategy for the intelligent device definition were selected to skip traveling the optimal route or the shortest route to reach the destination, regardless of the safety factor of its optimal strategy.
However, according to research and investigation, many current statistical intelligent devices applied to the deep reinforcement learning algorithm do not effectively consider the security problem of the intelligent agent in exploring the strategy, or even use the security problem as a performance index of the intelligent device. This setting would be an extremely unreasonable behavior. The unsafe exploration behavior of the intelligent agent is mainly caused by the over-estimation of a Q learning process in a deep reinforcement learning algorithm.
In summary, in order to improve the safety and non-controllability problems in practical applications, it is necessary to improve the overestimation problem caused in the Q learning process, so as to reduce the transmission error in the algorithm training and improve the algorithm stability. However, the stability of the deep reinforcement learning algorithm is a serious challenge, which limits the further optimization development of the algorithm. Although the performance of computers has improved, the stability of deep reinforcement learning algorithms has not improved well. Therefore, how to improve the stability in the Q learning process in the deep reinforcement learning algorithm and how to improve the security in the deep reinforcement learning algorithm are one of the most challenging problems in the deep reinforcement learning.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an Averaged Soft Actor-Critic algorithm which uses K previously learned state values to calculate a Soft Q value in the Averaged Soft Actor-Critic algorithm and aims to reduce the problem of over-high estimation of the Soft Q value in the calculation process. The invention innovatively provides an Avaged-SAC algorithm model to improve the performance of an SAC (Soft Actor-Critic) algorithm and reduce high errors caused by over-high estimation.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
an average SAC deep reinforcement learning method based on Q learning specifically comprises the following steps:
s1, finishing strategy evaluation to perform soft strategy iteration: calculating the state value of the intelligent agent in the environment interaction process through strategy evaluation, approximating the state value function of the intelligent agent to a soft Q value, and estimating the soft Q value from a single operation sample of the intelligent agent strategy in the current environment;
s2, calculating a state value function of the value network: obtaining a state value function of the value network according to a calculation formula of the soft Q value, and training the state value function through a Q learning process;
s3, training a soft Q function by using Q learning of the soft Q network: obtaining the state value according to the step S2, training a soft Q function through a Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, and estimating errors generated in the Q learning process by using the soft Q value of the target value function and a Monte Carlo algorithm;
s4, calculating an average soft Q value: selecting the previous K learned soft Q values to calculate an average soft Q value in the interaction process between the current image game agent and the environment;
s5, finishing strategy improvement to carry out learning optimization: the strategy is updated by minimizing the KL difference from the boltzmann strategy.
Preferably, step S1 specifically includes the following steps:
s110, evaluating and calculating the state value of the intelligent agent in the environment interaction process through a strategy, wherein a soft state value function is defined as
Figure BDA0003697015630000031
Where V represents the state cost function, s, of the current agent in the current round t Es is the state the agent is in at time step t,a t the method comprises the steps that the operation of the intelligent agent is carried out in the current state, and pi is a strategy adopted by the intelligent agent in the current environment;
s120, defining a soft Q function as
Figure BDA0003697015630000032
S130, approximating the state value function of the agent to a soft Q value, and estimating the soft Q value from a single operation sample of the agent strategy under the current environment.
Preferably, step S2 specifically includes the following steps:
s210, obtaining a state value function of the value network according to a calculation formula of the soft Q value;
s220, training the state value function through the Q learning process to reduce the error of the sample data, wherein the calculation formula is as follows
Figure BDA0003697015630000033
Where D represents the state distribution of the previous sample, seen as the sample distribution of the empirical playback zone, the algorithm gradient of the above formula is estimated using an unbiased estimation function in the averged Soft Actor-critical algorithm.
Preferably, step S3 specifically includes the following steps:
s3 obtaining the state value according to the step S2, training the soft Q function through the Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, estimating the error generated in the Q learning process by using the soft Q value of the target value function and the Monte Carlo algorithm, wherein the calculation formula of the Q learning process for training the soft Q function is as follows
Figure BDA0003697015630000034
Where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s t E S is the state of the agent under the current time step, p: s × A → S is currentA probability distribution function of an intelligent environmental state;
Figure BDA0003697015630000035
is a target value network of the Averaged Soft Actor-criticic algorithm; θ is a soft Q network parameter used to estimate the state of the agent operating in the current environment.
Preferably, step S4 specifically includes the following steps:
s4 selecting K previous learned soft Q values to calculate average soft Q value when running to a certain step in the interaction process between the intelligent agent and environment of current image game, the calculation formula of the average soft Q value is as follows
Figure BDA0003697015630000041
Preferably, step S5 specifically includes the following steps:
s5, updating the strategy by minimizing the KL difference with the Boltzmann strategy, wherein the strategy improvement formula is as follows
Figure BDA0003697015630000042
Where D represents the distribution of states of previous samples, the distribution of samples of the empirical playback zone; s t The state of the agent in the time step is belonged to S; theta is the Soft Q network parameter used to estimate the agent operating in this state and phi is the policy network parameter of the Averaged Soft Actor-critical algorithm.
Preferably, the policy is re-parameterized by a re-parameterization technique, and the calculation formula is as follows:
a t =f φ (∈ t ;s t ) Wherein ∈ t ~N(0,I);
The strategy improvement formula is as follows:
Figure BDA0003697015630000043
an average SAC deep reinforcement learning system based on Q learning specifically comprises
The initialization module is used for initializing soft Q values corresponding to all states and actions randomly; the intelligent agent and environment interaction experiment is used for collecting data information of the current environment, initializing the first state of the current state sequence and obtaining a feature vector of the first state;
the training module is used for training the parameters of the soft Q network according to the collected characteristic vector data information, so that the trained information is used for estimating a soft Q value, and the parameters of the soft Q network are fixed after the soft Q network is trained; using the feature vector information as input in an algorithm network, outputting a corresponding selection action, and obtaining a new state and instant reward feedback based on the action;
the Critic network parameter updating module is used for obtaining soft Q value output and a corresponding state value function by using the characteristic vector data information as input in a Critic network; the Critic network part calculates TD error through the calculated state value functions and corresponding return; then, using a mean square error loss function to update the gradient of the Critic network parameters;
and the Actor network parameter updating module selects Softmax or a Gaussian score function and updates the Softmax or the Gaussian score function in combination with the TD error value of the Critic network.
The invention has the beneficial effects that: the invention designs an Averaged Soft Actor-Critic depth reinforcement learning algorithm based on Q learning aiming at image game data, the algorithm uses K previously learned state values to calculate the Soft Q value in the Averaged Soft Actor-Critic algorithm, and the aim of reducing the problem of over-high estimation of the Soft Q value in the calculation process is fulfilled.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for average SAC deep reinforcement learning based on Q learning according to the present invention;
FIG. 2 is a graph of overestimated results for Avaged-SAC and SAC;
FIG. 3 is a framework diagram of the Averaged Soft Actor-Critic deep reinforcement learning algorithm based on Q learning.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments given herein without any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of an average SAC depth reinforcement learning method based on Q learning, and fig. 3 is a framework diagram of an average Soft Actor-critical depth reinforcement learning algorithm based on Q learning. Firstly, the intelligent agent of the algorithm interacts with the current environment to obtain the state information of the current intelligent agent; then the state information of the agent is used as the network layer input of the Averaged Soft Actor-Critic algorithm, and after the state information is processed by the convolutional neural network and the Softmax function, the network layer outputs the action selected by the agent at present; feeding back the selected action to the current interactive environment, so that the environment gives instant reward feedback brought by the current action; the Averaged Soft Actor-criticic algorithm uniformly feeds back the data information to the criticic module, so that criticic parameters are updated, and TD errors are calculated; the Actor module receives the environmental information and the TD error transmitted by the criticic module, so as to update the Actor parameter; and finally, the intelligent agent trains the intelligent agent to learn an optimal strategy by continuously circulating the steps.
An average SAC deep reinforcement learning method based on Q learning specifically comprises the following steps:
s1, finishing strategy evaluation to perform soft strategy iteration: calculating the state value of the intelligent agent in the environment interaction process through strategy evaluation, approximating the state value function of the intelligent agent to a soft Q value, and estimating the soft Q value from a single operation sample of the intelligent agent strategy in the current environment;
the strategy evaluation is the first step of Soft strategy iteration performed by the Averaged Soft Actor-Critic algorithm, and the value of the intelligent agent in the environment interaction process needs to be calculated. To achieve this goal, the Averaged Soft Actor-Critic algorithm defines the Soft state value function as shown in equation (1):
Figure BDA0003697015630000061
where V represents the state cost function, s, of the current agent in the current round t E S is the state of the agent at time step t, a t The method is an operation of the agent when the agent is executed in the current state, and pi is a strategy adopted by the agent in the current environment.
The Soft Q function of the Averaged Soft Actor-criticic algorithm is defined as shown in formula (2):
Figure BDA0003697015630000062
as can be seen in equation (2), the agent may obtain the next state based on the current state and the action taken.
In the Averaged Soft Actor-Critic algorithm, the state value function approximation can be regarded as a Soft Q value, and the algorithm does not need to design a separate function approximator for the state value in principle, because the relationship between the state value function and the Soft Q function is related to the policy function of the formula (1). The algorithm can estimate the soft Q value from a single operation sample of the agent policy in the current environment without introducing an additional bias. The soft Q value of the deep reinforcement learning algorithm is calculated through a single function approximator, so that the deep reinforcement learning algorithm can be stably trained and is convenient to train with other networks at the same time.
S2, calculating a state value function of the value network: obtaining a state value function of the value network according to a calculation formula of the soft Q value, and training the state value function through a Q learning process;
after the calculation formula of the soft Q function is determined, the state value function of the value network can be obtained. The Averaged Soft Actor-criticic algorithm mainly trains a state value function through a Q learning process to reduce the error of sample data to the maximum extent, and the calculation mode is as shown in formula (3):
Figure BDA0003697015630000063
where D represents the distribution of states of previous samples, which can be considered as the distribution of samples of the empirical playback zone. The algorithm gradient of equation (3) can be estimated using an unbiased estimation function in the Averaged Soft Actor-Critic algorithm.
When the Averaged Soft Actor-criticic algorithm is applied to the continuous state space, the neural network with the current Averaged Soft Actor-criticic algorithm parameters needs to be updated.
S3, training a soft Q function by using Q learning of the soft Q network: obtaining the state value according to the step S2, training a soft Q function through a Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, and estimating errors generated in the Q learning process by using the soft Q value of the target value function and a Monte Carlo algorithm;
after the state values are obtained in the above steps, a Q learning process through a soft Q network is needed to train a soft Q function, so that interaction between the intelligent agent and the environment is completed. The Averaged Soft Actor-Critic algorithm trains a Soft Q function through Q learning, so that the residual error of the Beckmann equation is minimized, and the calculation mode is shown as formula (4):
Figure BDA0003697015630000071
where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s t E S is the place of the agent under the current time stepState, p: s multiplied by A → S is the probability distribution function of the current intelligent environmental state;
Figure BDA0003697015630000072
the method is a target value network of the Averaged Soft Actor-Critic algorithm, the network structure and parameters of the target value network are the same as the value network psi, but the updating rates of the two networks are different; θ is a soft Q network parameter used to estimate the state of the agent operating in the current environment. The Averaged Soft Actor-Critic algorithm uses the Soft Q values of the target value network and the Monte Carlo algorithm to estimate the error generated during Q learning.
S4, calculating an average soft Q value: selecting the previous K learned soft Q values to calculate an average soft Q value in the interaction process between the current image game agent and the environment;
during the interaction between the current image game agent and the environment, the average Soft Actor-Critic algorithm selects the first K previously learned Soft Q values to calculate the average Soft Q value when the current operation reaches a certain step. Therefore, the average Soft Q value calculation method of the Averaged Soft Actor-critical algorithm is shown in formula (5):
Figure BDA0003697015630000073
by comparing the game result performance of the Avaged Soft Actor-Critic algorithm and the Soft Actor-Critic algorithm in the InvertedDoublePendulum game in FIG. 2, it can be seen that the Avaged Soft Actor-Critic algorithm provided by the invention has excellent performance effect in reducing the over-high estimation of the Soft Q value in the Q learning process. Therefore, the Averaged Soft Actor-Critic algorithm provided by the invention can effectively reduce the problem caused by over-high estimation of the Soft Q value in the Q learning process.
S5, finishing strategy improvement to carry out learning optimization: the strategy is updated by minimizing the KL difference from the boltzmann strategy.
The principle of the strategy improvement step of the Averaged Soft Actor-Critic algorithm is to update the learning parameters of the intelligent agent of the current environment as much as possible in the direction of maximizing the return obtainable by the intelligent agent. The Averaged Soft Actor-Critic algorithm updates the temperature parameter to the Boltzmann policy and uses the Soft Q function at that temperature as energy. Thus, the Averaged Soft Actor-Critic algorithm updates the policy by minimizing its Kullback-Leibler (KL) difference from the Boltzmann policy. After so setting, the strategy improvement of the Averaged Soft Actor-Critic algorithm is as shown in equation (6):
Figure BDA0003697015630000074
where D represents the distribution of states of previous samples, which can be considered as the distribution of samples of the empirical playback zone; s t The state of the agent in the time step is belonged to S; theta is the Soft Q network parameter used to estimate the agent operating in this state and phi is the policy network parameter of the Averaged Soft Actor-critical algorithm.
Since the probability distribution of the strategy output action is expected by the conventional deep reinforcement learning algorithm, some error data is generated in the back propagation. The Averaged Soft Actor-Critic algorithm solves this problem using a re-parameterization technique. The Averaged Soft Actor-Critic algorithm combines its corresponding neural network output with the input noise vector sampled from the spherical Gaussian distribution to establish a channel for correctly computing the back propagation error information. The output of the policy network of the Averaged Soft Actor-criticic algorithm can be used for forming the random action distribution in the state, and the output data of the policy network is not required to be directly used. The form of the strategy re-parameterized by the Averaged Soft Actor-criticic algorithm through the re-parameterization technology is shown in formula (7):
a t =f φ (∈ t ;s t ) (7)
wherein e t N (0, I). After the modification and replacement of this formula in the original algorithm, the Averaged Soft Actor-Critic algorithm policy objective will become as shown in formula (8):
Figure BDA0003697015630000081
the invention also provides an average SAC deep reinforcement learning system based on Q learning, which specifically comprises
The initialization module is used for initializing soft Q values corresponding to all states and actions randomly; the intelligent agent and environment interaction experiment is used for collecting data information of the current environment, initializing the first state of the current state sequence and obtaining a feature vector of the first state;
the training module is used for training the parameters of the soft Q network according to the collected characteristic vector data information, so that the trained information is used for estimating a soft Q value, and the parameters of the soft Q network are fixed after the soft Q network is trained; using the feature vector information as input in an algorithm network, outputting a corresponding selection action, and obtaining a new state and instant reward feedback based on the action;
the Critic network parameter updating module is used for obtaining soft Q value output and a corresponding state value function by using the characteristic vector data information as input in a Critic network; the Critic network part calculates TD error through the calculated state value functions and corresponding return; then, using a mean square error loss function to update the gradient of the Critic network parameters;
and the Actor network parameter updating module selects Softmax or a Gaussian score function and updates the Softmax or the Gaussian score function in combination with the TD error value of the Critic network.
The specific procedure is shown in the following table.
TABLE 1 Overall Process of the invention
Figure BDA0003697015630000091
And (4) verification result:
in the experiment of the present invention, 6 different games of MuJoCo by OpenAI Gym were used as experimental environments. In most MuJoCo games, the Soft Actor-Critic algorithm is obviously superior to other depth reinforcement learning algorithms, so in the Averaged Soft Actor-Critic algorithm experiment, 6 MuJoCo games are selected to test the performance comparison of the Soft Actor-Critic algorithm and the Averaged Soft Actor-Critic algorithm.
In order to more intuitively compare the performance of the Averaged Soft Actor-Critic algorithm and the Soft Actor-Critic algorithm of the present invention, their average training scores are listed in Table 2. The results are summarized in table 1. It can be seen that the overall performance of the Averaged Soft Actor-Critic algorithm is better than that of the Soft Actor-Critic algorithm. As can be seen, as the K value increases, the average training score of the Averaged Soft Actor-Critic algorithm will increase; if the increased K value exceeds a certain range, the average score of the Averaged Soft Actor-Critic algorithm will tend to decrease in certain gaming environments. This is mainly because the average Soft Actor-critical algorithm needs to calculate a large number of K values to obtain its average Soft Q value during the training process, and therefore, more training time and calculation amount are needed. But this cost penalty is acceptable in view of the performance improvement of the Averaged Soft Actor-Critic algorithm.
TABLE 2 training scores for Averaged-SAC and SAC
Figure BDA0003697015630000101
In summary, after 6 comparison experiments in the MuJoCo environment, the Averaged Soft Actor-Critic algorithm does perform better than the Soft Actor-Critic algorithm. In addition, increasing the K value in the Averaged Soft Actor-Critic algorithm appropriately will result in better performance and stability, while also increasing the agent training time cost within the acceptable range. Therefore, in the Averaged Soft Actor-Critic algorithm, an appropriate K value is selected according to the situation of computing resources for training and learning.
The experimental result shows that in a certain range, the intelligent agent can obtain excellent performance by increasing the K value; however, if the value of K exceeds a range, the learning performance of the agent will tend to decline. By repeatedly carrying out experiments with different K values, it is verified that the intelligent learning performance can be optimized when the K value is 10. Averaged Soft Actor-Critic Algorithm which can be easily integrated with the Soft Actor-Critic Algorithm.
In light of the foregoing description of the preferred embodiments of the present invention, those skilled in the art can now make various alterations and modifications without departing from the scope of the invention. The technical scope of the present invention is not limited to the contents of the specification, and must be determined according to the scope of the claims.

Claims (8)

1. An average SAC deep reinforcement learning method based on Q learning is characterized by comprising the following steps:
s1, finishing strategy evaluation to perform soft strategy iteration: calculating the state value of the intelligent agent in the environment interaction process through strategy evaluation, approximating the state value function of the intelligent agent to a soft Q value, and estimating the soft Q value from a single operation sample of the intelligent agent strategy in the current environment;
s2, calculating a state value function of the value network: obtaining a state value function of the value network according to a calculation formula of the soft Q value, and training the state value function through a Q learning process;
s3, training a soft Q function by using Q learning of the soft Q network: obtaining the state value according to the step S2, training a soft Q function through a Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, and estimating errors generated in the Q learning process by using the soft Q value of the target value function and a Monte Carlo algorithm;
s4, calculating an average soft Q value: selecting the previous K learned soft Q values to calculate an average soft Q value in the interaction process between the current image game agent and the environment;
s5, finishing strategy improvement to carry out learning optimization: the strategy is updated by minimizing the KL difference from the boltzmann strategy.
2. The method as claimed in claim 1, wherein the step S1 specifically includes the following steps:
s110, evaluating and calculating the state value of the intelligent agent in the environment interaction process through a strategy, wherein a soft state value function is defined as
Figure FDA0003697015620000011
Where V represents the state cost function, s, of the current agent in the current round t E S is the state of the agent at time step t, a t The method comprises the steps that the operation of the intelligent agent is carried out in the current state, and pi is a strategy adopted by the intelligent agent in the current environment;
s120, defining a soft Q function as
Figure FDA0003697015620000012
S130, approximating the state value function of the agent to a soft Q value, and estimating the soft Q value from a single operation sample of the agent strategy under the current environment.
3. The method as claimed in claim 1, wherein the step S2 specifically includes the following steps:
s210, obtaining a state value function of the value network according to a calculation formula of the soft Q value;
s220, training the state value function through the Q learning process to reduce the error of the sample data, wherein the calculation formula is as follows
Figure FDA0003697015620000013
Where D represents the state distribution of the previous sample, seen as the sample distribution of the empirical playback zone, the algorithm gradient of the above formula is estimated using an unbiased estimation function in the averged Soft Actor-critical algorithm.
4. The method as claimed in claim 1, wherein the step S3 specifically includes the following steps:
s3 obtaining the state value according to the step S2, training the soft Q function through the Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, estimating the error generated in the Q learning process by using the soft Q value of the target value function and the Monte Carlo algorithm, wherein the calculation formula of the Q learning process for training the soft Q function is as follows
Figure FDA0003697015620000021
Where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s t E S is the state of the agent under the current time step, p: s multiplied by A → S is the probability distribution function of the current intelligent environmental state;
Figure FDA0003697015620000022
is a target value network of the Averaged Soft Actor-criticic algorithm; θ is a soft Q network parameter used to estimate the state of the agent operating in the current environment.
5. The method as claimed in claim 1, wherein the step S4 specifically includes the following steps:
s4 selecting K previous learned soft Q values to calculate average soft Q value when running to a certain step in the interaction process between the intelligent agent and environment of current image game, the calculation formula of the average soft Q value is as follows
Figure FDA0003697015620000023
6. The method as claimed in claim 1, wherein the step S5 specifically includes the following steps:
s5, updating the strategy by minimizing the KL difference with the Boltzmann strategy, wherein the strategy improvement formula is as follows
Figure FDA0003697015620000024
Where D represents the distribution of states of previous samples, the distribution of samples of the empirical playback zone; s t The state of the agent in the time step is belonged to S; theta is the Soft Q network parameter used to estimate the agent operating in this state and phi is the policy network parameter of the Averaged Soft Actor-critical algorithm.
7. The mean SAC depth reinforcement learning method based on Q learning as claimed in claim 6, wherein the strategy is re-parameterized by a re-parameterization technique, and the calculation formula is as follows:
a t =f φ (∈ t ;s t ) Wherein ∈ t ~N(0,I);
The strategy improvement formula is as follows:
Figure FDA0003697015620000031
8. an average SAC deep reinforcement learning system based on Q learning is characterized by specifically comprising
The initialization module is used for initializing soft Q values corresponding to all states and actions randomly; the intelligent agent and environment interaction experiment is used for collecting data information of the current environment, initializing the first state of the current state sequence and obtaining a feature vector of the current state sequence;
the training module is used for training the parameters of the soft Q network according to the collected characteristic vector data information, so that the trained information is used for estimating a soft Q value, and the parameters of the soft Q network are fixed after the soft Q network is trained; using the feature vector information as input in an algorithm network, outputting a corresponding selection action, and obtaining a new state and instant reward feedback based on the action;
the Critic network parameter updating module is used for obtaining soft Q value output and a corresponding state value function by using the characteristic vector data information as input in a Critic network; the Critic network part calculates TD error through the calculated state value functions and corresponding return; then, using a mean square error loss function to update the gradient of the Critic network parameters;
and the Actor network parameter updating module selects Softmax or a Gaussian score function and updates the Softmax or the Gaussian score function in combination with the TD error value of the Critic network.
CN202210683336.7A 2021-09-04 2022-06-16 Average SAC deep reinforcement learning method and system based on Q learning Pending CN114881228A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021110347969 2021-09-04
CN202111034796 2021-09-04

Publications (1)

Publication Number Publication Date
CN114881228A true CN114881228A (en) 2022-08-09

Family

ID=82680896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210683336.7A Pending CN114881228A (en) 2021-09-04 2022-06-16 Average SAC deep reinforcement learning method and system based on Q learning

Country Status (1)

Country Link
CN (1) CN114881228A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115439479A (en) * 2022-11-09 2022-12-06 北京航空航天大学 Academic image multiplexing detection method based on reinforcement learning
CN115691110A (en) * 2022-09-20 2023-02-03 东南大学 Intersection signal period stable timing method based on reinforcement learning and oriented to dynamic traffic flow
CN115841163A (en) * 2023-02-20 2023-03-24 浙江吉利控股集团有限公司 Training method and device for model predictive control MPC and electronic equipment
CN116596060A (en) * 2023-07-19 2023-08-15 深圳须弥云图空间科技有限公司 Deep reinforcement learning model training method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111766782A (en) * 2020-06-28 2020-10-13 浙江大学 Strategy selection method based on Actor-Critic framework in deep reinforcement learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111766782A (en) * 2020-06-28 2020-10-13 浙江大学 Strategy selection method based on Actor-Critic framework in deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FENG DING ET AL.: "Averaged Soft Actor-Critic for Deep Reinforcement Learning", COMPLEXITY 2021, 1 April 2021 (2021-04-01), pages 1 - 16 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115691110A (en) * 2022-09-20 2023-02-03 东南大学 Intersection signal period stable timing method based on reinforcement learning and oriented to dynamic traffic flow
CN115691110B (en) * 2022-09-20 2023-08-25 东南大学 Intersection signal period stable timing method based on reinforcement learning and oriented to dynamic traffic flow
CN115439479A (en) * 2022-11-09 2022-12-06 北京航空航天大学 Academic image multiplexing detection method based on reinforcement learning
CN115439479B (en) * 2022-11-09 2023-02-03 北京航空航天大学 Academic image multiplexing detection method based on reinforcement learning
CN115841163A (en) * 2023-02-20 2023-03-24 浙江吉利控股集团有限公司 Training method and device for model predictive control MPC and electronic equipment
CN116596060A (en) * 2023-07-19 2023-08-15 深圳须弥云图空间科技有限公司 Deep reinforcement learning model training method and device, electronic equipment and storage medium
CN116596060B (en) * 2023-07-19 2024-03-15 深圳须弥云图空间科技有限公司 Deep reinforcement learning model training method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN114881228A (en) Average SAC deep reinforcement learning method and system based on Q learning
Kurenkov et al. Ac-teach: A bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers
WO2021159779A1 (en) Information processing method and apparatus, computer-readable storage medium and electronic device
Wulfmeier et al. Data-efficient hindsight off-policy option learning
CN109978012A (en) It is a kind of based on combine the improvement Bayes of feedback against intensified learning method
CN110766044A (en) Neural network training method based on Gaussian process prior guidance
CN113392396B (en) Strategy protection defense method for deep reinforcement learning
Huang et al. Robust reinforcement learning as a stackelberg game via adaptively-regularized adversarial training
Zhu et al. An overview of the action space for deep reinforcement learning
CN114065929A (en) Training method and device for deep reinforcement learning model and storage medium
CN113420326A (en) Deep reinforcement learning-oriented model privacy protection method and system
CN114261400A (en) Automatic driving decision-making method, device, equipment and storage medium
CN111282272B (en) Information processing method, computer readable medium and electronic device
Ding et al. Averaged soft actor-critic for deep reinforcement learning
CN113947022B (en) Near-end strategy optimization method based on model
CN113239472B (en) Missile guidance method and device based on reinforcement learning
CN114037048A (en) Belief consistency multi-agent reinforcement learning method based on variational cycle network model
Li et al. Domain adaptive state representation alignment for reinforcement learning
CN114942799B (en) Workflow scheduling method based on reinforcement learning in cloud edge environment
CN113379027A (en) Method, system, storage medium and application for generating confrontation interactive simulation learning
Jang et al. AVAST: Attentive variational state tracker in a reinforced navigator
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
US20200364555A1 (en) Machine learning system
CN113240118B (en) Dominance estimation method, dominance estimation device, electronic device, and storage medium
Liu et al. How to guide your learner: Imitation learning with active adaptive expert involvement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination