CN114881228A - Average SAC deep reinforcement learning method and system based on Q learning - Google Patents
Average SAC deep reinforcement learning method and system based on Q learning Download PDFInfo
- Publication number
- CN114881228A CN114881228A CN202210683336.7A CN202210683336A CN114881228A CN 114881228 A CN114881228 A CN 114881228A CN 202210683336 A CN202210683336 A CN 202210683336A CN 114881228 A CN114881228 A CN 114881228A
- Authority
- CN
- China
- Prior art keywords
- soft
- value
- state
- learning
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000002787 reinforcement Effects 0.000 title claims abstract description 40
- 230000006870 function Effects 0.000 claims abstract description 75
- 238000012549 training Methods 0.000 claims abstract description 33
- 230000006872 improvement Effects 0.000 claims abstract description 11
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 238000005457 optimization Methods 0.000 claims abstract description 5
- 239000003795 chemical substances by application Substances 0.000 claims description 81
- 230000008569 process Effects 0.000 claims description 40
- 230000003993 interaction Effects 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 20
- 230000009471 action Effects 0.000 claims description 16
- 238000002474 experimental method Methods 0.000 claims description 7
- 230000007613 environmental effect Effects 0.000 claims description 4
- 238000005315 distribution function Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 abstract description 3
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000007786 learning performance Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000021824 exploration behavior Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an average SAC (sample consensus) deep reinforcement learning method and system based on Q learning, belonging to the technical field of deep reinforcement learning, and 1) randomly initializing all parameter information; 2) updating Critic network parameters, finishing strategy evaluation to perform soft strategy iteration, calculating a state value function of the value network, training a soft Q function by using Q learning of the soft Q network, and then selecting the previous K previously learned soft Q values to calculate an average soft Q value; 3) updating Critic network parameters, and finishing strategy improvement to perform learning optimization; 4) and finishing a round of training and learning of the intelligent agent, and performing iterative updating until a termination condition is met, thereby finishing the deep reinforcement learning. The invention designs an Averaged Soft Actor-Critic method based on Q learning aiming at an image game, comprehensively considers the reason causing poor stability in deep reinforcement learning, and can effectively improve the performance of a deep reinforcement learning algorithm Soft Actor-Critic by utilizing the strategy of adopting the first K Soft Q values to reduce over-high estimation errors.
Description
Technical Field
The invention relates to the technical field of deep reinforcement learning, in particular to an average SAC deep reinforcement learning method and system based on Q learning.
Background
With the rapid development of the internet, an era of artificial intelligence has come. Artificial intelligence is a field knowledge of a method for simulating human thinking, and the generation of an agent with completely autonomous learning ability is the main task of the artificial intelligence. These autonomously generated agents need to interact with the current environment at all times to accomplish the task of information exchange and delivery. The final task set by these agents is to select the optimal action by training and learning continuously with the current environment information to achieve the agent optimal strategy in the current environment. An artificial intelligence system under this definition includes a robot that can interact with the surrounding environment, and a pure software-based agent that can interact with multimedia devices (e.g., computers, mobile phones, etc.) and natural language. Current deep reinforcement learning is a particularly suitable algorithm to address this interactive scheme. The algorithm principle of the deep reinforcement learning is the autonomous learning and training ability of an intelligent agent. Recently, for some problems commonly existing in machine learning, such as high computational complexity, huge memory occupation, and huge sample complexity, the deep reinforcement learning algorithm can well solve or alleviate the problems by using the neural network characteristics in deep learning. However, most of the current deep reinforcement learning geographical theory knowledge is in a theoretical exploration stage, and the application in a real ground practice is not very popular. Therefore, researchers have spent much effort recently to apply the algorithm model to the daily life of common people practically, so that the algorithm model provides convenience for the daily life of common people. However, in practical application, the security of the smart device in which the algorithm is located is a problem that must be considered. Since deep reinforcement learning tends to explore strange and dangerous states, the strategy learned by the deep reinforcement learning agent will be easily affected by the safety exploration problem, which is an extremely dangerous learning behavior if the deep reinforcement learning often explores unsafe states (e.g., a mobile robot drives a car into an unsafe state). Therefore, for the deep reinforcement learning security problem applied to the actual smart device, the algorithm must try to enable the smart body of the algorithm in the smart device to provide an optimal strategy which is efficient and extremely secure. Even if this policy is not the optimal one, the algorithm guarantees the security of the policy in the exploration process. Such as in an intelligent unmanned vehicle, this would be completely impractical if only the optimal strategy for the intelligent device definition were selected to skip traveling the optimal route or the shortest route to reach the destination, regardless of the safety factor of its optimal strategy.
However, according to research and investigation, many current statistical intelligent devices applied to the deep reinforcement learning algorithm do not effectively consider the security problem of the intelligent agent in exploring the strategy, or even use the security problem as a performance index of the intelligent device. This setting would be an extremely unreasonable behavior. The unsafe exploration behavior of the intelligent agent is mainly caused by the over-estimation of a Q learning process in a deep reinforcement learning algorithm.
In summary, in order to improve the safety and non-controllability problems in practical applications, it is necessary to improve the overestimation problem caused in the Q learning process, so as to reduce the transmission error in the algorithm training and improve the algorithm stability. However, the stability of the deep reinforcement learning algorithm is a serious challenge, which limits the further optimization development of the algorithm. Although the performance of computers has improved, the stability of deep reinforcement learning algorithms has not improved well. Therefore, how to improve the stability in the Q learning process in the deep reinforcement learning algorithm and how to improve the security in the deep reinforcement learning algorithm are one of the most challenging problems in the deep reinforcement learning.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an Averaged Soft Actor-Critic algorithm which uses K previously learned state values to calculate a Soft Q value in the Averaged Soft Actor-Critic algorithm and aims to reduce the problem of over-high estimation of the Soft Q value in the calculation process. The invention innovatively provides an Avaged-SAC algorithm model to improve the performance of an SAC (Soft Actor-Critic) algorithm and reduce high errors caused by over-high estimation.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
an average SAC deep reinforcement learning method based on Q learning specifically comprises the following steps:
s1, finishing strategy evaluation to perform soft strategy iteration: calculating the state value of the intelligent agent in the environment interaction process through strategy evaluation, approximating the state value function of the intelligent agent to a soft Q value, and estimating the soft Q value from a single operation sample of the intelligent agent strategy in the current environment;
s2, calculating a state value function of the value network: obtaining a state value function of the value network according to a calculation formula of the soft Q value, and training the state value function through a Q learning process;
s3, training a soft Q function by using Q learning of the soft Q network: obtaining the state value according to the step S2, training a soft Q function through a Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, and estimating errors generated in the Q learning process by using the soft Q value of the target value function and a Monte Carlo algorithm;
s4, calculating an average soft Q value: selecting the previous K learned soft Q values to calculate an average soft Q value in the interaction process between the current image game agent and the environment;
s5, finishing strategy improvement to carry out learning optimization: the strategy is updated by minimizing the KL difference from the boltzmann strategy.
Preferably, step S1 specifically includes the following steps:
s110, evaluating and calculating the state value of the intelligent agent in the environment interaction process through a strategy, wherein a soft state value function is defined as
Where V represents the state cost function, s, of the current agent in the current round t Es is the state the agent is in at time step t,a t the method comprises the steps that the operation of the intelligent agent is carried out in the current state, and pi is a strategy adopted by the intelligent agent in the current environment;
S130, approximating the state value function of the agent to a soft Q value, and estimating the soft Q value from a single operation sample of the agent strategy under the current environment.
Preferably, step S2 specifically includes the following steps:
s210, obtaining a state value function of the value network according to a calculation formula of the soft Q value;
s220, training the state value function through the Q learning process to reduce the error of the sample data, wherein the calculation formula is as follows
Where D represents the state distribution of the previous sample, seen as the sample distribution of the empirical playback zone, the algorithm gradient of the above formula is estimated using an unbiased estimation function in the averged Soft Actor-critical algorithm.
Preferably, step S3 specifically includes the following steps:
s3 obtaining the state value according to the step S2, training the soft Q function through the Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, estimating the error generated in the Q learning process by using the soft Q value of the target value function and the Monte Carlo algorithm, wherein the calculation formula of the Q learning process for training the soft Q function is as follows
Where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s t E S is the state of the agent under the current time step, p: s × A → S is currentA probability distribution function of an intelligent environmental state;is a target value network of the Averaged Soft Actor-criticic algorithm; θ is a soft Q network parameter used to estimate the state of the agent operating in the current environment.
Preferably, step S4 specifically includes the following steps:
s4 selecting K previous learned soft Q values to calculate average soft Q value when running to a certain step in the interaction process between the intelligent agent and environment of current image game, the calculation formula of the average soft Q value is as follows
Preferably, step S5 specifically includes the following steps:
s5, updating the strategy by minimizing the KL difference with the Boltzmann strategy, wherein the strategy improvement formula is as follows
Where D represents the distribution of states of previous samples, the distribution of samples of the empirical playback zone; s t The state of the agent in the time step is belonged to S; theta is the Soft Q network parameter used to estimate the agent operating in this state and phi is the policy network parameter of the Averaged Soft Actor-critical algorithm.
Preferably, the policy is re-parameterized by a re-parameterization technique, and the calculation formula is as follows:
a t =f φ (∈ t ;s t ) Wherein ∈ t ~N(0,I);
The strategy improvement formula is as follows:
an average SAC deep reinforcement learning system based on Q learning specifically comprises
The initialization module is used for initializing soft Q values corresponding to all states and actions randomly; the intelligent agent and environment interaction experiment is used for collecting data information of the current environment, initializing the first state of the current state sequence and obtaining a feature vector of the first state;
the training module is used for training the parameters of the soft Q network according to the collected characteristic vector data information, so that the trained information is used for estimating a soft Q value, and the parameters of the soft Q network are fixed after the soft Q network is trained; using the feature vector information as input in an algorithm network, outputting a corresponding selection action, and obtaining a new state and instant reward feedback based on the action;
the Critic network parameter updating module is used for obtaining soft Q value output and a corresponding state value function by using the characteristic vector data information as input in a Critic network; the Critic network part calculates TD error through the calculated state value functions and corresponding return; then, using a mean square error loss function to update the gradient of the Critic network parameters;
and the Actor network parameter updating module selects Softmax or a Gaussian score function and updates the Softmax or the Gaussian score function in combination with the TD error value of the Critic network.
The invention has the beneficial effects that: the invention designs an Averaged Soft Actor-Critic depth reinforcement learning algorithm based on Q learning aiming at image game data, the algorithm uses K previously learned state values to calculate the Soft Q value in the Averaged Soft Actor-Critic algorithm, and the aim of reducing the problem of over-high estimation of the Soft Q value in the calculation process is fulfilled.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for average SAC deep reinforcement learning based on Q learning according to the present invention;
FIG. 2 is a graph of overestimated results for Avaged-SAC and SAC;
FIG. 3 is a framework diagram of the Averaged Soft Actor-Critic deep reinforcement learning algorithm based on Q learning.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Other embodiments, which can be derived by one of ordinary skill in the art from the embodiments given herein without any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of an average SAC depth reinforcement learning method based on Q learning, and fig. 3 is a framework diagram of an average Soft Actor-critical depth reinforcement learning algorithm based on Q learning. Firstly, the intelligent agent of the algorithm interacts with the current environment to obtain the state information of the current intelligent agent; then the state information of the agent is used as the network layer input of the Averaged Soft Actor-Critic algorithm, and after the state information is processed by the convolutional neural network and the Softmax function, the network layer outputs the action selected by the agent at present; feeding back the selected action to the current interactive environment, so that the environment gives instant reward feedback brought by the current action; the Averaged Soft Actor-criticic algorithm uniformly feeds back the data information to the criticic module, so that criticic parameters are updated, and TD errors are calculated; the Actor module receives the environmental information and the TD error transmitted by the criticic module, so as to update the Actor parameter; and finally, the intelligent agent trains the intelligent agent to learn an optimal strategy by continuously circulating the steps.
An average SAC deep reinforcement learning method based on Q learning specifically comprises the following steps:
s1, finishing strategy evaluation to perform soft strategy iteration: calculating the state value of the intelligent agent in the environment interaction process through strategy evaluation, approximating the state value function of the intelligent agent to a soft Q value, and estimating the soft Q value from a single operation sample of the intelligent agent strategy in the current environment;
the strategy evaluation is the first step of Soft strategy iteration performed by the Averaged Soft Actor-Critic algorithm, and the value of the intelligent agent in the environment interaction process needs to be calculated. To achieve this goal, the Averaged Soft Actor-Critic algorithm defines the Soft state value function as shown in equation (1):
where V represents the state cost function, s, of the current agent in the current round t E S is the state of the agent at time step t, a t The method is an operation of the agent when the agent is executed in the current state, and pi is a strategy adopted by the agent in the current environment.
The Soft Q function of the Averaged Soft Actor-criticic algorithm is defined as shown in formula (2):
as can be seen in equation (2), the agent may obtain the next state based on the current state and the action taken.
In the Averaged Soft Actor-Critic algorithm, the state value function approximation can be regarded as a Soft Q value, and the algorithm does not need to design a separate function approximator for the state value in principle, because the relationship between the state value function and the Soft Q function is related to the policy function of the formula (1). The algorithm can estimate the soft Q value from a single operation sample of the agent policy in the current environment without introducing an additional bias. The soft Q value of the deep reinforcement learning algorithm is calculated through a single function approximator, so that the deep reinforcement learning algorithm can be stably trained and is convenient to train with other networks at the same time.
S2, calculating a state value function of the value network: obtaining a state value function of the value network according to a calculation formula of the soft Q value, and training the state value function through a Q learning process;
after the calculation formula of the soft Q function is determined, the state value function of the value network can be obtained. The Averaged Soft Actor-criticic algorithm mainly trains a state value function through a Q learning process to reduce the error of sample data to the maximum extent, and the calculation mode is as shown in formula (3):
where D represents the distribution of states of previous samples, which can be considered as the distribution of samples of the empirical playback zone. The algorithm gradient of equation (3) can be estimated using an unbiased estimation function in the Averaged Soft Actor-Critic algorithm.
When the Averaged Soft Actor-criticic algorithm is applied to the continuous state space, the neural network with the current Averaged Soft Actor-criticic algorithm parameters needs to be updated.
S3, training a soft Q function by using Q learning of the soft Q network: obtaining the state value according to the step S2, training a soft Q function through a Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, and estimating errors generated in the Q learning process by using the soft Q value of the target value function and a Monte Carlo algorithm;
after the state values are obtained in the above steps, a Q learning process through a soft Q network is needed to train a soft Q function, so that interaction between the intelligent agent and the environment is completed. The Averaged Soft Actor-Critic algorithm trains a Soft Q function through Q learning, so that the residual error of the Beckmann equation is minimized, and the calculation mode is shown as formula (4):
where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s t E S is the place of the agent under the current time stepState, p: s multiplied by A → S is the probability distribution function of the current intelligent environmental state;the method is a target value network of the Averaged Soft Actor-Critic algorithm, the network structure and parameters of the target value network are the same as the value network psi, but the updating rates of the two networks are different; θ is a soft Q network parameter used to estimate the state of the agent operating in the current environment. The Averaged Soft Actor-Critic algorithm uses the Soft Q values of the target value network and the Monte Carlo algorithm to estimate the error generated during Q learning.
S4, calculating an average soft Q value: selecting the previous K learned soft Q values to calculate an average soft Q value in the interaction process between the current image game agent and the environment;
during the interaction between the current image game agent and the environment, the average Soft Actor-Critic algorithm selects the first K previously learned Soft Q values to calculate the average Soft Q value when the current operation reaches a certain step. Therefore, the average Soft Q value calculation method of the Averaged Soft Actor-critical algorithm is shown in formula (5):
by comparing the game result performance of the Avaged Soft Actor-Critic algorithm and the Soft Actor-Critic algorithm in the InvertedDoublePendulum game in FIG. 2, it can be seen that the Avaged Soft Actor-Critic algorithm provided by the invention has excellent performance effect in reducing the over-high estimation of the Soft Q value in the Q learning process. Therefore, the Averaged Soft Actor-Critic algorithm provided by the invention can effectively reduce the problem caused by over-high estimation of the Soft Q value in the Q learning process.
S5, finishing strategy improvement to carry out learning optimization: the strategy is updated by minimizing the KL difference from the boltzmann strategy.
The principle of the strategy improvement step of the Averaged Soft Actor-Critic algorithm is to update the learning parameters of the intelligent agent of the current environment as much as possible in the direction of maximizing the return obtainable by the intelligent agent. The Averaged Soft Actor-Critic algorithm updates the temperature parameter to the Boltzmann policy and uses the Soft Q function at that temperature as energy. Thus, the Averaged Soft Actor-Critic algorithm updates the policy by minimizing its Kullback-Leibler (KL) difference from the Boltzmann policy. After so setting, the strategy improvement of the Averaged Soft Actor-Critic algorithm is as shown in equation (6):
where D represents the distribution of states of previous samples, which can be considered as the distribution of samples of the empirical playback zone; s t The state of the agent in the time step is belonged to S; theta is the Soft Q network parameter used to estimate the agent operating in this state and phi is the policy network parameter of the Averaged Soft Actor-critical algorithm.
Since the probability distribution of the strategy output action is expected by the conventional deep reinforcement learning algorithm, some error data is generated in the back propagation. The Averaged Soft Actor-Critic algorithm solves this problem using a re-parameterization technique. The Averaged Soft Actor-Critic algorithm combines its corresponding neural network output with the input noise vector sampled from the spherical Gaussian distribution to establish a channel for correctly computing the back propagation error information. The output of the policy network of the Averaged Soft Actor-criticic algorithm can be used for forming the random action distribution in the state, and the output data of the policy network is not required to be directly used. The form of the strategy re-parameterized by the Averaged Soft Actor-criticic algorithm through the re-parameterization technology is shown in formula (7):
a t =f φ (∈ t ;s t ) (7)
wherein e t N (0, I). After the modification and replacement of this formula in the original algorithm, the Averaged Soft Actor-Critic algorithm policy objective will become as shown in formula (8):
the invention also provides an average SAC deep reinforcement learning system based on Q learning, which specifically comprises
The initialization module is used for initializing soft Q values corresponding to all states and actions randomly; the intelligent agent and environment interaction experiment is used for collecting data information of the current environment, initializing the first state of the current state sequence and obtaining a feature vector of the first state;
the training module is used for training the parameters of the soft Q network according to the collected characteristic vector data information, so that the trained information is used for estimating a soft Q value, and the parameters of the soft Q network are fixed after the soft Q network is trained; using the feature vector information as input in an algorithm network, outputting a corresponding selection action, and obtaining a new state and instant reward feedback based on the action;
the Critic network parameter updating module is used for obtaining soft Q value output and a corresponding state value function by using the characteristic vector data information as input in a Critic network; the Critic network part calculates TD error through the calculated state value functions and corresponding return; then, using a mean square error loss function to update the gradient of the Critic network parameters;
and the Actor network parameter updating module selects Softmax or a Gaussian score function and updates the Softmax or the Gaussian score function in combination with the TD error value of the Critic network.
The specific procedure is shown in the following table.
TABLE 1 Overall Process of the invention
And (4) verification result:
in the experiment of the present invention, 6 different games of MuJoCo by OpenAI Gym were used as experimental environments. In most MuJoCo games, the Soft Actor-Critic algorithm is obviously superior to other depth reinforcement learning algorithms, so in the Averaged Soft Actor-Critic algorithm experiment, 6 MuJoCo games are selected to test the performance comparison of the Soft Actor-Critic algorithm and the Averaged Soft Actor-Critic algorithm.
In order to more intuitively compare the performance of the Averaged Soft Actor-Critic algorithm and the Soft Actor-Critic algorithm of the present invention, their average training scores are listed in Table 2. The results are summarized in table 1. It can be seen that the overall performance of the Averaged Soft Actor-Critic algorithm is better than that of the Soft Actor-Critic algorithm. As can be seen, as the K value increases, the average training score of the Averaged Soft Actor-Critic algorithm will increase; if the increased K value exceeds a certain range, the average score of the Averaged Soft Actor-Critic algorithm will tend to decrease in certain gaming environments. This is mainly because the average Soft Actor-critical algorithm needs to calculate a large number of K values to obtain its average Soft Q value during the training process, and therefore, more training time and calculation amount are needed. But this cost penalty is acceptable in view of the performance improvement of the Averaged Soft Actor-Critic algorithm.
TABLE 2 training scores for Averaged-SAC and SAC
In summary, after 6 comparison experiments in the MuJoCo environment, the Averaged Soft Actor-Critic algorithm does perform better than the Soft Actor-Critic algorithm. In addition, increasing the K value in the Averaged Soft Actor-Critic algorithm appropriately will result in better performance and stability, while also increasing the agent training time cost within the acceptable range. Therefore, in the Averaged Soft Actor-Critic algorithm, an appropriate K value is selected according to the situation of computing resources for training and learning.
The experimental result shows that in a certain range, the intelligent agent can obtain excellent performance by increasing the K value; however, if the value of K exceeds a range, the learning performance of the agent will tend to decline. By repeatedly carrying out experiments with different K values, it is verified that the intelligent learning performance can be optimized when the K value is 10. Averaged Soft Actor-Critic Algorithm which can be easily integrated with the Soft Actor-Critic Algorithm.
In light of the foregoing description of the preferred embodiments of the present invention, those skilled in the art can now make various alterations and modifications without departing from the scope of the invention. The technical scope of the present invention is not limited to the contents of the specification, and must be determined according to the scope of the claims.
Claims (8)
1. An average SAC deep reinforcement learning method based on Q learning is characterized by comprising the following steps:
s1, finishing strategy evaluation to perform soft strategy iteration: calculating the state value of the intelligent agent in the environment interaction process through strategy evaluation, approximating the state value function of the intelligent agent to a soft Q value, and estimating the soft Q value from a single operation sample of the intelligent agent strategy in the current environment;
s2, calculating a state value function of the value network: obtaining a state value function of the value network according to a calculation formula of the soft Q value, and training the state value function through a Q learning process;
s3, training a soft Q function by using Q learning of the soft Q network: obtaining the state value according to the step S2, training a soft Q function through a Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, and estimating errors generated in the Q learning process by using the soft Q value of the target value function and a Monte Carlo algorithm;
s4, calculating an average soft Q value: selecting the previous K learned soft Q values to calculate an average soft Q value in the interaction process between the current image game agent and the environment;
s5, finishing strategy improvement to carry out learning optimization: the strategy is updated by minimizing the KL difference from the boltzmann strategy.
2. The method as claimed in claim 1, wherein the step S1 specifically includes the following steps:
s110, evaluating and calculating the state value of the intelligent agent in the environment interaction process through a strategy, wherein a soft state value function is defined as
Where V represents the state cost function, s, of the current agent in the current round t E S is the state of the agent at time step t, a t The method comprises the steps that the operation of the intelligent agent is carried out in the current state, and pi is a strategy adopted by the intelligent agent in the current environment;
S130, approximating the state value function of the agent to a soft Q value, and estimating the soft Q value from a single operation sample of the agent strategy under the current environment.
3. The method as claimed in claim 1, wherein the step S2 specifically includes the following steps:
s210, obtaining a state value function of the value network according to a calculation formula of the soft Q value;
s220, training the state value function through the Q learning process to reduce the error of the sample data, wherein the calculation formula is as follows
Where D represents the state distribution of the previous sample, seen as the sample distribution of the empirical playback zone, the algorithm gradient of the above formula is estimated using an unbiased estimation function in the averged Soft Actor-critical algorithm.
4. The method as claimed in claim 1, wherein the step S3 specifically includes the following steps:
s3 obtaining the state value according to the step S2, training the soft Q function through the Q learning process of the soft Q network to complete the interaction between the intelligent agent and the environment, estimating the error generated in the Q learning process by using the soft Q value of the target value function and the Monte Carlo algorithm, wherein the calculation formula of the Q learning process for training the soft Q function is as follows
Where D represents the state distribution of previous samples during interaction between the agent and the environment, which can also be regarded as the sample distribution of the experience playback zone, s t E S is the state of the agent under the current time step, p: s multiplied by A → S is the probability distribution function of the current intelligent environmental state;is a target value network of the Averaged Soft Actor-criticic algorithm; θ is a soft Q network parameter used to estimate the state of the agent operating in the current environment.
5. The method as claimed in claim 1, wherein the step S4 specifically includes the following steps:
s4 selecting K previous learned soft Q values to calculate average soft Q value when running to a certain step in the interaction process between the intelligent agent and environment of current image game, the calculation formula of the average soft Q value is as follows
6. The method as claimed in claim 1, wherein the step S5 specifically includes the following steps:
s5, updating the strategy by minimizing the KL difference with the Boltzmann strategy, wherein the strategy improvement formula is as follows
Where D represents the distribution of states of previous samples, the distribution of samples of the empirical playback zone; s t The state of the agent in the time step is belonged to S; theta is the Soft Q network parameter used to estimate the agent operating in this state and phi is the policy network parameter of the Averaged Soft Actor-critical algorithm.
7. The mean SAC depth reinforcement learning method based on Q learning as claimed in claim 6, wherein the strategy is re-parameterized by a re-parameterization technique, and the calculation formula is as follows:
a t =f φ (∈ t ;s t ) Wherein ∈ t ~N(0,I);
The strategy improvement formula is as follows:
8. an average SAC deep reinforcement learning system based on Q learning is characterized by specifically comprising
The initialization module is used for initializing soft Q values corresponding to all states and actions randomly; the intelligent agent and environment interaction experiment is used for collecting data information of the current environment, initializing the first state of the current state sequence and obtaining a feature vector of the current state sequence;
the training module is used for training the parameters of the soft Q network according to the collected characteristic vector data information, so that the trained information is used for estimating a soft Q value, and the parameters of the soft Q network are fixed after the soft Q network is trained; using the feature vector information as input in an algorithm network, outputting a corresponding selection action, and obtaining a new state and instant reward feedback based on the action;
the Critic network parameter updating module is used for obtaining soft Q value output and a corresponding state value function by using the characteristic vector data information as input in a Critic network; the Critic network part calculates TD error through the calculated state value functions and corresponding return; then, using a mean square error loss function to update the gradient of the Critic network parameters;
and the Actor network parameter updating module selects Softmax or a Gaussian score function and updates the Softmax or the Gaussian score function in combination with the TD error value of the Critic network.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111034796 | 2021-09-04 | ||
CN2021110347969 | 2021-09-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114881228A true CN114881228A (en) | 2022-08-09 |
Family
ID=82680896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210683336.7A Pending CN114881228A (en) | 2021-09-04 | 2022-06-16 | Average SAC deep reinforcement learning method and system based on Q learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114881228A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115439479A (en) * | 2022-11-09 | 2022-12-06 | 北京航空航天大学 | Academic image multiplexing detection method based on reinforcement learning |
CN115691110A (en) * | 2022-09-20 | 2023-02-03 | 东南大学 | Intersection signal period stable timing method based on reinforcement learning and oriented to dynamic traffic flow |
CN115841163A (en) * | 2023-02-20 | 2023-03-24 | 浙江吉利控股集团有限公司 | Training method and device for model predictive control MPC and electronic equipment |
CN116596060A (en) * | 2023-07-19 | 2023-08-15 | 深圳须弥云图空间科技有限公司 | Deep reinforcement learning model training method and device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111766782A (en) * | 2020-06-28 | 2020-10-13 | 浙江大学 | Strategy selection method based on Actor-Critic framework in deep reinforcement learning |
-
2022
- 2022-06-16 CN CN202210683336.7A patent/CN114881228A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111766782A (en) * | 2020-06-28 | 2020-10-13 | 浙江大学 | Strategy selection method based on Actor-Critic framework in deep reinforcement learning |
Non-Patent Citations (1)
Title |
---|
FENG DING ET AL.: "Averaged Soft Actor-Critic for Deep Reinforcement Learning", COMPLEXITY 2021, 1 April 2021 (2021-04-01), pages 1 - 16 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115691110A (en) * | 2022-09-20 | 2023-02-03 | 东南大学 | Intersection signal period stable timing method based on reinforcement learning and oriented to dynamic traffic flow |
CN115691110B (en) * | 2022-09-20 | 2023-08-25 | 东南大学 | Intersection signal period stable timing method based on reinforcement learning and oriented to dynamic traffic flow |
CN115439479A (en) * | 2022-11-09 | 2022-12-06 | 北京航空航天大学 | Academic image multiplexing detection method based on reinforcement learning |
CN115439479B (en) * | 2022-11-09 | 2023-02-03 | 北京航空航天大学 | Academic image multiplexing detection method based on reinforcement learning |
CN115841163A (en) * | 2023-02-20 | 2023-03-24 | 浙江吉利控股集团有限公司 | Training method and device for model predictive control MPC and electronic equipment |
CN116596060A (en) * | 2023-07-19 | 2023-08-15 | 深圳须弥云图空间科技有限公司 | Deep reinforcement learning model training method and device, electronic equipment and storage medium |
CN116596060B (en) * | 2023-07-19 | 2024-03-15 | 深圳须弥云图空间科技有限公司 | Deep reinforcement learning model training method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114881228A (en) | Average SAC deep reinforcement learning method and system based on Q learning | |
Kurenkov et al. | Ac-teach: A bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers | |
WO2021159779A1 (en) | Information processing method and apparatus, computer-readable storage medium and electronic device | |
Wulfmeier et al. | Data-efficient hindsight off-policy option learning | |
CN112437690B (en) | Method and device for determining action selection policy of execution device | |
Zhu et al. | An overview of the action space for deep reinforcement learning | |
CN113392396A (en) | Strategy protection defense method for deep reinforcement learning | |
Huang et al. | Robust reinforcement learning as a stackelberg game via adaptively-regularized adversarial training | |
CN113255936A (en) | Deep reinforcement learning strategy protection defense method and device based on simulation learning and attention mechanism | |
CN109978012A (en) | It is a kind of based on combine the improvement Bayes of feedback against intensified learning method | |
CN114261400B (en) | Automatic driving decision method, device, equipment and storage medium | |
CN111282272B (en) | Information processing method, computer readable medium and electronic device | |
CN113947022B (en) | Near-end strategy optimization method based on model | |
CN114065929A (en) | Training method and device for deep reinforcement learning model and storage medium | |
CN113379027A (en) | Method, system, storage medium and application for generating confrontation interactive simulation learning | |
CN112533681B (en) | Determining action selection guidelines for executing devices | |
CN113239472B (en) | Missile guidance method and device based on reinforcement learning | |
Liu et al. | How to guide your learner: Imitation learning with active adaptive expert involvement | |
Li et al. | Domain adaptive state representation alignment for reinforcement learning | |
CN114154397A (en) | Implicit adversary modeling method based on deep reinforcement learning | |
CN114037048A (en) | Belief consistency multi-agent reinforcement learning method based on variational cycle network model | |
CN114942799B (en) | Workflow scheduling method based on reinforcement learning in cloud edge environment | |
CN116992928A (en) | Multi-agent reinforcement learning method for fair self-adaptive traffic signal control | |
Jang et al. | AVAST: Attentive variational state tracker in a reinforced navigator | |
US20200364555A1 (en) | Machine learning system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |