CN113487039A

CN113487039A - Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning

Info

Publication number: CN113487039A
Application number: CN202110729857.7A
Authority: CN
Inventors: 宋勇; 程艳; 庞豹; 袁宪锋; 许庆阳; 巩志
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-08
Anticipated expiration: 2041-06-29
Also published as: CN113487039B

Abstract

The invention provides an intelligent body self-adaptive decision generation method and system based on deep reinforcement learning, which is characterized in that a self-adaptive decision problem of an intelligent body is researched based on a deep reinforcement learning SoftActor-Cr it ic (SAC) algorithm, the SAC algorithm is improved aiming at the problem occurring in the training process, SAC + PER, SAC + ERE and SAC + PER + ERE algorithms are provided, the self-adaptive decision problem of the intelligent body is solved by utilizing the strong sensing capability of the deep learning and the efficient decision capability of the reinforcement learning, the intelligent body is trained through the deep reinforcement learning algorithm, so that the experience of the intelligent body is summarized in the process of interacting with the environment, and the self-recognition of the application of specific behaviors is formed; meanwhile, the unmanned aerial vehicle anti-interception task under the simulation environment is used as a carrier, and the effectiveness of the algorithm is verified.

Description

Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning

Technical Field

The disclosure belongs to the technical field of intelligent optimization, and particularly relates to an intelligent agent adaptive decision generation method and system based on deep reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Reinforcement learning is particularly important in the field of machine learning, and mainly researches how to obtain the maximum expected benefit according to the real-time change of the environment; for a general reinforcement learning algorithm, the goal of the agent learning is to learn a strategy that maximizes the accumulated return expectation; wherein the agent is a subject performing reinforcement learning.

Because the actual environment of the intelligent agent is often dynamic and uncertain, and the environmental change is difficult to predict in advance, the adaptive behavior adopted aiming at the environmental change is difficult to determine; in the existing intelligent agent learning and training process, the stability of the training process is poor, and the robustness is not high.

Disclosure of Invention

The invention provides an intelligent body self-adaptive decision generation method and system based on deep reinforcement learning, which is used for developing research on the intelligent body self-adaptive decision problem based on a deep reinforcement learning Soft-critical (SAC) algorithm, improving the SAC algorithm aiming at the problems occurring in the training process, providing an SAC + PER (Priority Experience Replay: Priority expert Replay, PER), SAC + ERE and SAC + PER + ERE (Priority Experience Replay: epidemic finding Replay, ERE) algorithm, solving the intelligent body self-adaptive decision problem by utilizing the strong perception capability of deep learning and the high-efficiency decision capability of reinforcement learning, training an intelligent body by the deep reinforcement learning algorithm, and summarizing the Experience of the intelligent body in the interaction process with the environment, thereby forming the self-recognition on the application of specific behaviors; meanwhile, the unmanned aerial vehicle anti-interception task under the simulation environment is used as a carrier, and the effectiveness of the algorithm is verified.

In order to achieve the purpose, the invention is realized by the following technical scheme:

in a first aspect, the present disclosure provides an agent adaptive decision generation method based on deep reinforcement learning, including:

acquiring historical and current environmental state information, environmental reward information and decision action information of the intelligent agent; acquiring environmental state information at the next moment;

storing all the acquired information as experience in a playback buffer area;

training an intelligent depth reinforcement learning model; experience playback in the intelligent agent depth reinforcement learning algorithm adopts a priority experience playback strategy, namely, the importance of different experiences in an experience playback buffer zone is distinguished; searching an optimal solution of the model by using a common gradient descent optimizer in the training process of the intelligent depth reinforcement learning model;

and verifying the intelligent depth reinforcement learning model by using the unmanned aerial vehicle anti-interception task under the simulation environment as a carrier.

In a second aspect, the present disclosure further provides an agent adaptive decision generating system based on deep reinforcement learning, including: the system comprises an experience acquisition module, a model training module and a verification module;

the experience acquisition module configured to: acquiring historical and current environmental state information, environmental reward information and decision action information of the intelligent agent; acquiring environmental state information at the next moment; storing all the acquired information as experience in a playback buffer area;

the model training module configured to: training an intelligent depth reinforcement learning model; experience playback in the intelligent agent depth reinforcement learning algorithm adopts a priority experience playback strategy, namely, the importance of different experiences in an experience playback buffer zone is distinguished; searching an optimal solution of the model by using a common gradient descent optimizer in the training process of the intelligent depth reinforcement learning model;

the verification module configured to: and verifying the intelligent depth reinforcement learning model by using the unmanned aerial vehicle anti-interception task under the simulation environment as a carrier.

Compared with the prior art, the beneficial effect of this disclosure is:

1. the method is based on a deep reinforcement learning Soft-critical (SAC) algorithm to carry out research on the self-adaptive decision problem of an intelligent body, and improves the SAC algorithm aiming at the problems in the training process, provides an SAC + PER (Priority Experience Replay: Priority Experience Replay, PER), an SAC + ERE and an SAC + PER + ERE (emphasis on Recent Experience Replay, ERE), solves the self-adaptive decision problem of the intelligent body by utilizing the strong perception capability of deep learning and the high-efficiency decision capability of reinforcement learning, trains the intelligent body by the deep reinforcement learning algorithm to summarize the Experience in the interaction process with the environment, thereby forming the self-recognition of the application of specific behaviors; the method can effectively sample in a continuous state and an action space, can keep the stability, and has stronger exploration capacity and robustness;

2. the method adopts a common gradient descent optimizer, sets the learning rate to be 3e-4, and improves the stability of the algorithm and the success rate of training;

3. the unmanned aerial vehicle anti-interception task under the simulation environment is used as a carrier, the effectiveness of the algorithm is verified, and the problem that the entity training in the real environment is difficult through a deep reinforcement learning algorithm when the model is verified is solved.

Drawings

The accompanying drawings, which form a part hereof, are included to provide a further understanding of the present embodiments, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the present embodiments and together with the description serve to explain the present embodiments without unduly limiting the present embodiments.

Fig. 1 is an empirical playback mechanism of embodiment 1 of the present disclosure;

fig. 2 is an effect diagram of using an Adam optimizer in the one-to-one scene verification according to embodiment 1 of the present disclosure;

FIG. 3 is a diagram of the effect of using a gradient descent optimizer in a one-to-one scenario verification according to embodiment 1 of the present disclosure;

fig. 4 is an effect diagram of using an Adam optimizer in the two-to-two scene verification according to embodiment 1 of the present disclosure;

FIG. 5 is a diagram of the effect of using a gradient descent optimizer in a two-to-two scenario verification according to embodiment 1 of the present disclosure;

fig. 6 is an effect diagram of using an Adam optimizer in the one-to-one scene verification according to embodiment 2 of the present disclosure;

FIG. 7 is a diagram of the effect of using a gradient descent optimizer in a one-to-one scenario verification according to embodiment 2 of the present disclosure;

fig. 8 is an effect diagram of using an Adam optimizer in the two-to-two scene verification according to embodiment 2 of the present disclosure;

FIG. 9 is a diagram of the effect of using a gradient descent optimizer in a two-to-two scenario verification according to embodiment 2 of the present disclosure;

fig. 10 is a diagram illustrating an effect of using an Adam optimizer in one-to-one scene verification according to embodiment 3 of the present disclosure;

FIG. 11 is a diagram of the effect of using a gradient descent optimizer in one-to-one scene verification according to embodiment 3 of the present disclosure;

fig. 12 is an effect diagram of using an Adam optimizer in the two-to-two scene verification according to embodiment 3 of the present disclosure;

FIG. 13 is a diagram of the effect of using a gradient descent optimizer in a two-to-two scenario verification according to embodiment 3 of the present disclosure;

fig. 14 is a diagram illustrating an effect of using an Adam optimizer in a one-to-one scene verification according to embodiment 4 of the present disclosure;

FIG. 15 is a diagram of the effect of using a gradient descent optimizer in one-to-one scene verification according to embodiment 4 of the present disclosure;

fig. 16 is an effect diagram of using an Adam optimizer in the two-to-two scene verification according to embodiment 4 of the present disclosure;

FIG. 17 is a diagram of the effect of using a gradient descent optimizer in a two-to-two scenario verification according to embodiment 4 of the present disclosure;

FIG. 18 is a one-to-one scenario schematic of the present disclosure;

fig. 19 is a schematic diagram of a one-to-two scenario of the present disclosure.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless otherwise defined, all technical and scientific terms used in the examples have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

Example 1:

the embodiment provides an intelligent agent self-adaptive decision generation method based on deep reinforcement learning, which comprises the following steps: acquiring historical and current environmental state information, environmental reward information and decision action information of the intelligent agent; acquiring environmental state information at the next moment; storing all the acquired information as experience in a playback buffer area; training an intelligent depth reinforcement learning model; searching an optimal solution of the model by using a common gradient descent optimizer in the training process of the intelligent depth reinforcement learning model; and verifying the intelligent depth reinforcement learning model by using the unmanned aerial vehicle anti-interception task under the simulation environment as a carrier.

The Soft Actor-critical (SAC) algorithm adopted in the embodiment is developed based on the DDPG algorithm, is a brand-new deep reinforcement learning algorithm, and solves the problems of high sample complexity and fragile convergence in the current model-free deep reinforcement learning; the SAC algorithm is a model-free offline strategy deep reinforcement learning algorithm based On a maximum entropy reinforcement learning framework, the most advanced performance is realized On a series of continuous action reference tasks by combining the Off-Policy updating with the Actor-Critic formula, the method is superior to the prior On-Policy (On-Policy) and offline strategy (Off-Policy) methods, the effective sampling can be realized in a continuous state and an action space, the stability can be kept, and the algorithm has stronger exploration capability and robustness.

SAC is a maximum entropy deep reinforcement learning algorithm, wherein the goal of Actor is to maximize the desired reward and maximize the entropy, and the purpose of doing so is to make the strategy as randomized as possible while completing the task; for a general reinforcement learning algorithm, the objective of the agent learning is to learn a strategy that maximizes the accumulated expected value of return, as shown in equation (1):

the maximum entropy reinforcement learning objective is shown in formula (2), and compared with the standard reinforcement learning objective, the maximum entropy reinforcement learning represents the learning objective by adding an entropy term, i.e. the objective of the maximum entropy reinforcement learning is to maximize not only the accumulated return expectation, but also the entropy of the strategy output action.

Wherein,

the temperature parameter alpha determines the relative importance of the entropy item to the reward, so that the randomness of the optimal strategy is controlled; the basic goal of this is to randomize the strategy, i.e. to make the probability of each action output as scattered as possible, rather than concentrated on one action; the maximum entropy goal is different from the standard maximum expected reward goal used in traditional reinforcement learning, but when the parameter α → 0, the maximum entropy goal can be transformed into a traditional reinforcement learning goal.

The SAC algorithm contains five neural networks: two Q networks, two value networks and a policy network; wherein two Q networks Q_θ(s_t,a_t) (where θ is 1,2) is used to output the value of the selection action; network of state values V_ψ(s_t) For outputting the current state s_tValue of, target state value network

For outputting the next state s_t+1The value of (D); policy network pi_φ(a_t|s_t) Used for outputting the action in the current state; first, the sum of squares of the residual function is minimized by training the state value function, as shown in equation (4):

where D is the playback buffer. The gradient of equation (4) can be given by equation (5):

sampling the action according to the current policy (rather than the playback buffer); the Bellman residual is minimized by training the soft Q function, as shown in equation (6):

wherein,

random gradient optimization is performed on the formula (6), and a formula (8) can be obtained:

wherein the gradient update introduces a network of target values

Finally, the parameters of the desired learning strategy network can be learned by directly minimizing the KL divergence, as shown in equation (9):

the strategic gradient method uses re-parameterization techniques, resulting in a lower variance estimate. Adding noise epsilon in the selection of actions_tMake it satisfy a certain distribution (such as Gaussian distribution), will be the actual strategy pi_φIs defined as f_φ：

a_t＝f_φ(ε_t；s_t) (10)

Wherein epsilon_tIs an input noise vector, sampled from some fixed distribution (e.g., gaussian). Then equation (9) can be rewritten as

By graduating equation (11), we can obtain:

the SAC algorithm reduces deviation in strategy improvement by using two Q networks, and selects the minimum value of two Q functions as a value gradient in a formula (5) and a strategy gradient in a formula (12); in addition, two Q functions can significantly speed up training, especially on more difficult tasks.

The SAC algorithm flow chart is shown in table 1:

TABLE 1 Soft Actor-Critic Algorithm flow

In order to research the adaptive decision problem of the intelligent agent, the unmanned aerial vehicle anti-interception simulation experiment environment is set up in the embodiment, and the application of a deep reinforcement learning Soft Actor-critical (SAC) algorithm in the adaptive decision problem of the intelligent agent is researched; specifically, the anti-interception problem of the unmanned aerial vehicle is abstracted into a game scene, the construction of a simulation environment is completed based on software platforms such as Pygame, Tensorflow, Python and the like, the test is carried out by taking Pycarm as a development tool, and the Soft Actor-criticic algorithm is in butt joint with the simulation environment, so that the training of a model can be realized; the parameter settings of the SAC algorithm are shown in table 2.

TABLE 2 SAC parameters

1. One-to-one scene

A one-to-one scene is set up, as shown in FIG. 18, the unmanned aerial vehicle of our party starts from an airport of origin, and breaks through the interception of one missile of the other party and smoothly lands on a target airport in the process of detection along the way.

In the training process, different optimization methods are used to find the optimal solution of the model. Of these, Gradient Descent (GD) is the most primitive type of optimizer, and there are three common variants: random Gradient Descent (SGD), Batch Gradient Descent (BGD) and small Batch Gradient Descent (Mini-Batch Gradient Descent, MBGD). The gradient descent method is widely applied in the field of machine learning and is a common method for minimizing a loss function. The solving process is simple, and only the first derivative of the loss function is needed, so that the method is suitable for a plurality of large-scale data sets, and is the most typical method for solving the unconstrained optimization problem. In this embodiment, the optimizer used in the original SAC algorithm under study is an Adam optimizer, and the Adam algorithm is derived from adaptive moment estimation (adaptive moment estimation), which is different from the conventional gradient descent method. In the common gradient descent method, the learning rate is always kept unchanged when the weight is updated; the Adam algorithm sets an independent adaptive learning rate for each parameter by calculating first moment estimation and second moment estimation of the gradient, and an Adam optimizer can be said to be the most widely used optimizer with the fastest convergence rate at present.

In the embodiment, an Adam optimizer is used firstly in the simulation experiment process, the initial learning rate is set to be 3e-4, the training effect graph is shown in fig. 2, the fact that the performance of the model depends on parameter initialization seriously can be found, the algorithm is easy to vibrate, and the training success probability is low; however, after the Adam optimizer is changed into a common gradient descent optimizer, the learning rate is set to be 3e-4, the training effect graph is shown in fig. 3, and it can be seen that the algorithm performance is relatively stable and the training success rate is improved; therefore, although the common gradient descent method has the problems of low training speed, easy falling into local optimal solution and the like, the method has good performance in the simulation experiment and can quickly converge; thus, in practical applications, the selection of which optimizer should be used in conjunction with a particular problem depends on the user's familiarity with the optimizer (e.g., parameter adjustments, etc.).

In fig. 2 and 3, target dist represents the distance between the unmanned reconnaissance plane and the target airport after the end of a journey, target baseline represents a distance reference line (12m), and if the target baseline is lower than the target baseline, the unmanned plane successfully lands on the target airport; the anti dist red curve represents the shortest distance between the unmanned aerial vehicle of the same party and the intercepting bomb of the other party in the flying process, the numerical value of the anti dist red curve is 20 times of the original numerical value, and the visual effect is convenient.

2. One-to-two scene

A one-to-two scene is set up, as shown in FIG. 19, the one-to-two scene is that the unmanned aerial vehicle of our party starts from an airport at the starting point, and needs to break through the interception of two intercepting bullets of the other party and smoothly land at a target airport in the process of detection along the way.

In the training process, if total _ steps% is 0, the test is carried out once, and in the test case, if the distance between the unmanned aerial vehicle of the owner and the target base is less than 12m, and the shortest distance between the unmanned aerial vehicle of the owner and the interception bomb of the other party in the flight process is more than 6m, the training is successful.

Because the task to be completed in the one-to-two scene is to avoid the interception of two missiles of the other party, the task to be completed in the one-to-two scene is more difficult than the task to be completed in the one-to-one scene, the training difficulty is increased, the training time is increased, and the training success rate is lower; FIG. 4 is an effect diagram of the Adam optimizer, and the learning rate is also set to 3e-4, so that it can be seen that the model performance is not stable enough, the algorithm is easy to oscillate, and the training success probability is low; FIG. 5 is a graph of the results of training using a conventional gradient descent optimizer, again with the learning rate set at 3e-4, showing that the model performance is more stable and the training speed is increased; but the training time is increased compared to a one-to-one scenario.

In fig. 4 and 5, targettist represents the distance between the unmanned reconnaissance plane and the target airport after a local period is finished, target baseline represents a distance reference line (12m), and if the target baseline is lower than the target baseline, the unmanned plane successfully lands on the target airport; the anti dist red curve represents the shortest distance between the unmanned aerial vehicle of the same party and the intercepting bomb of the other party in the flying process, the numerical value of the anti dist red curve is 20 times of the original numerical value, and the visual effect is convenient.

Example 2:

the embodiment provides an agent adaptive decision making method based on depth reinforcement learning, and is different from embodiment 1 in that a priority experience replay strategy is adopted for experience replay in an agent depth reinforcement learning algorithm in the embodiment.

When parameter updating is performed, the Off-Policy method adopted by the SAC can repeatedly use the past Experience and consistently sample data from the past Experience, namely Experience playback (Experience playback); an empirical replay mechanism that allows online reinforcement learning agents to remember and reuse past experiences that were sampled evenly from the replay buffer in past research work, however, this approach simply replays experiences with the same probability that they were experienced initially, regardless of their importance; in this embodiment, a combination of SAC and Priority experiential Playback (PER) strategy is considered, where PER is a framework of Experience-first playback, and may play back important experiences more frequently, thereby learning more effectively.

Empirical playback is an indispensable part of the deep reinforcement learning algorithm; the empirical replay mechanism is to store the experience collected in the past into an empirical replay buffer, and randomly and repeatedly sample the experience replay buffer from the empirical replay buffer in the learning process for the optimization of the strategy, as shown in fig. 1.

In an experience playback mechanism, past experiences can be called for training a neural network for multiple times, so that a strategy is optimized, and experience playback not only improves the sampling efficiency of a deep reinforcement learning algorithm, but also improves the stability of the algorithm, so that the training process is more stable; however, the initial empirical playback mechanism is that all experience in the empirical playback buffer is considered equally important, so it is sampled equally probabilistically from the playback buffer, and obviously, this is not reasonable; for example, in the learning process, for some experiences with high profit and high return, some specially successful attempts or specially painful trainings can be continuously played back in the brain, so as to help people to learn quickly and effectively; therefore, the experience of being played back continuously is more valuable than general experience, i.e. different experiences in the empirical playback buffer are of different importance for the strategy optimization; the importance difference between samples can be ignored by pure equal probability sampling, and if the important samples can be frequently sampled, the optimization of the strategy can be accelerated, so that the sampling efficiency of the algorithm is improved; therefore, based on the idea of improving the sample utilization rate, the present embodiment distinguishes the importance of different experiences in the empirical playback buffer, that is, the conventional random empirical playback in the SAC algorithm is changed to the priority empirical playback.

The principle of operation of the preferred empirical playback mechanism is as follows: first, the value of each experience is evaluated by the time-difference bias (TD-errors) of each experience, which have been calculated in many classical reinforcement learning algorithms, such as SARSA and Q-learning; then, sequencing the experience in the playback buffer according to the TD-errors absolute value, thereby playing back the experience with high TD-errors more frequently; however, doing so changes the frequency of state accesses, thereby introducing bias that can be corrected by the importance sampling weights.

The main idea of preferential experience playback is to play back more frequently those experiences that are important, and therefore, the criteria defining the value of experience is a core problem; in most reinforcement learning algorithms, TD-error is often used to update the estimate of the action value function Q (s, a), the value of TD-error can be used to correct the estimate and can reflect how well the agent learns from experience, the greater the magnitude of the absolute value of TD-error, the more powerful the correction of the expected action value, in which case experiences with high TD-errors are more likely to be of higher value and associated with very successful attempts. In addition, the experience with a large negative value TD-errors is the worst case of behavior of the agent, under which the agent cannot learn well, and these experiences are replayed more frequently, which can help the agent to gradually recognize the true consequences of the incorrect behavior in the corresponding states, and avoid making the incorrect behavior again under these circumstances, thereby improving the overall performance, and therefore, these poorly learned experiences are also considered to be of high value; in the embodiment, an empirical TD-error absolute value | δ | is selected as an index for evaluating an empirical value, and the degree of priority can be defined by using TD-error (Q reality value-Q estimation value), that is, the larger TD-error is, the prediction accuracy needs to be further improved, and the more the sample needs to be learned, that is, the higher the priority is; since the SAC algorithm has two Q networks, the redefined absolute TD-error value | δ | is the average of the absolute TD-error values of the two Q networks, i.e.:

wherein the first two terms r + gamma V_ψtarg(s') is the target network of Q, the third part Q_θ,l(s, a) is l^thQ current estimate of the network.

In order to avoid sampling diversity loss caused by greedy TD-error priority, and thus to enable a system to be in an over-fitting state, a random sampling method is introduced, and the probability of sampling experience i is defined as follows:

wherein the priority p_i＝1/rank(i)>0, rank (i) is the arrangement position of the experience i in the playback buffer, with the TD-error absolute value as a standard. The parameter alpha controls the use degree of the priority, and when alpha is 0, uniform sampling is performed; when alpha is 1, sampling by a greedy strategy; the definition of the sampling probability can be seen as a way to add a random factor to the empirical selection, since even thenIt is also possible that some low TD-error experience may be replayed, thereby ensuring diversity in the sampling experience that helps prevent overfitting of the neural network.

However, this will change the access frequency of the states due to the tendency to replay the experience with high TD-error more frequently, resulting in the training process of the neural network being prone to oscillations and even divergence; to solve this problem, the importance sampling weight is used in the calculation of the weight change:

where N is the size of the playback buffer and P (i) is the probability of sampling experience i. The parameter beta controls the degree of correction use, and when beta is 0, importance sampling is not used at all; when β is 1, it is a common importance sample, and when training is near the end, β → 1 should be used.

Based on the experience-first playback method described above, the flow of the integrated algorithm of the SAC algorithm and the experience-first playback is shown in table 3.

TABLE 3 SAC Algorithm flow based on prior experience playback

In order to verify the effectiveness of the algorithm, an unmanned aerial vehicle anti-interception task under a simulation environment is used as a carrier for carrying out experiments. The experimental environment settings are shown in table 4.

TABLE 4 SAC + PER parameters

1. One-to-one scene

A one-to-one scene is set up, as shown in FIG. 18, how to perform the interception of track changing and avoiding the other missile in the process of detection along the way by the unmanned aerial vehicle of our party from the starting airport, and then the unmanned aerial vehicle of our party smoothly lands on the target airport by adjusting the track.

Testing a SAC algorithm with prior experience playback by adopting an unmanned aerial vehicle anti-interception task on a self-built simulation platform, providing a plurality of groups of experimental results, and comparing the experimental results with an original SAC method; in the training process, different optimization methods are used for searching the optimal solution of the model; in the simulation experiment process, the Adam optimizer is used firstly, the initial learning rate is set to be 3e-4, the training effect graph is shown in fig. 6, the fact that the performance of the model is seriously dependent on parameter initialization can be found, the algorithm is easy to vibrate, and the training success probability is low; but the training speed is improved compared to the original SAC algorithm.

Aiming at the problems in the training process, some improvements are made to the optimizer; because the performance of the model in the training process depends on parameter initialization seriously, the stability of the algorithm is poor, and the probability of successful training is low; therefore, the Adam optimizer is changed into a common gradient descent optimizer, the learning rate is set to be 3e-4, and the training effect graph is shown in fig. 7.

2. One-to-two scene

A pair of two scenes is set up, as shown in fig. 19, the pair of two scenes is how the unmanned aerial vehicle of one party starts from a starting airport and carries out track change to avoid interception of two missiles of the other party in the process of detection along the way, and then the unmanned aerial vehicle is adjusted to land on a target airport safely. Under the test condition, the distance between the unmanned aerial vehicle of one party and the target base is less than 12m, and the shortest distance between the unmanned aerial vehicle of one party and the opposite party for intercepting the missile in the flight process is more than 6m, so that the training is successful.

Because the task to be completed in the one-to-two scene is to avoid the interception of two missiles of the other party, the task to be completed in the one-to-two scene is more difficult than the task to be completed in the one-to-one scene, the training difficulty is increased, and the training time is increased; in the training process, an Adam optimizer and a gradient descent optimizer are used for searching the optimal solution of the model; firstly, an Adam optimizer is used for training, the initial learning rate is set to be 3e-4, a training effect graph is shown in FIG. 8, and it can be found from the graph that the algorithm is easy to vibrate, the probability of successful training is low, and the comparison of the models depends on parameter initialization, so that the algorithm is not stable enough. But the training speed is improved compared with the original SAC algorithm.

It can be seen from fig. 8 that the algorithm is easier to oscillate and the training success probability is lower. Therefore, the Adam optimizer is changed into a common gradient descent optimizer, the learning rate is set to be 3e-4, and the training effect graph is shown in fig. 9.

Table 5 shows that the SAC algorithm with prior experience replay is significantly better than the original SAC algorithm in the one-to-one and two-to-two scenes, as a result of experiments performed repeatedly for many times to obtain the average value of the SAC algorithm based on uniform sampling and the SAC algorithm with prior experience replay, and the total number of steps (steps) required for training to complete the anti-interception task of the unmanned aerial vehicle. In addition, the number of peaks appearing in the reward curve for priority experience replay is less, indicating that the DDPG with the priority experience replay mechanism exhibits more stable performance during training. Compared with the original SAC algorithm, the experimental result shows that the SAC algorithm with prior experience playback not only obtains equivalent performance in shorter training time, but also has more stable training process, less sensitivity to the change of some hyper-parameters and stronger robustness.

TABLE 5 SAC vs SAC + PER algorithm

Example 3:

unlike embodiments 1 and 2, in this embodiment, an experience replay in an intelligent depth reinforcement learning algorithm adopts a recent experience replay emphasis strategy.

In embodiment 2, by playing back important experiences more frequently, the sampling efficiency of the experience playback mechanism in the SAC algorithm is further improved, so that the learning efficiency and the convergence speed are increased, and the SAC algorithm with prior experience playback is tested by using the unmanned aerial vehicle anti-interception task, which proves that the algorithm is effective. This embodiment adopts another method for improving the Experience playback mechanism, Emphasizing the Recent Experience playback (ERE) strategy, and proposes the SAC + ERE algorithm.

Emphasizing the recent experience (ERE) playback strategy is a simple but powerful off-line strategy sampling method that emphasizes recently observed data while not forgetting the knowledge learned in the past. When performing updates, the ERE algorithm more aggressively extracts samples from recent experience and performs parameter updates to ensure that updates from old data do not overwrite updates from new data. The core idea is that in the parameter updating phase, at the first updating, the first mini-batch data is sampled from all data in the playback buffer, and for each subsequent mini-batch data, the sampling range is gradually reduced so as to increase the probability that the recent experience is sampled. This sampling method includes two key parts: first, recent data is sampled at a higher frequency; and the other is to set an updating mode so that the updating using the old data does not override the updating using the new data.

In particular, suppose we are to update K times in succession in the current update phase, i.e., K mini-lots of data need to be cyclically fetched from the playback buffer. Let N be the maximum size of the playback buffer, for the kth (where 1. ltoreq. K. ltoreq.K) update, from the most recent c_kUniform sampling among data points:

wherein, eta ∈ (0, 1)]Is a hyper-parameter, which determines the degree of importance for recent data. When η is 1, it is equivalent to uniform sampling; when eta<1 time, c_kWill gradually decrease with each update. Let c_minIs c_kThis avoids sampling from very few recent data, but may also result in overfitting. So that formula c is not used_k＝N·η^kSince the length of a round may vary greatly from one environment to another. The constant 1000 here may also be set to a different value, but will change the optimal value of η. According to equation (16), uniform sampling is always performed at the first update, and at the last update:

this sampling method has a dual effect; the first effect is that the first mini-batch data will be sampled evenly from the entire buffer, the second mini-batch data will be sampled evenly from the entire buffer excluding the oldest data points in the buffer, and as k increases, more old data will be excluded; it is clear that the closer the data points, the greater the probability of being sampled; the second effect is that the sampling is done sequentially: first all data in the buffer is sampled and then the sampling range is gradually reduced, only the most recent data is sampled, which reduces the probability that new data parameter changes are overwritten by old data parameter changes. It is assumed that this process allows better approximation of the value function near recently visited states while still maintaining an acceptable approximation to more distant past visited states.

The different values of η depend on the speed of agent learning and the speed of past experience outdated. When the agent learns quickly, let the value of η decrease in order to place more emphasis on the most recent data; when the intelligent agent learns slowlyIn time, the value of η should be made higher to make it closer to uniform sampling so that the agent can utilize more past empirical data. A simple solution is to use a simulated annealing algorithm to optimize η during training, let T be the total time step of training, let η₀And η_TSetting eta as initial value and final value of eta_TUniform sampling can be obtained when the sampling rate is 1; then the value of η for a time step t is:

based on the above-described playback method of recent experience emphasis, the SAC and the recent experience emphasis integration algorithm are shown in table 6.

Table 6 SAC algorithm flow emphasizing recent experience

For comparability and repeatability of experimental results, the present embodiment uses the same SAC code implementation for each variant, and the neural network structure, activation function, optimizer, playback buffer size, learning rate, and other hyper-parameters used are the same as in the SAC algorithm, as shown in table 7. In the experiment, firstly, a low-dimensional neural network and an Adam optimizer are used for training the neural network, the initial learning rate is set to be 3e-4, and in order to ensure that the weight of the neural network does not become too large, an L2 regularization term is added into a loss function to update the action value network. The discount rate γ is set to 0.99, the update rate of the target network is set to 0.01, and the size of the playback buffer is set to 2e 5. The memory structure of the replay buffer is a circular queue and the data structure of the priority queue is a binary heap to minimize the extra time cost of priority sampling. During interaction with the environment, the agent receives low-dimensional vectors, including velocity, position, and coordinate information, as observations (i.e., states). The SAC algorithm emphasizing recent experience is tested by adopting an unmanned aerial vehicle anti-interception task on a self-built simulation platform, a plurality of groups of experimental results are given, and the experimental results are compared with the original SAC algorithm and the SAC algorithm with prior experience playback.

TABLE 7 SAC + ERE parameters

1. One-to-one scene

A one-to-one scene is set up, as shown in FIG. 17, the main task is how to change the track and avoid the interception of the other missile during the process of detecting along the way by the unmanned aerial vehicle of our party from the starting airport, and then the unmanned aerial vehicle can smoothly land on the target airport by adjusting the track. Under the test condition, the distance between the unmanned aerial vehicle of one party and the target base is less than 12m, and the shortest distance between the unmanned aerial vehicle of one party and the opposite party for intercepting the missile in the flight process is more than 6m, so that the training is successful.

In the simulation experiment process, an Adam optimizer is firstly used, the initial learning rate is set to be 3e-4, the training effect graph is shown in fig. 10, and it can be found that the performance of the model is seriously dependent on parameter initialization, the algorithm is easy to oscillate, and the training success probability is low. But the training speed is improved compared with the original SAC algorithm and SAC + PER algorithm.

The Adam optimizer is changed into a common gradient descent optimizer, the learning rate is set to be 3e-4, and the training effect graph is shown in fig. 11. Therefore, after the optimizer is changed into a common gradient descent method, the stability of the algorithm is improved, and the convergence speed is accelerated.

2. One-to-two scene

A pair of two scenes is set up, as shown in fig. 19, the pair of two scenes is how the unmanned aerial vehicle of one party starts from a starting airport and carries out track change to avoid interception of two missiles of the other party in the process of detection along the way, and then the unmanned aerial vehicle is adjusted to land on a target airport safely.

Under the test condition, the distance between the unmanned aerial vehicle of one party and the target base is less than 12m, and the shortest distance between the unmanned aerial vehicle of one party and the opposite party for intercepting the missile in the flight process is more than 6m, so that the training is successful. Embodiment during the simulation experiment, the Adam optimizer is used firstly, the initial learning rate is set to 3e-4, and the effect graph of the training is shown in fig. 12. The Adam optimizer is changed into a common gradient descent optimizer, the learning rate is set to be 3e-4, and the training effect graph is shown in fig. 13, and it can be seen that although the algorithm is easy to oscillate at the beginning, the model performance at the later stage tends to be stable, the training success rate is high, but the training time is increased compared with a one-to-one scene.

Table 8 shows that the SAC algorithm with prior experience replay is significantly better than the original SAC algorithm and the SAC algorithm with prior experience replay, and the SAC algorithm with prior experience replay is emphasized that the SAC algorithm with prior experience replay not only obtains better performance within a shorter training time, but also has a more stable training process, is less sensitive to changes in some hyper-parameters, and has stronger robustness.

TABLE 8 SAC vs. SAC + PER and SAC + ERE algorithms

Example 4:

unlike embodiments 1,2 and 3, in this embodiment, the experience replay in the smart depth reinforcement learning algorithm adopts a combination strategy of emphasizing recent experience replay and prioritized experience replay.

Although the SAC + PER and SAC + ERE algorithms provided in embodiments 2 and 3 have good effects in the anti-interception task of the unmanned aerial vehicle, the algorithms have advantages over the SAC algorithm and better algorithm performance, so that the learning efficiency of the intelligent agent is improved, the success rate of training is improved, the convergence rate of the algorithm is increased, the algorithm is more stable, and a better engineering implementation way is provided for the self-adaptive decision problem of the intelligent agent. However, in consideration of their respective advantages, the embodiment proposes a SAC + ERE + PER algorithm combining the two methods based on SAC + PER and SAC + ERE. The method is mainly improved in two aspects: first, the sampling range is gradually reduced during a series of small batch updates. Second, priority sampling is used rather than uniform sampling, and the sampling probability is proportional to the TD-error absolute value of the data point. By combining the ERE and PER empirical playback strategy with the SAC algorithm, namely firstly, a sampling range is determined, and then mini-batch data are selected to be updated according to the priorities of different experiences in an interval range.

Assume that K small batch updates are performed after a certain amount of data is collected. Let N be the maximum size of the playback buffer, let Dc_kFor playback in the buffer c_kThe closest data point. The probability of sampling a data point is calculated as:

wherein Dc_kIs front c in the experience buffer_kRecent empirical data.

Based on the recent experience emphasis playback method described above, an integrated algorithm of SAC with priority experience playback and recent experience emphasis is shown in table 9.

Table 9 SAC algorithm flow emphasizing recent experience and priority playback

For the hybrid algorithm SAC + PER + ERE, in order to ensure its comparability, its over-parameters are not adjusted, but the same parameter values as the SAC + PER and SAC + ERE algorithms are used, as shown in table 10. Then, an unmanned aerial vehicle anti-interception task is adopted to test the SAC algorithm emphasizing recent experience and playing back priority on a self-built simulation platform, a plurality of groups of experimental results are provided, the embodiment takes learning speed, learning stability and convergence speed as three indexes for evaluating the performance of the algorithm, and the SAC + PER + ERE algorithm is compared with the original SAC algorithm, SAC + PER algorithm and SAC + ERE algorithm.

TABLE 10 SAC + PER + ERE parameters

1. One-to-one scene

During the training process, different optimization methods are used to find the optimal solution of the model. In the simulation experiment process, the Adam optimizer is used firstly, the initial learning rate is set to be 3e-4, the training effect graph is shown in fig. 14, it can be found that the performance of the model is seriously dependent on parameter initialization, the algorithm is easy to oscillate, but the training efficiency is improved compared with SAC, SAC + PER and SAC + ERE algorithms.

The Adam optimizer is replaced by a common gradient descent optimizer, the learning rate is set to be 3e-4, the training effect graph is shown in FIG. 15, and it can be seen that the algorithm performance tends to be stable and the training success rate is improved.

2. One-to-two scene

Embodiment during the simulation experiment, the Adam optimizer is used firstly, the initial learning rate is set to 3e-4, and the effect graph of the training is shown in fig. 16. Obviously, compared with the original SAC algorithm, SAC + PER algorithm and SAC + ERE algorithm, the learning efficiency of the intelligent agent is improved, and the training task is completed in a quicker and shorter time.

It can be seen that the algorithm is easy to oscillate and the probability of training success is low. The Adam optimizer is changed into a common gradient descent optimizer, the learning rate is set to be 3e-4, and the training effect graph is shown in fig. 17. But compared with a one-to-one scenario, the task difficulty is increased and the training time is increased.

Table 11 shows that the total number of steps (steps) required for training for completing the anti-interception task of the unmanned aerial vehicle is the number of steps required for training by averaging the SAC algorithm based on uniform sampling, the SAC algorithm based on prior experience playback, the SAC algorithm based on recent experience emphasis, and the SAC algorithm based on recent experience emphasis and priority playback obtained through repeated experiments in one-to-one and one-to-two scenes.

TABLE 11 SAC vs. SAC + PER, SAC + ERE, and SAC + ERE + PER algorithms

Example 5:

the embodiment provides an intelligent agent adaptive decision generation system based on deep reinforcement learning, which includes: the system comprises an experience acquisition module, a model training module and a verification module;

In this embodiment, distinguishing the importance of different experiences in the experience playback buffer includes:

evaluating the value of the experience through the time difference deviation of each experience; selecting an empirical time difference deviation absolute value as an index for evaluating an empirical value, wherein the empirical time difference deviation absolute value is an average value of time difference deviation absolute values of two Q networks in a deep reinforcement learning algorithm;

sequencing experiences in the playback buffer according to the absolute value of the time difference deviation, and frequently playing back the experiences with high time difference deviation;

using a random sampling method to define the probability of sampling experience; importance sampling weights are employed in the calculation of weight changes.

In the embodiment, experience replay in the intelligent body depth reinforcement learning algorithm adopts a recent experience replay emphasis strategy, in a parameter updating stage, when the data is updated for the first time, the first mini-batch data is sampled from all data in a replay buffer, for each subsequent mini-batch data, the sampling range is gradually reduced, and the probability of sampling the recent experience is increased; the method comprises the following steps:

suppose to be atIn the previous updating stage, the data needs to be continuously updated for K times, namely K mini-batch data need to be circularly taken from the playback buffer; let N be the maximum size of the playback buffer, for the kth update, from the most recent c_kUniform sampling among data points: wherein K is more than or equal to 1 and less than or equal to K;

wherein, eta ∈ (0, 1)]Is a hyper-parameter, which determines the attention degree of recent data; the value of η depends on the speed of agent learning and the past experience outdated, let c_minIs c_kIs measured.

In this embodiment, the experience replay in the intelligent depth reinforcement learning algorithm adopts a combined strategy of emphasizing recent experience replay and priority experience replay, that is, firstly, a sampling range is determined, and then mini-batch data is selected to be updated according to priorities of different experiences in an interval range.

In this embodiment, the general gradient descent optimizer adopts a random gradient descent method, a batch gradient descent method or a small batch gradient descent method.

Example 6:

the embodiment provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the method for generating an adaptive decision for an agent based on deep reinforcement learning according to embodiment 1, embodiment 2, embodiment 3, or embodiment 4.

Example four:

the present embodiment provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for generating an intelligent agent adaptive decision based on deep reinforcement learning according to embodiment 1, embodiment 2, embodiment 3, or embodiment 4.

The method improves an experience playback mechanism of a SAC algorithm based on a prior experience Playback (PER) algorithm and an emphasis recent experience playback (ERE) algorithm, wherein the PER is a method for sampling according to priority, and the importance among different experiences is considered, so that the playback probability of the important experiences is higher, the learning efficiency is accelerated, and the algorithm convergence is faster; ERE is a simple but powerful experience playback method that emphasizes recent experience; the fact proves that both PER and ERE can obviously improve the learning speed of the SAC algorithm, and a better result can be obtained. Comparing SAC + ERE with SAC + PER, finding that the ERE algorithm is easier to realize because it does not need a special data structure, and the PER algorithm needs to adopt a sum-tree data structure in order to improve the sampling complexity; moreover, the performance of SAC + ERE is better than that of SAC + PER, but a more complex SAC + PER structure may obtain better effect; in fact, each of these two methods has its unique advantages; for example, when the reward is sparse, the PER algorithm is more preferred because PER is more advantageous in dealing with sparse reward situations, while ERE focuses on emphasizing the most recent data; the SAC + ERE + PER algorithm is a combination of ERE and PER, and can achieve better performance; however, this variant loses the simplicity of the SAC + ERE algorithm and adds some extra computational cost due to the PER part; however, in general, no matter the algorithm is SAC + PER, SAC + ERE or SAC + PER + ERE, a good effect is obtained in the unmanned aerial vehicle anti-interception task, the algorithm has advantages over the SAC algorithm, the training effect is better, the learning efficiency is improved, the convergence speed of the algorithm is accelerated, the algorithm is more stable, and a good engineering implementation way is provided for the intelligent self-adaptive decision problem.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and those skilled in the art can make various modifications and variations. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present embodiment should be included in the protection scope of the present embodiment.

Claims

1. The intelligent agent self-adaptive decision generation method based on deep reinforcement learning is characterized by comprising the following steps:

storing all the acquired information as experience in a playback buffer area;

2. The method of claim 1, wherein distinguishing the importance of different experiences in the experience replay buffer comprises:

3. The intelligent body adaptive decision making method based on the deep reinforcement learning as claimed in claim 1, characterized in that the experience replay in the intelligent body deep reinforcement learning algorithm adopts a recent experience replay emphasis strategy, in the parameter updating phase, when the data is updated for the first time, the first mini-batch data is sampled from all data in the replay buffer, for each subsequent mini-batch data, the sampling range is gradually reduced, and the probability of the recent experience being sampled is increased; the method comprises the following steps:

suppose we need to update K times continuously in the current update stage, i.e. it needs to take K mini-batch data from the playback buffer in a loop; let N be the maximum size of the playback buffer, for the kth update, from the most recent c_kUniform sampling among data points: wherein K is more than or equal to 1 and less than or equal to K;

4. The intelligent body adaptive decision making method based on the deep reinforcement learning as claimed in claim 1, characterized in that the experience replay in the intelligent body deep reinforcement learning algorithm adopts a combined strategy of emphasizing the recent experience replay and the prior experience replay, that is, firstly, a sampling range is determined, and then the mini-batch data is selected for updating according to the priorities of different experiences in the interval range.

5. The method of claim 1, wherein the generic gradient descent optimizer employs a stochastic gradient descent method, a batch gradient descent method, or a small batch gradient descent method.

6. An intelligent agent self-adaptive decision generation system based on deep reinforcement learning is characterized by comprising the following steps: the system comprises an experience acquisition module, a model training module and a verification module;

7. The deep reinforcement learning-based agent adaptive decision making system according to claim 6, wherein distinguishing the importance of different experiences in the experience replay buffer comprises:

8. The intelligent adaptive decision making system based on deep reinforcement learning as claimed in claim 6, characterized in that the experience replay in the intelligent deep reinforcement learning algorithm adopts a recent experience replay emphasis strategy, in the parameter updating phase, at the first time of updating, the first mini-batch data is sampled from all data in the replay buffer, for each subsequent mini-batch data, the sampling range is gradually reduced, and the probability of recent experience being sampled is increased; the method comprises the following steps:

suppose we want to update K times in succession in the current update phase, i.e. need to go from playbackThe buffer area circularly takes the data of K mini-lots; let N be the maximum size of the playback buffer, for the kth update, from the most recent c_kUniform sampling among data points: wherein K is more than or equal to 1 and less than or equal to K;

9. The intelligent adaptive decision making system based on deep reinforcement learning as claimed in claim 6, wherein the experience replay in the intelligent deep reinforcement learning algorithm adopts a combined strategy of emphasizing recent experience replay and prior experience replay, that is, firstly, a sampling range is determined, and then mini-batch data is selected for updating according to priorities of different experiences in an interval range.

10. The deep reinforcement learning-based agent adaptive decision making system according to claim 6, wherein the general gradient descent optimizer adopts a random gradient descent method, a batch gradient descent method or a small batch gradient descent method.