CN112396180A

CN112396180A - Deep Q learning network optimization method based on dynamic teaching data and behavior cloning

Info

Publication number: CN112396180A
Application number: CN202011338992.0A
Authority: CN
Inventors: 李小双; 王晓; 王飞跃; 金峻臣; 陈薏竹
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-02-23
Anticipated expiration: 2040-11-25
Also published as: CN112396180B

Abstract

The invention belongs to the field of information processing, and particularly relates to a deep Q learning network optimization method based on dynamic teaching data and behavior cloning, aiming at solving the problems that the coverage state of historical teaching data, namely the action space is limited, and imperfect teaching data can influence the strategy optimization direction. The invention comprises the following steps: carrying out supervised training on the initial behavior clone network to obtain a first behavior clone network; pre-training a main network and a target network with the same network structure based on a second teaching data set, and further training the main network based on a mixed loss function with expert loss; if the historical optimal reward value is obtained in the training round, updating the second teaching data set; and repeatedly adopting the updated second teaching data set to carry out network training until the end condition is reached. According to the method, high-quality sample data is continuously added in the training process, the performance of the strategy represented by the teaching data set is improved, and the forward effect is continuously generated on the performance improvement of the model.

Description

Deep Q learning network optimization method based on dynamic teaching data and behavior cloning

Technical Field

The invention belongs to the field of information processing, and particularly relates to a deep Q learning network optimization method based on dynamic teaching data and behavior cloning.

Background

Deep Learning (DRL) has advanced greatly in recent years, for example, in electronic games and board games. With the aid of the powerful feature extraction and function fitting capabilities of deep learning, a reinforcement learning subject can directly extract and learn feature knowledge from raw input data (such as game images), and then learn a decision control strategy by using a conventional reinforcement learning algorithm according to the extracted feature information without manually extracting or learning features on the basis of rules and heuristics.

However, at present, for the application of solving complex decision control problems (such as automatic driving) in real environment, the deep reinforcement learning technology still cannot be practically used. Due to the diversity and uncertainty of complex systems, the existing simulation environment is difficult to keep consistent with the real world, and the cost is high for improving the precision of the simulation system. Therefore, how to adapt to a complex real-world scene becomes one of the most urgent problems for applying the DRL model to a complex decision task.

For the decision problem under the complex scene, human experts have great advantages in learning efficiency and decision performance, so that the inclusion of expert knowledge in the DRL model is a potential solution. The DQfD (enhanced learning combining simulation learning and deep learning) method for Q learning in teaching can guide the intelligent agent to learn and obtain the strategy represented by the teaching data through learning the teaching data so as to guide and help the intelligent agent to learn expert knowledge, and performs autonomous learning on the basis, thereby improving the decision-making capability of the model. However, the DQfD model has the following problems: (1) in the DQfD learning process, the track data in the historical teaching data set are only used in pre-training, and the teaching data do not provide effective guidance for the track data generated by the model independently; (2) the teaching data set is very limited and cannot cover a sufficient state motion space. Moreover, it is difficult to collect enough teaching data in some practical applications, for example, extreme cases and real cases occur less frequently, and a large number of samples are data in normal cases; (3) the DQfD algorithm ignores the imperfection of historical teaching data ubiquitous in real application, and the imperfection can have negative influence on the improvement of the model performance.

Aiming at the problems, the invention provides a deep Q learning method based on dynamic teaching data and a behavior Cloning method, a Behavior Cloning (BC) model is constructed to mine historical teaching data and generate expert loss, and the behavior of an intelligent agent is compared with the behavior generated by the BC model through an expert loss function based on cross entropy. In addition, the invention provides an automatic updating mechanism of the self-adaptive enhanced BC model. This mechanism attempts to contain more high quality track samples, avoiding the adverse effects that imperfect teaching data may have.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem that the state covered by the historical teaching data, namely the limited motion space and the imperfect teaching data, may affect the strategy optimization direction, a first aspect of the present invention provides a deep Q learning network optimization method based on dynamic teaching data and behavior cloning, which is applied to a sequential decision task, and includes:

s100, carrying out supervised training on the initial behavior clone network based on a first teaching data set to obtain a first behavior clone network;

s200, pre-training a main network and a target network with the same network structure based on a second teaching data set; the main network is constructed based on a deep Q learning network;

s300, training the optimized main network of the S200 based on a mixed loss function with expert loss by adopting a second teaching data set;

s400, if the reward value history obtained in the S300 is optimal, based on real sequence decision interaction, obtaining an interaction track by using the main network optimized in the S300, generating sample data and adding the sample data into a second teaching data set;

s500, performing supervised training on the first behavior clone network by using the updated main network based on the second teaching data set obtained in S400, and performing fine adjustment on the first behavior clone network;

s600, performing next training round by adopting the method of S300-S500 until a preset training end condition is reached, and obtaining the final optimized main network.

In some preferred embodiments, the samples in the first teaching data set and the second teaching data set each include a state, an action, a reward, a next state, and an end-of-life flag.

In some preferred embodiments, in S100, the first behavior clone network is obtained by:

and taking the first teaching data set as a training sample, taking the action tag in the training sample and the output of the initial behavior clone network as cross entropy loss, and training the initial behavior clone network according to an error back propagation and gradient descent algorithm to obtain the first behavior clone network.

In some preferred embodiments, the mixture loss function J (Q) with expert loss in S300 is

J(Q)＝J_DQ(Q)+λ₁J_n(Q)+λ₂J_E(Q)+λ₃J_L2(Q)

Wherein, J_DQ(Q)、J_n(Q)、J_E(Q)、J_L2(Q) in turn is a single step TD penalty, a multi-step TD penalty, an expert penalty, an L2 regularization term, λ₁、λ₂、₃Are weight coefficients.

In some preferred embodiments, the loss function J of S500 "supervised training of the first behavioral clone network with updated master network" is_E(Q) is

J_E(Q)＝max_a∈A[Q(s,a)+l(π,π_bc；s_t)]-Q(s,a_E)

Wherein, pi and pi_bc、a_ERespectively outputting a main network, a first behavior clone model and a second teaching data set, and l (pi, pi)_bc；s_t) Representing the loss of supervision, a is the action of the current step, A is the action space of the current task environment, Q (s, a) is the function of the action value of the current state, Q (s, a)_E) Is in a state s and an action a_EState action value function of.

In some preferred embodiments, there is a supervised loss of l (π, π)_bc；s_t) Is composed of

l(π,π_bc；s_t)＝Crossentropy(π(s_t),π_bc(s_t))

When in state s_tThe master network outputs pi(s) in time_t) And the first behavioral clone model outputs pi_bc(s_t) If the actions are the same, the supervision loss is zero, otherwise, it is a positive number.

In some preferred embodiments, the initial behavioral clone network comprises a three-layer fully-connected network, the activation function uses LeakyRelu, and the number of output neurons is the sum of all possible actions in the action space.

The second aspect of the invention provides a deep Q learning network optimization system based on dynamic teaching data and behavior cloning, which is applied to a sequence decision task and comprises the following steps:

the first module is configured to perform supervised training on the initial behavior clone network based on a first teaching data set to obtain a first behavior clone network;

the second module is configured to pre-train the main network and the target network with the same network structure based on a second teaching data set; the main network is constructed based on a deep Q learning network;

the third module is configured to train the main network optimized by the second module based on a mixed loss function with expert loss by adopting a second teaching data set;

the fourth module is configured to obtain an interaction track by using the main network optimized by the third module based on real sequence decision interaction and generate sample data to be added into a second teaching data set if the reward value history obtained by the third module is optimal;

a fifth module, configured to perform supervised training on the first behavioral clone network by using the updated main network based on the second teaching data set obtained by the fourth module, and perform fine tuning on the first behavioral clone network;

and the sixth module is configured to repeatedly execute the third module to the fifth module to perform the next training round until a preset training end condition is reached, so as to obtain the finally optimized main network.

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, wherein the programs are suitable for being loaded and executed by a processor to implement the above deep Q learning network optimization method based on dynamic teaching data and behavior cloning.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; wherein the program is adapted to be loaded and executed by a processor to implement the method for deep Q learning network optimization based on dynamic teach pendant data and behavioral cloning described above.

The invention has the beneficial effects that:

the invention can effectively improve the convergence speed and the decision performance of network training.

According to the method, the interaction track under the real sequence decision interaction is acquired through the optimal main network, new sample data is constructed and added into the second teaching data set, and the quality of the sample is continuously improved while more sample spaces are covered, so that the performance of the strategy represented by the teaching data set is continuously improved, and a positive effect is continuously generated on the performance improvement of the model.

In the invention, the updated main network is used for carrying out supervised training on the first behavioral clone network, supervision loss can be generated aiming at all dynamically generated teaching data in the second teaching data set, and thus the use efficiency of the sample is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic flow chart of a deep Q learning network optimization method based on dynamic teaching data and behavior cloning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

At present, how to adapt to complex situations becomes one of the most urgent problems of applying the DRL model to complex decision tasks, and human experts have great advantages in learning efficiency and decision performance, so that incorporating expert knowledge in the DRL model is a potential solution. Based on the reasons, the invention provides a deep Q learning network optimization method based on dynamic teaching data and behavior Cloning, a Behavior Cloning (BC) model is constructed to generate expert loss, and the behavior of an intelligent agent is compared with the behavior generated by the BC model through a designed expert loss function. In addition, the invention provides an automatic updating mechanism of the self-adaptive enhanced BC model. This mechanism attempts to contain more high quality track samples, avoiding the adverse effects that imperfect teaching data may have. The invention can achieve better effects in the aspects of convergence speed and decision performance.

The invention discloses a deep Q learning network optimization method based on dynamic teaching data and behavior cloning, which is applied to a sequence decision task and comprises the following steps:

For a clearer explanation of the present invention, an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

The deep Q learning network optimization method based on dynamic teaching data and behavior cloning is applied to a sequence decision task and comprises S100-S600, and the following description is given with reference to FIG. 1.

And S100, carrying out supervised training on the initial behavior clone network based on the first teaching data set to obtain a first behavior clone network.

The invention generates an initial teaching data set according to the operation record of an expert in a typical sequence decision problem scene. "expert" includes but is not limited to human experts, but may also be other intelligent devices. Typical sequence decision problem scenarios include, but are not limited to, video games, traffic regulation, grid regulation. The teaching data samples mainly comprise quintuple consisting of < state, action, reward, next state, whether to end mark > and a plurality of teaching data samples are put into a common experience playback pool. The teaching data set can be extracted and constructed from an experience playback pool at any time.

The first teaching data set of the present embodiment is D^replayIs a pool of common experience playback in the DRL model.

In this embodiment, the initial behavior clone network includes a three-layer fully-connected network, the activation function uses LeakyRelu, and the number of output neurons is the sum of all possible actions in the action space. The network is first pre-trained using a first teaching data set to ensure its initial decision performance. The training method comprises the following steps: and taking the first teaching data set as a training sample, taking the output of the action label and the initial behavior clone network in the training sample as cross entropy loss, training the initial behavior clone network according to an error back-propagation and gradient descent algorithm, and establishing a mapping f: s → a from a state to an action to obtain the first behavior clone network, wherein the trained first behavior clone network can perform an action a similar to an expert according to the state s of the environment.

And S200, pre-training the main network and the target network with the same network structure based on the second teaching data set.

The main network is constructed based on a deep Q learning network, the target network is obtained by copying the main network, and during initialization, the two networks are respectively initialized at any time.

The second teaching data set is denoted as D^demoThe teaching data are composed of historical teaching data and dynamically generated second teaching data.

And S300, training the optimized main network of the S200 based on the mixed loss function with expert loss by adopting a second teaching data set.

The loss function during training is a mixed loss function J (Q) with expert loss,

J(Q)＝J_DQ(Q)+λ₁J_n(Q)+λ₂J_E(Q)+λ₃J_L2(Q)

wherein, J_DQ(Q)、J_n(Q)、J_E(Q)、J_L2(Q) in turn is a single step TD penalty, a multi-step TD penalty, an expert penalty, an L2 regularization term, λ₁、λ₂、λ₃Are weight coefficients.

Single step TD loss J_DQ(Q) is

Wherein R (s, a) is the feedback reward for status and action at the current time step,

is the Q value, s, of the target network_t+1Is the state at the next moment, Q (s, a; theta) is the Q value of the master network in the current sample state and action,

in order to obtain the action corresponding to the maximum Q value according to the target network in the next state of the current sample, theta and theta' are parameters of the main network and the target network respectively, t is the current time step, and gamma is a reward discount factor.

Multistep TD loss J_n(Q) is

Wherein r is_tThe user is rewarded for the current step,

in order to obtain the action corresponding to the maximum Q value in the nth step state in the current sample according to the target network,

n is the number of steps counted from the current step onward, and γ is the reward discount factor, for the corresponding Q value when using the get maximum Q action.

In the training process of the step, the parameters of the main network are updated according to the values of the loss functions J (Q), and the parameters of the main network are copied to the target network periodically.

And S400, if the reward value history obtained in the S300 is optimal, based on real sequence decision interaction, obtaining an interaction track by using the main network optimized in the S300, generating sample data and adding the sample data into a second teaching data set.

In this step, the reward returned by the system when the loss function converges during the training of S300 may be used as the reward value. And if the current round obtains the optimal reward value, collecting the interaction track obtained by the current main network operation, and adding the interaction track into a second teaching data set.

And with the progress of the training process, the performance of the main network can be better and better, more samples can be added into the teaching data set, and the samples in the teaching data set can be continuously updated and expanded. By continuously adding new more optimal track samples, the quality of the samples is continuously improved while more sample spaces are covered, so that the performance of the strategy represented by the second teaching data set is continuously improved, and a positive effect is continuously generated on the performance improvement of the main network.

And S500, performing supervised training on the first behavior clone network by using the updated main network based on the second teaching data set obtained in the S400, and performing fine adjustment on the first behavior clone network.

Loss function J adopted in the training process of this step_E(Q) is

J_E(Q)＝max_a∈A[Q(s,a)+l(π,π_bc；s_t)]-Q(s,a_E)

Wherein, pi and pi_bc、a_ERespectively the main network strategy, the first behavior clone model output and the second teaching data set action, l (pi, pi)_bc；s_t) Representing the loss of supervision, a is the action of the current step, A is the action space of the current task environment, Q (s, a) is the function of the action value of the current state, Q (s, a)_E) Is in a state s and an action a_EState action value function of.

This loss function is also used in step S300 as an expert loss in the loss function j (q).

Supervised losses l (π, π)_bc；s_t) Is composed of

l(π,π_bc；s_t)＝Crossentropy(π(s_t),π_bc(s_t))

When in state s_tThe master network outputs pi(s) in time_t) And the first behavioral clone model outputs pi_bc(s_t) If the actions are the same, the supervision loss is zero, otherwise, it is a positive number. This enables, on the one hand, a smooth loss function value to be output and, on the other hand, for D^demoAll dynamically generated teaching data in (2) generate supervision loss, thereby improving the use efficiency of the sample.

Each time S300-S500 is performed in the above method is a round, and the whole training process may include a plurality of rounds. The preset termination condition may be the number of rounds, may be a range set for the convergence value of the loss function j (q), or may be included in both and exist in the relationship of or.

A deep Q learning network optimization system based on dynamic teaching data and behavioral cloning according to a second embodiment of the present invention is applied to a sequential decision task, and includes:

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

It should be noted that, the deep Q learning network optimization system based on dynamic teaching data and behavioral cloning provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the above-mentioned deep Q learning network optimization method based on dynamic teaching data and behavioral cloning.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the deep Q learning network optimization method based on dynamic teaching data and behavior cloning.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A deep Q learning network optimization method based on dynamic teaching data and behavior cloning is applied to a sequence decision task and is characterized by comprising the following steps:

2. The method of claim 1, wherein the samples in the first and second teaching data sets each comprise a state, an action, a reward, a next state, and an end-of-finish flag.

3. The deep Q learning network optimization method based on dynamic teaching data and behavioral cloning as claimed in claim 1, wherein in S100, the first behavioral cloning network is obtained by:

4. The deep Q learning network optimization method based on dynamic teaching data and behavior cloning as claimed in claim 1, wherein the mixture loss function J (Q) with expert loss in S300 is

J(Q)＝J_DQ(Q)+λ₁J_n(Q)+λ₂J_E(Q)+λ₃J_L2(Q)

5. The method of claim 4, wherein in S500, "supervised training of the first behavioral clone network with updated master network" includes a loss function J_E(Q) is

J_E(Q)＝max_a∈A[Q(s,a)+l(π,π_bc；s_t)]-Q(s,a_E)

Wherein, pi and pi_bc、a_ERespectively outputting a main network, a first behavior clone model and a second teaching data set, and l (pi, pi)_bc；s_t) Representing the supervised loss a as the action of the current step, A as the action space of the current task environment, Q (s, a) as the function of the action value of the current state, Q (s, a)_E) Is in a state s and an action a_EState action value function of.

6. The deep Q learning network optimization method based on dynamic teaching data and behavioral cloning as claimed in claim 5, wherein there is supervised loss l (π, π)_bc；s_t) Is composed of

l(π,π_bc；s_t)＝Crossentropy(π(s_t),π_bc(s_t))

7. The deep Q learning network optimization method based on dynamic teaching data and behavioral cloning as claimed in claim 1, wherein the initial behavioral cloning network comprises a three-layer fully-connected network, the activation function uses LeakyRelu, and the number of output neurons is the sum of all possible actions in the action space.

8. A deep Q learning network optimization system based on dynamic teaching data and behavior cloning is applied to a sequence decision task and is characterized by comprising the following steps:

9. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method for deep Q-learning network optimization based on dynamic teaching data and behavioral cloning of any one of claims 1 to 7.

10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the method for deep Q-learning network optimization based on dynamic teaching data and behavioral cloning of any of claims 1-7.