CN112396180A - Deep Q learning network optimization method based on dynamic teaching data and behavior cloning - Google Patents

Deep Q learning network optimization method based on dynamic teaching data and behavior cloning Download PDF

Info

Publication number
CN112396180A
CN112396180A CN202011338992.0A CN202011338992A CN112396180A CN 112396180 A CN112396180 A CN 112396180A CN 202011338992 A CN202011338992 A CN 202011338992A CN 112396180 A CN112396180 A CN 112396180A
Authority
CN
China
Prior art keywords
network
teaching data
training
data set
behavior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011338992.0A
Other languages
Chinese (zh)
Other versions
CN112396180B (en
Inventor
李小双
王晓
王飞跃
金峻臣
陈薏竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202011338992.0A priority Critical patent/CN112396180B/en
Publication of CN112396180A publication Critical patent/CN112396180A/en
Application granted granted Critical
Publication of CN112396180B publication Critical patent/CN112396180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • A63F13/67Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the field of information processing, and particularly relates to a deep Q learning network optimization method based on dynamic teaching data and behavior cloning, aiming at solving the problems that the coverage state of historical teaching data, namely the action space is limited, and imperfect teaching data can influence the strategy optimization direction. The invention comprises the following steps: carrying out supervised training on the initial behavior clone network to obtain a first behavior clone network; pre-training a main network and a target network with the same network structure based on a second teaching data set, and further training the main network based on a mixed loss function with expert loss; if the historical optimal reward value is obtained in the training round, updating the second teaching data set; and repeatedly adopting the updated second teaching data set to carry out network training until the end condition is reached. According to the method, high-quality sample data is continuously added in the training process, the performance of the strategy represented by the teaching data set is improved, and the forward effect is continuously generated on the performance improvement of the model.

Description

Deep Q learning network optimization method based on dynamic teaching data and behavior cloning
Technical Field
The invention belongs to the field of information processing, and particularly relates to a deep Q learning network optimization method based on dynamic teaching data and behavior cloning.
Background
Deep Learning (DRL) has advanced greatly in recent years, for example, in electronic games and board games. With the aid of the powerful feature extraction and function fitting capabilities of deep learning, a reinforcement learning subject can directly extract and learn feature knowledge from raw input data (such as game images), and then learn a decision control strategy by using a conventional reinforcement learning algorithm according to the extracted feature information without manually extracting or learning features on the basis of rules and heuristics.
However, at present, for the application of solving complex decision control problems (such as automatic driving) in real environment, the deep reinforcement learning technology still cannot be practically used. Due to the diversity and uncertainty of complex systems, the existing simulation environment is difficult to keep consistent with the real world, and the cost is high for improving the precision of the simulation system. Therefore, how to adapt to a complex real-world scene becomes one of the most urgent problems for applying the DRL model to a complex decision task.
For the decision problem under the complex scene, human experts have great advantages in learning efficiency and decision performance, so that the inclusion of expert knowledge in the DRL model is a potential solution. The DQfD (enhanced learning combining simulation learning and deep learning) method for Q learning in teaching can guide the intelligent agent to learn and obtain the strategy represented by the teaching data through learning the teaching data so as to guide and help the intelligent agent to learn expert knowledge, and performs autonomous learning on the basis, thereby improving the decision-making capability of the model. However, the DQfD model has the following problems: (1) in the DQfD learning process, the track data in the historical teaching data set are only used in pre-training, and the teaching data do not provide effective guidance for the track data generated by the model independently; (2) the teaching data set is very limited and cannot cover a sufficient state motion space. Moreover, it is difficult to collect enough teaching data in some practical applications, for example, extreme cases and real cases occur less frequently, and a large number of samples are data in normal cases; (3) the DQfD algorithm ignores the imperfection of historical teaching data ubiquitous in real application, and the imperfection can have negative influence on the improvement of the model performance.
Aiming at the problems, the invention provides a deep Q learning method based on dynamic teaching data and a behavior Cloning method, a Behavior Cloning (BC) model is constructed to mine historical teaching data and generate expert loss, and the behavior of an intelligent agent is compared with the behavior generated by the BC model through an expert loss function based on cross entropy. In addition, the invention provides an automatic updating mechanism of the self-adaptive enhanced BC model. This mechanism attempts to contain more high quality track samples, avoiding the adverse effects that imperfect teaching data may have.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problem that the state covered by the historical teaching data, namely the limited motion space and the imperfect teaching data, may affect the strategy optimization direction, a first aspect of the present invention provides a deep Q learning network optimization method based on dynamic teaching data and behavior cloning, which is applied to a sequential decision task, and includes:
s100, carrying out supervised training on the initial behavior clone network based on a first teaching data set to obtain a first behavior clone network;
s200, pre-training a main network and a target network with the same network structure based on a second teaching data set; the main network is constructed based on a deep Q learning network;
s300, training the optimized main network of the S200 based on a mixed loss function with expert loss by adopting a second teaching data set;
s400, if the reward value history obtained in the S300 is optimal, based on real sequence decision interaction, obtaining an interaction track by using the main network optimized in the S300, generating sample data and adding the sample data into a second teaching data set;
s500, performing supervised training on the first behavior clone network by using the updated main network based on the second teaching data set obtained in S400, and performing fine adjustment on the first behavior clone network;
s600, performing next training round by adopting the method of S300-S500 until a preset training end condition is reached, and obtaining the final optimized main network.
In some preferred embodiments, the samples in the first teaching data set and the second teaching data set each include a state, an action, a reward, a next state, and an end-of-life flag.
In some preferred embodiments, in S100, the first behavior clone network is obtained by:
and taking the first teaching data set as a training sample, taking the action tag in the training sample and the output of the initial behavior clone network as cross entropy loss, and training the initial behavior clone network according to an error back propagation and gradient descent algorithm to obtain the first behavior clone network.
In some preferred embodiments, the mixture loss function J (Q) with expert loss in S300 is
J(Q)=JDQ(Q)+λ1Jn(Q)+λ2JE(Q)+λ3JL2(Q)
Wherein, JDQ(Q)、Jn(Q)、JE(Q)、JL2(Q) in turn is a single step TD penalty, a multi-step TD penalty, an expert penalty, an L2 regularization term, λ1、λ23Are weight coefficients.
In some preferred embodiments, the loss function J of S500 "supervised training of the first behavioral clone network with updated master network" isE(Q) is
JE(Q)=maxa∈A[Q(s,a)+l(π,πbc;st)]-Q(s,aE)
Wherein, pi and pibc、aERespectively outputting a main network, a first behavior clone model and a second teaching data set, and l (pi, pi)bc;st) Representing the loss of supervision, a is the action of the current step, A is the action space of the current task environment, Q (s, a) is the function of the action value of the current state, Q (s, a)E) Is in a state s and an action aEState action value function of.
In some preferred embodiments, there is a supervised loss of l (π, π)bc;st) Is composed of
l(π,πbc;st)=Crossentropy(π(st),πbc(st))
When in state stThe master network outputs pi(s) in timet) And the first behavioral clone model outputs pibc(st) If the actions are the same, the supervision loss is zero, otherwise, it is a positive number.
In some preferred embodiments, the initial behavioral clone network comprises a three-layer fully-connected network, the activation function uses LeakyRelu, and the number of output neurons is the sum of all possible actions in the action space.
The second aspect of the invention provides a deep Q learning network optimization system based on dynamic teaching data and behavior cloning, which is applied to a sequence decision task and comprises the following steps:
the first module is configured to perform supervised training on the initial behavior clone network based on a first teaching data set to obtain a first behavior clone network;
the second module is configured to pre-train the main network and the target network with the same network structure based on a second teaching data set; the main network is constructed based on a deep Q learning network;
the third module is configured to train the main network optimized by the second module based on a mixed loss function with expert loss by adopting a second teaching data set;
the fourth module is configured to obtain an interaction track by using the main network optimized by the third module based on real sequence decision interaction and generate sample data to be added into a second teaching data set if the reward value history obtained by the third module is optimal;
a fifth module, configured to perform supervised training on the first behavioral clone network by using the updated main network based on the second teaching data set obtained by the fourth module, and perform fine tuning on the first behavioral clone network;
and the sixth module is configured to repeatedly execute the third module to the fifth module to perform the next training round until a preset training end condition is reached, so as to obtain the finally optimized main network.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, wherein the programs are suitable for being loaded and executed by a processor to implement the above deep Q learning network optimization method based on dynamic teaching data and behavior cloning.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; wherein the program is adapted to be loaded and executed by a processor to implement the method for deep Q learning network optimization based on dynamic teach pendant data and behavioral cloning described above.
The invention has the beneficial effects that:
the invention can effectively improve the convergence speed and the decision performance of network training.
According to the method, the interaction track under the real sequence decision interaction is acquired through the optimal main network, new sample data is constructed and added into the second teaching data set, and the quality of the sample is continuously improved while more sample spaces are covered, so that the performance of the strategy represented by the teaching data set is continuously improved, and a positive effect is continuously generated on the performance improvement of the model.
In the invention, the updated main network is used for carrying out supervised training on the first behavioral clone network, supervision loss can be generated aiming at all dynamically generated teaching data in the second teaching data set, and thus the use efficiency of the sample is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
fig. 1 is a schematic flow chart of a deep Q learning network optimization method based on dynamic teaching data and behavior cloning according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
At present, how to adapt to complex situations becomes one of the most urgent problems of applying the DRL model to complex decision tasks, and human experts have great advantages in learning efficiency and decision performance, so that incorporating expert knowledge in the DRL model is a potential solution. Based on the reasons, the invention provides a deep Q learning network optimization method based on dynamic teaching data and behavior Cloning, a Behavior Cloning (BC) model is constructed to generate expert loss, and the behavior of an intelligent agent is compared with the behavior generated by the BC model through a designed expert loss function. In addition, the invention provides an automatic updating mechanism of the self-adaptive enhanced BC model. This mechanism attempts to contain more high quality track samples, avoiding the adverse effects that imperfect teaching data may have. The invention can achieve better effects in the aspects of convergence speed and decision performance.
The invention discloses a deep Q learning network optimization method based on dynamic teaching data and behavior cloning, which is applied to a sequence decision task and comprises the following steps:
s100, carrying out supervised training on the initial behavior clone network based on a first teaching data set to obtain a first behavior clone network;
s200, pre-training a main network and a target network with the same network structure based on a second teaching data set; the main network is constructed based on a deep Q learning network;
s300, training the optimized main network of the S200 based on a mixed loss function with expert loss by adopting a second teaching data set;
s400, if the reward value history obtained in the S300 is optimal, based on real sequence decision interaction, obtaining an interaction track by using the main network optimized in the S300, generating sample data and adding the sample data into a second teaching data set;
s500, performing supervised training on the first behavior clone network by using the updated main network based on the second teaching data set obtained in S400, and performing fine adjustment on the first behavior clone network;
s600, performing next training round by adopting the method of S300-S500 until a preset training end condition is reached, and obtaining the final optimized main network.
For a clearer explanation of the present invention, an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
The deep Q learning network optimization method based on dynamic teaching data and behavior cloning is applied to a sequence decision task and comprises S100-S600, and the following description is given with reference to FIG. 1.
And S100, carrying out supervised training on the initial behavior clone network based on the first teaching data set to obtain a first behavior clone network.
The invention generates an initial teaching data set according to the operation record of an expert in a typical sequence decision problem scene. "expert" includes but is not limited to human experts, but may also be other intelligent devices. Typical sequence decision problem scenarios include, but are not limited to, video games, traffic regulation, grid regulation. The teaching data samples mainly comprise quintuple consisting of < state, action, reward, next state, whether to end mark > and a plurality of teaching data samples are put into a common experience playback pool. The teaching data set can be extracted and constructed from an experience playback pool at any time.
The first teaching data set of the present embodiment is DreplayIs a pool of common experience playback in the DRL model.
In this embodiment, the initial behavior clone network includes a three-layer fully-connected network, the activation function uses LeakyRelu, and the number of output neurons is the sum of all possible actions in the action space. The network is first pre-trained using a first teaching data set to ensure its initial decision performance. The training method comprises the following steps: and taking the first teaching data set as a training sample, taking the output of the action label and the initial behavior clone network in the training sample as cross entropy loss, training the initial behavior clone network according to an error back-propagation and gradient descent algorithm, and establishing a mapping f: s → a from a state to an action to obtain the first behavior clone network, wherein the trained first behavior clone network can perform an action a similar to an expert according to the state s of the environment.
And S200, pre-training the main network and the target network with the same network structure based on the second teaching data set.
The main network is constructed based on a deep Q learning network, the target network is obtained by copying the main network, and during initialization, the two networks are respectively initialized at any time.
The second teaching data set is denoted as DdemoThe teaching data are composed of historical teaching data and dynamically generated second teaching data.
And S300, training the optimized main network of the S200 based on the mixed loss function with expert loss by adopting a second teaching data set.
The loss function during training is a mixed loss function J (Q) with expert loss,
J(Q)=JDQ(Q)+λ1Jn(Q)+λ2JE(Q)+λ3JL2(Q)
wherein, JDQ(Q)、Jn(Q)、JE(Q)、JL2(Q) in turn is a single step TD penalty, a multi-step TD penalty, an expert penalty, an L2 regularization term, λ1、λ2、λ3Are weight coefficients.
Single step TD loss JDQ(Q) is
Figure BDA0002798040500000081
Wherein R (s, a) is the feedback reward for status and action at the current time step,
Figure BDA0002798040500000082
is the Q value, s, of the target networkt+1Is the state at the next moment, Q (s, a; theta) is the Q value of the master network in the current sample state and action,
Figure BDA0002798040500000083
in order to obtain the action corresponding to the maximum Q value according to the target network in the next state of the current sample, theta and theta' are parameters of the main network and the target network respectively, t is the current time step, and gamma is a reward discount factor.
Multistep TD loss Jn(Q) is
Figure BDA0002798040500000084
Wherein r istThe user is rewarded for the current step,
Figure BDA0002798040500000085
in order to obtain the action corresponding to the maximum Q value in the nth step state in the current sample according to the target network,
Figure BDA0002798040500000086
n is the number of steps counted from the current step onward, and γ is the reward discount factor, for the corresponding Q value when using the get maximum Q action.
In the training process of the step, the parameters of the main network are updated according to the values of the loss functions J (Q), and the parameters of the main network are copied to the target network periodically.
And S400, if the reward value history obtained in the S300 is optimal, based on real sequence decision interaction, obtaining an interaction track by using the main network optimized in the S300, generating sample data and adding the sample data into a second teaching data set.
In this step, the reward returned by the system when the loss function converges during the training of S300 may be used as the reward value. And if the current round obtains the optimal reward value, collecting the interaction track obtained by the current main network operation, and adding the interaction track into a second teaching data set.
And with the progress of the training process, the performance of the main network can be better and better, more samples can be added into the teaching data set, and the samples in the teaching data set can be continuously updated and expanded. By continuously adding new more optimal track samples, the quality of the samples is continuously improved while more sample spaces are covered, so that the performance of the strategy represented by the second teaching data set is continuously improved, and a positive effect is continuously generated on the performance improvement of the main network.
And S500, performing supervised training on the first behavior clone network by using the updated main network based on the second teaching data set obtained in the S400, and performing fine adjustment on the first behavior clone network.
Loss function J adopted in the training process of this stepE(Q) is
JE(Q)=maxa∈A[Q(s,a)+l(π,πbc;st)]-Q(s,aE)
Wherein, pi and pibc、aERespectively the main network strategy, the first behavior clone model output and the second teaching data set action, l (pi, pi)bc;st) Representing the loss of supervision, a is the action of the current step, A is the action space of the current task environment, Q (s, a) is the function of the action value of the current state, Q (s, a)E) Is in a state s and an action aEState action value function of.
This loss function is also used in step S300 as an expert loss in the loss function j (q).
Supervised losses l (π, π)bc;st) Is composed of
l(π,πbc;st)=Crossentropy(π(st),πbc(st))
When in state stThe master network outputs pi(s) in timet) And the first behavioral clone model outputs pibc(st) If the actions are the same, the supervision loss is zero, otherwise, it is a positive number. This enables, on the one hand, a smooth loss function value to be output and, on the other hand, for DdemoAll dynamically generated teaching data in (2) generate supervision loss, thereby improving the use efficiency of the sample.
S600, performing next training round by adopting the method of S300-S500 until a preset training end condition is reached, and obtaining the final optimized main network.
Each time S300-S500 is performed in the above method is a round, and the whole training process may include a plurality of rounds. The preset termination condition may be the number of rounds, may be a range set for the convergence value of the loss function j (q), or may be included in both and exist in the relationship of or.
A deep Q learning network optimization system based on dynamic teaching data and behavioral cloning according to a second embodiment of the present invention is applied to a sequential decision task, and includes:
the first module is configured to perform supervised training on the initial behavior clone network based on a first teaching data set to obtain a first behavior clone network;
the second module is configured to pre-train the main network and the target network with the same network structure based on a second teaching data set; the main network is constructed based on a deep Q learning network;
the third module is configured to train the main network optimized by the second module based on a mixed loss function with expert loss by adopting a second teaching data set;
the fourth module is configured to obtain an interaction track by using the main network optimized by the third module based on real sequence decision interaction and generate sample data to be added into a second teaching data set if the reward value history obtained by the third module is optimal;
a fifth module, configured to perform supervised training on the first behavioral clone network by using the updated main network based on the second teaching data set obtained by the fourth module, and perform fine tuning on the first behavioral clone network;
and the sixth module is configured to repeatedly execute the third module to the fifth module to perform the next training round until a preset training end condition is reached, so as to obtain the finally optimized main network.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.
It should be noted that, the deep Q learning network optimization system based on dynamic teaching data and behavioral cloning provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the above-mentioned deep Q learning network optimization method based on dynamic teaching data and behavioral cloning.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the deep Q learning network optimization method based on dynamic teaching data and behavior cloning.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (10)

1. A deep Q learning network optimization method based on dynamic teaching data and behavior cloning is applied to a sequence decision task and is characterized by comprising the following steps:
s100, carrying out supervised training on the initial behavior clone network based on a first teaching data set to obtain a first behavior clone network;
s200, pre-training a main network and a target network with the same network structure based on a second teaching data set; the main network is constructed based on a deep Q learning network;
s300, training the optimized main network of the S200 based on a mixed loss function with expert loss by adopting a second teaching data set;
s400, if the reward value history obtained in the S300 is optimal, based on real sequence decision interaction, obtaining an interaction track by using the main network optimized in the S300, generating sample data and adding the sample data into a second teaching data set;
s500, performing supervised training on the first behavior clone network by using the updated main network based on the second teaching data set obtained in S400, and performing fine adjustment on the first behavior clone network;
s600, performing next training round by adopting the method of S300-S500 until a preset training end condition is reached, and obtaining the final optimized main network.
2. The method of claim 1, wherein the samples in the first and second teaching data sets each comprise a state, an action, a reward, a next state, and an end-of-finish flag.
3. The deep Q learning network optimization method based on dynamic teaching data and behavioral cloning as claimed in claim 1, wherein in S100, the first behavioral cloning network is obtained by:
and taking the first teaching data set as a training sample, taking the action tag in the training sample and the output of the initial behavior clone network as cross entropy loss, and training the initial behavior clone network according to an error back propagation and gradient descent algorithm to obtain the first behavior clone network.
4. The deep Q learning network optimization method based on dynamic teaching data and behavior cloning as claimed in claim 1, wherein the mixture loss function J (Q) with expert loss in S300 is
J(Q)=JDQ(Q)+λ1Jn(Q)+λ2JE(Q)+λ3JL2(Q)
Wherein, JDQ(Q)、Jn(Q)、JE(Q)、JL2(Q) in turn is a single step TD penalty, a multi-step TD penalty, an expert penalty, an L2 regularization term, λ1、λ2、λ3Are weight coefficients.
5. The method of claim 4, wherein in S500, "supervised training of the first behavioral clone network with updated master network" includes a loss function JE(Q) is
JE(Q)=maxa∈A[Q(s,a)+l(π,πbc;st)]-Q(s,aE)
Wherein, pi and pibc、aERespectively outputting a main network, a first behavior clone model and a second teaching data set, and l (pi, pi)bc;st) Representing the supervised loss a as the action of the current step, A as the action space of the current task environment, Q (s, a) as the function of the action value of the current state, Q (s, a)E) Is in a state s and an action aEState action value function of.
6. The deep Q learning network optimization method based on dynamic teaching data and behavioral cloning as claimed in claim 5, wherein there is supervised loss l (π, π)bc;st) Is composed of
l(π,πbc;st)=Crossentropy(π(st),πbc(st))
When in state stThe master network outputs pi(s) in timet) And the first behavioral clone model outputs pibc(st) If the actions are the same, the supervision loss is zero, otherwise, it is a positive number.
7. The deep Q learning network optimization method based on dynamic teaching data and behavioral cloning as claimed in claim 1, wherein the initial behavioral cloning network comprises a three-layer fully-connected network, the activation function uses LeakyRelu, and the number of output neurons is the sum of all possible actions in the action space.
8. A deep Q learning network optimization system based on dynamic teaching data and behavior cloning is applied to a sequence decision task and is characterized by comprising the following steps:
the first module is configured to perform supervised training on the initial behavior clone network based on a first teaching data set to obtain a first behavior clone network;
the second module is configured to pre-train the main network and the target network with the same network structure based on a second teaching data set; the main network is constructed based on a deep Q learning network;
the third module is configured to train the main network optimized by the second module based on a mixed loss function with expert loss by adopting a second teaching data set;
the fourth module is configured to obtain an interaction track by using the main network optimized by the third module based on real sequence decision interaction and generate sample data to be added into a second teaching data set if the reward value history obtained by the third module is optimal;
a fifth module, configured to perform supervised training on the first behavioral clone network by using the updated main network based on the second teaching data set obtained by the fourth module, and perform fine tuning on the first behavioral clone network;
and the sixth module is configured to repeatedly execute the third module to the fifth module to perform the next training round until a preset training end condition is reached, so as to obtain the finally optimized main network.
9. A storage device having stored therein a plurality of programs, wherein the programs are adapted to be loaded and executed by a processor to implement the method for deep Q-learning network optimization based on dynamic teaching data and behavioral cloning of any one of claims 1 to 7.
10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the method for deep Q-learning network optimization based on dynamic teaching data and behavioral cloning of any of claims 1-7.
CN202011338992.0A 2020-11-25 2020-11-25 Deep Q learning network optimization method based on dynamic teaching data and behavior cloning Active CN112396180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011338992.0A CN112396180B (en) 2020-11-25 2020-11-25 Deep Q learning network optimization method based on dynamic teaching data and behavior cloning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011338992.0A CN112396180B (en) 2020-11-25 2020-11-25 Deep Q learning network optimization method based on dynamic teaching data and behavior cloning

Publications (2)

Publication Number Publication Date
CN112396180A true CN112396180A (en) 2021-02-23
CN112396180B CN112396180B (en) 2021-06-29

Family

ID=74603878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011338992.0A Active CN112396180B (en) 2020-11-25 2020-11-25 Deep Q learning network optimization method based on dynamic teaching data and behavior cloning

Country Status (1)

Country Link
CN (1) CN112396180B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801290A (en) * 2021-02-26 2021-05-14 中国人民解放军陆军工程大学 Multi-agent deep reinforcement learning method, system and application
CN113240118A (en) * 2021-05-18 2021-08-10 中国科学院自动化研究所 Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium
CN113408621A (en) * 2021-06-21 2021-09-17 中国科学院自动化研究所 Rapid simulation learning method, system and equipment for robot skill learning
CN115204387A (en) * 2022-07-21 2022-10-18 法奥意威(苏州)机器人系统有限公司 Learning method and device under layered target condition and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739611A (en) * 2009-12-08 2010-06-16 上海华平信息技术股份有限公司 High-resolution remote collaborative vehicle loss assessment system and method
US20170300911A1 (en) * 2016-04-13 2017-10-19 Abdullah Abdulaziz I. Alnajem Risk-link authentication for optimizing decisions of multi-factor authentications
CN107263449A (en) * 2017-07-05 2017-10-20 中国科学院自动化研究所 Robot remote teaching system based on virtual reality
CN110404264A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) It is a kind of based on the virtually non-perfect information game strategy method for solving of more people, device, system and the storage medium self played a game
CN110998585A (en) * 2017-06-22 2020-04-10 株式会社半导体能源研究所 Layout design system and layout design method
US10740370B2 (en) * 2017-07-06 2020-08-11 International Business Machines Corporation Dialog agent for conducting task-oriented computer-based communications
CN111983922A (en) * 2020-07-13 2020-11-24 广州中国科学院先进技术研究所 Robot demonstration teaching method based on meta-simulation learning
US10845937B2 (en) * 2018-01-11 2020-11-24 International Business Machines Corporation Semantic representation and realization for conversational systems

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739611A (en) * 2009-12-08 2010-06-16 上海华平信息技术股份有限公司 High-resolution remote collaborative vehicle loss assessment system and method
US20170300911A1 (en) * 2016-04-13 2017-10-19 Abdullah Abdulaziz I. Alnajem Risk-link authentication for optimizing decisions of multi-factor authentications
CN110998585A (en) * 2017-06-22 2020-04-10 株式会社半导体能源研究所 Layout design system and layout design method
CN107263449A (en) * 2017-07-05 2017-10-20 中国科学院自动化研究所 Robot remote teaching system based on virtual reality
US10740370B2 (en) * 2017-07-06 2020-08-11 International Business Machines Corporation Dialog agent for conducting task-oriented computer-based communications
US10845937B2 (en) * 2018-01-11 2020-11-24 International Business Machines Corporation Semantic representation and realization for conversational systems
CN110404264A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) It is a kind of based on the virtually non-perfect information game strategy method for solving of more people, device, system and the storage medium self played a game
CN111983922A (en) * 2020-07-13 2020-11-24 广州中国科学院先进技术研究所 Robot demonstration teaching method based on meta-simulation learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JANNE HUOTARI等: ""Q-Learning Based Autonomous Control of the Auxiliary Power Network of a Ship"", 《IEEE ACCESS》 *
秦方博等: ""机器人操作技能模型综述"", 《自动化学报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801290A (en) * 2021-02-26 2021-05-14 中国人民解放军陆军工程大学 Multi-agent deep reinforcement learning method, system and application
CN112801290B (en) * 2021-02-26 2021-11-05 中国人民解放军陆军工程大学 Multi-agent deep reinforcement learning method, system and application
CN113240118A (en) * 2021-05-18 2021-08-10 中国科学院自动化研究所 Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium
CN113408621A (en) * 2021-06-21 2021-09-17 中国科学院自动化研究所 Rapid simulation learning method, system and equipment for robot skill learning
CN113408621B (en) * 2021-06-21 2022-10-14 中国科学院自动化研究所 Rapid simulation learning method, system and equipment for robot skill learning
CN115204387A (en) * 2022-07-21 2022-10-18 法奥意威(苏州)机器人系统有限公司 Learning method and device under layered target condition and electronic equipment
CN115204387B (en) * 2022-07-21 2023-10-03 法奥意威(苏州)机器人系统有限公司 Learning method and device under layered target condition and electronic equipment

Also Published As

Publication number Publication date
CN112396180B (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN112396180B (en) Deep Q learning network optimization method based on dynamic teaching data and behavior cloning
Hu et al. Gaia-1: A generative world model for autonomous driving
van Vliet et al. Linking stakeholders and modellers in scenario studies: The use of Fuzzy Cognitive Maps as a communication and learning tool
CN111176758B (en) Configuration parameter recommendation method and device, terminal and storage medium
US20200342307A1 (en) Swarm fair deep reinforcement learning
CN111523640A (en) Training method and device of neural network model
KR102375286B1 (en) Learning method and learning device for generating training data from virtual data on virtual world by using generative adversarial network, to thereby reduce annotation cost required in training processes of neural network for autonomous driving
CN114261400B (en) Automatic driving decision method, device, equipment and storage medium
Khanam et al. Artificial intelligence surpassing human intelligence: factual or hoax
CN114194211A (en) Automatic driving method and device, electronic equipment and storage medium
CN111783944A (en) Rule embedded multi-agent reinforcement learning method and device based on combination training
El Gourari et al. The Implementation of Deep Reinforcement Learning in E‐Learning and Distance Learning: Remote Practical Work
CN112434791A (en) Multi-agent strong countermeasure simulation method and device and electronic equipment
CN114840021A (en) Trajectory planning method, device, equipment and medium for data collection of unmanned aerial vehicle
CN112733043A (en) Comment recommendation method and device
CN113408621A (en) Rapid simulation learning method, system and equipment for robot skill learning
CN113625753B (en) Method for guiding neural network to learn unmanned aerial vehicle maneuver flight by expert rules
CN111090740B (en) Knowledge graph generation method for dialogue system
CN112926628A (en) Action value determination method, device, learning framework, medium and equipment
CN117010474A (en) Model deployment method, device, equipment and storage medium
Suenaga et al. Development of a web-based education system for deep reinforcement learning-based autonomous mobile robot navigation in real world
CN113240118B (en) Dominance estimation method, dominance estimation device, electronic device, and storage medium
Pageaud et al. Multiagent Learning and Coordination with Clustered Deep Q-Network.
CN115905613A (en) Audio and video multitask learning and evaluation method, computer equipment and medium
CN113052252B (en) Super-parameter determination method, device, deep reinforcement learning framework, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant