CN111401556A - Selection method of opponent type imitation learning winning incentive function - Google Patents

Selection method of opponent type imitation learning winning incentive function Download PDF

Info

Publication number
CN111401556A
CN111401556A CN202010323155.4A CN202010323155A CN111401556A CN 111401556 A CN111401556 A CN 111401556A CN 202010323155 A CN202010323155 A CN 202010323155A CN 111401556 A CN111401556 A CN 111401556A
Authority
CN
China
Prior art keywords
function
network
reward
strategy
expert
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010323155.4A
Other languages
Chinese (zh)
Other versions
CN111401556B (en
Inventor
李秀
王亚伟
张明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN202010323155.4A priority Critical patent/CN111401556B/en
Publication of CN111401556A publication Critical patent/CN111401556A/en
Application granted granted Critical
Publication of CN111401556B publication Critical patent/CN111401556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a selection method of a winning incentive function of confrontational imitation learning, which comprises the following steps: constructing a strategy network with a parameter theta, a judgment network with a parameter w and at least two reward functions; acquiring teaching data under an expert strategy and storing the teaching data into an expert data buffer containing an expert track; the input of the control strategy network is the state returned by the simulation environment, and the output is a decision action; judging whether the network utilizes the state action pair under the expert strategy or the state action pair of the strategy network to update parameters; in the stage of calculating the reward, judging that the input of the network is a state action pair of the strategy network, and the output value is a reward value obtained through calculation of a reward function; selecting the reward function of the current task according to the performance indexes of different reward functions; parameters of the policy network corresponding to the selected reward function are saved. The intelligent agent learns under the guidance of different reward functions, and then selects the optimal reward function according to the performance evaluation index in a specific task scene.

Description

Selection method of opponent type imitation learning winning incentive function
Technical Field
The invention relates to the technical field of reward function selection, in particular to a method for selecting a reward function in confrontational type imitation learning.
Background
In recent years, with the great breakthrough of deep learning in the fields of image recognition, voice technology, natural language processing and the like, the deep reinforcement learning combining the deep neural network and the reinforcement learning also obtains the expression exceeding human beings on the large-scale strategy optimization problems of go, interstellar dispute and the like. One of the bottlenecks of reinforcement learning is: designing a reasonable reward function based on expert experience is time consuming and laborious when faced with practical control problems such as autopilot, robotics, etc. Data-driven simulation learning provides a thought for solving the problem, and the strategy of a competitive expert can be learned only by using teaching data without manually designing a reward function. Among many algorithms of simulation learning, the behavioral cloning method is the simplest, and it performs simulation in a supervised learning manner. However, this method is susceptible to compounding errors and is difficult to adapt to situations that do not occur in expert data. The inverse reinforcement learning algorithm is another type of simulation learning algorithm, which first learns a reward function according to expert data and then executes a reinforcement learning process learning strategy under the guidance of the reward function. The reward function learned by the method has better robustness and can deal with the situation which does not appear in the expert data. However, the algorithm needs to alternate the steps of finding the optimal reward function and performing reinforcement learning training, and therefore, a large amount of calculation is needed.
In the process of simulating the expert strategy, a reward signal is output from the judgment network, the reward value is larger when the output value is larger when the generated data is closer to the expert data, and the reward value is smaller otherwise, because the reward function in a negative logarithm form is used and the input value of the reward function is always in a range of [0,1], the reward value is always positive, and the algorithm is difficult to be suitable for various tasks.
The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.
Disclosure of Invention
The invention provides a method for selecting a reward function for antagonistic imitative learning, aiming at solving the problem that the reward function form used in an imitative learning algorithm based on generation of an antagonistic network in the prior art is not optimal in all tasks and therefore cannot obtain optimal performance in multiple tasks.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a method for selecting a competitive imitation learning winning incentive function, comprising the steps of: s1: constructing a strategy network pi with a parameter theta, a judgment network D with a parameter w and at least two reward functions; s2: acquiring teaching data under expert strategy and storing the teaching data into an expert data buffer B containing expert tracksE(st,at) Performing the following steps; s3: controlling the input of the policy network to be the state s returned by the simulation environment EnvtThe output is decision action at(ii) a The discriminating network utilizes state-action pairs under expert policy(s)t,at)EAnd the state action pairs of the policy networkt,at)πUpdating the parameters; in the stage of calculating the reward, the input of the discrimination network is the state action pair(s) of the policy networkt,at)πThe output value is the reward value obtained by calculation of the reward function; s4: selecting the reward function of the current task according to the performance indexes of different reward functions; s5: saving parameters of the policy network corresponding to the selected reward function.
Preferably, 6 reward functions are designed according to different values of the reward function.
Preferably, the reward function is:
r1(x)=x=log(σ(x))-log(1-σ(x))
r2(x)=ex
r3(x)=-e-x
r4(x)=σ(x)
r5(x)=-log(1-σ(x))
r6(x)=log(σ(x))
wherein x is the output of the discrimination network,
Figure BDA0002462208470000021
is a sigmoid function.
Preferably, the process of selecting the reward function of the current task according to the size of the performance index of the reward function comprises the following steps: s41: initializing a plurality of said simulation environments EnviThe policy network piiAnd said discriminating network DiAnd simultaneously starting a training process, wherein i is 0, 1.. 6; s42: in each training process, the current strategy network pi is used for interacting with the simulation environment Env, and the state action pairs of the current time step are stored into a strategy network buffer Bπ(st,at) Performing the following steps; s43: from the policy network cache Bπ(st,at) The state action track(s) under the current strategy is obtained by middle samplingt,at)πFrom said expert data buffer BE(st,at) The motion track(s) of the expert state is obtained by middle samplingt,at)EAnd through a loss function DJSThe gradient of the discrimination network D is decreased to optimize the parameter w:
Figure BDA0002462208470000031
wherein w represents the discriminant network parameter, αdRepresenting the learning rate of the discriminating network parameter, DJSRepresentative discriminant network loss function (s, a)πAnd (s, a)ERespectively representing slave policy network buffers Bπ(st,at) And an expert data buffer(s)t,at)EAnd (5) sampling the obtained state action track.
S44: calculating each of the reward functions during the training process according to different specific forms of the reward functionsReward value r of steptAnd is stored in the policy network buffer Bπ(st,at,rt) Performing the following steps; s45: calculating the dominance value A of each time step according to the dominance functiontAnd is stored in the policy network buffer Bπ(st,at,rt,At) Performing the following steps; s46: according to the near-end strategy optimization algorithm, and utilizing the strategy network buffer Bπ(st,at,rt,At) The parameter θ of the policy network is updated in a gradient descending manner by the data in (1):
Figure BDA0002462208470000032
where θ represents a policy network parameter, αpRepresenting the learning rate of the policy network parameters,
Figure BDA0002462208470000033
representing a policy network objective function.
S47: calculating the difference of average returns in adjacent time periods, if the difference is less than a set threshold Thre, stopping the training process and saving the network parameters theta and w, and simultaneously saving the time step t during convergence and the average return in the latest period
Figure BDA0002462208470000034
And a standard deviation S; otherwise, returning to the step S42, and re-executing the steps S42-S46; s48: after all training processes are finished, according to the data stored in the final convergence
Figure BDA0002462208470000035
And calculating the size of the performance index, and selecting a reward function of the current task.
Preferably, the loss function DJSThe calculation formula of (a) is as follows:
Figure BDA0002462208470000041
wherein(s)π,aπ) And(s)E,aE) Respectively from the policy network buffer BπAnd said expert data buffer BEState action samples of (2).
Preferably, the merit function AtThe method is used for measuring the amount of income brought by the current action;
the general dominance estimation algorithm calculates according to the trajectory data collected by the policy network over the elapsed time T, and the specific formula is as follows:
Att+(γλ)t+1+…+(γλ)T-t+1 t-1
wherein the content of the first and second substances,t=rt+γV(st+1)-V(st) Gamma and lambda are hyper-parameters, rtIs a calculated prize value, s, based on a prize functiontAnd st+1Representing the state of the preceding and following moments, V(s)t) And V(s)t+1) Respectively representing the state values of the front and rear moments.
Preferably, the near-end strategy optimization algorithm improves the loss function term in the general gradient algorithm by LCLIPItem substitution of original LGP,LCLIPThe specific formula of (A) is as follows:
Figure BDA0002462208470000042
wherein the content of the first and second substances,
Figure BDA0002462208470000043
the ratio, which represents the probability, is a hyperparameter,
Figure BDA0002462208470000044
represents the aforementioned merit function; clip represents a truncation function;
Figure BDA0002462208470000045
representing expectation, min is a function taking the minimum value.
Preferably, the final objective function is obtained by adding the improvement term, the value function error term and the strategy entropy term:
Figure BDA0002462208470000046
wherein the content of the first and second substances,
Figure BDA0002462208470000047
is the above-mentioned loss function improvement term,
Figure BDA0002462208470000048
is the loss of squared error, S is the exploration entropy, c1,c2Is a relevant hyper-parameter.
Preferably, the formula for calculating the performance indicator of the reward function is as follows:
Figure BDA0002462208470000049
wherein, tiRepresenting the total step size required for the convergence of the training process,
Figure BDA00024622084700000410
is the average return value, S, over the last period of timeiIs the standard deviation reported in the last period of time, and T and β are hyperparameters.
The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the above.
The invention has the beneficial effects that: the method for selecting the reward function in the confrontation type imitation learning is provided, and by designing various reward function forms, an intelligent agent can learn under the guidance of different reward functions, and then the optimal reward function is selected according to performance evaluation indexes in a specific task scene.
Furthermore, under the condition of small calculation amount, the optimal reward function can be automatically selected according to different tasks through the performance evaluation index and the reward function optimization algorithm without continuously trying the effects of different reward functions.
Drawings
FIG. 1 is a diagram illustrating a method for selecting a competitive imitation learning reward function according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a method for selecting a reward function of a current task according to the size of different reward function performance indicators according to an embodiment of the present invention.
FIG. 3 is a flow chart of a preferred algorithm of the reward function for countering mock learning according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a training process curve according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
As shown in FIG. 1, the invention provides a method for selecting a competitive imitation learning reward function, comprising the following steps:
s1: constructing a strategy network pi with a parameter theta, a judgment network D with a parameter w and at least two reward functions;
s2: acquiring teaching data under expert strategy and storing the teaching data into an expert data buffer B containing expert tracksE(st,at) Performing the following steps;
data is typically derived from the data set disclosed by OpenAI et al, which facilitates comparison with the effects of other algorithms. The expert data buffer typically stores 5 complete expert data tracks from which some data is sampled for training in a certain manner during the algorithm learning process.
S3: controlling the input of the policy network to be the state s returned by the simulation environment EnvtThe output is decision action at(ii) a The discriminating network utilizes state-action pairs under expert policy(s)t,at)EAnd the state action pairs of the policy networkt,at)πUpdating the parameters; in the stage of calculating the reward, the input of the discrimination network is the state action pair(s) of the policy networkt,at)πThe output value is the reward value obtained by calculation of the reward function;
s4: selecting the reward function of the current task according to the performance indexes of different reward functions;
s5: saving parameters of the policy network corresponding to the selected reward function.
In an embodiment of the present invention, 6 reward functions are designed according to different values of the reward function. The reward functions are respectively:
r1(x)=x=log(σ(x))-log(1-σ(x))......(1)
r2(x)=ex......(2)
r3(x)=-e-x......(3)
r4(x)=σ(x)......(4)
r5(x)=-log(1-σ(x))......(5)
r6(x)=log(σ(x))......(6)
wherein x is the output of the discrimination network,
Figure BDA0002462208470000071
is a sigmoid function.
First, these several functions are conventional functions. Wherein, the functions (1), (5) and (6) are reward function forms used in the original algorithm. The value interval of the function (1) can be positive or negative, the value interval of the function (5) is always positive, and the value interval of the function (6) is always negative. Some previous studies have confirmed that functions with different value intervals are suitable for different types of tasks. Therefore, three other reward functions are designed according to different value intervals, wherein the value intervals of the functions (2) and (4) are always positive, and the value interval of the function (3) is always negative.
As shown in fig. 2, the process of selecting the current mission reward function according to the size of the performance index of the reward function includes:
s41: initializing a plurality of said simulation environments EnviThe policy network piiAnd said discriminating network DiAnd simultaneously starting a training process, wherein i is 0, 1.. 6;
s42: in each training process, the current strategy network pi is used for interacting with the simulation environment Env, and the state action pairs of the current time step are stored into a strategy network buffer Bπ(st,at) Performing the following steps;
s43: from the policy network cache Bπ(st,at) The state action track(s) under the current strategy is obtained by middle samplingt,at)πFrom said expert numberAccording to the buffer BE(st,at) The motion track(s) of the expert state is obtained by middle samplingt,at)EAnd through a loss function DJSThe gradient of the discrimination network D is decreased to optimize the parameter w:
Figure BDA0002462208470000072
wherein w represents the discriminant network parameter, αdRepresenting the learning rate of the discriminating network parameter, DJSRepresentative discriminant network loss function (s, a)πAnd (s, a)ERespectively representing slave policy network buffers Bπ(st,at) And an expert data buffer(s)t,at)EAnd (5) sampling the obtained state action track.
S44: calculating the reward value r of each step in the training process according to different specific forms of the reward functiontAnd is stored in the policy network buffer Bπ(st,at,rt) Performing the following steps;
s45: calculating the dominance value A of each time step according to the dominance functiontAnd is stored in the policy network buffer Bπ(st,at,rt,At) Performing the following steps;
s46: according to the near-end strategy optimization algorithm, and utilizing the strategy network buffer Bπ(st,at,rt,At) The parameter θ of the policy network is updated in a gradient descending manner by the data in (1):
Figure BDA0002462208470000081
where θ represents a policy network parameter, αpRepresenting the learning rate of the policy network parameters,
Figure BDA0002462208470000082
representing a policy network objective function.
S47: calculating adjacent timeIf the difference of average return in the period is less than the set threshold Thre, the training process is stopped and the network parameters theta and w are saved, and the time step t of convergence and the average return in the latest period are saved
Figure BDA0002462208470000083
And a standard deviation S; otherwise, returning to the step S42, and re-executing the steps S42-S46;
s48: after all training processes are finished, according to the data stored in the final convergence
Figure BDA0002462208470000084
And calculating the size of the performance index, and selecting a reward function of the current task.
Loss function DJSThe calculation formula of (a) is as follows:
Figure BDA0002462208470000085
wherein(s)π,aπ) And(s)E,aE) Respectively from the policy network buffer BπAnd said expert data buffer BEState action samples of (2).
Fig. 3 is a schematic diagram showing a preferred algorithm flow of the reward function for confrontation simulation learning in the present invention.
In strategic gradient optimization algorithms, the dominance function A is often utilizedtMeasure the amount of profit brought by the current action, thus applying to the merit function AtIt is critical to make a reasonable estimate. The General Advantage Estimation (GAE) algorithm is an efficient method for estimating advantages, and is calculated according to trajectory data collected by a strategy over an elapsed time T, and the specific formula is as follows:
Att+(γλ)t+1+…+(γλ)T-t+1 t-1
wherein the content of the first and second substances,t=rt+γV(st+1)-V(st) Gamma and lambda are hyper-parameters, rtIs a calculated prize value, s, based on a prize functiontAnd st+1RepresentsThe state of the preceding and following moments, V(s)t) And V(s)t+1) Respectively representing the state values of the front and rear moments.
The near-end strategy optimization algorithm improves the loss function term in the general gradient extraction algorithm, and L is usedCLIPItem substitution of original LGP,LCLIPThe specific formula of (A) is as follows:
Figure BDA0002462208470000091
wherein the content of the first and second substances,
Figure BDA0002462208470000092
the ratio, which represents the probability, is a hyperparameter,
Figure BDA0002462208470000093
represents the aforementioned merit function; clip represents a truncation function;
Figure BDA0002462208470000094
representing expectation, min is a function taking the minimum value.
And (3) obtaining a final target function by combining the improvement term, the value function error term and the strategy entropy term:
Figure BDA0002462208470000095
wherein the content of the first and second substances,
Figure BDA0002462208470000096
is the above-mentioned loss function improvement term,
Figure BDA0002462208470000097
is the loss of squared error, S is the exploration entropy, c1,c2Is a relevant hyper-parameter.
The formula for calculating the performance indicator of the reward function is as follows:
Figure BDA0002462208470000098
wherein, tiRepresenting the total step size required for the convergence of the training process,
Figure BDA0002462208470000099
is the average return value, S, over the last period of timeiIs the standard deviation reported in the last period of time, and T and β are hyperparameters.
The optimal reward function suitable for the current task can be selected according to the performance index, the reliability of the index is proved through experiments, and the selected reward function can obtain the optimal performance during testing.
In one specific task: the method is used in a high-dimensional continuous control task and specifically comprises the following steps:
the objective of this task is to mimic expert strategy to control a robot in a simulated environment so that it can learn to walk as quickly as possible. For an agent in a simulation environment, the input is an 11-dimensional state quantity and the output control action is an 8-dimensional continuous variable. In the process of training the reward function optimization algorithm, a strategy network and a discrimination network owned by an agent have similar structures: the strategy network comprises two hidden layers containing 128 nodes, and the activation functions are tanh; the discriminant network consists of two hidden layers containing 100 nodes, the activation function is also tan h, and the parameters of the two networks are optimized through an Adam optimizer.
The hyper-parameters used in the antagonistic mock learning algorithm are shown in table 1:
TABLE 1 mimic learning hyper-parameters
Hyper-parameter Value of
Full period (T) 2048
Policy network learning rate (α)p) 3e-4
Distinguishing the network learning rate (α)d) 1e-3
Discount factor (gamma) 0.99
GAE parameters (lambda) 0.95
Value error coefficient (c)1) 0.5
Policy entropy coefficient (c)2) 0.01
The hyperparameters in the reward function evaluation index calculation formula are shown in table 2:
TABLE 2 Performance evaluation hyperparameters
Hyper-parameter Value of
Normalized range (T) 1e6
Discount factor (β) 0.25
Convergence threshold (Thre) 50
In this context, the random strategy may achieve a reward score of-60.21 ± 30.40, while the expert strategy scores 4066.96 ± 688.97. Traces of 5 sets of teach data are obtained from the public data set, and the agent is trained under the guidance of 6 reward functions in parallel using the expert decision data. The scores in the agent training process are normalized to between 0,1 according to the ranges determined by the stochastic strategy and the expert strategy scores.
As shown in fig. 4, 0 represents the score of the random strategy, and 1 represents the score of the expert strategy, it can be seen that the final performances of the agents under different reward functions are different, and the training processes of the agents corresponding to the reward functions 2, 4 and 5 are better. According to the data stored at the end of training and the designed reward function performance evaluation index function, the finally optimized reward function is r2(x)=ex. The average score after the end of 5 rounds in the test period is 3796.00, which is slightly higher than the other two reward functions r5(x) Log (1- σ (x)) and r4(x) Score for σ (x), 3785.31 and 3789.75, respectively, indicates that the optimal reward function is indeed r in this task2(x)=exThis demonstrates the effectiveness of the proposed performance assessment index and picks the optimal reward function based on the index.
In previous algorithms, any of the reward functions (1), (5) and (6) may be used, and their effectiveness may vary significantly when faced with different tasks. It can be seen that the function (1) is difficult to perform the learning task, and although both functions (5) and (6) perform better, it is still slightly inferior to the reward function (2). Therefore, through the performance evaluation index and the reward function optimization algorithm, the optimal reward function can be automatically selected according to different tasks without continuously trying the effects of different reward functions.
An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.
Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.
Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.
The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an erasable Programmable Read-Only Memory (EPROM), an electrically erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a ferroelectric Random Access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), a Compact Disc Read-Only Memory (DRAM), or a Dynamic Random Access Memory (SDRAM), or any other type of Random Access Memory (SDRAM), including but not limited to a Dynamic Random Access RAM (SDRAM), a Dynamic Random Access Memory (SDRAM), or a magnetic Random Access Memory (RAM) that is suitable for external Access, such as a Dynamic Access Memory (SDRAM), or a Dynamic Random Access Memory (SDRAM) that is suitable for example, but not limited by a Dynamic Access bus Access RAM, or Dynamic Access RAM.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims (10)

1. A method for selecting a competitive imitation learning winning incentive function, comprising the steps of:
s1: constructing a strategy network pi with a parameter theta, a judgment network D with a parameter w and at least two reward functions;
s2: acquiring teaching data under expert strategy and storing the teaching data into an expert data buffer B containing expert tracksE(st,at) Performing the following steps;
s3: controlling the input of the policy network to be the state s returned by the simulation environment EnvtThe output is decision action at(ii) a The discriminating network utilizes state-action pairs under expert policy(s)t,at)EAnd the state action pairs of the policy networkt,at)πUpdating the parameters; in the stage of calculating the reward, the input of the discrimination network is the state action pair(s) of the policy networkt,at)πThe output value is the reward value obtained by calculation of the reward function;
s4: selecting the reward function of the current task according to the performance indexes of different reward functions;
s5: saving parameters of the policy network corresponding to the selected reward function.
2. The method of claim 1, wherein 6 reward functions are designed according to the difference of the reward function value ranges.
3. The method of selecting a reward function for antagonistic mock-learning according to claim 2, wherein said reward function is:
r1(x)=x=log(σ(x))-log(1-σ(x))
r2(x)=ex
r3(x)=-e-x
r4(x)=σ(x)
r5(x)=-log(1-σ(x))
r6(x)=log(σ(x))
wherein x is the output of the discrimination network,
Figure FDA0002462208460000011
is a sigmoid function.
4. The method of claim 3, wherein selecting a current mission reward function based on the magnitude of different performance indicators of the reward function comprises:
s41: initializing a plurality of said simulation environments EnviThe policy network piiAnd said discriminating network DiAnd simultaneously starting a training process, wherein i is 0, 1.. 6;
s42: in each training process, the current strategy network pi is used for interacting with the simulation environment Env, and the state action pairs of the current time step are stored into a strategy network buffer Bπ(st,at) Performing the following steps;
s43: from the policy network cache Bπ(st,at) The state action track(s) under the current strategy is obtained by middle samplingt,at)πFrom said expert data buffer BE(st,at) The motion track(s) of the expert state is obtained by middle samplingt,at)EAnd through a loss function DJSThe gradient of the discrimination network D is decreased to optimize the parameter w:
Figure FDA0002462208460000021
wherein w represents the discriminant network parameter, αdRepresenting the learning rate of the discriminating network parameter, DJSRepresentative discriminant network loss function (s, a)πAnd (s, a)ERespectively representing slave policy network buffers Bπ(st,at) And an expert data buffer(s)t,at)EAnd (5) sampling the obtained state action track.
S44: calculating the reward value r of each step in the training process according to different specific forms of the reward functiontAnd is present inPolicy network buffer Bπ(st,at,rt) Performing the following steps;
s45: calculating the dominance value A of each time step according to the dominance functiontAnd is stored in the policy network buffer Bπ(st,at,rt,At) Performing the following steps;
s46: according to the near-end strategy optimization algorithm, and utilizing the strategy network buffer Bπ(st,at,rt,At) The parameter θ of the policy network is updated in a gradient descending manner by the data in (1):
Figure FDA0002462208460000022
where θ represents a policy network parameter, αpRepresenting the learning rate of the policy network parameters,
Figure FDA0002462208460000023
representing a policy network objective function.
S47: calculating the difference of average returns in adjacent time periods, if the difference is less than a set threshold Thre, stopping the training process and saving the network parameters theta and w, and simultaneously saving the time step t during convergence and the average return in the latest period
Figure FDA0002462208460000031
And a standard deviation S; otherwise, returning to the step S42, and re-executing the steps S42-S46;
s48: after all training processes are finished, according to the data stored in the final convergence
Figure FDA0002462208460000032
And calculating the size of the performance index, and selecting a reward function of the current task.
5. The method of selecting a winning incentive function for opportune mock learning according to claim 4, wherein said penalty function DJSIs calculated byThe formula is as follows:
Figure FDA0002462208460000033
wherein(s)π,aπ) And(s)E,aE) Respectively from the policy network buffer BπAnd said expert data buffer BEState action samples of (2).
6. The method of claim 4, wherein the merit function A is a function of the learning awardtThe method is used for measuring the amount of income brought by the current action;
the general dominance estimation algorithm calculates according to the trajectory data collected by the policy network over the elapsed time T, and the specific formula is as follows:
Att+(γλ)t+1+…+(γλ)T-t+1 t-1
wherein the content of the first and second substances,t=rt+γV(st+1)-V(st) Gamma and lambda are hyper-parameters, rtIs a calculated prize value, s, based on a prize functiontAnd st+1Representing the state of the preceding and following moments, V(s)t) And V(s)t+1) Respectively representing the state values of the front and rear moments.
7. The method of claim 4, wherein the near-end strategy optimization algorithm improves the penalty function term in the general gradient lift algorithm by LCLIPItem substitution of original LGP,LCLIPThe specific formula of (A) is as follows:
Figure FDA0002462208460000034
wherein the content of the first and second substances,
Figure FDA0002462208460000035
the ratio, which represents the probability, is a hyperparameter,
Figure FDA0002462208460000036
represents the aforementioned merit function; clip represents a truncation function;
Figure FDA0002462208460000037
representing expectation, min is a function taking the minimum value.
8. The method of claim 7, wherein the final objective function is obtained by adding the improvement term to the value function error term and the strategy entropy term:
Figure FDA0002462208460000041
wherein the content of the first and second substances,
Figure FDA0002462208460000042
is the above-mentioned loss function improvement term,
Figure FDA0002462208460000043
is the loss of squared error, S is the exploration entropy, c1,c2Is a relevant hyper-parameter.
9. The method of selecting a reward function for antagonistic emulation learning according to claim 4, wherein the formula for calculating the performance indicators of the reward function is as follows:
Figure FDA0002462208460000044
wherein, tiRepresenting the total step size required for the convergence of the training process,
Figure FDA0002462208460000045
is the average return over the last period of timeReporting of value, SiIs the standard deviation reported in the last period of time, and T and β are hyperparameters.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
CN202010323155.4A 2020-04-22 2020-04-22 Selection method of countermeasure type imitation learning winning function Active CN111401556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010323155.4A CN111401556B (en) 2020-04-22 2020-04-22 Selection method of countermeasure type imitation learning winning function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010323155.4A CN111401556B (en) 2020-04-22 2020-04-22 Selection method of countermeasure type imitation learning winning function

Publications (2)

Publication Number Publication Date
CN111401556A true CN111401556A (en) 2020-07-10
CN111401556B CN111401556B (en) 2023-06-30

Family

ID=71431701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010323155.4A Active CN111401556B (en) 2020-04-22 2020-04-22 Selection method of countermeasure type imitation learning winning function

Country Status (1)

Country Link
CN (1) CN111401556B (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111983922A (en) * 2020-07-13 2020-11-24 广州中国科学院先进技术研究所 Robot demonstration teaching method based on meta-simulation learning
CN112052947A (en) * 2020-08-17 2020-12-08 清华大学 Hierarchical reinforcement learning method and device based on strategy options
CN112249032A (en) * 2020-10-29 2021-01-22 浪潮(北京)电子信息产业有限公司 Automatic driving decision method, system, equipment and computer storage medium
CN112434171A (en) * 2020-11-26 2021-03-02 中山大学 Knowledge graph reasoning and complementing method and system based on reinforcement learning
CN112894809A (en) * 2021-01-18 2021-06-04 华中科技大学 Impedance controller design method and system based on reinforcement learning
CN112975967A (en) * 2021-02-26 2021-06-18 同济大学 Service robot quantitative water pouring method based on simulation learning and storage medium
CN113052253A (en) * 2021-03-31 2021-06-29 北京字节跳动网络技术有限公司 Hyper-parameter determination method, device, deep reinforcement learning framework, medium and equipment
CN113221469A (en) * 2021-06-04 2021-08-06 上海天壤智能科技有限公司 Inverse reinforcement learning method and system for enhancing authenticity of traffic simulator
CN113222297A (en) * 2021-06-08 2021-08-06 上海交通大学 Method, system, equipment and medium suitable for cyclic updating planning of solid waste base garden
CN113240118A (en) * 2021-05-18 2021-08-10 中国科学院自动化研究所 Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium
CN113239634A (en) * 2021-06-11 2021-08-10 上海交通大学 Simulator modeling method based on robust simulation learning
CN113467515A (en) * 2021-07-22 2021-10-01 南京大学 Unmanned aerial vehicle flight control method based on virtual environment simulation reconstruction and reinforcement learning
CN113504723A (en) * 2021-07-05 2021-10-15 北京航空航天大学 Carrier rocket load shedding control method based on inverse reinforcement learning
CN113609786A (en) * 2021-08-27 2021-11-05 中国人民解放军国防科技大学 Mobile robot navigation method and device, computer equipment and storage medium
CN113688977A (en) * 2021-08-30 2021-11-23 浙江大学 Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium
CN113704979A (en) * 2021-08-07 2021-11-26 中国航空工业集团公司沈阳飞机设计研究所 Air countermeasure maneuver control method based on random neural network
CN113781190A (en) * 2021-01-13 2021-12-10 北京沃东天骏信息技术有限公司 Bill data processing method, system, computer system and medium
CN113852645A (en) * 2021-12-02 2021-12-28 北京邮电大学 Method and device for resisting client DNS cache poisoning attack and electronic equipment
CN113962012A (en) * 2021-07-23 2022-01-21 中国科学院自动化研究所 Unmanned aerial vehicle countermeasure strategy optimization method and device
WO2022044314A1 (en) * 2020-08-31 2022-03-03 日本電気株式会社 Learning device, learning method, and learning program
CN114625089A (en) * 2022-03-15 2022-06-14 大连东软信息学院 Job shop scheduling method based on improved near-end strategy optimization algorithm
CN114683280A (en) * 2022-03-17 2022-07-01 达闼机器人股份有限公司 Object control method, device, storage medium and electronic equipment
CN114986518A (en) * 2022-07-19 2022-09-02 聊城一明五金科技有限公司 Intelligent control method and system for automobile disassembly production line
CN115314399A (en) * 2022-08-05 2022-11-08 北京航空航天大学 Data center flow scheduling method based on inverse reinforcement learning
CN115470710A (en) * 2022-09-26 2022-12-13 北京鼎成智造科技有限公司 Air game simulation method and device
CN115688858A (en) * 2022-10-20 2023-02-03 哈尔滨工业大学(深圳) Fine-grained expert behavior simulation learning method, device, medium and terminal
WO2023109663A1 (en) * 2021-12-17 2023-06-22 深圳先进技术研究院 Serverless computing resource configuration method based on maximum entropy inverse reinforcement learning
CN117193008A (en) * 2023-10-07 2023-12-08 航天科工集团智能科技研究院有限公司 Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium
CN113704979B (en) * 2021-08-07 2024-05-10 中国航空工业集团公司沈阳飞机设计研究所 Air countermeasure maneuvering control method based on random neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376862A (en) * 2018-10-29 2019-02-22 中国石油大学(华东) A kind of time series generation method based on generation confrontation network
CN111026272A (en) * 2019-12-09 2020-04-17 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376862A (en) * 2018-10-29 2019-02-22 中国石油大学(华东) A kind of time series generation method based on generation confrontation network
CN111026272A (en) * 2019-12-09 2020-04-17 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MING ZHANG: "Wasserstein Distance guided Adversarial Imitation Learning with Reward Shape Exploration", 《IEEE》 *

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111983922A (en) * 2020-07-13 2020-11-24 广州中国科学院先进技术研究所 Robot demonstration teaching method based on meta-simulation learning
CN112052947A (en) * 2020-08-17 2020-12-08 清华大学 Hierarchical reinforcement learning method and device based on strategy options
WO2022044314A1 (en) * 2020-08-31 2022-03-03 日本電気株式会社 Learning device, learning method, and learning program
CN112249032A (en) * 2020-10-29 2021-01-22 浪潮(北京)电子信息产业有限公司 Automatic driving decision method, system, equipment and computer storage medium
CN112249032B (en) * 2020-10-29 2022-02-18 浪潮(北京)电子信息产业有限公司 Automatic driving decision method, system, equipment and computer storage medium
CN112434171A (en) * 2020-11-26 2021-03-02 中山大学 Knowledge graph reasoning and complementing method and system based on reinforcement learning
CN113781190A (en) * 2021-01-13 2021-12-10 北京沃东天骏信息技术有限公司 Bill data processing method, system, computer system and medium
CN112894809B (en) * 2021-01-18 2022-08-02 华中科技大学 Impedance controller design method and system based on reinforcement learning
CN112894809A (en) * 2021-01-18 2021-06-04 华中科技大学 Impedance controller design method and system based on reinforcement learning
CN112975967A (en) * 2021-02-26 2021-06-18 同济大学 Service robot quantitative water pouring method based on simulation learning and storage medium
CN112975967B (en) * 2021-02-26 2022-06-28 同济大学 Service robot quantitative water pouring method based on simulation learning and storage medium
CN113052253A (en) * 2021-03-31 2021-06-29 北京字节跳动网络技术有限公司 Hyper-parameter determination method, device, deep reinforcement learning framework, medium and equipment
CN113240118A (en) * 2021-05-18 2021-08-10 中国科学院自动化研究所 Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium
CN113221469A (en) * 2021-06-04 2021-08-06 上海天壤智能科技有限公司 Inverse reinforcement learning method and system for enhancing authenticity of traffic simulator
CN113222297A (en) * 2021-06-08 2021-08-06 上海交通大学 Method, system, equipment and medium suitable for cyclic updating planning of solid waste base garden
CN113239634A (en) * 2021-06-11 2021-08-10 上海交通大学 Simulator modeling method based on robust simulation learning
CN113504723B (en) * 2021-07-05 2023-11-28 北京航空航天大学 Carrier rocket load shedding control method based on inverse reinforcement learning
CN113504723A (en) * 2021-07-05 2021-10-15 北京航空航天大学 Carrier rocket load shedding control method based on inverse reinforcement learning
CN113467515A (en) * 2021-07-22 2021-10-01 南京大学 Unmanned aerial vehicle flight control method based on virtual environment simulation reconstruction and reinforcement learning
CN113962012A (en) * 2021-07-23 2022-01-21 中国科学院自动化研究所 Unmanned aerial vehicle countermeasure strategy optimization method and device
CN113704979B (en) * 2021-08-07 2024-05-10 中国航空工业集团公司沈阳飞机设计研究所 Air countermeasure maneuvering control method based on random neural network
CN113704979A (en) * 2021-08-07 2021-11-26 中国航空工业集团公司沈阳飞机设计研究所 Air countermeasure maneuver control method based on random neural network
CN113609786A (en) * 2021-08-27 2021-11-05 中国人民解放军国防科技大学 Mobile robot navigation method and device, computer equipment and storage medium
CN113609786B (en) * 2021-08-27 2022-08-19 中国人民解放军国防科技大学 Mobile robot navigation method, device, computer equipment and storage medium
CN113688977B (en) * 2021-08-30 2023-12-05 浙江大学 Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium
CN113688977A (en) * 2021-08-30 2021-11-23 浙江大学 Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium
CN113852645A (en) * 2021-12-02 2021-12-28 北京邮电大学 Method and device for resisting client DNS cache poisoning attack and electronic equipment
WO2023109663A1 (en) * 2021-12-17 2023-06-22 深圳先进技术研究院 Serverless computing resource configuration method based on maximum entropy inverse reinforcement learning
CN114625089B (en) * 2022-03-15 2024-05-03 大连东软信息学院 Job shop scheduling method based on improved near-end strategy optimization algorithm
CN114625089A (en) * 2022-03-15 2022-06-14 大连东软信息学院 Job shop scheduling method based on improved near-end strategy optimization algorithm
CN114683280A (en) * 2022-03-17 2022-07-01 达闼机器人股份有限公司 Object control method, device, storage medium and electronic equipment
CN114683280B (en) * 2022-03-17 2023-11-17 达闼机器人股份有限公司 Object control method and device, storage medium and electronic equipment
CN114986518A (en) * 2022-07-19 2022-09-02 聊城一明五金科技有限公司 Intelligent control method and system for automobile disassembly production line
CN115314399A (en) * 2022-08-05 2022-11-08 北京航空航天大学 Data center flow scheduling method based on inverse reinforcement learning
CN115314399B (en) * 2022-08-05 2023-09-15 北京航空航天大学 Data center flow scheduling method based on inverse reinforcement learning
CN115470710B (en) * 2022-09-26 2023-06-06 北京鼎成智造科技有限公司 Air game simulation method and device
CN115470710A (en) * 2022-09-26 2022-12-13 北京鼎成智造科技有限公司 Air game simulation method and device
CN115688858A (en) * 2022-10-20 2023-02-03 哈尔滨工业大学(深圳) Fine-grained expert behavior simulation learning method, device, medium and terminal
CN115688858B (en) * 2022-10-20 2024-02-09 哈尔滨工业大学(深圳) Fine granularity expert behavior imitation learning method, device, medium and terminal
CN117193008B (en) * 2023-10-07 2024-03-01 航天科工集团智能科技研究院有限公司 Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium
CN117193008A (en) * 2023-10-07 2023-12-08 航天科工集团智能科技研究院有限公司 Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111401556B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
CN111401556A (en) Selection method of opponent type imitation learning winning incentive function
CN110119844B (en) Robot motion decision method, system and device introducing emotion regulation and control mechanism
Camerer et al. Behavioural game theory: thinking, learning and teaching
Erev et al. On adaptation, maximization, and reinforcement learning among cognitive strategies.
Dayan et al. Decision theory, reinforcement learning, and the brain
Van Otterlo The logic of adaptive behavior: Knowledge representation and algorithms for adaptive sequential decision making under uncertainty in first-order and relational domains
CN111291890A (en) Game strategy optimization method, system and storage medium
Hotaling et al. Dynamic decision making
CN114330651A (en) Layered multi-agent reinforcement learning method oriented to multi-element joint instruction control
Intisar et al. Classification of online judge programmers based on rule extraction from self organizing feature map
Kim et al. Generalization of TORCS car racing controllers with artificial neural networks and linear regression analysis
CN115033878A (en) Rapid self-game reinforcement learning method and device, computer equipment and storage medium
CN113947246B (en) Loss processing method and device based on artificial intelligence and electronic equipment
CN117112742A (en) Dialogue model optimization method and device, computer equipment and storage medium
CN115906673A (en) Integrated modeling method and system for combat entity behavior model
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
US11478716B1 (en) Deep learning for data-driven skill estimation
CN114911157A (en) Robot navigation control method and system based on partial observable reinforcement learning
CN112884129B (en) Multi-step rule extraction method, device and storage medium based on teaching data
De Penning et al. Applying neural-symbolic cognitive agents in intelligent transport systems to reduce CO 2 emissions
Belavkin Conflict resolution by random estimated costs
CN117474077B (en) Auxiliary decision making method and device based on OAR model and reinforcement learning
US11869383B2 (en) Method, system and non-transitory computer- readable recording medium for providing information on user's conceptual understanding
O'Hanlon Using Supervised Machine Learning to Predict the Final Rankings of the 2021 Formula One Championship
Yang et al. The Cognitive Substrates of Model-Based Learning: An Integrative Declarative-Procedural Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant