CN111401556A - Selection method of opponent type imitation learning winning incentive function - Google Patents
Selection method of opponent type imitation learning winning incentive function Download PDFInfo
- Publication number
- CN111401556A CN111401556A CN202010323155.4A CN202010323155A CN111401556A CN 111401556 A CN111401556 A CN 111401556A CN 202010323155 A CN202010323155 A CN 202010323155A CN 111401556 A CN111401556 A CN 111401556A
- Authority
- CN
- China
- Prior art keywords
- function
- network
- reward
- strategy
- expert
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a selection method of a winning incentive function of confrontational imitation learning, which comprises the following steps: constructing a strategy network with a parameter theta, a judgment network with a parameter w and at least two reward functions; acquiring teaching data under an expert strategy and storing the teaching data into an expert data buffer containing an expert track; the input of the control strategy network is the state returned by the simulation environment, and the output is a decision action; judging whether the network utilizes the state action pair under the expert strategy or the state action pair of the strategy network to update parameters; in the stage of calculating the reward, judging that the input of the network is a state action pair of the strategy network, and the output value is a reward value obtained through calculation of a reward function; selecting the reward function of the current task according to the performance indexes of different reward functions; parameters of the policy network corresponding to the selected reward function are saved. The intelligent agent learns under the guidance of different reward functions, and then selects the optimal reward function according to the performance evaluation index in a specific task scene.
Description
Technical Field
The invention relates to the technical field of reward function selection, in particular to a method for selecting a reward function in confrontational type imitation learning.
Background
In recent years, with the great breakthrough of deep learning in the fields of image recognition, voice technology, natural language processing and the like, the deep reinforcement learning combining the deep neural network and the reinforcement learning also obtains the expression exceeding human beings on the large-scale strategy optimization problems of go, interstellar dispute and the like. One of the bottlenecks of reinforcement learning is: designing a reasonable reward function based on expert experience is time consuming and laborious when faced with practical control problems such as autopilot, robotics, etc. Data-driven simulation learning provides a thought for solving the problem, and the strategy of a competitive expert can be learned only by using teaching data without manually designing a reward function. Among many algorithms of simulation learning, the behavioral cloning method is the simplest, and it performs simulation in a supervised learning manner. However, this method is susceptible to compounding errors and is difficult to adapt to situations that do not occur in expert data. The inverse reinforcement learning algorithm is another type of simulation learning algorithm, which first learns a reward function according to expert data and then executes a reinforcement learning process learning strategy under the guidance of the reward function. The reward function learned by the method has better robustness and can deal with the situation which does not appear in the expert data. However, the algorithm needs to alternate the steps of finding the optimal reward function and performing reinforcement learning training, and therefore, a large amount of calculation is needed.
In the process of simulating the expert strategy, a reward signal is output from the judgment network, the reward value is larger when the output value is larger when the generated data is closer to the expert data, and the reward value is smaller otherwise, because the reward function in a negative logarithm form is used and the input value of the reward function is always in a range of [0,1], the reward value is always positive, and the algorithm is difficult to be suitable for various tasks.
The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.
Disclosure of Invention
The invention provides a method for selecting a reward function for antagonistic imitative learning, aiming at solving the problem that the reward function form used in an imitative learning algorithm based on generation of an antagonistic network in the prior art is not optimal in all tasks and therefore cannot obtain optimal performance in multiple tasks.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a method for selecting a competitive imitation learning winning incentive function, comprising the steps of: s1: constructing a strategy network pi with a parameter theta, a judgment network D with a parameter w and at least two reward functions; s2: acquiring teaching data under expert strategy and storing the teaching data into an expert data buffer B containing expert tracksE(st,at) Performing the following steps; s3: controlling the input of the policy network to be the state s returned by the simulation environment EnvtThe output is decision action at(ii) a The discriminating network utilizes state-action pairs under expert policy(s)t,at)EAnd the state action pairs of the policy networkt,at)πUpdating the parameters; in the stage of calculating the reward, the input of the discrimination network is the state action pair(s) of the policy networkt,at)πThe output value is the reward value obtained by calculation of the reward function; s4: selecting the reward function of the current task according to the performance indexes of different reward functions; s5: saving parameters of the policy network corresponding to the selected reward function.
Preferably, 6 reward functions are designed according to different values of the reward function.
Preferably, the reward function is:
r1(x)=x=log(σ(x))-log(1-σ(x))
r2(x)=ex
r3(x)=-e-x
r4(x)=σ(x)
r5(x)=-log(1-σ(x))
r6(x)=log(σ(x))
Preferably, the process of selecting the reward function of the current task according to the size of the performance index of the reward function comprises the following steps: s41: initializing a plurality of said simulation environments EnviThe policy network piiAnd said discriminating network DiAnd simultaneously starting a training process, wherein i is 0, 1.. 6; s42: in each training process, the current strategy network pi is used for interacting with the simulation environment Env, and the state action pairs of the current time step are stored into a strategy network buffer Bπ(st,at) Performing the following steps; s43: from the policy network cache Bπ(st,at) The state action track(s) under the current strategy is obtained by middle samplingt,at)πFrom said expert data buffer BE(st,at) The motion track(s) of the expert state is obtained by middle samplingt,at)EAnd through a loss function DJSThe gradient of the discrimination network D is decreased to optimize the parameter w:
wherein w represents the discriminant network parameter, αdRepresenting the learning rate of the discriminating network parameter, DJSRepresentative discriminant network loss function (s, a)πAnd (s, a)ERespectively representing slave policy network buffers Bπ(st,at) And an expert data buffer(s)t,at)EAnd (5) sampling the obtained state action track.
S44: calculating each of the reward functions during the training process according to different specific forms of the reward functionsReward value r of steptAnd is stored in the policy network buffer Bπ(st,at,rt) Performing the following steps; s45: calculating the dominance value A of each time step according to the dominance functiontAnd is stored in the policy network buffer Bπ(st,at,rt,At) Performing the following steps; s46: according to the near-end strategy optimization algorithm, and utilizing the strategy network buffer Bπ(st,at,rt,At) The parameter θ of the policy network is updated in a gradient descending manner by the data in (1):
where θ represents a policy network parameter, αpRepresenting the learning rate of the policy network parameters,representing a policy network objective function.
S47: calculating the difference of average returns in adjacent time periods, if the difference is less than a set threshold Thre, stopping the training process and saving the network parameters theta and w, and simultaneously saving the time step t during convergence and the average return in the latest periodAnd a standard deviation S; otherwise, returning to the step S42, and re-executing the steps S42-S46; s48: after all training processes are finished, according to the data stored in the final convergenceAnd calculating the size of the performance index, and selecting a reward function of the current task.
Preferably, the loss function DJSThe calculation formula of (a) is as follows:
wherein(s)π,aπ) And(s)E,aE) Respectively from the policy network buffer BπAnd said expert data buffer BEState action samples of (2).
Preferably, the merit function AtThe method is used for measuring the amount of income brought by the current action;
the general dominance estimation algorithm calculates according to the trajectory data collected by the policy network over the elapsed time T, and the specific formula is as follows:
At=t+(γλ)t+1+…+(γλ)T-t+1 t-1,
wherein the content of the first and second substances,t=rt+γV(st+1)-V(st) Gamma and lambda are hyper-parameters, rtIs a calculated prize value, s, based on a prize functiontAnd st+1Representing the state of the preceding and following moments, V(s)t) And V(s)t+1) Respectively representing the state values of the front and rear moments.
Preferably, the near-end strategy optimization algorithm improves the loss function term in the general gradient algorithm by LCLIPItem substitution of original LGP,LCLIPThe specific formula of (A) is as follows:
wherein the content of the first and second substances,the ratio, which represents the probability, is a hyperparameter,represents the aforementioned merit function; clip represents a truncation function;representing expectation, min is a function taking the minimum value.
Preferably, the final objective function is obtained by adding the improvement term, the value function error term and the strategy entropy term:
wherein the content of the first and second substances,is the above-mentioned loss function improvement term,is the loss of squared error, S is the exploration entropy, c1,c2Is a relevant hyper-parameter.
Preferably, the formula for calculating the performance indicator of the reward function is as follows:
wherein, tiRepresenting the total step size required for the convergence of the training process,is the average return value, S, over the last period of timeiIs the standard deviation reported in the last period of time, and T and β are hyperparameters.
The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the above.
The invention has the beneficial effects that: the method for selecting the reward function in the confrontation type imitation learning is provided, and by designing various reward function forms, an intelligent agent can learn under the guidance of different reward functions, and then the optimal reward function is selected according to performance evaluation indexes in a specific task scene.
Furthermore, under the condition of small calculation amount, the optimal reward function can be automatically selected according to different tasks through the performance evaluation index and the reward function optimization algorithm without continuously trying the effects of different reward functions.
Drawings
FIG. 1 is a diagram illustrating a method for selecting a competitive imitation learning reward function according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a method for selecting a reward function of a current task according to the size of different reward function performance indicators according to an embodiment of the present invention.
FIG. 3 is a flow chart of a preferred algorithm of the reward function for countering mock learning according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a training process curve according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
As shown in FIG. 1, the invention provides a method for selecting a competitive imitation learning reward function, comprising the following steps:
s1: constructing a strategy network pi with a parameter theta, a judgment network D with a parameter w and at least two reward functions;
s2: acquiring teaching data under expert strategy and storing the teaching data into an expert data buffer B containing expert tracksE(st,at) Performing the following steps;
data is typically derived from the data set disclosed by OpenAI et al, which facilitates comparison with the effects of other algorithms. The expert data buffer typically stores 5 complete expert data tracks from which some data is sampled for training in a certain manner during the algorithm learning process.
S3: controlling the input of the policy network to be the state s returned by the simulation environment EnvtThe output is decision action at(ii) a The discriminating network utilizes state-action pairs under expert policy(s)t,at)EAnd the state action pairs of the policy networkt,at)πUpdating the parameters; in the stage of calculating the reward, the input of the discrimination network is the state action pair(s) of the policy networkt,at)πThe output value is the reward value obtained by calculation of the reward function;
s4: selecting the reward function of the current task according to the performance indexes of different reward functions;
s5: saving parameters of the policy network corresponding to the selected reward function.
In an embodiment of the present invention, 6 reward functions are designed according to different values of the reward function. The reward functions are respectively:
r1(x)=x=log(σ(x))-log(1-σ(x))......(1)
r2(x)=ex......(2)
r3(x)=-e-x......(3)
r4(x)=σ(x)......(4)
r5(x)=-log(1-σ(x))......(5)
r6(x)=log(σ(x))......(6)
First, these several functions are conventional functions. Wherein, the functions (1), (5) and (6) are reward function forms used in the original algorithm. The value interval of the function (1) can be positive or negative, the value interval of the function (5) is always positive, and the value interval of the function (6) is always negative. Some previous studies have confirmed that functions with different value intervals are suitable for different types of tasks. Therefore, three other reward functions are designed according to different value intervals, wherein the value intervals of the functions (2) and (4) are always positive, and the value interval of the function (3) is always negative.
As shown in fig. 2, the process of selecting the current mission reward function according to the size of the performance index of the reward function includes:
s41: initializing a plurality of said simulation environments EnviThe policy network piiAnd said discriminating network DiAnd simultaneously starting a training process, wherein i is 0, 1.. 6;
s42: in each training process, the current strategy network pi is used for interacting with the simulation environment Env, and the state action pairs of the current time step are stored into a strategy network buffer Bπ(st,at) Performing the following steps;
s43: from the policy network cache Bπ(st,at) The state action track(s) under the current strategy is obtained by middle samplingt,at)πFrom said expert numberAccording to the buffer BE(st,at) The motion track(s) of the expert state is obtained by middle samplingt,at)EAnd through a loss function DJSThe gradient of the discrimination network D is decreased to optimize the parameter w:
wherein w represents the discriminant network parameter, αdRepresenting the learning rate of the discriminating network parameter, DJSRepresentative discriminant network loss function (s, a)πAnd (s, a)ERespectively representing slave policy network buffers Bπ(st,at) And an expert data buffer(s)t,at)EAnd (5) sampling the obtained state action track.
S44: calculating the reward value r of each step in the training process according to different specific forms of the reward functiontAnd is stored in the policy network buffer Bπ(st,at,rt) Performing the following steps;
s45: calculating the dominance value A of each time step according to the dominance functiontAnd is stored in the policy network buffer Bπ(st,at,rt,At) Performing the following steps;
s46: according to the near-end strategy optimization algorithm, and utilizing the strategy network buffer Bπ(st,at,rt,At) The parameter θ of the policy network is updated in a gradient descending manner by the data in (1):
where θ represents a policy network parameter, αpRepresenting the learning rate of the policy network parameters,representing a policy network objective function.
S47: calculating adjacent timeIf the difference of average return in the period is less than the set threshold Thre, the training process is stopped and the network parameters theta and w are saved, and the time step t of convergence and the average return in the latest period are savedAnd a standard deviation S; otherwise, returning to the step S42, and re-executing the steps S42-S46;
s48: after all training processes are finished, according to the data stored in the final convergenceAnd calculating the size of the performance index, and selecting a reward function of the current task.
Loss function DJSThe calculation formula of (a) is as follows:
wherein(s)π,aπ) And(s)E,aE) Respectively from the policy network buffer BπAnd said expert data buffer BEState action samples of (2).
Fig. 3 is a schematic diagram showing a preferred algorithm flow of the reward function for confrontation simulation learning in the present invention.
In strategic gradient optimization algorithms, the dominance function A is often utilizedtMeasure the amount of profit brought by the current action, thus applying to the merit function AtIt is critical to make a reasonable estimate. The General Advantage Estimation (GAE) algorithm is an efficient method for estimating advantages, and is calculated according to trajectory data collected by a strategy over an elapsed time T, and the specific formula is as follows:
At=t+(γλ)t+1+…+(γλ)T-t+1 t-1,
wherein the content of the first and second substances,t=rt+γV(st+1)-V(st) Gamma and lambda are hyper-parameters, rtIs a calculated prize value, s, based on a prize functiontAnd st+1RepresentsThe state of the preceding and following moments, V(s)t) And V(s)t+1) Respectively representing the state values of the front and rear moments.
The near-end strategy optimization algorithm improves the loss function term in the general gradient extraction algorithm, and L is usedCLIPItem substitution of original LGP,LCLIPThe specific formula of (A) is as follows:
wherein the content of the first and second substances,the ratio, which represents the probability, is a hyperparameter,represents the aforementioned merit function; clip represents a truncation function;representing expectation, min is a function taking the minimum value.
And (3) obtaining a final target function by combining the improvement term, the value function error term and the strategy entropy term:
wherein the content of the first and second substances,is the above-mentioned loss function improvement term,is the loss of squared error, S is the exploration entropy, c1,c2Is a relevant hyper-parameter.
The formula for calculating the performance indicator of the reward function is as follows:
wherein, tiRepresenting the total step size required for the convergence of the training process,is the average return value, S, over the last period of timeiIs the standard deviation reported in the last period of time, and T and β are hyperparameters.
The optimal reward function suitable for the current task can be selected according to the performance index, the reliability of the index is proved through experiments, and the selected reward function can obtain the optimal performance during testing.
In one specific task: the method is used in a high-dimensional continuous control task and specifically comprises the following steps:
the objective of this task is to mimic expert strategy to control a robot in a simulated environment so that it can learn to walk as quickly as possible. For an agent in a simulation environment, the input is an 11-dimensional state quantity and the output control action is an 8-dimensional continuous variable. In the process of training the reward function optimization algorithm, a strategy network and a discrimination network owned by an agent have similar structures: the strategy network comprises two hidden layers containing 128 nodes, and the activation functions are tanh; the discriminant network consists of two hidden layers containing 100 nodes, the activation function is also tan h, and the parameters of the two networks are optimized through an Adam optimizer.
The hyper-parameters used in the antagonistic mock learning algorithm are shown in table 1:
TABLE 1 mimic learning hyper-parameters
Hyper-parameter | Value of |
Full period (T) | 2048 |
Policy network learning rate (α)p) | 3e-4 |
Distinguishing the network learning rate (α)d) | 1e-3 |
Discount factor (gamma) | 0.99 |
GAE parameters (lambda) | 0.95 |
Value error coefficient (c)1) | 0.5 |
Policy entropy coefficient (c)2) | 0.01 |
The hyperparameters in the reward function evaluation index calculation formula are shown in table 2:
TABLE 2 Performance evaluation hyperparameters
Hyper-parameter | Value of |
Normalized range (T) | 1e6 |
Discount factor (β) | 0.25 |
Convergence threshold (Thre) | 50 |
In this context, the random strategy may achieve a reward score of-60.21 ± 30.40, while the expert strategy scores 4066.96 ± 688.97. Traces of 5 sets of teach data are obtained from the public data set, and the agent is trained under the guidance of 6 reward functions in parallel using the expert decision data. The scores in the agent training process are normalized to between 0,1 according to the ranges determined by the stochastic strategy and the expert strategy scores.
As shown in fig. 4, 0 represents the score of the random strategy, and 1 represents the score of the expert strategy, it can be seen that the final performances of the agents under different reward functions are different, and the training processes of the agents corresponding to the reward functions 2, 4 and 5 are better. According to the data stored at the end of training and the designed reward function performance evaluation index function, the finally optimized reward function is r2(x)=ex. The average score after the end of 5 rounds in the test period is 3796.00, which is slightly higher than the other two reward functions r5(x) Log (1- σ (x)) and r4(x) Score for σ (x), 3785.31 and 3789.75, respectively, indicates that the optimal reward function is indeed r in this task2(x)=exThis demonstrates the effectiveness of the proposed performance assessment index and picks the optimal reward function based on the index.
In previous algorithms, any of the reward functions (1), (5) and (6) may be used, and their effectiveness may vary significantly when faced with different tasks. It can be seen that the function (1) is difficult to perform the learning task, and although both functions (5) and (6) perform better, it is still slightly inferior to the reward function (2). Therefore, through the performance evaluation index and the reward function optimization algorithm, the optimal reward function can be automatically selected according to different tasks without continuously trying the effects of different reward functions.
An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.
Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.
Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.
The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an erasable Programmable Read-Only Memory (EPROM), an electrically erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a ferroelectric Random Access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), a Compact Disc Read-Only Memory (DRAM), or a Dynamic Random Access Memory (SDRAM), or any other type of Random Access Memory (SDRAM), including but not limited to a Dynamic Random Access RAM (SDRAM), a Dynamic Random Access Memory (SDRAM), or a magnetic Random Access Memory (RAM) that is suitable for external Access, such as a Dynamic Access Memory (SDRAM), or a Dynamic Random Access Memory (SDRAM) that is suitable for example, but not limited by a Dynamic Access bus Access RAM, or Dynamic Access RAM.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.
Claims (10)
1. A method for selecting a competitive imitation learning winning incentive function, comprising the steps of:
s1: constructing a strategy network pi with a parameter theta, a judgment network D with a parameter w and at least two reward functions;
s2: acquiring teaching data under expert strategy and storing the teaching data into an expert data buffer B containing expert tracksE(st,at) Performing the following steps;
s3: controlling the input of the policy network to be the state s returned by the simulation environment EnvtThe output is decision action at(ii) a The discriminating network utilizes state-action pairs under expert policy(s)t,at)EAnd the state action pairs of the policy networkt,at)πUpdating the parameters; in the stage of calculating the reward, the input of the discrimination network is the state action pair(s) of the policy networkt,at)πThe output value is the reward value obtained by calculation of the reward function;
s4: selecting the reward function of the current task according to the performance indexes of different reward functions;
s5: saving parameters of the policy network corresponding to the selected reward function.
2. The method of claim 1, wherein 6 reward functions are designed according to the difference of the reward function value ranges.
3. The method of selecting a reward function for antagonistic mock-learning according to claim 2, wherein said reward function is:
r1(x)=x=log(σ(x))-log(1-σ(x))
r2(x)=ex
r3(x)=-e-x
r4(x)=σ(x)
r5(x)=-log(1-σ(x))
r6(x)=log(σ(x))
4. The method of claim 3, wherein selecting a current mission reward function based on the magnitude of different performance indicators of the reward function comprises:
s41: initializing a plurality of said simulation environments EnviThe policy network piiAnd said discriminating network DiAnd simultaneously starting a training process, wherein i is 0, 1.. 6;
s42: in each training process, the current strategy network pi is used for interacting with the simulation environment Env, and the state action pairs of the current time step are stored into a strategy network buffer Bπ(st,at) Performing the following steps;
s43: from the policy network cache Bπ(st,at) The state action track(s) under the current strategy is obtained by middle samplingt,at)πFrom said expert data buffer BE(st,at) The motion track(s) of the expert state is obtained by middle samplingt,at)EAnd through a loss function DJSThe gradient of the discrimination network D is decreased to optimize the parameter w:
wherein w represents the discriminant network parameter, αdRepresenting the learning rate of the discriminating network parameter, DJSRepresentative discriminant network loss function (s, a)πAnd (s, a)ERespectively representing slave policy network buffers Bπ(st,at) And an expert data buffer(s)t,at)EAnd (5) sampling the obtained state action track.
S44: calculating the reward value r of each step in the training process according to different specific forms of the reward functiontAnd is present inPolicy network buffer Bπ(st,at,rt) Performing the following steps;
s45: calculating the dominance value A of each time step according to the dominance functiontAnd is stored in the policy network buffer Bπ(st,at,rt,At) Performing the following steps;
s46: according to the near-end strategy optimization algorithm, and utilizing the strategy network buffer Bπ(st,at,rt,At) The parameter θ of the policy network is updated in a gradient descending manner by the data in (1):
where θ represents a policy network parameter, αpRepresenting the learning rate of the policy network parameters,representing a policy network objective function.
S47: calculating the difference of average returns in adjacent time periods, if the difference is less than a set threshold Thre, stopping the training process and saving the network parameters theta and w, and simultaneously saving the time step t during convergence and the average return in the latest periodAnd a standard deviation S; otherwise, returning to the step S42, and re-executing the steps S42-S46;
5. The method of selecting a winning incentive function for opportune mock learning according to claim 4, wherein said penalty function DJSIs calculated byThe formula is as follows:
wherein(s)π,aπ) And(s)E,aE) Respectively from the policy network buffer BπAnd said expert data buffer BEState action samples of (2).
6. The method of claim 4, wherein the merit function A is a function of the learning awardtThe method is used for measuring the amount of income brought by the current action;
the general dominance estimation algorithm calculates according to the trajectory data collected by the policy network over the elapsed time T, and the specific formula is as follows:
At=t+(γλ)t+1+…+(γλ)T-t+1 t-1,
wherein the content of the first and second substances,t=rt+γV(st+1)-V(st) Gamma and lambda are hyper-parameters, rtIs a calculated prize value, s, based on a prize functiontAnd st+1Representing the state of the preceding and following moments, V(s)t) And V(s)t+1) Respectively representing the state values of the front and rear moments.
7. The method of claim 4, wherein the near-end strategy optimization algorithm improves the penalty function term in the general gradient lift algorithm by LCLIPItem substitution of original LGP,LCLIPThe specific formula of (A) is as follows:
8. The method of claim 7, wherein the final objective function is obtained by adding the improvement term to the value function error term and the strategy entropy term:
9. The method of selecting a reward function for antagonistic emulation learning according to claim 4, wherein the formula for calculating the performance indicators of the reward function is as follows:
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010323155.4A CN111401556B (en) | 2020-04-22 | 2020-04-22 | Selection method of countermeasure type imitation learning winning function |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010323155.4A CN111401556B (en) | 2020-04-22 | 2020-04-22 | Selection method of countermeasure type imitation learning winning function |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111401556A true CN111401556A (en) | 2020-07-10 |
CN111401556B CN111401556B (en) | 2023-06-30 |
Family
ID=71431701
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010323155.4A Active CN111401556B (en) | 2020-04-22 | 2020-04-22 | Selection method of countermeasure type imitation learning winning function |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111401556B (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111983922A (en) * | 2020-07-13 | 2020-11-24 | 广州中国科学院先进技术研究所 | Robot demonstration teaching method based on meta-simulation learning |
CN112052947A (en) * | 2020-08-17 | 2020-12-08 | 清华大学 | Hierarchical reinforcement learning method and device based on strategy options |
CN112249032A (en) * | 2020-10-29 | 2021-01-22 | 浪潮(北京)电子信息产业有限公司 | Automatic driving decision method, system, equipment and computer storage medium |
CN112434171A (en) * | 2020-11-26 | 2021-03-02 | 中山大学 | Knowledge graph reasoning and complementing method and system based on reinforcement learning |
CN112894809A (en) * | 2021-01-18 | 2021-06-04 | 华中科技大学 | Impedance controller design method and system based on reinforcement learning |
CN112975967A (en) * | 2021-02-26 | 2021-06-18 | 同济大学 | Service robot quantitative water pouring method based on simulation learning and storage medium |
CN113052253A (en) * | 2021-03-31 | 2021-06-29 | 北京字节跳动网络技术有限公司 | Hyper-parameter determination method, device, deep reinforcement learning framework, medium and equipment |
CN113221469A (en) * | 2021-06-04 | 2021-08-06 | 上海天壤智能科技有限公司 | Inverse reinforcement learning method and system for enhancing authenticity of traffic simulator |
CN113222297A (en) * | 2021-06-08 | 2021-08-06 | 上海交通大学 | Method, system, equipment and medium suitable for cyclic updating planning of solid waste base garden |
CN113240118A (en) * | 2021-05-18 | 2021-08-10 | 中国科学院自动化研究所 | Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium |
CN113239634A (en) * | 2021-06-11 | 2021-08-10 | 上海交通大学 | Simulator modeling method based on robust simulation learning |
CN113467515A (en) * | 2021-07-22 | 2021-10-01 | 南京大学 | Unmanned aerial vehicle flight control method based on virtual environment simulation reconstruction and reinforcement learning |
CN113504723A (en) * | 2021-07-05 | 2021-10-15 | 北京航空航天大学 | Carrier rocket load shedding control method based on inverse reinforcement learning |
CN113609786A (en) * | 2021-08-27 | 2021-11-05 | 中国人民解放军国防科技大学 | Mobile robot navigation method and device, computer equipment and storage medium |
CN113688977A (en) * | 2021-08-30 | 2021-11-23 | 浙江大学 | Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium |
CN113704979A (en) * | 2021-08-07 | 2021-11-26 | 中国航空工业集团公司沈阳飞机设计研究所 | Air countermeasure maneuver control method based on random neural network |
CN113781190A (en) * | 2021-01-13 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | Bill data processing method, system, computer system and medium |
CN113852645A (en) * | 2021-12-02 | 2021-12-28 | 北京邮电大学 | Method and device for resisting client DNS cache poisoning attack and electronic equipment |
CN113962012A (en) * | 2021-07-23 | 2022-01-21 | 中国科学院自动化研究所 | Unmanned aerial vehicle countermeasure strategy optimization method and device |
WO2022044314A1 (en) * | 2020-08-31 | 2022-03-03 | 日本電気株式会社 | Learning device, learning method, and learning program |
CN114625089A (en) * | 2022-03-15 | 2022-06-14 | 大连东软信息学院 | Job shop scheduling method based on improved near-end strategy optimization algorithm |
CN114683280A (en) * | 2022-03-17 | 2022-07-01 | 达闼机器人股份有限公司 | Object control method, device, storage medium and electronic equipment |
CN114986518A (en) * | 2022-07-19 | 2022-09-02 | 聊城一明五金科技有限公司 | Intelligent control method and system for automobile disassembly production line |
CN115314399A (en) * | 2022-08-05 | 2022-11-08 | 北京航空航天大学 | Data center flow scheduling method based on inverse reinforcement learning |
CN115470710A (en) * | 2022-09-26 | 2022-12-13 | 北京鼎成智造科技有限公司 | Air game simulation method and device |
CN115688858A (en) * | 2022-10-20 | 2023-02-03 | 哈尔滨工业大学(深圳) | Fine-grained expert behavior simulation learning method, device, medium and terminal |
WO2023109663A1 (en) * | 2021-12-17 | 2023-06-22 | 深圳先进技术研究院 | Serverless computing resource configuration method based on maximum entropy inverse reinforcement learning |
CN117193008A (en) * | 2023-10-07 | 2023-12-08 | 航天科工集团智能科技研究院有限公司 | Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium |
CN113704979B (en) * | 2021-08-07 | 2024-05-10 | 中国航空工业集团公司沈阳飞机设计研究所 | Air countermeasure maneuvering control method based on random neural network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109376862A (en) * | 2018-10-29 | 2019-02-22 | 中国石油大学(华东) | A kind of time series generation method based on generation confrontation network |
CN111026272A (en) * | 2019-12-09 | 2020-04-17 | 网易(杭州)网络有限公司 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
-
2020
- 2020-04-22 CN CN202010323155.4A patent/CN111401556B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109376862A (en) * | 2018-10-29 | 2019-02-22 | 中国石油大学(华东) | A kind of time series generation method based on generation confrontation network |
CN111026272A (en) * | 2019-12-09 | 2020-04-17 | 网易(杭州)网络有限公司 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
MING ZHANG: "Wasserstein Distance guided Adversarial Imitation Learning with Reward Shape Exploration", 《IEEE》 * |
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111983922A (en) * | 2020-07-13 | 2020-11-24 | 广州中国科学院先进技术研究所 | Robot demonstration teaching method based on meta-simulation learning |
CN112052947A (en) * | 2020-08-17 | 2020-12-08 | 清华大学 | Hierarchical reinforcement learning method and device based on strategy options |
WO2022044314A1 (en) * | 2020-08-31 | 2022-03-03 | 日本電気株式会社 | Learning device, learning method, and learning program |
CN112249032A (en) * | 2020-10-29 | 2021-01-22 | 浪潮(北京)电子信息产业有限公司 | Automatic driving decision method, system, equipment and computer storage medium |
CN112249032B (en) * | 2020-10-29 | 2022-02-18 | 浪潮(北京)电子信息产业有限公司 | Automatic driving decision method, system, equipment and computer storage medium |
CN112434171A (en) * | 2020-11-26 | 2021-03-02 | 中山大学 | Knowledge graph reasoning and complementing method and system based on reinforcement learning |
CN113781190A (en) * | 2021-01-13 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | Bill data processing method, system, computer system and medium |
CN112894809B (en) * | 2021-01-18 | 2022-08-02 | 华中科技大学 | Impedance controller design method and system based on reinforcement learning |
CN112894809A (en) * | 2021-01-18 | 2021-06-04 | 华中科技大学 | Impedance controller design method and system based on reinforcement learning |
CN112975967A (en) * | 2021-02-26 | 2021-06-18 | 同济大学 | Service robot quantitative water pouring method based on simulation learning and storage medium |
CN112975967B (en) * | 2021-02-26 | 2022-06-28 | 同济大学 | Service robot quantitative water pouring method based on simulation learning and storage medium |
CN113052253A (en) * | 2021-03-31 | 2021-06-29 | 北京字节跳动网络技术有限公司 | Hyper-parameter determination method, device, deep reinforcement learning framework, medium and equipment |
CN113240118A (en) * | 2021-05-18 | 2021-08-10 | 中国科学院自动化研究所 | Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium |
CN113221469A (en) * | 2021-06-04 | 2021-08-06 | 上海天壤智能科技有限公司 | Inverse reinforcement learning method and system for enhancing authenticity of traffic simulator |
CN113222297A (en) * | 2021-06-08 | 2021-08-06 | 上海交通大学 | Method, system, equipment and medium suitable for cyclic updating planning of solid waste base garden |
CN113239634A (en) * | 2021-06-11 | 2021-08-10 | 上海交通大学 | Simulator modeling method based on robust simulation learning |
CN113504723B (en) * | 2021-07-05 | 2023-11-28 | 北京航空航天大学 | Carrier rocket load shedding control method based on inverse reinforcement learning |
CN113504723A (en) * | 2021-07-05 | 2021-10-15 | 北京航空航天大学 | Carrier rocket load shedding control method based on inverse reinforcement learning |
CN113467515A (en) * | 2021-07-22 | 2021-10-01 | 南京大学 | Unmanned aerial vehicle flight control method based on virtual environment simulation reconstruction and reinforcement learning |
CN113962012A (en) * | 2021-07-23 | 2022-01-21 | 中国科学院自动化研究所 | Unmanned aerial vehicle countermeasure strategy optimization method and device |
CN113704979B (en) * | 2021-08-07 | 2024-05-10 | 中国航空工业集团公司沈阳飞机设计研究所 | Air countermeasure maneuvering control method based on random neural network |
CN113704979A (en) * | 2021-08-07 | 2021-11-26 | 中国航空工业集团公司沈阳飞机设计研究所 | Air countermeasure maneuver control method based on random neural network |
CN113609786A (en) * | 2021-08-27 | 2021-11-05 | 中国人民解放军国防科技大学 | Mobile robot navigation method and device, computer equipment and storage medium |
CN113609786B (en) * | 2021-08-27 | 2022-08-19 | 中国人民解放军国防科技大学 | Mobile robot navigation method, device, computer equipment and storage medium |
CN113688977B (en) * | 2021-08-30 | 2023-12-05 | 浙江大学 | Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium |
CN113688977A (en) * | 2021-08-30 | 2021-11-23 | 浙江大学 | Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium |
CN113852645A (en) * | 2021-12-02 | 2021-12-28 | 北京邮电大学 | Method and device for resisting client DNS cache poisoning attack and electronic equipment |
WO2023109663A1 (en) * | 2021-12-17 | 2023-06-22 | 深圳先进技术研究院 | Serverless computing resource configuration method based on maximum entropy inverse reinforcement learning |
CN114625089B (en) * | 2022-03-15 | 2024-05-03 | 大连东软信息学院 | Job shop scheduling method based on improved near-end strategy optimization algorithm |
CN114625089A (en) * | 2022-03-15 | 2022-06-14 | 大连东软信息学院 | Job shop scheduling method based on improved near-end strategy optimization algorithm |
CN114683280A (en) * | 2022-03-17 | 2022-07-01 | 达闼机器人股份有限公司 | Object control method, device, storage medium and electronic equipment |
CN114683280B (en) * | 2022-03-17 | 2023-11-17 | 达闼机器人股份有限公司 | Object control method and device, storage medium and electronic equipment |
CN114986518A (en) * | 2022-07-19 | 2022-09-02 | 聊城一明五金科技有限公司 | Intelligent control method and system for automobile disassembly production line |
CN115314399A (en) * | 2022-08-05 | 2022-11-08 | 北京航空航天大学 | Data center flow scheduling method based on inverse reinforcement learning |
CN115314399B (en) * | 2022-08-05 | 2023-09-15 | 北京航空航天大学 | Data center flow scheduling method based on inverse reinforcement learning |
CN115470710B (en) * | 2022-09-26 | 2023-06-06 | 北京鼎成智造科技有限公司 | Air game simulation method and device |
CN115470710A (en) * | 2022-09-26 | 2022-12-13 | 北京鼎成智造科技有限公司 | Air game simulation method and device |
CN115688858A (en) * | 2022-10-20 | 2023-02-03 | 哈尔滨工业大学(深圳) | Fine-grained expert behavior simulation learning method, device, medium and terminal |
CN115688858B (en) * | 2022-10-20 | 2024-02-09 | 哈尔滨工业大学(深圳) | Fine granularity expert behavior imitation learning method, device, medium and terminal |
CN117193008B (en) * | 2023-10-07 | 2024-03-01 | 航天科工集团智能科技研究院有限公司 | Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium |
CN117193008A (en) * | 2023-10-07 | 2023-12-08 | 航天科工集团智能科技研究院有限公司 | Small sample robust imitation learning training method oriented to high-dimensional disturbance environment, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111401556B (en) | 2023-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111401556A (en) | Selection method of opponent type imitation learning winning incentive function | |
CN110119844B (en) | Robot motion decision method, system and device introducing emotion regulation and control mechanism | |
Camerer et al. | Behavioural game theory: thinking, learning and teaching | |
Erev et al. | On adaptation, maximization, and reinforcement learning among cognitive strategies. | |
Dayan et al. | Decision theory, reinforcement learning, and the brain | |
Van Otterlo | The logic of adaptive behavior: Knowledge representation and algorithms for adaptive sequential decision making under uncertainty in first-order and relational domains | |
CN111291890A (en) | Game strategy optimization method, system and storage medium | |
Hotaling et al. | Dynamic decision making | |
CN114330651A (en) | Layered multi-agent reinforcement learning method oriented to multi-element joint instruction control | |
Intisar et al. | Classification of online judge programmers based on rule extraction from self organizing feature map | |
Kim et al. | Generalization of TORCS car racing controllers with artificial neural networks and linear regression analysis | |
CN115033878A (en) | Rapid self-game reinforcement learning method and device, computer equipment and storage medium | |
CN113947246B (en) | Loss processing method and device based on artificial intelligence and electronic equipment | |
CN117112742A (en) | Dialogue model optimization method and device, computer equipment and storage medium | |
CN115906673A (en) | Integrated modeling method and system for combat entity behavior model | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
US11478716B1 (en) | Deep learning for data-driven skill estimation | |
CN114911157A (en) | Robot navigation control method and system based on partial observable reinforcement learning | |
CN112884129B (en) | Multi-step rule extraction method, device and storage medium based on teaching data | |
De Penning et al. | Applying neural-symbolic cognitive agents in intelligent transport systems to reduce CO 2 emissions | |
Belavkin | Conflict resolution by random estimated costs | |
CN117474077B (en) | Auxiliary decision making method and device based on OAR model and reinforcement learning | |
US11869383B2 (en) | Method, system and non-transitory computer- readable recording medium for providing information on user's conceptual understanding | |
O'Hanlon | Using Supervised Machine Learning to Predict the Final Rankings of the 2021 Formula One Championship | |
Yang et al. | The Cognitive Substrates of Model-Based Learning: An Integrative Declarative-Procedural Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |