CN111401556A

CN111401556A - Selection method of opponent type imitation learning winning incentive function

Info

Publication number: CN111401556A
Application number: CN202010323155.4A
Authority: CN
Inventors: 李秀; 王亚伟; 张明
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-07-10
Anticipated expiration: 2040-04-22
Also published as: CN111401556B

Abstract

The invention provides a selection method of a winning incentive function of confrontational imitation learning, which comprises the following steps: constructing a strategy network with a parameter theta, a judgment network with a parameter w and at least two reward functions; acquiring teaching data under an expert strategy and storing the teaching data into an expert data buffer containing an expert track; the input of the control strategy network is the state returned by the simulation environment, and the output is a decision action; judging whether the network utilizes the state action pair under the expert strategy or the state action pair of the strategy network to update parameters; in the stage of calculating the reward, judging that the input of the network is a state action pair of the strategy network, and the output value is a reward value obtained through calculation of a reward function; selecting the reward function of the current task according to the performance indexes of different reward functions; parameters of the policy network corresponding to the selected reward function are saved. The intelligent agent learns under the guidance of different reward functions, and then selects the optimal reward function according to the performance evaluation index in a specific task scene.

Description

Selection method of opponent type imitation learning winning incentive function

Technical Field

The invention relates to the technical field of reward function selection, in particular to a method for selecting a reward function in confrontational type imitation learning.

Background

In recent years, with the great breakthrough of deep learning in the fields of image recognition, voice technology, natural language processing and the like, the deep reinforcement learning combining the deep neural network and the reinforcement learning also obtains the expression exceeding human beings on the large-scale strategy optimization problems of go, interstellar dispute and the like. One of the bottlenecks of reinforcement learning is: designing a reasonable reward function based on expert experience is time consuming and laborious when faced with practical control problems such as autopilot, robotics, etc. Data-driven simulation learning provides a thought for solving the problem, and the strategy of a competitive expert can be learned only by using teaching data without manually designing a reward function. Among many algorithms of simulation learning, the behavioral cloning method is the simplest, and it performs simulation in a supervised learning manner. However, this method is susceptible to compounding errors and is difficult to adapt to situations that do not occur in expert data. The inverse reinforcement learning algorithm is another type of simulation learning algorithm, which first learns a reward function according to expert data and then executes a reinforcement learning process learning strategy under the guidance of the reward function. The reward function learned by the method has better robustness and can deal with the situation which does not appear in the expert data. However, the algorithm needs to alternate the steps of finding the optimal reward function and performing reinforcement learning training, and therefore, a large amount of calculation is needed.

In the process of simulating the expert strategy, a reward signal is output from the judgment network, the reward value is larger when the output value is larger when the generated data is closer to the expert data, and the reward value is smaller otherwise, because the reward function in a negative logarithm form is used and the input value of the reward function is always in a range of [0,1], the reward value is always positive, and the algorithm is difficult to be suitable for various tasks.

The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.

Disclosure of Invention

The invention provides a method for selecting a reward function for antagonistic imitative learning, aiming at solving the problem that the reward function form used in an imitative learning algorithm based on generation of an antagonistic network in the prior art is not optimal in all tasks and therefore cannot obtain optimal performance in multiple tasks.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

a method for selecting a competitive imitation learning winning incentive function, comprising the steps of: s1: constructing a strategy network pi with a parameter theta, a judgment network D with a parameter w and at least two reward functions; s2: acquiring teaching data under expert strategy and storing the teaching data into an expert data buffer B containing expert tracks_E(s_t,a_t) Performing the following steps; s3: controlling the input of the policy network to be the state s returned by the simulation environment Env_tThe output is decision action a_t(ii) a The discriminating network utilizes state-action pairs under expert policy(s)_t,a_t)^EAnd the state action pairs of the policy network_t,a_t)^πUpdating the parameters; in the stage of calculating the reward, the input of the discrimination network is the state action pair(s) of the policy network_t,a_t)^πThe output value is the reward value obtained by calculation of the reward function; s4: selecting the reward function of the current task according to the performance indexes of different reward functions; s5: saving parameters of the policy network corresponding to the selected reward function.

Preferably, 6 reward functions are designed according to different values of the reward function.

Preferably, the reward function is:

r₁(x)＝x＝log(σ(x))-log(1-σ(x))

r₂(x)＝e^x

r₃(x)＝-e^-x

r₄(x)＝σ(x)

r₅(x)＝-log(1-σ(x))

r₆(x)＝log(σ(x))

wherein x is the output of the discrimination network,

is a sigmoid function.

Preferably, the process of selecting the reward function of the current task according to the size of the performance index of the reward function comprises the following steps: s41: initializing a plurality of said simulation environments Env_iThe policy network pi_iAnd said discriminating network D_iAnd simultaneously starting a training process, wherein i is 0, 1.. 6; s42: in each training process, the current strategy network pi is used for interacting with the simulation environment Env, and the state action pairs of the current time step are stored into a strategy network buffer B_π(s_t,a_t) Performing the following steps; s43: from the policy network cache B_π(s_t,a_t) The state action track(s) under the current strategy is obtained by middle sampling_t,a_t)^πFrom said expert data buffer B_E(s_t,a_t) The motion track(s) of the expert state is obtained by middle sampling_t,a_t)^EAnd through a loss function D_JSThe gradient of the discrimination network D is decreased to optimize the parameter w:

wherein w represents the discriminant network parameter, α_dRepresenting the learning rate of the discriminating network parameter, D_JSRepresentative discriminant network loss function (s, a)^πAnd (s, a)^ERespectively representing slave policy network buffers B_π(s_t,a_t) And an expert data buffer(s)_t,a_t)^EAnd (5) sampling the obtained state action track.

S44: calculating each of the reward functions during the training process according to different specific forms of the reward functionsReward value r of step_tAnd is stored in the policy network buffer B_π(s_t,a_t,r_t) Performing the following steps; s45: calculating the dominance value A of each time step according to the dominance function_tAnd is stored in the policy network buffer B_π(s_t,a_t,r_t,A_t) Performing the following steps; s46: according to the near-end strategy optimization algorithm, and utilizing the strategy network buffer B_π(s_t,a_t,r_t,A_t) The parameter θ of the policy network is updated in a gradient descending manner by the data in (1):

where θ represents a policy network parameter, α_pRepresenting the learning rate of the policy network parameters,

representing a policy network objective function.

S47: calculating the difference of average returns in adjacent time periods, if the difference is less than a set threshold Thre, stopping the training process and saving the network parameters theta and w, and simultaneously saving the time step t during convergence and the average return in the latest period

And a standard deviation S; otherwise, returning to the step S42, and re-executing the steps S42-S46; s48: after all training processes are finished, according to the data stored in the final convergence

And calculating the size of the performance index, and selecting a reward function of the current task.

Preferably, the loss function D_JSThe calculation formula of (a) is as follows:

wherein(s)^π,a^π) And(s)^E,a^E) Respectively from the policy network buffer B_πAnd said expert data buffer B_EState action samples of (2).

Preferably, the merit function A_tThe method is used for measuring the amount of income brought by the current action;

the general dominance estimation algorithm calculates according to the trajectory data collected by the policy network over the elapsed time T, and the specific formula is as follows:

A_t＝_t+(γλ)_t+1+…+(γλ)^T-t+1 _t-1，

wherein the content of the first and second substances,_t＝r_t+γV(s_t+1)-V(s_t) Gamma and lambda are hyper-parameters, r_tIs a calculated prize value, s, based on a prize function_tAnd s_t+1Representing the state of the preceding and following moments, V(s)_t) And V(s)_t+1) Respectively representing the state values of the front and rear moments.

Preferably, the near-end strategy optimization algorithm improves the loss function term in the general gradient algorithm by L^CLIPItem substitution of original L^GP，L^CLIPThe specific formula of (A) is as follows:

wherein the content of the first and second substances,

the ratio, which represents the probability, is a hyperparameter,

represents the aforementioned merit function; clip represents a truncation function;

representing expectation, min is a function taking the minimum value.

Preferably, the final objective function is obtained by adding the improvement term, the value function error term and the strategy entropy term:

wherein the content of the first and second substances,

is the above-mentioned loss function improvement term,

is the loss of squared error, S is the exploration entropy, c₁，c₂Is a relevant hyper-parameter.

Preferably, the formula for calculating the performance indicator of the reward function is as follows:

wherein, t_iRepresenting the total step size required for the convergence of the training process,

is the average return value, S, over the last period of time_iIs the standard deviation reported in the last period of time, and T and β are hyperparameters.

The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the above.

The invention has the beneficial effects that: the method for selecting the reward function in the confrontation type imitation learning is provided, and by designing various reward function forms, an intelligent agent can learn under the guidance of different reward functions, and then the optimal reward function is selected according to performance evaluation indexes in a specific task scene.

Furthermore, under the condition of small calculation amount, the optimal reward function can be automatically selected according to different tasks through the performance evaluation index and the reward function optimization algorithm without continuously trying the effects of different reward functions.

Drawings

FIG. 1 is a diagram illustrating a method for selecting a competitive imitation learning reward function according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating a method for selecting a reward function of a current task according to the size of different reward function performance indicators according to an embodiment of the present invention.

FIG. 3 is a flow chart of a preferred algorithm of the reward function for countering mock learning according to an embodiment of the present invention.

FIG. 4 is a diagram illustrating a training process curve according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

As shown in FIG. 1, the invention provides a method for selecting a competitive imitation learning reward function, comprising the following steps:

s1: constructing a strategy network pi with a parameter theta, a judgment network D with a parameter w and at least two reward functions;

s2: acquiring teaching data under expert strategy and storing the teaching data into an expert data buffer B containing expert tracks_E(s_t,a_t) Performing the following steps;

data is typically derived from the data set disclosed by OpenAI et al, which facilitates comparison with the effects of other algorithms. The expert data buffer typically stores 5 complete expert data tracks from which some data is sampled for training in a certain manner during the algorithm learning process.

S3: controlling the input of the policy network to be the state s returned by the simulation environment Env_tThe output is decision action a_t(ii) a The discriminating network utilizes state-action pairs under expert policy(s)_t,a_t)^EAnd the state action pairs of the policy network_t,a_t)^πUpdating the parameters; in the stage of calculating the reward, the input of the discrimination network is the state action pair(s) of the policy network_t,a_t)^πThe output value is the reward value obtained by calculation of the reward function;

s4: selecting the reward function of the current task according to the performance indexes of different reward functions;

s5: saving parameters of the policy network corresponding to the selected reward function.

In an embodiment of the present invention, 6 reward functions are designed according to different values of the reward function. The reward functions are respectively:

r₁(x)＝x＝log(σ(x))-log(1-σ(x))......(1)

r₂(x)＝e^x......(2)

r₃(x)＝-e^-x......(3)

r₄(x)＝σ(x)......(4)

r₅(x)＝-log(1-σ(x))......(5)

r₆(x)＝log(σ(x))......(6)

wherein x is the output of the discrimination network,

is a sigmoid function.

First, these several functions are conventional functions. Wherein, the functions (1), (5) and (6) are reward function forms used in the original algorithm. The value interval of the function (1) can be positive or negative, the value interval of the function (5) is always positive, and the value interval of the function (6) is always negative. Some previous studies have confirmed that functions with different value intervals are suitable for different types of tasks. Therefore, three other reward functions are designed according to different value intervals, wherein the value intervals of the functions (2) and (4) are always positive, and the value interval of the function (3) is always negative.

As shown in fig. 2, the process of selecting the current mission reward function according to the size of the performance index of the reward function includes:

s41: initializing a plurality of said simulation environments Env_iThe policy network pi_iAnd said discriminating network D_iAnd simultaneously starting a training process, wherein i is 0, 1.. 6;

s42: in each training process, the current strategy network pi is used for interacting with the simulation environment Env, and the state action pairs of the current time step are stored into a strategy network buffer B_π(s_t,a_t) Performing the following steps;

s43: from the policy network cache B_π(s_t,a_t) The state action track(s) under the current strategy is obtained by middle sampling_t,a_t)^πFrom said expert numberAccording to the buffer B_E(s_t,a_t) The motion track(s) of the expert state is obtained by middle sampling_t,a_t)^EAnd through a loss function D_JSThe gradient of the discrimination network D is decreased to optimize the parameter w:

S44: calculating the reward value r of each step in the training process according to different specific forms of the reward function_tAnd is stored in the policy network buffer B_π(s_t,a_t,r_t) Performing the following steps;

s45: calculating the dominance value A of each time step according to the dominance function_tAnd is stored in the policy network buffer B_π(s_t,a_t,r_t,A_t) Performing the following steps;

s46: according to the near-end strategy optimization algorithm, and utilizing the strategy network buffer B_π(s_t,a_t,r_t,A_t) The parameter θ of the policy network is updated in a gradient descending manner by the data in (1):

representing a policy network objective function.

S47: calculating adjacent timeIf the difference of average return in the period is less than the set threshold Thre, the training process is stopped and the network parameters theta and w are saved, and the time step t of convergence and the average return in the latest period are saved

And a standard deviation S; otherwise, returning to the step S42, and re-executing the steps S42-S46;

s48: after all training processes are finished, according to the data stored in the final convergence

Loss function D_JSThe calculation formula of (a) is as follows:

Fig. 3 is a schematic diagram showing a preferred algorithm flow of the reward function for confrontation simulation learning in the present invention.

In strategic gradient optimization algorithms, the dominance function A is often utilized_tMeasure the amount of profit brought by the current action, thus applying to the merit function A_tIt is critical to make a reasonable estimate. The General Advantage Estimation (GAE) algorithm is an efficient method for estimating advantages, and is calculated according to trajectory data collected by a strategy over an elapsed time T, and the specific formula is as follows:

A_t＝_t+(γλ)_t+1+…+(γλ)^T-t+1 _t-1，

wherein the content of the first and second substances,_t＝r_t+γV(s_t+1)-V(s_t) Gamma and lambda are hyper-parameters, r_tIs a calculated prize value, s, based on a prize function_tAnd s_t+1RepresentsThe state of the preceding and following moments, V(s)_t) And V(s)_t+1) Respectively representing the state values of the front and rear moments.

The near-end strategy optimization algorithm improves the loss function term in the general gradient extraction algorithm, and L is used^CLIPItem substitution of original L^GP，L^CLIPThe specific formula of (A) is as follows:

wherein the content of the first and second substances,

the ratio, which represents the probability, is a hyperparameter,

representing expectation, min is a function taking the minimum value.

And (3) obtaining a final target function by combining the improvement term, the value function error term and the strategy entropy term:

wherein the content of the first and second substances,

is the above-mentioned loss function improvement term,

The formula for calculating the performance indicator of the reward function is as follows:

The optimal reward function suitable for the current task can be selected according to the performance index, the reliability of the index is proved through experiments, and the selected reward function can obtain the optimal performance during testing.

In one specific task: the method is used in a high-dimensional continuous control task and specifically comprises the following steps:

the objective of this task is to mimic expert strategy to control a robot in a simulated environment so that it can learn to walk as quickly as possible. For an agent in a simulation environment, the input is an 11-dimensional state quantity and the output control action is an 8-dimensional continuous variable. In the process of training the reward function optimization algorithm, a strategy network and a discrimination network owned by an agent have similar structures: the strategy network comprises two hidden layers containing 128 nodes, and the activation functions are tanh; the discriminant network consists of two hidden layers containing 100 nodes, the activation function is also tan h, and the parameters of the two networks are optimized through an Adam optimizer.

The hyper-parameters used in the antagonistic mock learning algorithm are shown in table 1:

TABLE 1 mimic learning hyper-parameters

Hyper-parameter	Value of
		Full period (T)	2048
Policy network learning rate (α)_p)	3e-4
		Distinguishing the network learning rate (α)_d)	1e-3
Discount factor (gamma)	0.99
		GAE parameters (lambda)	0.95
Value error coefficient (c)₁)	0.5
		Policy entropy coefficient (c)₂)	0.01

The hyperparameters in the reward function evaluation index calculation formula are shown in table 2:

TABLE 2 Performance evaluation hyperparameters

Hyper-parameter	Value of
		Normalized range (T)	1e6
Discount factor (β)	0.25
		Convergence threshold (Thre)	50

In this context, the random strategy may achieve a reward score of-60.21 ± 30.40, while the expert strategy scores 4066.96 ± 688.97. Traces of 5 sets of teach data are obtained from the public data set, and the agent is trained under the guidance of 6 reward functions in parallel using the expert decision data. The scores in the agent training process are normalized to between 0,1 according to the ranges determined by the stochastic strategy and the expert strategy scores.

As shown in fig. 4, 0 represents the score of the random strategy, and 1 represents the score of the expert strategy, it can be seen that the final performances of the agents under different reward functions are different, and the training processes of the agents corresponding to the reward functions 2, 4 and 5 are better. According to the data stored at the end of training and the designed reward function performance evaluation index function, the finally optimized reward function is r₂(x)＝e^x. The average score after the end of 5 rounds in the test period is 3796.00, which is slightly higher than the other two reward functions r₅(x) Log (1- σ (x)) and r₄(x) Score for σ (x), 3785.31 and 3789.75, respectively, indicates that the optimal reward function is indeed r in this task₂(x)＝e^xThis demonstrates the effectiveness of the proposed performance assessment index and picks the optimal reward function based on the index.

In previous algorithms, any of the reward functions (1), (5) and (6) may be used, and their effectiveness may vary significantly when faced with different tasks. It can be seen that the function (1) is difficult to perform the learning task, and although both functions (5) and (6) perform better, it is still slightly inferior to the reward function (2). Therefore, through the performance evaluation index and the reward function optimization algorithm, the optimal reward function can be automatically selected according to different tasks without continuously trying the effects of different reward functions.

An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.

Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.

Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.

The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an erasable Programmable Read-Only Memory (EPROM), an electrically erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a ferroelectric Random Access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), a Compact Disc Read-Only Memory (DRAM), or a Dynamic Random Access Memory (SDRAM), or any other type of Random Access Memory (SDRAM), including but not limited to a Dynamic Random Access RAM (SDRAM), a Dynamic Random Access Memory (SDRAM), or a magnetic Random Access Memory (RAM) that is suitable for external Access, such as a Dynamic Access Memory (SDRAM), or a Dynamic Random Access Memory (SDRAM) that is suitable for example, but not limited by a Dynamic Access bus Access RAM, or Dynamic Access RAM.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A method for selecting a competitive imitation learning winning incentive function, comprising the steps of:

2. The method of claim 1, wherein 6 reward functions are designed according to the difference of the reward function value ranges.

3. The method of selecting a reward function for antagonistic mock-learning according to claim 2, wherein said reward function is:

r₁(x)＝x＝log(σ(x))-log(1-σ(x))

r₂(x)＝e^x

r₃(x)＝-e^-x

r₄(x)＝σ(x)

r₅(x)＝-log(1-σ(x))

r₆(x)＝log(σ(x))

wherein x is the output of the discrimination network,

is a sigmoid function.

4. The method of claim 3, wherein selecting a current mission reward function based on the magnitude of different performance indicators of the reward function comprises:

s43: from the policy network cache B_π(s_t,a_t) The state action track(s) under the current strategy is obtained by middle sampling_t,a_t)^πFrom said expert data buffer B_E(s_t,a_t) The motion track(s) of the expert state is obtained by middle sampling_t,a_t)^EAnd through a loss function D_JSThe gradient of the discrimination network D is decreased to optimize the parameter w:

S44: calculating the reward value r of each step in the training process according to different specific forms of the reward function_tAnd is present inPolicy network buffer B_π(s_t,a_t,r_t) Performing the following steps;

representing a policy network objective function.

5. The method of selecting a winning incentive function for opportune mock learning according to claim 4, wherein said penalty function D_JSIs calculated byThe formula is as follows:

6. The method of claim 4, wherein the merit function A is a function of the learning award_tThe method is used for measuring the amount of income brought by the current action;

A_t＝_t+(γλ)_t+1+…+(γλ)^T-t+1 _t-1，

7. The method of claim 4, wherein the near-end strategy optimization algorithm improves the penalty function term in the general gradient lift algorithm by L^CLIPItem substitution of original L^GP，L^CLIPThe specific formula of (A) is as follows:

wherein the content of the first and second substances,

the ratio, which represents the probability, is a hyperparameter,

representing expectation, min is a function taking the minimum value.

8. The method of claim 7, wherein the final objective function is obtained by adding the improvement term to the value function error term and the strategy entropy term:

wherein the content of the first and second substances,

is the above-mentioned loss function improvement term,

9. The method of selecting a reward function for antagonistic emulation learning according to claim 4, wherein the formula for calculating the performance indicators of the reward function is as follows:

is the average return over the last period of timeReporting of value, S_iIs the standard deviation reported in the last period of time, and T and β are hyperparameters.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.