CN113240118A

CN113240118A - Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium

Info

Publication number: CN113240118A
Application number: CN202110540754.6A
Authority: CN
Inventors: 李小双; 王晓; 黄梓铭; 王飞跃
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-08-10
Anticipated expiration: 2041-05-18
Also published as: CN113240118B

Abstract

The invention provides a dominance estimation method, a dominance estimation device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a current environment state; inputting the current environment state into an advantage estimation model to obtain an advantage action obtained by the advantage estimation model through advantage estimation based on the current environment state; the advantage estimation model is obtained based on a teaching data set and a behavior clone model; the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set training. According to the invention, the advantage estimation model is trained on the basis of the teaching data set and the behavior clone model, the expert experience in historical teaching data is automatically mined by fully utilizing the teaching data through the self-adaptive behavior clone model, adverse effects possibly brought by incomplete teaching data are avoided, the advantage estimation performance of the advantage estimation model is enhanced, and the advantage estimation accuracy in a complex scene is improved.

Description

Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium

Technical Field

The present invention relates to the field of reinforcement learning technologies, and in particular, to a method and an apparatus for advantage estimation, an electronic device, and a storage medium.

Background

In recent years, Deep Learning (DRL) has been greatly advanced, and is widely applied in decision scenes such as electronic games and chess and card games. With the aid of the powerful feature extraction and function fitting capabilities of deep learning, a reinforcement learning subject can directly extract and learn feature knowledge from raw input data (such as game images), and then learn a decision control strategy by using a conventional reinforcement learning algorithm according to the extracted feature information without manually extracting or learning features on the basis of rules and heuristics.

However, at present, for the application of solving complex decision control problems (such as automatic driving) in real environment, the deep reinforcement learning technology still cannot be practically used. Due to the diversity and uncertainty of complex systems, the existing simulation environment is difficult to keep consistent with the real world, and the cost is high for improving the precision of the simulation system. Therefore, how to adapt to a complex real-world scene becomes one of the most urgent problems for applying the DRL model to a complex decision task.

For the decision problem under the complex scene, human experts have great advantages in learning efficiency and decision performance, so that the inclusion of expert knowledge in the DRL model is a potential solution. The DQfD (Deep Q-learning from Demonstrations) method for performing Q learning in teaching can guide the learning of an intelligent agent to obtain a strategy represented by teaching data through learning the teaching data so as to guide and help the intelligent agent to learn expert knowledge, and performs autonomous learning on the basis, thereby improving the decision-making capability of a model.

However, the DQfD model has the following problems: (1) in the DQfD learning process, the track data in the historical teaching data set are only used in pre-training, and the teaching data do not provide effective guidance for the track data generated by the model independently; (2) the teaching data set is very limited and cannot cover enough state action space; moreover, it is difficult to collect enough teaching data in some practical applications, for example, extreme cases occur rarely in real situations, and a large number of samples are data in normal situations; (3) the DQfD algorithm ignores the imperfection of historical teaching data ubiquitous in real application, and the imperfection can have negative influence on the improvement of the model performance. In addition, although the method based on DQN (Deep Q-learning Network) can achieve a good effect, there is a problem of overestimation of the Q value.

Disclosure of Invention

The invention provides an advantage estimation method, an advantage estimation device, electronic equipment and a storage medium, which are used for solving the defect that the effect of automatic decision making in a complex scene is poor in the prior art.

The invention provides an advantage estimation method, which comprises the following steps:

acquiring a current environment state;

inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action;

the advantage estimation model is obtained based on a teaching data set and a behavior clone model;

the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.

According to the superiority estimation method provided by the invention, the superiority estimation model is trained based on the following steps:

training to obtain a behavior clone network based on the teaching data set;

pre-training an advantage estimation model based on the teaching data set;

training the superiority estimation model based on the teaching data set and expert actions determined by the behavior clone network based on the sample environment state, and dynamically updating the teaching data set and finely adjusting the behavior clone network.

According to an advantage estimation method provided by the invention, the dynamically updating the teaching data set specifically comprises:

interacting with a real application environment based on the superiority estimation model, determining new teaching data based on feedback information of the real application environment, and updating and adding the new teaching data into the teaching data set.

According to an advantage estimation method provided by the present invention, the determining new teaching data based on feedback information of a real application environment, and updating the new teaching data into the teaching data set, specifically includes:

after the current round is finished, calculating the reward value of the current round;

and if the reward value of the current round is higher than the preset reward, determining new teaching data based on feedback information of a real application environment in the current round, state information input and output advantageous actions of the advantage estimation model in the current round, and updating the new teaching data into the teaching data set.

According to an advantage estimation method provided by the invention, the fine tuning of the behavioral clone network specifically comprises the following steps:

and fine-tuning the behavior clone network based on the updated teaching data set every time the teaching data set is updated for a preset number of times.

According to an advantage estimation method provided by the invention, the training is performed to obtain a behavior clone network based on the teaching data set, and the method specifically comprises the following steps:

determining a plurality of candidate cloned networks of different network structures and network parameters;

based on the teaching data set, taking the environmental state of the sample as input, taking the action of the sample as a label, and training each candidate clone network according to a back propagation and gradient descent algorithm;

interacting each candidate clone network with the real environment respectively, and calculating the total bonus points of each round corresponding to each candidate clone network;

and selecting the candidate clone network with the highest turn reward total score as the trained behavior clone network.

According to the advantage estimation method provided by the invention, the loss function of the advantage estimation model comprises supervision loss, single-step time difference loss and multi-step time difference loss;

wherein the supervised loss is determined based on a difference between a dominance estimation vector output by the dominance estimation model and a corresponding expert or sample action; wherein the expert action is determined by the behavioral cloning network according to a sample environment state, and the sample action is acquired from the teaching data set.

The present invention also provides an advantage estimation apparatus, including:

the state acquisition unit is used for acquiring the current environment state;

the advantage estimation unit is used for inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by performing advantage estimation on the advantage estimation model based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action;

the advantage estimation model is obtained based on a teaching data set and expert action training determined by a behavior clone model based on the sample environment state;

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the superiority estimation method as described in any one of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the merit estimation method as described in any of the above.

According to the advantage estimation method, the advantage estimation device, the electronic equipment and the storage medium, the advantage estimation model is trained on the basis of the teaching data set and the behavior clone model, teaching data can be fully utilized through the adaptive behavior clone model, expert experience in historical teaching data is automatically mined, adverse effects possibly brought by incomplete teaching data are avoided, the advantage estimation performance of the advantage estimation model is enhanced, the advantage estimation accuracy in a complex scene is improved, and therefore the decision performance of the model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for estimating the superiority provided by the present invention;

FIG. 2 is a schematic flow chart of a dominant estimation model training method provided by the present invention;

FIG. 3 is a schematic diagram of the loss function calculation provided by the present invention;

FIG. 4 is a schematic structural diagram of an advantage estimation apparatus provided in the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of an advantage estimation method provided by an embodiment of the present invention, as shown in fig. 1, the method includes:

step 110, obtaining the current environment state.

Specifically, the current decision field is first obtainedThe environmental state of the scene. The environment state is a value set of features capable of describing the current environment running state, and includes, but is not limited to, an RGB channel matrix of an environment image, a vector or tensor formed by values of different feature variables, and the like. Taking an emergency regulation and control scene of the power grid as an example, suppose that the ith bus and the voltage of the corresponding low-voltage side of the ith bus at the time t are v_t ⁱ,

And the load on the bus is

The state of the grid at the present moment may be expressed as

The states in the past N time steps are stacked, so that the dynamic operation state of the power grid at the time t can be described, and the environment state s is formed_t＝[O_t-N+1,O_t-N+2,...,O_t]。

Step 120, inputting the current environment state into the advantage estimation model to obtain an advantage vector obtained by the advantage estimation model performing advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action;

the advantage estimation model is obtained based on a teaching data set and expert action training determined by the behavior clone model based on the sample environment state;

the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set training.

Specifically, the current environmental state is input into the dominance estimation model, and the dominance estimation model may perform dominance estimation based on the current environmental state, and select the currently best dominance action from among a plurality of candidate actions, so that control may be performed according to the dominance action. Here, the candidate action is a candidate for an execution action applicable to the environment by the agent, and may be divided into a discrete action space and a continuous action space according to whether the candidate action is discrete or not.Taking the emergency regulation and control scene of the power grid as an example, when the power grid has K buses and can selectively execute load shedding by 20% or do not act, the dimension of the discrete action space is 2^K。

The advantage estimation model is obtained by performing reinforcement learning based on teaching data sets and expert actions determined by the behavior clone model based on the sample environment state. The advantage estimation model can be established based on a Double Deep Q-learning Network (Dueling DDQN) model. The teaching data set comprises sample environment states and corresponding sample actions, and the teaching data set can be generated according to operation records of human experts or other methods and systems in complex decision problem scenes. Here, the teaching data set may include a data sample consisting of < sample environment state, sample action, reward, next sample state, whether the current round is ended flag > quintuple:

e_t＝(s_t,a_t,r_t,s_t+1,flag_t)

wherein s is_t、a_t、t_t、s_t+1、flag_tRespectively showing the sample environment state, the sample action, the reward, the next sample environment state and whether the current round is finished or not.

Rewards are rewards for system feedback after actions are applied to the environment, and are given by a reward function r_t＝r(s_t,a_t) And (6) determining. Taking the grid environment as an example, the difference between the bus voltage and the standard value, and the reduction of the bus load can be used to construct the reward function. The larger the difference between the bus voltage and the standard value is, the larger the penalty is, the more the load is reduced, and the larger the penalty is. The cumulative sum of all penalty terms can be used as a reward function. Namely, correct actions are adopted, punishment is small, and reward is large; with the wrong actions, the penalty is large and the reward is small. The next sample environment state is a new environment state returned by the environment after a specific action is applied to the environment, and the current turn whether-to-end flag indicates whether the current turn is to be ended after the action is applied to the environment.

The behavioral clone model is used to predict the best expert actions from the action space based on the sample environmental state. The behavior clone model is obtained by training based on a teaching data set. Here, the superiority estimation model is trained by constructing a Behavioral Cloning (BC) model using the teaching data set to mine expert experience in historical teaching data, and using the Behavioral Cloning model and the teaching data set obtained by training. The expert action output by the behavior clone model is compared with the dominant action output by the behavior clone model, so that expert loss is generated, the difference between the dominant action output by the behavior clone model and the expert action output by the behavior clone model is reduced, the training effect of the dominant estimation model is optimized, and the accuracy of dominant estimation in a complex decision scene is improved. It should be noted that the advantage estimation method provided by the embodiment of the invention has universality and can be applied to different complex decision-making scenes, including but not limited to electronic games, traffic control, power grid control and the like.

According to the method provided by the embodiment of the invention, the advantage estimation model is trained on the basis of the teaching data set and the behavior clone model, the teaching data can be fully utilized through the self-adaptive behavior clone model, the expert experience in historical teaching data is automatically mined, the possible adverse effect brought by incomplete teaching data is avoided, the advantage estimation performance of the advantage estimation model is enhanced, the advantage estimation accuracy in a complex scene is improved, and the decision performance of the model is improved.

Based on any of the above embodiments, the superiority estimation model is trained based on the following steps:

training to obtain a behavior clone network based on the teaching data set;

pre-training an advantage estimation model based on a teaching data set;

and training an advantage estimation model based on the teaching data set and the expert action determined by the behavior clone network based on the sample environment state, and dynamically updating the teaching data set and finely adjusting the behavior clone network.

Specifically, according to the teaching data set, expert experience in a corresponding decision scene can be learned from the teaching data set, so that a behavior clone network with certain initial decision capability is obtained through training. Meanwhile, the teaching data set can be placed into a common experience playback pool, data are randomly sampled from the common experience playback pool, and the advantage estimation model is pre-trained to obtain the advantage estimation model with certain initial decision-making capability.

Subsequently, the dominance estimation model performs autonomous learning. Randomly sampling data from an empirical playback pool, taking the action corresponding to the maximum value in the discrete action probability vector output by the behavior clone network as an expert action, and updating the network parameters of the dominance estimation model according to a back propagation and gradient descent algorithm. The updated dominance estimation model has better dominance estimation performance compared with the pre-trained dominance estimation model. In the training process, the teaching data set can be dynamically updated, the behavior clone model can be finely adjusted periodically, and the teaching data set contains more high-quality track samples by introducing an automatic teaching data updating mechanism, so that adverse effects possibly brought by incomplete teaching data are avoided, and the effect of enhancing the robustness of the behavior clone model is achieved.

According to the method provided by the embodiment of the invention, the advantage estimation model is trained further by pre-training the advantage estimation model, based on the teaching data set and the expert action determined by the behavior clone network based on the sample environment state, so that the advantage estimation performance of the advantage estimation model is improved; meanwhile, the teaching data set is dynamically updated, the behavior clone network is finely adjusted, the robustness of the behavior clone model is enhanced, and the training effect of the advantage estimation model is further improved.

Based on any of the above embodiments, dynamically updating the teaching data set specifically includes:

and interacting with the real application environment based on the advantage estimation model, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into a teaching data set.

Specifically, the dominance estimation model interacts with a real application environment, an environment state is input, a dominance estimation network outputs a dominance estimation vector, and a behavior clone network outputs a discrete action probability vector. The advantage estimation vector and the discrete action probability vector respectively comprise scores of each candidate action determined by the advantage estimation network and the behavior clone network according to the environment state. And selecting candidate actions corresponding to the maximum value in the advantage estimation vector as a current optimal decision, applying the candidate actions to a real environment, and obtaining feedback information of the real application environment, so as to form new < environmental state, action, reward, next environmental state, whether the current turn is finished or not and mark > quintuple teaching data, and putting the five tuple teaching data into an experience playback pool to realize dynamic update of a teaching data set.

Based on any of the above embodiments, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into a teaching data set, specifically including:

and if the reward value of the current round is higher than the preset reward, determining new teaching data based on feedback information of the real application environment in the current round, state information input and output advantageous actions of the advantage estimation model in the current round, and updating the new teaching data into a teaching data set.

Specifically, if the current round is finished after the current optimal decision output by the advantage estimation model is applied to the real environment, the reward value of the current round is calculated. If the reward value of the current round is higher than the preset reward, the successful operation track corresponding to the current round is indicated, and the successful operation track can be added into the teaching data set. The method specifically includes determining new teaching data based on feedback information of a real application environment in a current round and dominant actions output by a dominant estimation model in the current round, and updating the new teaching data into a teaching data set.

Based on any of the above embodiments, the fine tuning behavior cloning network specifically includes:

Specifically, a period of fine tuning of the behavioral cloning network may be preset, for example, K, and then, each time the teaching data set is updated K times, the behavioral cloning network may be fine-tuned based on the updated teaching data set, so as to improve robustness of the behavioral cloning network.

Based on any of the above embodiments, based on the teaching data set, training to obtain a behavioral clone network specifically includes:

training each candidate clone network based on a teaching data set by taking a sample environment state as input and a sample action as a label according to a back propagation and gradient descent algorithm;

Specifically, a plurality of candidate cloned networks of different network structures and network parameters are predetermined. The network structure of the candidate clone network can be determined according to the teaching data set, and the network structure is matched with the teaching data and can well mine the teaching data, for example, the network structure can be a full-connection network or a long-time memory network. The activation function of the candidate cloned network may use LeakyRelu, i.e.:

y＝max(0，x)+α*min(0，x)

where alpha is a small positive number.

Training each candidate clone network by using a teaching data set, taking a sample environment state as input, taking a sample action as a label, and performing cross entropy loss with the output of a network model:

wherein a is the action corresponding to the maximum value in the discrete action probability vectors output by the candidate clone network, a_ESample actions in the teach data. And updating the network parameters of each candidate clone network according to a back propagation and gradient descent algorithm, and establishing a mapping f: s → a from the state to the action. The trained candidate clone network can be based on the input environmentThe state generates a virtual expert action.

And then, interacting each candidate clone network with the real environment respectively, inputting the candidate clone networks into an environment state, outputting each discrete action probability vector by each candidate clone network, and selecting the maximum value in the discrete action probability vectors as a virtual expert action to be applied to the real environment. Each specific action is executed to obtain a single step reward score, and all the single step rewards of each round are summed to obtain a total reward score of the round. And then selecting the candidate clone network with the highest turn reward total score as the trained behavior clone network.

Based on any embodiment, the loss function of the advantage estimation model comprises supervision loss, single-step time difference loss and multi-step time difference loss;

wherein the supervised loss is determined based on a difference between the dominance estimation vector output by the dominance estimation model and the corresponding expert action or sample action; wherein, the expert action is determined by the behavior clone network according to the environmental state of the sample, and the sample action is acquired from the teaching data set.

In particular, a blending loss function is defined, including supervised and unsupervised losses, where supervised loss is some measure of distance of the teaching action and the dominance estimation vector, including but not limited to: cross entropy loss, MSE loss, KL loss, JS divergence loss, Wasserstein distance, etc. Wherein the teaching action comprises an expert action output by the behavioral clone network or a sample action in the teaching data set. Unsupervised losses are single step time differential loss TD (1) and multi-step time differential loss TD (n). The individual losses can be calculated by the following formula:

L_u(Q)＝L_DQ(Q)+λ₁L_n(Q)+λ₂L_E(Q)

wherein L is_DQ(Q) is a single step time differential penalty, L_n(Q) is the multistep time differential penalty, L_E(Q) is supervised loss, here JS divergence is taken as an example, r (s, a) is the reward function, s is the state at the current time, a is the action at the current time, γ is the discount factor, s_t+1Is the state where the system jumps to the next time after the current action is performed,

is the optimal action under the Double DQN algorithm, which is defined as

Theta and theta' are the parameters of the Q network and the target Q network, respectively, r_t+iThe method is a reward function fed back by a system when a current time t jumps backwards by i steps, adva is a normalized vector of an advantage estimation vector output by an advantage network in a Dual DDQN, and is defined as adva ═ softmax (A (s, a)), A (s, a) is the advantage estimation vector, demo represents teaching data, and pi_bc(s) represents the action of the behavioral clone network policy in state s, λ₁And λ₂Is the weight corresponding to the loss.

Based on any of the above embodiments, fig. 2 is a schematic flow chart of a superiority estimation model training method provided by an embodiment of the present invention, as shown in fig. 2, the method includes:

collecting an initial teaching data set according to the operation records of a human expert or other methods and systems in a complex decision problem scene;

pre-training and verifying a behavior clone network by using a teaching data set to obtain a behavior clone network structure with certain initial decision-making capability and corresponding parameters;

the method comprises the steps of applying a teaching data set, sampling from teaching data, pre-training a Dueling DDQN network, and obtaining an advantage estimation model with certain initial decision making capability;

the advantage estimation model performs autonomous learning, interacts with the environment, continuously provides expert actions in the current state according to the behavior clone model, generates expert losses, and trains the Dueling DDQN network by using the mixed loss function provided by the embodiment. Fig. 3 is a schematic diagram of the calculation of the loss function according to the embodiment of the present invention, and as shown in fig. 3, the hybrid loss function includes a supervisory loss supervise, a single-step time difference loss TD (1) loss, and a multi-step time difference loss TD (n) loss. In the figure, V(s), Q (s, a) and A (s, a) respectively represent a state value function, a state-action value function and an advantage function obtained in the Dueling DDQN method, argmax_a(A_BC(s, a)) represents the expert action of the behavioral clone network output. And if the current round is finished and the current round obtains better rewards, adding the generated data into a teaching data set, and finely adjusting the behavior clone model. And repeating the operation until the termination condition is met, and finishing the training.

Based on any of the above embodiments, fig. 4 is a schematic structural diagram of an advantage estimation apparatus provided in an embodiment of the present invention, and as shown in fig. 4, the apparatus includes a state obtaining unit 410 and an advantage estimation unit 420.

The state obtaining unit 410 is configured to obtain a current environment state;

the advantage estimation unit 420 is configured to input the current environment state into an advantage estimation model, obtain an advantage vector obtained by performing advantage estimation on the advantage estimation model based on the current environment state, and determine an action corresponding to a maximum value in the advantage vector as an advantage action;

The device provided by the embodiment of the invention trains the advantage estimation model based on the teaching data set and the behavior clone model, can fully utilize the teaching data through the adaptive behavior clone model, automatically excavate the expert experience in historical teaching data, avoid the adverse effect possibly brought by incomplete teaching data, enhance the advantage estimation performance of the advantage estimation model, and improve the accuracy of advantage estimation in a complex scene.

training to obtain a behavior clone network based on the teaching data set;

pre-training an advantage estimation model based on a teaching data set;

According to the device provided by the embodiment of the invention, the advantage estimation model is trained further through pre-training the advantage estimation model, the teaching data set and the expert action determined by the behavior clone network based on the sample environment state, so that the advantage estimation performance of the advantage estimation model is improved; meanwhile, the teaching data set is dynamically updated, the behavior clone network is finely adjusted, the robustness of the behavior clone model is enhanced, and the training effect of the advantage estimation model is further improved.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a dominance estimation method comprising: acquiring a current environment state; inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action; the advantage estimation model is obtained based on a teaching data set and a behavior clone model; the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a method of advantage estimation provided by the above methods, the method comprising: acquiring a current environment state; inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action; the advantage estimation model is obtained based on a teaching data set and a behavior clone model; the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method of merit estimation provided above, the method comprising: acquiring a current environment state; inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action; the advantage estimation model is obtained based on a teaching data set and a behavior clone model; the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A dominance estimation method, comprising:

acquiring a current environment state;

2. The dominance estimation method according to claim 1, wherein the dominance estimation model is trained based on the following steps:

training to obtain a behavior clone network based on the teaching data set;

pre-training an advantage estimation model based on the teaching data set;

3. The dominance estimation method according to claim 2, wherein the dynamically updating the teach data set specifically comprises:

interacting with a real application environment based on the superiority estimation model, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into the teaching data set.

4. The advantage estimation method according to claim 3, wherein the determining new teaching data based on the feedback information of the real application environment and updating the new teaching data into the teaching data set specifically includes:

5. The advantage estimation method according to claim 2, wherein the fine-tuning the behavioral clone network specifically includes:

6. The advantage estimation method according to claim 2, wherein training to obtain a behavioral clone network based on the teach data set specifically comprises:

7. The dominance estimation method according to any one of claims 1 to 6, wherein the dominance estimation model includes a loss function including a supervised loss, a single step time difference loss, and a multi step time difference loss;

8. An advantage estimation apparatus, comprising:

the state acquisition unit is used for acquiring the current environment state;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the advantage estimation method according to any of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the dominance estimation method according to any one of claims 1 to 7.