CN113240118B

CN113240118B - Dominance estimation method, dominance estimation device, electronic device, and storage medium

Info

Publication number: CN113240118B
Application number: CN202110540754.6A
Authority: CN
Inventors: 李小双; 王晓; 黄梓铭; 王飞跃
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2023-05-09
Anticipated expiration: 2041-05-18
Also published as: CN113240118A

Abstract

The invention provides a dominance estimation method, a dominance estimation device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a current environment state; inputting the current environment state into a dominant estimation model to obtain dominant actions obtained by dominant estimation of the dominant estimation model based on the current environment state; the advantage estimation model is obtained based on a teaching data set and a behavior cloning model; the teaching data set comprises a sample environment state and a corresponding sample action, and the behavior cloning model is trained based on the teaching data set. According to the invention, the dominant estimation model is trained based on the teaching data set and the behavior cloning model, the teaching data is fully utilized through the self-adaptive behavior cloning model, expert experience in the historical teaching data is automatically mined, adverse effects possibly caused by imperfect teaching data are avoided, the dominant estimation performance of the dominant estimation model is enhanced, and the dominant estimation accuracy in complex scenes is improved.

Description

Dominance estimation method, dominance estimation device, electronic device, and storage medium

Technical Field

The present invention relates to the field of reinforcement learning technologies, and in particular, to a method and apparatus for estimating advantages of a vehicle, an electronic device, and a storage medium.

Background

Deep reinforcement learning (Deep Reinforcement Learning, DRL) has made great progress in recent years, and is widely used in decision scenes such as electronic games and chess games. By means of the strong feature extraction and function fitting capability of deep learning, the reinforcement learning subject can directly extract and learn feature knowledge from original input data (such as game images), and then learn decision control strategies according to the extracted feature information by utilizing a traditional reinforcement learning algorithm without manually extracting or learning features based on rules and heuristics.

However, for applications that solve complex decision control problems (e.g., autopilot) in real environments, deep reinforcement learning techniques are still not practically used. Because of the diversity and uncertainty of complex systems, existing simulation environments are difficult to keep consistent with the real world, and the improvement of the accuracy of the simulation system is costly. Therefore, how to adapt to a complex reality scenario becomes one of the most urgent problems of applying the DRL model to a complex decision task.

For decision-making problems in complex scenarios, human experts have great advantages in learning efficiency and decision-making performance, so the inclusion of expert knowledge in the DRL model is a potential solution. The DQfD (Deep Q-learning from Demonstrations) method for performing Q learning from teaching can guide an agent to learn to obtain a strategy represented by the teaching data through learning the teaching data so as to guide and help the agent to learn expert knowledge, and perform autonomous learning on the basis, so that the decision capability of a model is improved.

However, the DQfD model has the following problems: (1) In the DQfD learning process, track data in the history teaching data set is only used in pre-training, and the teaching data does not provide effective guidance for the track data which is generated by the model autonomously; (2) The teaching data set is very limited and cannot cover enough state action space; moreover, in some practical applications, it is difficult to collect enough teaching data, such as extreme cases are rarely happened in real situations, and a large number of samples are data under normal conditions; (3) The DQfD algorithm ignores imperfections in the historical teaching data that are common in real world applications that can negatively impact the model performance improvement. In addition, although a method based on DQN (Deep Q-learning Network) can achieve a good effect, there is a problem of overestimation of Q value.

Disclosure of Invention

The invention provides a method, a device, electronic equipment and a storage medium for estimating advantages, which are used for solving the defect of poor effect of automatic decision in a complex scene in the prior art.

The invention provides an advantage estimation method, which comprises the following steps:

acquiring a current environment state;

inputting the current environment state into a dominant estimation model, obtaining a dominant vector obtained by performing dominant estimation on the basis of the current environment state by the dominant estimation model, and determining an action corresponding to the maximum value in the dominant vector as a dominant action;

the advantage estimation model is obtained based on a teaching data set and a behavior cloning model;

the teaching data set comprises a sample environment state and a corresponding sample action, and the behavior cloning model is trained based on the teaching data set.

According to the invention, a dominance estimation method is provided, and the dominance estimation model is trained based on the following steps:

training to obtain a behavior cloning network based on the teaching data set;

pre-training a dominance estimation model based on the teaching dataset;

based on the teaching data set and expert actions obtained by the behavior cloning network based on the sample environment state determination, training the dominance estimation model, and simultaneously dynamically updating the teaching data set and fine-tuning the behavior cloning network.

According to the advantage estimation method provided by the invention, the dynamic updating of the teaching data set specifically comprises the following steps:

and interacting with a real application environment based on the dominance estimation model, determining new teaching data based on feedback information of the real application environment, and updating and adding the new teaching data into the teaching data set.

According to the advantage estimation method provided by the invention, the feedback information based on the real application environment determines new teaching data, and updates the new teaching data into the teaching data set, which specifically comprises the following steps:

after the current round is finished, calculating a reward value of the current round;

if the current round of rewarding value is higher than the preset rewarding value, determining new teaching data based on feedback information of a real application environment in the current round and state information input and dominant actions output by the dominant estimation model in the current round, and updating the new teaching data into the teaching data set.

According to the advantage estimation method provided by the invention, the fine tuning of the behavior clone network specifically comprises the following steps:

and carrying out fine tuning on the behavior cloning network based on the updated teaching data set every time the preset number of times of updating the teaching data set.

According to the advantage estimation method provided by the invention, the behavior clone network is obtained by training based on the teaching data set, and the method specifically comprises the following steps:

determining a plurality of candidate clone networks of different network structures and network parameters;

based on the teaching data set, taking a sample environment state as input, taking a sample action as a label, and training each candidate clone network according to a back propagation and gradient descent algorithm;

respectively interacting each candidate clone network with a real environment, and calculating round rewards total score of each round corresponding to each candidate clone network;

and selecting the candidate clone network with the highest round rewards total score as a trained behavior clone network.

According to the advantage estimation method provided by the invention, the loss function of the advantage estimation model comprises a supervised loss, a single-step time difference loss and a multi-step time difference loss;

wherein the supervised penalty is determined based on differences between the dominant estimation vectors output by the dominant estimation model and corresponding expert or sample actions; the expert action is determined by the behavior cloning network according to a sample environment state, and the sample action is acquired from the teaching data set.

The present invention also provides an advantage estimation apparatus, including:

a state acquisition unit for acquiring a current environment state by an application;

the dominant estimation unit is used for inputting the current environment state into a dominant estimation model, obtaining a dominant vector obtained by performing dominant estimation on the dominant estimation model based on the current environment state, and determining an action corresponding to the maximum value in the dominant vector as a dominant action;

the advantage estimation model is obtained by training expert actions determined based on a sample environment state and based on a teaching data set;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the dominance estimation method according to any of the preceding claims when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the dominance estimation method according to any of the above.

According to the advantage estimation method, the device, the electronic equipment and the storage medium, the advantage estimation model is trained based on the teaching data set and the behavior cloning model, the expert experience in the historical teaching data can be automatically mined by fully utilizing the teaching data through the self-adaptive behavior cloning model, adverse effects possibly caused by imperfect teaching data are avoided, the advantage estimation performance of the advantage estimation model is enhanced, the advantage estimation accuracy in a complex scene is improved, and therefore the decision performance of the model is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an advantage estimation method according to the present invention;

FIG. 2 is a schematic flow chart of a dominant estimation model training method provided by the present invention;

FIG. 3 is a schematic diagram of a loss function calculation provided by the present invention;

FIG. 4 is a schematic diagram of an advantage estimating apparatus according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flow chart of an advantage estimation method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 110, the current environmental state is obtained.

Specifically, the environmental state of the current decision scene is first acquired. The environment state is a value set capable of describing the characteristics of the current environment running state, and the environment state comprises, but is not limited to, an RGB channel matrix of an environment image, vectors or tensors formed by different characteristic variable values and the like. Taking an emergency regulation scene of a power grid as an example, assume that the ith bus at time t and voltage of the corresponding low-voltage side are v _t ⁱ ,

And the load on the bus is +.>

The state of the current moment of the power grid can be expressed as

By stacking states in the past N time steps, the method can be used for describing the dynamic operation state of the power grid at the time tConstitutes an environmental state s _t ＝[O _t-N+1 ,O _t-N+2 ,...,O _t ]。

Step 120, inputting the current environmental state into a dominant estimation model, obtaining a dominant vector obtained by performing dominant estimation on the basis of the current environmental state by the dominant estimation model, and determining an action corresponding to the maximum value in the dominant vector as a dominant action;

the advantage estimation model is obtained by training expert actions determined based on a sample environment state based on a teaching data set and the behavior cloning model;

Specifically, the current environmental state is input into a dominant estimation model, which may perform dominant estimation based on the current environmental state, and a currently best dominant motion is selected from among a plurality of candidate motions so that control may be performed according to the dominant motion. Here, the candidate actions are execution action candidates applicable to the environment by the agent, and can be classified into a discrete action space and a continuous action space according to whether or not they are discrete. Taking an emergency regulation scene of a power grid as an example, when K buses of the power grid can selectively execute load shedding of 20% or no action, the dimension of a discrete action space is 2 ^K 。

The dominant estimation model is obtained by reinforcement learning based on a teaching data set and the behavior cloning model based on expert actions determined by the sample environment state. The dominance estimation model may be built based on a two Deep Q-learning Network model. The teaching data set comprises sample environment states and corresponding sample actions, and the sample environment states and the corresponding sample actions can be generated according to operation records of human experts or other methods and systems in complex decision problem scenes. Here, the teaching dataset may include a data sample consisting of < sample environment state, sample action, reward, next sample state, whether current round ends flag > quintuple:

e _t ＝(s _t ,a _t ,r _t ,s _t+1 ,flag _t )

wherein s is _t 、a _t 、t _t 、s _t+1 、flag _t The sample environment state, sample action, rewards, next sample environment state, and whether the current round is ended or not at the time t are respectively indicated.

The rewards are rewards fed back by the system after the action is applied to the environment, and are obtained by a rewarding function r _t ＝r(s _t ,a _t ) And (5) determining. Taking the grid environment as an example, the gap between the bus voltage and the standard value, and the reduction of bus load can be used to construct a reward function. The larger the bus voltage is different from the standard value, the larger the penalty is, the more the load is reduced, and the larger the penalty is. The accumulated sum of all penalty terms may be used as a bonus function. Namely, correct actions are adopted, punishment is small, and rewards are large; with false actions, the punishment is large and the rewards are small. The next sample environmental state is a new environmental state of the return of the environment after a particular action was applied to the environment, and the current round end flag indicates whether the current round is ended after the action was applied to the environment.

The behavioral clone model is used to predict the best expert action in action space based on the sample environmental state. The behavior clone model is trained based on a teaching data set. Here, the dominant estimation model is trained by constructing a behavioral clone (Behavioral Cloning, BC) model using the teaching data set to mine expert experience in the historical teaching data, and using the behavioral clone model and the teaching data set obtained by training. Expert loss is generated by comparing expert actions output by the behavior clone model with dominant actions output by the behavior clone model, so that the gap between the dominant actions output by the behavior clone model and the expert actions output by the behavior clone model is reduced, the training effect of the dominant estimation model is optimized, and the accuracy of dominant estimation in a complex decision scene is improved. It should be noted that the advantage estimation method provided by the embodiment of the invention has universality and can be applied to different complex decision scenes, including but not limited to electronic games, traffic regulation, power grid regulation and the like.

According to the method provided by the embodiment of the invention, the dominant estimation model is trained based on the teaching data set and the behavior cloning model, the self-adaptive behavior cloning model is used for fully utilizing the teaching data and automatically mining expert experience in the historical teaching data, so that adverse effects possibly caused by imperfect teaching data are avoided, the dominant estimation performance of the dominant estimation model is enhanced, the dominant estimation accuracy in a complex scene is improved, and the decision performance of the model is improved.

Based on any of the above embodiments, the dominance estimation model is trained based on the following steps:

training to obtain a behavior cloning network based on the teaching data set;

pre-training a dominance estimation model based on the teaching dataset;

based on the teaching data set and expert actions obtained by the behavior cloning network based on the sample environment state determination, the dominant estimation model is trained, and meanwhile, the teaching data set is dynamically updated and the behavior cloning network is finely adjusted.

Specifically, according to the teaching data set, expert experience under a corresponding decision scene can be learned from the teaching data set, so that a behavior clone network with a certain initial decision capability is obtained through training. Meanwhile, the teaching data set can be put into a common experience playback pool, data are randomly sampled from the common experience playback pool, and the dominant estimation model is pre-trained to obtain the dominant estimation model with certain initial decision capability.

Subsequently, the dominant estimation model performs autonomous learning. Randomly sampling data from an experience playback pool, taking the action corresponding to the maximum value in the discrete action probability vector output by the behavior cloning network as expert action, and updating network parameters of the dominant estimation model according to the back propagation and gradient descent algorithm. The updated dominant estimation model has better dominant estimation performance than the pre-trained dominant estimation model. In the training process, the teaching data set can be dynamically updated, the behavior cloning model is periodically and finely adjusted, and the teaching data set contains more high-quality track samples by introducing an automatic updating mechanism of the teaching data, so that adverse effects possibly caused by imperfect teaching data are avoided, and the effect of enhancing the robustness of the behavior cloning model is achieved.

According to the method provided by the embodiment of the invention, the dominant estimation model is further trained by pre-training the dominant estimation model and determining the expert action based on the teaching data set and the behavior cloning network based on the sample environment state, so that the dominant estimation performance of the dominant estimation model is improved; meanwhile, the teaching data set is dynamically updated and the behavior cloning network is finely adjusted, so that the robustness of the behavior cloning model is enhanced, and the training effect of the advantage estimation model is further improved.

Based on any of the above embodiments, dynamically updating the teaching data set specifically includes:

and interacting with the real application environment based on the dominance estimation model, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into a teaching data set.

Specifically, the dominant estimation model is interacted with a real application environment, the environment state is input, the dominant estimation network outputs a dominant estimation vector, and the behavior cloning network outputs a discrete action probability vector. The dominant estimation vector and the discrete action probability vector respectively comprise scores of candidate actions determined by the dominant estimation network and the behavior cloning network according to the environment state. And selecting a candidate action corresponding to the maximum value in the dominant estimation vector as a current optimal decision, applying the candidate action to a real environment to obtain feedback information of the real application environment, thereby forming new < environment state, action, rewarding, next environment state, whether the current round is ended or not is marked with > five-tuple teaching data, and putting the teaching data into an experience playback pool to realize dynamic update of a teaching data set.

Based on any of the above embodiments, determining new teaching data based on feedback information of a real application environment, and updating the new teaching data into a teaching data set, specifically including:

if the current round of rewarding value is higher than the preset rewarding value, determining new teaching data based on feedback information of a real application environment in the current round and state information input in the current round and output dominant actions of the dominant estimation model, and updating the new teaching data into a teaching data set.

Specifically, if the current optimal decision output by the dominance estimation model is applied to the real environment and then the current round is ended, calculating the rewarding value of the current round. If the current round of rewards is higher than the preset rewards, the current round of rewards indicates that the round corresponds to a successful running track, and the successful running track can be added into the teaching data set. Specifically, based on feedback information of a real application environment in a current round and dominant actions of a dominant estimation model in the current round, new teaching data can be determined, and the new teaching data can be updated into a teaching data set.

Based on any of the above embodiments, the fine tuning behavior clone network specifically includes:

and carrying out fine tuning on the behavior cloning network based on the updated teaching data set every time the preset number of teaching data sets are updated.

Specifically, the period of fine tuning of the behavior clone network may be preset, for example, K, and then, every K times the teaching data set is updated, fine tuning may be performed on the behavior clone network based on the updated teaching data set, so as to improve the robustness of the behavior clone network.

Based on any of the above embodiments, training to obtain a behavioral clone network based on a teaching dataset specifically includes:

based on the teaching data set, taking the sample environment state as input, taking the sample action as a label, and training each candidate clone network according to the counter-propagation and gradient descent algorithm;

Specifically, a plurality of candidate clone networks of different network structures and network parameters are predetermined. The network structure of the candidate clone network can be determined according to the teaching data set, the candidate clone network is matched with the teaching data set, and the candidate clone network can well mine the teaching data, and can be a fully-connected network or a long-short-term memory network. The activation function of the candidate clone network may use a LeakyRelu, i.e.:

y＝max(0，x)+α*min(0，x)

where α is a small positive number.

Training each candidate clone network by using a teaching data set, taking a sample environment state as an input, taking a sample action as a label, and performing cross entropy loss on the output of a network model:

wherein a is the action corresponding to the maximum value in the discrete action probability vector output by the candidate clone network, and a _E Is a sample action in the teaching data. And updating network parameters of each candidate clone network according to the back propagation and gradient descent algorithm, and establishing a mapping f: s- & gt a of the state to the action. The trained candidate clone network may generate a virtual expert action based on the input environmental state.

And then, respectively interacting each candidate clone network with the real environment, inputting the real environment into the environment state, outputting each discrete action probability vector by each candidate clone network, and selecting the maximum value in the discrete action probability vector as a virtual expert action to apply to the real environment. After each specific action is executed, single-step rewards are obtained, and all single-step rewards of each round are summed to obtain round rewards total points. And then selecting the candidate clone network with the highest round prize total score as the trained behavior clone network.

Based on any of the above embodiments, the loss function of the dominance estimation model includes supervised loss, single step time differential loss, and multi-step time differential loss;

wherein the supervised penalty is determined based on the difference between the dominant estimation vector output by the dominant estimation model and the corresponding expert action or sample action; the expert action is determined by the behavior cloning network according to the sample environment state, and the sample action is acquired from the teaching data set.

Specifically, a hybrid loss function is defined, including supervised and unsupervised losses, where the supervised losses are some distance measure of the teaching action and dominance estimation vector, including but not limited to: cross entropy loss, MSE loss, KL loss, JS divergence loss, wasperstein distance, etc. Wherein the teaching actions include expert actions output by the behavioral cloning network or sample actions in the teaching data set. The unsupervised losses are single step time differential loss TD (1) and multi-step time differential loss TD (n). The individual losses can be calculated by the following formula:

L _u (Q)＝L _DQ (Q)+λ ₁ L _n (Q)+λ ₂ L _E (Q)

wherein L is _DQ (Q) is a single step time differential loss, L _n (Q) is a multi-step time differential loss, L _E (Q) is a supervised loss, here exemplified by JS divergence, r (s, a) is a reward function, s is the state at the current time, a is the action at the current time, γ is a discount factor, s _t+1 Is the state in which the system jumps to the next moment after the current action is performed,

is the optimal action under Double DQN algorithm, which is defined as +.>

θ and θ' are parameters of the Q network and parameters of the target Q network, respectively, r _t+i Is a reward function fed back by the system when the current moment t jumps back to the i step, adva is a vector obtained by normalizing a dominance estimation vector output by an advantage network in the lasting DDQN, and is defined as adva=softmax (A (s, a)), A (s, a) is a dominance estimation vector, demo represents teaching data, and pi is a vector _bc (s) represents actions of the behavior clone network policy in state s, lambda ₁ And lambda (lambda) ₂ Is the weight of the corresponding penalty.

Based on any of the foregoing embodiments, fig. 2 is a flow chart of a method for training a dominant estimation model according to an embodiment of the present invention, as shown in fig. 2, where the method includes:

collecting an initial teaching data set according to operation records of human experts or other methods and systems in a complex decision problem scene;

pre-training and verifying a behavior clone network by using a teaching data set to obtain a behavior clone network structure with a certain initial decision-making capability and corresponding parameters;

applying a teaching data set, sampling from the teaching data, and pre-training the Dueling DDQN network to obtain an advantage estimation model with a certain initial decision capability;

the dominant estimation model performs autonomous learning, interacts with the environment, continuously provides expert actions in the current state according to the behavior cloning model, generates expert losses, and trains the Dueling DDQN network by using the mixed loss function provided by the embodiment. Fig. 3 is a schematic diagram of loss function calculation provided in an embodiment of the present invention, and as shown in fig. 3, the hybrid loss function includes a Supervised loss, a single-step time differential loss TD (1) loss, and a multi-step time differential loss TD (n) loss. In the figure, V(s), Q (s, a) and A (s, a) respectively represent a state value function, a state-action value function and an advantage function, argmax, obtained in the Dueling DDQN method _a (A _BC (s, a)) represents expert actions of the behavioral clone network output. And if the current round is finished and the current round obtains the better rewards, adding the generated data into the teaching data set, and fine-tuning the behavior cloning model. Repeating the above operation until the termination condition is satisfied, and training the knotA bundle.

Based on any of the above embodiments, fig. 4 is a schematic structural diagram of an advantage estimation device according to an embodiment of the present invention, and as shown in fig. 4, the device includes a state obtaining unit 410 and an advantage estimation unit 420.

Wherein, the state acquisition unit 410 is configured to acquire a current environmental state;

the dominant estimation unit 420 is configured to input a current environmental state into a dominant estimation model, obtain a dominant vector obtained by performing dominant estimation on the dominant estimation model based on the current environmental state, and determine an action corresponding to a maximum value in the dominant vector as a dominant action;

According to the device provided by the embodiment of the invention, the dominant estimation model is trained based on the teaching data set and the behavior cloning model, the teaching data can be fully utilized through the self-adaptive behavior cloning model, expert experience in the historical teaching data is automatically mined, adverse effects possibly caused by imperfect teaching data are avoided, the dominant estimation performance of the dominant estimation model is enhanced, and the dominant estimation accuracy under a complex scene is improved.

training to obtain a behavior cloning network based on the teaching data set;

pre-training a dominance estimation model based on the teaching dataset;

According to the device provided by the embodiment of the invention, the dominant estimation model is further trained by pre-training the dominant estimation model and determining the expert action based on the teaching data set and the behavior cloning network based on the sample environment state, so that the dominant estimation performance of the dominant estimation model is improved; meanwhile, the teaching data set is dynamically updated and the behavior cloning network is finely adjusted, so that the robustness of the behavior cloning model is enhanced, and the training effect of the advantage estimation model is further improved.

Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a dominance estimation method comprising: acquiring a current environment state; inputting the current environment state into a dominant estimation model, obtaining a dominant vector obtained by performing dominant estimation on the basis of the current environment state by the dominant estimation model, and determining an action corresponding to the maximum value in the dominant vector as a dominant action; the advantage estimation model is obtained based on a teaching data set and a behavior cloning model; the teaching data set comprises a sample environment state and a corresponding sample action, and the behavior cloning model is trained based on the teaching data set.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the method of estimating the advantages provided by the methods described above, the method comprising: acquiring a current environment state; inputting the current environment state into a dominant estimation model, obtaining a dominant vector obtained by performing dominant estimation on the basis of the current environment state by the dominant estimation model, and determining an action corresponding to the maximum value in the dominant vector as a dominant action; the advantage estimation model is obtained based on a teaching data set and a behavior cloning model; the teaching data set comprises a sample environment state and a corresponding sample action, and the behavior cloning model is trained based on the teaching data set.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above provided advantage estimation methods, the method comprising: acquiring a current environment state; inputting the current environment state into a dominant estimation model, obtaining a dominant vector obtained by performing dominant estimation on the basis of the current environment state by the dominant estimation model, and determining an action corresponding to the maximum value in the dominant vector as a dominant action; the advantage estimation model is obtained based on a teaching data set and a behavior cloning model; the teaching data set comprises a sample environment state and a corresponding sample action, and the behavior cloning model is trained based on the teaching data set.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of dominance estimation comprising:

acquiring a current environment state in an emergency regulation scene of a power grid; the current environment state refers to the environment state of the current decision scene;

the environment state comprises an RGB channel matrix of the environment image, and vectors or tensors formed by different characteristic variable values; let the voltage of the ith bus bar at time t and the corresponding low voltage side be v _t ⁱ ,

And the load on the bus is +.>

The state of the network at the current moment is then indicated as +.>

By stacking states over the past N time steps for describing the dynamic operating state of the network at time t, the environmental state s is formed _t ＝[O _t-N+1 ,O _t-N+2 ,...,O _t ]；

the teaching data set comprises a sample environment state and a corresponding sample action, and the behavior cloning model is obtained by training based on the teaching data set;

the loss function of the dominance estimation model comprises supervised loss, single-step time differential loss and multi-step time differential loss;

the equation for the supervised loss is as follows:

wherein adva is a normalized vector of advan estimation vector output by advan-tage network in the lasting DDQN, which is defined as adva=softmax (a (s, a)), a (s, a) is an advan estimation vector, demo represents teaching data, pi _bc (s) represents the actions of the behavioural clone network policy in state s, a _E Sample actions in the teaching data;

wherein the supervised penalty is determined based on differences between the dominant estimation vectors output by the dominant estimation model and corresponding expert or sample actions; wherein the expert action is determined by the behavior cloning network according to a sample environment state, and the sample action is acquired from the teaching data set;

the dominance estimation model is trained based on the following steps:

training to obtain a behavior cloning network based on the teaching data set;

pre-training a dominance estimation model based on the teaching dataset;

training the dominance estimation model based on the teaching data set and expert actions obtained by the behavior cloning network based on the sample environment state determination, and dynamically updating the teaching data set and fine-tuning the behavior cloning network;

training to obtain a behavior clone network based on the teaching data set specifically comprises the following steps:

2. The dominance estimation method of claim 1, wherein the dynamically updating the set of teaching data comprises:

and interacting with the real application environment based on the dominance estimation model, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into the teaching data set.

3. The dominance estimation method according to claim 2, wherein the determining new teaching data based on feedback information of the real application environment and updating the new teaching data into the teaching data set specifically comprises:

4. The dominance estimation method of claim 1, wherein the fine-tuning the behavioral clone network specifically comprises:

5. An advantage estimating apparatus, characterized by comprising:

the state acquisition unit is applied to the emergency regulation and control scene of the power grid and is used for acquiring the current environment state; the current environment state refers to the environment state of the current decision scene;

And the load on the bus is +.>

The state of the network at the current moment is then indicated as +.>

the equation for the supervised loss is as follows:

wherein adva is a normalized vector of advan estimation vector output by advan-tage network in the lasting DDQN, which is defined as adva=softmax (a (s, a)), a (s, a) is an advan estimation vector, demo represents teaching data, pi _bc (s) represents the actions of the behavioural clone network policy in state s,a _E sample actions in the teaching data;

the dominance estimation model is trained based on the following steps:

training to obtain a behavior cloning network based on the teaching data set;

pre-training a dominance estimation model based on the teaching dataset;

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the dominance estimation method according to any of claims 1 to 4 when the program is executed by the processor.

7. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of the dominance estimation method according to any of claims 1 to 4.