CN113240118A - Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium - Google Patents

Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium Download PDF

Info

Publication number
CN113240118A
CN113240118A CN202110540754.6A CN202110540754A CN113240118A CN 113240118 A CN113240118 A CN 113240118A CN 202110540754 A CN202110540754 A CN 202110540754A CN 113240118 A CN113240118 A CN 113240118A
Authority
CN
China
Prior art keywords
teaching data
advantage
data set
estimation
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110540754.6A
Other languages
Chinese (zh)
Other versions
CN113240118B (en
Inventor
李小双
王晓
黄梓铭
王飞跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110540754.6A priority Critical patent/CN113240118B/en
Publication of CN113240118A publication Critical patent/CN113240118A/en
Application granted granted Critical
Publication of CN113240118B publication Critical patent/CN113240118B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a dominance estimation method, a dominance estimation device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a current environment state; inputting the current environment state into an advantage estimation model to obtain an advantage action obtained by the advantage estimation model through advantage estimation based on the current environment state; the advantage estimation model is obtained based on a teaching data set and a behavior clone model; the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set training. According to the invention, the advantage estimation model is trained on the basis of the teaching data set and the behavior clone model, the expert experience in historical teaching data is automatically mined by fully utilizing the teaching data through the self-adaptive behavior clone model, adverse effects possibly brought by incomplete teaching data are avoided, the advantage estimation performance of the advantage estimation model is enhanced, and the advantage estimation accuracy in a complex scene is improved.

Description

Superiority estimation method, superiority estimation apparatus, electronic device, and storage medium
Technical Field
The present invention relates to the field of reinforcement learning technologies, and in particular, to a method and an apparatus for advantage estimation, an electronic device, and a storage medium.
Background
In recent years, Deep Learning (DRL) has been greatly advanced, and is widely applied in decision scenes such as electronic games and chess and card games. With the aid of the powerful feature extraction and function fitting capabilities of deep learning, a reinforcement learning subject can directly extract and learn feature knowledge from raw input data (such as game images), and then learn a decision control strategy by using a conventional reinforcement learning algorithm according to the extracted feature information without manually extracting or learning features on the basis of rules and heuristics.
However, at present, for the application of solving complex decision control problems (such as automatic driving) in real environment, the deep reinforcement learning technology still cannot be practically used. Due to the diversity and uncertainty of complex systems, the existing simulation environment is difficult to keep consistent with the real world, and the cost is high for improving the precision of the simulation system. Therefore, how to adapt to a complex real-world scene becomes one of the most urgent problems for applying the DRL model to a complex decision task.
For the decision problem under the complex scene, human experts have great advantages in learning efficiency and decision performance, so that the inclusion of expert knowledge in the DRL model is a potential solution. The DQfD (Deep Q-learning from Demonstrations) method for performing Q learning in teaching can guide the learning of an intelligent agent to obtain a strategy represented by teaching data through learning the teaching data so as to guide and help the intelligent agent to learn expert knowledge, and performs autonomous learning on the basis, thereby improving the decision-making capability of a model.
However, the DQfD model has the following problems: (1) in the DQfD learning process, the track data in the historical teaching data set are only used in pre-training, and the teaching data do not provide effective guidance for the track data generated by the model independently; (2) the teaching data set is very limited and cannot cover enough state action space; moreover, it is difficult to collect enough teaching data in some practical applications, for example, extreme cases occur rarely in real situations, and a large number of samples are data in normal situations; (3) the DQfD algorithm ignores the imperfection of historical teaching data ubiquitous in real application, and the imperfection can have negative influence on the improvement of the model performance. In addition, although the method based on DQN (Deep Q-learning Network) can achieve a good effect, there is a problem of overestimation of the Q value.
Disclosure of Invention
The invention provides an advantage estimation method, an advantage estimation device, electronic equipment and a storage medium, which are used for solving the defect that the effect of automatic decision making in a complex scene is poor in the prior art.
The invention provides an advantage estimation method, which comprises the following steps:
acquiring a current environment state;
inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action;
the advantage estimation model is obtained based on a teaching data set and a behavior clone model;
the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.
According to the superiority estimation method provided by the invention, the superiority estimation model is trained based on the following steps:
training to obtain a behavior clone network based on the teaching data set;
pre-training an advantage estimation model based on the teaching data set;
training the superiority estimation model based on the teaching data set and expert actions determined by the behavior clone network based on the sample environment state, and dynamically updating the teaching data set and finely adjusting the behavior clone network.
According to an advantage estimation method provided by the invention, the dynamically updating the teaching data set specifically comprises:
interacting with a real application environment based on the superiority estimation model, determining new teaching data based on feedback information of the real application environment, and updating and adding the new teaching data into the teaching data set.
According to an advantage estimation method provided by the present invention, the determining new teaching data based on feedback information of a real application environment, and updating the new teaching data into the teaching data set, specifically includes:
after the current round is finished, calculating the reward value of the current round;
and if the reward value of the current round is higher than the preset reward, determining new teaching data based on feedback information of a real application environment in the current round, state information input and output advantageous actions of the advantage estimation model in the current round, and updating the new teaching data into the teaching data set.
According to an advantage estimation method provided by the invention, the fine tuning of the behavioral clone network specifically comprises the following steps:
and fine-tuning the behavior clone network based on the updated teaching data set every time the teaching data set is updated for a preset number of times.
According to an advantage estimation method provided by the invention, the training is performed to obtain a behavior clone network based on the teaching data set, and the method specifically comprises the following steps:
determining a plurality of candidate cloned networks of different network structures and network parameters;
based on the teaching data set, taking the environmental state of the sample as input, taking the action of the sample as a label, and training each candidate clone network according to a back propagation and gradient descent algorithm;
interacting each candidate clone network with the real environment respectively, and calculating the total bonus points of each round corresponding to each candidate clone network;
and selecting the candidate clone network with the highest turn reward total score as the trained behavior clone network.
According to the advantage estimation method provided by the invention, the loss function of the advantage estimation model comprises supervision loss, single-step time difference loss and multi-step time difference loss;
wherein the supervised loss is determined based on a difference between a dominance estimation vector output by the dominance estimation model and a corresponding expert or sample action; wherein the expert action is determined by the behavioral cloning network according to a sample environment state, and the sample action is acquired from the teaching data set.
The present invention also provides an advantage estimation apparatus, including:
the state acquisition unit is used for acquiring the current environment state;
the advantage estimation unit is used for inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by performing advantage estimation on the advantage estimation model based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action;
the advantage estimation model is obtained based on a teaching data set and expert action training determined by a behavior clone model based on the sample environment state;
the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.
The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the superiority estimation method as described in any one of the above when executing the program.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the merit estimation method as described in any of the above.
According to the advantage estimation method, the advantage estimation device, the electronic equipment and the storage medium, the advantage estimation model is trained on the basis of the teaching data set and the behavior clone model, teaching data can be fully utilized through the adaptive behavior clone model, expert experience in historical teaching data is automatically mined, adverse effects possibly brought by incomplete teaching data are avoided, the advantage estimation performance of the advantage estimation model is enhanced, the advantage estimation accuracy in a complex scene is improved, and therefore the decision performance of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for estimating the superiority provided by the present invention;
FIG. 2 is a schematic flow chart of a dominant estimation model training method provided by the present invention;
FIG. 3 is a schematic diagram of the loss function calculation provided by the present invention;
FIG. 4 is a schematic structural diagram of an advantage estimation apparatus provided in the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of an advantage estimation method provided by an embodiment of the present invention, as shown in fig. 1, the method includes:
step 110, obtaining the current environment state.
Specifically, the current decision field is first obtainedThe environmental state of the scene. The environment state is a value set of features capable of describing the current environment running state, and includes, but is not limited to, an RGB channel matrix of an environment image, a vector or tensor formed by values of different feature variables, and the like. Taking an emergency regulation and control scene of the power grid as an example, suppose that the ith bus and the voltage of the corresponding low-voltage side of the ith bus at the time t are vt i,
Figure BDA0003071654060000061
And the load on the bus is
Figure BDA0003071654060000062
The state of the grid at the present moment may be expressed as
Figure BDA0003071654060000063
The states in the past N time steps are stacked, so that the dynamic operation state of the power grid at the time t can be described, and the environment state s is formedt=[Ot-N+1,Ot-N+2,...,Ot]。
Step 120, inputting the current environment state into the advantage estimation model to obtain an advantage vector obtained by the advantage estimation model performing advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action;
the advantage estimation model is obtained based on a teaching data set and expert action training determined by the behavior clone model based on the sample environment state;
the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set training.
Specifically, the current environmental state is input into the dominance estimation model, and the dominance estimation model may perform dominance estimation based on the current environmental state, and select the currently best dominance action from among a plurality of candidate actions, so that control may be performed according to the dominance action. Here, the candidate action is a candidate for an execution action applicable to the environment by the agent, and may be divided into a discrete action space and a continuous action space according to whether the candidate action is discrete or not.Taking the emergency regulation and control scene of the power grid as an example, when the power grid has K buses and can selectively execute load shedding by 20% or do not act, the dimension of the discrete action space is 2K
The advantage estimation model is obtained by performing reinforcement learning based on teaching data sets and expert actions determined by the behavior clone model based on the sample environment state. The advantage estimation model can be established based on a Double Deep Q-learning Network (Dueling DDQN) model. The teaching data set comprises sample environment states and corresponding sample actions, and the teaching data set can be generated according to operation records of human experts or other methods and systems in complex decision problem scenes. Here, the teaching data set may include a data sample consisting of < sample environment state, sample action, reward, next sample state, whether the current round is ended flag > quintuple:
et=(st,at,rt,st+1,flagt)
wherein s ist、at、tt、st+1、flagtRespectively showing the sample environment state, the sample action, the reward, the next sample environment state and whether the current round is finished or not.
Rewards are rewards for system feedback after actions are applied to the environment, and are given by a reward function rt=r(st,at) And (6) determining. Taking the grid environment as an example, the difference between the bus voltage and the standard value, and the reduction of the bus load can be used to construct the reward function. The larger the difference between the bus voltage and the standard value is, the larger the penalty is, the more the load is reduced, and the larger the penalty is. The cumulative sum of all penalty terms can be used as a reward function. Namely, correct actions are adopted, punishment is small, and reward is large; with the wrong actions, the penalty is large and the reward is small. The next sample environment state is a new environment state returned by the environment after a specific action is applied to the environment, and the current turn whether-to-end flag indicates whether the current turn is to be ended after the action is applied to the environment.
The behavioral clone model is used to predict the best expert actions from the action space based on the sample environmental state. The behavior clone model is obtained by training based on a teaching data set. Here, the superiority estimation model is trained by constructing a Behavioral Cloning (BC) model using the teaching data set to mine expert experience in historical teaching data, and using the Behavioral Cloning model and the teaching data set obtained by training. The expert action output by the behavior clone model is compared with the dominant action output by the behavior clone model, so that expert loss is generated, the difference between the dominant action output by the behavior clone model and the expert action output by the behavior clone model is reduced, the training effect of the dominant estimation model is optimized, and the accuracy of dominant estimation in a complex decision scene is improved. It should be noted that the advantage estimation method provided by the embodiment of the invention has universality and can be applied to different complex decision-making scenes, including but not limited to electronic games, traffic control, power grid control and the like.
According to the method provided by the embodiment of the invention, the advantage estimation model is trained on the basis of the teaching data set and the behavior clone model, the teaching data can be fully utilized through the self-adaptive behavior clone model, the expert experience in historical teaching data is automatically mined, the possible adverse effect brought by incomplete teaching data is avoided, the advantage estimation performance of the advantage estimation model is enhanced, the advantage estimation accuracy in a complex scene is improved, and the decision performance of the model is improved.
Based on any of the above embodiments, the superiority estimation model is trained based on the following steps:
training to obtain a behavior clone network based on the teaching data set;
pre-training an advantage estimation model based on a teaching data set;
and training an advantage estimation model based on the teaching data set and the expert action determined by the behavior clone network based on the sample environment state, and dynamically updating the teaching data set and finely adjusting the behavior clone network.
Specifically, according to the teaching data set, expert experience in a corresponding decision scene can be learned from the teaching data set, so that a behavior clone network with certain initial decision capability is obtained through training. Meanwhile, the teaching data set can be placed into a common experience playback pool, data are randomly sampled from the common experience playback pool, and the advantage estimation model is pre-trained to obtain the advantage estimation model with certain initial decision-making capability.
Subsequently, the dominance estimation model performs autonomous learning. Randomly sampling data from an empirical playback pool, taking the action corresponding to the maximum value in the discrete action probability vector output by the behavior clone network as an expert action, and updating the network parameters of the dominance estimation model according to a back propagation and gradient descent algorithm. The updated dominance estimation model has better dominance estimation performance compared with the pre-trained dominance estimation model. In the training process, the teaching data set can be dynamically updated, the behavior clone model can be finely adjusted periodically, and the teaching data set contains more high-quality track samples by introducing an automatic teaching data updating mechanism, so that adverse effects possibly brought by incomplete teaching data are avoided, and the effect of enhancing the robustness of the behavior clone model is achieved.
According to the method provided by the embodiment of the invention, the advantage estimation model is trained further by pre-training the advantage estimation model, based on the teaching data set and the expert action determined by the behavior clone network based on the sample environment state, so that the advantage estimation performance of the advantage estimation model is improved; meanwhile, the teaching data set is dynamically updated, the behavior clone network is finely adjusted, the robustness of the behavior clone model is enhanced, and the training effect of the advantage estimation model is further improved.
Based on any of the above embodiments, dynamically updating the teaching data set specifically includes:
and interacting with the real application environment based on the advantage estimation model, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into a teaching data set.
Specifically, the dominance estimation model interacts with a real application environment, an environment state is input, a dominance estimation network outputs a dominance estimation vector, and a behavior clone network outputs a discrete action probability vector. The advantage estimation vector and the discrete action probability vector respectively comprise scores of each candidate action determined by the advantage estimation network and the behavior clone network according to the environment state. And selecting candidate actions corresponding to the maximum value in the advantage estimation vector as a current optimal decision, applying the candidate actions to a real environment, and obtaining feedback information of the real application environment, so as to form new < environmental state, action, reward, next environmental state, whether the current turn is finished or not and mark > quintuple teaching data, and putting the five tuple teaching data into an experience playback pool to realize dynamic update of a teaching data set.
Based on any of the above embodiments, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into a teaching data set, specifically including:
after the current round is finished, calculating the reward value of the current round;
and if the reward value of the current round is higher than the preset reward, determining new teaching data based on feedback information of the real application environment in the current round, state information input and output advantageous actions of the advantage estimation model in the current round, and updating the new teaching data into a teaching data set.
Specifically, if the current round is finished after the current optimal decision output by the advantage estimation model is applied to the real environment, the reward value of the current round is calculated. If the reward value of the current round is higher than the preset reward, the successful operation track corresponding to the current round is indicated, and the successful operation track can be added into the teaching data set. The method specifically includes determining new teaching data based on feedback information of a real application environment in a current round and dominant actions output by a dominant estimation model in the current round, and updating the new teaching data into a teaching data set.
Based on any of the above embodiments, the fine tuning behavior cloning network specifically includes:
and fine-tuning the behavior clone network based on the updated teaching data set every time the teaching data set is updated for a preset number of times.
Specifically, a period of fine tuning of the behavioral cloning network may be preset, for example, K, and then, each time the teaching data set is updated K times, the behavioral cloning network may be fine-tuned based on the updated teaching data set, so as to improve robustness of the behavioral cloning network.
Based on any of the above embodiments, based on the teaching data set, training to obtain a behavioral clone network specifically includes:
determining a plurality of candidate cloned networks of different network structures and network parameters;
training each candidate clone network based on a teaching data set by taking a sample environment state as input and a sample action as a label according to a back propagation and gradient descent algorithm;
interacting each candidate clone network with the real environment respectively, and calculating the total bonus points of each round corresponding to each candidate clone network;
and selecting the candidate clone network with the highest turn reward total score as the trained behavior clone network.
Specifically, a plurality of candidate cloned networks of different network structures and network parameters are predetermined. The network structure of the candidate clone network can be determined according to the teaching data set, and the network structure is matched with the teaching data and can well mine the teaching data, for example, the network structure can be a full-connection network or a long-time memory network. The activation function of the candidate cloned network may use LeakyRelu, i.e.:
y=max(0,x)+α*min(0,x)
where alpha is a small positive number.
Training each candidate clone network by using a teaching data set, taking a sample environment state as input, taking a sample action as a label, and performing cross entropy loss with the output of a network model:
Figure BDA0003071654060000101
wherein a is the action corresponding to the maximum value in the discrete action probability vectors output by the candidate clone network, aESample actions in the teach data. And updating the network parameters of each candidate clone network according to a back propagation and gradient descent algorithm, and establishing a mapping f: s → a from the state to the action. The trained candidate clone network can be based on the input environmentThe state generates a virtual expert action.
And then, interacting each candidate clone network with the real environment respectively, inputting the candidate clone networks into an environment state, outputting each discrete action probability vector by each candidate clone network, and selecting the maximum value in the discrete action probability vectors as a virtual expert action to be applied to the real environment. Each specific action is executed to obtain a single step reward score, and all the single step rewards of each round are summed to obtain a total reward score of the round. And then selecting the candidate clone network with the highest turn reward total score as the trained behavior clone network.
Based on any embodiment, the loss function of the advantage estimation model comprises supervision loss, single-step time difference loss and multi-step time difference loss;
wherein the supervised loss is determined based on a difference between the dominance estimation vector output by the dominance estimation model and the corresponding expert action or sample action; wherein, the expert action is determined by the behavior clone network according to the environmental state of the sample, and the sample action is acquired from the teaching data set.
In particular, a blending loss function is defined, including supervised and unsupervised losses, where supervised loss is some measure of distance of the teaching action and the dominance estimation vector, including but not limited to: cross entropy loss, MSE loss, KL loss, JS divergence loss, Wasserstein distance, etc. Wherein the teaching action comprises an expert action output by the behavioral clone network or a sample action in the teaching data set. Unsupervised losses are single step time differential loss TD (1) and multi-step time differential loss TD (n). The individual losses can be calculated by the following formula:
Figure BDA0003071654060000111
Figure BDA0003071654060000112
Figure BDA0003071654060000113
Lu(Q)=LDQ(Q)+λ1Ln(Q)+λ2LE(Q)
wherein L isDQ(Q) is a single step time differential penalty, Ln(Q) is the multistep time differential penalty, LE(Q) is supervised loss, here JS divergence is taken as an example, r (s, a) is the reward function, s is the state at the current time, a is the action at the current time, γ is the discount factor, st+1Is the state where the system jumps to the next time after the current action is performed,
Figure BDA0003071654060000114
is the optimal action under the Double DQN algorithm, which is defined as
Figure BDA0003071654060000115
Theta and theta' are the parameters of the Q network and the target Q network, respectively, rt+iThe method is a reward function fed back by a system when a current time t jumps backwards by i steps, adva is a normalized vector of an advantage estimation vector output by an advantage network in a Dual DDQN, and is defined as adva ═ softmax (A (s, a)), A (s, a) is the advantage estimation vector, demo represents teaching data, and pibc(s) represents the action of the behavioral clone network policy in state s, λ1And λ2Is the weight corresponding to the loss.
Based on any of the above embodiments, fig. 2 is a schematic flow chart of a superiority estimation model training method provided by an embodiment of the present invention, as shown in fig. 2, the method includes:
collecting an initial teaching data set according to the operation records of a human expert or other methods and systems in a complex decision problem scene;
pre-training and verifying a behavior clone network by using a teaching data set to obtain a behavior clone network structure with certain initial decision-making capability and corresponding parameters;
the method comprises the steps of applying a teaching data set, sampling from teaching data, pre-training a Dueling DDQN network, and obtaining an advantage estimation model with certain initial decision making capability;
the advantage estimation model performs autonomous learning, interacts with the environment, continuously provides expert actions in the current state according to the behavior clone model, generates expert losses, and trains the Dueling DDQN network by using the mixed loss function provided by the embodiment. Fig. 3 is a schematic diagram of the calculation of the loss function according to the embodiment of the present invention, and as shown in fig. 3, the hybrid loss function includes a supervisory loss supervise, a single-step time difference loss TD (1) loss, and a multi-step time difference loss TD (n) loss. In the figure, V(s), Q (s, a) and A (s, a) respectively represent a state value function, a state-action value function and an advantage function obtained in the Dueling DDQN method, argmaxa(ABC(s, a)) represents the expert action of the behavioral clone network output. And if the current round is finished and the current round obtains better rewards, adding the generated data into a teaching data set, and finely adjusting the behavior clone model. And repeating the operation until the termination condition is met, and finishing the training.
Based on any of the above embodiments, fig. 4 is a schematic structural diagram of an advantage estimation apparatus provided in an embodiment of the present invention, and as shown in fig. 4, the apparatus includes a state obtaining unit 410 and an advantage estimation unit 420.
The state obtaining unit 410 is configured to obtain a current environment state;
the advantage estimation unit 420 is configured to input the current environment state into an advantage estimation model, obtain an advantage vector obtained by performing advantage estimation on the advantage estimation model based on the current environment state, and determine an action corresponding to a maximum value in the advantage vector as an advantage action;
the advantage estimation model is obtained based on a teaching data set and expert action training determined by the behavior clone model based on the sample environment state;
the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set training.
The device provided by the embodiment of the invention trains the advantage estimation model based on the teaching data set and the behavior clone model, can fully utilize the teaching data through the adaptive behavior clone model, automatically excavate the expert experience in historical teaching data, avoid the adverse effect possibly brought by incomplete teaching data, enhance the advantage estimation performance of the advantage estimation model, and improve the accuracy of advantage estimation in a complex scene.
Based on any of the above embodiments, the superiority estimation model is trained based on the following steps:
training to obtain a behavior clone network based on the teaching data set;
pre-training an advantage estimation model based on a teaching data set;
and training an advantage estimation model based on the teaching data set and the expert action determined by the behavior clone network based on the sample environment state, and dynamically updating the teaching data set and finely adjusting the behavior clone network.
According to the device provided by the embodiment of the invention, the advantage estimation model is trained further through pre-training the advantage estimation model, the teaching data set and the expert action determined by the behavior clone network based on the sample environment state, so that the advantage estimation performance of the advantage estimation model is improved; meanwhile, the teaching data set is dynamically updated, the behavior clone network is finely adjusted, the robustness of the behavior clone model is enhanced, and the training effect of the advantage estimation model is further improved.
Based on any of the above embodiments, dynamically updating the teaching data set specifically includes:
and interacting with the real application environment based on the advantage estimation model, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into a teaching data set.
Based on any of the above embodiments, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into a teaching data set, specifically including:
after the current round is finished, calculating the reward value of the current round;
and if the reward value of the current round is higher than the preset reward, determining new teaching data based on feedback information of the real application environment in the current round, state information input and output advantageous actions of the advantage estimation model in the current round, and updating the new teaching data into a teaching data set.
Based on any of the above embodiments, the fine tuning behavior cloning network specifically includes:
and fine-tuning the behavior clone network based on the updated teaching data set every time the teaching data set is updated for a preset number of times.
Based on any of the above embodiments, based on the teaching data set, training to obtain a behavioral clone network specifically includes:
determining a plurality of candidate cloned networks of different network structures and network parameters;
training each candidate clone network based on a teaching data set by taking a sample environment state as input and a sample action as a label according to a back propagation and gradient descent algorithm;
interacting each candidate clone network with the real environment respectively, and calculating the total bonus points of each round corresponding to each candidate clone network;
and selecting the candidate clone network with the highest turn reward total score as the trained behavior clone network.
Based on any embodiment, the loss function of the advantage estimation model comprises supervision loss, single-step time difference loss and multi-step time difference loss;
wherein the supervised loss is determined based on a difference between the dominance estimation vector output by the dominance estimation model and the corresponding expert action or sample action; wherein, the expert action is determined by the behavior clone network according to the environmental state of the sample, and the sample action is acquired from the teaching data set.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a dominance estimation method comprising: acquiring a current environment state; inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action; the advantage estimation model is obtained based on a teaching data set and a behavior clone model; the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a method of advantage estimation provided by the above methods, the method comprising: acquiring a current environment state; inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action; the advantage estimation model is obtained based on a teaching data set and a behavior clone model; the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method of merit estimation provided above, the method comprising: acquiring a current environment state; inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action; the advantage estimation model is obtained based on a teaching data set and a behavior clone model; the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A dominance estimation method, comprising:
acquiring a current environment state;
inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by the advantage estimation model through advantage estimation based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action;
the advantage estimation model is obtained based on a teaching data set and a behavior clone model;
the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.
2. The dominance estimation method according to claim 1, wherein the dominance estimation model is trained based on the following steps:
training to obtain a behavior clone network based on the teaching data set;
pre-training an advantage estimation model based on the teaching data set;
training the superiority estimation model based on the teaching data set and expert actions determined by the behavior clone network based on the sample environment state, and dynamically updating the teaching data set and finely adjusting the behavior clone network.
3. The dominance estimation method according to claim 2, wherein the dynamically updating the teach data set specifically comprises:
interacting with a real application environment based on the superiority estimation model, determining new teaching data based on feedback information of the real application environment, and updating the new teaching data into the teaching data set.
4. The advantage estimation method according to claim 3, wherein the determining new teaching data based on the feedback information of the real application environment and updating the new teaching data into the teaching data set specifically includes:
after the current round is finished, calculating the reward value of the current round;
and if the reward value of the current round is higher than the preset reward, determining new teaching data based on feedback information of a real application environment in the current round, state information input and output advantageous actions of the advantage estimation model in the current round, and updating the new teaching data into the teaching data set.
5. The advantage estimation method according to claim 2, wherein the fine-tuning the behavioral clone network specifically includes:
and fine-tuning the behavior clone network based on the updated teaching data set every time the teaching data set is updated for a preset number of times.
6. The advantage estimation method according to claim 2, wherein training to obtain a behavioral clone network based on the teach data set specifically comprises:
determining a plurality of candidate cloned networks of different network structures and network parameters;
based on the teaching data set, taking the environmental state of the sample as input, taking the action of the sample as a label, and training each candidate clone network according to a back propagation and gradient descent algorithm;
interacting each candidate clone network with the real environment respectively, and calculating the total bonus points of each round corresponding to each candidate clone network;
and selecting the candidate clone network with the highest turn reward total score as the trained behavior clone network.
7. The dominance estimation method according to any one of claims 1 to 6, wherein the dominance estimation model includes a loss function including a supervised loss, a single step time difference loss, and a multi step time difference loss;
wherein the supervised loss is determined based on a difference between a dominance estimation vector output by the dominance estimation model and a corresponding expert or sample action; wherein the expert action is determined by the behavioral cloning network according to a sample environment state, and the sample action is acquired from the teaching data set.
8. An advantage estimation apparatus, comprising:
the state acquisition unit is used for acquiring the current environment state;
the advantage estimation unit is used for inputting the current environment state into an advantage estimation model to obtain an advantage vector obtained by performing advantage estimation on the advantage estimation model based on the current environment state, and determining an action corresponding to the maximum value in the advantage vector as an advantage action;
the advantage estimation model is obtained based on a teaching data set and expert action training determined by a behavior clone model based on the sample environment state;
the teaching data set comprises sample environment states and corresponding sample actions, and the behavior clone model is obtained based on the teaching data set through training.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the advantage estimation method according to any of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the dominance estimation method according to any one of claims 1 to 7.
CN202110540754.6A 2021-05-18 2021-05-18 Dominance estimation method, dominance estimation device, electronic device, and storage medium Active CN113240118B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110540754.6A CN113240118B (en) 2021-05-18 2021-05-18 Dominance estimation method, dominance estimation device, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110540754.6A CN113240118B (en) 2021-05-18 2021-05-18 Dominance estimation method, dominance estimation device, electronic device, and storage medium

Publications (2)

Publication Number Publication Date
CN113240118A true CN113240118A (en) 2021-08-10
CN113240118B CN113240118B (en) 2023-05-09

Family

ID=77135047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110540754.6A Active CN113240118B (en) 2021-05-18 2021-05-18 Dominance estimation method, dominance estimation device, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN113240118B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118563A (en) * 2021-11-23 2022-03-01 中国电子科技集团公司第三十研究所 Self-iteration situation prediction method and system based on data middleboxes

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858630A (en) * 2019-02-01 2019-06-07 清华大学 Method and apparatus for intensified learning
CN110428615A (en) * 2019-07-12 2019-11-08 中国科学院自动化研究所 Learn isolated intersection traffic signal control method, system, device based on deeply
CN111291890A (en) * 2020-05-13 2020-06-16 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
EP3690769A1 (en) * 2019-01-31 2020-08-05 StradVision, Inc. Learning method and learning device for supporting reinforcement learning by using human driving data as training data to thereby perform personalized path planning
CN112396180A (en) * 2020-11-25 2021-02-23 中国科学院自动化研究所 Deep Q learning network optimization method based on dynamic teaching data and behavior cloning
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3690769A1 (en) * 2019-01-31 2020-08-05 StradVision, Inc. Learning method and learning device for supporting reinforcement learning by using human driving data as training data to thereby perform personalized path planning
CN109858630A (en) * 2019-02-01 2019-06-07 清华大学 Method and apparatus for intensified learning
CN110428615A (en) * 2019-07-12 2019-11-08 中国科学院自动化研究所 Learn isolated intersection traffic signal control method, system, device based on deeply
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN111291890A (en) * 2020-05-13 2020-06-16 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN112396180A (en) * 2020-11-25 2021-02-23 中国科学院自动化研究所 Deep Q learning network optimization method based on dynamic teaching data and behavior cloning
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOSHUANG LI 等: "Supervised assisted deep reinforcement learning for emergency voltage control of power systems", 《NEUROCOMPUTING》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118563A (en) * 2021-11-23 2022-03-01 中国电子科技集团公司第三十研究所 Self-iteration situation prediction method and system based on data middleboxes

Also Published As

Publication number Publication date
CN113240118B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN110852448A (en) Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
US12059619B2 (en) Information processing method and apparatus, computer readable storage medium, and electronic device
CN113561986A (en) Decision-making method and device for automatically driving automobile
CN112329948A (en) Multi-agent strategy prediction method and device
CN112396180B (en) Deep Q learning network optimization method based on dynamic teaching data and behavior cloning
CN111282272B (en) Information processing method, computer readable medium and electronic device
CN113947022B (en) Near-end strategy optimization method based on model
CN110555517A (en) Improved chess game method based on Alphago Zero
CN116776751B (en) Intelligent decision algorithm model design development auxiliary system
JP2020166795A (en) Reinforced learning method, reinforced learning device, and reinforced learning program for efficient learning
CN112613608A (en) Reinforced learning method and related device
CN115457240A (en) Image object driving navigation method, device, equipment and storage medium
CN113240118B (en) Dominance estimation method, dominance estimation device, electronic device, and storage medium
CN114154397B (en) Implicit opponent modeling method based on deep reinforcement learning
CN115409158A (en) Robot behavior decision method and device based on layered deep reinforcement learning model
CN113887708A (en) Multi-agent learning method based on mean field, storage medium and electronic device
CN111753855B (en) Data processing method, device, equipment and medium
Xing et al. Policy distillation with selective input gradient regularization for efficient interpretability
Jang et al. AVAST: Attentive variational state tracker in a reinforced navigator
CN115009291B (en) Automatic driving assistance decision making method and system based on network evolution replay buffer area
CN112884129A (en) Multi-step rule extraction method and device based on teaching data and storage medium
Chen et al. Modified PPO-RND method for solving sparse reward problem in ViZDoom
CN112870716B (en) Game data processing method and device, storage medium and electronic equipment
Saito et al. A study on efficient transfer learning for reinforcement learning using sparse coding
CN114118400B (en) Concentration network-based cluster countermeasure method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant