CN115648204A - Training method, device, equipment and storage medium of intelligent decision model - Google Patents
Training method, device, equipment and storage medium of intelligent decision model Download PDFInfo
- Publication number
- CN115648204A CN115648204A CN202211172621.9A CN202211172621A CN115648204A CN 115648204 A CN115648204 A CN 115648204A CN 202211172621 A CN202211172621 A CN 202211172621A CN 115648204 A CN115648204 A CN 115648204A
- Authority
- CN
- China
- Prior art keywords
- reward
- action
- external information
- model
- robot
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 88
- 238000000034 method Methods 0.000 title claims abstract description 69
- 238000003860 storage Methods 0.000 title claims abstract description 19
- 230000009471 action Effects 0.000 claims abstract description 245
- 238000009826 distribution Methods 0.000 claims abstract description 68
- 230000002776 aggregation Effects 0.000 claims abstract description 26
- 238000004220 aggregation Methods 0.000 claims abstract description 26
- 230000006870 function Effects 0.000 claims description 39
- 230000003993 interaction Effects 0.000 claims description 27
- 230000015654 memory Effects 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 16
- 230000008901 benefit Effects 0.000 claims description 15
- 238000005070 sampling Methods 0.000 claims description 13
- 238000012935 Averaging Methods 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 6
- 230000004931 aggregating effect Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 8
- 239000003795 chemical substances by application Substances 0.000 description 42
- 230000008569 process Effects 0.000 description 13
- 238000012545 processing Methods 0.000 description 12
- 239000013598 vector Substances 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 230000002093 peripheral effect Effects 0.000 description 10
- 230000002787 reinforcement Effects 0.000 description 10
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000013473 artificial intelligence Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 6
- 238000013507 mapping Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 5
- 241000282414 Homo sapiens Species 0.000 description 4
- 230000001133 acceleration Effects 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000002245 particle Substances 0.000 description 2
- 238000004321 preservation Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 101000795130 Homo sapiens Trehalase Proteins 0.000 description 1
- 102100029677 Trehalase Human genes 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000007786 learning performance Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Feedback Control In General (AREA)
Abstract
The application discloses a training method, a training device, equipment and a storage medium of an intelligent decision model, and belongs to the technical field of computers. According to the technical scheme provided by the embodiment of the application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision model, the distributed executor model of the intelligent decision model outputs a plurality of action branches, and the action branches are all actions which can be executed by the robot in the target environment under the condition that the external information is obtained. Based on the external information and the action branches, the reward value distribution of each action branch is determined, namely, the action branches are evaluated. Based on the reward value distribution of the plurality of action branches, reward aggregation is performed to determine a mixed reward and an integrated reward. Training the intelligent decision model based on the mixed reward, the integrated reward and the external information can achieve a stable training effect.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training an intelligent decision model.
Background
With the development of computer technology, multi-Agent relationship Learning (MARL) has made a great progress in the fields of large-scale real-time battle games, robot control, autopilot, stock exchange, and the like.
In a multi-agent environment, under the interference of factors such as potential mutual influence among agents and uncertainty of the environment, the reward value acquired by the agents has uncertainty, and the learning of the multi-agent is influenced by the reward value with uncertainty. For example, in an environment including multiple robots, the interaction between the robots and the uncertainty of the environment itself affect the training effect of the intelligent decision model of the robots, i.e., the reinforcement learning model. It becomes very difficult and troublesome to realize the stable learning of the multi-agent reinforcement learning model, and therefore, a method for realizing the stable learning of the multi-agent reinforcement learning model is urgently needed.
Disclosure of Invention
The embodiment of the application provides a training method, a training device, equipment and a storage medium of an intelligent decision model, and can improve the training effect of the decision model of a robot. The technical scheme is as follows:
in one aspect, a method for training an intelligent decision model is provided, the method including:
acquiring external information acquired by a robot in a target environment, wherein the external information comprises external environment information and interaction information, the external environment information is information obtained by observing the target environment by the robot, and the interaction information is information obtained by interacting the robot with other robots in the target environment;
inputting the external information into an intelligent decision model of the robot, performing prediction by a distributed executor model of the intelligent decision model based on the external information, and outputting a plurality of action branches of the robot, wherein the action branches are actions which can be executed by the robot in the target environment;
determining a reward value distribution of each action branch based on the external information and the plurality of action branches;
carrying out reward aggregation on the sampled reward values obtained by sampling in the reward value distribution of each action branch to obtain mixed reward and integrated reward;
training an intelligent decision model of the robot based on the hybrid reward, the integrated reward, and the external information.
In one possible embodiment, the predicting by the distributed actor model of the intelligent decision model based on the external information, outputting the plurality of action branches of the robot comprises:
performing at least one of full connection, convolution and attention coding on the external information by a distributed executor model of the intelligent decision model to obtain external information characteristics of the external information;
and fully connecting and normalizing the external information features by a distributed executor model of the intelligent decision model, and outputting a plurality of action branches of the robot.
In a possible embodiment, the determining, based on the external information and the plurality of action branches, a reward value distribution of each of the action branches includes:
inputting the external information and the action branches into a reward value estimation model, performing reward value distribution estimation through the reward value estimation model based on the external information and the action branches, and outputting reward value distribution of each action branch.
In one possible embodiment, the aggregating the rewards of the sampled reward values sampled in the reward value distribution of each action branch, and the obtaining of the mixed reward and the integrated reward comprises:
sampling the reward value distribution of each action branch to obtain the sampling reward value of each action branch;
carrying out strategy weighted fusion on the sampling reward values of all the action branches to obtain the mixed reward;
and carrying out any one of global average, local average and direct selection on the sampling reward value of each action branch to obtain the integrated reward.
In one possible embodiment, the training of the distributed actor model of the intelligent decision model of the robot based on the hybrid reward, the integrated reward, and the external information includes:
training a reward value estimation model of the intelligent decision model based on the external information;
obtaining action advantage values of the action branches based on the hybrid reward, the external information and a critic model of the intelligent decision model, wherein the critic model is used for evaluating the action branches based on the external information;
training a distributed actor model of the intelligent decision model based on the action dominance value and the integrated reward.
In one possible embodiment, the obtaining the action advantage values of the plurality of action branches based on the hybrid reward, the external information and the critic model of the intelligent decision model comprises:
training a critic model of the intelligent decision model based on the hybrid reward and the external information; inputting the external information and the plurality of action branches into the critic model, and outputting action dominance values of the plurality of action branches by the critic model.
In one possible implementation, the training the intelligent decision model based on the action advantage value and the integrated reward comprises:
constructing a loss function for the distributed actor model based on the action dominance value and the integrated reward;
and training the distributed executor model by adopting a gradient descent method based on the loss function.
In one aspect, an apparatus for training an intelligent decision model is provided, where the apparatus includes:
the external information acquisition module is used for acquiring external information acquired by the robot in a target environment, wherein the external information comprises external environment information and interaction information, the external environment information is information obtained by observing the target environment by the robot, and the interaction information is information obtained by interacting the robot with other robots in the target environment;
the action prediction module is used for inputting the external information into an intelligent decision model of the robot, performing prediction by a distributed executor model of the intelligent decision model based on the external information, and outputting a plurality of action branches of the robot, wherein the action branches are actions which the robot can possibly execute in the target environment;
the reward value prediction module is used for determining reward value distribution of each action branch based on the external information and the action branches;
the reward value aggregation module is used for carrying out reward aggregation on the sampled reward values obtained by sampling in the reward value distribution of each action branch to obtain mixed rewards and integrated rewards;
and the training module is used for training the intelligent decision model of the robot based on the mixed reward, the integrated reward and the external information.
In a possible implementation manner, the action prediction module is configured to perform at least one of full connection, convolution and attention coding on the external information by a distributed executor model of the intelligent decision model to obtain an external information feature of the external information; and fully connecting and normalizing the external information features by a distributed executor model of the intelligent decision model, and outputting a plurality of action branches of the robot.
In a possible implementation manner, the reward value prediction module is configured to input the external information and the plurality of action branches into a reward value estimation model, perform reward value distribution estimation through the reward value estimation model based on the external information and the plurality of action branches, and output reward value distribution of each action branch.
In a possible implementation manner, the reward value aggregation module is configured to sample a reward value distribution of each action branch to obtain a sampled reward value of each action branch; carrying out strategy weighted fusion on the sampling reward values of all the action branches to obtain the mixed reward; and carrying out any one of global averaging, local averaging and direct selection on the sampled reward values of the action branches to obtain the integrated reward.
In a possible implementation, the training module is configured to obtain action dominance values of the plurality of action branches based on the hybrid reward, the external information, and a critic model of the smart decision model, the critic model being configured to evaluate the plurality of action branches based on the external information; training a distributed actor model of the intelligent decision model based on the action dominance value and the integrated reward.
In a possible implementation, the training module is configured to train a reward value estimation model of the intelligent decision model based on the external information;
in one possible implementation, the training module is configured to train the critic model based on the hybrid reward and the external information; inputting the external information and the action branches into the critic model, and outputting action advantage values of the action branches by the critic model.
In one possible embodiment, the training module is configured to construct a loss function of the distributed actor model based on the action dominance value and the integrated reward; and training the distributed executor model by adopting a gradient descent method based on the loss function.
In one aspect, a computer device is provided, the computer device comprising one or more processors and one or more memories, at least one computer program being stored in the one or more memories, the computer program being loaded and executed by the one or more processors to implement the training method of the intelligent decision model.
In one aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, which is loaded and executed by a processor to implement the training method of the intelligent decision model.
In one aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising program code, the program code being stored in a computer-readable storage medium, which is read by a processor of a computer device from the computer-readable storage medium, the program code being executed by the processor such that the computer device performs the method of training an intelligent decision model as described above.
According to the technical scheme provided by the embodiment of the application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision model, the distributed executor model of the intelligent decision model outputs a plurality of action branches, and the action branches are all actions which can be possibly executed by the robot in the target environment under the condition of obtaining the external information. Based on the external information and the action branches, the reward value distribution of each action branch is determined, namely, the action branches are evaluated. Based on the reward value distribution of the plurality of action branches, reward aggregation is performed to determine a mixed reward and an integrated reward. Training the intelligent decision model based on the mixed reward, the integrated reward and the external information can achieve a stable training effect.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic diagram of an implementation environment of a training method for an intelligent decision model according to an embodiment of the present application;
FIG. 2 is a flowchart of a training method of an intelligent decision model according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a training method for an intelligent decision model according to an embodiment of the present disclosure;
FIG. 4 is a block diagram of a training method of an intelligent decision model according to an embodiment of the present application;
FIG. 5 is a schematic diagram of experimental results provided in an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a training apparatus for an intelligent decision model according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.
The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution.
In order to explain the technical solutions provided in the embodiments of the present application, the following first introduces terms related to the embodiments of the present application.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge submodel to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
Deep Reinforcement Learning (DRL): the deep reinforcement learning combines the perception capability of the deep learning and the decision capability of the reinforcement learning, can be directly controlled according to an input image, and is an artificial intelligence method closer to a human thinking mode. The deep reinforcement learning is a branch of the deep learning, and the deep reinforcement learning can be classified into a Markov Decision Process (MDP) represented by a quintuple, namely < S, a, R, P, γ >, which respectively represents an environment state, an action, a reward, a state transition matrix, and an accumulated discount factor. The agent obtains the state and reward from the environment and generates an action to act on the environment, and the environment obtains the action of the agent and generates the next state to the agent according to the current state. The goal of the agent is to achieve long-term revenue maximization.
Normalization (Normalization): and the arrays with different value ranges are mapped to the (0, 1) interval, so that the data processing is facilitated. In some cases, the normalized values may be directly implemented as probabilities.
Learning Rate (Learning Rate): the learning rate can guide how the model adjusts the network weight by using the gradient of the loss function in the gradient descent method. If the learning rate is too large, the loss function can directly cross the global optimal point, and the loss is too large at the moment; if the learning rate is too small, the change speed of the loss function is slow, which greatly increases the convergence complexity of the network and is easily trapped in a local minimum or saddle point.
Embedded Coding (Embedded Coding): the embedded code expresses a corresponding relation mathematically, that is, data in X space is mapped to Y space through a function F, wherein the function F is a single-shot function, the mapping result is structure preservation, the single-shot function expresses that the mapped data is uniquely corresponding to the data before mapping, and the structure preservation expresses that the size relation of the data before mapping is the same after the size relation of the mapped data, for example, the data X exists before mapping 1 And X 2 Mapping to obtain X 1 Corresponding Y 1 And X 2 Corresponding Y 2 . If data X before mapping 1 >X 2 Then correspondingly, the mapped data Y 1 Greater than Y 2 . For words, the words are mapped to another space, so that subsequent machine learning and processing are facilitated.
Attention Weight (Attention Weight): may represent the importance of certain data in the training or prediction process, the importance representing the magnitude of the impact of the input data on the output data. The data of high importance has a high value of attention weight, and the data of low importance has a low value of attention weight. Under different scenes, the importance of the data is different, and the process of training attention weight of the model is the process of determining the importance of the data.
In addition to the words used in describing the embodiments of the present application, an environment in which the embodiments of the present application may be implemented is described below.
Fig. 1 is a schematic diagram of an implementation environment of a training method for an intelligent decision model according to an embodiment of the present application, and referring to fig. 1, the implementation environment may include a terminal 110 and a server 140.
The terminal 110 is connected to the server 140 through a wireless network or a wired network. Optionally, the terminal 110 is a smartphone, a tablet, a laptop, a desktop computer, etc., but is not limited thereto. The terminal 110 is installed and run with applications that support intelligent decision model training.
The server 140 is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud function, cloud storage, web service, cloud communication, middleware service, domain name service, security service, distribution Network (CDN), and a big data and artificial intelligence platform. The server 140 provides background services for applications running on the terminal.
Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals is only one, or several tens or hundreds, or more, and other terminals are also included in the implementation environment. The number of terminals and the type of the device are not limited in the embodiments of the present application.
After the implementation environment of the embodiment of the present application is described, an application scenario of the embodiment of the present application is described below.
The technical scheme provided by the embodiment of the application can be applied to a scene of multi-robot navigation, the intelligent decision-making model of the robot is trained by adopting the technical scheme provided by the embodiment of the application, the intelligent decision-making model can output actions according to external information observed by the robot, and the actions can realize the navigation of the robot. The intelligent decision model is configured for a plurality of robots in one environment, and the purpose of navigating the plurality of robots can be achieved. Under the multi-robot navigation scenario, each robot may be considered an agent.
Alternatively, the technical solution provided in the embodiment of the present application can also be applied to other scenes including a plurality of agents, for example, the technical solution provided in the embodiment of the present application can be applied to a scene of multi-vehicle navigation, a scene of a match between multiple virtual objects in a game scene, and the like, and the embodiment of the present application is not limited thereto. In the following description process, the technical solution provided by the embodiment of the present application is applied to a multi-robot navigation scenario as an example.
After the implementation environment and the application scenario of the embodiment of the present application are introduced, the technical solutions provided in the embodiment of the present application are introduced below. It should be noted that, in the following description of the technical solutions provided in the present application, a server is taken as an example of an execution subject. In other possible embodiments, the terminal may also be used as an execution subject to execute the technical solution provided in the present application, and the embodiment of the present application is not limited to the type of the execution subject.
Fig. 2 is a flowchart of a training method of an intelligent decision model provided in an embodiment of the present application, and referring to fig. 2, taking an execution subject as an example, the method includes the following steps.
201. The method comprises the steps that a server obtains external information collected by a robot in a target environment, the external information comprises external environment information and interaction information, the external environment information is information obtained by the robot observing the target environment, and the interaction information is information obtained by the robot interacting with other robots in the target environment.
The target environment includes a plurality of robots, and the robot in step 201 is any one of the plurality of robots. The external information includes external environment information and interaction information, where the external environment information is information obtained by the robot observing the target environment, that is, information obtained by the robot observing the target environment through a plurality of sensors, for example, the external information includes an environment image obtained by an image sensor of the robot, a position obtained by a position sensor of the robot, and attitude information obtained by a gyroscope of the robot. The interaction information is information obtained by the robot interacting with other robots in the target environment, for example, the interaction information includes collision information between the robot and other robots.
202. The server inputs the external information into an intelligent decision model of the robot, a distributed executor model of the intelligent decision model predicts based on the external information, and a plurality of action branches of the robot are output, wherein the action branches are possible actions of the robot in the target environment.
The distributed executor model of the intelligent decision model is used for predicting actions based on external information, the external information is observation of the robot to the target environment, and the function of the distributed executor model is to make a decision about which action to execute based on the observation of the robot. The plurality of actions are branched into actions that the robot is likely to perform in the target environment in the case of acquiring the external information.
203. The server determines a prize value distribution for each of the action branches based on the external information and the plurality of action branches.
The reward value distribution of each action branch is a potential reward distribution which can be obtained by executing each action branch under the condition of the external information, and the reward distribution is related to actions taken by the intelligent bodies, interaction among the intelligent bodies and the external information, wherein the intelligent bodies are robots.
204. And the server carries out reward aggregation on the sampled reward values sampled in the reward value distribution of each action branch to obtain mixed rewards and integrated rewards.
Wherein Mixed rewarded Aggregation is used to train the critic model; integrated reward (Lumped rewarded Aggregation) is used to train the distributed actor model.
205. The server trains an intelligent decision model of the robot based on the hybrid reward, the integrated reward, and the external information.
According to the technical scheme provided by the embodiment of the application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision model, the distributed executor model of the intelligent decision model outputs a plurality of action branches, and the action branches are all actions which can be executed by the robot in the target environment under the condition that the external information is obtained. Based on the external information and the plurality of action branches, the reward value distribution of each action branch is determined, namely, the plurality of action branches are evaluated. Based on the reward value distribution of the plurality of action branches, reward aggregation is performed to determine a mixed reward and an integrated reward. Training the intelligent decision model based on the mixed reward, the integrated reward and the external information can achieve a stable training effect.
The above steps 201 to 205 are brief descriptions of the technical solutions provided by the embodiments of the present application, and the technical solutions provided by the embodiments of the present application will be more clearly described below with reference to some examples, and with reference to fig. 3 and fig. 4, taking an execution subject as an example, the method includes the following steps.
301. The method comprises the steps that a server obtains external information collected by a robot in a target environment, the external information comprises external environment information and interaction information, the external environment information is information obtained by the robot observing the target environment, and the interaction information is information obtained by the robot interacting with other robots in the target environment.
The target environment includes a plurality of robots, and the robot in step 301 is any one of the plurality of robots. The external information includes external environment information and interaction information, where the external environment information is information obtained by the robot observing the target environment, that is, information obtained by the robot observing the target environment through multiple sensors, for example, the external information includes an environment image obtained by an image sensor of the robot, a position obtained by a position sensor of the robot, and attitude information obtained by a gyroscope of the robot. The interaction information is information obtained by the robot interacting with other robots in the target environment, for example, the interaction information includes collision information between the robot and other robots. In some embodiments, the target environment is a simulated virtual environment.
In some embodiments, multiple robots in a target environment interact with the target environment in a distributed interaction manner, each robot can acquire external information at different times during an interaction process, the external information acquired by the multiple robots at different times is stored in an experience Buffer pool (Replay Buffer), a server can acquire the external information acquired by the robots at different times from the experience Buffer pool, the experience Buffer pool is equivalent to a database, the external information stored in the experience Buffer pool can be used as a training sample when an intelligent decision model of the robot is trained, and the form of the experience Buffer pool is shown in fig. 4. Since the robot is any one of the robots, the server can acquire the external information collected by the robot in the target environment in the experience buffer pool based on the identification of the robot. In some embodiments, the experience buffer pool also stores actions performed in the target environment by the robot in the target environment and reward values for actions performed by the robot by the target environment.
Under the framework of deep reinforcement learning, the external information is the state of the environment, the robot is the intelligent agent, the reward uncertainty caused by interaction among the intelligent agents increases exponentially with the increase of the number of the intelligent agents, and the reward corresponding to the state-action pair can be multiple, so that the reward distribution modeling is selected for different action branches in the subsequent processing process. One straightforward approach is to directly choose to model the reward distribution over the joint state-action space. However, as the number of agents increases, there is a great uncertainty in this approach due to the interaction between agents. To address the above challenges, the embodiments of the present application propose another method to achieve this objective. From the perspective of one agent, other agents are considered part of the environment to simplify the problem, so we only need to focus on the action branch reward distribution estimation of each agent, not all agents. In the above steps, that is, the mutual information is regarded as a part of the environment, and the external information used in the motion prediction includes the external environment information and the mutual information, thereby simplifying the influence of other robots on the robot. The idea of multi-action branch reward distribution estimation is intuitively motivated by: the human brain can always speculate on the possible consequences of decision-making behaviour associated with this based on existing facts.
302. The server inputs the external information into an intelligent decision model of the robot, a distributed executor model of the intelligent decision model predicts based on the external information, and a plurality of action branches of the robot are output, wherein the action branches are possible actions of the robot in the target environment.
The distributed executor model is used for predicting actions based on external information, the external information is the observation of the robot to the target environment, and the function of the distributed executor model is to make a decision on which action to execute based on the observation of the robot. The plurality of actions are branched into actions that the robot is likely to perform in the target environment in the case of acquiring the external information. In some embodiments, the distributed Actor model is also referred to as a distributed Actor network (Decentralized Actor).
In one possible implementation, the server inputs external information into a distributed executor model of the robot, performs prediction by the distributed executor model based on the external information and a decision strategy, and outputs a plurality of action branches of the robot.
The decision Policy (Policy) is also a model parameter of the distributed executor model, and the distributed executor model is trained, that is, the decision Policy is optimized (the model parameter is adjusted), so that the distributed executor model can perform more accurate action prediction according to external information.
For example, the server inputs the external information into a distributed executor model of the robot, and performs feature extraction on the external information by the distributed executor model to obtain external information features of the external information, for example, the external information features of the external information are obtained by performing at least one of full connection, convolution and attention coding on the external information by the distributed executor model. And carrying out full connection and normalization on the external information characteristics by the distributed executor model, and outputting the action branches. After the distributed executor model performs full connection and normalization on the external information features, a probability set can be obtained, wherein the probability set comprises a plurality of probabilities, and each probability corresponds to one action. And the server determines the action corresponding to the highest probability of the number of the previous targets in the probability set as an action branch. For example, the number of actions that the robot can perform in the target environment is N, the probability set output by the distributed executor model includes the probabilities of the N actions, and the server determines M optional actions (M may also be equal to N) of the N actions as the action branches. Referring to fig. 4, the distributed executor model of the robot is Actor in fig. 4.
303. The server determines a reward value distribution for each of the action branches based on the external information and the plurality of action branches.
Wherein the reward value distribution of each action branch is a potential reward distribution that can be obtained for executing each action branch under the condition of the external information, and the distribution of the rewards is related to the action taken by the intelligent agents, the interaction among the intelligent agents and the external information.
In one possible implementation, the server inputs the external information and the action branches into a reward value estimation model, performs reward value distribution estimation through the reward value estimation model based on the external information and the action branches, and outputs reward value distribution of each action branch.
Among them, referring to fig. 4, the Reward value Estimation model is also called a Multi-action-branch Reward Estimation (Multi-action-branch Reward Estimation) device for outputting a Reward value based on an environmental state (external information) and an action (action branch).
For example, at each time step t, for each agent (robot) i, the server uses a multi-action branch reward estimatorDe-modeling agent i observesIn case of selecting an action on the kth action branchDistribution of possible rewards. Wherein,a space of distribution of the bonus is represented,indicating the reward distribution of agent i on the kth action branch,a parameter indicating the distribution of the reward related to agent i,this external information observed by the robot i is represented. Use ofIs shown in the observationDistribution of estimates over all action branches of lower agent i. Furthermore, the multi-action branch distribution estimation can be realized by optimizing an objective function, so that the uncertainty of the reward function can be captured, preconditions are provided for reducing the influence caused by the uncertainty of the reward, and the objective function is in the form of the following formula (1) and formula (2).
Wherein equation (1) is the optimization function of the multi-action branch reward estimator for all agents, J r For the function value of the objective function, in equation (2), the superscript on the agent number is omitted, and D isEmpirical buffer pool, -log P.]A negative log-likelihood loss is represented,is a regularization term for the prize distribution. In some embodiments of the present invention, the,a gaussian distribution may be used, but the optimization goal is not limited to having to use a gaussian distribution, and other distribution forms are possible. When a gaussian distribution is used in the case, where var (μ) represents the variance with respect to the gaussian mean vector. Alpha and beta are hyper-parameters.
The potential rewards for all action branches can be predicted by multi-action branch reward estimation, as humans would consider each possible outcome and would typically make decisions considering the outcomes of all possible actions. Therefore, in order to better evaluate the historical experience and utilize the historical experience to achieve stable training, strategy weighted reward aggregation will be introduced later to weaken the influence of reward uncertainty in the training process.
304. And the server carries out reward aggregation on the sampled reward values sampled in the reward value distribution of each action branch to obtain mixed rewards and integrated rewards.
Wherein at each time step agent i can only take a certain action a k To obtain a reward r on this branch of action k The branch of action referred to herein is also the action a k . However, after the multi-action branch reward estimation, r can be set k Enhanced to an embedded prize vector with a prize value r at the kth position of the embedded prize vector k The values at other positions being fromThe sampled prize value. The embedded reward vectors are then reward aggregated, obtaining two types of reward values for updating a Centralized Critic network (Centralized Critic) and a distributed actor network (distributed actor model). The two types of reward values are Mixed rewards (Mixed rewarded Aggregation)It is used to train centralized critics V γ,ψ (ii) a Integrated reward (Lumped rewarded Aggregation)It is used to train distributed executivesA centralized critic network, also referred to as a critic model, is used to evaluate actions output by a distributed actor network.
First, a method of acquiring the sample bonus value by the server will be described.
In one possible implementation, the server samples the reward value distribution of each action branch to obtain a sampled reward value for each action branch.
The method for the server to obtain the hybrid bonus and the integrated bonus in step 304 is described below.
First, a method for acquiring a hybrid prize by a server will be described.
In one possible implementation, the server performs policy weighted fusion on the sampled reward values of the action branches to obtain the mixed reward.
Wherein the reward values of the plurality of action branches constitute an embedded reward vectorWherein the functionIndicating a replacementThe operation of replacing the reward on a certain action branch with the reward obtained from the environment, and the final result obtained isThis vector is formed by dividing r k Is placed at the kth position of the embedded vector, and other positions are placedThe corresponding value.
For example, server formulasObtaining a hybrid reward after policy weightingWhere the function g (.) represents two operations, which are: average operation gMO: namely, directly carrying out average operation on the input quantity; selection operation g SS : i.e. directly select which m corresponds to agent i i As a result of the output of g (.),represents the weight of the policy weighting and o represents the external information. In summary, the server can perform policy-weighted fusion on the reward values of the plurality of action branches by the following formula (3) to obtain the hybrid reward.
Wherein,indicating a hybrid reward, m, for agent i 1 ,…,m N Is the reward value of a plurality of action branches of agent i, N is the number of action branches, and N is a positive integer.
It should be noted that the above-mentioned averaging operation can be adoptedMaking g MO And a selection operation g SS Any of which operate to determine the hybrid reward, and the embodiments of the present application are not limited thereto.
The method for the server to obtain the integrated bonus is described below.
In one possible implementation, the server performs any one of global averaging, local averaging and direct selection of the sampled reward values for the respective action branches to obtain the integrated reward.
The server obtains the integrated prize, for example, by the following formula (4).
Wherein,representing the integrated reward for agent i, there are three common forms of expression for function l (): averaging operation (global average) l MO Directly on all embedded reward vectors enteredAveraging, the output obtained as an integrated rewardSimplified average (local average) l SMO Selecting only the embedded reward vector m corresponding to agent i i Averaging as an integrated rewardSelection operation (direct selection) l SS Direct selection of ambient reward feedbackAs an output. It should be noted that any manner described above may be adopted to obtain the integrated reward, and the embodiment of the present application is not limited thereto.
In some embodiments, referring to fig. 4, the server obtains the hybrid and integrated rewards through a Reward aggregator (rewarded Aggregation) in a distributed Reward Estimation (distributed Reward Estimation) network. The Multi-action-branch Reward estimator (Multi-action-branch Reward Estimation) in step 303 also belongs to the distributed Reward Estimation network.
305. The server obtains action advantage values of the action branches based on the hybrid reward, the external information and a critic model of the intelligent decision model, and the critic model is used for evaluating the action branches based on the external information.
The critic model is also the centralized critic network.
In one possible implementation, the server trains the critic model based on the hybrid reward. The server inputs the external information and the action branches into the critic model, and the critic model outputs action advantage values of the action branches.
In order to more clearly explain the above embodiment, the above embodiment will be explained in two parts.
The first part, the server, trains the critic model based on the hybrid reward and the external information.
In one possible implementation, the server trains the critic model based on the hybrid reward, the plurality of action branches, and the external information of the target environment collected by the robot at the next time step.
For example, the server inputs the hybrid reward, the plurality of action branches, the external information and the external information of the target environment acquired by the robot at the next time step into the critic model, and the critic model encodes the hybrid reward, the plurality of action branches, the external information and the external information of the target environment acquired by the robot at the next time step to obtain the encoding characteristic. And the server aggregates and fully connects the coding features at least once through the critic model, and outputs a reference evaluation value and a target evaluation value for the action branches, wherein the reference evaluation value is used for evaluating the action branches and the external information, and the target evaluation value is used for evaluating the action branches and the external information of the target environment collected by the robot at the next time step. The server trains the critic model based on difference information between the target evaluation value and the reference evaluation value.
For example, the server optimizes the critic model V by minimizing Bellman residuals γ,ψ That is, the critic model is trained by the following formula (5).
Wherein, J c (psi) is the function value of the optimization function, psi andrespectively representing parameters of the current state value network and parameters of the target state value network. V γ,ψ For the model parameters, γ represents the action branch, and o' represents the external information of the target environment collected by the robot at the next time step.
And a second part, wherein the server inputs the external information and the action branches into the critic model, and the critic model outputs action advantage values of the action branches.
Wherein, see the above formula (5), in terms of squaresI.e. the motion dominance valueBut we are not using hybrid rewardsCalculating action advantage values instead of using integrated rewardsNamely that
In some embodiments, the server trains a reward value estimation model of the intelligent decision model based on the external information.
306. The server trains the intelligent decision model based on the action dominance value and the integrated reward.
In one possible implementation, the server constructs a loss function for the distributed actor model based on the action dominance value and the integrated reward. And the server trains the distributed executor model by adopting a gradient descent method based on the loss function.
In some embodiments, the penalty function is a penalty function similar to the nearest Policy Optimization (PPO).
For example, the server trains the distributed actor model by equation (6) below, which omits the superscript of the agent index.
Wherein,representing importance weight, wherein epsilon is a clipping hyper-parameter, and in the actual use process, the selection is generally carried outRepresenting the strategic entropy of agent i, governing the entropy loss at the final J a Degree of importance in (θ), action advantage valueWith integrated rewardsAnd (4) correlating.
It should be noted that, the above steps 301 to 306 are described by taking an example of performing an iterative training on the distributed actor model, and in other iterative training, the training method is the same as the above steps 301 to 306, and is not described herein again.
In addition, the steps 301 to 306 are described by taking a server as an execution subject to perform training of the intelligent decision model, and in other possible embodiments, the steps 301 to 306 may also be executed by a terminal, which is not limited in this embodiment of the present application.
After the training of the intelligent decision model is completed, the distributed executor model may be deployed to a plurality of robots, and the navigation of the plurality of robots may be achieved by placing the plurality of robots in a specified environment, for example, tasks may be set for the plurality of robots, and the robots are moved from a position a to a position B in the specified environment, where the specified environment includes a plurality of terrains and obstacles. The multiple robots can observe the designated environment, acquire external information of the designated environment, and input the external information into the trained intelligent decision model, so that corresponding actions can be obtained, such as going straight, turning, lifting legs and the like, and the multiple robots can complete tasks of moving from the position A to the position B by executing the actions.
All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
In the experimental process, a multi-agent environment comprising three collaborations is selected for testing. These three scenarios are CN, REF, TREA, respectively. The first two scenarios are scenarios in the particle system, and the third scenario is a more difficult version of Cooperative Treasure Collection developed based on the particle system. Methods of comparison include maddppg, MAAC, QMIX, IQL, MAPPO, where MAPPO is the most recent algorithm that extends PPO to multi-agent scenarios using a centralized training decentralized execution framework. Fig. 5 shows an experimental result, and in fig. 5, the method provided by the embodiment of the present invention achieves an optimal effect in three scenarios, which illustrates that the method provided by the embodiment of the present invention can actually capture uncertainty of rewards in an environment, and at the same time, as can be seen from a performance curve, the method provided by the embodiment of the present invention successfully achieves a stable training process in a learning process.
According to the technical scheme provided by the embodiment of the application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision model, the distributed executor model of the intelligent decision model outputs a plurality of action branches, and the action branches are all actions which can be executed by the robot in the target environment under the condition that the external information is obtained. Based on the external information and the action branches, the reward value distribution of each action branch is determined, namely, the action branches are evaluated. Based on the reward value distribution of the plurality of action branches, reward aggregation is performed to determine a mixed reward and an integrated reward. Training the intelligent decision model based on the mixed reward, the integrated reward and the external information can achieve a stable training effect.
The embodiment of the application provides a multi-agent distributed reward estimation and policy weighting aggregation framework (i.e. DRE-MARL in fig. 5) for capturing and reducing the problem of unstable learning performance caused by uncertain reward signals, and achieving the effectiveness and robustness of model training. Within the framework, possible reward signals on all action branches in a certain state are considered comprehensively, and a more stable reward signal is provided for a Critic (Critic) network to update model parameters. First, a distributed multi-action branch distribution estimation is proposed, and a separate reward distribution function is constructed for each action branch. Secondly, reward signals are sampled from different action distributions to form a sampling reward vector, then rewards on a certain action branch are replaced by rewards obtained from the environment to form an embedded reward vector, then various reward aggregation operations are carried out on the embedded reward vector, and a mixed reward signal is obtained to serve as the reward signal of a critic network and a distributed executor (DecentralizedActor) network. The experimental result shows that the method provided by the embodiment of the application shows excellent performance, and the effectiveness of the method provided by the embodiment of the application is proved.
Fig. 6 is a schematic structural diagram of a training apparatus for an intelligent decision model according to an embodiment of the present application, and referring to fig. 6, the apparatus includes: an external information acquisition module 601, an action prediction module 602, a reward value prediction module 603, a reward value aggregation module 604, and a training module 605.
The external information obtaining module 601 is configured to obtain external information collected by the robot in a target environment, where the external information includes external environment information and interaction information, the external environment information is information obtained by the robot observing the target environment, and the interaction information is information obtained by the robot interacting with other robots in the target environment.
The action prediction module 602 is configured to input the external information into an intelligent decision model of the robot, perform prediction based on the external information by a distributed executor model of the intelligent decision model, and output a plurality of action branches of the robot, where the action branches are actions that the robot may perform in the target environment.
A reward value prediction module 603 configured to determine a reward value distribution of each of the action branches based on the external information and the plurality of action branches.
And the reward value aggregation module 604 is configured to perform reward aggregation on the sampled reward values sampled in the reward value distribution of each action branch to obtain a mixed reward and an integrated reward.
A training module 605 for training the intelligent decision model of the robot based on the hybrid reward, the integrated reward and the external information.
In a possible embodiment, the action prediction module 602 is configured to perform at least one of full concatenation, convolution and attention coding on the extrinsic information by a distributed executor model of the intelligent decision model to obtain extrinsic information features of the extrinsic information. And fully connecting and normalizing the external information features by a distributed executor model of the intelligent decision model, and outputting a plurality of action branches of the robot.
In a possible implementation manner, the reward value prediction module 603 is configured to input the external information and the action branches into a reward value estimation model, perform reward value distribution estimation through the reward value estimation model based on the external information and the action branches, and output a reward value distribution of each action branch.
In a possible implementation manner, the bonus value aggregation module 604 is configured to sample the bonus value distribution of each action branch to obtain a sampled bonus value of each action branch. And performing policy weighted fusion on the sampled reward values of the action branches to obtain the mixed reward. And carrying out any one of global average, local average and direct selection on the sampling reward value of each action branch to obtain the integrated reward.
In a possible implementation, the training module 605 is configured to obtain the action dominance values of the plurality of action branches based on the hybrid reward, the external information, and a critic model of the intelligent decision model, the critic model being configured to evaluate the plurality of action branches based on the external information. A reward value estimation model of the intelligent decision model is trained based on the external information. Training a distributed actor model of the intelligent decision model based on the action dominance value and the integrated reward.
In one possible implementation, the training module 605 trains the critic model of the intelligent decision model based on the hybrid reward and the external information. Inputting the external information and the action branches into the critic model, and outputting action advantage values of the action branches by the critic model.
In one possible implementation, the training module 605 is configured to construct a penalty function for the distributed actor model based on the action dominance value and the integrated reward. Based on the loss function, the distributed executor model is trained by adopting a gradient descent method.
It should be noted that: in the training apparatus for an intelligent decision-making model provided in the foregoing embodiment, when training the intelligent decision-making model, only the division of the above function modules is used for illustration, and in practical applications, the above function distribution may be completed by different function modules according to needs, that is, the internal structure of the computer device is divided into different function modules, so as to complete all or part of the above described functions. In addition, the training device of the intelligent decision model provided in the above embodiment and the training method embodiment of the intelligent decision model belong to the same concept, and the specific implementation process thereof is described in detail in the method embodiment, and is not described herein again.
According to the technical scheme provided by the embodiment of the application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision model, the distributed executor model of the intelligent decision model outputs a plurality of action branches, and the action branches are all actions which can be executed by the robot in the target environment under the condition that the external information is obtained. Based on the external information and the action branches, the reward value distribution of each action branch is determined, namely, the action branches are evaluated. Based on the reward value distribution of the plurality of action branches, reward aggregation is performed to determine a mixed reward and an integrated reward. Training the intelligent decision model based on the mixed reward, the integrated reward and the external information can achieve a stable training effect.
An embodiment of the present application provides a computer device, configured to execute the method described above, where the computer device may be implemented as a terminal or a server, and a structure of the terminal is introduced below:
fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 700 may be: a smartphone, a tablet, a laptop, or a desktop computer. Terminal 700 may also be referred to as a user equipment, portable terminal, laptop terminal, desktop terminal, or by other names.
In general, terminal 700 includes: one or more processors 701 and one or more memories 702.
In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 704, a display screen 705, a camera assembly 706, an audio circuit 707, and a power supply 708.
The peripheral interface 703 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth.
The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to capture touch signals on or over the surface of the display screen 705. The touch signal may be input to the processor 701 as a control signal for processing. At this point, the display screen 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard.
The camera assembly 706 is used to capture images or video. Optionally, camera assembly 706 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal.
The audio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing or inputting the electric signals to the radio frequency circuit 704 to realize voice communication.
The power supply 708 is used to power the various components in the terminal 700. The power source 708 may be alternating current, direct current, disposable batteries, or rechargeable batteries.
In some embodiments, terminal 700 can also include one or more sensors 709. The one or more sensors 709 include, but are not limited to: acceleration sensor 710, gyro sensor 711, pressure sensor 712, optical sensor 713, and proximity sensor 714.
The acceleration sensor 710 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the terminal 700.
The gyro sensor 711 may acquire a 3D motion of the user on the terminal 700 in cooperation with the acceleration sensor 710, and may acquire a body direction and a rotation angle of the terminal 700.
Pressure sensors 712 may be disposed on the side frames of terminal 700 and/or underneath display screen 705. When the pressure sensor 712 is disposed on the side frame of the terminal 700, a user's holding signal of the terminal 700 can be detected, and the processor 701 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 712. When the pressure sensor 712 is disposed at the lower layer of the display 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the display 705.
The optical sensor 713 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the display screen 705 based on the ambient light intensity collected by the optical sensor 713.
The proximity sensor 714 is used to collect a distance between the user and the front surface of the terminal 700.
Those skilled in the art will appreciate that the configuration shown in fig. 7 is not intended to be limiting of terminal 700 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
The computer device may also be implemented as a server, and the following describes a structure of the server:
fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 800 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 801 and one or more memories 802, where the one or more memories 802 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 801 to implement the methods provided by the foregoing method embodiments. Of course, the server 800 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 800 may also include other components for implementing the functions of the device, which are not described herein again.
In an exemplary embodiment, a computer-readable storage medium, such as a memory including a computer program, executable by a processor, is also provided to perform the method of training an intelligent decision model in the above embodiments. For example, the computer readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises program code stored in a computer-readable storage medium, which is read by a processor of a computer device from the computer-readable storage medium, and which is executed by the processor such that the computer device performs the training method of the intelligent decision model described above.
In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.
Claims (10)
1. A method for training an intelligent decision model, the method comprising:
acquiring external information acquired by a robot in a target environment, wherein the external information comprises external environment information and interaction information, the external environment information is information obtained by observing the target environment by the robot, and the interaction information is information obtained by interacting the robot with other robots in the target environment;
inputting the external information into an intelligent decision model of the robot, performing prediction by a distributed executor model of the intelligent decision model based on the external information, and outputting a plurality of action branches of the robot, wherein the action branches are actions which can be executed by the robot in the target environment;
determining a reward value distribution of each action branch based on the external information and the plurality of action branches;
carrying out reward aggregation on the sampled reward values obtained by sampling in the reward value distribution of each action branch to obtain mixed reward and integrated reward;
training an intelligent decision model of the robot based on the hybrid reward, the integrated reward, and the external information.
2. The method of claim 1, wherein the predicting by the distributed actor model of the intelligent decision model based on the external information, outputting a plurality of action branches for the robot comprises:
performing at least one of full connection, convolution and attention coding on the external information by a distributed executor model of the intelligent decision model to obtain external information characteristics of the external information;
and fully connecting and normalizing the external information features by a distributed executor model of the intelligent decision model, and outputting a plurality of action branches of the robot.
3. The method of claim 1, wherein determining a reward value distribution for each of the action branches based on the external information and the plurality of action branches comprises:
inputting the external information and the action branches into a reward value estimation model, performing reward value distribution estimation through the reward value estimation model based on the external information and the action branches, and outputting reward value distribution of each action branch.
4. The method of claim 1, wherein the aggregating rewards of the sampled reward values sampled in the reward value distribution of each of the action branches, and wherein the obtaining of the hybrid reward and the integrated reward comprises:
sampling the reward value distribution of each action branch to obtain the sampling reward value of each action branch;
carrying out strategy weighted fusion on the sampling reward values of all the action branches to obtain the mixed reward;
and carrying out any one of global averaging, local averaging and direct selection on the sampled reward values of the action branches to obtain the integrated reward.
5. The method of claim 1, wherein training a smart decision model of the robot based on the hybrid reward, the integrated reward, and the external information comprises:
training a reward value estimation model of the intelligent decision model based on the external information;
obtaining action advantage values of the action branches based on the hybrid reward, the external information and a critic model of the intelligent decision model, wherein the critic model is used for evaluating the action branches based on the external information;
training a distributed actor model of the intelligent decision model based on the action advantage value and the integrated reward.
6. The method of claim 5, wherein obtaining the action advantage values for the plurality of action branches based on the hybrid reward, the external information, and a critic model of the intelligent decision model comprises:
training a critic model of the intelligent decision model based on the hybrid reward and the external information;
inputting the external information and the plurality of action branches into the critic model, and outputting action dominance values of the plurality of action branches by the critic model.
7. The method of claim 5, wherein training the intelligent decision model based on the action advantage value and the integrated reward comprises:
constructing a loss function for the distributed actor model based on the action dominance value and the integrated reward;
and training the distributed executor model by adopting a gradient descent method based on the loss function.
8. An apparatus for training an intelligent decision model, the apparatus comprising:
the external information acquisition module is used for acquiring external information acquired by the robot in a target environment, wherein the external information comprises external environment information and interaction information, the external environment information is information obtained by observing the target environment by the robot, and the interaction information is information obtained by interacting the robot with other robots in the target environment;
the action prediction module is used for inputting the external information into an intelligent decision-making model of the robot, performing prediction by a distributed executor model of the intelligent decision-making model based on the external information, and outputting a plurality of action branches of the robot, wherein the action branches are actions which are possibly executed by the robot in the target environment;
the reward value prediction module is used for determining reward value distribution of each action branch based on the external information and the action branches;
the reward value aggregation module is used for carrying out reward aggregation on the sampled reward values sampled in the reward value distribution of each action branch to obtain mixed rewards and integrated rewards;
and the training module is used for training the intelligent decision model of the robot based on the mixed reward, the integrated reward and the external information.
9. A computer device, characterized in that the computer device comprises one or more processors and one or more memories, in which at least one computer program is stored, which is loaded and executed by the one or more processors to implement the method of training of an intelligent decision model according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor to implement a method of training an intelligent decision model as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211172621.9A CN115648204B (en) | 2022-09-26 | 2022-09-26 | Training method, device, equipment and storage medium of intelligent decision model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211172621.9A CN115648204B (en) | 2022-09-26 | 2022-09-26 | Training method, device, equipment and storage medium of intelligent decision model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115648204A true CN115648204A (en) | 2023-01-31 |
CN115648204B CN115648204B (en) | 2024-08-27 |
Family
ID=84985656
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211172621.9A Active CN115648204B (en) | 2022-09-26 | 2022-09-26 | Training method, device, equipment and storage medium of intelligent decision model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115648204B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116922397A (en) * | 2023-09-13 | 2023-10-24 | 成都明途科技有限公司 | Robot intelligent level measuring method and device, robot and storage medium |
CN117933096A (en) * | 2024-03-21 | 2024-04-26 | 山东省科学院自动化研究所 | Unmanned countermeasure test scene generation method and system |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111026272A (en) * | 2019-12-09 | 2020-04-17 | 网易(杭州)网络有限公司 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
CN112329948A (en) * | 2020-11-04 | 2021-02-05 | 腾讯科技(深圳)有限公司 | Multi-agent strategy prediction method and device |
CN112476424A (en) * | 2020-11-13 | 2021-03-12 | 腾讯科技(深圳)有限公司 | Robot control method, device, equipment and computer storage medium |
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
CN112843725A (en) * | 2021-03-15 | 2021-05-28 | 网易(杭州)网络有限公司 | Intelligent agent processing method and device |
CN113221444A (en) * | 2021-04-20 | 2021-08-06 | 中国电子科技集团公司第五十二研究所 | Behavior simulation training method for air intelligent game |
CN113392935A (en) * | 2021-07-09 | 2021-09-14 | 浙江工业大学 | Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism |
CN113435606A (en) * | 2021-07-01 | 2021-09-24 | 吉林大学 | Method and device for optimizing reinforcement learning model, storage medium and electronic equipment |
CN113609786A (en) * | 2021-08-27 | 2021-11-05 | 中国人民解放军国防科技大学 | Mobile robot navigation method and device, computer equipment and storage medium |
CN114004149A (en) * | 2021-10-29 | 2022-02-01 | 深圳市商汤科技有限公司 | Intelligent agent training method and device, computer equipment and storage medium |
US20220036186A1 (en) * | 2020-07-30 | 2022-02-03 | Waymo Llc | Accelerated deep reinforcement learning of agent control policies |
CN114662639A (en) * | 2022-03-24 | 2022-06-24 | 河海大学 | Multi-agent reinforcement learning method and system based on value decomposition |
WO2022135066A1 (en) * | 2020-12-25 | 2022-06-30 | 南京理工大学 | Temporal difference-based hybrid flow-shop scheduling method |
CN114723065A (en) * | 2022-03-22 | 2022-07-08 | 中国人民解放军国防科技大学 | Optimal strategy obtaining method and device based on double-layer deep reinforcement learning model |
-
2022
- 2022-09-26 CN CN202211172621.9A patent/CN115648204B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111026272A (en) * | 2019-12-09 | 2020-04-17 | 网易(杭州)网络有限公司 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
US20220036186A1 (en) * | 2020-07-30 | 2022-02-03 | Waymo Llc | Accelerated deep reinforcement learning of agent control policies |
CN112329948A (en) * | 2020-11-04 | 2021-02-05 | 腾讯科技(深圳)有限公司 | Multi-agent strategy prediction method and device |
CN112476424A (en) * | 2020-11-13 | 2021-03-12 | 腾讯科技(深圳)有限公司 | Robot control method, device, equipment and computer storage medium |
WO2022135066A1 (en) * | 2020-12-25 | 2022-06-30 | 南京理工大学 | Temporal difference-based hybrid flow-shop scheduling method |
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
CN112843725A (en) * | 2021-03-15 | 2021-05-28 | 网易(杭州)网络有限公司 | Intelligent agent processing method and device |
CN113221444A (en) * | 2021-04-20 | 2021-08-06 | 中国电子科技集团公司第五十二研究所 | Behavior simulation training method for air intelligent game |
CN113435606A (en) * | 2021-07-01 | 2021-09-24 | 吉林大学 | Method and device for optimizing reinforcement learning model, storage medium and electronic equipment |
CN113392935A (en) * | 2021-07-09 | 2021-09-14 | 浙江工业大学 | Multi-agent deep reinforcement learning strategy optimization method based on attention mechanism |
CN113609786A (en) * | 2021-08-27 | 2021-11-05 | 中国人民解放军国防科技大学 | Mobile robot navigation method and device, computer equipment and storage medium |
CN114004149A (en) * | 2021-10-29 | 2022-02-01 | 深圳市商汤科技有限公司 | Intelligent agent training method and device, computer equipment and storage medium |
CN114723065A (en) * | 2022-03-22 | 2022-07-08 | 中国人民解放军国防科技大学 | Optimal strategy obtaining method and device based on double-layer deep reinforcement learning model |
CN114662639A (en) * | 2022-03-24 | 2022-06-24 | 河海大学 | Multi-agent reinforcement learning method and system based on value decomposition |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116922397A (en) * | 2023-09-13 | 2023-10-24 | 成都明途科技有限公司 | Robot intelligent level measuring method and device, robot and storage medium |
CN116922397B (en) * | 2023-09-13 | 2023-11-28 | 成都明途科技有限公司 | Robot intelligent level measuring method and device, robot and storage medium |
CN117933096A (en) * | 2024-03-21 | 2024-04-26 | 山东省科学院自动化研究所 | Unmanned countermeasure test scene generation method and system |
Also Published As
Publication number | Publication date |
---|---|
CN115648204B (en) | 2024-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110909630B (en) | Abnormal game video detection method and device | |
CN115648204B (en) | Training method, device, equipment and storage medium of intelligent decision model | |
CN112329826B (en) | Training method of image recognition model, image recognition method and device | |
CN111666919B (en) | Object identification method and device, computer equipment and storage medium | |
CN111813532A (en) | Image management method and device based on multitask machine learning model | |
CN113284142B (en) | Image detection method, image detection device, computer-readable storage medium and computer equipment | |
CN114297730A (en) | Countermeasure image generation method, device and storage medium | |
CN114261400B (en) | Automatic driving decision method, device, equipment and storage medium | |
CN111044045A (en) | Navigation method and device based on neural network and terminal equipment | |
CN110516113B (en) | Video classification method, video classification model training method and device | |
CN112699832B (en) | Target detection method, device, equipment and storage medium | |
CN112115900B (en) | Image processing method, device, equipment and storage medium | |
CN118018426B (en) | Training method, detecting method and device for network anomaly intrusion detection model | |
CN115018017A (en) | Multi-agent credit allocation method, system and equipment based on ensemble learning | |
CN116310318A (en) | Interactive image segmentation method, device, computer equipment and storage medium | |
CN111008622B (en) | Image object detection method and device and computer readable storage medium | |
CN112527104A (en) | Method, device and equipment for determining parameters and storage medium | |
Desai et al. | Auxiliary tasks for efficient learning of point-goal navigation | |
CN116883961A (en) | Target perception method and device | |
CN116301022A (en) | Unmanned aerial vehicle cluster task planning method and device based on deep reinforcement learning | |
CN115222769A (en) | Trajectory prediction method, device and agent | |
CN116152289A (en) | Target object tracking method, related device, equipment and storage medium | |
CN113822293A (en) | Model processing method, device and equipment for graph data and storage medium | |
Zangirolami et al. | Dealing with uncertainty: Balancing exploration and exploitation in deep recurrent reinforcement learning | |
CN118171744B (en) | Method and device for predicting space-time distribution, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |