CN112215328A

CN112215328A - Training of intelligent agent, and action control method and device based on intelligent agent

Info

Publication number: CN112215328A
Application number: CN202011176683.8A
Authority: CN
Inventors: 邱福浩; 韩国安; 王亮; 付强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-01-12
Anticipated expiration: 2040-10-29
Also published as: CN112215328B

Abstract

The application relates to the technical field of artificial intelligence, in particular to a method and a device for training an agent and controlling actions based on the agent, which are used for acquiring environment interaction data generated based on the agent, wherein the environment interaction data at least comprise environment state information and influence factors, and actions and instant incentive data which are acquired by the agent and respond to the environment state information; obtaining a target potential energy value through a trained potential energy function according to the environmental state information and the influence factor, and obtaining positive excitation data or negative excitation data according to a standard potential energy value corresponding to the environmental state information and the target potential energy value; acquiring target excitation data according to the instant excitation data and the positive excitation data or the negative excitation data; and performing reinforcement learning training on the intelligent agent according to the environmental state information, the action and the target incentive data, so that the training efficiency and accuracy of the intelligent agent can be improved.

Description

Training of intelligent agent, and action control method and device based on intelligent agent

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method and a motion control method and device based on an agent.

Background

Artificial Intelligence (AI) is a new technical science that studies theories, methods, techniques and application systems for simulating, extending and expanding human intelligence, such as the Reinforced Learning (RL) algorithm, which is mainly adopted for AI Learning in complex games.

Reinforcement learning is that an agent learns in a trial-and-error manner, and through reasonable designed incentive, an incentive (rewarded) obtained by interaction between actions (actions) and the environment guides behaviors, with the goal of enabling the agent to obtain the maximum benefit, i.e. enabling the agent to obtain the maximum expected benefit through continuous exploration so as to achieve the final expected goal.

Disclosure of Invention

The embodiment of the application provides a training method of an intelligent agent, and an action control method and device based on the intelligent agent, so as to improve the training efficiency and accuracy of the intelligent agent.

The embodiment of the application provides the following specific technical scheme:

an embodiment of the present application provides a training method for an agent, including:

acquiring environment interaction data generated based on an intelligent agent, wherein the environment interaction data at least comprises environment state information and influence factors, and action and instant incentive data which are obtained by the intelligent agent and respond to the environment state information;

obtaining a target potential energy value through a trained potential energy function according to the environmental state information and the influence factor, and obtaining positive excitation data or negative excitation data according to a standard potential energy value corresponding to the environmental state information and the target potential energy value;

acquiring target excitation data according to the instant excitation data and the positive excitation data or the negative excitation data;

and performing reinforcement learning training on the intelligent agent according to the environment state information, the action and the target incentive data.

Another embodiment of the present application provides an agent-based action control method, including:

acquiring to-be-processed environment state information corresponding to a set environment;

based on a trained agent, taking the to-be-processed environment state information as input, and obtaining an action responding to the to-be-processed environment state information, wherein the agent is obtained by performing reinforcement learning training on a training data sample set containing environment state information, the action and target excitation data, the target excitation data is obtained according to instant excitation data and positive excitation data or negative excitation data, and the positive excitation data or the negative excitation data is obtained according to a standard potential energy value corresponding to the environment state information and a target potential energy value obtained through a trained potential energy function;

outputting, by the agent, the action to the set environment.

Another embodiment of the present application provides a training apparatus for an agent, including:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring environment interaction data generated based on an intelligent agent, the environment interaction data at least comprises environment state information and influence factors, and action and instant incentive data which are obtained by the intelligent agent and respond to the environment state information;

the judging module is used for obtaining a target potential energy value through a trained potential energy function according to the environment state information and the influence factor, and obtaining positive excitation data or negative excitation data according to a standard potential energy value corresponding to the environment state information and the target potential energy value;

the determining module is used for obtaining target excitation data according to the instant excitation data and the positive excitation data or the negative excitation data;

and the first training module is used for carrying out reinforcement learning training on the intelligent agent according to the environment state information, the action and the target incentive data.

Another embodiment of the present application provides an agent-based motion control apparatus, including:

the acquisition module is used for acquiring the state information of the environment to be processed corresponding to the set environment;

the processing module is used for obtaining an action responding to the environmental state information to be processed by taking the environmental state information to be processed as input based on a trained intelligent agent, wherein the intelligent agent is obtained by performing reinforcement learning training on a training data sample set containing the environmental state information, the action and target excitation data, the target excitation data is obtained according to instant excitation data and positive excitation data or negative excitation data, and the positive excitation data or the negative excitation data is obtained according to a standard potential energy value corresponding to the environmental state information and a target potential energy value obtained through a trained potential energy function;

and the output module is used for outputting the action to the set environment through the intelligent agent.

Another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the methods for training an agent or the method for controlling an action based on an agent when executing the program.

Another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of any one of the methods for agent training or agent-based motion control.

Another embodiment of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the agent or the action control method based on the agent provided in any one of the above-mentioned various optional implementation modes.

In the embodiment of the application, environment interaction data generated based on an intelligent body is obtained, wherein the environment interaction data at least comprises environment state information and an influence factor, and action and instant incentive data which are obtained by the intelligent body and respond to the environment state information, value evaluation is carried out through a trained potential energy function, final target incentive data is obtained by combining the instant incentive data, and then reinforcement learning training is carried out on the intelligent body according to the environment state information, the action and the target incentive data, so that when the reinforcement learning training is carried out on the intelligent body, an incentive mechanism during reinforcement learning of the intelligent body is constructed by carrying out the value evaluation through the trained potential energy function and further combining the value evaluation of the potential energy function and the instant incentive data generated by the intelligent body, the intelligent body can be stimulated to carry out strategy exploration in a high potential energy direction step by step without design, the value evaluation of the potential energy function can avoid the problem that the intelligent agent falls into the local optimal point of the strategy due to the original dense excitation, effectively improves the exploration of the intelligent agent on the optimal strategy, improves the training convergence speed and efficiency, and further improves the accuracy of final intelligent agent action control.

Drawings

FIG. 1 is a diagram illustrating the effect of comparing sparse excitation and dense excitation in the related art;

FIG. 2 is a schematic diagram of an application architecture provided in an embodiment of the present application;

FIG. 3 is a flowchart of a method for training an agent according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a feature extraction principle in an embodiment of the present application;

FIG. 5 is a schematic diagram of a self-playing principle in the embodiment of the present application;

FIG. 6 is a schematic diagram of a discrimination principle based on potential energy function in the embodiment of the present application;

FIG. 7 is a schematic diagram of a potential energy function training principle in an embodiment of the present application;

FIG. 8 is a system diagram illustrating a method for training an agent in an embodiment of the present application;

FIG. 9 is a flow chart of another method for training agents in accordance with an embodiment of the present application;

FIG. 10 is a flow chart of a method for controlling actions based on agents in an embodiment of the present application;

FIG. 11 is a schematic structural diagram of an intelligent agent training device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an agent-based motion control apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For the purpose of facilitating an understanding of the embodiments of the present application, a brief introduction of several concepts is provided below:

reinforcement Learning (RL): also known as refile learning, evaluative learning, or reinforcement learning, is one of the paradigms and methodologies of machine learning to describe and solve the problem of an agent (agent) in interacting with the environment to achieve maximum return or achieve a specific goal through a learning strategy, wherein a learning strategy is a rule of behavior of an agent used in reinforcement learning, and a learning strategy is generally a neural network.

The training process of reinforcement learning in general may be: and (2) carrying out interaction with the environment for multiple times through the intelligent agent to obtain the action, state and excitation (reward) of each interaction, then carrying out one-time training on the intelligent agent by taking the multiple groups of actions, states and excitations as training data, and carrying out the next round of training on the intelligent agent by adopting the process until the convergence condition is met.

The intelligent agent: is an important concept in the field of artificial intelligence, and any independent entity which can realize ideas and interact with the environment can be abstracted into an agent.

The excitation mechanism is as follows: in reinforcement learning, an agent is mainly guided to obtain the maximum expected benefit through continuous exploration through an incentive mechanism, and the incentive mechanism is very important for training of the agent.

Potential energy function: the potential energy function is constructed in the embodiment of the application and is mainly used for exciting an intelligent body to conduct strategy exploration from low to high and gradually towards a high potential energy direction, the potential energy function is a neural network, and the potential energy function is obtained through learning of a neural network model according to environment state information, influence factors and corresponding standard potential energy values.

The near-end Policy Optimization (PPO) algorithm: the method is an improved algorithm for strategy Gradient (Policy Gradient), and the key point of the PPO algorithm is that the On-Policy training process in Policy Gradient is converted into Off-Policy by a method called as impartation Sampling, namely, the On-Policy training process is converted into Off-Policy from On-line learning, which is similar to the Experience Replay in a value-based iterative algorithm in a certain sense, and the training speed and the training effect are obviously improved compared with the Policy Gradient algorithm experimentally by the improvement.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (Computer Vision, CV): computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition. For example, in the embodiment of the present application, image feature extraction is mainly performed on image information in environment or playing data by using an image feature extraction technology in image semantic understanding.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. For example, in the embodiment of the application, a strong learning technology in machine learning is mainly utilized, and the intelligent agent reinforcement learning strategy is controlled through value evaluation of a potential energy function, so that the exploration capacity and diversity of the intelligent agent are improved. For another example, in the embodiment of the present application, a potential energy function is obtained through training of a neural network model by using deep learning in machine learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the computer vision technology, the machine learning technology and other technologies of artificial intelligence, and is specifically explained by the following embodiments:

in practice, compared with chess games with complete information and turn, more complex electronic games, such as Multiplayer Online tactical sports games (MOBA), instant Strategy games (Real-Time Strategy, RTS), First-Person shooter games (FPS) and the like, have the characteristics of Multiplayer cooperation, incomplete information and Strategy confrontation, for example, certain MOBA games are played by dividing players into two enemy paradigms and competing in a map to destroy enemies, which requires that AI not only has efficient operation on microscopic details of an intelligent body in a specific scene, but also has macroscopic awareness on team cooperation and Strategy confrontation.

For AI learning in a complex game, the RL algorithm is mainly adopted to design AI at present, and by designing a reasonable incentive mechanism, a game intelligent agent obtains the maximum expected income through continuous exploration to achieve the final win of the game. The reinforcement learning algorithm has the following advantages: 1) independent of existing human player data, the final model capability can exceed human performance; 2) data can be generated in a self-fighting mode for learning, parallelization of the algorithm is facilitated, and training efficiency of the algorithm model is greatly improved; 3) the deep neural network can be used for modeling a high-dimensional complex continuous state space in the game and an action space of an intelligent agent.

However, in the related art, the reinforcement learning algorithm mainly uses a man-made excitation mechanism to perform reinforcement learning on the game agent through multiple self-play simulations, and different factors and corresponding weights of the excitation (reward) need to be continuously adjusted and designed based on game field knowledge. In general, dense incentives (e.g., blood volume, kill) can better accelerate the learning of the gaming agent than sparse incentives (win or lose of the game). But there are significant drawbacks to the multiple, different dense excitation designs:

1) depending on the ability of the motivation designer, the knowledge in the field of the target game needs to be understood and mastered deeply; 2) for example, referring to FIG. 1, a diagram of the effect of comparing sparse excitation and dense excitation in the related art is shown, s_TSparse excitation is difficult to explore, the convergence rate is low, the intelligent agent is easy to fall into a local optimal strategy due to numerous dense excitations, single dense excitation is frequently and circularly acquired, and the capacity improvement of the intelligent agent is hindered; 3) the artificial design incentive is relatively fixed incentive, so that the strategy formed after the training convergence of the intelligent agent has the characteristic of being fixed and single, and is easy to be restrained in actual fight.

Therefore, in view of the above problems, an embodiment of the present application provides a method for training an agent, and in particular, acquiring environment interaction data generated by the agent, where the environment interaction data at least includes environment state information and an influence factor, and action and immediate excitation data obtained by the agent in response to the environment state information, and according to the environment state information and the influence factor, obtaining a target potential energy value through a trained potential energy function, and according to a standard potential energy value and the target potential energy value corresponding to the environment state information, obtaining positive excitation data or negative excitation data; obtaining target incentive data according to the instant incentive data and the positive incentive data or the negative incentive data, further obtaining target incentive data according to the environmental state information, the action and the target incentive data, the reinforcement learning training is carried out on the intelligent agent, so that when the intelligent agent is trained by reinforcement learning, by constructing a potential energy function, and further, the value evaluation based on the potential energy function is added in the excitation mechanism, the intelligent body can be excited to explore the strategy from low to high gradually in the high potential energy direction, the value evaluation of the potential energy function can avoid the problem that the intelligent body falls into the local optimum point of the strategy due to the original dense excitation, the exploration of the intelligent body on the optimal strategy is effectively improved, meanwhile, due to the training of the potential energy function, specific or various sample data can be freely selected for carrying out, therefore, the strategy exploration direction of the intelligent agent can be controlled and adjusted, and the diversity and the interpretability of the final intelligent agent strategy are improved.

Fig. 2 is a schematic diagram of an application architecture provided in the embodiment of the present application, including a terminal 100 and a server 200.

The terminal 100 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. Various application programs can be installed on the terminal 100, and the trained agent can be loaded through the terminal 100 to implement corresponding business processing, for example, a game application program is installed in the terminal 100, a user can start the game application program in the terminal 100, and can select a man-machine confrontation mode, and then the trained agent can be obtained from the server 200 side, and the action output of a machine is implemented through the game agent to play a game confrontation with the user.

The server 200 is a background server of the terminal 100, and can provide various network services for the terminal 100, and for different application programs, the server 200 may be regarded as a corresponding background server, for example, the server 200 may train a neural network model according to a potential energy function training sample set to obtain a potential energy function. For another example, the server 200 obtains environment interaction data generated based on the agent, obtains value evaluation of a potential energy function according to the potential energy function, that is, positive excitation data or negative excitation data, further obtains final target excitation data, and performs reinforcement learning training on the agent according to the environment state information, the action and the target excitation data, that is, obtains a trained agent.

The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The terminal 100 and the server 200 may be directly or indirectly connected through wired or wireless communication, which is not limited in this application, for example, fig. 2 illustrates that the terminal 100 and the server 200 are connected through the internet to realize communication therebetween.

Optionally, the internet described above uses standard communication techniques, protocols, or a combination of both. The internet is typically the internet, but can be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), any combination of mobile, wireline or wireless networks, private or virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

It should be noted that the training method for agents and the action control method based on agents in the embodiment of the present application are mainly executed by the server 200, but the action control method based on agents in the embodiment of the present application may also be executed by the terminal 100, and the embodiment of the present application is not limited thereto.

In addition, the training process of the intelligent agent in the embodiment of the present application, and the training process of the potential energy function, are generally performed by the server 200 side, which is limited by the performance of the terminal 100.

It should be noted that the application architecture diagram in the embodiment of the present application is used to more clearly illustrate the technical solution in the embodiment of the present application, and is not limited to the technical solution provided in the embodiment of the present application, and is not limited to a game service scenario, and for other application architectures and service scenarios, for example, a robot scenario, an intelligent chat or message processing scenario in an instant messaging application, and the like, the technical solution provided in the embodiment of the present application is also applicable to similar problems, and in each embodiment of the present application, the application architecture shown in fig. 2 is taken as an example to schematically illustrate the application architecture.

Based on the above embodiment, referring to fig. 3, a flowchart of a method for training an agent in the embodiment of the present application is shown, where the method includes:

step 300: the method comprises the steps of obtaining environment interaction data generated based on an intelligent agent, wherein the environment interaction data at least comprise environment state information and influence factors, and action and instant incentive data which are obtained through the intelligent agent and respond to the environment state information.

The environment interaction data is data generated based on interaction between the intelligent agent and the environment, for example, in a game scene, the interaction environment is a game environment, the environment state information is related description information of the current game, for example, in a two-party confrontation game, the environment can be the current teammate casualty condition, position information, life value information, game grade information and the like of one party; the influence factor is a factor item which is set according to requirements and specific conditions and influences the value of the potential energy function, can be obtained from the environmental state information, and can also be extracted from the environmental interaction data without being contained in the environmental state information; inputting environmental status information into the agent, and then outputting an action responding to the environmental status information, and obtaining instant incentive data after outputting the action, namely reward benefit, wherein the action comprises actions acted on a game scene by the agent, for example, the actions include but are not limited to: the moving direction of the intelligent agent, the moving distance of the intelligent agent, the moving speed of the intelligent agent, the moving position of the intelligent agent and the number of the interactive entity acted by the intelligent agent.

In the embodiment of the present application, when training an agent, self-playing data generated based on the agent may be used without relying on training data of a human player, and when step 300 is specifically executed, a possible implementation manner is provided in the embodiment of the present application:

and S1, acquiring at least two historical versions of the agent.

In the embodiment of the application, through continuously training and updating the intelligent bodies, the intelligent bodies with different versions can be obtained generally, a diversified opponent or teammate model pool with equivalent strength can be constructed by utilizing the intelligent bodies with the current version and the intelligent bodies with the historical version of training, and then when the intelligent bodies with a certain version need to be trained, at least two intelligent bodies can be selected randomly, and then the generation of the self-chess-playing data can be rapidly realized.

Further, in order to improve the accuracy of the training data, the obtained agents in the historical versions may be set to update versions according to iteration, and may be selected from a preset number of agents in the latest version, which is not limited in this embodiment of the present application.

Additionally, initially, the network model parameters of the agent may be randomly initialized as a historical version of the agent at the initial stage.

And S2, obtaining the playing data of each frame based on the playing results of the agents with at least two historical versions under the set environment.

The method specifically comprises the following steps: and S2.1, acquiring the environment state information of the current frame corresponding to the set environment.

The setting environment may be a game environment, and is not limited in the embodiment of the present application.

And S2.2, inputting the environmental state information of the current frame into the intelligent agents of at least two historical versions, respectively obtaining the actions responding to the environmental state information of the current frame, and outputting the respectively obtained actions to the set environment through the intelligent agents of at least two historical versions.

And S2.3, acquiring instant incentive data responding to the action and environment state information of the next frame from the environment through at least two historical versions of agents.

And S2.4, inputting the environmental state information of the next frame into the intelligent agents with at least two historical versions until the playing is finished, and obtaining playing data of each frame.

For example, in a game environment, taking two agents as an example, two agents play against each other, continuously obtain current frame environment state information, output an action to the environment through the agents, make a next frame action decision, and obtain video data generated by playing a game until playing a game is finished, that is, the game is finished.

And S3, extracting image features and non-image features of each frame of playing data to obtain environment interaction data generated by the intelligent agent.

In the embodiment of the application, for some complex environments, complex and diverse feature information may not be described only through single feature extraction, for example, in a complex game, a situation state is no longer simple disc surface information, and a large map, a multi-target unit, incomplete information and the like of the situation state enable the situation state feature to have higher complexity, and main information considered in the game process of a real player is considered, so when information extraction is performed in the embodiment of the application, image feature extraction and non-image feature extraction are performed respectively, wherein when the image feature extraction and the non-image feature extraction are performed, specific size is not limited, for example, an image is 12 × 12 in size, and in practical application, setting can be performed according to computing resources and model required precision.

For example, taking a game scene as an example, referring to fig. 4, which is a schematic diagram of a feature extraction principle in an embodiment of the present application, as shown in fig. 4, when the feature extraction is performed on any frame of play data, it is possible to extract, for each frame of play data based on image presentation information, i.e., the presented image, for image feature extraction, e.g., for in-game global minimap information and current view image information, for information which is not presented based on images, such as hero state and score, and specifically such as blood volume, grade, attack power, and the like, non-image feature extraction can be adopted, namely vectorization feature extraction, so that various information in a complex environment can be extracted by combining the imaging and vectorization feature extraction methods, more comprehensive information is extracted, the method can effectively adapt to the space complexity of the complex environment state, and reduce the environment state space and the action space.

Further, in order to improve speed and efficiency, when generating the data of playing chess, the playing chess process can be conveniently and quickly expanded to a plurality of machines in parallel by a multi-container (docker) mirror mode according to the available machine capacity, referring to fig. 5, which is a schematic diagram of the principle of self playing chess in the embodiment of the present application, as shown in fig. 5, the playing chess can be executed in parallel based on a plurality of container mirrors, the execution method of each container mirror is the same, for example, in the container mirror 1, an agent of a historical version is obtained from a model pool, namely, the agent is used as a strategy of the agent 2, the agent 1 is an agent to be trained which is an obtained current version, aiming at the environmental state information of the same frame, the agent 1 and the agent 2 respectively obtain the action responding to the environmental state information of the frame based on their own strategies, and output the obtained action to the environment, and further through multiple interactions, until the playing is finished, the playing data of each frame can be obtained. Therefore, the efficiency of generating the data of playing chess can be greatly improved through the container mirror image technology in the embodiment of the application, a large amount of training data can be generated through playing chess, the training data of human players are not depended on, and the capacity of the neural network model can be rapidly and efficiently improved from scratch.

Step 310: and acquiring a target potential energy value through the trained potential energy function according to the environment state information and the influence factor, and acquiring positive excitation data or negative excitation data according to the standard potential energy value and the target potential energy value corresponding to the environment state information.

When step 310 is executed, the method specifically includes:

1) and obtaining a target potential energy value through the trained potential energy function according to the environment state information and the influence factor.

In the embodiment of the present application, the potential energy function may be obtained through neural network model learning, for example, the trained potential energy function is y ═ f(s)_tX) where s_tThe environmental state information is x is an influence factor, and y is a potential value.

After the potential energy function is obtained through training, the current environment state information and the influence factors in the environment interaction data generated based on the intelligent agent can be input through the potential energy function, and the target potential energy value is obtained.

2) And acquiring positive excitation data or negative excitation data according to the standard potential value and the target potential value of the corresponding environment state information.

Specifically, the method may include: determining a potential energy difference value obtained by subtracting the standard potential energy value from the target potential energy value; if the potential energy difference value is a positive value, determining corresponding forward excitation data based on the potential energy difference value; and if the potential energy difference value is a negative value, determining corresponding negative excitation data based on the potential energy difference value.

The standard potential energy value may be obtained from expert environment interaction data, for example, the expert environment interaction data may be obtained in advance, and based on the expert environment interaction data, revenue or incentive data obtained after an output action is determined, and then a corresponding potential energy value is calibrated, that is, the potential energy value is used as the standard potential energy value.

In the embodiment of the application, the trained potential energy function is used to be y ═ f(s)_tX) evaluating the agent for environmental status information s_tNext, the influence factor x ═ x (x) is obtained₁,…,x_n) Comparing the target potential energy value with the standard potential energy value corresponding to the environment state information to obtain the potential energyFor example, referring to fig. 6, the difference is a schematic diagram of a principle of discrimination based on a potential energy function in the embodiment of the present application, and is based on the environmental state information s of the agent_tAnd the influence factor x ═ x (x)₁,…,x_n) The target potential energy value obtained by the potential energy function is y, and the information s of the state of the expert data in the environment_tBy the factor of

The corresponding standard potential energy value is

Then the potential energy difference value is obtained through the discriminator

When the potential energy difference value is a positive value, a positive excitation data is given, when the potential energy difference value is a negative value, a negative excitation data, namely a punishment data is given, and further the positive excitation data can be combined with the instant excitation data to adjust the advancing direction of the action sequence based on the strategy in the training process of the intelligent agent, so that the exploring direction of the intelligent agent in the set environment is restrained and guided by the value judgment of the potential energy function.

Step 320: and acquiring target excitation data according to the instant excitation data and the positive excitation data or the negative excitation data.

Step 330: and performing reinforcement learning training on the intelligent agent according to the environment state information, the action and the target incentive data.

In the process of training the agent, after the environmental status information is input, the agent's predicted action needs a specific value to evaluate the superiority and inferiority of the action, and the return benefit represents the return benefit that the environmental status information at a certain time t will have, i.e. the return benefit of all rewarded information at the next time, for example, the return benefit of the environmental status information at a certain time t is G_tI.e. by

λ is the attenuation coefficient.

In practice unless the wholeWhen the game reaches the ending state, all rewarded information cannot be explicitly acquired to calculate the return income of each environmental state information, so in the embodiment of the application, a Bellman equation can be adopted, so that the return income for calculating the current environmental state information is only related to the return income of the next step and the current instant income feedback, for example, the expectation of the return income of the current environmental state information is V_θ(s), then can be expressed as:

V_θ(s)＝E[G_t|S_t＝s]＝E[R_t+1+λR_t+2+λ²R_t+3+...|S_t＝s]

＝E[R_t+1+λv(S_t+1)|S_t＝s]

taking a game scene as an example, the reference factors for the instant profit feedback mainly include an hero experience value exp, economic money, blood volume hp, killing kill, dead, blood volume change of a main building and the like, and certainly not limited to these factors, more factors can be added or deleted according to different game contents, and the weights of the factors in the calculation process can be adjusted at the same time.

Therefore, based on the above-mentioned method for calculating the return benefit, in the embodiment of the present application, according to the environmental status information and the action, and the finally obtained target incentive data, that is, one training sample including the environmental status information, the action, and the target incentive data, a reinforcement learning training of the agent is performed, for example, a PPO reinforcement learning algorithm may be used to perform parameter updating on the neural network model of the agent, so as to target the expectation of the maximum incentive data, that is, the return benefit generated based on the action predicted by the agent, and the target incentive data in the training sample to determine the target function, until the iteration preset number or the target function converges, and the upper limit of the capability is reached, the training is stopped and the finally trained agent is saved, so as to improve the accuracy of outputting the action of the agent on the environmental status information, and further, the updated and trained agent may be added to the model pool, for the subsequent self-playing fight to generate self-playing data.

Of course, other reinforcement learning algorithms, such as an Actor-Critic (A3C) algorithm, a Depth Deterministic Policy Gradient (DDPG) algorithm, etc., may be used for training, and the embodiment of the present invention is not limited thereto.

In the embodiment of the application, the environmental interaction data generated based on the intelligent agent is obtained, the target potential energy value is obtained through the trained potential energy function according to the environmental state information and the influence factor, the positive excitation data or the negative excitation data is obtained according to the standard potential energy value and the target potential energy value corresponding to the environmental state information, the target excitation data is obtained according to the instant excitation data and the positive excitation data or the negative excitation data, and then the intelligent agent is subjected to reinforcement learning training according to the environmental state information, the action and the target excitation data, so that the potential energy function is constructed in advance, and when the intelligent agent is trained by using a reinforcement learning algorithm, the strategy execution direction of the intelligent agent can be guided based on the potential energy function and combined with the instant excitation data, the target direction of final success is gradually reached from low potential energy to high potential energy, and the convergence speed is improved, the intelligent agent can rapidly learn and converge under dense excitation, meanwhile, the intelligent agent is prevented from getting into local optimum, the accuracy of final strategy learning is improved, and therefore the robustness and diversity of the existing intelligent agent are improved.

Further, based on the above embodiment, the following specifically describes the training method of the potential energy function in the embodiment of the present application, and for the training method of the potential energy function, the embodiment of the present application provides a possible implementation manner:

s1, acquiring a potential energy function training sample set, wherein each potential energy function training sample in the potential energy function training sample set at least comprises environment state information, a corresponding influence factor and a standard potential energy value.

The acquisition of the potential energy function training sample set can be realized by extracting corresponding environmental state information, influence factors and the like from a large amount of normal game-play data of a large number of high-level players on the server and from a large amount of expert data, so that the potential energy function training sample set is formed.

For example, taking a game scene as an example, determining main influence factors, such as blood volume, magic value, economic value, experience value, killing number and the like, as main factor terms x of the potential energy function according to the specific situation of the game; the standard potential energy value may be determined according to game winning conditions or artificially defined tags, for example, if 10 building targets are destroyed, i.e. winning, and 2 buildings are destroyed currently, then the standard potential energy value may be defined as 0.2.

And S2, respectively inputting the potential energy function training sample set into the neural network model, and training to obtain a potential energy function, wherein the potential energy function represents the incidence relation between the standard potential energy value and the influence factor and the environmental state information.

Thus, a large amount of sample data of the operation track of the expert can be obtained<Status information s_tInfluence factor (x)₁,…,x_n) Target value y of function>The data pair can learn to obtain the environmental state information s through a neural network model_tAnd each factor term x, in relation to a standard potential energy value y, i.e. y ═ f(s)_tAnd x), for example, referring to fig. 7, which is a schematic diagram of a potential energy function training principle in an embodiment of the present application, inputting environment state information and an influence factor, outputting a standard potential energy value, and further training to obtain a potential energy function, fig. 7 is only a simple example of a neural network structure, and is only for explaining the training principle of the potential energy function, and a specific structure of the neural network of the potential energy function is not limited.

In addition, expert data of specific single types or various strategy rules can be freely selected according to requirements, so that after a potential energy function is obtained through the learning of a neural network model, the target potential energy value is evaluated according to the environmental state information of the intelligent body in the training process of the intelligent body, corresponding rewards or punishments are given, the exploration direction of the strategy of the intelligent body is restrained and guided by the judgment of the potential energy function, the capacity of the intelligent body is promoted and converged quickly, the overall capacity level of the intelligent body is improved, and the directional training of the strategy rules of the intelligent body can be realized.

Based on the above embodiment, a system structure of the method for training an agent in the embodiment of the present application is described below, and specifically, fig. 8 is a schematic diagram of the system structure of the method for training an agent in the embodiment of the present application.

As shown in fig. 8, the whole scheme of the training method of the agent can be divided into two parts, namely a self-playing training part and a potential energy function part, specifically:

a first part: self-playing training.

As shown in fig. 8, in the self-playing training section, each frame of playing data is obtained from the results of playing in the setting environment by at least two historical versions of agents, and image feature and non-image feature extraction is performed on the overall information of each frame of playing data, thereby reducing the setting environment state space and the operation space, and further obtaining environment interaction data.

And comparing and outputting positive excitation data or negative excitation data through a discriminator by utilizing a trained potential energy function, and further continuously training through self-chess playing and a reinforcement learning algorithm of a neural network model of the intelligent body by utilizing instant excitation data in environment interaction data and positive excitation data or negative excitation data estimated through the potential energy function so as to improve the exploration capacity and diversity of the intelligent body, wherein the neural network structure of the intelligent body in fig. 8 is only a simple example illustration, and the specific structure of the neural network of the intelligent body in the embodiment of the application is not limited.

Therefore, through value evaluation of the potential energy function, the action exploration direction of the intelligent agent is restrained and guided, so that the intelligent agent can correctly advance to the high potential energy direction of the final winning target under dense excitation and sparse excitation learning, and the trapping strategy is prevented from being locally optimal.

A second part: potential energy function.

The potential energy function part mainly comprises potential energy function training and value evaluation based on the potential energy function, and specifically comprises the following steps:

1) training a potential energy function: the potential energy function is obtained by obtaining a potential energy function training sample set and training, wherein each potential energy function training sample in the potential energy function training sample set at least comprises environment state information, a corresponding influence factor and a standard potential energy value.

The potential energy function training sample set can freely select expert data controlled by specific and various strategies and actions, so that the aim of directional strategy training of the intelligent agent can be fulfilled, the intelligent agent can learn reasonable action strategies more quickly, and the robustness and adaptability are better when the intelligent agent is confronted with human players.

2) Potential energy function based value assessment: the method mainly provides additional value evaluation in the training process of the intelligent agent, and further combines instant incentive data generated by the intelligent agent to obtain an incentive mechanism in the training process of the intelligent agent.

Specifically, aiming at the current environment state information, a target potential energy value is obtained through a trained potential energy function, a standard potential energy value corresponding to the environment state information is obtained, when the target potential energy value is lower than the standard potential energy value, an extra negative excitation data is given to the corresponding action of the intelligent body, on the contrary, when the target potential energy value is higher than the standard potential energy value, an extra positive excitation data is given to the corresponding action of the intelligent body, the target excitation data is obtained by combining the extra positive excitation data with the instant excitation data generated by the intelligent body, and the final target excitation data is used as an excitation mechanism in the training process of the intelligent body to restrict the training direction of the intelligent body and improve the convergence speed and reliability of the training of the intelligent body.

Based on the foregoing embodiment, the following describes an overall process of a training method for an agent in the embodiment of the present application by taking a specific application scenario and taking a set environment as a game environment as an example, and specifically refers to fig. 9, which is a flowchart of another training method for an agent in the embodiment of the present application, and specifically includes:

step 900: and acquiring a potential energy function training sample set, and training to acquire a potential energy function according to the potential energy function training sample set.

For example, taking a game scene as an example, feature extraction is performed on historical combat data of a high-level player or player data of a designated playing characteristic, and sampling processing of < environmental state information, influence factors and standard potential energy values > is performed by taking each game as a unit, so as to obtain a potential energy function training sample set.

Step 901: and loading the neural network model of the intelligent agent and loading the game environment.

Wherein, at the initial stage, the neural network model parameters can be initialized randomly.

Step 902: and acquiring at least two historical versions of the agents from the model pool, acquiring the data of playing chess of each frame and acquiring the environmental interaction data generated by the agents based on the results of playing chess of the agents of the at least two historical versions under the set environment.

For example, a container mirroring technology can be adopted, self-playing scripts are started in parallel on multiple machines to improve the generation efficiency of playing data, and then environmental interaction data at least comprising environmental state information and influence factors and action and instant incentive data which are obtained through an agent and respond to the environmental state information are obtained through image feature and non-image feature extraction.

Step 903: and acquiring a target potential energy value through the trained potential energy function according to the environment state information and the influence factor, acquiring positive excitation data or negative excitation data according to the standard potential energy value and the target potential energy value corresponding to the environment state information, and acquiring target excitation data according to the instant excitation data and the positive excitation data or the negative excitation data.

Step 904: and performing reinforcement learning training on the intelligent agent according to the environment state information, the action and the target incentive data.

Step 905: and after iteration is carried out for a certain number of times, the updated intelligent agent is added into the model pool for subsequent self-chess-playing and fighting.

Step 906: if the objective function is determined to be converged or the maximum number of iterations is reached, the training is stopped and the final trained agent is saved, otherwise, the process returns to step 902 to continue the training.

Further, the finally trained agent may also be added to the model pool, or may be added to the model pool after iteration for a certain number of times, which is not limited in the embodiment of the present application.

In the embodiment of the application, the image features and the non-image features are extracted from the game situation state, the space complexity of a complex game is effectively reduced, then the intelligent body training is carried out through a multi-machine parallel self-chess playing method and instant reward data, the ability can be automatically iterated and improved without supervised labeled data, the intelligent body training efficiency is greatly improved, meanwhile, potential energy functions are obtained through training, the action execution strategy direction of the intelligent body in the training process is immediately evaluated, the action execution strategy is controlled, and the robustness and the adaptability of the intelligent body are improved.

Further, in the embodiment of the present application, after obtaining the trained agent, the agent may be applied to a related business process based on the agent, and in particular, the embodiment of the present application provides an agent-based action control method, based on the above embodiment, referring to fig. 10, a flowchart of the agent-based action control method in the embodiment of the present application is shown, and in particular, the method includes:

step 1000: and acquiring the state information of the environment to be processed corresponding to the set environment.

The setting environment is, for example, a game environment, and the embodiment of the present application is not limited thereto.

Step 1010: based on the trained agent, with the pending environmental status information as input, an action is obtained in response to the pending environmental status information.

The intelligent agent is obtained by performing reinforcement learning training on a training data sample set containing environment state information, actions and target excitation data, wherein the target excitation data are obtained according to instant excitation data and positive excitation data or negative excitation data, and the positive excitation data or the negative excitation data are obtained according to a standard potential energy value corresponding to the environment state information and a target potential energy value obtained through a trained potential energy function.

Step 1020: and outputting the action to the set environment through the intelligent agent.

According to the embodiment of the application, the convergence rate of the intelligent body training can be improved by combining value evaluation of the potential energy function according to the instant excitation data, the efficiency and the accuracy of the intelligent body training are improved, and the accuracy and the robustness of the output action based on the intelligent body can be improved when the intelligent body is trained to perform service processing.

Based on the same inventive concept, an embodiment of the present application further provides a training apparatus for an agent, where the training apparatus for an agent may be, for example, the server in the foregoing embodiment, and the training apparatus for an agent may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiments, referring to fig. 11, an embodiment of the present application provides a training device for an agent, which specifically includes:

an obtaining module 1100, configured to obtain environment interaction data generated based on an agent, where the environment interaction data at least includes environment state information and an influence factor, and action and immediate incentive data obtained by the agent in response to the environment state information;

a determining module 1110, configured to obtain a target potential energy value through a trained potential energy function according to the environmental state information and the influence factor, and obtain positive excitation data or negative excitation data according to a standard potential energy value corresponding to the environmental state information and the target potential energy value;

a determining module 1120, configured to obtain target incentive data according to the instant incentive data and the positive incentive data or the negative incentive data;

a first training module 1130, configured to perform reinforcement learning training on the agent according to the environmental status information, the action, and the target incentive data.

Optionally, when the environment interaction data generated based on the agent is acquired, the acquiring module 1100 is specifically configured to:

acquiring intelligent agents of at least two historical versions;

obtaining the playing data of each frame based on the playing results of the agents with the at least two historical versions in the set environment;

and respectively extracting image characteristics and non-image characteristics of the chess playing data of each frame to obtain environment interaction data generated by the intelligent agent.

Optionally, when obtaining the playing data of each frame based on the playing results of the at least two historical versions of the agent in the set environment, the obtaining module 1100 is specifically configured to:

acquiring environment state information of a current frame corresponding to the set environment;

inputting the environmental state information of the current frame into the agents of the at least two historical versions, respectively obtaining actions responding to the environmental state information of the current frame, and outputting the respectively obtained actions to the set environment through the agents of the at least two historical versions;

obtaining, by the agent of the at least two historical versions, immediate incentive data responsive to the action and environmental status information of a next frame from the environment;

and inputting the environmental state information of the next frame into the intelligent agents of the at least two historical versions until the playing is finished, and obtaining playing data of each frame.

Optionally, the second training module 1140 is further configured to:

acquiring a potential energy function training sample set, wherein each potential energy function training sample in the potential energy function training sample set at least comprises environment state information, a corresponding influence factor and a standard potential energy value;

and respectively inputting the potential energy function training sample set into a neural network model, and training to obtain a potential energy function, wherein the potential energy function represents the incidence relation between a standard potential energy value and an influence factor and environmental state information.

Optionally, when obtaining positive excitation data or negative excitation data according to the standard potential energy value and the target potential energy value corresponding to the environment state information, the determining module 1120 is specifically configured to:

determining a potential energy difference value of the target potential energy value minus the standard potential energy value;

if the potential energy difference value is a positive value, determining corresponding forward excitation data based on the potential energy difference value;

and if the potential energy difference value is a negative value, determining corresponding negative excitation data based on the potential energy difference value.

Optionally, the setting environment is a game environment.

Based on the same inventive concept, the embodiment of the present application further provides an agent-based motion control device, which may be, for example, a server or a terminal in the foregoing embodiments, and which may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiments, referring to fig. 12, an intelligent agent-based motion control apparatus in an embodiment of the present application specifically includes:

an obtaining module 1200, configured to obtain to-be-processed environment state information corresponding to a set environment;

a processing module 1210, configured to obtain an action in response to the to-be-processed environmental state information based on a trained agent by using the to-be-processed environmental state information as an input, where the agent is obtained by performing reinforcement learning training on a training data sample set including environmental state information, the action, and target excitation data, the target excitation data is obtained according to immediate excitation data and positive excitation data or negative excitation data, and the positive excitation data or the negative excitation data is obtained according to a standard potential energy value corresponding to the environmental state information and a target potential energy value obtained through a trained potential energy function;

an output module 1220, configured to output the action to the setting environment through the agent.

Based on the above embodiments, referring to fig. 13, a schematic structural diagram of an electronic device in an embodiment of the present application is shown.

An electronic device, which may be a server in the foregoing embodiments, may include a processor 1310 (CPU), a memory 1320, an input device 1330, an output device 1340, and the like.

Memory 1320 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor 1310 with program instructions and data stored in the memory 1320. In the embodiment of the present application, the memory 1320 may be used to store a training method of any one of the embodiments of the present application, or a program of an agent-based motion control method.

The processor 1310 is configured to execute the training method of any of the agents or the agent-based action control method according to the present embodiment by calling the program instructions stored in the memory 1320 according to the obtained program instructions.

Based on the foregoing embodiments, in the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the method for training an agent or the method for controlling an action based on an agent in any of the above-mentioned method embodiments.

Based on the above embodiments, in the embodiments of the present application, there is also provided a computer program product or a computer program, which includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the agent or the action control method based on the agent in any of the above method embodiments.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Claims

1. A method for training an agent, comprising:

2. The method of claim 1, wherein obtaining the context interaction data generated based on the agent comprises:

acquiring intelligent agents of at least two historical versions;

3. The method as claimed in claim 2, wherein obtaining the frame of playing data based on the playing results of the at least two historical versions of agents in the setting environment specifically comprises:

4. The method of claim 1, further comprising:

5. The method of claim 1, wherein obtaining positive-going excitation data or negative-going excitation data according to a standard potential energy value and the target potential energy value corresponding to the environmental status information comprises:

6. The method of any of claims 2-5, wherein the set environment is a gaming environment.

7. An agent-based motion control method, comprising:

outputting, by the agent, the action to the set environment.

8. An intelligent agent training device, comprising:

9. An agent-based motion control apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1-6 or 7 are performed when the program is executed by the processor.

11. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 6 or 7.