CN115238891A

CN115238891A - Decision model training method, and target object strategy control method and device

Info

Publication number: CN115238891A
Application number: CN202210908501.4A
Authority: CN
Inventors: 李舒兴; 徐家卫; 袁春; 韩磊
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen International Graduate School of Tsinghua University
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen International Graduate School of Tsinghua University
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-10-25

Abstract

The application provides a decision model training method, a strategy control method of a target object and a device, wherein the method comprises the following steps: the method comprises the steps of obtaining training data of a target object in the current state in a target environment, wherein the training data of the current state comprises training information of N moments, the training information of any moment of N comprises state information of the moment of t, a control strategy of the moment of t and prediction reward of the control strategy of the moment of t, the control strategy of the moment of t and the prediction reward of the control strategy of the moment of t are input into a decision model to be output, a target loss function is built according to the training data of the current state, parameters of the decision model are adjusted according to the target loss function, the state information of the target object at the moment of N +1 is input into the decision model with the adjusted parameters, the decision model is trained continuously according to the training data of the next state of the target object until a training stopping condition is met, and the trained decision model is obtained.

Description

Decision model training method, and target object strategy control method and device

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a decision model training method, a target object strategy control method and a target object strategy control device.

Background

Agents refer to entities having intelligence, and refer to any independent entity that is capable of thinking and interacting with the environment. A policy refers to the behavior of an agent to make action selections after the agent perceives the environment. Reinforcement learning, a sub-field of artificial intelligence, may predict action strategies for an agent via a reinforcement learning model. At present, when a decision model obtained by reinforcement learning training is used for multi-agent game and faces different opponent agents, action decisions of the agents are basically consistent or have a greater degree of similarity, namely, the action decisions of the agents are single and fixed.

How to realize the diversity strategy of the intelligent agent in the multi-person game environment, namely when facing different opponent intelligent agents, the action decision of the intelligent agent can be adjusted in real time, and the problem to be solved is urgent.

Disclosure of Invention

The application provides a decision model training method, a strategy control method and a strategy control device for a target object, which can realize diversity strategies of the target object in a target environment.

In a first aspect, the present application provides a decision model training method, including:

acquiring training data of a target object in a current state in a target environment, wherein the training data of the current state comprises training information of N moments, the training information of any moment t in the N moments comprises state information of the moment t, a control strategy of the moment t and a prediction reward of the control strategy of the moment t, the state information of the moment t comprises a real-time reward of the moment t, observation information of the moment t and an area identification of the target object at the moment t, which are fed back by the target environment, the control strategy of the moment t and the prediction reward of the control strategy of the moment t are output by a decision model after the state information of the moment t is input, the control strategy of the moment t comprises a moving action strategy, a target task action strategy and a target arrival area identification of the moment t, the target arrival area identification of the moment t is the same as the area identification of the target object at the moment t +1, and both t and N are positive integers;

constructing a target loss function according to the training data of the current state;

adjusting and updating parameters of the decision model according to the target loss function;

and inputting the state information of the target object at the (N + 1) th moment into the decision model with the updated parameters, and continuing training the decision model according to the training data of the next state of the target object until the training stopping condition is met to obtain the trained decision model.

In a second aspect, the present application provides a policy control method for a target object, including:

acquiring the current state information of a target object in a target environment;

inputting the state information of the current moment into a decision model, and outputting a control strategy of the current moment, wherein the decision model is obtained by training according to the decision model training method of the first aspect;

and controlling the target object to act by adopting the control strategy at the current moment.

In a third aspect, the present application provides a decision model training apparatus, including:

the training data of the current state of the target object in the target environment are acquired, the training data of the current state comprise training information of N moments, the training information of any moment t in the N moments comprises state information of the moment t, a control strategy of the moment t and a prediction reward of the control strategy of the moment t, the state information of the moment t comprises a real-time reward of the moment t, observation information of the moment t and an area identification of the target object arriving at the moment t, which are fed back by the target environment, the control strategy of the moment t and the prediction reward of the control strategy of the moment t are output by a decision model, the control strategy of the moment t comprises a mobile action strategy, a target task action strategy and a target arriving area identification of the moment t, wherein the target arriving area identification of the moment t is the same as the area identification of the target object arriving at the moment t +1, and t and N are both positive integers;

the building module is used for building a target loss function according to the training data of the current state;

the adjusting module is used for adjusting and updating parameters of the decision model according to the target loss function;

and the processing module is used for inputting the state information of the target object at the (N + 1) th moment into the decision model after the parameters are updated, and continuing training the decision model according to the training data of the next state of the target object until a training stopping condition is met to obtain the trained decision model.

In a fourth aspect, the present application provides a policy control apparatus for a target object, comprising:

the acquisition module is used for acquiring the current state information of the target object in the target environment;

a control strategy output module, configured to input the state information at the current time into a decision model and output a control strategy at the current time, where the decision model is obtained by training according to the decision model training method of the first aspect;

and the action control module is used for controlling the target object to act by adopting the control strategy at the current moment.

In a fifth aspect, the present application provides a computer device comprising: a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of the first aspect or the second aspect.

In a sixth aspect, the present application provides a computer readable storage medium comprising instructions which, when run on a computer program, cause the computer to perform a method according to the first or second aspect.

In a seventh aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform a method according to the first or second aspect.

In summary, in the present application, by training a decision model for performing action strategy control on a target object in a target environment, when the decision model is trained, because a real-time reward at time t, observation information at time t and a region identifier at which the target object at time t arrives, which are fed back by the target environment, are used as inputs of the decision model at time t, the decision model decides and outputs a control strategy at time t (including a moving action strategy at time t, a target task action strategy, a target region that the target object should arrive at next time (i.e. a region indicated by the target arrival region identifier at time t) and a prediction reward of the control strategy at time t based on the three pieces of information, the decision model not only learns the movement action strategy, but also learns the action strategy of the target task and a target area which the target object should reach at the next moment, and the target area is used for controlling the target object to move and target task in a specific area at the next moment, so that the decision model can learn various control strategies.

Further, in the embodiment of the application, when the decision model is trained, a two-stage learning training mode is adopted, in the first stage, a post-event near-end policy optimization method is used for training the target decision model, the target decision model is used for training a movement action strategy of the target object, in the second stage, the target decision model is used as a teacher model of the decision model, and therefore supervised learning training is carried out on the decision model, a target task action strategy is learned, and a target area where the target object should arrive at next moment is learned. The multi-stage training mode can reduce the training difficulty and improve the training efficiency.

Drawings

Fig. 1 is a schematic view of an application scenario of a decision model training method and a strategy control method for a target object according to an embodiment of the present application;

FIG. 2 is a flowchart of a decision model training method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a decision model training method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a vertical perspective map corresponding to a target environment;

FIG. 5 is a schematic diagram of the map shown in FIG. 4 after the map with the vertical view is divided into regions;

fig. 6 is a schematic process diagram of a decision model outputting a control decision at time t according to input state information at time t according to an embodiment of the present application;

fig. 7 is a flowchart of a policy control method for a target object according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a decision model training apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a policy control apparatus for a target object according to an embodiment of the present application;

fig. 10 is a schematic block diagram of a computer device 700 provided by an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the application provides a decision model training method, a target object strategy control method and a device, relates to Artificial Intelligence (AI) technology, in particular to machine learning and reinforcement learning in the AI, and before introducing the technical scheme of the application, the following introduces the relevant knowledge of the application first:

artificial intelligence: the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Deep Learning (DL): is a branch of machine learning, an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple nonlinear transformations. Deep learning is to learn the internal rules and the expression levels of training sample data, and information obtained in the learning process is very helpful to the interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.

Reinforcement learning is one way of machine learning and can be used to train decision models that control the action decisions of target objects in a target environment. The training process of the decision model comprises the following steps: in the process of training the decision model, the embodiment of the application learns the action decision of the target object by adopting the neural network model based on reinforcement learning, and learns to control the target object to adapt to the target environment to obtain the decision model. The embodiment of the application can use the trained decision model to perform action strategy control on the target object in the target environment.

Neural Networks (NN): a deep learning model simulating the structure and function of a biological neural network in the field of machine learning and cognitive science.

Target environment: the target environment may be a virtual environment or a real environment, and the target environment may have various expression forms, for example, the target environment may be a chessboard, a sports field, or a pursuit field, and for example, the target environment may be various game environments such as a virtual chessboard or a virtual pursuit field set by a computer.

Target object: refers to a movable object in a target environment.

The game environment is as follows: is a virtual environment displayed when a game-like application is run on a terminal device. The game environment can be a simulation environment of a real world, a semi-simulation semi-fictional three-dimensional environment, or a pure fictional three-dimensional environment. Optionally, the game environment is also used for game environment battles between at least two game characters (e.g., multiplayer gaming games, such as multiplayer shooting games, multiplayer pursuit games, multiplayer hiding games), for example, with game resources available for use by at least two game characters in the game environment. Optionally, the map of the gaming environment is comprised of a plurality of squares or rectangles.

And (3) game roles: also referred to as virtual objects, refer to moveable objects in a gaming environment. The active object may be at least one of a virtual character, a virtual animal, an animation character, and the like. Illustratively, when the current game environment is a three-dimensional game environment, the game characters are three-dimensional solid models, and each game character has its own shape and volume in the three-dimensional game environment and occupies a part of the space in the three-dimensional game environment. Alternatively, the game character may be an hero character, soldier or neutral creature in a multiplayer online tactical competitive game.

An intelligent agent: the intelligent agent in the embodiment of the application refers to a game role which can interact with a game environment in a game. For example, an agent may communicate with other agents or fight against each other according to existing instructions or through autonomous learning based on its own perception of the game environment in a particular game environment, and autonomously accomplish a set goal in the game environment in which the agent is located.

In the related art, when an agent faces different opponent agents, the action decision of the agent is single and fixed. In order to solve the problem, the method and the device train a decision model for performing action strategy control on a target object in a target environment, and output a control strategy at the time t and a prediction reward of the control strategy at the time t by taking state information at the time t as input of the decision model when the decision model is trained, wherein the state information at the time t comprises a real-time reward at the time t, observation information at the time t and an area identifier at which the target object at the time t arrives, which are fed back by the target environment, and the control strategy at the time t comprises a moving action strategy, a target task action strategy and an area identifier at the time t, wherein the area identifier at the time t, which is reached by the target object at the time t, is the same as the area identifier at the time t +1, namely the area identifier at the time t, which is reached by the target object at the time t, is input as the area identifier at the next time t, which is reached by the target object at the time t. the method comprises the steps of enabling state information at the time t, a control strategy at the time t and prediction rewards of the control strategy at the time t to form training information at the time t, enabling the training information at continuous N times to form training data of a target object at the current state, building a target loss function according to the training data at the current state, adjusting and updating parameters of a decision model according to the target loss function, inputting the state information at the (N + 1) th time of the target object into the decision model after the parameters are updated, continuing training the decision model according to the training data at the next state of the target object until a training stopping condition is met, and obtaining the trained decision model. The real-time reward at the time t, the observation information at the time t and the area identification of the target object at the time t fed back by the target environment are used as the input of a decision model at the time t together, and the decision model decides and outputs a control strategy at the time t (comprising a moving action strategy at the time t, a target task action strategy, a target area (namely the area indicated by the target arrival area identification at the time t) which the target object should arrive at next time and a prediction reward of the control strategy at the time t according to the three parts of information, wherein the real-time reward at the time t fed back by the target environment is the reward obtained after the target object and an unknown opponent perform a target task, the decision model not only learns the moving action strategy, but also learns the target task action strategy and the target area which the target object should arrive at next time, and the target area is used for controlling the target object to perform movement and target task of a specific area at the next time so that the decision model can learn various control strategies.

When the target environment is a game environment, the method can be adapted to various application scenes (such as a first person name or a third person name), real opponent information is not needed in the training process, the training cost is reduced, and the application range is expanded.

Furthermore, the observation information in the embodiment of the application is first-person observation information, and the first-person observation information is target environment information which can be observed by the target object, so that a decision method capable of enabling the target object to obtain a diversity strategy is provided for the target object in the multi-person game environment of the first person. The reasoning difficulty and the information acquisition difficulty of the first person environment are far higher than those of the third person environment and the perfect observation environment.

Further, in the embodiment of the application, when the decision model is trained, a two-stage learning training mode is adopted, in the first stage, a post-event near-end policy optimization method is used for training the target decision model, the target decision model is used for training a movement action strategy of the target object, and in the second stage, the target decision model is used as a teacher model of the decision model, so that the decision model is supervised, learned and trained, and a target task action strategy and a target area which the target object should reach at the next moment are learned. The multi-stage training mode can ensure that the training difficulty is reduced and the training efficiency is improved.

For example, the decision model training method and the target object strategy control method provided in the embodiments of the present application may be applied to game scenes such as a multiplayer game, such as a multiplayer shooting game, a multiplayer pursuit game, a multiplayer hiding game, and the like, and the decision model obtained through reinforcement learning is used to control a target object in the game to compete with other target objects.

Illustratively, the target environment is a game environment, and the target object is an agent, so that when the decision model training method provided by the embodiment of the present application is applied, the decision model trained by the decision model training method provided by the embodiment of the present application can be loaded and imported into a multiplayer game scene to help a player improve effective game experience. The intelligent agent can perform diversified expression on the strategy of the intelligent agent and increase the game playability under the condition that the intelligent agent does not modify or acquire the background parameters of the game through cheating as well as human players.

It can be understood that, in the embodiment of the present application, a game environment is taken as an example of a target environment, and the decision model training method and the policy control method for a target object provided in the embodiment of the present application may also be applied to other scenarios, which are not limited in the embodiment of the present application.

Fig. 1 is a schematic view of an application scenario of a decision model training method and a policy control method for a target object according to an embodiment of the present disclosure, as shown in fig. 1, the application scenario of the embodiment of the present disclosure relates to a server 1 and a terminal device 2, and the terminal device 2 may perform data communication with the server 1 through a wired network or a wireless network.

In some implementation manners, the terminal device 2 is a device having a rich man-machine interaction manner, having an internet access capability, generally carrying various operating systems, and having a strong processing capability. The terminal device 2 may be a terminal device such as a smart phone, a tablet computer, a laptop computer, a desktop computer, or a telephone watch, but is not limited thereto.

The server 1 in fig. 1 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing a cloud computing service. This is not limited by the present application. In this embodiment, the server 1 may be a background server of a game application program in the terminal device 2, and may be a server of a game platform.

In some implementations, fig. 1 exemplarily shows one terminal device and one server, and may actually include other numbers of terminal devices and servers, which are not limited in this application.

Illustratively, a game-like client or other client may be installed and run on the terminal device 2. For example, taking a game client as an example, the terminal device 2 may obtain a competitor for the player to select from the server 1, and after the player selects a target opponent, the terminal device 2 controls a target object corresponding to the target opponent to compete with a game character controlled by the player by using a trained decision model. The decision model is obtained through training by the decision model training method provided by the embodiment of the application.

The decision model training method provided by the embodiment of the application can be executed by the server 1, and can also be executed by other servers or other terminal devices. Optionally, the policy control method for the target object provided in the embodiment of the present application may be executed by the server 1, or may be executed by the terminal device 2, or may be executed by the server 1 and the terminal device 2 in cooperation.

The technical scheme of the application will be explained in detail as follows:

fig. 2 is a flowchart of a decision model training method provided in an embodiment of the present application, where an execution subject of the method may be a server, and as shown in fig. 2, the method may include:

s101, obtaining training data of a target object in a current state in a target environment, wherein the training data of the current state comprises training information of N moments, the training information of any moment t in the N moments comprises state information of the moment t, a control strategy of the moment t and a prediction reward of the control strategy of the moment t, the state information of the moment t comprises a real-time reward of the moment t, observation information of the moment t and an arrival area identification of the target object of the moment t, which are fed back by the target environment, the control strategy of the moment t and the prediction reward of the control strategy of the moment t are output of a decision model, the control strategy of the moment t comprises a moving action strategy, a target task action strategy and an arrival area identification of the target object of the moment t, the arrival area identification of the target object of the moment t is the same as the arrival area identification of the target object of the moment t +1, and both t and N are positive integers.

Specifically, in the process of training the decision model, the parameters of the decision model are updated once by using the training data of the target object in a state of one time in the target environment, the training data of the state of one time may include N pieces of training information, that is, the decision model performs output of N control decisions, and one piece of training information may be training information at one moment. The training data of the primary state may also include training information at N time instants. N is a predetermined number, for example, N may be 1024. Alternatively, N may also be a preset number of cumulative consecutive multiplication steps, that is, the number of times of continuous control decision output.

In the embodiment of the application, for any time t of N times, in the process of training the decision model, the state information at the time t is used as input of the decision model, and a control strategy at the time t and a prediction reward of the control strategy at the time t are output, the state information at the time t includes a real-time reward at the time t fed back by a target environment, observation information at the time t and an area identifier at which a target object at the time t arrives, and the control strategy at the time t includes a movement action strategy, a target task action strategy and a target arrival area identifier at the time t, wherein the target arrival area identifier at the time t is the same as an area identifier at which the target object at the time t +1 arrives, that is, the target arrival area identifier at the time t is input as an area identifier at which the target object at the next time t arrives. the state information at the time t, the control strategy at the time t and the predicted reward of the control strategy at the time t form training information at the time t, and the training information at N continuous times form training data of the current state of the target object.

In the embodiment of the present application, the movement action policy may be angle adjustment, action displacement, left turn, right turn, or posture adjustment, and the like.

In this embodiment, taking the target environment as the game environment as an example, in a game scene, the target task may be a shooting task, a searching task, or an evading task. Accordingly, the target task action strategy may be shooting, seeking, or evading.

Before the state information at the time t is input into the decision model, the state information at the time t is to be obtained, and as an implementable manner, the obtaining of the state information at the time t may specifically include:

and S1011, acquiring the real-time reward of the target environment feedback at the time t.

Optionally, the real-time reward of obtaining the target environment feedback at time t may specifically be: and controlling the target object to perform corresponding action by adopting the control strategy at the time t-1. And determining that the target object completes the target task at the time t and obtaining real-time reward of target environment feedback. And generating a characteristic information hidden variable according to the observation information at the time t and the real-time reward fed back by the target environment, and storing the characteristic information hidden variable into a characteristic information statistical table to obtain an updated characteristic information statistical table. And carrying out normalization processing on the updated characteristic information statistical table, and determining the characteristic information statistical table after the normalization processing as the real-time reward of the target environment feedback at the time t.

And S1012, acquiring observation information of the target object at the time t.

S1013, the target arrival area mark at the time t-1 of the target object is used as the area mark at which the target object arrives at the time t.

If the decision model performs the output of the first control decision, that is, t is 1, the area identifier at which the target object arrives at the time t may be a preset area identifier.

And S1014, forming the real-time reward at the time t, the observation information at the time t and the area identification reached by the target object at the time t, which are fed back by the target environment, into the state information at the time t of the target object.

And S102, constructing a target loss function according to the training data of the current state.

S103, adjusting and updating parameters of the decision model according to the target loss function.

Specifically, the parameters of the decision model are adjusted and updated according to the objective loss function, which may be adjusting and updating the parameters of the decision model through back propagation.

And S104, inputting the state information of the target object at the (N + 1) th moment into the decision model after the parameters are updated, and continuing training the decision model according to the training data of the next state of the target object until the training stopping condition is met to obtain the trained decision model.

In an embodiment of the present application, in an implementable manner, when a decision model is trained, the decision model is trained by adopting a two-stage learning training mode. Before S101, the method of this embodiment may further include:

s105, training a target decision model by using a post-event near-end policy optimization method, wherein the target decision model is used for training a moving action strategy of a target object.

Specifically, the late near-end policy optimization (HPPO) method modifies the reward score obtained by the target object by replacing an area that is not reached with an area that is actually reached, so that the target object obtains pseudo reward information, and thus the target object has a higher chance to train with existing data, and the data utilization efficiency is improved.

Specifically, the objective decision model is trained by using a post-event near-end policy optimization method, in order to firstly train the movement action strategy of the objective object to obtain the objective decision model of the movement action strategy of the trained objective object, then the objective decision model is used as a teacher model of the decision model to be trained, and the decision model is used as a student model, so that the decision model is supervised, learned and trained, and the objective task action strategy and the objective area which the objective object should reach at the next moment are learned.

Optionally, when the objective decision model is trained by using the post-event near-end policy optimization method, the method specifically includes: the method comprises the steps of firstly obtaining first training data of a target object in the current state in a target environment, wherein the first training data of the current state comprises training information at T moments, the training information at any moment T in the training information at the T moments comprises first state information at the moment T, a moving action strategy at the moment T and a prediction reward at the moment T, the first state information at the moment T comprises real-time reward at the moment T, observation information at the moment T and randomly sampled area identification fed back by the target environment, the first state information at the moment T is input into a target decision model, the moving action strategy at the moment T and the prediction reward at the moment T are output, the prediction reward at the moment T is predicted by the target decision model according to the first state information at the moment T and an actual arrival area at the moment T, and T is a preset positive integer. Then, a post-event near-end policy optimization loss function is built according to training data of the current state, parameters of a target decision model are adjusted and updated according to the built post-event near-end policy optimization loss function, training information of the target object at the T +1 th moment is input into the target decision model with the updated parameters, input and output training at the T moments is performed to obtain first training data of the next state of the target object, training of the target decision model is continued according to the first training data of the next state of the target object until a training stopping condition is met, and the trained target decision model is obtained.

The input of the target decision model is first state information at the time t, the first state information comprises real-time reward at the time t, observation information at the time t and randomly sampled area identification fed back by a target environment, the output of the target decision model is a moving action strategy at the time t and predicted reward at the time t, and the predicted reward at the time t is reward predicted by the target decision model according to the first state information at the time t and an actual arrival area at the time t.

Optionally, the post-hoc near-end policy optimization loss function (also called constraint objective) is shown by the following formula:

in the formula, g and g' respectively represent the area identification of random sampling and the identification of the actual arrival area, theta is the parameter of the target decision model to be updated at this time,

is the target decision model parameter, pi, before this update _θ Is the target decision model parameter of the current update,

is an advantage function calculated by the target decision model before the update, wherein the advantage function is equal to the real-time reward r at the t moment fed back by the target environment _t Subtracting the state information s of the target decision model according to the time t _t And the actual arrival area identification g' predicted prize V(s) _t G'), i.e. r _t -V(s _t ，g′)。a _t Is a mobile action policy.

Sampling weights for importanceThe weight coefficient of the light beam is calculated,

representing the state information s of the target decision model at the moment k before the update _k And various types of movement actions a predicted by the actual arrival area mark g _k The probability of (a) of (b) being,

representing the state information s of the target decision model at the time k before the update _k And various types of movement actions a predicted by the randomly sampled area identification g _k The probability of (c). Pi _θ (a _t |s _t G') represents the updated objective decision model parameter according to the state information s at the time k _k And various types of movement actions a predicted by the actual arrival area mark g _k The probability of (c). Thus obtained

I.e. a loss function for the parameter theta. In this embodiment, when the method trains a mobile control strategy, the target decision model parameters are updated every T steps, where T is also referred to as the number of times of control decision making. The first training data with τ representing the current state is(s) ₀ ,a _0, r _0, s ₁ ,a _1, r _1,…... ) Take t =0 as an example, s ₀ Status information indicating time 0, a ₀ Represents the motion strategy output by the objective decision model at time 0, r ₀ Real-time rewards at time 0 representing target environment feedback.

And calculating the average value of the post-event near-end policy optimization loss functions corresponding to the training information at the T moments, namely summing the post-event near-end policy optimization loss functions corresponding to the training information at the T moments and then averaging.

In order to enhance the stability of the algorithm, a clipping constraint is adopted in this embodiment, and ∈ is a clipping range, in the above formula, clip indicates that if the current clip is calculated

If the weight coefficient is less than 1-epsilon, the importance sampling weight coefficient is taken as 1-epsilon, and if the current weight coefficient is calculated

If the importance sampling weight coefficient is more than 1+ epsilon, the importance sampling weight coefficient is taken as 1+ epsilon, and if the current importance sampling weight coefficient is calculated

If the sampling weight coefficient is more than 0.8 and less than 1.2, the importance sampling weight coefficient is taken as

Such that the importance sample weight coefficients are between (1-epsilon) - (1 + epsilon). For example, ε =0.2 may be set such that the importance sample weighting factor is between 0.8-1.2, with the maximum value of the penalty function being no higher than 1.2 times the dominance function and the minimum value being no lower than 0.8 times the dominance function.

And S106, determining the target decision model as a teacher model of the decision model so as to perform supervised learning training on the decision model.

Optionally, the decision model in this embodiment may be a neural network model.

In this embodiment, in the multi-stage training mode, in the first stage, a post-event near-end policy optimization method is used to train a target decision model, where the target decision model is used to train a movement action strategy of a target object, and in the second stage, the target decision model is used as a teacher model of the decision model to perform supervised learning training on the decision model, learn a target task action strategy, and learn a target area where the target object should arrive at the next moment. Thereby can guarantee that the training degree of difficulty reduces, promote training efficiency.

Optionally, the constructing the target loss function according to the training data of the current state in S102 may specifically include:

and S1021, constructing a near-end strategy optimization loss function according to the training data of the current state.

As an implementable manner, the near-end strategy optimization loss function is constructed according to the training data of the current state, which may specifically be:

s1, aiming at training information at each time t in N times, calculating to obtain a near-end strategy optimization loss function corresponding to the training information at the time t according to a control strategy at the time t, a prediction reward of the control strategy at the time t, a real-time reward at the time t fed back by a target environment, observation information at the time t and an area mark where a target object at the time t arrives.

Optionally, the near-end policy optimization loss function corresponding to the training information at the time t is obtained by calculation according to the control strategy at the time t, the prediction reward of the control strategy at the time t, the real-time reward at the time t fed back by the target environment, the observation information at the time t, and the area identifier where the target object at the time t arrives, and specifically may be:

the method comprises the steps of firstly determining a predicted action probability ratio according to real-time rewards at the time t, observation information at the time t, area identification of target objects at the time t and a control strategy at the time t, which are fed back by a target environment.

Then, the difference value of the real-time reward at the t moment and the predicted reward of the control strategy at the t moment fed back by the target environment is calculated. And calculating to obtain a near-end strategy optimization loss function corresponding to the training information at the time t according to the difference value between the predicted action probability ratio and the real-time reward at the time t fed back by the target environment and the predicted reward of the control strategy at the time t.

As a practical way, the near-end strategy optimization loss function corresponding to the training information at time t may be represented by the following formula:

wherein,

the predicted action probability ratio is the state information s of the decision model after the parameters are updated according to the time t _t Predicted types of actions a _t Probability of (including moving action, target task action and target arrival area identification) and decision model before updating parameters at this time according to t-time state information s _t Predicted various types of actions a _t (including moving action, target task action, and target arrival zone identification) ratio of probabilities, π _θ Is the decision model parameter of the present update,

is the decision model parameter before this update.

And feeding back the difference value of the real-time reward at the time t and the predicted reward of the control strategy at the time t for the target environment. In order to enhance the stability of the algorithm, a clipping constraint is adopted in this embodiment, and ∈ is a clipping range, in the above formula, clip indicates that if the current r is calculated _t If (theta) is less than 1-epsilon, the ratio of the probability of the predicted action is taken as 1-epsilon, and if the current r is calculated _t If (theta) is larger than 1+ epsilon, the ratio of the probability of the predicted action is taken as 1+ epsilon, and if the current r is calculated _t When the value of (theta) is more than 0.8 and less than 1.2, the ratio of the probability of the predicted action is taken as r _t (θ) such that the predicted action probability ratio is between (1- ε) - (1 + ε). For example, ε =0.2 may be set such that the predicted action probability ratio is between 0.8-1.2, with the maximum value of the loss function not higher than

1.2 times of (A), minimum value not less than

0.8 times of the total weight of the powder.

In an implementable manner, the real-time reward of the target environment feedback at the time t is the sum of the reward of the search strategy at the time t and the reward of the utilization strategy at the time t, wherein the reward of the search strategy at the time t is determined according to the number of reached areas at the time t and the target reached area at the time t, and the reward of the utilization strategy at the time t is determined according to the total number of completed target tasks at the time t and the target reached area at the time t.

Optionally, real-time reward r of t time of target environment feedback _t May be represented by the following formula:

r _t ＝r _t ^explore +r _t ^exploit

wherein r is _t ^explore Reward for exploring the strategy at time t, r _t ^exploit Is a reward for the strategy of use at time t, F ₂ Number of reached regions at time t, g _t Target arrival area at time t, F ₁ Is the total number of target tasks completed at time t.

Wherein Softmax is a normalized exponential function, if Vi represents the ith element in V, the Softmax value of the element is S _i ：

Replacing i in the above formula with F ₁ (g _t ) The result obtained is r _t ^exploit 。

S2, determining the average value of the near-end strategy optimization loss functions corresponding to the training information at the N moments as the near-end strategy optimization loss function.

And S1022, constructing a supervised learning loss function according to the training data of the current state and the teacher model.

As an implementable manner, the supervised learning loss function is constructed according to the training data of the current state and the teacher model, which may specifically be:

and calculating a cross entropy loss function of the movement action strategy in the control strategy at the time t and the movement action strategy at the time t output by the corresponding teacher model according to the training information at each time t in the N moments. And determining the average value of the cross entropy loss functions corresponding to the training information at the N moments as a supervised learning loss function.

And S1023, determining the weighted sum of the near-end strategy optimization loss function and the supervised learning loss function as a target loss function.

Illustratively, the target loss function L _Frag May be represented by the following formula:

wherein L is _PPO Is a near-end policy optimization loss function,

is a supervised learning loss function, λ and μ are two weights, the sum of λ and μ is 1, e.g., both λ and μ are 0.5.

In this embodiment of the present application, optionally, before the state information of the target object at the N +1 th time is input into the decision model after the parameter is updated in S104, the method of this embodiment may further include: and acquiring the state information of the target object at the (N + 1) th moment. The obtaining of the state information at the N +1 th time may specifically be:

and S1041, acquiring real-time reward of the N +1 th moment fed back by the target environment.

Optionally, the real-time reward of the N +1 th time fed back by the target environment is taken, which may specifically be:

and controlling the target object to perform corresponding actions by adopting a control strategy at the Nth moment, determining that the target object completes a target task at the (N + 1) th moment and obtains a real-time reward of target environment feedback, generating a characteristic information hidden variable according to the observation information at the (N + 1) th moment and the real-time reward of the target environment feedback, storing the characteristic information hidden variable into a characteristic information statistical table to obtain an updated characteristic information statistical table, performing normalization processing on the updated characteristic information statistical table, and determining the normalized characteristic information statistical table as the real-time reward of the (N + 1) th moment of the target environment feedback.

S1042, acquiring the observation information of the target object at the (N + 1) th moment.

And S1043, taking the target arrival area identifier of the target object at the Nth time as the area identifier of the target object at the (N + 1) th time.

And S1044, forming the state information of the target object at the (N + 1) th time by the real-time reward of the (N + 1) th time, the observation information of the (N + 1) th time and the area identification reached by the target object at the (N + 1) th time fed back by the target environment.

In one practical implementation manner, in S104, the state information of the target object at the N +1 th time is input into the decision model after the parameters are updated, and the decision model after the parameters are adjusted is used for: respectively extracting the characteristics of the real-time reward at the N +1 th moment, the observation information at the N +1 th moment and the region identification of the target object at the N +1 th moment fed back by the target environment to obtain a first hidden variable, a second hidden variable and a third hidden variable. And combining the first hidden variable, the second hidden variable and the third hidden variable to obtain the target hidden variable. And extracting the characteristics of the target hidden variables to obtain the characteristic information of the target hidden variables. And generating a control strategy at the (N + 1) th moment according to the target hidden variable characteristic information.

In this embodiment of the application, the state information at time t includes an identifier of an area where the target object at time t arrives, so before S101, it is further necessary to perform area division on the map corresponding to the target environment, obtain a plurality of areas, assign an identifier to each area, and store the assigned identifier. Correspondingly, before S101, the method of this embodiment may further include:

and S107, acquiring vertical visual angle information of the map corresponding to the target environment.

And S108, carrying out area division on the map according to the vertical visual angle information of the map to obtain a plurality of areas.

And S109, allocating identification to each area.

As an implementable manner, in S108, the map is divided into regions according to the vertical perspective information of the map, so as to obtain a plurality of regions, which may specifically include:

s1081, determining a central point of each area in the plurality of areas according to the vertical visual angle information of the map and a central point selection mode, wherein the central point selection mode is as follows: the distance between two adjacent central points is the same, and the central points exist in the open area with regular corners and shapes.

S1082, determining the area to which the target point belongs according to the principle that the Euclidean distance between the target point and the central point is shortest aiming at each target point except the determined central point in the map, and obtaining a plurality of areas.

Correspondingly, in S109, the identifier is allocated to each area, which may specifically be: an identification is assigned to the center point of each region. Optionally, the identifier is assigned to the center point of each region, and may be a number for the center point of each region, and a single hot coding result of the number is used as the identifier of the center point.

Optionally, the observation information in the embodiment of the application may be first-person observation information, and the first-person observation information is target environment information that can be observed by the target object, so that a decision method capable of enabling the target object to obtain a diversity policy is provided for the target object in the multi-person game environment of the first person. The difficulty of reasoning and the difficulty of acquiring information volume of the first person environment are high.

The decision model training method provided by the embodiment is characterized in that when a decision model is trained, a real-time reward at time t, observation information at time t and a region identifier where a target object at time t arrives, which are fed back by a target environment, are jointly used as input of the decision model at time t, and the decision model makes a decision and outputs a control strategy at time t (including a moving action strategy at time t, a target task action strategy, a target region where the target object should arrive at next time (i.e., a region indicated by the target arrival region identifier at time t) and a prediction reward of the control strategy at time t according to the three parts of information, wherein the real-time reward at time t fed back by the target environment is a reward obtained after the target object and an unknown opponent perform a target task, the decision model not only learns the moving action strategy, but also learns a target task action strategy and a target region where the target object should arrive at next time, the target region is used for controlling the target object to perform movement and target task in a specific region at next time, so that the decision model can learn various control strategies, thereby, when the target object faces different opponent target objects, the action strategies can be applied in a game environment, and the cost of a plurality of opponents can be reduced.

The decision model training method provided by the embodiment of the present application is described in detail below with reference to fig. 3.

Fig. 3 is a flowchart of a decision model training method provided in an embodiment of the present application, where an execution subject of the method may be a server, and as shown in fig. 3, the method may include:

s201, acquiring vertical visual angle information of a map corresponding to the target environment.

In this embodiment, a game environment is taken as an example of the target environment, and accordingly, the target object may be an agent.

S202, carrying out area division on the map according to the vertical visual angle information of the map to obtain a plurality of areas.

S203, allocating an identifier for each area, and storing the corresponding relation between the areas and the identifiers.

In one embodiment, the dividing the map into regions according to the vertical perspective information of the map to obtain a plurality of regions may specifically include:

s2011, determining a central point of each area in the plurality of areas according to the vertical visual angle information of the map and a central point selection mode, wherein the central point selection mode is as follows: the distance between two adjacent central points is the same, and the central points exist in the open area with regular corners and shapes.

S2012, aiming at each target point in the map except the determined central point, determining the area to which the target point belongs according to the principle that the Euclidean distance between the target point and the central point is shortest, and obtaining a plurality of areas.

Correspondingly, in S203, an identifier is allocated to each region, which may specifically be: an identification is assigned to the center point of each region. Optionally, the identifier is allocated to the central point of each region, and the central point of each region may be numbered, and the result of the single hot coding of the number is used as the identifier of the central point.

Exemplarily, fig. 4 is a schematic diagram of a vertical perspective map corresponding to a target environment, and fig. 5 is a schematic diagram of a map obtained by dividing the vertical perspective map shown in fig. 4 into regions, as shown in fig. 5, in this embodiment, the vertical perspective map shown in fig. 3 is divided into 20 regions (shown by numbers 0 to 19 in fig. 5), and the result of the one-hot encoding of each number is respectively used as an identifier of each corresponding region.

S204, inputting the state information of the target object in the target environment at the time t into a decision model, and outputting a control strategy at the time t and a prediction reward of the control strategy at the time t, wherein the state information at the time t comprises a real-time reward at the time t, observation information at the time t and an area identifier at which the target object arrives, which are fed back by the target environment, and the control strategy at the time t comprises a moving action strategy, a target task action strategy and an area identifier at the time t, which the target object arrives at.

Specifically, before inputting the state information of the target object at time t in the target environment into the decision model, obtaining the state information at time t, and obtaining the state information at time t may specifically include:

and S2041, acquiring real-time rewards at the t moment of target environment feedback.

S2042, acquiring observation information of the target object at the time t.

S2043, taking the target arrival area identification of the target object at the time t-1 as the area identification of the target object at the time t.

If the decision model performs the output of the first control decision, that is, t is 1, the area identifier where the target object arrives at the time t may be a preset area identifier.

And S2044, forming the real-time reward at the time t, the observation information at the time t and the area identification reached by the target object at the time t, which are fed back by the target environment, into the state information at the time t of the target object.

Fig. 6 is a schematic diagram illustrating a process of outputting a control decision at a time t by a decision model according to input state information at the time t according to the present application, where as shown in fig. 6, the process of outputting the control decision at the time t by the decision model according to the input state information at the time t may include:

s11, the decision model respectively extracts the real-time reward at the time t, the observation information at the time t and the region identification of the target object at the time t, which are fed back by the target environment, to obtain a first hidden variable, a second hidden variable and a third hidden variable.

In this embodiment, optionally, the feature extraction may also introduce a self-attention mechanism to different parts of the input to further extract the features, or a long-short term memory network (LSTM) may be used to enable the network to process the time-series features to enhance performance.

S12, combining the first hidden variable, the second hidden variable and the third hidden variable by the decision model to obtain a target hidden variable.

And S13, carrying out feature extraction on the target hidden variable by the decision model to obtain target hidden variable feature information.

And S14, generating a control strategy at the t moment by the decision model according to the target hidden variable characteristic information.

Specifically, three independent neural network strategy modules can be generated according to the target hidden variable characteristic information, including a mobile control strategy module, a target task (such as shooting/searching/avoiding) control strategy module and a target arrival area output strategy module. The three modules respectively output a moving action strategy, a target task action strategy and a target arrival area identifier at the moment t. the target arrival area identifier at the time t is used as the area identifier at which the target object arrives at the next time (i.e. at the time t + 1).

When the decision model generates and outputs the control strategy at the time t, the predicted reward of the control strategy at the time t is output at the same time.

S205, controlling the target object to act by using the control strategy at the time t, obtaining the real-time reward at the time t +1 fed back by the target environment, and taking the target arrival area identifier at the time t as the area identifier at which the target object arrives at the time t + 1.

And S206, inputting the state information of the target object at the t +1 moment in the target environment into the decision model, and outputting the control strategy at the t +1 moment and the prediction reward of the control strategy at the t +1 moment.

S207, when the Nth control decision is completed, acquiring training data in the current state, wherein the training data in the current state comprise training information at N moments, and the training information at any moment t in the N moments comprises state information at the moment t, a control strategy at the moment t and prediction rewards of the control strategy at the moment t.

Specifically, for example, from time t to time t + N, training information at these N times is acquired.

And S208, constructing a target loss function according to the training data of the current state.

S209, adjusting and updating parameters of the decision model according to the target loss function.

S210, inputting the state information of the target object at the (N + 1) th moment into the decision model after the parameters are updated, and continuing to train the decision model according to the training data of the next state of the target object until the training stopping condition is met, so as to obtain the trained decision model.

Specifically, the detailed processes of S208-S210 can be referred to the description in the embodiment shown in fig. 2, and are not repeated here.

By the method, when the target object plays a multi-player game, under any environment, after interaction (such as shooting, searching or avoiding and obtaining reward scores) is generated with other unknown opponents, the current reward is recorded in a characteristic information statistical table mode, and after the current reward is input into a decision-making model for characteristic extraction, the decision-making model makes a target area which the target object can reach at the next moment, and the target area is used for controlling the target object to perform mobile cruising and target tasks in a specific area, so that the diversity of strategies facing different game opponents can be realized. The method of the embodiment realizes that an effective diversity strategy can be shown for unknown opponents in the multi-target object game environment under any visual angle scene, and the maximization of the self task reward is realized. The method reduces the technical implementation cost, does not need to acquire opponent information to help training and application, and expands the actual application range and application scene.

Fig. 7 is a flowchart of a policy control method for a target object according to an embodiment of the present application, where an execution subject of the method may be a server, and as shown in fig. 7, the method may include:

s301, acquiring the current state information of the target object in the target environment.

S302, inputting the state information of the current moment into the decision model, and outputting the control strategy of the current moment.

Wherein, the decision model is obtained by training according to the decision model training method shown in fig. 2 or fig. 3.

And S303, controlling the target object to act by adopting the control strategy at the current moment.

In the policy control method for the target object provided in this embodiment, the state information of the target object at the current time in the target environment is obtained, the state information of the target object at the current time is input into the decision model, the control policy at the current time is output, and the target object is controlled to perform an action by using the control policy at the current time. When the target object faces different opponent target objects, the action decision of the target object can be adjusted in real time, and the diversity strategy of the target object in the multi-person game environment is realized.

By controlling the agent to perform a 1-to-1 comparison with other opponent agents in the same game environment by using the policy control method for the agent provided by the embodiment, the opponent agents, such as F1, axon, marvin, yanshi or other agents, show better performance in most game environments.

Fig. 8 is a schematic structural diagram of a decision model training apparatus according to an embodiment of the present application, and as shown in fig. 8, the apparatus may include: an acquisition module 11, a construction module 12, an adjustment module 13 and a processing module 13, wherein,

the obtaining module 11 is configured to obtain training data of a target object in a current state in a target environment, where the training data of the current state includes training information at N times, where the training information at any time t of the N times includes state information at the time t, a control strategy at the time t, and a prediction reward of the control strategy at the time t, the state information at the time t includes a real-time reward at the time t, observation information at the time t, and an area identifier where the target object arrives, which are fed back by the target environment, and the control strategy at the time t and the prediction reward of the control strategy at the time t are output from a decision model, and the control strategy at the time t includes a mobile action strategy, a target task action strategy, and an area identifier where the target object arrives at the time t +1 are the same, and t and N are positive integers;

the construction module 12 is configured to construct a target loss function according to the training data of the current state;

the adjusting module 13 is configured to adjust and update parameters of the decision model according to the target loss function;

the processing module 14 is configured to input the state information of the target object at the (N + 1) th time into the decision model after updating the parameters, and continue training the decision model according to the training data of the next state of the target object until a training stopping condition is met, so as to obtain a trained decision model.

Optionally, the obtaining module 11 is configured to:

training a target decision model by using a post-event near-end policy optimization method, wherein the target decision model is used for training a moving action strategy of a target object;

and determining the target decision model as a teacher model of the decision model so as to perform supervised learning training on the decision model.

Optionally, the building block 12 is configured to:

constructing a near-end strategy optimization loss function according to the training data of the current state;

constructing a supervised learning loss function according to the training data of the current state and the teacher model;

and determining the weighted sum of the near-end strategy optimization loss function and the supervised learning loss function as a target loss function.

Optionally, the building block 12 is specifically configured to:

aiming at the training information of each t moment in N moments, calculating to obtain a near-end strategy optimization loss function corresponding to the training information of the t moment according to a control strategy of the t moment, a prediction reward of the control strategy of the t moment, a real-time reward of the t moment fed back by a target environment, observation information of the t moment and an area mark of the t moment where a target object arrives;

and determining the average value of the near-end strategy optimization loss functions corresponding to the training information at the N moments as the near-end strategy optimization loss function.

Optionally, the building block 12 is specifically configured to:

determining a predicted action probability ratio according to real-time rewards at the time t, observation information at the time t, an area identifier where a target object arrives at the time t and a control strategy at the time t, which are fed back by a target environment;

calculating the difference value of the real-time reward at the t moment fed back by the target environment and the predicted reward of the control strategy at the t moment;

and calculating to obtain a near-end strategy optimization loss function corresponding to the training information at the moment t according to the predicted action probability ratio and the difference.

Optionally, the real-time reward at the time t fed back by the target environment is the sum of the reward of the exploration strategy at the time t and the reward of the utilization strategy at the time t;

the reward of the exploration strategy at the time t is determined according to the number of the reached areas at the time t and the target reached area at the time t;

and the reward of the strategy used at the time t is determined according to the total number of finished target tasks at the time t and the target arrival area at the time t.

Optionally, the building block 12 is specifically configured to:

aiming at training information of each t moment in N moments, calculating a cross entropy loss function of a moving action strategy in a control strategy of the t moment and a moving action strategy of the t moment output by a corresponding teacher model;

and determining the average value of the cross entropy loss functions corresponding to the training information at the N moments as a supervised learning loss function.

Optionally, the obtaining module 11 is further configured to: acquiring real-time reward of the N +1 th moment fed back by the target environment;

acquiring observation information of a target object at the (N + 1) th moment;

taking the target arrival area identifier of the target object at the Nth moment as the area identifier of the target object at the (N + 1) th moment;

and forming the state information of the target object at the N +1 th time by the real-time reward of the N +1 th time, the observation information of the N +1 th time and the area identification of the target object at the N +1 th time fed back by the target environment.

Optionally, the obtaining module 11 is specifically configured to:

adopting the control strategy at the Nth moment to control the target object to perform corresponding action;

determining that the target object completes the target task at the (N + 1) th moment and obtaining real-time reward of target environment feedback;

generating a characteristic information hidden variable according to the observation information at the (N + 1) th moment and the real-time reward fed back by the target environment, and storing the characteristic information hidden variable into a characteristic information statistical table to obtain an updated characteristic information statistical table;

and carrying out normalization processing on the updated characteristic information statistical table, and determining the characteristic information statistical table after the normalization processing as the real-time reward of the N +1 th moment fed back by the target environment.

Optionally, the decision model after adjusting the parameters is used to:

respectively extracting the characteristics of the real-time reward at the N +1 th moment, the observation information at the N +1 th moment and the region identification of the target object at the N +1 th moment fed back by the target environment to obtain a first hidden variable, a second hidden variable and a third hidden variable;

combining the first hidden variable, the second hidden variable and the third hidden variable to obtain a target hidden variable;

extracting the characteristics of the target hidden variables to obtain characteristic information of the target hidden variables;

and generating a control strategy at the (N + 1) th moment according to the target hidden variable characteristic information.

Optionally, the obtaining module 11 is further configured to:

acquiring vertical visual angle information of a map corresponding to a target environment;

dividing the map into a plurality of areas according to the vertical visual angle information of the map;

an identification is assigned to each region.

Fig. 9 is a schematic structural diagram of a policy control apparatus for a target object according to an embodiment of the present application, and as shown in fig. 9, the apparatus may include: an acquisition module 21, a control strategy output module 22 and an action control module 23, wherein,

a control strategy output module, configured to input state information at the current time into the decision model and output a control strategy at the current time, where the decision model is obtained by training according to the decision model training method shown in fig. 2;

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the decision model training apparatus shown in fig. 8 or the policy control apparatus of the target object shown in fig. 9 may execute a method embodiment corresponding to the computer device, and the foregoing and other operations and/or functions of each module in the apparatus are respectively for implementing the method embodiment corresponding to the computer device, and are not described herein again for brevity.

The decision model training device and the strategy control device of the target object according to the embodiment of the present application are described above from the perspective of functional modules in conjunction with the drawings. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

As shown in fig. 10, the computer device 700 may include:

a memory 710 and a processor 720, the memory 710 for storing a computer program and transferring the program code to the processor 720. In other words, the processor 720 may call and run the computer program from the memory 710 to implement the method in the embodiment of the present application.

For example, the processor 720 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In some embodiments of the present application, the processor 720 may include, but is not limited to:

general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 710 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), enhanced Synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules, which are stored in the memory 710 and executed by the processor 720 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments describing the execution of the computer program in the electronic device.

As shown in fig. 9, the computer apparatus may further include:

a transceiver 730, the transceiver 730 being connectable to the processor 720 or the memory 710.

The processor 720 may control the transceiver 730 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 730 may include a transmitter and a receiver. The transceiver 730 may further include an antenna, and the number of antennas may be one or more.

It should be understood that the various components in the electronic device are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiment.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, indirect coupling or communication connection between devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and all the changes or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a decision model, comprising:

and inputting the state information of the target object at the (N + 1) th moment into the decision model after updating the parameters, and continuing to train the decision model according to the training data of the next state of the target object until the training stopping condition is met to obtain the trained decision model.

2. The method of claim 1, wherein prior to obtaining training data for a current state of a target object in a target environment, the method further comprises:

training a target decision model by using a post-event near-end policy optimization method, wherein the target decision model is used for training a movement action strategy of the target object;

3. The method of claim 2, wherein constructing an objective loss function from the training data for the current state comprises:

determining a weighted sum of the near-end policy optimization loss function and a supervised learning loss function as the target loss function.

4. The method of claim 3, wherein constructing a near-end strategy optimization loss function from the training data for the current state comprises:

aiming at the training information of each t moment in the N moments, calculating to obtain a near-end strategy optimization loss function corresponding to the training information of the t moment according to the control strategy of the t moment, the prediction reward of the control strategy of the t moment, the real-time reward of the t moment fed back by the target environment, the observation information of the t moment and the region identification of the target object at the t moment;

5. The method according to claim 3, wherein the calculating a near-end strategy optimization loss function corresponding to the training information at the time t according to the control strategy at the time t, the prediction reward of the control strategy at the time t, the real-time reward of the target environment feedback at the time t, the observation information at the time t and the region identifier at which the target object at the time t arrives comprises:

determining a predicted action probability ratio according to a real-time reward at the time t fed back by the target environment, observation information at the time t, an area identifier where the target object arrives at the time t and a control strategy at the time t;

calculating the difference value of the real-time reward of the target environment at the time t and the predicted reward of the control strategy at the time t;

and calculating to obtain a near-end strategy optimization loss function corresponding to the training information at the time t according to the predicted action probability ratio and the difference.

6. The method according to claim 5, wherein the real-time reward at the time t of the target environment feedback is the sum of the reward of a time t exploration strategy and the reward of a time t utilization strategy;

the reward of the time t exploration strategy is determined according to the number of reached areas at the time t and the target reached area at the time t;

and the reward of the strategy used at the time t is determined according to the total number of completed target tasks at the time t and the target arrival area at the time t.

7. The method of claim 3, wherein constructing a supervised learning loss function from the training data for the current state and the teacher model comprises:

calculating a cross entropy loss function of a moving action strategy in the control strategy at the time t and a moving action strategy at the time t output by the corresponding teacher model according to the training information at each time t in the N moments;

and determining the average value of cross entropy loss functions corresponding to the training information of the N moments as the supervised learning loss function.

8. The method of claim 1, wherein before inputting the state information of the target object at the time point N +1 into the decision model after updating the parameters, the method further comprises:

acquiring real-time reward of the N +1 th moment fed back by the target environment;

acquiring observation information of the target object at the (N + 1) th moment;

and forming the state information of the target object at the (N + 1) th time by the real-time reward of the (N + 1) th time, the observation information of the (N + 1) th time and the area identification of the target object at the (N + 1) th time fed back by the target environment.

9. The method of claim 8, wherein obtaining the real-time reward of time N +1 of the target environment feedback comprises:

controlling the target object to perform corresponding action by adopting a control strategy at the Nth moment;

determining that the target object completes the target task at the (N + 1) th moment and obtains a real-time reward of the target environment feedback;

and normalizing the updated characteristic information statistical table, and determining the normalized characteristic information statistical table as the real-time reward of the N +1 th moment fed back by the target environment.

10. The method of claim 8, wherein the parameterized decision model is configured to:

respectively performing feature extraction on the real-time reward at the N +1 th moment, the observation information at the N +1 th moment and the region identification of the target object at the N +1 th moment fed back by the target environment to obtain a first hidden variable, a second hidden variable and a third hidden variable;

extracting the characteristics of the target hidden variables to obtain target hidden variable characteristic information;

and generating the control strategy at the (N + 1) th moment according to the target hidden variable characteristic information.

11. The method of claim 1, further comprising:

acquiring vertical visual angle information of a map corresponding to the target environment;

an identification is assigned to each region.

12. The method of claim 10, wherein the dividing the map into the plurality of regions according to the vertical view information of the map comprises:

determining a central point of each area in the plurality of areas according to the vertical visual angle information of the map and a central point selection mode, wherein the central point selection mode is as follows: the distance between two adjacent central points is the same, and the open areas with regular corners and shapes have the central points;

aiming at each target point in the map except the determined central point, determining the area to which the target point belongs according to the principle that the Euclidean distance between the target point and the central point is shortest to obtain a plurality of areas;

the allocating an identifier for each region includes:

and allocating an identifier to the central point of each area.

13. A method for policy control of a target object, comprising:

inputting the state information of the current moment into a decision model and outputting a control strategy of the current moment, wherein the decision model is obtained by training according to the decision model training method of any one of claims 1-12;

14. A decision model training device, comprising:

15. A policy control apparatus for a target object, comprising:

a control strategy output module, configured to input the state information at the current time into a decision model, and output a control strategy at the current time, where the decision model is obtained by training according to the decision model training method according to any one of claims 1 to 12;

16. A computer device, comprising:

a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of any one of claims 1 to 12 or 13.

17. A computer-readable storage medium comprising instructions which, when run on a computer program, cause the computer to perform the method of any one of claims 1 to 12 or 13.

18. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 12 or 13.