CN117312815A

CN117312815A - Intelligent body strategy network training method and device, electronic equipment and storage medium

Info

Publication number: CN117312815A
Application number: CN202311230353.6A
Authority: CN
Inventors: 刘天硕; 刘旭辉; 陈瑞峰; 张智龙; 韩璐瑶
Original assignee: Nanqi Xiance Nanjing High Tech Co ltd
Current assignee: Nanqi Xiance Nanjing High Tech Co ltd
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2023-12-29

Abstract

The invention discloses an agent strategy network training method, an agent strategy network training device, electronic equipment and a storage medium. The method comprises the following steps: determining a linear score for the first training sample; respectively inputting the current first training sample into a state action value network, a state value network and a strategy network, and determining a first actual output, a second actual output and a third actual output; determining a first loss value according to the second actual output, the accumulated rewards at the first historical moment, the linear score and the first loss function, and carrying out parameter correction on the state action value network; determining a second loss value corresponding to the state value network, and carrying out parameter correction on the state value network; and determining a third loss value corresponding to the strategy network, and carrying out parameter correction on the strategy network according to the third loss value to obtain the target strategy network. According to the technical scheme, the effect of improving the training effect and the training speed of the strategy network is achieved, and the target strategy network can be trained more accurately and conveniently.

Description

Intelligent body strategy network training method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of intelligent agent reinforcement learning, in particular to an intelligent agent strategy network training method, an intelligent agent strategy network training device, electronic equipment and a storage medium.

Background

Reinforcement learning has enjoyed remarkable success in a variety of sequential decision problems, such as sequential recommendation systems, automation, and robotic motor skills learning. The different strategy reinforcement learning algorithm utilizes the diverse empirical data collected by previous strategies on the strategy improvement path, and can achieve higher data efficiency than the online strategy approach.

In the related art, a conventional heterogeneous strategy reinforcement learning algorithm is generally a historical (state, action, rewards, next state) interactive transfer quaternion which is sequentially stored in an experience pool, and in the process of updating by using the reinforcement learning algorithm, data obtained by sampling is sampled from the experience pool and is used for updating a neural network. However, in the network updating process, the sampling strategy of the training sample is usually a sample generated by preferentially selecting the current strategy, so that the strategy network to be trained cannot be trained according to the historical strategy sample, and further, the training effect of the strategy network to be trained is poor, and the performance of the target strategy network of the final application is affected.

Disclosure of Invention

The invention provides an agent strategy network training method, an agent strategy network training device, electronic equipment and a storage medium, which are used for realizing the effect of improving the strategy network training effect and training speed, and can train a target strategy network more accurately and conveniently, thereby improving the decision performance of the target strategy network.

According to an aspect of the present invention, there is provided an agent policy network training method, including:

processing a current first training sample based on a linear discriminator which is trained in advance for each first training sample obtained from a first experience pool to obtain a linear score corresponding to the current first training sample; wherein the first training sample is a plurality of groups, the plurality of groups comprising an initial state, a state at a first historical time, a decision action at the first historical time, a state at a next historical time, a reward at the first historical time, and an accumulated reward at the first historical time; the next historical time is the next time corresponding to the current time by taking the first historical time as the current time;

respectively inputting the current first training sample into a state action value network, a state value network and a strategy network to obtain a first actual output corresponding to the state action value network, a second actual output corresponding to the state value network and a third actual output corresponding to the strategy network;

Processing the first actual output and the rewards at the first historical moment according to the second actual output, the accumulated rewards at the first historical moment, the online scores and a first loss function corresponding to the state action value network to obtain a first loss value, and correcting parameters in the state action value network according to the first loss value; the method comprises the steps of,

processing the second actual output and rewards at the first historical moment according to a second loss function corresponding to the state value network to obtain a second loss value, and correcting parameters in the state value network according to the second loss value; the method comprises the steps of,

processing the third actual output and the decision action at the first historical moment according to a third loss function corresponding to the strategy network to obtain a third loss value, and correcting parameters in the strategy network according to the third loss value until training is finished when a preset convergence condition corresponding to the strategy network is reached to obtain a target strategy network;

wherein the third loss function is associated with an actual output corresponding to the state action value network and an actual output corresponding to the state action value network; the target policy network is used for judging the state of the current moment corresponding to the intelligent agent so as to obtain a target decision action at the current moment.

According to another aspect of the present invention, there is provided an agent policy network training device, the device comprising:

the training sample acquisition module is used for processing the current first training sample based on the linear discriminator which is trained in advance for each first training sample acquired from the first experience pool to obtain a linear score corresponding to the current first training sample; wherein the first training sample is a plurality of groups, the plurality of groups comprising an initial state, a state at a first historical time, a decision action at the first historical time, a state at a next historical time, a reward at the first historical time, and an accumulated reward at the first historical time; the next historical time is the next time corresponding to the current time by taking the first historical time as the current time;

the training sample processing module is used for respectively inputting the current first training sample into a state action value network to be trained, a state value network and a strategy network to obtain a first actual output corresponding to the state action value network, a second actual output corresponding to the state value network and a third actual output corresponding to the strategy network;

The first loss value determining module is used for processing the first actual output and the rewards at the first historical moment according to the second actual output, the accumulated rewards at the first historical moment, the linear score and a first loss function corresponding to the state action value network to obtain a first loss value, and correcting parameters in the state action value network according to the first loss value; the method comprises the steps of,

the second loss value determining module is used for processing the second actual output and the rewards at the first historical moment according to a second loss function corresponding to the state value network to obtain a second loss value, and correcting parameters in the state value network according to the second loss value; the method comprises the steps of,

the third loss value determining module is used for processing the third actual output and the decision action at the first historical moment according to a third loss function corresponding to the strategy network to obtain a third loss value, and correcting parameters in the strategy network according to the third loss value until training is finished when a preset convergence condition corresponding to the strategy network is reached to obtain a target strategy network;

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the agent policy network training method of any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the agent policy network training method according to any of the embodiments of the present invention when executed.

According to the technical scheme, aiming at each first training sample obtained from the first experience pool, a current first training sample is processed based on a linear discriminator which is trained in advance, so that a linear score corresponding to the current first training sample is obtained; then, the current first training sample is respectively input into a state action value network, a state value network and a strategy network to obtain a first actual output corresponding to the state action value network, a second actual output corresponding to the state value network and a third actual output corresponding to the strategy network, and then, the first actual output and the rewards at the first historical moment are processed according to the second actual output, the accumulated rewards at the first historical moment, the linear score and the first loss function corresponding to the state action value network to obtain a first loss value, and parameters in the state action value network are corrected according to the first loss value; processing the second actual output and rewards at the first historical moment according to a second loss function corresponding to the state value network to obtain a second loss value, and correcting parameters in the state value network according to the second loss value; and processing the third actual output and the decision action at the first historical moment according to the third loss function corresponding to the strategy network to obtain a third loss value, correcting parameters in the strategy network according to the third loss value until the training is finished when the preset convergence condition corresponding to the strategy network is reached, and obtaining a target strategy network.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an agent policy network training method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of an agent policy network training method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an agent policy network training device according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device implementing an agent policy network training method according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of an intelligent agent policy network training method according to an embodiment of the present invention, where the method may be implemented by an intelligent agent policy network training device, and the intelligent agent policy network training device may be implemented in hardware and/or software, and the intelligent agent policy network training device may be configured in a terminal and/or a server. As shown in fig. 1, the method includes:

S110, processing a current first training sample based on a linear discriminator which is trained in advance for each first training sample obtained from a first experience pool to obtain a linear score corresponding to the current first training sample; the first training sample is a multi-element group, and the multi-element group comprises an initial state, a state of a first historical moment, a decision action of the first historical moment, a state of a next historical moment, rewards of the first historical moment and accumulated rewards of the first historical moment.

In this embodiment, the first experience pool may be a pre-built playback buffer for storing samples of the agent's interactions with the environment. The samples stored in the first experience pool may be used to train the network. Correspondingly, the first training sample is historical interaction data generated after the intelligent agent interacts with the environment. The first training samples are samples in the form of a plurality of groups. The plurality of sets includes an initial state, a state at a first historical time, a decision action at the first historical time, a state at a next historical time, a reward at the first historical time, and an accumulated reward at the first historical time. The service environment may be an environment that interacts with an agent. It should be noted that, under different application scenarios, the intelligent agent and the service environment change accordingly. For example, if the application scenario is a robot control scenario, the agent may be a controlled robot, and the service environment may be an environment in which the robot is located; if the application scene is a game playing scene, the agent may be a game object controlled in the game, and the service environment may be a game scene.

The initial state may be state information of the service environment in the initial situation, and may also be understood as an environmental state observation value of a starting point time in the service environment. The state at the first historical time may be an environmental state observation at the first historical time in the business environment. The decision action at the first historical moment may be an agent-performed action decided for the state at the first historical moment. The state at the next history instant may be understood as an environmental state observation at the next history instant in the business environment determined after performing the decision action at the first history instant for the state at the first history instant. The next history time is the next time corresponding to the current time when the first history time is set as the current time. The process of determining the state of the next history time from the state of the first history time and the decision action of the first history time in the service environment may be referred to as an environment state transition. The reward at the first history time may be understood as an evaluation or a reward made for the state at the first history time and the decision action at the first history time, or the state at the first history time and the decision action at the first history time are input into a preset reward function, and the obtained output is the reward at the first history time. The jackpot at the first historical time may be understood as the entire jackpot from the start time to the first historical time.

For example, in the case where the agent is a game object that is manipulated in a game and the service environment is a game scene, the service environment initial state may be a state of a game at a starting point in any game round; the state of the first historical moment can be the state of the first historical moment of the game in any game round, wherein the first historical moment can be any moment of the game object in the process of executing the current game round; the decision action of the first historical moment can be an action to be executed by the game object decided according to the state of the first historical moment; the state of the next historical moment may be the state of the game environment after the decision action of the first historical moment is performed; the reward at the first historical time may be a game score corresponding to the first historical time; the jackpot for the first historical time may be a jackpot score that expires to the first historical time.

The linear discriminator may be a neural network model trained in advance, which is used for judging the matching degree between the input sample and the current strategy, and may be also understood as a neural network model for judging the time difference between the input sample and the current strategy in the time dimension, that is, the smaller the time difference between the input sample and the current strategy, the higher the matching degree between the input sample and the current strategy may be indicated, the higher the linearity of the input sample is; the greater the time difference between the input sample and the current policy may indicate a lower degree of matching between the input sample and the current policy, and a lower linearity of the input sample. In this embodiment, the input samples of the linear arbiter may be historical samples in an experience pool. The linear discriminant may be a neural network model for judging the degree of matching between the history sample and the policy currently being executed. Illustratively, the linear discriminant may be a deep convolutional neural network model. Accordingly, the linear score may be a score that characterizes the degree of matching between the input sample and the current policy. In this embodiment, the online score may represent a score of the degree of matching between the first training sample and the current strategy; the higher the linear score, the higher the degree of matching between the first training sample and the current strategy can be characterized; the lower the linear score, the lower the degree of matching between the first training sample and the current strategy may be characterized. The linear score may be any value, alternatively, a random number between 0 and 1.

Before the linear discriminant of the present embodiment is applied, the discriminant to be trained needs to be trained first. Alternatively, the training process of the linear discriminant may be: obtaining a plurality of first training samples from a first experience pool, setting a first theoretical score for each first training sample, and obtaining a plurality of second training samples from a second experience pool, setting a second theoretical score for each second training sample; aiming at each first training sample and each second training sample, inputting the current first training sample and the current second training sample into a to-be-trained discriminator to obtain a first actual score and a second actual score; respectively processing the first theoretical score and the first actual score according to a preset loss function corresponding to the discriminator to obtain a first loss value to be processed, and processing the second theoretical score and the second actual score according to the loss function to obtain a second loss value to be processed; and correcting parameters in the discriminator based on the first to-be-processed loss value and the second to-be-processed loss value to obtain the linear discriminator.

Wherein the first experience pool and the second experience pool may be two experience playback buffers of different capacities. The capacity of the first empirical pool is greater than the capacity of the second empirical pool. The ratio between the capacity of the first empirical pool and the capacity of the second empirical pool may be any value, alternatively, may be 10 times, i.e. the capacity of the first empirical pool may be 10 times the capacity of the second empirical pool. In practical application, in the process that the agent interacts with the environment and determines corresponding decision actions according to the policy network, the environment state transition multivariate element at each moment can be respectively stored in the first experience pool and the second experience pool to be used as training samples. The training samples in the experience pool can be used to train the policy network to which the agent corresponds. Alternatively, the training sample may be constructed by: acquiring a state of an intelligent agent at a first historical moment corresponding to a target application scene; inputting the state of the first historical moment into a strategy network to be trained, and obtaining decision action of the first historical moment, rewards of the first historical moment and accumulated rewards of the first historical moment; determining the state of the next historical moment according to the state of the first historical moment and the decision action of the first historical moment; and constructing training samples according to the state of the first historical moment, the decision action of the first historical moment, the rewards of the first historical moment, the accumulated rewards of the first historical moment and the state of the next historical moment, and storing the training samples into a first experience pool and a second experience pool respectively.

The target application scene can be at least one scene of robot control, game play, computer vision and unmanned driving. In this embodiment, the training samples stored into the first experience pool may be first training samples and the training samples stored into the second experience pool may be second training samples.

In practical application, when the intelligent agent interacts with the environment in the target application scene, the state of the intelligent agent at the first historical moment in the target application scene can be obtained, the state of the first historical moment can be input into the strategy network to be trained, the decision action corresponding to the first historical moment can be obtained, the rewards corresponding to the state of the first historical moment and the decision action of the first historical moment are determined according to a preset rewarding function, and meanwhile, the accumulated rewards corresponding to the first historical moment can be determined according to a preset accumulating function. Further, the agent may be controlled to execute a decision action corresponding to the first history time with respect to the state of the first history time, so as to obtain the state of the next history time. Further, training samples may be constructed according to the state of the first historical time, the decision action of the first historical time, the reward of the first historical time, the accumulated reward of the first historical time, and the state of the next historical time, and the training samples are stored in the first experience pool and the second experience pool respectively to be used as the first training sample in the first experience pool and the second training sample in the second experience pool.

Further, a plurality of first training samples and a plurality of second training samples can be obtained from the first experience pool, and then, aiming at each first training sample and each second training sample, the current first training sample and the current second training sample are input into a discriminator to be trained to obtain a first actual score and a second actual score; respectively processing the first theoretical score and the first actual score according to a preset loss function corresponding to the discriminator to obtain a first loss value to be processed, and processing the second theoretical score and the second actual score according to the loss function to obtain a second loss value to be processed; and correcting parameters in the discriminator based on the first to-be-processed loss value and the second to-be-processed loss value to obtain the linear discriminator.

Wherein the first theoretical score may be any value, alternatively, may be 0. The second theoretical score may be any value, alternatively, may be 1. The parameters in the discriminant to be trained can be default values, and the parameters in the discriminant to be trained are corrected through training samples to obtain the linear discriminant. The first actual score may be an online score output after the first training sample is input to the arbiter to be trained. The second actual score may be an online score output after the second training sample is input to the arbiter to be trained. The preset loss function may be a predetermined function for characterizing the degree of difference between the actual output and the theoretical output. In this embodiment, the preset loss function may be a cross entropy loss function.

In practical applications, a plurality of first training samples may be obtained from a first experience pool, a first theoretical score may be set for each first training sample, a plurality of second training samples may be obtained from a second experience pool, and a second theoretical score may be set for each second training sample. Furthermore, for each first training sample and each second training sample, the current first training sample and the current second training sample are simultaneously input into a to-be-trained discriminator to obtain a first actual score corresponding to the current first training sample and a second actual score corresponding to the current second training sample. Further, the first actual score and the first theoretical score may be subjected to loss processing according to a preset loss function to obtain a first to-be-processed loss value, and the second actual score and the second theoretical score may be subjected to loss processing according to the preset loss function to obtain a second to-be-processed loss value. Further, the parameters in the discriminators to be trained may be corrected according to the first to-be-processed loss value and the second to-be-processed loss value. Specifically, when the parameters in the discriminators to be trained are corrected by using the loss values, the loss function can be converged as a training target, for example, whether the training error is smaller than a preset error, whether the error change tends to be stable, or whether the current iteration number is equal to the preset number. If the detection reaches the convergence condition, for example, the training error of the loss function is smaller than the preset error, or the error change trend tends to be stable, the training of the to-be-trained discriminator is completed, and at the moment, the iterative training can be stopped. If the current condition is detected not to be met, other first training samples and other second training samples can be further obtained to train the to-be-trained discriminator continuously until the training error of the loss function is within a preset range. When the training error of the loss function reaches convergence, the trained discriminator can be used as a linear discriminator, namely, after the training sample for setting the theoretical score is input into the linear discriminator, the linear score corresponding to the training sample can be accurately obtained.

In practical applications, after obtaining a plurality of first training samples obtained from the first test pool, for each first training sample, the current first training sample may be input into the linear discriminator to process the current first training sample based on the linear discriminator, so that a linear score corresponding to the current first training sample may be obtained.

S120, the current first training sample is respectively input into a state action value network, a state value network and a strategy network, and a first actual output corresponding to the state action value network, a second actual output corresponding to the state value network and a third actual output corresponding to the strategy network are obtained.

The state action value network may be understood as a deep neural network that takes a state at a first history time and a decision action at the first history time as input objects to evaluate the decision action taken in the state at the first history time. The state action value network may be a neural network that includes a state action value function. The input to the stateful action value network may be the state at the first historical time and the decision action at the first historical time, and the output may be the corresponding value at which the decision action was taken for the state at the first historical time. The state value network may be understood as a deep neural network that takes a state at a first history as an input object to evaluate the state at the first history. The state-value network may comprise a neural network of state-value functions. The input of the state value network may be the state of the first history time, and the output may be the value obtained by evaluating the state of the first history time. The policy network may be understood as a deep neural network model that takes the state of the first history as an input object to determine the decision action of the first history after processing the state of the first history. The input to the policy network may be the state at the first historical moment and the output may be the decision action at the first historical moment. It should be noted that, the state action value network and the state value network are evaluation networks in the reinforcement learning algorithm (such as the flexible actor-critter algorithm), and are a "critter" and do not directly take actions, but evaluate the quality of the actions; the policy network may be a policy network in a reinforcement learning algorithm (e.g., a flexible actor-reviewer algorithm), which is an "actor" for determining decision actions based on the entered states.

In practical application, the state of the first historical moment in the current first training sample and the action of the first historical moment can be input into a state action value network to obtain a first practical output; inputting the initial state of the service environment, the state at the first historical moment and the state at the next historical moment in the current first training sample into a state value network to obtain a second actual output; and inputting the state of the current first historical moment in the first training sample into the strategy network to obtain a third actual output.

The first actual output may be an evaluation value corresponding to the state at the first history time when the decision action at the first history time is taken. The second actual output may include the value of the initial state, the value of the state at the first historical time, and the value of the state at the next historical time. The third actual output may be an action to be performed by the agent determined for the state of the first historical moment.

S130, processing the first actual output and the rewards at the first historical moment according to the second actual output, the accumulated rewards at the first historical moment, the linear score and the first loss function corresponding to the state action value network to obtain a first loss value, and correcting parameters in the state action value network according to the first loss value.

The first loss function may be a preset original loss function corresponding to the state action value network. In this embodiment, a first loss function may be used to characterize a function of a degree of difference between an actual output of the state action value network and rewards at a first historical time.

It should be noted that, the process of heterogeneous strategy reinforcement learning may be understood that, in the process of interaction between an agent and an environment, a strategy for executing (i.e., a strategy network) and a strategy for evaluating (i.e., a evaluation network) are different, and the strategy network may be trained by evaluating the quality of a decision action output by the strategy network according to the evaluation network. Thus, in this embodiment, the loss function of the evaluation network for training the policy network may be processed during training of the policy network to weight the duty cycle of the sample in the loss function. Thereby being beneficial to faster and better learning of the strategy network and improving the convergence rate of the strategy network.

Optionally, processing the first actual output and the accumulated rewards at the first historical moment according to the second actual output, the accumulated rewards at the first historical moment, the linear score and the first loss function corresponding to the state action value network to obtain a first loss value, including: processing the first actual output and the accumulated rewards at the first historical moment according to the first loss function to obtain an initial loss value; and processing the initial loss value according to the second actual output, the rewards at the first historical moment, the linear scores and the predetermined temperature coefficient to obtain a first loss value.

The initial loss value may be a value obtained by inputting the first actual output and the prize at the first history into the first loss function and then outputting the result. The temperature coefficient may be a predetermined coefficient for weighting the duty cycle of the sample in the loss function.

In practical application, the first actual output and the reward at the first historical time may be subjected to loss processing according to the first loss function to obtain an initial loss value, and further, the initial loss value may be processed according to the second actual output, the accumulated reward at the first historical time, the linear score and the temperature coefficient, and the loss value obtained after the processing may be used as the first loss value.

Optionally, the second actual output includes an actual value corresponding to the state at the next historical moment and an actual value corresponding to the initial state; correspondingly, according to the second actual output, the accumulated rewards at the first historical moment, the linear scores and the predetermined temperature coefficients, the initial loss value is processed to obtain a first loss value, and the method comprises the following steps: determining a look-ahead distribution weight according to the actual value corresponding to the state at the next historical moment, the actual value corresponding to the initial state and the accumulated rewards at the first historical moment; determining the product of the prospective distribution weight, the temperature coefficient and the linear score to obtain a target weight; multiplying the target weight and the initial loss value to obtain a first loss value.

The actual value corresponding to the state at the next history time may be a state value evaluation result output after the state at the next history time is input into the state value network. The actual value corresponding to the initial state may be a state value evaluation result output after the initial state of the service environment is input into the state value network. In this embodiment, after the actual value corresponding to the state at the next historical time and the actual value corresponding to the initial state are obtained, the actual value corresponding to the state at the next historical time, the actual value corresponding to the initial state, and the cumulative reward at the first historical time in the current first training sample may be processed, and the value obtained after the processing may be used as the look-ahead distribution weight.

Optionally, determining the look-ahead distribution weight according to the actual value corresponding to the state at the next historical moment, the actual value corresponding to the initial state, and the accumulated rewards at the first historical moment includes: adding the accumulated rewards at the first historical moment and the actual values corresponding to the states at the next historical moment to obtain a first numerical value; and determining a difference value between the first numerical value and the actual value corresponding to the initial state to obtain a prospective distribution weight.

In practical application, after the actual value corresponding to the state of the next historical moment and the actual value corresponding to the initial state are obtained, the actual value corresponding to the state of the next historical moment can be added to the accumulated rewards of the current first training sample at the first historical moment, and the value obtained after the addition can be used as the first value. Further, a difference between the first value and the actual value corresponding to the initial state may be determined, and the obtained difference may be used as a look-ahead distribution weight.

By way of example, the look-ahead distribution weights may be determined based on the following formula:

prospective distribution weight = cumulative prize at first history time + actual value corresponding to state at next history time

-actual value corresponding to initial state

Further, after obtaining the look-ahead distribution weight, a product among the look-ahead distribution weight, the temperature coefficient, and the linear score may be determined, and the product may be taken as the target weight.

For example, the target weight may be determined based on the following formula:

target weight = look-ahead distribution weight × temperature coefficient × online score

Further, the target weight may be multiplied by the initial loss value to obtain a first loss value, so that parameters in the state action value network may be corrected according to the first loss value.

And S140, processing the second actual output and rewards at the first historical moment according to a second loss function corresponding to the state value network to obtain a second loss value, and correcting parameters in the state value network according to the second loss value.

Wherein the second loss function may be preset as a function of a degree of difference between the actual output of the state-value network and the reward at the first historical moment. In this embodiment, the second loss function may be a time-series differential function.

In the actual application, after obtaining the second actual output corresponding to the state action value network, the actual value corresponding to the state at the first history time included in the second actual output, the actual value corresponding to the state at the next history time included in the second actual output, and the reward at the first history time included in the current first training sample may be input to the second loss function, so as to obtain a loss value, and the loss value may be used as the second loss value.

By way of example, the second loss function may be expressed based on the following formula:

V(S _t )←V(S _t )+α[R _t +γV(S _t+1 )-V(S _t )]

wherein V (S) _t ) The actual value corresponding to the state at the first historical moment can be represented; alpha may represent a step size; r is R _t A reward that may represent a first historical moment; gamma may represent the discount rate; v (S) _t+1 ) The status of the next historical moment can be represented; gamma V (S) _t+1 ) May be a training target.

Further, after the second loss value is obtained, the parameters in the state value network may be modified according to the second loss value.

And S150, processing the third actual output and the decision action at the first historical moment according to a third loss function corresponding to the strategy network to obtain a third loss value, correcting parameters in the strategy network according to the third loss value, and obtaining the target strategy network after training is finished until a preset convergence condition corresponding to the strategy network is reached.

It should be noted that, S130, S140, and S150 do not have a time sequence of execution, and these three steps may be executed in parallel.

Wherein the third loss function may be a predetermined function characterizing a degree of difference between an actual output of the policy network and the decision action at the first historical moment. In this embodiment, the third loss function may be a policy network loss function included in the flexible actor-reviewer algorithm. It should be noted that the training of the policy network depends on the evaluation made by the evaluation network with respect to the actual output of the policy network (i.e., the actual output of the state action value network and the actual output of the state value network). Thus, the third loss function may be associated with an actual output corresponding to the state action value network and an actual output corresponding to the state action value network. The strategy network is illustratively equivalent to a gymnastic athlete, after it acts, the referee will score the action, the referee is equivalent to a judgment network (state action value network and state value network), the athlete strives to improve his own technology, strives for higher score of referee, the score of referee is a supervisory signal (i.e. the actual output of state action value network and the actual output of state value network), and the athlete improves his own technology by virtue of the score of referee. The parameters in the evaluation network are updated in order to make the scoring of the evaluation network more accurate, thereby better estimating the sum of rewards obtained in the future. Through training the judgment network and the strategy network, the score of the athlete is higher and higher, and the score of the judge is more accurate.

The preset convergence condition may be a preset policy network training process ending condition. Optionally, the preset convergence condition may include that the training error is smaller than the preset error, the error variation trend tends to be stable, or the current training iteration number reaches the preset number, etc. The target policy network is used for judging the state of the current moment corresponding to the intelligent agent so as to obtain a target decision action at the current moment.

In practical application, after the third actual output is obtained, the third actual output and the decision action at the first historical moment may be subjected to loss processing according to a third loss function, so as to obtain a third loss value. Further, the parameters in the policy network may be modified according to the third loss value. Specifically, the training error of the third loss function in the policy network, that is, the loss parameter, may be used as a condition for detecting whether the current loss function reaches convergence, for example, whether the training error is smaller than a preset error or whether the error variation trend tends to be stable, or whether the current number of iterations of the model is equal to a preset number of iterations, or the like. If the detection reaches the convergence condition, for example, the training error of the loss function is smaller than the preset error or the error change tends to be stable, which indicates that the network training of the current strategy is completed, and at the moment, the iterative training can be stopped. If the current convergence condition is not detected, the current first training sample can be further obtained to train the strategy network until the training error of the loss function is within the preset range. When the training error of the loss function reaches convergence, the strategy network obtained by current training can be used as a target strategy network.

Example two

Fig. 2 is a flowchart of an agent policy network training method according to a second embodiment of the present invention, where after obtaining a target policy network, the method can process a state at any moment according to the target policy network to obtain a decision action corresponding to any moment. Wherein, the technical terms identical to or corresponding to the above embodiments are not repeated herein.

As shown in fig. 2, the method includes:

s210, processing a current first training sample based on a linear discriminator which is trained in advance for each first training sample obtained from the first experience pool to obtain a linear score corresponding to the current first training sample; the first training sample is a multi-element group, and the multi-element group comprises an initial state, a state of a first historical moment, a decision action of the first historical moment, a state of a next historical moment, rewards of the first historical moment and accumulated rewards of the first historical moment.

S220, the current first training samples are respectively input into a state action value network, a state value network and a strategy network, and a first actual output corresponding to the state action value network, a second actual output corresponding to the state value network and a third actual output corresponding to the strategy network are obtained.

S230, processing the first actual output and the rewards at the first historical moment according to the second actual output, the accumulated rewards at the first historical moment, the linear scores and the first loss functions corresponding to the state action value network to obtain a first loss value, and correcting parameters in the state action value network according to the first loss value.

S240, processing the second actual output and rewards at the first historical moment according to a second loss function corresponding to the state value network to obtain a second loss value, and correcting parameters in the state value network according to the second loss value.

S250, processing the third actual output and the decision action at the first historical moment according to a third loss function corresponding to the strategy network to obtain a third loss value, correcting parameters in the strategy network according to the third loss value, and obtaining the target strategy network after training is finished until a preset convergence condition corresponding to the strategy network is reached.

S260, acquiring the state of the agent at the corresponding current moment under the target application scene.

The state at the current time may be an environmental state observation at the current time in the service environment. The state at the current time may be the state of the agent at the current time, or may be the state of the environment where the agent interacts with the current time, and the form of the state at the current time may be determined according to a specific target application scenario, which is not specifically limited in this embodiment. For example, when the target application scenario is a game play scenario, the agent is a game object that is controlled in the game, and when the service environment is a game scenario, the state at the current time may be the state of the game at the current time in the current game round; when the target application scene is a robot control scene, the agent may be a controlled robot, and the state at the current moment may be a spatial position coordinate of each joint of the robot.

In practical application, in the process of interaction between the agent and the environment, the state of the agent at the current moment corresponding to the target application scene can be obtained.

S270, processing the state at the current moment based on the target policy network to obtain a target decision action at the current moment, and controlling the intelligent agent to execute the target decision action to obtain the state at the next moment.

The target decision action at the current moment can be understood as an action to be executed by the agent decided by the target policy network according to the state at the current moment.

In practical application, after obtaining the state at the current moment, the state at the current moment can be input into the target policy network, so that the state at the current moment is processed based on the target policy network, and the target decision action at the current moment is obtained. Further, the agent may be controlled to perform a target decision action to determine a state at a next time based on the state at the current time and the target decision action at the current time.

According to the technical scheme, aiming at each first training sample obtained from the first experience pool, a current first training sample is processed based on a linear discriminator which is trained in advance, so that a linear score corresponding to the current first training sample is obtained; then, the current first training sample is respectively input into a state action value network, a state value network and a strategy network to obtain a first actual output corresponding to the state action value network, a second actual output corresponding to the state value network and a third actual output corresponding to the strategy network, and then, the first actual output and the rewards at the first historical moment are processed according to the second actual output, the accumulated rewards at the first historical moment, the linear score and the first loss function corresponding to the state action value network to obtain a first loss value, and parameters in the state action value network are corrected according to the first loss value; processing the second actual output and rewards at the first historical moment according to a second loss function corresponding to the state value network to obtain a second loss value, and correcting parameters in the state value network according to the second loss value; and processing the third actual output and the decision action at the first historical moment according to a third loss function corresponding to the strategy network to obtain a third loss value, correcting parameters in the strategy network according to the third loss value until the training is finished when the preset convergence condition corresponding to the strategy network is reached, obtaining a target strategy network, further, obtaining the state of an agent at the current moment corresponding to the target application scene, processing the state at the current moment based on the target strategy network to obtain a target decision action at the current moment, and controlling the agent to execute the target decision action to obtain the state at the next moment, thereby realizing the effect of improving the training effect and the training speed of the strategy network, training the target strategy network more accurately and conveniently, and further improving the decision performance of the target strategy network.

Example III

Fig. 3 is a schematic structural diagram of an agent policy network training device according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: a training sample acquisition module 310, a training sample processing module 320, a first loss value determination module 330, a second loss value determination module 340, and a third loss value determination module 350.

The training sample obtaining module 310 is configured to process, for each first training sample obtained from the first experience pool, a current first training sample based on a linear discriminator that is trained in advance, to obtain a linear score corresponding to the current first training sample; wherein the first training sample is a plurality of groups, the plurality of groups comprising an initial state, a state at a first historical time, a decision action at the first historical time, a state at a next historical time, a reward at the first historical time, and an accumulated reward at the first historical time; the next historical time is the next time corresponding to the current time by taking the first historical time as the current time; the training sample processing module 320 is configured to input the current first training sample into a state action value network to be trained, a state value network, and a policy network, to obtain a first actual output corresponding to the state action value network, a second actual output corresponding to the state value network, and a third actual output corresponding to the policy network; a first loss value determining module 330, configured to process the first actual output and the reward at the first historical time according to the second actual output, the accumulated reward at the first historical time, the linear score, and a first loss function corresponding to the state action value network, obtain a first loss value, and correct a parameter in the state action value network according to the first loss value; and a second loss value determining module 340, configured to process the second actual output and the rewards at the first historical moment according to a second loss function corresponding to the state value network, obtain a second loss value, and correct parameters in the state value network according to the second loss value; and a third loss value determining module 350, configured to process the third actual output and the decision action at the first historical moment according to a third loss function corresponding to the policy network, obtain a third loss value, and correct parameters in the policy according to the third loss value, until training is completed when a preset convergence condition corresponding to the policy network is reached, so as to obtain a target policy network; wherein the third loss function is associated with an actual output corresponding to the state action value network and an actual output corresponding to the state action value network; the target policy network is used for judging the state of the current moment corresponding to the intelligent agent so as to obtain a target decision action at the current moment.

Optionally, the first loss value determining module 330 includes: an initial loss value determination sub-module and a first loss value determination sub-module.

The initial loss value determining submodule is used for processing the first actual output and the rewards at the first historical moment according to the first loss function to obtain an initial loss value;

and the first loss value determining submodule is used for processing the initial loss value according to the second actual output, the accumulated rewards at the first historical moment, the linear score and a predetermined temperature coefficient to obtain the first loss value.

Optionally, the second actual output includes an actual value corresponding to a state at a next historical moment and an actual value corresponding to the initial state; correspondingly, the first loss value determining submodule includes: a look-ahead distribution weight determination unit, a target weight determination unit, and a first loss value determination unit.

A look-ahead distribution weight determining unit, configured to determine a look-ahead distribution weight according to an actual value corresponding to a state at the next historical moment, an actual value corresponding to the initial state, and an accumulated prize at the first historical moment;

The target weight determining unit is used for determining the product among the look-ahead distribution weight, the temperature coefficient and the linear score to obtain a target weight;

and the first loss value determining unit is used for multiplying the target weight and the initial loss value to obtain the first loss value.

Optionally, the look-ahead distribution weight determining unit includes: a first value determination subunit and a look-ahead distribution weight determination subunit.

A first value determining subunit, configured to add the accumulated prize at the first historical time to the actual value corresponding to the state at the next historical time to obtain a first value;

and the look-ahead distribution weight determining subunit is used for determining the difference value between the first numerical value and the actual value corresponding to the initial state to obtain the look-ahead distribution weight.

Optionally, the apparatus further includes: the system comprises a training sample acquisition module, an actual score determination module, a loss processing module and a parameter correction module.

The training sample acquisition module is used for acquiring a plurality of first training samples from the first experience pool, setting a first theoretical score for each first training sample, acquiring a plurality of second training samples from the second experience pool, and setting a second theoretical score for each second training sample; wherein the capacity of the first experience pool is greater than the capacity of the second experience pool;

The actual score determining module is used for inputting the current first training sample and the current second training sample into a to-be-trained discriminator aiming at each first training sample and each second training sample to obtain a first actual score and a second actual score;

the loss processing module is used for respectively processing the first theoretical score and the first actual score according to a loss function corresponding to the discriminator to obtain a first loss value to be processed, and processing the second theoretical score and the second actual score according to the loss function to obtain a second loss value to be processed;

and the parameter correction module is used for correcting the parameters in the discriminator based on the first to-be-processed loss value and the second to-be-processed loss value to obtain a linear discriminator.

Optionally, the apparatus further includes: the system comprises a state acquisition module, a state processing module, a state determination module and a sample construction module.

The state acquisition module is used for acquiring the state of the first historical moment corresponding to the intelligent agent in the target application scene;

the state processing module is used for inputting the state of the first historical moment into a strategy network to be trained to obtain a decision action of the first historical moment, rewards of the first historical moment and accumulated rewards of the first historical moment;

The state determining module is used for determining the state of the next historical moment according to the state of the first historical moment and the decision action of the first historical moment;

the sample construction module is used for constructing training samples according to the state of the first historical moment, the decision action of the first historical moment, the rewards of the first historical moment, the accumulated rewards of the first historical moment and the state of the next historical moment, and storing the training samples into a first experience pool and a second experience pool respectively.

Optionally, the apparatus further includes: a state acquisition module and a target decision action determination module at the current moment.

The state acquisition module at the current moment is used for acquiring the state of the intelligent agent at the current moment corresponding to the target application scene;

and the target decision action determining module is used for processing the state at the current moment based on the target policy network to obtain a target decision action at the current moment and controlling the intelligent agent to execute the target decision action so as to obtain the state at the next moment.

Optionally, the method is applied to at least one of robot control, game play, computer vision and unmanned driving.

The intelligent agent strategy network training device provided by the embodiment of the invention can execute the intelligent agent strategy network training method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 4 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the agent policy network training method.

In some embodiments, the agent policy network training method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the agent policy network training method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the agent policy network training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. An agent policy network training method, comprising:

2. The method of claim 1, wherein the processing the first actual output and the cumulative rewards at the first historical time based on the second actual output, the cumulative rewards at the first historical time, the linear score, and the first loss function corresponding to the state action value network to obtain a first loss value comprises:

processing the first actual output and the rewards at the first historical moment according to the first loss function to obtain an initial loss value;

and processing the initial loss value according to the second actual output, the accumulated rewards at the first historical moment, the linear score and a predetermined temperature coefficient to obtain the first loss value.

3. The method of claim 2, wherein the second actual output includes an actual value corresponding to a state at a next historical time and an actual value corresponding to the initial state; correspondingly, the processing the initial loss value according to the second actual output, the accumulated rewards at the first historical moment, the linear score and the predetermined temperature coefficient to obtain the first loss value includes:

Determining a look-ahead distribution weight according to the actual value corresponding to the state of the next historical moment, the actual value corresponding to the initial state and the accumulated rewards of the first historical moment;

determining the product among the look-ahead distribution weight, the temperature coefficient and the linear score to obtain a target weight;

multiplying the target weight and the initial loss value to obtain the first loss value.

4. The method of claim 3, wherein the determining the look-ahead distribution weight based on the actual value corresponding to the state at the next historical time, the actual value corresponding to the initial state, and the cumulative prize at the first historical time comprises:

adding the accumulated rewards at the first historical moment and the actual values corresponding to the states at the next historical moment to obtain a first numerical value;

and determining a difference value between the first numerical value and the actual value corresponding to the initial state to obtain the prospective distribution weight.

5. The method as recited in claim 1, further comprising:

obtaining a plurality of first training samples from a first experience pool, setting a first theoretical score for each first training sample, and obtaining a plurality of second training samples from a second experience pool, setting a second theoretical score for each second training sample; wherein the capacity of the first experience pool is greater than the capacity of the second experience pool;

Aiming at each first training sample and each second training sample, inputting the current first training sample and the current second training sample into a to-be-trained discriminator to obtain a first actual score and a second actual score;

the first theoretical score and the first actual score are processed according to a loss function corresponding to the discriminator to obtain a first loss value to be processed, and the second theoretical score and the second actual score are processed according to the loss function to obtain a second loss value to be processed;

and correcting parameters in the discriminator based on the first to-be-processed loss value and the second to-be-processed loss value to obtain a linear discriminator.

6. The method according to claim 1 or 5, further comprising:

acquiring a state of an intelligent agent at a first historical moment corresponding to a target application scene;

inputting the state of the first historical moment into a strategy network to be trained, and obtaining a decision action of the first historical moment, rewards of the first historical moment and accumulated rewards of the first historical moment;

determining the state of the next historical moment according to the state of the first historical moment and the decision action of the first historical moment;

And constructing training samples according to the state of the first historical moment, the decision action of the first historical moment, the rewards of the first historical moment, the accumulated rewards of the first historical moment and the state of the next historical moment, and storing the training samples into a first experience pool and a second experience pool respectively.

7. The method as recited in claim 1, further comprising:

acquiring a state of an intelligent agent at a corresponding current moment under a target application scene;

and processing the state at the current moment based on the target policy network to obtain a target decision action at the current moment, and controlling the intelligent agent to execute the target decision action so as to obtain the state at the next moment.

8. The method as recited in claim 1, further comprising: the method is applied to at least one scene of robot control, game play, computer vision and unmanned driving.

9. An agent policy model training device, the device comprising:

10. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the agent policy network training method of any of claims 1-8.