CN111309880B

CN111309880B - Multi-agent action strategy learning method, device, medium and computing equipment

Info

Publication number: CN111309880B
Application number: CN202010072011.6A
Authority: CN
Inventors: 黄民烈; 高信龙一
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2023-11-10
Anticipated expiration: 2040-01-21
Also published as: CN111309880A

Abstract

The embodiment of the invention provides a multi-agent action strategy learning method, which comprises the following steps: the multi-agent samples corresponding actions according to respective initial action strategies; respectively estimating advantages obtained after the multiple agents execute corresponding actions; and updating the action strategies of each intelligent agent based on the advantages obtained after the intelligent agents execute the corresponding actions, so that each updated action strategy can enable the corresponding intelligent agent to obtain higher return. In the machine learning scene facing task processing, the method of the invention trains a plurality of mutually cooperative intelligent agents (namely trains a plurality of action strategies) at the same time, rather than adopting a pre-constructed simulator to interact with the intelligent agents, and manual supervision is not needed, so that time cost and resources are greatly saved.

Description

Multi-agent action strategy learning method, device, medium and computing equipment

Technical Field

Embodiments of the present invention relate to the field of reinforcement learning, and more particularly, to a multi-agent action policy learning method, apparatus, medium, and computing device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Action strategies determine the next action that the agent should take and play a critical role in the task-oriented system. In recent years, policy learning has been widely recognized as belonging to the Reinforcement Learning (RL) problem. Because RL requires a lot of interaction to conduct policy training, however, interacting directly with the real user is time consuming and laborious. The most common approach is to develop a user simulator to assist in training to facilitate the target agent to learn action strategies.

However, designing a reliable user simulator is not an easy task and is often challenging because it is equivalent to building a good agent. With the increasing demand for agents to handle more complex tasks, building a completely rule-based user simulator would be a laborious and arduous task and require a great deal of domain expertise.

Disclosure of Invention

In this context, embodiments of the present invention desire to provide a multi-agent action policy learning method, apparatus, medium, and computing device.

In a first aspect of the embodiments of the present invention, there is provided a multi-agent action policy learning method, including:

the multi-agent samples corresponding actions according to respective initial action strategies;

respectively estimating advantages obtained after the multiple agents execute corresponding actions;

and updating the action strategies of each intelligent agent based on the advantages obtained after the intelligent agents execute the corresponding actions, so that each updated action strategy can enable the corresponding intelligent agent to obtain higher return.

In an embodiment of this embodiment, before the multiple agents perform the corresponding actions according to the respective initial action policies, the method further includes:

pre-training is performed using the real action data to obtain respective initial action strategies for the multiple agents, respectively.

In one example of this embodiment, a weighted logistic regression method is used to pre-train based on the actual action data to obtain the initial action strategy for each of the multiple agents, respectively.

In an example of this embodiment, estimating the advantages obtained after the multiple agents perform the corresponding actions respectively includes:

respectively obtaining new states reached by each intelligent agent after corresponding actions are executed;

respectively calculating the current state of each intelligent agent and the return after the new state is reached;

and calculating the advantages obtained after each agent executes corresponding actions according to the current state of each agent and the return after the new state is reached.

In one example of this implementation, the multi-agent includes a user agent configured to target completion of a task and a system agent configured to assist the user agent in completing the task.

In one example of this embodiment, the rewards of each agent after reaching a certain state include at least its own rewards and global rewards.

In one example of this embodiment, the user agent's own rewards include at least a penalty for no action, a penalty for executing a new subtask when there are incomplete subtasks, and a user action reward.

In one example of this implementation, the user action rewards are based on a determination of whether the user agent is performing an action that is favorable to completing the task.

In one example of this embodiment, the return of the system agent itself is accumulated from at least one of: no punishment of action, punishment of corresponding assistance not provided immediately when the user agent requests assistance, and subtask completion rewards.

In one example of this embodiment, the global rewards are accumulated with at least one of an efficiency loss penalty, a subtask completion reward, and a full task completion reward.

In one example of this embodiment, the user action rewards or all task completion rewards are calculated after all actions are terminated.

In one example of this embodiment, the rewards for a certain state of each agent are calculated by the corresponding value network.

In one example of this embodiment, the rewards of different categories are calculated by different value networks, respectively.

In one example of this embodiment, during the multi-agent policy learning process, the value network is configured to optimally update with the objective of minimizing the variance between the rewards predicted using the preset method and the rewards calculated by the value network.

In one example of this embodiment, the preset method is configured to predict rewards based on a state of an agent.

In an example of this embodiment, the preset method is a time-series differential algorithm.

In one example of this embodiment, a target network is introduced to optimize the update of the value network to make the training process more stable.

In a second aspect of the embodiments of the present invention, a multi-agent action policy learning device is provided, including a plurality of multi-agents and a plurality of value networks, wherein the plurality of agents sample corresponding actions according to respective initial action policies;

the value networks respectively estimate advantages obtained after the multi-agent executes corresponding actions;

the plurality of agents update their respective action policies based on the advantages obtained after executing the respective actions, such that each of the updated action policies enables the respective agents to obtain a higher return.

In an example of this embodiment, the apparatus further includes:

and the pre-training module is configured to perform pre-training by using the real action data to obtain respective initial action strategies of the multiple agents.

In one example of this embodiment, the pre-training module is configured to perform pre-training based on the real action data by using a weighted logistic regression method to obtain respective initial action strategies of the multiple agents.

In an embodiment of this embodiment, the apparatus further includes a plurality of state encoding modules configured to obtain new states reached after each agent performs a corresponding action, respectively;

the value networks are further configured to calculate a current state of each agent and a return after reaching a new state, respectively; and calculating the advantages obtained after each agent executes the corresponding actions according to the current state of each agent and the return after the new state is reached.

In one example of this embodiment, the user action rewards or all task completion rewards are calculated after the action is terminated.

In a third aspect of the embodiments of the present invention, there is provided a computer readable storage medium storing a computer program for performing the method of any one of the first aspects of the embodiments.

In a fourth aspect of the embodiments of the present invention, there is provided a computing device comprising a processor for implementing a method as in any of the first aspects of the embodiments when executing a computer program stored in a memory.

In the method, in a machine learning scene facing task processing, multiple mutually cooperative agents (namely, each agent learns respective action strategies) are trained at the same time, instead of interaction between a pre-built simulator and the agents, manual supervision is not needed, time cost and resources are greatly saved, and in the method, different rewards are distributed to each agent in order to enable each agent to learn the excellent action strategy, and differentiated rewards are distributed to enable each agent to learn the action strategy with higher accumulated return.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically shows a schematic view of a structural scenario of a multi-agent action strategy learning method according to an embodiment of the present invention;

FIG. 2 schematically illustrates a flow field diagram of user interaction with a system according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for learning multi-agent action strategy according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a multi-agent action strategy learning method according to an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a multi-agent action strategy learning device according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computing device provided by an embodiment of the present invention;

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that embodiments of the invention may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, a multi-agent action strategy learning method, a multi-agent action strategy learning device, a multi-agent action strategy learning medium and a multi-agent action strategy learning computing device are provided.

Furthermore, any number of elements in the figures is for illustration and not limitation, and any naming is used for distinction only and not for any limiting sense.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments thereof.

Scene overview

In a task-oriented scenario, users often come with an explicit purpose, hope to get information or services that meet certain constraints, such as: ordering, booking, shopping online, reserving a taxi, reserving a hotel, searching music, films or certain products, and the like. In the above-described scenario, it is often required that the system (agent) is able to provide corresponding information according to the user's goal. The current mode is often to interact with the system (agent) through a pre-built user simulator to train the system (agent) so that the system (agent) has the capability of serving users.

The inventor finds that dialogue agents (agents) with different roles can be trained simultaneously based on an Actor commentator-Critic framework so as to learn strategies of interaction collaboration to achieve targets. For example, under the above framework, an agent is considered an Actor (Actor) which selects action based on the current state, and a Critic (Critic) module scores the performance of the Actor (the agent performs a certain action), and the Actor adjusts his own action based on the Critic's score in order to obtain a higher score.

However, since the roles of different agents are different in nature, that is, the agents are classified into user agents and system agents, based on this, it is inappropriate to adopt the same set of evaluation criteria to evaluate the performance of the agents, it is necessary to respectively score the different agents by using corresponding criteria, that is, it is necessary to respectively score the performance of the agents (actors) by using critters (Critic) of different evaluation criteria, as shown in fig. 1, so that the agents can respectively adjust their action strategies, thereby completing the task more efficiently.

Task summary

Embodiments of the present invention provide a multi-agent action policy learning method, in which the multi-agents interact to complete a task goal, which may be a task goal common to the multi-agents, or may be a task goal of one of the multi-agents, the multi-agents may be implemented as various virtual/entities capable of performing actions or actions, such as robots capable of performing conversations/grabbing objects/moving, and a scenario in which the multi-agents perform conversations is described below, the conversations including a multi-round conversation between a user agent (hereinafter abbreviated as a user) and a system agent (hereinafter abbreviated as a system), the user agent having a task goal g= (C, R), where C is a constraint (e.g., japanese restaurant in a city center), R is a type of requirement (e.g., an address of a query hotel, a phone number), and information requested by the user agent is stored in an external database DB to which the system agent can access, and the user agent and the system agent interact to achieve the user goal during the conversations. There may be multiple domains in G, and both agents must complete all the subtasks in each domain. Both agents are limited to only perceiving part of the environmental information, as shown in fig. 2, i.e. only the user agent knows the task object G, and only the system agent can access the DB, the only way both know about each other is through dialogue interaction. In the present invention, two agents in a conversation process are in asynchronous communication. I.e. the user agent and the system agent communicate alternately.

Exemplary method

A multi-agent action policy learning method according to an exemplary embodiment of the present application is described below with reference to fig. 3 in conjunction with the application scenario of fig. 1. It should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principle of the present application, and the embodiments of the present application are not limited in any way. Rather, embodiments of the application may be applied to any scenario where applicable.

Fig. 3 schematically illustrates an exemplary process flow 300 of a multi-agent action policy learning method according to an embodiment of the application. After the process flow 300 starts, step S310 is performed.

Step S310, the multiple agents sample corresponding actions according to the initial action strategies respectively;

wherein each agent has its own stateAnd action->State s=(s) ₁ ，...，s _N )→s′＝(s′ ₁ ，...s′ _N ) Wherein the state transition depends on the action policy pi of all the agents according to the respective agents _i (a _i |s _i ) Action taken (a) ₁ ，...，a _N ) Wherein->In this embodiment, the task is started by the user first acting, i.e. the user first acts according to the action policy (μ (a ^U |s ^U ) Based on the current state, the action to be performed is selected, and in turn, by the system according to the action policy (pi (a) ^S |s ^S ) Based on the current state, an action to be performed is selected, based on which each round of dialog can employ a state-action representation:

Wherein the superscript U indicates the user, the superscript S indicates the system, and the subscript indicates a certain round of action.

In one example of this embodiment, the initial action policy of each agent may be a random policy, i.e., each agent randomly samples actions to be performed from the optional set of actions based on the current state. Considering that in handling multi-domain, multi-objective, complex (conversational) tasks, the action space of the action strategy for acting on the individual agents may be very large, in which case training from scratch takes significant resources and time with a random strategy. Thus, in one example of this embodiment, the training process may be divided into two phases: the action strategy of each agent is first pre-trained using real action data (e.g., a dialog corpus), and then step S310 is performed to improve the pre-trained strategy using RL. Since each agent generates only a few conversational actions in a round of conversations, in one example of this implementation, a β -weighted logistic regression is used for policy pre-training to reduce data sample bias:

L(X，Y；β)＝-[β·Y ^T logσ(X)+(I-Y) ^T log(I-σ(X))]，

where X is the state and Y is the action of the corpus in the task.

In addition, multiple rounds of conversations may be involved between the user and the system, where each round of conversations refers to each round of system conversational content and user conversational content.

For example, the system session content of the t-th round is denoted by S (t), and the user session content of the t-th round is denoted by U (t), where t denotes the number of rounds, and t=1, 2,3, …. S (1) represents the system dialogue content of the first round, U (1) represents the system dialogue content of the first round, and so on.

It should be noted that in each round of dialog, the talk time of the system dialog content is after the talk time of the user dialog content. I.e. in a single round of dialogue, the user first issues a query and the system gives a corresponding response.

In the present embodiment, the action policy pi of the system is based on the state s of the system ^S Determining action a to be performed by the system ^S To give the user an appropriate response. Each system action a ^S Are a subset of action set a. In this embodiment, the system state of the t-th round of dialogueConsists of the following components: (I) User operation of this round of dialogue->(II) System action of previous round of dialogue +.>(III) tracking confidence status b of user-provided constraint slots and request slots _t The method comprises the steps of carrying out a first treatment on the surface of the And (IV) an embedded vector qt of the number of query results from the DB.

The user policy μ is based on the user state s ^U To determine the action a to be performed by the user ^U To be connected withIts constraints and requirements are communicated to the system. In the present embodiment, the user status Consists of the following components: (I) System actions of previous round of dialogue->(II) user action of previous dialog +.>(III) target State g _t Representing the remaining constraint and demand type to be transmitted; (IV) a difference vector ct indicating difference information between the system response and constraint C.

The user's actions are abstract representations of intent (constraints and demand types), which can be represented in terms of four tuples of fields, demand types, slot types and slot values (e.g. [ restaurant, notify demand, food, italy ]). The above example shows information that the user wishes the system to inform the restaurant of italian dishes. It should be noted that there may be multiple intents in a round of dialogue.

In addition to predicting user actions, the user policy is also configured to output an action termination signal T, i.e. μ=μ (a ^U ,T|s ^U ). I.e. when the user has no remaining tasks to be performed, the termination signal T is output.

After the current state and the action to be executed of each agent are obtained, the new state after the action is executed can be estimated, and then the advantages obtained after the action is executed are calculated, so that the action strategy of each agent is updated.

As an example, in the course of the conversation, the conversation has been started between the user and the system, and in order to update the action policy of each agent, step S320 is next performed.

In step S320, the advantages obtained after the multiple agents execute the corresponding actions are estimated respectively;

specifically, first, respectively acquiring new states reached by each intelligent agent after corresponding actions are executed;

then, calculating the current state of each intelligent agent and the return after the new state is reached;

and finally, calculating the advantages obtained after each agent executes corresponding actions according to the current state of each agent and the return after the new state is reached.

Wherein the return of the current state of an agent is a cumulative of rewards it has earned, specifically, during reinforcement learning, the environment will feed back an rewards value to the agent after each action of the agent, as an example, the expected return of the t-th round of dialogue of an agent Where r is the rewards that the environment feeds back after the agent selects a certain action and γ is the discount factor.

In conventional reinforcement learning, there is often only one agent that needs a learning strategy, that is, the environment only needs to feed back rewards to the one agent. It is contemplated that on the one hand, the roles of the user and the system are not the same. The user will actively initiate the task and possibly change the task during the conversation, while the system will only passively respond to the user and return the appropriate information, i.e. the user and the system's performance cannot be fully judged using unified criteria, so rewards should be calculated separately for each agent. In one example of this implementation, a system rewards At least comprises the following parts:

(I) No punishment of action; for example, during a round of dialogue, the system does not give any responsePenalty of (2);

(II) a penalty for not immediately providing corresponding assistance when the user agent requests assistance; for example, if a user sends a request in a round of dialogue, but the system does not reply to the corresponding information in time, penalty is given;

(III) subtask completion rewards; i.e. rewards that assist the user in completing the subtasks in accordance with the user's sent request. For example, in a round of dialogue, if the user requests to inform about a certain item of information, the system gives the user an appropriate response, and the corresponding reward of the system is fed back.

User rewardsAt least comprises the following parts:

(I) No punishment of action; similar to the system rewards, i.e

(II) a penalty for executing a new subtask when there are unfinished subtasks; i.e. giving a penalty if the user requests new information when there is still a constraint to inform the system;

(III) user action rewards based on a determination of whether the user agent is performing actions that are beneficial to completing the task, for example, in this implementation, based on whether the user is informing the system of all constraints C and demand type R.

In this manner, the system rewards and user rewards can be made explicitly based on the status of the individual agents during a round of dialogue.

However, considering on the other hand that two agents communicate and cooperate to accomplish the same task, the reward should also relate to a (common) global goal of both agents.

Based on this, in one example of this implementation, a global rewardAt least comprises the following parts:

(I) Efficiency loss (penalty); for example, each round of dialog gives a penalty (e.g., a small negative value) within a preset range;

(II) subtask completion rewards; for example, after completing subtasks in a certain field in the overall user goal G, awards are given;

(III) a total task completion reward; for example, rewards for all tasks are completed based on the user's overall goal G.

The above detailed decomposition of rewards is only directed to the present embodiment, and the above manner is not required to be fully followed for setting and calculating when the method of the present application is adopted for policy learning in other fields, in general, the above embodiment of the present application is merely to explain how rewards are calculated individually for each agent (i.e. rewards are calculated and accumulated independently), and for multi-agent policy learning requiring cooperation, global rewards are calculated additionally (accumulated global rewards) on the basis of calculation of rewards individually for each agent, and are distributed to corresponding agents in a preset manner, for example, global rewards may be distributed to each agent in a preset proportion/weight.

In this embodiment, the rewards of an agent are its own rewards and global rewards, for example, for a system, its total rewards are a cumulative system reward and global reward: (superscript G indicates global). Note that the full task success and user action rewards are calculated only after the full task is completed, and the subtask completion rewards calculated in the system rewards are different from the subtask completion rewards in the global rewards.

The action policy of each agent aims to maximize the return obtained by its corresponding agent. In order to make the action policy learned by each agent be the optimal action policy, step S330 is executed next, and in step S330, the action policy of each agent is updated based on the advantages obtained after the multiple agents execute the corresponding actions, so that each action policy after updating can make the corresponding agent obtain higher returns. Specifically, in one example of the present embodiment, the action policy is updated based on the advantages of an agent after performing an action, where the advantages are calculated based on rewards, i.e., a(s) =r+γv (s') -V(s), and after obtaining the advantages of each different part, for an agent, the action policy may be updated based on each part of the advantages related thereto, and as an example, the system policy and the user policy may be updated based on the following ways:

The policy gradient of the user and the system can be obtained respectively in the above manner, and based on the policy gradient, the action policy can be updated in a corresponding manner to obtain the optimal action policy of each agent, in this embodiment, the system policyBy->Parameterization, and user policy μ _w By w parameterization, i.e. updating parameters based on the above policy gradients, respectively +.>And w.

The above embodiments illustrate how the action policies of the respective agents are updated, i.e. the advantage update according to an action performed by the agents in a certain state, based on which we can see that the calculation of the advantages is equally important, i.e. the appropriate advantage calculation method helps the agents learn more excellent action policies, specifically, in one embodiment of the present invention, value networks (i.e. Critic) are set corresponding to the respective agents, and in addition, in order to completely accumulate rewards of the respective agents, separate value networks (Critic) are additionally set to estimate global rewards, and in the above embodiment, a mixed value network HVN (Hybrid Value Network) including three independent value networks (Critic) is set to calculate advantages based on the rewards of the respective portions, respectively, as shown in fig. 4. Based on the above embodiments, the value network (Critic) of the present embodiment is based on the state accumulation rewards of the respective agents, specifically, in the present embodiment, the dialogue state of each agent is first encoded to represent the learning state:

Where f (·) may be any neural network (e.g., a multi-layer perceptron, a convolutional neural network, etc., which may be the same or different), and tan h is a hyperbolic tangent function (activation function). The payback may then be calculated based on the dialog state jackpots for each agent:

wherein f _S 、f _U And f _G Is three arbitrary neural networks.

The neural networks described in the above embodiments may be the same or different, and the present embodiment is not limited thereto.

In one embodiment of the present application, the advantageous calculation method is also updated and optimized continuously during training, and in particular, during the policy learning process of the multi-agent, the value network is configured to perform optimization and updating with the objective of minimizing the variance between the return predicted by the preset method and the return calculated by the value network. The predetermined method may be any known method based on state prediction rewards of an agent, such as a time-series differential algorithm. Considering that the update frequency of the value network is high, this may cause excessive variation of the estimated value, especially for the multi-agent policy learning in this embodiment, so in one embodiment of this implementation, the target network is introduced to perform optimization update on the value network to make the training process more stable, and specifically, the following loss functions may be used to perform optimization respectively to update the value network:

Wherein HVNV _θ By theta parameterization, theta ^- Is the weight of the target network, total loss L _V Is the sum of the estimated return loss for each component prize.

In addition, in the embodiment, in order to enable each agent to learn an excellent action strategy and construct a mixed value network, the total rewards are corresponding to each agent and the task, so that the performance of each agent is evaluated by different value networks, and the differentiated rewards distribution enables each agent to learn an action strategy with higher accumulated rewards.

Exemplary apparatus

Having described the method of an exemplary embodiment of the present invention, next, with reference to fig. 5, an exemplary embodiment of the present invention provides a multi-agent action policy learning device, including a plurality of multi-agents and a plurality of value networks, wherein the plurality of agents sample corresponding actions according to respective initial action policies, respectively;

It should be noted that, the number of the agents and the number of the value networks are not fixed, and may be set according to actual scene needs, for example, if there are a plurality of agents with identical roles (that is, the same value network may be used), it is not necessary to set a value network for each agent separately, and besides setting different value networks for agents with different roles, a value network for calculating global rewards may be additionally set. The above value networks may be mixed together to form a mixed value network.

In an example of this embodiment, the apparatus further includes:

Exemplary Medium

Having described the methods and apparatus of exemplary embodiments of the present invention, a computer-readable storage medium of exemplary embodiments of the present invention is described next with reference to fig. 6.

Referring to fig. 6, a computer readable storage medium is shown as an optical disc 60, on which a computer program (i.e., a program product) is stored, which when executed by a processor, implements the steps described in the above method embodiments, for example: the multi-agent samples corresponding actions according to respective initial action strategies; respectively estimating advantages obtained after the multiple agents execute corresponding actions; and updating the action strategies of each intelligent agent based on the advantages obtained after the intelligent agents execute the corresponding actions, so that each updated action strategy can enable the corresponding intelligent agent to obtain higher return. The specific implementation of each step is not repeated here.

It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.

Exemplary computing device

Having described the method, apparatus, and medium of the exemplary embodiments of the present invention, a computing device of the exemplary embodiments of the present invention is next described with reference to fig. 7, where fig. 7 shows a block diagram of an exemplary computing device 70 suitable for use in implementing the embodiments of the present invention, the computing device 70 may be a computer system or a server. The computing device 70 shown in fig. 7 is only one example and should not be taken as limiting the functionality and scope of use of embodiments of the invention.

As shown in fig. 7, components of computing device 70 may include, but are not limited to: one or more processors or processing units 701, a system memory 702, and a bus 703 that connects the various system components (including the system memory 702 and the processing units 701).

Computing device 70 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computing device 70 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 702 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 7021 and/or cache memory 7022. Computing device 70 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM7023 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, commonly referred to as a "hard disk drive"). Although not shown in fig. 7, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media), may be provided. In such cases, each drive may be coupled to bus 703 through one or more data medium interfaces. The system memory 702 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 7025 having a set (at least one) of program modules 7024 may be stored, for example, in system memory 702, and such program modules 7024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 7024 generally perform the functions and/or methods of the embodiments described herein.

Computing device 70 may also communicate with one or more external devices 704 (e.g., keyboard, pointing device, display, etc.). Such communication may occur through an input/output (I/O) interface 705. Moreover, the computing device 70 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 706. As shown in fig. 7, the network adapter 706 communicates with other modules of the computing device 70 (e.g., processing unit 701, etc.) over bus 703. It should be appreciated that although not shown in fig. 7, other hardware and/or software modules may be used in connection with computing device 70.

The processing unit 701 executes various functional applications and data processing by running a program stored in the system memory 702, for example, acquiring at least one of the monitoring data of the first monitoring amount and the monitoring data of the second monitoring amount, wherein the multiple agents sample corresponding actions according to respective initial action policies; respectively estimating advantages obtained after the multiple agents execute corresponding actions; and updating the action strategies of each intelligent agent based on the advantages obtained after the intelligent agents execute the corresponding actions, so that each updated action strategy can enable the corresponding intelligent agent to obtain higher return.

It should be noted that although several units/modules or sub-units/modules of the time-series based abnormal data detecting apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Through the above description, the embodiments of the present invention provide the following technical solutions, but are not limited thereto:

1. a multi-agent action strategy learning method, comprising:

2. The method of claim 1, wherein the multi-agent performs the respective actions according to the respective initial action policies, the method further comprising:

3. The method of claim 2, wherein the pre-training is performed based on real action data using a weighted logistic regression method to obtain respective initial action strategies for the multiple agents.

4. The method of claim 1, wherein estimating the advantages obtained after the multi-agent performs the corresponding actions, respectively, comprises:

5. The method of claim 1, wherein the multi-agent comprises a user agent configured to target completion of a task and a system agent configured to assist the user agent in completing the task.

6. The method of claim 5, wherein the rewards of each agent after reaching a state include at least its own rewards and global rewards.

7. The method of claim 6, wherein the user agent's own rewards include at least no action penalty, a penalty for executing a new subtask when there are incomplete subtasks, and a user action reward.

8. The method of claim 7, wherein the user action rewards are determined based on whether the user agent performs actions that are favorable to completing the task.

9. The method of claim 6, wherein the return of the system agent itself is accumulated from at least one of: no punishment of action, punishment of corresponding assistance not provided immediately when the user agent requests assistance, and subtask completion rewards.

10. The method of claim 6, wherein the global rewards are accumulated from at least one of efficiency loss penalties, subtask completion rewards, and all task completion rewards.

11. A method as claimed in claim 8 or 10, wherein the user action rewards or all task completion rewards are calculated after all actions have been terminated.

12. The method of claim 3, wherein the rewards for a state of each agent are calculated by the corresponding value network.

13. The method of claim 12, wherein the rewards of different categories are calculated by different value networks, respectively.

14. The method of claim 13, wherein, during policy learning of the multi-agent, the value network is configured to optimally update with the objective of minimizing variance between rewards predicted using a preset method and rewards calculated by the value network.

15. The method of claim 14, wherein the predetermined method is configured to predict a return based on a state of an agent.

16. The method of claim 15, wherein the predetermined method is a time-series differential algorithm.

17. The method of claim 14, wherein introducing a target network optimizes the value network to make the training process more stable.

18. The multi-agent action strategy learning device comprises a plurality of multi-agents and a plurality of value networks, wherein the plurality of agents sample corresponding actions according to respective initial action strategies;

19. The apparatus of claim 18, wherein the apparatus further comprises:

20. The apparatus of claim 19, wherein the pre-training module is configured to pre-train based on real action data using a weighted logistic regression method to obtain respective initial action strategies for the multiple agents.

21. The apparatus of claim 18, wherein the apparatus further comprises a plurality of status encoding modules configured to obtain new statuses reached by the respective agents after performing the respective actions, respectively;

22. The apparatus of claim 18, wherein the multi-agent comprises a user agent configured to target completion of a task and a system agent configured to assist the user agent in completing the task.

23. The apparatus of claim 22, wherein the rewards of each agent after reaching a state include at least its own rewards and global rewards.

24. The apparatus of claim 23, wherein the user agent's own rewards include at least no-action penalties, penalties to perform new subtasks when there are incomplete subtasks, and user action rewards.

25. The apparatus of claim 24, wherein the user action rewards are based on a determination of whether the user agent performs an action that is favorable to completing the task.

26. The apparatus of claim 23, wherein the return of the system agent itself is accumulated from at least one of: no punishment of action, punishment of corresponding assistance not provided immediately when the user agent requests assistance, and subtask completion rewards.

27. The apparatus of claim 23, wherein the global rewards are accumulated as an efficiency loss penalty, a subtask completion reward, and a full task completion reward by at least one of.

28. Apparatus according to claim 25 or 27 wherein the user action rewards or all task completion rewards are calculated after termination of an action.

29. The apparatus of claim 20, wherein the rewards for a state of each agent are calculated by a corresponding value network.

30. The apparatus of claim 29, wherein the rewards of different categories are calculated by different value networks, respectively.

31. The apparatus of claim 30, wherein, during policy learning of the multi-agent, the value network is configured to optimally update with the objective of minimizing variance between rewards predicted using a preset method and rewards calculated by the value network.

32. The apparatus of claim 31, wherein the predetermined method is configured to predict a return based on a state of an agent.

33. The apparatus of claim 32, wherein the predetermined method is a time-series differential algorithm.

34. The apparatus of claim 31, wherein introducing a target network optimizes the value network to make the training process more stable.

35. A medium having a computer program stored thereon, characterized by: the computer program, when executed by a processor, implements the method according to any one of the schemes 1-17.

36. A computing device, characterized by: the computer device comprises a processor for implementing the method according to any of the claims 1-17 when executing a computer program stored in a memory.

Claims

1. A multi-agent action strategy learning method, comprising:

updating action strategies of each agent based on advantages obtained after the multi-agent executes corresponding actions, so that each updated action strategy can enable the corresponding agent to obtain higher return;

the action strategies of the intelligent agents comprise a system strategy and a user strategy; updating the action policy of each agent based on the advantages obtained after the multiple agents execute the corresponding actions, including: updating the system policy and the user policy of each agent based on the following manner:

the strategy gradient of each agent under the user and the system is obtained based on the following formula;

Updating parameters of each agent based on the policy gradientsAnd a parameter w to obtain an optimal action strategy for each agent, wherein the system strategy +.>By->Parameterization, user policy mu _w Parameterized by w.

4. The method of claim 1, wherein estimating the advantages obtained after the multi-agent performs the respective actions, respectively, comprises:

7. The method of claim 6, wherein the user agent's own rewards include at least no-action penalties, penalties for executing new subtasks when there are incomplete subtasks, and user action rewards.

8. The method of claim 7, wherein the user action rewards are determined based on whether the user agent performs an action that is favorable to completing the task.

10. The method of claim 6, wherein the global rewards are accumulated as efficiency loss penalties, subtask completion rewards, and all task completion rewards by at least one of.

11. A method according to claim 8 or 10, wherein the user action rewards or all task completion rewards are calculated after all actions have been terminated.

12. A method according to claim 3, wherein the return of a certain state of each agent is calculated by the corresponding value network.

14. The method of claim 13, wherein during policy learning of the multi-agent, the value network is configured to optimally update with the objective of minimizing a variance between rewards predicted using a preset method and rewards calculated by the value network.

15. The method of claim 14, wherein the predetermined method is configured to predict rewards based on a state of an agent.

the plurality of agents update respective action strategies based on advantages obtained after executing corresponding actions, so that each updated action strategy can enable the corresponding agent to obtain higher returns;

Wherein, the action strategies of each of the plurality of intelligent agents comprise a system strategy and a user strategy; the plurality of agents updating respective action policies based on advantages obtained after performing the respective actions, including: updating the system policy and the user policy of each agent based on the following manner:

19. The apparatus of claim 18, wherein the apparatus further comprises:

20. The apparatus of claim 19, wherein the pre-training module is configured to pre-train based on real action data using a weighted logistic regression method to obtain the respective initial action policies of the multiple agents.

24. The apparatus of claim 23, wherein the rewards of the user agent itself include at least a penalty for no action, a penalty for executing a new subtask when there are unfinished subtasks, and a user action reward.

25. The apparatus of claim 24, wherein the user action rewards are determined based on whether the user agent performs an action that is favorable to completing the task.

27. The apparatus of claim 23, wherein the global rewards are accumulated as efficiency loss penalties, subtask completion rewards, and all task completion rewards by at least one of.

29. The apparatus of claim 20, wherein the return for a state of each agent is calculated by a corresponding value network.

31. The apparatus of claim 30, wherein during policy learning of the multi-agent, the value network is configured to optimally update with the objective of minimizing a variance between rewards predicted using a preset method and rewards calculated by the value network.

35. A medium having a computer program stored thereon, characterized by: the computer program implementing the method of any of claims 1-17 when executed by a processor.

36. A computing device, characterized by: the computing device comprising a processor for implementing the method of any of claims 1-17 when executing a computer program stored in memory.