CN118051780A

CN118051780A - Training method, interaction method and corresponding system of intelligent body

Info

Publication number: CN118051780A
Application number: CN202410444184.4A
Authority: CN
Inventors: 倪晚成; 赵晓楠
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-05-17
Anticipated expiration: 2044-04-12
Also published as: CN118051780B

Abstract

The disclosure provides a training method, an interaction method and a corresponding system of an agent. The training method of the intelligent agent suitable for the human-computer interaction scene comprises the following steps: sampling a plurality of strategies from the strategy space of the intelligent agent as test strategies of the training; testing each test strategy in a human-computer interaction task environment to obtain test results of each test strategy on m test tasks; calculating objective evaluation data for evaluating the performance of each test strategy in each test task based on the test result; outputting the test result to a user, and receiving subjective evaluation data of the user on the performance of each test strategy in each test task; updating the agent based on the objective assessment data and the subjective assessment data. According to the exemplary embodiment of the disclosure, the subjective feeling of the human and the objective ability of the intelligent agent are comprehensively considered to complete the intelligent agent evaluation and training, so that the intelligent agent obtained by training has strong ability and is accepted by the human.

Description

Training method, interaction method and corresponding system of intelligent body

Technical Field

The present disclosure relates generally to the field of artificial intelligence, and more particularly, to a training method, an interaction method, and a corresponding system for an agent.

Background

Assessment is a key driving force for intelligent development of propulsion machines. The existing evaluation methods of agents (game strategy models) are mostly evaluated by objective scores obtained by the agents. However, in the human-computer interaction scenario, there is information interaction between the agent and the human, and the agent with the highest objective evaluation score may not be the agent with the best human experience. Therefore, there is a need for an evaluation method of agents suitable for application in human-computer interaction scenarios to train out strong and human-approved agents.

Disclosure of Invention

The exemplary embodiments of the present disclosure provide an agent training method, an interaction method, and a corresponding system, which are oriented to a human-computer interaction application scenario, and provide an agent training method that combines subjective feeling of human beings and objective ability of agents, so as to train out an agent that has strong ability and is approved by human beings.

According to a first aspect of embodiments of the present disclosure, there is provided a training method of an agent suitable for a human-computer interaction scenario, the training method including: sampling a plurality of strategies from a strategy space of the intelligent agent as test strategies for the training of the round, wherein the strategy space comprises n strategies, and n is an integer greater than 1; testing each test strategy in a human-computer interaction task environment to obtain test results of each test strategy on m test tasks, wherein m is an integer greater than 0; calculating objective evaluation data for evaluating the performance of each test strategy in each test task based on the test result; outputting the test result to a user, and receiving subjective evaluation data of the user on the performance of each test strategy in each test task; updating the agent based on the objective evaluation data and the subjective evaluation data, and determining whether to continue updating the agent; under the condition that the update of the intelligent agent is stopped, taking the intelligent agent which is completed to update as a target intelligent agent which is finally trained, wherein the target intelligent agent is used for carrying out information interaction with human beings; wherein, in case it is determined to continue updating the agent, the step of sampling a plurality of policies from the agent's policy space as test policies for the present round of training is returned to be performed to start a new round of training.

Optionally, the step of updating the agent based on the objective assessment data and the subjective assessment data comprises: generating an antisymmetric matrix based on the objective evaluation data and the subjective evaluation data; calculating a loss function of the agent based on the antisymmetric matrix; updating a vector r based on the loss function, wherein r is an n-dimensional vector, and the ith element in rRepresenting the comprehensive capacity evaluation result of the ith strategy; updating the ranking of the n strategies based on the updated vector r, wherein the better the comprehensive capability evaluation result of the strategy is, the higher the ranking of the strategy is; wherein, in the target agent, the higher the ranking of the strategies, the higher the probability that the strategies are selected.

Optionally, the step of generating an antisymmetric matrix based on the objective evaluation data and the subjective evaluation data comprises: converting the objective evaluation data into a first probability evaluation matrix, and generating a first antisymmetric matrix based on the first probability evaluation matrix; converting the subjective evaluation data into a second probability evaluation matrix, and generating a second antisymmetric matrix based on the second probability evaluation matrix; carrying out weighted summation on the first antisymmetric matrix and the second antisymmetric matrix to obtain the antisymmetric matrix;

wherein, the elements in the ith row and the jth column in the first probability evaluation matrix Indicating the probability that objective evaluation data obtained by the ith policy is not weaker than that of the jth policy, and; Elements/>, located in the ith row and the jth column, of the second probability evaluation matrixIndicating a probability that subjective evaluation data obtained by the ith policy is not weaker than that of the jth policy, and。

Optionally, the elements in the first antisymmetric matrix are located in the ith row and the jth column; Elements/>, located in the ith row and the jth column, of the second antisymmetric matrix。

Optionally, the objective evaluation data is an objective evaluation score, and the subjective evaluation data is a subjective evaluation score;

wherein, ，，

Wherein,The values of z are 1 and 2,Indicating that the ith policy is at theObjective evaluation score obtained in each test task,Indicating that the j-th policy is atObjective evaluation score obtained in each test task,Indicating that the ith policy is at theSubjective assessment score obtained in each test task,Indicating that the j-th policy is atSubjective assessment score obtained in each test task,Is an integer greater than 0 and less than or equal to m.

Optionally, the step of calculating the loss function of the agent based on the antisymmetric matrix comprises: the loss function is calculated based on a combined gradient on the vector r, the antisymmetric matrix, and a low rank approximation of the combined rotation of the antisymmetric matrix.

Optionally, based on the loss function, the step of updating the vector r includes: updating a vector r and a matrix C based on the loss function;

Wherein the loss function The method comprises the following steps: /(I)WhereinRepresenting the combined gradient over the vector r,Low-rank approximation of the combined rotation of the antisymmetric matrix,K is a superparameter.

According to a second aspect of embodiments of the present disclosure, there is provided an interaction method of an agent suitable for a human-computer interaction scenario, the interaction method including: acquiring information transmitted by a user from the current human-computer interaction task environment; selecting a target strategy from a strategy space of the target intelligent agent; making a decision based on the information using the target policy; executing corresponding actions in the man-machine interaction task environment according to the decision so as to change feedback information of the man-machine interaction task environment to the user; wherein the target agent is trained by performing the training method as described above.

According to a third aspect of embodiments of the present disclosure, there is provided a training system for an agent adapted for use in a human-machine interaction scenario, the training system comprising: a policy sampling unit configured to sample a plurality of policies from a policy space of the agent as test policies of the present training, wherein the policy space includes n policies, n being an integer greater than 1; the testing unit is configured to test each testing strategy in a human-computer interaction task environment to obtain testing results of each testing strategy on m testing tasks, wherein m is an integer greater than 0; an objective evaluation unit configured to calculate objective evaluation data for evaluating the performance of each test strategy in each test task based on the test result; the subjective evaluation unit is configured to output the test result to a user and receive subjective evaluation data of the user on the performance of each test strategy in each test task; an updating unit configured to update the agent based on the objective evaluation data and the subjective evaluation data, and determine whether to continue updating the agent, wherein in a case where it is determined to stop continuing updating the agent, the agent for which updating has been completed is taken as a final trained target agent for information interaction with a human being; wherein the training system starts a new round of training if it is determined to continue updating the agent.

According to a fourth aspect of embodiments of the present disclosure, there is provided an interaction system for an agent adapted for use in a human-machine interaction scenario, the interaction system comprising: the information acquisition unit is configured to acquire information transmitted by a user from the current human-computer interaction task environment; a policy selection unit configured to select a target policy from a policy space of the target agent; a decision unit configured to make a decision based on the information using the target policy; the action execution unit is configured to execute corresponding actions in the man-machine interaction task environment according to the decision so as to change feedback information of the man-machine interaction task environment to the user; wherein the target agent is trained by performing the training method as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium storing instructions that, when executed by a processor of an electronic device, enable the electronic device to perform a training method and/or an interaction method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided an electronic device including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the training method and/or the interaction method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer executable instructions which, when executed by at least one processor, implement the training method and/or the interaction method as described above.

According to the training method, the interaction method and the corresponding system of the intelligent agent, subjective feeling of human and objective ability of the intelligent agent are comprehensively considered to complete intelligent agent assessment and training, so that the intelligent agent obtained through assessment training is strong in ability and approved by human, and user experience is improved.

In the following description, some aspects and/or advantages of the present general inventive concept will be set forth, and still others will be apparent from the following description or the practice of the present general inventive concept.

Drawings

These and/or other aspects and advantages of the present application will become more apparent and more readily appreciated from the following detailed description of the embodiments of the application, taken in conjunction with the accompanying drawings, wherein:

FIG. 1 illustrates a flow chart of a training method of an agent according to an exemplary embodiment of the present disclosure;

FIG. 2 illustrates a flowchart of a method of updating an agent based on objective assessment data and subjective assessment data according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of a training method of an agent according to another exemplary embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of a method of interaction of an agent in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a block diagram of a training system of an agent according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of a training system of an agent according to another exemplary embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of an interactive system of an agent according to an exemplary embodiment of the present disclosure;

fig. 8 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments will be described below in order to explain the present disclosure by referring to the figures.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

The present disclosure contemplates: in a human-computer interaction scene, information interaction exists between an agent and a human, the agent with the highest objective evaluation score rank may not be the agent with the highest ranking on human experience, namely, the agent trained by considering only objective evaluation scores may not be welcome in the human-computer interaction scene, in other words, the result of the objective evaluation method may not be consistent with subjective experience of the human, and the agent with the high objective evaluation score rank may not be the agent hoped by the human in the human-computer interaction scene. However, subjective assessment methods are too subjective and may make the agent good for humans. The existing subjective evaluation method mainly obtains evaluation data in a human scoring mode, and determines ranking according to the evaluation data, so that the accuracy of the original data is difficult to control, and if the obtained scoring data is wrong, the evaluation result is wrong; training an agent with this assessment method may also make the agent concerned about only human perception, i.e., the agent is good for humans, even if the raw data is correct.

Therefore, the method and the device provide comprehensive consideration of subjective feeling of human and objective ability of the intelligent agent in human-computer interaction scene to complete intelligent agent evaluation and training, so that the intelligent agent obtained through evaluation training has strong ability and is accepted by human. The following will describe in detail with reference to fig. 1 to 8.

Fig. 1 illustrates a flowchart of a training method of an agent according to an exemplary embodiment of the present disclosure.

The intelligent agent is suitable for a human-computer interaction scene, wherein the human-computer interaction scene refers to information interaction between the intelligent agent and a human being, and the intelligent agent can adjust own actions according to information transmitted by the human being so as to feed back the information to the human being. It should be appreciated that the manner in which the human-machine interaction may include, but is not limited to, language interaction, and may include other types of interaction, such as interaction with a human by way of a labeled game map, or the like.

The agent is used for information interaction with humans, and types of agents may include, but are not limited to: dialog-type agents, game AI agents (e.g., NPC game AI agents), it should be understood that other types of agents capable of interacting with humans may also be included.

Referring to fig. 1, in step S101, a plurality of policies are sampled from the agent' S policy space as test policies for the present round of training.

The policy space of the agent includes n policies, n being an integer greater than 1.

As an exemplary embodiment, random sampling or prioritized sampling may be performed. With regard to prioritized sampling, for example, multiple policies may be sampled from the policy space as test policies for the present round of training based on current policy ranking results and/or information that a user may want to communicate in a human-machine interaction task environment.

In step S102, each test policy is tested in a human-computer interaction task environment, so as to obtain test results of each test policy on m test tasks. m is an integer greater than 0.

As an exemplary embodiment, the test results of the test strategy on the test tasks may include: the test strategy is directed to actions performed in a human-computer interaction task environment for the test task decision.

In step S103, objective evaluation data for evaluating the performance of each test strategy in each test task is calculated based on the test results.

As an exemplary embodiment, various suitable objective assessment algorithms may be employed to calculate objective assessment data for each test strategy, which is not limiting of the present disclosure. For example, objective assessment data may be obtained by computing a cumulative sum of rewards for testing actions performed by a task in a task environment for a test task.

As an exemplary embodiment, the form of objective assessment data may include, but is not limited to, at least one of the following: the data in the three forms of score, ordering and probability can be represented by a matrix.

Regarding fractional forms, availableRepresenting a score matrix,Representing each test task, m refers to the total number of test tasks, n refers to the policy total number of the agent, and the element/>, in the score matrixIndicating that the ith test strategy is at theThe scores obtained on the test tasks.

Regarding the ordering form, availableRepresenting a ranking matrix,Representing the individual test tasks, elements/>, in the ranking matrixIndicating that the ith test strategy is at theRanking obtained on each test task. The ranking may be global or local, and local means that the ranking may be obtained by comparing only two strategies, and needs to be converted into the ranking on all strategies (namely, global ranking).

With respect to probability forms, availableRepresenting a probability matrix of elementsIndicating the probability that the ith policy is not weaker than the jth policy.

In step S104, a test result is output to the user, and subjective evaluation data of the user' S performance of each test policy in each test task is received.

Specifically, subjective evaluation data obtained by subjective evaluation of performance of each test strategy in each test task based on test results by a user is received.

As an exemplary embodiment, some subjective evaluation indexes (i.e., evaluation indexes related to subjective feelings of the user) may be preset according to a human-computer interaction scene to which the agent is specifically applicable and actual requirements of the agent, and the user may score, rank, etc. the performance of each test policy in each test task according to these subjective evaluation indexes.

As an exemplary embodiment, the form of subjective assessment data may include, but is not limited to, at least one of the following: the data in the three forms of score, ordering and probability can be represented by a matrix.

In step S105, the agent is updated based on the objective evaluation data and the subjective evaluation data.

After step S105, step S106 is performed to determine whether to continue updating the agent.

An exemplary embodiment of updating an agent based on objective evaluation data and subjective evaluation data will be described below in conjunction with fig. 2, which is not yet developed.

In the case where it is determined in step S106 that the update of the agent is stopped, step S107 is executed, and the agent for which the update has been completed is taken as the final trained target agent.

In the case where it is determined in step S106 that the update of the agent is continued, step S101 is returned to be executed to start a new round of training, in other words, steps S101 to S106 are repeatedly executed until the update of the agent is stopped.

As an exemplary embodiment, it may be determined to stop updating the agent when the effect of the agent meets a preset requirement or the total number of training wheels reaches an upper limit.

Fig. 2 illustrates a flowchart of a method of updating an agent based on objective evaluation data and subjective evaluation data according to an exemplary embodiment of the present disclosure.

Referring to fig. 2, in step S201, an antisymmetric matrix is generated based on objective evaluation data and subjective evaluation data.

As an exemplary embodiment, step S201 may include: converting the objective evaluation data into a first probability evaluation matrix, and generating a first antisymmetric matrix based on the first probability evaluation matrix; converting the subjective evaluation data into a second probability evaluation matrix, and generating a second antisymmetric matrix based on the second probability evaluation matrix; and then, carrying out weighted summation on the first anti-symmetric matrix and the second anti-symmetric matrix to obtain an anti-symmetric matrix.

First probability evaluation matrixElement in ith row and jth columnIndicating that the objective evaluation data obtained by the ith policy is not weaker (i.e., score, rank is not lower) than the probability of the jth policy, and。

Second probability evaluation matrixElement in ith row and jth columnIndicating the probability that subjective evaluation data obtained by the ith policy is not weaker than that of the jth policy, and。

It should be understood that if the objective evaluation data and the subjective evaluation data are in the form of probabilities, the objective evaluation data and the subjective evaluation data are directly used as a first probability evaluation matrix and the subjective evaluation data are directly used as a second probability evaluation matrix without converting the objective evaluation data and the subjective evaluation data.

The score or ranking form of the evaluation data may be converted to a probabilistic form, and as an exemplary embodiment, when the objective evaluation data is an objective evaluation score and the subjective evaluation data is a subjective evaluation score, the conversion is as follows:

，，

wherein, The values of z are 1 and 2,Indicating that the ith policy is at theObjective evaluation score obtained in each test task,Indicating that the j-th policy is atObjective evaluation score obtained in each test task,Indicating that the ith policy is at theSubjective assessment score obtained in each test task,Indicating that the j-th policy is atSubjective assessment score obtained in each test task,Is an integer greater than 0 and less than or equal to m. Here, the non-test strategy is atThe objective evaluation score and the subjective evaluation score are not included in each test task.

As an exemplary embodiment, a first antisymmetric matrixElements in the ith row and jth column; Second antisymmetric matrixElement in ith row and jth column。

As an exemplary embodiment, the first and second antisymmetric matrices may be weighted and summed to obtain an antisymmetric matrix Y by:

Wherein, the value of z is 1 and 2, Representing the importance of objective assessment data,The subjective evaluation data is represented as importance, the sum of the subjective evaluation data and the subjective evaluation data is 1, and the antisymmetric matrix Y unifies different types of evaluation data and contains the relation among strategies. As an exemplary embodiment, the specific application can be adjusted/>, according to the requirements and experimental effectsAndFor example, can cause。

In step S202, a loss function of the agent is calculated based on the antisymmetric matrix.

In step S203, the vector r is updated based on the loss function.

R is an n-dimensional vector, the ith element in rRepresenting the result of the comprehensive ability assessment of the ith policy, i.e.The comprehensive capacity of the ith strategy is used for representing the comprehensive capacity of the ith strategy.

In step S204, the ranking of the n policies is updated based on the updated vector r.

The better the comprehensive ability evaluation result of the policy is, the higher the ranking of the policy is, namely, pressingThe rank of the ith policy is updated,The higher the comprehensive ability of the ith policy, the higher the ranking of the ith policy.

In the target agent, the higher the ranking of the policies, the higher the probability that the policy is selected.

As an example, step S202 and step S203 may be implemented using the combined hodgkin' S theorem (Combinatorial Hodge theory).

The combined hodgkin's theorem has evolved from differential geometry. Given a undirected graph with a junction set of V and an edge set of EEach node represents an agent policy, for a total of n nodes. Defining a real value function on the junction set V, usingAnd expressing the function value corresponding to each node, and r is an n-dimensional vector. Assigning an edge flow to each edge in the graph. WhenTime, letThenIs an antisymmetric matrix that may represent pairwise ranking data between vertices.

Four operators are first defined. The combined gradient over r is defined as: A stream having this form is referred to as a gradient stream. The combined divergence operator on the edge flow Y is defined as: /(I) ，Is an n-dimensional vector representing the contribution of each vertex (source point) to Y. Combined rotation operator on YCombining rotation operatorsThe swirls map the edge flow to a triangle flow.

Based on the four operators, the combined Hoqi theorem is as follows: standard inner productThe vector space of any antisymmetric matrix allows for an orthogonal decomposition:

，

That is, any one of the anti-symmetric matrices Y has the following decomposition:

wherein, the divergence is related to the transfer relationship between strategies, and the curl is related to the circulation relationship between strategies.

The relationship between agents can be divided into a transfer relationship and a cyclic relationship, and for A, B, C agents, the transfer relationship refers toThen; The circulation relation meansBut。

As an exemplary embodiment, step S202 may include: the loss function is calculated based on the combined gradient on vector r, the antisymmetric matrix, and a low-rank approximation of the combined rotation of the antisymmetric matrix.

As an exemplary embodiment, step S203 may include: based on the loss function, the vector r and the matrix C are updated.

As an exemplary embodiment, the loss functionThe method comprises the following steps:

，

wherein, Representing the combined gradient on vector r,A low-rank approximation of the combined rotation of the antisymmetric matrix Y, namely:

Where k is a super-parameter, 。Representing a low rank approximation of rot (Y), representing a cyclic relationship between strategies, k referring to the approximated rank, the low rank approximation of rot (Y) is added to the loss function for more accurate computation, and the closer 2k is to n, the more accurate the computation result. For example, C may be randomly initialized.

As an example of an implementation of the method of the present invention,The combined gradient over r is defined as follows:

wherein 1 means vector/> 。Representing the gap between the ith policy and the jth policy, i.e., the transfer relationship.

As an exemplary embodiment, the method can be carried out by the following formulaGradient update calculationAnd. In the training of the t-th round, the ith strategy and the jth strategy are sampled for testing, and the corresponding/> isobtainedThen

，

Wherein the above formula is a gradient descent formula,Representing the loss function pairGradientRepresentationLine i,RepresentationIn column j,Is learning rate,AndVector r,/>, obtained from previous round of training updateAndThe matrix C obtained from the previous round of training update.

Fig. 3 illustrates a flowchart of a training method of an agent according to another exemplary embodiment of the present disclosure.

Referring to fig. 3, in step S301, an agent and a task environment are initialized. The agent has several strategies, which constitute the strategy space of the agent. The task environment refers to a human-computer interaction task environment, and can receive the operation of human users and intelligent agents on the environment and make corresponding changes.

In step S302, several test strategies are sampled in the strategy space, and objective scores (i.e., rewards) of each test strategy on different test tasks are collected.

In step S303, the test result of the test policy is presented to the human user through the visualization module, and subjective scores of the human user on each test policy are collected.

In step S304, the objective score and the subjective score are sent to the score data preprocessing module, and an antisymmetric matrix is obtained by summarizing and sent to the policy ranking module.

In step S305, the ranking result of each policy is determined according to the obtained antisymmetric matrix, and is fed back to the agent decision module for updating.

In step S306, S302-S305 are repeated until the update of the agent is stopped.

Compared with the prior art, the method and the device integrate objective data and subjective data for evaluation, so that the objective ability score and subjective human feeling of the intelligent body are considered at the same time, the problem that the objective evaluation deviates from the human feeling is relieved, and the problem that the subjective evaluation is too strong is relieved. Meanwhile, the ranking results among strategies are determined based on the combined Hodgkin's theorem, so that the results have strong mathematical basis and clear theoretical explanation.

Fig. 4 shows a flowchart of an interaction method of an agent according to an exemplary embodiment of the present disclosure. The agent is a target agent trained using the training method as described in the above exemplary embodiments.

Referring to fig. 4, in step S401, information transferred by a user is acquired from a current human-computer interaction task environment.

In step S402, a target policy is selected from a policy space of the target agent.

As an exemplary embodiment, the target policy may be selected based on a ranking of policies in the policy space and/or information communicated by the acquired user.

In step S403, a decision is made based on the acquired information delivered by the user using the target policy.

In step S404, corresponding actions are performed in the human-machine interaction task environment (i.e., the environment is operated) according to the decision to change feedback information from the human-machine interaction task environment to the user. That is, the target agent may act on the human-machine interaction task environment to change the environment to effect interaction with the human in accordance with the decision to perform a corresponding action (e.g., return an answer or game AI completed operation)

For example, operations such as pagination may be performed on a game map (task environment), thereby changing the task environment according to the results of these operations.

Fig. 5 shows a block diagram of a training system of an agent according to an exemplary embodiment of the present disclosure.

Referring to fig. 5, a training system of an agent according to an exemplary embodiment of the present disclosure includes: policy sampling unit 101, test unit 102, objective evaluation unit 103, subjective evaluation unit 104, update unit 105.

Specifically, the policy sampling unit 101 is configured to sample a plurality of policies from the policy space of the agent as test policies for the present round of training.

The policy space includes n policies, n being an integer greater than 1.

The test unit 102 is configured to test each test strategy in the human-computer interaction task environment, so as to obtain test results of each test strategy on m test tasks. m is an integer greater than 0.

The objective evaluation unit 103 is configured to calculate objective evaluation data for evaluating the performance of the test strategies in the test tasks based on the test results.

The subjective assessment unit 104 is configured to output the test result to a user, and receive subjective assessment data of the user's performance of the test strategies in the test tasks.

The updating unit 105 is configured to update the agent based on the objective evaluation data and the subjective evaluation data, and determine whether to continue updating the agent, wherein in a case where it is determined to stop continuing updating the agent, the agent for which updating has been completed is taken as a final trained target agent for information interaction with a human.

In the event that it is determined to continue updating the agent, the training system begins a new round of training.

As an exemplary embodiment, the updating unit 105 may be configured to: generating an antisymmetric matrix based on the objective evaluation data and the subjective evaluation data; calculating a loss function of the agent based on the antisymmetric matrix; updating a vector r based on the loss function, wherein r is an n-dimensional vector, and the ith element in rRepresenting the comprehensive capacity evaluation result of the ith strategy; updating the ranking of the n strategies based on the updated vector r, wherein the better the comprehensive capability evaluation result of the strategy is, the higher the ranking of the strategy is; wherein, in the target agent, the higher the ranking of the strategies, the higher the probability that the strategies are selected. /(I)

As an exemplary embodiment, the updating unit 105 may be configured to: converting the objective evaluation data into a first probability evaluation matrix, and generating a first antisymmetric matrix based on the first probability evaluation matrix; converting the subjective evaluation data into a second probability evaluation matrix, and generating a second antisymmetric matrix based on the second probability evaluation matrix; carrying out weighted summation on the first antisymmetric matrix and the second antisymmetric matrix to obtain the antisymmetric matrix;

As an exemplary embodiment, the elements in the first antisymmetric matrix located in the ith row and the jth column; Elements/>, located in the ith row and the jth column, of the second antisymmetric matrix。

As an exemplary embodiment, the objective evaluation data is an objective evaluation score, and the subjective evaluation data is a subjective evaluation score;

wherein, ，，

As an exemplary embodiment, the updating unit 105 may be configured to: the loss function is calculated based on a combined gradient on the vector r, the antisymmetric matrix, and a low rank approximation of the combined rotation of the antisymmetric matrix.

As an exemplary embodiment, the updating unit 105 may be configured to: updating a vector r and a matrix C based on the loss function;

Wherein the loss function The method comprises the following steps: /(I)WhereinRepresenting the combined gradient over the vector r,A low-rank approximation representing the combined rotation of the antisymmetric matrix,K is a superparameter.

Fig. 6 illustrates a block diagram of a training system of an agent according to another exemplary embodiment of the present disclosure.

Referring to fig. 6, a training system of an agent according to another exemplary embodiment of the present disclosure includes: an agent initialization module 201, a task environment generation module 202, an agent decision module 203, a policy score acquisition module 204, a policy visualization module 205, a human score acquisition module 206, a score data preprocessing module 207, and a policy ranking module 208.

Specifically, agent initialization module 201 is used to initialize agents that have several policies.

The task environment generation module 202 is configured to generate a task environment required for human-computer interaction, and may receive operations of human users and agents on the environment and make corresponding changes.

The agent decision module 203 is configured to sample policies in an agent policy space according to ranking results of the policy ranking module 208 and human information that may exist, perform a test on a task environment, and send the test results to the policy visualization module 205 and the policy score collection module 204.

The policy score collection module 204 is configured to obtain objective scores obtained by the test policies on the test tasks, and send the objective scores to the score data preprocessing module 207.

The policy visualization module 205 is configured to receive the policy test result sent by the agent decision module 203, visually present the policy test result to the human user, and receive information that the human user wants to transfer to the agent (e.g. want the agent to make a decision next).

The human score acquisition module 206 is configured to acquire scores of the strategies after the human user views the test results, and send the scores to the score data preprocessing module 207.

The scoring data preprocessing module 207 is configured to receive objective scoring data and subjective scoring data, and aggregate the two data into an antisymmetric matrix, and send the antisymmetric matrix to the policy ranking module 208.

The policy ranking module 208 is configured to rank each policy according to the received antisymmetric matrix, and feed back the ranking result to the agent decision module 203.

Fig. 7 shows a block diagram of an interactive system of an agent according to an exemplary embodiment of the present disclosure. The agent is trained using the training method as described in the exemplary embodiments above.

Referring to fig. 7, an interactive system of an agent according to an exemplary embodiment of the present disclosure includes: an information acquisition unit 301, a policy selection unit 302, a decision unit 303, and an action execution unit 304.

Specifically, the information acquisition unit 301 is configured to acquire information delivered by a user from the current human-computer interaction task environment.

The policy selection unit 302 is configured to select a target policy from a policy space of the target agent.

The decision unit 303 is configured to make a decision based on the information using the target policy.

The action execution unit 304 is configured to execute a corresponding action in the human-computer interaction task environment according to the decision, so as to change feedback information of the human-computer interaction task environment to the user.

It should be appreciated that the specific processes performed by the training system and the interactive system of the agent according to the exemplary embodiments of the present disclosure have been described in detail with reference to fig. 1 to 4, and related details will not be repeated here.

It should be appreciated that the various units and modules in the training system and interactive system of an agent according to exemplary embodiments of the present disclosure may be implemented as hardware components and/or software components.

Referring to fig. 8, the electronic device includes: at least one memory 401 and at least one processor 402, said at least one memory 401 having stored therein a set of computer executable instructions which, when executed by the at least one processor 402, perform the training method and/or the interaction method of the agent as described in the above exemplary embodiments.

By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device is not necessarily a single electronic device, but may be any device or an aggregate of circuits capable of executing the above-described instructions (or instruction set) singly or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In an electronic device, processor 402 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 402 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The processor 402 may execute instructions or code stored in the memory 401, wherein the memory 401 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory 401 may be integrated with the processor 402, for example, RAM or flash memory is arranged within an integrated circuit microprocessor or the like. In addition, the memory 401 may include a separate device, such as an external disk drive, a storage array, or other storage device that may be used by any database system. The memory 401 and the processor 402 may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., so that the processor 402 can read files stored in the memory.

In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, a computer-readable storage medium storing instructions may also be provided, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform the training method and/or the interaction method of the agent as described in the above exemplary embodiments. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card-type memories (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, hard disks, solid state disks, and any other devices configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, instructions in which are executable by at least one processor to perform the training method and/or the interaction method of an agent as described in the above exemplary embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. The training method of the intelligent agent suitable for the human-computer interaction scene is characterized by comprising the following steps of:

sampling a plurality of strategies from a strategy space of the intelligent agent as test strategies for the training of the round, wherein the strategy space comprises n strategies, and n is an integer greater than 1;

Testing each test strategy in a human-computer interaction task environment to obtain test results of each test strategy on m test tasks, wherein m is an integer greater than 0;

Calculating objective evaluation data for evaluating the performance of each test strategy in each test task based on the test result;

outputting the test result to a user, and receiving subjective evaluation data of the user on the performance of each test strategy in each test task;

updating the agent based on the objective evaluation data and the subjective evaluation data, and determining whether to continue updating the agent;

Under the condition that the update of the intelligent agent is stopped, taking the intelligent agent which is completed to update as a target intelligent agent which is finally trained, wherein the target intelligent agent is used for carrying out information interaction with human beings;

wherein, in case it is determined to continue updating the agent, the step of sampling a plurality of policies from the agent's policy space as test policies for the present round of training is returned to be performed to start a new round of training.

2. The training method of claim 1, wherein updating the agent based on the objective assessment data and the subjective assessment data comprises:

Generating an antisymmetric matrix based on the objective evaluation data and the subjective evaluation data;

calculating a loss function of the agent based on the antisymmetric matrix;

updating a vector r based on the loss function, wherein r is an n-dimensional vector, and the ith element in r Representing the comprehensive capacity evaluation result of the ith strategy;

updating the ranking of the n strategies based on the updated vector r, wherein the better the comprehensive capability evaluation result of the strategy is, the higher the ranking of the strategy is;

Wherein, in the target agent, the higher the ranking of the strategies, the higher the probability that the strategies are selected.

3. The training method of claim 2, wherein generating an antisymmetric matrix based on the objective assessment data and the subjective assessment data comprises:

converting the objective evaluation data into a first probability evaluation matrix, and generating a first antisymmetric matrix based on the first probability evaluation matrix;

Converting the subjective evaluation data into a second probability evaluation matrix, and generating a second antisymmetric matrix based on the second probability evaluation matrix;

Carrying out weighted summation on the first antisymmetric matrix and the second antisymmetric matrix to obtain the antisymmetric matrix;

wherein, the elements in the ith row and the jth column in the first probability evaluation matrix Indicating the probability that objective evaluation data obtained by the ith policy is not weaker than that of the jth policy, and；

Elements in the ith row and the jth column of the second probability evaluation matrixIndicating the probability that subjective evaluation data obtained by the ith policy is not weaker than that of the jth policy, and。

4. The training method of claim 3, wherein,

Elements in the ith row and the jth column of the first antisymmetric matrix；

Elements in the ith row and the jth column of the second antisymmetric matrix。

5. The training method of claim 3, wherein the objective assessment data is an objective assessment score and the subjective assessment data is a subjective assessment score;

wherein, ，，

Wherein,Z has the values of1 and 2,Indicating that the ith policy is at theObjective evaluation score obtained in each test task,Indicating that the j-th policy is atObjective evaluation score obtained in each test task,Indicating that the ith policy is at theSubjective assessment score obtained in each test task,Indicating that the j-th policy is atSubjective assessment score obtained in each test task,Is an integer greater than 0 and less than or equal to m.

6. The training method of claim 2, wherein the step of calculating a loss function for the agent based on the antisymmetric matrix comprises:

the loss function is calculated based on a combined gradient on the vector r, the antisymmetric matrix, and a low rank approximation of the combined rotation of the antisymmetric matrix.

7. The training method of claim 6 wherein updating the vector r based on the loss function comprises: updating a vector r and a matrix C based on the loss function;

Wherein the loss function The method comprises the following steps:

，

wherein, Representing the combined gradient over the vector r,Low-rank approximation of the combined rotation of the antisymmetric matrix,K is a superparameter.

8. An interaction method of an agent suitable for a human-computer interaction scene is characterized by comprising the following steps:

Acquiring information transmitted by a user from the current human-computer interaction task environment;

selecting a target strategy from a strategy space of the target intelligent agent;

Making a decision based on the information using the target policy;

Executing corresponding actions in the man-machine interaction task environment according to the decision so as to change feedback information of the man-machine interaction task environment to the user;

wherein the target agent is trained by performing the training method according to any one of claims 1 to 7.

9. A training system for an agent adapted for use in a human-machine interaction scenario, the training system comprising:

A policy sampling unit configured to sample a plurality of policies from a policy space of the agent as test policies of the present training, wherein the policy space includes n policies, n being an integer greater than 1;

The testing unit is configured to test each testing strategy in a human-computer interaction task environment to obtain testing results of each testing strategy on m testing tasks, wherein m is an integer greater than 0;

An objective evaluation unit configured to calculate objective evaluation data for evaluating the performance of each test strategy in each test task based on the test result;

The subjective evaluation unit is configured to output the test result to a user and receive subjective evaluation data of the user on the performance of each test strategy in each test task;

An updating unit configured to update the agent based on the objective evaluation data and the subjective evaluation data, and determine whether to continue updating the agent, wherein in a case where it is determined to stop continuing updating the agent, the agent for which updating has been completed is taken as a final trained target agent for information interaction with a human being;

wherein the training system starts a new round of training if it is determined to continue updating the agent.

10. An interactive system of an agent adapted for use in a human-machine interaction scenario, the interactive system comprising:

the information acquisition unit is configured to acquire information transmitted by a user from the current human-computer interaction task environment;

A policy selection unit configured to select a target policy from a policy space of the target agent;

A decision unit configured to make a decision based on the information using the target policy;

The action execution unit is configured to execute corresponding actions in the man-machine interaction task environment according to the decision so as to change feedback information of the man-machine interaction task environment to the user;

11. A computer-readable storage medium storing instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the training method of any one of claims 1 to 7 and/or the interaction method of claim 8.

12. An electronic device, the electronic device comprising:

At least one processor;

At least one memory storing computer-executable instructions,

Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the training method of any one of claims 1 to 7 and/or the interaction method of claim 8.

13. Computer program product comprising computer-executable instructions which, when executed by at least one processor, implement the training method of any one of claims 1 to 7 and/or the interaction method of claim 8.