CN112836036A

CN112836036A - Interactive training method, device, terminal and storage medium for intelligent agent

Info

Publication number: CN112836036A
Application number: CN202110288790.8A
Authority: CN
Inventors: 毋杰; 周凯捷
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-05-25
Anticipated expiration: 2041-03-18
Also published as: CN112836036B

Abstract

The embodiment of the invention discloses an interactive training method, an interactive training device, a terminal and a storage medium for an agent, and belongs to the technical field of intelligent decision, wherein the method comprises the steps of calling a rule simulator and the rule agent to interact on the basis of a first interaction target subset to obtain a first interaction data set, training an initial simulator and the initial agent on the basis of the first interaction data set to obtain a basic simulator and a basic agent, calling a combination of the rule agent and the basic agent on the basis of a second interaction target subset to interactively train the basic simulator to obtain a target simulator, and calling a combination of the rule simulator and the target simulator on the basis of a third interaction target subset to interactively train the basic agent to obtain the target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.

Description

Interactive training method, device, terminal and storage medium for intelligent agent

Technical Field

The invention relates to the technical field of computers, in particular to an interactive training method, an interactive training device, a terminal and a storage medium for an agent.

Background

The target-oriented dialog System (Task-oriented dialog System) is an important research field and has high application value. The dialog system may assist the user in efficiently completing a specified task through natural language dialog. Object-oriented task-based dialog systems are currently implemented in a number of applications, such as movie ticket purchases, airline tickets, and hotel reservations.

At present, specifically, an agent (agent) interacts with a user in the operation process of a target-oriented dialog system, a large amount of real-time interaction with the user is required in the training process of the agent, and training of the agent is performed through interaction with a large amount of users, however, the above training mode requires continuous participation of the user, which results in high training cost and overlong training time.

Disclosure of Invention

The embodiment of the invention provides an interactive training method, an interactive training device, a terminal and a storage medium for an intelligent agent, which can be used for alternately training the simulator and the intelligent agent based on the interaction form of a simulator combination and the intelligent agent, so that the training efficiency for the intelligent agent is improved.

In one aspect, an embodiment of the present invention provides an interactive training method for an agent, where the method includes:

acquiring an interactive target set, and screening a first interactive target subset from the interactive target set;

calling a rule simulator and a rule agent to interact on the basis of the first interaction target subset to obtain a first interaction data set, wherein the rule simulator is a simulator constructed on the basis of a first preset rule, and the rule agent is an agent constructed on the basis of a second preset rule;

training an initial simulator and an initial agent based on the first interactive data set to obtain a basic simulator and a basic agent, wherein the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;

screening a second interaction target subset from the interaction target set, calling an agent combination based on the second interaction target subset to carry out interaction training on the basic simulator so as to update parameters in the basic simulator, and obtaining a target simulator, wherein the agent combination comprises the rule simulator and the basic simulator;

and screening a third interaction target subset from the interaction target set, calling a simulator combination based on the third interaction target subset to carry out interactive training on the basic agent so as to update parameters in the basic agent to obtain a target agent, wherein the simulator combination comprises the rule simulator and the target simulator.

In one aspect, an embodiment of the present invention provides an interactive training apparatus for an agent, where the apparatus includes:

the acquisition module is used for acquiring an interaction target set;

the screening module is used for screening out a first interaction target subset from the interaction target set;

the calling module is used for calling a rule simulator based on the first interaction target subset to interact with a rule agent to obtain a first interaction data set, the rule simulator is a simulator constructed based on a first preset rule, and the rule agent is an agent constructed based on a second preset rule;

the training module is used for training an initial simulator and an initial agent based on the first interactive data set to obtain a basic simulator and a basic agent, the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;

the screening module is further configured to screen a second subset of interaction targets from the set of interaction targets,

the training module is further configured to invoke an agent combination to perform interactive training on the basic simulator based on the second interactive target subset, so as to update parameters in the basic simulator, and obtain a target simulator, where the agent combination includes the rule simulator and the basic simulator;

the screening module is further configured to screen a third interaction target subset from the interaction target set;

the training module is further configured to invoke a simulator combination based on the third interactive target subset to perform interactive training on the basic agent, so as to update parameters in the basic agent, and obtain a target agent, where the simulator combination includes the rule simulator and the target simulator.

In one aspect, an embodiment of the present invention provides a terminal, including a processor and a memory, where the memory is configured to store a computer program, and the computer program includes program instructions, and is characterized in that the processor is configured to call the program instructions to execute the interactive training method for an agent.

In one aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the interactive training method for an agent.

In the embodiment of the invention, a terminal calls a rule simulator and a rule agent to interact based on a first interaction target subset to obtain a first interaction data set, trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, calls a combination of the rule agent and the basic agent based on a second interaction target subset to interactively train the basic simulator to obtain a target simulator, and calls a combination of the rule simulator and the target simulator based on a third interaction target subset to interactively train the basic agent to obtain the target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of an interactive training method for an agent according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of another interactive training method for an agent according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an interactive training apparatus for an agent according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The interactive training method for the intelligent agent is realized on a terminal, and the terminal comprises electronic equipment such as a smart phone, a tablet computer, a digital audio and video player, an electronic reader, a handheld game machine or vehicle-mounted electronic equipment.

Fig. 1 is a schematic flowchart of an interactive training method for an agent in an embodiment of the present invention, and as shown in fig. 1, a flowchart of the interactive training method for an agent in the embodiment may include:

s101, acquiring an interaction target set, and screening a first interaction target subset from the interaction target set.

In the embodiment of the present invention, the interaction target set includes multiple interaction targets, where an interaction target is specifically a target that needs to be reached in a process of interacting with an agent once, for example, for an airline ticket ordering scenario, one interaction target may be: the day worship six am, Shanghai to Chongqing, from Pudong airport, eastern aviation tickets. The intelligent agent can be specifically an executing device of an air ticket ordering system, an insurance purchasing system, an intelligent customer service and the like, and can specifically interact with a user in an actual process to meet the requirements of the user. The intelligent agent can be obtained through deep learning mode training, specifically can adopt a mode of reinforcement learning in the deep learning to the interaction between the simulator and the intelligent agent trains the intelligent agent, so that the intelligent agent has good response capability. In the training process of the agent, a plurality of interaction targets are screened from the interaction target set, and the simulator is called to interact with the agent based on the interaction targets, so as to achieve the purpose of training the agent.

In specific implementation, the terminal may obtain a pre-constructed interaction target set, and screen a first interaction target subset from the interaction target set, where the first interaction target subset may be multiple interaction targets in the interaction target set, and each interaction target may be used for interaction between a rule simulator and a rule agent to be constructed subsequently.

In one embodiment, the specific way of screening out the first interaction target subset from the interaction target set by the terminal may be that the terminal acquires a target application scene corresponding to an initial agent to be trained; acquiring a target interaction record under a target application scene from the historical record, wherein the target interaction record comprises interaction records of a user and an intelligent agent under the target application scene; at least one historical interactive target is obtained from the target interactive records, K interactive targets matched with the historical interactive targets are screened out from the interactive target set and serve as a first interactive target subset, and K is a positive integer. The application scene can be an air ticket ordering scene, an insurance purchasing scene, an online shopping scene and the like, interaction records between different users and intelligent agents are included in different application scenes, the intelligent agents can be intelligent agents finished through historical training or intelligent agents constructed based on rules and used for interacting with the users, and the target application scene can be any application scene.

S102, calling a rule simulator and a rule agent to interact based on the first interaction target subset to obtain a first interaction data set.

In the embodiment of the invention, after the terminal determines the first interaction target subset, the terminal can call the rule simulator and the rule agent to interact based on the first interaction target subset, wherein the rule simulator is constructed based on a first preset rule, and the rule agent is constructed based on a second preset rule.

In one implementation, the rule simulator is configured to simulate content that the user would output to the agent when the user interacts with the agent, and the rule simulator may specifically ask the agent for questions according to a predefined rule. For example, when the obtained content contains a phrase location, location information in the interactive target (e.g., from shanghai to Chongqing) is output, and when the obtained content contains time, time information in the interactive target (e.g., worship six am) is output. The rule intelligent agent does not know target information required by the user in advance, collects the required slot position of the user in a questioning mode, simultaneously carries out corresponding database query according to the requirement, and replies the problem of the user. Specifically, it may perform a corresponding operation based on a preset rule when the input dialog content is detected. If the second preset rule for constructing the rule agent is that corresponding information is inquired from the database based on the acquired keywords and fed back to the corresponding information, if the keywords in the acquired content are 'saturday, morning and ticket', the agent can inquire all saturday and morning tickets from the database, when the number of the tickets is less than a threshold value, the tickets are displayed, and when the number of the tickets is greater than the threshold value, feedback information of 'where the asking starting place is' is returned. The specific way of invoking the rule simulator and the rule agent to interact based on the first interaction target subset may be that the rule simulator outputs all or part of the content in any target in the first interaction target subset to the rule agent, acquires the content returned by the rule agent, and continues to interact with the rule simulator based on the returned content. And stopping interaction until the interaction reaches a preset target, or stopping interaction after the number of interactive rounds reaches a preset number. For each interactive target, the rule simulator and the rule agent may be invoked to perform interaction in the above manner, so as to obtain N rounds of interactive data, where one round of interactive data includes "content output by the rule simulator, content fed back by the rule agent, and a return value", where the return value may be determined by a degree of association between the content output by the rule simulator and the content fed back by the rule agent, for example, entities in the content may be extracted, a degree of association between the contents is determined by a distance between the entities in a knowledge graph, the knowledge graph includes a plurality of entities, and the entities are connected based on the relationship between the entities, and when the entities appear in the same text, it is determined that the entities have a relationship, or the return value may also be determined by human labeling. Further, the N rounds of interaction data are stored in an experience storage area (Replaybuff) as a first interaction data set.

In one implementation, the first preset rule is to output corresponding consultation information based on keywords in the acquired feedback information, the second preset rule is to output corresponding feedback information based on keywords in the acquired consultation information, the terminal invokes the rule simulator and the rule agent to interact based on each interactive target in the first interactive target subset in the same way, specifically, the terminal determines first consultation information based on any reference interactive target in the first interactive target subset, and calls the rule simulator to interact with the rule intelligent body; calling a rule intelligent agent to output corresponding first feedback information to a rule simulator based on the keywords in the first consultation information; determining the first consultation information and the first feedback information as first round of interactive data, and determining a first feedback value corresponding to the first round of interactive data based on the matching degree between the first feedback information and the reference interactive target; if the matching degree between the first feedback information and the reference interaction target is smaller than the preset matching degree, calling a rule simulator to output corresponding second consultation information to the rule intelligent agent based on the keywords in the first feedback information; calling the rule intelligent agent to output corresponding second feedback information to the rule simulator based on the keywords in the second consultation information; determining the second consultation information and the second feedback information as second round of interactive data, and determining a second return value corresponding to the second round of interactive data based on the matching degree between the second feedback information and the reference interactive target; and if the matching degree between the second feedback information and the reference interaction target is greater than the preset matching degree, stopping calling the rule simulator and the rule agent for interaction, and adding the first round of interaction data, the first return value, the second round of interaction data and the second return value into the first interaction data set. Similarly, if the matching degree between the second feedback information and the reference interaction target is smaller than the preset matching degree, the rule simulator can be continuously called to interact with the rule intelligent body, and the interaction is stopped until the matching degree between the feedback information output by the rule intelligent body after the multiple rounds of interaction and the reference interaction target is larger than the preset matching degree; or stopping the interaction when the number of the interaction rounds of the rule agent and the rule simulator reaches the preset number of the rounds. In the interaction process, each round of consultation information, feedback information and return value are added into the first interaction data set, and by adopting the mode, the rule simulator and the rule intelligent body can be called to carry out interaction based on each interaction target in the first interaction target subset, and data generated by interaction is added into the first interaction data set to obtain a complete first interaction data set. The specific calculation manner of the matching degree between the first feedback information and the reference interaction target may be to obtain the number of the same characters between the first feedback information and the reference interaction target, and determine the ratio of the number of the same characters to the total number of characters in the reference interaction target as the matching degree, or to extract at least one first entity included in the first feedback information and at least one second entity in the reference interaction target, to obtain the number of the same entities in the at least one first entity and the at least one second entity, and the total number of entities in the first entity and the second entity, and determine the ratio between the number of the same entities and the total number as the matching degree, where the entities may be specifically extracted based on a preset rule, and the specific manner of extracting the target entity may be based on a rule and a dictionary method, such as a manually written rule, extracting features such as keywords, indicator words, and position words as entities, or a conventional machine learning method based on statistics, a deep learning method based on deep learning, and the like. Alternatively, the degree of matching may be pre-labeled by the developer.

S103, training the initial simulator and the initial agent based on the first interactive data set to obtain a basic simulator and a basic agent.

In the embodiment of the invention, after the terminal acquires the first interactive data set, the initial simulator and the initial agent are trained based on the first interactive data set to obtain the basic simulator and the basic agent. The initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm. The deep learning algorithm may be a Convolutional Neural Network (CNN) algorithm, a Long Short-Term Memory network (LSTM) algorithm, or the like, such as a single-layer LSTM + multi-layer perceptron + softmax classification layer is used as a strategy to generate the initial simulator, the initial agent and the initial simulator.

In one implementation manner, the first interaction data set comprises at least one round of interaction data, the specific manner of training the initial simulator by the terminal based on the first interaction data set may be that N rounds of interaction data are screened from the first interaction set, each round of interaction data in the N rounds of interaction data comprises consultation information output by the rule simulator, feedback information output by the rule agent and a return value, and N is a positive integer; calling N rounds of interactive data to carry out iterative training on the initial simulator based on a training mode of reinforcement learning so as to update parameters in the initial simulator; and if the initial simulator after the parameter updating meets the preset condition, determining the initial simulator after the parameter updating as a basic simulator, wherein the preset condition comprises that a return value obtained by interacting with the rule agent is higher than a preset return value. Or the preset condition may be that a success rate obtained by interacting with the rule agent is higher than a preset success rate, wherein when the interaction obtains a question mark asking completion, it is determined that the interaction is successful. The preset conditions may be preset by a developer.

In one implementation manner, the first interactive data set comprises at least one round of interactive data, the specific manner of training the initial agent by the terminal based on the first interactive data set may be that N rounds of interactive data are screened from the first interactive set, each round of interactive data in the N rounds of interactive data comprises consultation information output by the rule simulator, feedback information output by the rule agent and a return value, and N is a positive integer; calling N rounds of interactive data to carry out iterative training on the initial agent based on a training mode of reinforcement learning so as to update parameters in the initial agent; and if the initial agent after the parameter updating meets the preset condition, determining the initial agent after the parameter updating as a basic agent, wherein the preset condition comprises that a return value obtained by interacting with the rule simulator is higher than a preset return value. Or the preset condition may be that a success rate obtained by interacting with the rule simulator is higher than a preset success rate, wherein when the interaction obtains a question mark asking completion, it is determined that the interaction is successful. The preset conditions may be preset by a developer.

By the method, the basic simulator and the basic intelligent agent obtained by training have basic interaction capacity.

S104, screening a second interaction target subset from the interaction target set, calling an agent combination to carry out interaction training on the basic simulator based on the second interaction target subset, and updating parameters in the basic simulator to obtain the target simulator.

In the embodiment of the invention, after the terminal trains and obtains the basic simulator and the basic agent, a second interaction target subset can be screened from the interaction target set, an agent combination is called based on the second interaction target subset to carry out interaction training on the basic simulator, so that parameters in the basic simulator are updated, and the target simulator is obtained, wherein the agent combination comprises a rule simulator and the basic simulator, the second interaction target subset can be a plurality of interaction targets in the interaction target set, and each interaction target can be used for the subsequent interaction of the basic simulator and the agent combination.

In one implementation manner, the specific manner in which the terminal invokes the agent combination to perform the interactive training on the base simulator based on the second interaction target subset may be that the terminal invokes the agent combination to perform at least one interaction with the base simulator based on the second interaction target subset; in any I-time interaction process of at least one interaction, acquiring a first interaction round number U corresponding to a regular agent in an agent combination and a second interaction round number V corresponding to a basic agent, wherein the numbers of I, U and V are positive integers; based on U interactive targets in the second interactive target subset, calling a rule agent to interact with the basic simulator to obtain a first interactive data subset; calling a basic agent to interact with a basic simulator based on V interactive targets in the second interactive target subset to obtain a second interactive data subset; updating parameters in the base simulator based on the first and second subsets of interaction data; if the basic simulator after the parameter updating does not meet the first preset condition, adjusting a first interaction round number corresponding to a regular agent and a second interaction round number corresponding to a basic agent in the agent combination to obtain an agent combination after the interaction round number updating, and performing interaction training on the basic simulator based on the agent combination after the interaction round number updating in the I + 1-time interaction process; and if the basic simulator after the parameter updating meets the first preset condition, determining the basic simulator after the parameter updating as the target simulator. The specific way that the terminal determines whether the basic simulator after the parameter updating meets the first preset condition can be that the terminal screens out a test interaction target from an interaction target set, and calls the basic simulator after the parameter updating to interact with the intelligent agent combination based on the test interaction target to obtain test interaction data; if the test interaction data indicates that the completion degree of the test interaction target is higher than the preset completion degree, determining that the basic simulator after the parameter updating meets a preset condition, wherein the completion degree can be specifically judged by research and development personnel, or determining based on the matching degree between the feedback information in the interaction data and the test interaction target. In the above manner, by combining the agents and continuously adjusting the frequency of use of the regular agent and the basic agent in the agent combination in each training round, the simulator can be better trained, the performance of the trained target simulator is improved, and the trained target simulator has good capability of simulating real user conversations.

In one implementation, the specific way in which the terminal invokes the agent combination to interactively train the base simulator based on the second interaction target subset may be that the terminal invokes the agent combination to interact with the base simulator based on the second interaction target subset to obtain a second interaction data set, and training the basic simulator based on the second interactive data set to obtain a target simulator, wherein, the intelligent agent combination is a combination of a regular intelligent agent and a basic intelligent agent, and in each training process, the use proportion of the regular agent and the basic agent in the agent combination can be adjusted, the agent combination continuously interacts with the basic simulator, the parameters in the basic simulator are continuously updated based on the interaction data generated by the interaction, and when the basic simulator after the parameter updating meets the condition, determining the basic simulator after the parameter updating as the target simulator. By the method, the trained target simulator has good capability of simulating real user conversation. The mode of adjusting the use proportion of the regular agent and the basic agent in the agent combination can be based on the rule to adjust, if the adjustment rule is to improve the use proportion of 10% of the basic agent each time until the success rate is not improved any more, if the second interactive target comprises 100 interactive targets, the initial proportion is 90% of the regular agent and 10% of the basic agent, in the process of one training, the basic agent is called to complete 10 questions and answers, and the regular agent completes 90 questions and answers. In the next training process, the basic agent is called to complete 20 questions and answers, and the rule agent completes 80 questions and answers. And updating parameters in the basic simulator based on each interactive question and answer condition to obtain the target simulator.

S105, screening a third interaction target subset from the interaction target set, calling a simulator combination based on the third interaction target subset to carry out interaction training on the basic intelligent agent so as to update parameters in the basic intelligent agent and obtain the target intelligent agent.

In the embodiment of the invention, after the terminal trains and obtains the target simulator, the target simulator and the rule simulator can be combined to obtain the simulator combination, a third interaction target subset is screened from the interaction target set, the simulator combination is called based on the third interaction target subset to carry out interaction training on the basic intelligent agent so as to update parameters in the basic intelligent agent and obtain the target intelligent agent, and the simulator combination comprises the rule simulator and the target simulator. Wherein the third subset of interaction targets may be a plurality of interaction targets in the set of interaction targets, and each interaction target may be used for interaction of the subsequently constructed simulator combination and the basic agent.

In one implementation manner, the specific manner of the terminal invoking the simulator combination based on the third interaction target subset to perform interactive training on the basic agent may be that the terminal invokes the simulator combination based on the third interaction target subset to perform multiple interactions with the basic agent; in the J-th interaction process of multiple interactions, the terminal acquires a third interaction round number X corresponding to the rule simulator in the simulator combination and a fourth interaction round number Y, J, X and Y corresponding to the target simulator, wherein the third interaction round number X and the fourth interaction round number Y are positive integers; based on X interactive targets in the third interactive target subset, calling a rule simulator to interact with the basic agent to obtain a third interactive data subset; calling a target simulator to interact with the basic agent based on Y interactive targets in the third interactive target subset to obtain a fourth interactive data subset; updating the parameters in the basic agent based on the third interactive data subset and the fourth interactive data subset; if the basic agent after the parameter updating does not meet the second preset condition, adjusting a third interaction round number corresponding to the rule simulator and a fourth interaction round number corresponding to the target simulator in the simulator combination to obtain the simulator combination after the interaction round number updating, and performing interactive training on the basic agent based on the simulator combination after the interaction round number updating in the J + 1-time interaction process; and if the basic agent after the parameter updating meets the second preset condition, determining the basic agent after the parameter updating as the target agent. The specific way that the terminal determines whether the basic intelligent agent after the parameter updating meets the preset condition can be that the terminal screens out a test interaction target from the interaction target set, and calls the basic intelligent agent after the parameter updating to combine with the simulator for interaction based on the test interaction target to obtain test interaction data; and if the inspection interaction data indicate that the completion degree of the inspection interaction target is higher than the preset completion degree, determining that the basic agent after the parameter updating meets a second preset condition. In the above manner, the rule simulator and the trained target simulator are combined, and the frequency of use of the rule simulator and the target simulator in the simulator combination is continuously adjusted in each round of training, so that the basic agent can be trained better, the performance of the trained target agent is improved, and the trained target agent has good ability to have a conversation with a real user.

In one implementation, the specific way in which the terminal invokes the agent combination to interactively train the basic agent based on the third interaction target subset may be that the terminal invokes the agent combination to interact with the basic agent based on the third interaction target subset to obtain a third interaction data set, and training the basic agent based on the third interactive data set to obtain a target agent, wherein, the intelligent agent combination is a combination of a rule simulator and a target simulator, and in each training process, can adjust the use proportion of the rule simulator and the target simulator in the agent combination, continuously interact with the basic agent, continuously update the parameters in the basic agent based on the interaction data generated by the interaction, and when the basic agent after the parameter updating meets the condition, determining the basic agent after the parameter updating as the target agent. The target agent at this time has good capability of responding to the content input by the user.

Fig. 2 is a schematic flowchart of another interactive training method for an agent in the embodiment of the present invention, and as shown in fig. 2, the flowchart of the interactive training method for an agent in the embodiment may include:

s201, obtaining an interaction target set, and screening a first interaction target subset from the interaction target set.

In the embodiment of the invention, the interaction target set comprises multiple interaction targets, the interaction targets are specifically targets which need to be reached in the process of interacting with the intelligent agent once, and the interaction target set can be preset by research personnel.

S202, calling a rule simulator and a rule agent to interact based on the first interaction target subset to obtain a first interaction data set.

In the embodiment of the invention, after the terminal determines the first interaction target subset, the terminal can call the rule simulator and the rule intelligent agent to interact based on the first interaction target subset, so as to obtain the first interaction data set. The rule simulator is constructed based on a first preset rule, and the rule agent is constructed based on a second preset rule.

S203, training the initial simulator and the initial agent based on the first interactive data set to obtain a basic simulator and a basic agent.

In the embodiment of the invention, after the terminal acquires the first interactive data set, the initial simulator and the initial agent are trained based on the first interactive data set to obtain the basic simulator and the basic agent. The initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm.

S204, screening a second interaction target subset from the interaction target set, calling an agent combination to carry out interaction training on the basic simulator based on the second interaction target subset, and updating parameters in the basic simulator to obtain the target simulator.

In the embodiment of the invention, the agent combination comprises a rule simulator and a basic simulator, the second interaction target subset can be a plurality of interaction targets in the interaction target set, and each interaction target can be used for interaction of the basic simulator and the agent combination which are constructed subsequently. The interactive training process is N rounds, the use ratio of the regular agent and the basic agent in the agent combination in each training process is different, for example, in the first round of training, the regular agent and the basic simulator complete the interaction of 90 interactive targets in the second interactive target subset, and the basic agent and the basic simulator complete the interaction of 10 interactive targets in the second interactive target subset; and after each round of training, obtaining the target completion rate of the round of training, and subtracting t from the number of the interaction targets needing to be completed by the regular agent and the basic simulator, and adding t to the number of the interaction targets needing to be completed by the basic agent and the basic simulator until the target completion rate of each round of training is not increased, wherein t is a positive integer.

S205, a third interaction target subset is screened from the interaction target set, and a simulator combination is called to carry out interaction training on the basic intelligent agent based on the third interaction target subset so as to update parameters in the basic intelligent agent and obtain the target intelligent agent.

In an embodiment of the invention, the simulator combination comprises a rule simulator and a target simulator. Wherein the third subset of interaction targets may be a plurality of interaction targets in the set of interaction targets, and each interaction target may be used for interaction of the subsequently constructed simulator combination and the basic agent. The interactive training process is N rounds, the usage proportion of the rule simulator and the target simulator in the simulator combination in each training process is different, for example, in the first round of training, the rule simulator and the basic agent complete the interaction of 90 interactive targets in the third interactive target subset, and the target simulator and the basic agent complete the interaction of 10 interactive targets in the third interactive target subset; and after each round of training, obtaining the target completion rate of the round of training, and subtracting t from the number of the interactive targets to be completed by the rule simulator and the basic agent, and adding t to the number of the interactive targets to be completed by the target simulator and the basic agent until the target completion rate of each round of training is not increased, wherein t is a positive integer.

S206, acquiring a second interactive data set obtained by interaction between at least one test user and the target intelligent agent.

In the embodiment of the invention, the second interactive data set comprises at least one round of interactive data, and each round of interactive data comprises test scores, consultation information output by a test user and feedback information output by a target intelligent agent. Each test user can input information to the target intelligent agent to interact with the target intelligent agent, the test users can output consultation information in the interaction process, the target intelligent agent outputs corresponding feedback information, and after the interaction is completed, the test users can score the feedback information output by the intelligent agent in the round of interaction, namely test scores are obtained.

And S207, training the target agent based on the second interactive data set so as to update the parameters in the target agent and obtain the target agent with updated parameters.

In the embodiment of the invention, after the terminal acquires the second interactive data set, N rounds of interactive data in the second interactive data set can be called to carry out iterative training on the target agent based on a training mode of reinforcement learning so as to update parameters in the target agent; and calling K rounds of interactive data in the second interactive data set to test the target agent with the updated parameters, and if the test result indicates that the target agent with the updated parameters meets preset conditions, executing step S208.

And S208, calling the target agent with the updated parameters to interact with the target user.

In the embodiment of the invention, after the terminal obtains the target intelligent agent with the updated parameters, the parameter update can be applied to the actual interaction process with the user, and the intelligent agent can be continuously updated by adopting the interaction data generated by the subsequent actual interaction with the user, so that the performance of the intelligent agent is improved. In the above scheme, the training of different stages of the intelligent agent can be completed based on the data of different types, the first stage is used for training the intelligent agent based on the constructed interactive data, so that the intelligent agent has preliminary response capability, and the intelligent agent is trained and optimized by subsequently adopting the interactive data with real users, thereby realizing the continuous improvement of the performance of the intelligent agent.

In the embodiment of the invention, a terminal calls a rule simulator and a rule agent to interact based on a first interaction target subset to obtain a first interaction data set, trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, calls a combination of the rule agent and the basic agent based on a second interaction target subset to interactively train the basic simulator to obtain a target simulator, calls a combination of the rule simulator and the target simulator based on a third interaction target subset to interactively train the basic agent to obtain the target agent, and trains and optimizes the target agent by adopting real interaction with a user, so that the agent has better performance. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and then training and tuning are carried out based on a small amount of interaction samples with real users, so that the training efficiency of the intelligent agent is improved.

The interactive training device for an agent according to the embodiment of the present invention will be described in detail with reference to fig. 3. It should be noted that the interactive training apparatus for an agent shown in fig. 3 is used for executing the method of the embodiment of the present invention shown in fig. 1-2, for convenience of description, only the portion related to the embodiment of the present invention is shown, and details of the technology are not disclosed, and reference is made to the embodiment of the present invention shown in fig. 1-2.

Referring to fig. 3, a schematic structural diagram of an interactive training device for an agent according to the present invention is shown, where the interactive training device 30 for an agent may include: the system comprises an acquisition module 301, a screening module 302, a calling module 303 and a training module 304.

An obtaining module 301, configured to obtain an interaction target set;

a screening module 302, configured to screen out a first subset of interaction targets from the set of interaction targets;

a calling module 303, configured to call a rule simulator based on the first interaction target subset to interact with a rule agent to obtain a first interaction data set, where the rule simulator is a simulator constructed based on a first preset rule, and the rule agent is an agent constructed based on a second preset rule;

a training module 304, configured to train an initial simulator and an initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, where the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;

the screening module 302 is further configured to screen out a second subset of interaction targets from the set of interaction targets,

the training module 304 is further configured to invoke an agent combination to perform interactive training on the basic simulator based on the second interactive target subset, so as to update parameters in the basic simulator, so as to obtain a target simulator, where the agent combination includes the rule simulator and the basic simulator;

the screening module 302 is further configured to screen a third subset of interaction targets from the set of interaction targets;

the training module 304 is further configured to invoke a simulator combination based on the third interactive target subset to perform interactive training on the basic agent, so as to update parameters in the basic agent, and obtain a target agent, where the simulator combination includes the rule simulator and the target simulator.

In one implementation, the screening module 302 is specifically configured to:

acquiring a target application scene corresponding to an initial agent to be trained;

acquiring a target interaction record under the target application scene from a historical record, wherein the target interaction record comprises interaction records of a user and an intelligent agent under the target application scene;

and acquiring at least one historical interaction target from the target interaction record, and screening K interaction targets matched with the historical interaction targets from the interaction target set to serve as a first interaction target subset, wherein K is a positive integer.

In one implementation manner, the first preset rule is to output corresponding advisory information based on a keyword in the obtained feedback information, the second preset rule is to output corresponding feedback information based on a keyword in the obtained advisory information, and the calling module 303 is specifically configured to:

determining first consulting information based on a reference interaction target, and calling the rule simulator to send the first consulting information to the rule agent;

calling the rule agent to output corresponding first feedback information to the rule simulator based on the keywords in the first consultation information;

determining the first consultation information and the first feedback information as first round of interaction data, and determining a first return value corresponding to the first round of interaction data based on the matching degree between the first feedback information and the reference interaction target;

if the matching degree between the first feedback information and the reference interaction target is smaller than a preset matching degree, calling the rule simulator to output corresponding second consultation information to the rule intelligent body based on the keywords in the first feedback information;

calling the rule agent to output corresponding second feedback information to the rule simulator based on the keywords in the second consultation information;

determining the second consultation information and the second feedback information as second round of interaction data, and determining a second return value corresponding to the second round of interaction data based on the matching degree between the second feedback information and the reference interaction target;

if the matching degree between the second feedback information and the reference interaction target is greater than a preset matching degree, stopping calling the rule simulator and the rule agent for interaction, and adding the first round of interaction data, the first return value, the second round of interaction data and the second return value to a first interaction data set.

In an implementation manner, the first interaction data set includes at least one round of interaction data, and the training module 304 is specifically configured to:

screening N rounds of interaction data from the first interaction set, wherein each round of interaction data in the N rounds of interaction data comprises consultation information output by the rule simulator, feedback information output by the rule agent and a return value, and N is a positive integer;

calling the N rounds of interactive data to carry out iterative training on the initial agent based on a training mode of reinforcement learning so as to update parameters in the initial agent;

and if the initial agent with the updated parameters meets preset conditions, determining the initial agent with the updated parameters as a basic agent, wherein the preset conditions comprise that the average return value obtained by carrying out multiple rounds of interaction with the rule simulator is higher than a preset return value.

In one implementation, the training module 304 is specifically configured to:

calling an agent combination to interact with the basic simulator at least once based on the second interaction target subset;

in the I-time interaction process of the at least one interaction, acquiring a first interaction round number U corresponding to a regular agent in the agent combination and a second interaction round number V, I, U and V corresponding to the basic agent as positive integers;

calling the rule agent to interact with the basic simulator based on U interaction targets in the second interaction target subset to obtain a first interaction data subset;

calling the basic agent to interact with the basic simulator based on the V interactive targets in the second interactive target subset to obtain a second interactive data subset;

updating parameters in the base simulator based on the first subset of interaction data and the second subset of interaction data;

if the basic simulator after the parameter updating does not meet the first preset condition, adjusting a first interaction round number corresponding to a regular agent in the agent combination and a second interaction round number corresponding to the basic agent to obtain an agent combination after the interaction round number updating, and performing interactive training on the basic simulator based on the agent combination after the interaction round number updating in the I + 1-time interaction process;

and if the basic simulator after the parameter updating meets a first preset condition, determining the basic simulator after the parameter updating as a target simulator.

In one implementation, the training module 304 is specifically configured to:

calling a simulator combination to interact with the basic agent for a plurality of times based on the third interaction target subset;

in the J-th interaction process of the multiple interactions, acquiring a third interaction round number X corresponding to a rule simulator in the simulator combination and a fourth interaction round number Y, J, X and Y corresponding to the target simulator as positive integers;

calling the rule simulator to interact with the basic agent based on X interactive targets in the third interactive target subset to obtain a third interactive data subset;

calling the target simulator to interact with the basic agent based on Y interactive targets in the third interactive target subset to obtain a fourth interactive data subset;

updating parameters in the base agent based on the third subset of interaction data and the fourth subset of interaction data;

if the basic agent after the parameter updating does not meet a second preset condition, adjusting a third interaction round number corresponding to a rule simulator in the simulator combination and a fourth interaction round number corresponding to the target simulator to obtain a simulator combination after the interaction round number updating, and performing interactive training on the basic agent based on the simulator combination after the interaction round number updating in the J + 1-time interaction process;

and if the basic agent after the parameter updating meets the second preset condition, determining the basic agent after the parameter updating as a target agent.

In one implementation, the training module 304 is further configured to:

acquiring a second interactive data set obtained by at least one test user interacting with the target intelligent agent, wherein the second interactive data set comprises at least one round of interactive data, and each round of interactive data comprises a test score, consultation information output by the test user and feedback information output by the target intelligent agent;

training the target agent based on the second interaction data set so as to update parameters in the target agent and obtain a target agent with updated parameters;

and calling the target agent with the updated parameters to interact with the target user.

In the embodiment of the present invention, an obtaining module 301 obtains an interaction target set, a screening module 302 screens out a first interaction target subset from the interaction target set, a calling module 303 calls a rule simulator and a rule agent to interact based on the first interaction target subset to obtain a first interaction data set, a training module 304 trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, the training module 304 calls a combination of the rule agent and the basic agent to interactively train the basic simulator based on a second interaction target subset to obtain a target simulator, and the training module 304 interactively trains the basic agent based on a combination of a third interaction target subset calling the rule simulator and the target simulator to obtain the target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.

Fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention. As shown in fig. 4, the terminal includes: at least one processor 401, input devices 403, output devices 404, memory 405, at least one communication bus 402. Wherein a communication bus 402 is used to enable connective communication between these components. The input device 403 may be a control panel or a microphone, and the output device 404 may be a display screen. The memory 405 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 405 may alternatively be at least one storage device located remotely from the aforementioned processor 401. Wherein the processor 401 may be combined with the apparatus described in fig. 3, the memory 405 stores a set of program codes, and the processor 401, the input device 403, and the output device 404 call the program codes stored in the memory 405 to perform the following operations:

the processor 401 is configured to obtain an interaction target set, and screen out a first interaction target subset from the interaction target set;

a processor 401, configured to invoke a rule simulator based on the first interaction target subset to interact with a rule agent to obtain a first interaction data set, where the rule simulator is a simulator constructed based on a first preset rule, and the rule agent is an agent constructed based on a second preset rule;

a processor 401, configured to train an initial simulator and an initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, where the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;

a processor 401, configured to screen a second interaction target subset from the interaction target set, and invoke an agent combination to perform interaction training on the basic simulator based on the second interaction target subset, so as to update parameters in the basic simulator, so as to obtain a target simulator, where the agent combination includes the rule simulator and the basic simulator;

a processor 401, configured to screen a third interaction target subset from the interaction target set, and invoke a simulator combination to perform interaction training on the basic agent based on the third interaction target subset, so as to update parameters in the basic agent, so as to obtain a target agent, where the simulator combination includes the rule simulator and the target simulator.

In one implementation, the processor 401 is specifically configured to:

In the embodiment of the invention, a processor 401 obtains an interaction target set, the processor 401 screens out a first interaction target subset from the interaction target set, the processor 401 calls a rule simulator and a rule agent to interact based on the first interaction target subset to obtain a first interaction data set, the processor 401 trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, the processor 401 calls a combination of the rule agent and the basic agent to interactively train the basic simulator based on a second interaction target subset to obtain a target simulator, and the processor 401 interactively trains the basic agent based on a combination of a third interaction target subset call rule simulator and the target simulator to obtain a target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.

The module in the embodiment of the present invention may be implemented by a general-purpose Integrated Circuit, such as a CPU (central Processing Unit), or an ASIC (application Specific Integrated Circuit).

It should be understood that, in the embodiments of the present invention, the Processor 401 may be a Central Processing Unit (CPU), and the Processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The bus 402 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like, and the bus 402 may be divided into an address bus, a data bus, a control bus, and the like, where fig. 4 only shows one thick line for convenience of illustration, but does not show only one bus or one type of bus.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer storage medium and may include the processes of the embodiments of the methods described above when executed. The computer storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method of interactive training for an agent, the method comprising:

2. The method of claim 1, wherein the screening out a first subset of interaction targets from the set of interaction targets comprises:

3. The method according to claim 1, wherein the first preset rule is to output corresponding advisory information based on the keywords in the obtained feedback information, the second preset rule is to output corresponding feedback information based on the keywords in the obtained advisory information, and a manner of invoking a rule simulator and a rule agent to interact based on any reference interaction target in the first subset of interaction targets comprises:

4. The method of claim 3, wherein the first set of interaction data comprises at least one round of interaction data, and wherein training an initial agent based on the first set of interaction data to obtain a base agent comprises:

5. The method of claim 1, wherein invoking agent combinations to interactively train the base simulator based on the second subset of interaction targets comprises:

6. The method of claim 1, wherein the invoking a simulator combination based on the third subset of interaction objectives interactively trains the base agent, comprising:

7. The method of claim 1, wherein after invoking the simulator composition to interactively train the base agent based on the third subset of interaction targets to update parameters in the base agent, the method further comprises:

8. An interactive training apparatus for a smart agent, the apparatus comprising:

the acquisition module is used for acquiring an interaction target set;

9. A terminal, comprising a processor and a memory, wherein the memory is configured to store a computer program comprising program instructions, wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.