CN112836036A - Interactive training method, device, terminal and storage medium for intelligent agent - Google Patents

Interactive training method, device, terminal and storage medium for intelligent agent Download PDF

Info

Publication number
CN112836036A
CN112836036A CN202110288790.8A CN202110288790A CN112836036A CN 112836036 A CN112836036 A CN 112836036A CN 202110288790 A CN202110288790 A CN 202110288790A CN 112836036 A CN112836036 A CN 112836036A
Authority
CN
China
Prior art keywords
interaction
agent
simulator
target
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110288790.8A
Other languages
Chinese (zh)
Other versions
CN112836036B (en
Inventor
毋杰
周凯捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110288790.8A priority Critical patent/CN112836036B/en
Publication of CN112836036A publication Critical patent/CN112836036A/en
Application granted granted Critical
Publication of CN112836036B publication Critical patent/CN112836036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses an interactive training method, an interactive training device, a terminal and a storage medium for an agent, and belongs to the technical field of intelligent decision, wherein the method comprises the steps of calling a rule simulator and the rule agent to interact on the basis of a first interaction target subset to obtain a first interaction data set, training an initial simulator and the initial agent on the basis of the first interaction data set to obtain a basic simulator and a basic agent, calling a combination of the rule agent and the basic agent on the basis of a second interaction target subset to interactively train the basic simulator to obtain a target simulator, and calling a combination of the rule simulator and the target simulator on the basis of a third interaction target subset to interactively train the basic agent to obtain the target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.

Description

Interactive training method, device, terminal and storage medium for intelligent agent
Technical Field
The invention relates to the technical field of computers, in particular to an interactive training method, an interactive training device, a terminal and a storage medium for an agent.
Background
The target-oriented dialog System (Task-oriented dialog System) is an important research field and has high application value. The dialog system may assist the user in efficiently completing a specified task through natural language dialog. Object-oriented task-based dialog systems are currently implemented in a number of applications, such as movie ticket purchases, airline tickets, and hotel reservations.
At present, specifically, an agent (agent) interacts with a user in the operation process of a target-oriented dialog system, a large amount of real-time interaction with the user is required in the training process of the agent, and training of the agent is performed through interaction with a large amount of users, however, the above training mode requires continuous participation of the user, which results in high training cost and overlong training time.
Disclosure of Invention
The embodiment of the invention provides an interactive training method, an interactive training device, a terminal and a storage medium for an intelligent agent, which can be used for alternately training the simulator and the intelligent agent based on the interaction form of a simulator combination and the intelligent agent, so that the training efficiency for the intelligent agent is improved.
In one aspect, an embodiment of the present invention provides an interactive training method for an agent, where the method includes:
acquiring an interactive target set, and screening a first interactive target subset from the interactive target set;
calling a rule simulator and a rule agent to interact on the basis of the first interaction target subset to obtain a first interaction data set, wherein the rule simulator is a simulator constructed on the basis of a first preset rule, and the rule agent is an agent constructed on the basis of a second preset rule;
training an initial simulator and an initial agent based on the first interactive data set to obtain a basic simulator and a basic agent, wherein the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;
screening a second interaction target subset from the interaction target set, calling an agent combination based on the second interaction target subset to carry out interaction training on the basic simulator so as to update parameters in the basic simulator, and obtaining a target simulator, wherein the agent combination comprises the rule simulator and the basic simulator;
and screening a third interaction target subset from the interaction target set, calling a simulator combination based on the third interaction target subset to carry out interactive training on the basic agent so as to update parameters in the basic agent to obtain a target agent, wherein the simulator combination comprises the rule simulator and the target simulator.
In one aspect, an embodiment of the present invention provides an interactive training apparatus for an agent, where the apparatus includes:
the acquisition module is used for acquiring an interaction target set;
the screening module is used for screening out a first interaction target subset from the interaction target set;
the calling module is used for calling a rule simulator based on the first interaction target subset to interact with a rule agent to obtain a first interaction data set, the rule simulator is a simulator constructed based on a first preset rule, and the rule agent is an agent constructed based on a second preset rule;
the training module is used for training an initial simulator and an initial agent based on the first interactive data set to obtain a basic simulator and a basic agent, the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;
the screening module is further configured to screen a second subset of interaction targets from the set of interaction targets,
the training module is further configured to invoke an agent combination to perform interactive training on the basic simulator based on the second interactive target subset, so as to update parameters in the basic simulator, and obtain a target simulator, where the agent combination includes the rule simulator and the basic simulator;
the screening module is further configured to screen a third interaction target subset from the interaction target set;
the training module is further configured to invoke a simulator combination based on the third interactive target subset to perform interactive training on the basic agent, so as to update parameters in the basic agent, and obtain a target agent, where the simulator combination includes the rule simulator and the target simulator.
In one aspect, an embodiment of the present invention provides a terminal, including a processor and a memory, where the memory is configured to store a computer program, and the computer program includes program instructions, and is characterized in that the processor is configured to call the program instructions to execute the interactive training method for an agent.
In one aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the interactive training method for an agent.
In the embodiment of the invention, a terminal calls a rule simulator and a rule agent to interact based on a first interaction target subset to obtain a first interaction data set, trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, calls a combination of the rule agent and the basic agent based on a second interaction target subset to interactively train the basic simulator to obtain a target simulator, and calls a combination of the rule simulator and the target simulator based on a third interaction target subset to interactively train the basic agent to obtain the target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of an interactive training method for an agent according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of another interactive training method for an agent according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an interactive training apparatus for an agent according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The interactive training method for the intelligent agent is realized on a terminal, and the terminal comprises electronic equipment such as a smart phone, a tablet computer, a digital audio and video player, an electronic reader, a handheld game machine or vehicle-mounted electronic equipment.
Fig. 1 is a schematic flowchart of an interactive training method for an agent in an embodiment of the present invention, and as shown in fig. 1, a flowchart of the interactive training method for an agent in the embodiment may include:
s101, acquiring an interaction target set, and screening a first interaction target subset from the interaction target set.
In the embodiment of the present invention, the interaction target set includes multiple interaction targets, where an interaction target is specifically a target that needs to be reached in a process of interacting with an agent once, for example, for an airline ticket ordering scenario, one interaction target may be: the day worship six am, Shanghai to Chongqing, from Pudong airport, eastern aviation tickets. The intelligent agent can be specifically an executing device of an air ticket ordering system, an insurance purchasing system, an intelligent customer service and the like, and can specifically interact with a user in an actual process to meet the requirements of the user. The intelligent agent can be obtained through deep learning mode training, specifically can adopt a mode of reinforcement learning in the deep learning to the interaction between the simulator and the intelligent agent trains the intelligent agent, so that the intelligent agent has good response capability. In the training process of the agent, a plurality of interaction targets are screened from the interaction target set, and the simulator is called to interact with the agent based on the interaction targets, so as to achieve the purpose of training the agent.
In specific implementation, the terminal may obtain a pre-constructed interaction target set, and screen a first interaction target subset from the interaction target set, where the first interaction target subset may be multiple interaction targets in the interaction target set, and each interaction target may be used for interaction between a rule simulator and a rule agent to be constructed subsequently.
In one embodiment, the specific way of screening out the first interaction target subset from the interaction target set by the terminal may be that the terminal acquires a target application scene corresponding to an initial agent to be trained; acquiring a target interaction record under a target application scene from the historical record, wherein the target interaction record comprises interaction records of a user and an intelligent agent under the target application scene; at least one historical interactive target is obtained from the target interactive records, K interactive targets matched with the historical interactive targets are screened out from the interactive target set and serve as a first interactive target subset, and K is a positive integer. The application scene can be an air ticket ordering scene, an insurance purchasing scene, an online shopping scene and the like, interaction records between different users and intelligent agents are included in different application scenes, the intelligent agents can be intelligent agents finished through historical training or intelligent agents constructed based on rules and used for interacting with the users, and the target application scene can be any application scene.
S102, calling a rule simulator and a rule agent to interact based on the first interaction target subset to obtain a first interaction data set.
In the embodiment of the invention, after the terminal determines the first interaction target subset, the terminal can call the rule simulator and the rule agent to interact based on the first interaction target subset, wherein the rule simulator is constructed based on a first preset rule, and the rule agent is constructed based on a second preset rule.
In one implementation, the rule simulator is configured to simulate content that the user would output to the agent when the user interacts with the agent, and the rule simulator may specifically ask the agent for questions according to a predefined rule. For example, when the obtained content contains a phrase location, location information in the interactive target (e.g., from shanghai to Chongqing) is output, and when the obtained content contains time, time information in the interactive target (e.g., worship six am) is output. The rule intelligent agent does not know target information required by the user in advance, collects the required slot position of the user in a questioning mode, simultaneously carries out corresponding database query according to the requirement, and replies the problem of the user. Specifically, it may perform a corresponding operation based on a preset rule when the input dialog content is detected. If the second preset rule for constructing the rule agent is that corresponding information is inquired from the database based on the acquired keywords and fed back to the corresponding information, if the keywords in the acquired content are 'saturday, morning and ticket', the agent can inquire all saturday and morning tickets from the database, when the number of the tickets is less than a threshold value, the tickets are displayed, and when the number of the tickets is greater than the threshold value, feedback information of 'where the asking starting place is' is returned. The specific way of invoking the rule simulator and the rule agent to interact based on the first interaction target subset may be that the rule simulator outputs all or part of the content in any target in the first interaction target subset to the rule agent, acquires the content returned by the rule agent, and continues to interact with the rule simulator based on the returned content. And stopping interaction until the interaction reaches a preset target, or stopping interaction after the number of interactive rounds reaches a preset number. For each interactive target, the rule simulator and the rule agent may be invoked to perform interaction in the above manner, so as to obtain N rounds of interactive data, where one round of interactive data includes "content output by the rule simulator, content fed back by the rule agent, and a return value", where the return value may be determined by a degree of association between the content output by the rule simulator and the content fed back by the rule agent, for example, entities in the content may be extracted, a degree of association between the contents is determined by a distance between the entities in a knowledge graph, the knowledge graph includes a plurality of entities, and the entities are connected based on the relationship between the entities, and when the entities appear in the same text, it is determined that the entities have a relationship, or the return value may also be determined by human labeling. Further, the N rounds of interaction data are stored in an experience storage area (Replaybuff) as a first interaction data set.
In one implementation, the first preset rule is to output corresponding consultation information based on keywords in the acquired feedback information, the second preset rule is to output corresponding feedback information based on keywords in the acquired consultation information, the terminal invokes the rule simulator and the rule agent to interact based on each interactive target in the first interactive target subset in the same way, specifically, the terminal determines first consultation information based on any reference interactive target in the first interactive target subset, and calls the rule simulator to interact with the rule intelligent body; calling a rule intelligent agent to output corresponding first feedback information to a rule simulator based on the keywords in the first consultation information; determining the first consultation information and the first feedback information as first round of interactive data, and determining a first feedback value corresponding to the first round of interactive data based on the matching degree between the first feedback information and the reference interactive target; if the matching degree between the first feedback information and the reference interaction target is smaller than the preset matching degree, calling a rule simulator to output corresponding second consultation information to the rule intelligent agent based on the keywords in the first feedback information; calling the rule intelligent agent to output corresponding second feedback information to the rule simulator based on the keywords in the second consultation information; determining the second consultation information and the second feedback information as second round of interactive data, and determining a second return value corresponding to the second round of interactive data based on the matching degree between the second feedback information and the reference interactive target; and if the matching degree between the second feedback information and the reference interaction target is greater than the preset matching degree, stopping calling the rule simulator and the rule agent for interaction, and adding the first round of interaction data, the first return value, the second round of interaction data and the second return value into the first interaction data set. Similarly, if the matching degree between the second feedback information and the reference interaction target is smaller than the preset matching degree, the rule simulator can be continuously called to interact with the rule intelligent body, and the interaction is stopped until the matching degree between the feedback information output by the rule intelligent body after the multiple rounds of interaction and the reference interaction target is larger than the preset matching degree; or stopping the interaction when the number of the interaction rounds of the rule agent and the rule simulator reaches the preset number of the rounds. In the interaction process, each round of consultation information, feedback information and return value are added into the first interaction data set, and by adopting the mode, the rule simulator and the rule intelligent body can be called to carry out interaction based on each interaction target in the first interaction target subset, and data generated by interaction is added into the first interaction data set to obtain a complete first interaction data set. The specific calculation manner of the matching degree between the first feedback information and the reference interaction target may be to obtain the number of the same characters between the first feedback information and the reference interaction target, and determine the ratio of the number of the same characters to the total number of characters in the reference interaction target as the matching degree, or to extract at least one first entity included in the first feedback information and at least one second entity in the reference interaction target, to obtain the number of the same entities in the at least one first entity and the at least one second entity, and the total number of entities in the first entity and the second entity, and determine the ratio between the number of the same entities and the total number as the matching degree, where the entities may be specifically extracted based on a preset rule, and the specific manner of extracting the target entity may be based on a rule and a dictionary method, such as a manually written rule, extracting features such as keywords, indicator words, and position words as entities, or a conventional machine learning method based on statistics, a deep learning method based on deep learning, and the like. Alternatively, the degree of matching may be pre-labeled by the developer.
S103, training the initial simulator and the initial agent based on the first interactive data set to obtain a basic simulator and a basic agent.
In the embodiment of the invention, after the terminal acquires the first interactive data set, the initial simulator and the initial agent are trained based on the first interactive data set to obtain the basic simulator and the basic agent. The initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm. The deep learning algorithm may be a Convolutional Neural Network (CNN) algorithm, a Long Short-Term Memory network (LSTM) algorithm, or the like, such as a single-layer LSTM + multi-layer perceptron + softmax classification layer is used as a strategy to generate the initial simulator, the initial agent and the initial simulator.
In one implementation manner, the first interaction data set comprises at least one round of interaction data, the specific manner of training the initial simulator by the terminal based on the first interaction data set may be that N rounds of interaction data are screened from the first interaction set, each round of interaction data in the N rounds of interaction data comprises consultation information output by the rule simulator, feedback information output by the rule agent and a return value, and N is a positive integer; calling N rounds of interactive data to carry out iterative training on the initial simulator based on a training mode of reinforcement learning so as to update parameters in the initial simulator; and if the initial simulator after the parameter updating meets the preset condition, determining the initial simulator after the parameter updating as a basic simulator, wherein the preset condition comprises that a return value obtained by interacting with the rule agent is higher than a preset return value. Or the preset condition may be that a success rate obtained by interacting with the rule agent is higher than a preset success rate, wherein when the interaction obtains a question mark asking completion, it is determined that the interaction is successful. The preset conditions may be preset by a developer.
In one implementation manner, the first interactive data set comprises at least one round of interactive data, the specific manner of training the initial agent by the terminal based on the first interactive data set may be that N rounds of interactive data are screened from the first interactive set, each round of interactive data in the N rounds of interactive data comprises consultation information output by the rule simulator, feedback information output by the rule agent and a return value, and N is a positive integer; calling N rounds of interactive data to carry out iterative training on the initial agent based on a training mode of reinforcement learning so as to update parameters in the initial agent; and if the initial agent after the parameter updating meets the preset condition, determining the initial agent after the parameter updating as a basic agent, wherein the preset condition comprises that a return value obtained by interacting with the rule simulator is higher than a preset return value. Or the preset condition may be that a success rate obtained by interacting with the rule simulator is higher than a preset success rate, wherein when the interaction obtains a question mark asking completion, it is determined that the interaction is successful. The preset conditions may be preset by a developer.
By the method, the basic simulator and the basic intelligent agent obtained by training have basic interaction capacity.
S104, screening a second interaction target subset from the interaction target set, calling an agent combination to carry out interaction training on the basic simulator based on the second interaction target subset, and updating parameters in the basic simulator to obtain the target simulator.
In the embodiment of the invention, after the terminal trains and obtains the basic simulator and the basic agent, a second interaction target subset can be screened from the interaction target set, an agent combination is called based on the second interaction target subset to carry out interaction training on the basic simulator, so that parameters in the basic simulator are updated, and the target simulator is obtained, wherein the agent combination comprises a rule simulator and the basic simulator, the second interaction target subset can be a plurality of interaction targets in the interaction target set, and each interaction target can be used for the subsequent interaction of the basic simulator and the agent combination.
In one implementation manner, the specific manner in which the terminal invokes the agent combination to perform the interactive training on the base simulator based on the second interaction target subset may be that the terminal invokes the agent combination to perform at least one interaction with the base simulator based on the second interaction target subset; in any I-time interaction process of at least one interaction, acquiring a first interaction round number U corresponding to a regular agent in an agent combination and a second interaction round number V corresponding to a basic agent, wherein the numbers of I, U and V are positive integers; based on U interactive targets in the second interactive target subset, calling a rule agent to interact with the basic simulator to obtain a first interactive data subset; calling a basic agent to interact with a basic simulator based on V interactive targets in the second interactive target subset to obtain a second interactive data subset; updating parameters in the base simulator based on the first and second subsets of interaction data; if the basic simulator after the parameter updating does not meet the first preset condition, adjusting a first interaction round number corresponding to a regular agent and a second interaction round number corresponding to a basic agent in the agent combination to obtain an agent combination after the interaction round number updating, and performing interaction training on the basic simulator based on the agent combination after the interaction round number updating in the I + 1-time interaction process; and if the basic simulator after the parameter updating meets the first preset condition, determining the basic simulator after the parameter updating as the target simulator. The specific way that the terminal determines whether the basic simulator after the parameter updating meets the first preset condition can be that the terminal screens out a test interaction target from an interaction target set, and calls the basic simulator after the parameter updating to interact with the intelligent agent combination based on the test interaction target to obtain test interaction data; if the test interaction data indicates that the completion degree of the test interaction target is higher than the preset completion degree, determining that the basic simulator after the parameter updating meets a preset condition, wherein the completion degree can be specifically judged by research and development personnel, or determining based on the matching degree between the feedback information in the interaction data and the test interaction target. In the above manner, by combining the agents and continuously adjusting the frequency of use of the regular agent and the basic agent in the agent combination in each training round, the simulator can be better trained, the performance of the trained target simulator is improved, and the trained target simulator has good capability of simulating real user conversations.
In one implementation, the specific way in which the terminal invokes the agent combination to interactively train the base simulator based on the second interaction target subset may be that the terminal invokes the agent combination to interact with the base simulator based on the second interaction target subset to obtain a second interaction data set, and training the basic simulator based on the second interactive data set to obtain a target simulator, wherein, the intelligent agent combination is a combination of a regular intelligent agent and a basic intelligent agent, and in each training process, the use proportion of the regular agent and the basic agent in the agent combination can be adjusted, the agent combination continuously interacts with the basic simulator, the parameters in the basic simulator are continuously updated based on the interaction data generated by the interaction, and when the basic simulator after the parameter updating meets the condition, determining the basic simulator after the parameter updating as the target simulator. By the method, the trained target simulator has good capability of simulating real user conversation. The mode of adjusting the use proportion of the regular agent and the basic agent in the agent combination can be based on the rule to adjust, if the adjustment rule is to improve the use proportion of 10% of the basic agent each time until the success rate is not improved any more, if the second interactive target comprises 100 interactive targets, the initial proportion is 90% of the regular agent and 10% of the basic agent, in the process of one training, the basic agent is called to complete 10 questions and answers, and the regular agent completes 90 questions and answers. In the next training process, the basic agent is called to complete 20 questions and answers, and the rule agent completes 80 questions and answers. And updating parameters in the basic simulator based on each interactive question and answer condition to obtain the target simulator.
S105, screening a third interaction target subset from the interaction target set, calling a simulator combination based on the third interaction target subset to carry out interaction training on the basic intelligent agent so as to update parameters in the basic intelligent agent and obtain the target intelligent agent.
In the embodiment of the invention, after the terminal trains and obtains the target simulator, the target simulator and the rule simulator can be combined to obtain the simulator combination, a third interaction target subset is screened from the interaction target set, the simulator combination is called based on the third interaction target subset to carry out interaction training on the basic intelligent agent so as to update parameters in the basic intelligent agent and obtain the target intelligent agent, and the simulator combination comprises the rule simulator and the target simulator. Wherein the third subset of interaction targets may be a plurality of interaction targets in the set of interaction targets, and each interaction target may be used for interaction of the subsequently constructed simulator combination and the basic agent.
In one implementation manner, the specific manner of the terminal invoking the simulator combination based on the third interaction target subset to perform interactive training on the basic agent may be that the terminal invokes the simulator combination based on the third interaction target subset to perform multiple interactions with the basic agent; in the J-th interaction process of multiple interactions, the terminal acquires a third interaction round number X corresponding to the rule simulator in the simulator combination and a fourth interaction round number Y, J, X and Y corresponding to the target simulator, wherein the third interaction round number X and the fourth interaction round number Y are positive integers; based on X interactive targets in the third interactive target subset, calling a rule simulator to interact with the basic agent to obtain a third interactive data subset; calling a target simulator to interact with the basic agent based on Y interactive targets in the third interactive target subset to obtain a fourth interactive data subset; updating the parameters in the basic agent based on the third interactive data subset and the fourth interactive data subset; if the basic agent after the parameter updating does not meet the second preset condition, adjusting a third interaction round number corresponding to the rule simulator and a fourth interaction round number corresponding to the target simulator in the simulator combination to obtain the simulator combination after the interaction round number updating, and performing interactive training on the basic agent based on the simulator combination after the interaction round number updating in the J + 1-time interaction process; and if the basic agent after the parameter updating meets the second preset condition, determining the basic agent after the parameter updating as the target agent. The specific way that the terminal determines whether the basic intelligent agent after the parameter updating meets the preset condition can be that the terminal screens out a test interaction target from the interaction target set, and calls the basic intelligent agent after the parameter updating to combine with the simulator for interaction based on the test interaction target to obtain test interaction data; and if the inspection interaction data indicate that the completion degree of the inspection interaction target is higher than the preset completion degree, determining that the basic agent after the parameter updating meets a second preset condition. In the above manner, the rule simulator and the trained target simulator are combined, and the frequency of use of the rule simulator and the target simulator in the simulator combination is continuously adjusted in each round of training, so that the basic agent can be trained better, the performance of the trained target agent is improved, and the trained target agent has good ability to have a conversation with a real user.
In one implementation, the specific way in which the terminal invokes the agent combination to interactively train the basic agent based on the third interaction target subset may be that the terminal invokes the agent combination to interact with the basic agent based on the third interaction target subset to obtain a third interaction data set, and training the basic agent based on the third interactive data set to obtain a target agent, wherein, the intelligent agent combination is a combination of a rule simulator and a target simulator, and in each training process, can adjust the use proportion of the rule simulator and the target simulator in the agent combination, continuously interact with the basic agent, continuously update the parameters in the basic agent based on the interaction data generated by the interaction, and when the basic agent after the parameter updating meets the condition, determining the basic agent after the parameter updating as the target agent. The target agent at this time has good capability of responding to the content input by the user.
In the embodiment of the invention, a terminal calls a rule simulator and a rule agent to interact based on a first interaction target subset to obtain a first interaction data set, trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, calls a combination of the rule agent and the basic agent based on a second interaction target subset to interactively train the basic simulator to obtain a target simulator, and calls a combination of the rule simulator and the target simulator based on a third interaction target subset to interactively train the basic agent to obtain the target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.
Fig. 2 is a schematic flowchart of another interactive training method for an agent in the embodiment of the present invention, and as shown in fig. 2, the flowchart of the interactive training method for an agent in the embodiment may include:
s201, obtaining an interaction target set, and screening a first interaction target subset from the interaction target set.
In the embodiment of the invention, the interaction target set comprises multiple interaction targets, the interaction targets are specifically targets which need to be reached in the process of interacting with the intelligent agent once, and the interaction target set can be preset by research personnel.
S202, calling a rule simulator and a rule agent to interact based on the first interaction target subset to obtain a first interaction data set.
In the embodiment of the invention, after the terminal determines the first interaction target subset, the terminal can call the rule simulator and the rule intelligent agent to interact based on the first interaction target subset, so as to obtain the first interaction data set. The rule simulator is constructed based on a first preset rule, and the rule agent is constructed based on a second preset rule.
S203, training the initial simulator and the initial agent based on the first interactive data set to obtain a basic simulator and a basic agent.
In the embodiment of the invention, after the terminal acquires the first interactive data set, the initial simulator and the initial agent are trained based on the first interactive data set to obtain the basic simulator and the basic agent. The initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm.
S204, screening a second interaction target subset from the interaction target set, calling an agent combination to carry out interaction training on the basic simulator based on the second interaction target subset, and updating parameters in the basic simulator to obtain the target simulator.
In the embodiment of the invention, the agent combination comprises a rule simulator and a basic simulator, the second interaction target subset can be a plurality of interaction targets in the interaction target set, and each interaction target can be used for interaction of the basic simulator and the agent combination which are constructed subsequently. The interactive training process is N rounds, the use ratio of the regular agent and the basic agent in the agent combination in each training process is different, for example, in the first round of training, the regular agent and the basic simulator complete the interaction of 90 interactive targets in the second interactive target subset, and the basic agent and the basic simulator complete the interaction of 10 interactive targets in the second interactive target subset; and after each round of training, obtaining the target completion rate of the round of training, and subtracting t from the number of the interaction targets needing to be completed by the regular agent and the basic simulator, and adding t to the number of the interaction targets needing to be completed by the basic agent and the basic simulator until the target completion rate of each round of training is not increased, wherein t is a positive integer.
S205, a third interaction target subset is screened from the interaction target set, and a simulator combination is called to carry out interaction training on the basic intelligent agent based on the third interaction target subset so as to update parameters in the basic intelligent agent and obtain the target intelligent agent.
In an embodiment of the invention, the simulator combination comprises a rule simulator and a target simulator. Wherein the third subset of interaction targets may be a plurality of interaction targets in the set of interaction targets, and each interaction target may be used for interaction of the subsequently constructed simulator combination and the basic agent. The interactive training process is N rounds, the usage proportion of the rule simulator and the target simulator in the simulator combination in each training process is different, for example, in the first round of training, the rule simulator and the basic agent complete the interaction of 90 interactive targets in the third interactive target subset, and the target simulator and the basic agent complete the interaction of 10 interactive targets in the third interactive target subset; and after each round of training, obtaining the target completion rate of the round of training, and subtracting t from the number of the interactive targets to be completed by the rule simulator and the basic agent, and adding t to the number of the interactive targets to be completed by the target simulator and the basic agent until the target completion rate of each round of training is not increased, wherein t is a positive integer.
S206, acquiring a second interactive data set obtained by interaction between at least one test user and the target intelligent agent.
In the embodiment of the invention, the second interactive data set comprises at least one round of interactive data, and each round of interactive data comprises test scores, consultation information output by a test user and feedback information output by a target intelligent agent. Each test user can input information to the target intelligent agent to interact with the target intelligent agent, the test users can output consultation information in the interaction process, the target intelligent agent outputs corresponding feedback information, and after the interaction is completed, the test users can score the feedback information output by the intelligent agent in the round of interaction, namely test scores are obtained.
And S207, training the target agent based on the second interactive data set so as to update the parameters in the target agent and obtain the target agent with updated parameters.
In the embodiment of the invention, after the terminal acquires the second interactive data set, N rounds of interactive data in the second interactive data set can be called to carry out iterative training on the target agent based on a training mode of reinforcement learning so as to update parameters in the target agent; and calling K rounds of interactive data in the second interactive data set to test the target agent with the updated parameters, and if the test result indicates that the target agent with the updated parameters meets preset conditions, executing step S208.
And S208, calling the target agent with the updated parameters to interact with the target user.
In the embodiment of the invention, after the terminal obtains the target intelligent agent with the updated parameters, the parameter update can be applied to the actual interaction process with the user, and the intelligent agent can be continuously updated by adopting the interaction data generated by the subsequent actual interaction with the user, so that the performance of the intelligent agent is improved. In the above scheme, the training of different stages of the intelligent agent can be completed based on the data of different types, the first stage is used for training the intelligent agent based on the constructed interactive data, so that the intelligent agent has preliminary response capability, and the intelligent agent is trained and optimized by subsequently adopting the interactive data with real users, thereby realizing the continuous improvement of the performance of the intelligent agent.
In the embodiment of the invention, a terminal calls a rule simulator and a rule agent to interact based on a first interaction target subset to obtain a first interaction data set, trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, calls a combination of the rule agent and the basic agent based on a second interaction target subset to interactively train the basic simulator to obtain a target simulator, calls a combination of the rule simulator and the target simulator based on a third interaction target subset to interactively train the basic agent to obtain the target agent, and trains and optimizes the target agent by adopting real interaction with a user, so that the agent has better performance. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and then training and tuning are carried out based on a small amount of interaction samples with real users, so that the training efficiency of the intelligent agent is improved.
The interactive training device for an agent according to the embodiment of the present invention will be described in detail with reference to fig. 3. It should be noted that the interactive training apparatus for an agent shown in fig. 3 is used for executing the method of the embodiment of the present invention shown in fig. 1-2, for convenience of description, only the portion related to the embodiment of the present invention is shown, and details of the technology are not disclosed, and reference is made to the embodiment of the present invention shown in fig. 1-2.
Referring to fig. 3, a schematic structural diagram of an interactive training device for an agent according to the present invention is shown, where the interactive training device 30 for an agent may include: the system comprises an acquisition module 301, a screening module 302, a calling module 303 and a training module 304.
An obtaining module 301, configured to obtain an interaction target set;
a screening module 302, configured to screen out a first subset of interaction targets from the set of interaction targets;
a calling module 303, configured to call a rule simulator based on the first interaction target subset to interact with a rule agent to obtain a first interaction data set, where the rule simulator is a simulator constructed based on a first preset rule, and the rule agent is an agent constructed based on a second preset rule;
a training module 304, configured to train an initial simulator and an initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, where the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;
the screening module 302 is further configured to screen out a second subset of interaction targets from the set of interaction targets,
the training module 304 is further configured to invoke an agent combination to perform interactive training on the basic simulator based on the second interactive target subset, so as to update parameters in the basic simulator, so as to obtain a target simulator, where the agent combination includes the rule simulator and the basic simulator;
the screening module 302 is further configured to screen a third subset of interaction targets from the set of interaction targets;
the training module 304 is further configured to invoke a simulator combination based on the third interactive target subset to perform interactive training on the basic agent, so as to update parameters in the basic agent, and obtain a target agent, where the simulator combination includes the rule simulator and the target simulator.
In one implementation, the screening module 302 is specifically configured to:
acquiring a target application scene corresponding to an initial agent to be trained;
acquiring a target interaction record under the target application scene from a historical record, wherein the target interaction record comprises interaction records of a user and an intelligent agent under the target application scene;
and acquiring at least one historical interaction target from the target interaction record, and screening K interaction targets matched with the historical interaction targets from the interaction target set to serve as a first interaction target subset, wherein K is a positive integer.
In one implementation manner, the first preset rule is to output corresponding advisory information based on a keyword in the obtained feedback information, the second preset rule is to output corresponding feedback information based on a keyword in the obtained advisory information, and the calling module 303 is specifically configured to:
determining first consulting information based on a reference interaction target, and calling the rule simulator to send the first consulting information to the rule agent;
calling the rule agent to output corresponding first feedback information to the rule simulator based on the keywords in the first consultation information;
determining the first consultation information and the first feedback information as first round of interaction data, and determining a first return value corresponding to the first round of interaction data based on the matching degree between the first feedback information and the reference interaction target;
if the matching degree between the first feedback information and the reference interaction target is smaller than a preset matching degree, calling the rule simulator to output corresponding second consultation information to the rule intelligent body based on the keywords in the first feedback information;
calling the rule agent to output corresponding second feedback information to the rule simulator based on the keywords in the second consultation information;
determining the second consultation information and the second feedback information as second round of interaction data, and determining a second return value corresponding to the second round of interaction data based on the matching degree between the second feedback information and the reference interaction target;
if the matching degree between the second feedback information and the reference interaction target is greater than a preset matching degree, stopping calling the rule simulator and the rule agent for interaction, and adding the first round of interaction data, the first return value, the second round of interaction data and the second return value to a first interaction data set.
In an implementation manner, the first interaction data set includes at least one round of interaction data, and the training module 304 is specifically configured to:
screening N rounds of interaction data from the first interaction set, wherein each round of interaction data in the N rounds of interaction data comprises consultation information output by the rule simulator, feedback information output by the rule agent and a return value, and N is a positive integer;
calling the N rounds of interactive data to carry out iterative training on the initial agent based on a training mode of reinforcement learning so as to update parameters in the initial agent;
and if the initial agent with the updated parameters meets preset conditions, determining the initial agent with the updated parameters as a basic agent, wherein the preset conditions comprise that the average return value obtained by carrying out multiple rounds of interaction with the rule simulator is higher than a preset return value.
In one implementation, the training module 304 is specifically configured to:
calling an agent combination to interact with the basic simulator at least once based on the second interaction target subset;
in the I-time interaction process of the at least one interaction, acquiring a first interaction round number U corresponding to a regular agent in the agent combination and a second interaction round number V, I, U and V corresponding to the basic agent as positive integers;
calling the rule agent to interact with the basic simulator based on U interaction targets in the second interaction target subset to obtain a first interaction data subset;
calling the basic agent to interact with the basic simulator based on the V interactive targets in the second interactive target subset to obtain a second interactive data subset;
updating parameters in the base simulator based on the first subset of interaction data and the second subset of interaction data;
if the basic simulator after the parameter updating does not meet the first preset condition, adjusting a first interaction round number corresponding to a regular agent in the agent combination and a second interaction round number corresponding to the basic agent to obtain an agent combination after the interaction round number updating, and performing interactive training on the basic simulator based on the agent combination after the interaction round number updating in the I + 1-time interaction process;
and if the basic simulator after the parameter updating meets a first preset condition, determining the basic simulator after the parameter updating as a target simulator.
In one implementation, the training module 304 is specifically configured to:
calling a simulator combination to interact with the basic agent for a plurality of times based on the third interaction target subset;
in the J-th interaction process of the multiple interactions, acquiring a third interaction round number X corresponding to a rule simulator in the simulator combination and a fourth interaction round number Y, J, X and Y corresponding to the target simulator as positive integers;
calling the rule simulator to interact with the basic agent based on X interactive targets in the third interactive target subset to obtain a third interactive data subset;
calling the target simulator to interact with the basic agent based on Y interactive targets in the third interactive target subset to obtain a fourth interactive data subset;
updating parameters in the base agent based on the third subset of interaction data and the fourth subset of interaction data;
if the basic agent after the parameter updating does not meet a second preset condition, adjusting a third interaction round number corresponding to a rule simulator in the simulator combination and a fourth interaction round number corresponding to the target simulator to obtain a simulator combination after the interaction round number updating, and performing interactive training on the basic agent based on the simulator combination after the interaction round number updating in the J + 1-time interaction process;
and if the basic agent after the parameter updating meets the second preset condition, determining the basic agent after the parameter updating as a target agent.
In one implementation, the training module 304 is further configured to:
acquiring a second interactive data set obtained by at least one test user interacting with the target intelligent agent, wherein the second interactive data set comprises at least one round of interactive data, and each round of interactive data comprises a test score, consultation information output by the test user and feedback information output by the target intelligent agent;
training the target agent based on the second interaction data set so as to update parameters in the target agent and obtain a target agent with updated parameters;
and calling the target agent with the updated parameters to interact with the target user.
In the embodiment of the present invention, an obtaining module 301 obtains an interaction target set, a screening module 302 screens out a first interaction target subset from the interaction target set, a calling module 303 calls a rule simulator and a rule agent to interact based on the first interaction target subset to obtain a first interaction data set, a training module 304 trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, the training module 304 calls a combination of the rule agent and the basic agent to interactively train the basic simulator based on a second interaction target subset to obtain a target simulator, and the training module 304 interactively trains the basic agent based on a combination of a third interaction target subset calling the rule simulator and the target simulator to obtain the target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.
Fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention. As shown in fig. 4, the terminal includes: at least one processor 401, input devices 403, output devices 404, memory 405, at least one communication bus 402. Wherein a communication bus 402 is used to enable connective communication between these components. The input device 403 may be a control panel or a microphone, and the output device 404 may be a display screen. The memory 405 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 405 may alternatively be at least one storage device located remotely from the aforementioned processor 401. Wherein the processor 401 may be combined with the apparatus described in fig. 3, the memory 405 stores a set of program codes, and the processor 401, the input device 403, and the output device 404 call the program codes stored in the memory 405 to perform the following operations:
the processor 401 is configured to obtain an interaction target set, and screen out a first interaction target subset from the interaction target set;
a processor 401, configured to invoke a rule simulator based on the first interaction target subset to interact with a rule agent to obtain a first interaction data set, where the rule simulator is a simulator constructed based on a first preset rule, and the rule agent is an agent constructed based on a second preset rule;
a processor 401, configured to train an initial simulator and an initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, where the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;
a processor 401, configured to screen a second interaction target subset from the interaction target set, and invoke an agent combination to perform interaction training on the basic simulator based on the second interaction target subset, so as to update parameters in the basic simulator, so as to obtain a target simulator, where the agent combination includes the rule simulator and the basic simulator;
a processor 401, configured to screen a third interaction target subset from the interaction target set, and invoke a simulator combination to perform interaction training on the basic agent based on the third interaction target subset, so as to update parameters in the basic agent, so as to obtain a target agent, where the simulator combination includes the rule simulator and the target simulator.
In one implementation, the processor 401 is specifically configured to:
acquiring a target application scene corresponding to an initial agent to be trained;
acquiring a target interaction record under the target application scene from a historical record, wherein the target interaction record comprises interaction records of a user and an intelligent agent under the target application scene;
and acquiring at least one historical interaction target from the target interaction record, and screening K interaction targets matched with the historical interaction targets from the interaction target set to serve as a first interaction target subset, wherein K is a positive integer.
In one implementation, the processor 401 is specifically configured to:
determining first consulting information based on a reference interaction target, and calling the rule simulator to send the first consulting information to the rule agent;
calling the rule agent to output corresponding first feedback information to the rule simulator based on the keywords in the first consultation information;
determining the first consultation information and the first feedback information as first round of interaction data, and determining a first return value corresponding to the first round of interaction data based on the matching degree between the first feedback information and the reference interaction target;
if the matching degree between the first feedback information and the reference interaction target is smaller than a preset matching degree, calling the rule simulator to output corresponding second consultation information to the rule intelligent body based on the keywords in the first feedback information;
calling the rule agent to output corresponding second feedback information to the rule simulator based on the keywords in the second consultation information;
determining the second consultation information and the second feedback information as second round of interaction data, and determining a second return value corresponding to the second round of interaction data based on the matching degree between the second feedback information and the reference interaction target;
if the matching degree between the second feedback information and the reference interaction target is greater than a preset matching degree, stopping calling the rule simulator and the rule agent for interaction, and adding the first round of interaction data, the first return value, the second round of interaction data and the second return value to a first interaction data set.
In one implementation, the processor 401 is specifically configured to:
screening N rounds of interaction data from the first interaction set, wherein each round of interaction data in the N rounds of interaction data comprises consultation information output by the rule simulator, feedback information output by the rule agent and a return value, and N is a positive integer;
calling the N rounds of interactive data to carry out iterative training on the initial agent based on a training mode of reinforcement learning so as to update parameters in the initial agent;
and if the initial agent with the updated parameters meets preset conditions, determining the initial agent with the updated parameters as a basic agent, wherein the preset conditions comprise that the average return value obtained by carrying out multiple rounds of interaction with the rule simulator is higher than a preset return value.
In one implementation, the processor 401 is specifically configured to:
calling an agent combination to interact with the basic simulator at least once based on the second interaction target subset;
in the I-time interaction process of the at least one interaction, acquiring a first interaction round number U corresponding to a regular agent in the agent combination and a second interaction round number V, I, U and V corresponding to the basic agent as positive integers;
calling the rule agent to interact with the basic simulator based on U interaction targets in the second interaction target subset to obtain a first interaction data subset;
calling the basic agent to interact with the basic simulator based on the V interactive targets in the second interactive target subset to obtain a second interactive data subset;
updating parameters in the base simulator based on the first subset of interaction data and the second subset of interaction data;
if the basic simulator after the parameter updating does not meet the first preset condition, adjusting a first interaction round number corresponding to a regular agent in the agent combination and a second interaction round number corresponding to the basic agent to obtain an agent combination after the interaction round number updating, and performing interactive training on the basic simulator based on the agent combination after the interaction round number updating in the I + 1-time interaction process;
and if the basic simulator after the parameter updating meets a first preset condition, determining the basic simulator after the parameter updating as a target simulator.
In one implementation, the processor 401 is specifically configured to:
calling a simulator combination to interact with the basic agent for a plurality of times based on the third interaction target subset;
in the J-th interaction process of the multiple interactions, acquiring a third interaction round number X corresponding to a rule simulator in the simulator combination and a fourth interaction round number Y, J, X and Y corresponding to the target simulator as positive integers;
calling the rule simulator to interact with the basic agent based on X interactive targets in the third interactive target subset to obtain a third interactive data subset;
calling the target simulator to interact with the basic agent based on Y interactive targets in the third interactive target subset to obtain a fourth interactive data subset;
updating parameters in the base agent based on the third subset of interaction data and the fourth subset of interaction data;
if the basic agent after the parameter updating does not meet a second preset condition, adjusting a third interaction round number corresponding to a rule simulator in the simulator combination and a fourth interaction round number corresponding to the target simulator to obtain a simulator combination after the interaction round number updating, and performing interactive training on the basic agent based on the simulator combination after the interaction round number updating in the J + 1-time interaction process;
and if the basic agent after the parameter updating meets the second preset condition, determining the basic agent after the parameter updating as a target agent.
In one implementation, the processor 401 is specifically configured to:
acquiring a second interactive data set obtained by at least one test user interacting with the target intelligent agent, wherein the second interactive data set comprises at least one round of interactive data, and each round of interactive data comprises a test score, consultation information output by the test user and feedback information output by the target intelligent agent;
training the target agent based on the second interaction data set so as to update parameters in the target agent and obtain a target agent with updated parameters;
and calling the target agent with the updated parameters to interact with the target user.
In the embodiment of the invention, a processor 401 obtains an interaction target set, the processor 401 screens out a first interaction target subset from the interaction target set, the processor 401 calls a rule simulator and a rule agent to interact based on the first interaction target subset to obtain a first interaction data set, the processor 401 trains an initial simulator and the initial agent based on the first interaction data set to obtain a basic simulator and a basic agent, the processor 401 calls a combination of the rule agent and the basic agent to interactively train the basic simulator based on a second interaction target subset to obtain a target simulator, and the processor 401 interactively trains the basic agent based on a combination of a third interaction target subset call rule simulator and the target simulator to obtain a target agent. By implementing the method, the simulator and the intelligent agent can be alternately trained based on the interaction form of the simulator combination and the intelligent agent, and the training efficiency for the intelligent agent is improved.
The module in the embodiment of the present invention may be implemented by a general-purpose Integrated Circuit, such as a CPU (central Processing Unit), or an ASIC (application Specific Integrated Circuit).
It should be understood that, in the embodiments of the present invention, the Processor 401 may be a Central Processing Unit (CPU), and the Processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The bus 402 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like, and the bus 402 may be divided into an address bus, a data bus, a control bus, and the like, where fig. 4 only shows one thick line for convenience of illustration, but does not show only one bus or one type of bus.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer storage medium and may include the processes of the embodiments of the methods described above when executed. The computer storage medium may be a magnetic disk, an optical disk, a Read-only Memory (ROM), a Random Access Memory (RAM), or the like.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims (10)

1. A method of interactive training for an agent, the method comprising:
acquiring an interactive target set, and screening a first interactive target subset from the interactive target set;
calling a rule simulator and a rule agent to interact on the basis of the first interaction target subset to obtain a first interaction data set, wherein the rule simulator is a simulator constructed on the basis of a first preset rule, and the rule agent is an agent constructed on the basis of a second preset rule;
training an initial simulator and an initial agent based on the first interactive data set to obtain a basic simulator and a basic agent, wherein the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;
screening a second interaction target subset from the interaction target set, calling an agent combination based on the second interaction target subset to carry out interaction training on the basic simulator so as to update parameters in the basic simulator, and obtaining a target simulator, wherein the agent combination comprises the rule simulator and the basic simulator;
and screening a third interaction target subset from the interaction target set, calling a simulator combination based on the third interaction target subset to carry out interactive training on the basic agent so as to update parameters in the basic agent to obtain a target agent, wherein the simulator combination comprises the rule simulator and the target simulator.
2. The method of claim 1, wherein the screening out a first subset of interaction targets from the set of interaction targets comprises:
acquiring a target application scene corresponding to an initial agent to be trained;
acquiring a target interaction record under the target application scene from a historical record, wherein the target interaction record comprises interaction records of a user and an intelligent agent under the target application scene;
and acquiring at least one historical interaction target from the target interaction record, and screening K interaction targets matched with the historical interaction targets from the interaction target set to serve as a first interaction target subset, wherein K is a positive integer.
3. The method according to claim 1, wherein the first preset rule is to output corresponding advisory information based on the keywords in the obtained feedback information, the second preset rule is to output corresponding feedback information based on the keywords in the obtained advisory information, and a manner of invoking a rule simulator and a rule agent to interact based on any reference interaction target in the first subset of interaction targets comprises:
determining first consulting information based on a reference interaction target, and calling the rule simulator to send the first consulting information to the rule agent;
calling the rule agent to output corresponding first feedback information to the rule simulator based on the keywords in the first consultation information;
determining the first consultation information and the first feedback information as first round of interaction data, and determining a first return value corresponding to the first round of interaction data based on the matching degree between the first feedback information and the reference interaction target;
if the matching degree between the first feedback information and the reference interaction target is smaller than a preset matching degree, calling the rule simulator to output corresponding second consultation information to the rule intelligent body based on the keywords in the first feedback information;
calling the rule agent to output corresponding second feedback information to the rule simulator based on the keywords in the second consultation information;
determining the second consultation information and the second feedback information as second round of interaction data, and determining a second return value corresponding to the second round of interaction data based on the matching degree between the second feedback information and the reference interaction target;
if the matching degree between the second feedback information and the reference interaction target is greater than a preset matching degree, stopping calling the rule simulator and the rule agent for interaction, and adding the first round of interaction data, the first return value, the second round of interaction data and the second return value to a first interaction data set.
4. The method of claim 3, wherein the first set of interaction data comprises at least one round of interaction data, and wherein training an initial agent based on the first set of interaction data to obtain a base agent comprises:
screening N rounds of interaction data from the first interaction set, wherein each round of interaction data in the N rounds of interaction data comprises consultation information output by the rule simulator, feedback information output by the rule agent and a return value, and N is a positive integer;
calling the N rounds of interactive data to carry out iterative training on the initial agent based on a training mode of reinforcement learning so as to update parameters in the initial agent;
and if the initial agent with the updated parameters meets preset conditions, determining the initial agent with the updated parameters as a basic agent, wherein the preset conditions comprise that the average return value obtained by carrying out multiple rounds of interaction with the rule simulator is higher than a preset return value.
5. The method of claim 1, wherein invoking agent combinations to interactively train the base simulator based on the second subset of interaction targets comprises:
calling an agent combination to interact with the basic simulator at least once based on the second interaction target subset;
in the I-time interaction process of the at least one interaction, acquiring a first interaction round number U corresponding to a regular agent in the agent combination and a second interaction round number V, I, U and V corresponding to the basic agent as positive integers;
calling the rule agent to interact with the basic simulator based on U interaction targets in the second interaction target subset to obtain a first interaction data subset;
calling the basic agent to interact with the basic simulator based on the V interactive targets in the second interactive target subset to obtain a second interactive data subset;
updating parameters in the base simulator based on the first subset of interaction data and the second subset of interaction data;
if the basic simulator after the parameter updating does not meet the first preset condition, adjusting a first interaction round number corresponding to a regular agent in the agent combination and a second interaction round number corresponding to the basic agent to obtain an agent combination after the interaction round number updating, and performing interactive training on the basic simulator based on the agent combination after the interaction round number updating in the I + 1-time interaction process;
and if the basic simulator after the parameter updating meets a first preset condition, determining the basic simulator after the parameter updating as a target simulator.
6. The method of claim 1, wherein the invoking a simulator combination based on the third subset of interaction objectives interactively trains the base agent, comprising:
calling a simulator combination to interact with the basic agent for a plurality of times based on the third interaction target subset;
in the J-th interaction process of the multiple interactions, acquiring a third interaction round number X corresponding to a rule simulator in the simulator combination and a fourth interaction round number Y, J, X and Y corresponding to the target simulator as positive integers;
calling the rule simulator to interact with the basic agent based on X interactive targets in the third interactive target subset to obtain a third interactive data subset;
calling the target simulator to interact with the basic agent based on Y interactive targets in the third interactive target subset to obtain a fourth interactive data subset;
updating parameters in the base agent based on the third subset of interaction data and the fourth subset of interaction data;
if the basic agent after the parameter updating does not meet a second preset condition, adjusting a third interaction round number corresponding to a rule simulator in the simulator combination and a fourth interaction round number corresponding to the target simulator to obtain a simulator combination after the interaction round number updating, and performing interactive training on the basic agent based on the simulator combination after the interaction round number updating in the J + 1-time interaction process;
and if the basic agent after the parameter updating meets the second preset condition, determining the basic agent after the parameter updating as a target agent.
7. The method of claim 1, wherein after invoking the simulator composition to interactively train the base agent based on the third subset of interaction targets to update parameters in the base agent, the method further comprises:
acquiring a second interactive data set obtained by at least one test user interacting with the target intelligent agent, wherein the second interactive data set comprises at least one round of interactive data, and each round of interactive data comprises a test score, consultation information output by the test user and feedback information output by the target intelligent agent;
training the target agent based on the second interaction data set so as to update parameters in the target agent and obtain a target agent with updated parameters;
and calling the target agent with the updated parameters to interact with the target user.
8. An interactive training apparatus for a smart agent, the apparatus comprising:
the acquisition module is used for acquiring an interaction target set;
the screening module is used for screening out a first interaction target subset from the interaction target set;
the calling module is used for calling a rule simulator based on the first interaction target subset to interact with a rule agent to obtain a first interaction data set, the rule simulator is a simulator constructed based on a first preset rule, and the rule agent is an agent constructed based on a second preset rule;
the training module is used for training an initial simulator and an initial agent based on the first interactive data set to obtain a basic simulator and a basic agent, the initial simulator is a simulator constructed based on a first deep learning algorithm, and the initial agent is an agent constructed based on a second deep learning algorithm;
the screening module is further configured to screen a second subset of interaction targets from the set of interaction targets,
the training module is further configured to invoke an agent combination to perform interactive training on the basic simulator based on the second interactive target subset, so as to update parameters in the basic simulator, and obtain a target simulator, where the agent combination includes the rule simulator and the basic simulator;
the screening module is further configured to screen a third interaction target subset from the interaction target set;
the training module is further configured to invoke a simulator combination based on the third interactive target subset to perform interactive training on the basic agent, so as to update parameters in the basic agent, and obtain a target agent, where the simulator combination includes the rule simulator and the target simulator.
9. A terminal, comprising a processor and a memory, wherein the memory is configured to store a computer program comprising program instructions, wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.
CN202110288790.8A 2021-03-18 2021-03-18 Interactive training method and device for intelligent agent, terminal and storage medium Active CN112836036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110288790.8A CN112836036B (en) 2021-03-18 2021-03-18 Interactive training method and device for intelligent agent, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110288790.8A CN112836036B (en) 2021-03-18 2021-03-18 Interactive training method and device for intelligent agent, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN112836036A true CN112836036A (en) 2021-05-25
CN112836036B CN112836036B (en) 2023-09-08

Family

ID=75930225

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110288790.8A Active CN112836036B (en) 2021-03-18 2021-03-18 Interactive training method and device for intelligent agent, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN112836036B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806512A (en) * 2021-09-30 2021-12-17 中国平安人寿保险股份有限公司 Robot dialogue model training method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789732A (en) * 2012-08-08 2012-11-21 四川大学华西医院 Transesophageal ultrasonic visual simulation system and method used for teaching and clinical skill training
CN110882542A (en) * 2019-11-13 2020-03-17 广州多益网络股份有限公司 Training method, device, equipment and storage medium for game agent
CN111488992A (en) * 2020-03-03 2020-08-04 中国电子科技集团公司第五十二研究所 Simulator adversary reinforcing device based on artificial intelligence
CN112420125A (en) * 2020-11-30 2021-02-26 腾讯科技(深圳)有限公司 Molecular attribute prediction method and device, intelligent equipment and terminal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789732A (en) * 2012-08-08 2012-11-21 四川大学华西医院 Transesophageal ultrasonic visual simulation system and method used for teaching and clinical skill training
CN110882542A (en) * 2019-11-13 2020-03-17 广州多益网络股份有限公司 Training method, device, equipment and storage medium for game agent
CN111488992A (en) * 2020-03-03 2020-08-04 中国电子科技集团公司第五十二研究所 Simulator adversary reinforcing device based on artificial intelligence
CN112420125A (en) * 2020-11-30 2021-02-26 腾讯科技(深圳)有限公司 Molecular attribute prediction method and device, intelligent equipment and terminal

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806512A (en) * 2021-09-30 2021-12-17 中国平安人寿保险股份有限公司 Robot dialogue model training method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112836036B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
WO2022095380A1 (en) Ai-based virtual interaction model generation method and apparatus, computer device and storage medium
US10332034B2 (en) Self-adaptive, self-trained computer engines based on machine learning and methods of use thereof
CN110580516B (en) Interaction method and device based on intelligent robot
CN109690581A (en) User guided system and method
CN113315979A (en) Data processing method and device, electronic equipment and storage medium
CN112259078A (en) Method and device for training audio recognition model and recognizing abnormal audio
CN112836036B (en) Interactive training method and device for intelligent agent, terminal and storage medium
CN111353784A (en) Transfer processing method, system, device and equipment
CN110727771A (en) Information processing method and device, electronic equipment and readable storage medium
CN112818689B (en) Entity identification method, model training method and device
CN113438374A (en) Intelligent outbound call processing method, device, equipment and storage medium
CN117010992A (en) Training method and recommendation method for recommendation model for multitasking and multi-scene recommendation
CN111143529A (en) Method and equipment for carrying out conversation with conversation robot
CN116910201A (en) Dialogue data generation method and related equipment thereof
CN116596593A (en) Content generation method and device based on interaction, electronic equipment and storage medium
CN113128597B (en) Method and device for extracting user behavior characteristics and classifying and predicting user behavior characteristics
CN110610697B (en) Voice recognition method and device
CN111968632A (en) Call voice acquisition method and device, computer equipment and storage medium
CN115334362B (en) Barrage problem processing method, barrage problem processing device, barrage problem storage medium, barrage problem service equipment and barrage problem service system
CN112351292B (en) Electronic resource interaction reminding processing method and device in live broadcast
CN117815674B (en) Game information recommendation method and device, computer readable medium and electronic equipment
CN109740671B (en) Image identification method and device
CN116433356A (en) User matching method and device, storage medium and electronic equipment
CN116797319A (en) Service processing method, device, equipment, medium and product based on voice interaction
CN117909590A (en) Service organization recommendation method, device, equipment, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant