CN116521850A

CN116521850A - Interaction method and device based on reinforcement learning

Info

Publication number: CN116521850A
Application number: CN202310809050.3A
Authority: CN
Inventors: 李宇舰; 曾敏
Original assignee: Beijing Hongmian Xiaoice Technology Co Ltd
Current assignee: Beijing Hongmian Xiaoice Technology Co Ltd
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-08-01
Anticipated expiration: 2043-07-04
Also published as: CN116521850B

Abstract

The invention provides an interaction method and device based on reinforcement learning, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring statement information input currently, and determining a global dialogue state of the statement information; inputting the global dialogue state into a strategy network model to determine a reply strategy corresponding to the statement information under the global dialogue state; the strategy network model is obtained by training based on a sample dialogue state, a sample reply strategy corresponding to the sample dialogue state and a sample rewarding value; a top Wen Yugou of the statement information is obtained, and a reply statement is generated based on the reply policy, the global dialog state, and the top Wen Yugou. The interaction method based on reinforcement learning provided by the invention can respond to the state change information of the user in time based on the global dialogue state, so that the interaction intention of the user can be rapidly understood, the quality and naturalness of the interaction dialogue are effectively improved, and the use experience of the user is improved.

Description

Interaction method and device based on reinforcement learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an interaction method and device based on reinforcement learning. In addition, the invention also relates to an electronic device and a processor readable storage medium.

Background

In recent years, with the rapid development of artificial intelligence technology, dialogue systems capable of intelligent interaction are becoming more and more widespread. In the prior art, the dialogue system has more defects and defects, has limited understanding capability, cannot deeply understand the intention of a user, leads to insufficient natural answers and cannot respond to the change of the state of the user in time, for example, the hobbies of the user can change in a period of time, and the traditional dialogue system cannot capture the state information. Therefore, how to design an interactive dialogue scheme based on reinforcement learning to improve understanding capability and naturalness of interaction is a challenge to be solved.

Disclosure of Invention

Therefore, the invention provides an interaction method and device based on reinforcement learning, which are used for solving the defects that in the prior art, the limitation of an interaction scheme in the social field is higher, the understanding capability is limited, the intention of a user cannot be deeply understood, and the answer is not natural enough.

In a first aspect, the present invention provides an interaction method based on reinforcement learning, including:

acquiring statement information input currently, and determining a global dialogue state of the statement information;

inputting the global dialogue state into a strategy network model to determine a reply strategy corresponding to the statement information under the global dialogue state;

The strategy network model is obtained by training based on a sample dialogue state, a sample reply strategy corresponding to the sample dialogue state and a sample rewarding value;

a top Wen Yugou of the statement information is obtained, and a reply statement is generated based on the reply policy, the global dialog state, and the top Wen Yugou.

Further, after each reply sentence generation, the method further comprises:

based on the statement information and the reply statement which are input currently, carrying out multidimensional evaluation analysis on the current dialogue state, and determining a corresponding rewarding value; determining updating parameters of the strategy network model based on the rewarding value and taking the rewarding value which maximizes the current dialogue state as an optimization target, and obtaining an updated strategy network model; wherein, the current dialogue state comprises a plurality of sentence information which is input currently and a plurality of reply sentences corresponding to the sentence information; and analyzing the global dialogue state based on the updated strategy network model to determine a reply strategy corresponding to the statement information which is input next time in the global dialogue state.

Further, the determining the global dialogue state of the sentence information specifically includes:

Carrying out emotion analysis on the statement information based on a preset first classification evaluation model to obtain an emotion state corresponding to the statement information; carrying out semantic analysis on the statement information based on a preset second classification evaluation model to obtain a dialogue intention corresponding to the statement information; determining current dialog state information based on the emotional state and the dialog intention; the first classification evaluation model is obtained by training based on first sample sentence information and emotion labels corresponding to the first sample sentence information; the second classification evaluation model is obtained by training based on second sample sentence information and dialogue intention labels corresponding to the second sample sentence information;

determining a target user corresponding to the statement information, and acquiring historical dialogue information of the target user in a preset time range; the history dialogue information comprises history statement information input by the target user in a preset time range and corresponding history reply statements; analyzing the historical dialogue information, extracting key feature information in the historical dialogue information, and determining historical dialogue state information of the target user based on the key feature information; the history dialogue state information comprises history preference information and history interaction information corresponding to the target user;

The current dialog state information and the historical dialog state information are determined to be global dialog states.

Further, the generating a reply sentence based on the reply policy, the global dialogue state and the upper Wen Yugou specifically includes: inputting the reply strategy, the global dialogue state and the above sentences into a preset language model to obtain reply sentences output by the language model; the language model is a pre-training language model based on a transducer architecture, which is obtained by training in advance based on third sample sentence information and reply sentences corresponding to the third sample sentence information.

Further, the step of performing multidimensional evaluation analysis on the current dialogue state based on the statement information and the reply statement input at present to determine a corresponding reward value specifically includes:

inputting the statement information and the reply statement which are input currently into a preset rewarding model, and carrying out multidimensional evaluation analysis on the current dialogue state based on a dialogue naturalness classification evaluation model, an information accuracy classification evaluation model, a user satisfaction classification evaluation model and an emotion expression classification evaluation model in the rewarding model to obtain a dialogue naturalness evaluation value, an information accuracy evaluation value, a user satisfaction evaluation value and an emotion expression evaluation value;

Determining a reward value of current dialog state information based on the dialog naturalness evaluation value, the information accuracy evaluation value, the user satisfaction evaluation value, and the emotion expression evaluation value.

Further, inputting the global dialogue state into a policy network model to determine a reply policy corresponding to the sentence information in the global dialogue state, which specifically includes:

based on the strategy network model, carrying out conditional probability analysis on the global dialogue state to obtain probability values of various reply strategies corresponding to the statement information in the current dialogue state in the global dialogue state scene; and determining the reply strategy corresponding to the statement information based on the probability values of the reply strategies.

Further, the various reply strategies comprise a reply strategy of a questioning, a reply strategy of a replying and a reply strategy of a questioning;

the determining the reply strategy corresponding to the statement information based on the probability values of the reply strategies specifically comprises the following steps:

and comparing the reply strategy of the questioning, the reply strategy of the replying and the reply strategy of the questioning with the sizes of the corresponding probability values respectively, and determining the reply strategy with the largest corresponding probability value in the reply strategy of the replying and the reply strategy of the questioning as the reply strategy corresponding to the statement information.

In a second aspect, the present invention further provides an interaction device based on reinforcement learning, including:

the language understanding module is used for acquiring the statement information input currently and determining the global dialogue state of the statement information;

a dialogue management module, configured to input the global dialogue state into a policy network model, so as to determine a reply policy corresponding to the sentence information in the global dialogue state;

and a reply generation module, configured to obtain the upper Wen Yugou of the statement information, and generate a reply statement based on the reply policy, the global dialog state, and the upper Wen Yugou.

Further, after each reply sentence generation, the method further comprises:

the rewarding model processing and updating module is used for carrying out multidimensional evaluation analysis on the current dialogue state based on the statement information and the reply statement which are input currently and determining a corresponding rewarding value; determining updating parameters of the strategy network model based on the rewarding value and taking the rewarding value which maximizes the current dialogue state as an optimization target, and obtaining an updated strategy network model; wherein, the current dialogue state comprises a plurality of sentence information which is input currently and a plurality of reply sentences corresponding to the sentence information; and the reply generation module is further used for analyzing the global dialogue state based on the updated strategy network model so as to determine a reply strategy corresponding to the statement information which is input next time in the global dialogue state.

Further, the language understanding module is specifically configured to:

Further, the reply generation module is specifically configured to: inputting the reply strategy, the global dialogue state and the above sentences into a preset language model to obtain reply sentences output by the language model; the language model is a pre-training language model based on a transducer architecture, which is obtained by training in advance based on third sample sentence information and reply sentences corresponding to the third sample sentence information.

Further, the reward model processing and updating module is specifically configured to:

Further, the session management module is specifically configured to:

In a third aspect, the present invention also provides an electronic device, including: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the reinforcement learning based interaction method as described in any of the above when the computer program is executed by the processor.

In a fourth aspect, the present invention also provides a processor-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the reinforcement learning based interaction method as described in any of the above.

According to the interaction method based on reinforcement learning, the global dialogue state of the sentence information is determined by acquiring the sentence information which is currently input, the global dialogue state is input into a strategy network model to determine the reply strategy corresponding to the sentence information in the global dialogue state, then the upper Wen Yugou of the sentence information is acquired, and a reply sentence is generated based on the reply strategy, the global dialogue state and the upper Wen Yugou; the interactive system can respond to the state change information of the user in time based on the global dialogue state, so that the interactive intention of the user is quickly understood, the quality and naturalness of the interactive dialogue are effectively improved, and the use experience of the user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly describe the drawings that are required to be used in the embodiments or the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without any inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an interactive method based on reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of an interaction method based on reinforcement learning according to an embodiment of the present invention;

FIG. 3 is a table schematic diagram of a prize value calculated by a prize model provided by an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an interactive device based on reinforcement learning according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which are derived by a person skilled in the art from the embodiments according to the invention without creative efforts, fall within the protection scope of the invention.

It should be noted that the description of the present invention and the above terms "first," "second," and the like are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.

Embodiments thereof are described in detail below based on the reinforcement learning-based interaction method described in the present invention. As shown in fig. 1, which is a schematic flow chart of an interaction method based on reinforcement learning according to an embodiment of the present invention, a specific process includes the following steps:

step 101: and acquiring the statement information input currently, and determining the global dialogue state of the statement information.

In the embodiment of the invention, the sentence information to be replied, which is currently input by the user, is obtained, and the sentence information can be text sentence information or sound sentence information, and is not particularly limited herein. In determining the global dialogue state of the sentence information, the method specifically may include: carrying out emotion analysis on the statement information based on a preset first classification evaluation model to obtain an emotion state corresponding to the statement information; carrying out semantic analysis on the statement information based on a preset second classification evaluation model to obtain a dialogue intention corresponding to the statement information; based on the emotional state and the dialog intention, current dialog state information is determined. The first classification evaluation model is a classification model based on a Transformer, which is obtained by training based on first sample sentence information and emotion labels corresponding to the first sample sentence information; the second classification evaluation model is a classification model based on a transducer, which is obtained by training based on second sample sentence information and dialogue intention labels corresponding to the second sample sentence information. Determining a target user corresponding to the statement information, and acquiring historical dialogue information of the target user in a preset time range; the history dialogue information comprises history statement information input by the target user in a preset time range and corresponding history reply statements; analyzing the historical dialogue information, extracting key feature information in the historical dialogue information, and determining historical dialogue state information of the target user based on the key feature information; the historical dialogue state information comprises historical preference information and historical interaction information corresponding to the target user. The current dialog state information and the historical dialog state information are determined to be global dialog states. That is, the global dialogue state of the sentence information in the embodiment of the present invention includes both the current dialogue state information and the historical dialogue state information. For example, the target user performs 7 rounds of interactive conversations today, the sentence information is sentence information input by the 7 th round of interactive conversations, and the current conversation state information may refer to the emotion state and conversation intention of the sentence information and reply sentences thereof (i.e. the sentence information and the reply sentences thereof input by the 7 th round of interactive conversations) input by the current round of interactive conversations. The emotional state comprises a negative state, a positive state and a neutral state corresponding to the reply information. The dialog intention refers to the purpose of language or sentence use, such as expressing the current annoying intention, and setting a dialog intention label of "user annoyance" back. The historical dialogue state information may include the emotional state and dialogue intention of the sentence information and the reply sentence input in the previous 1-6 rounds of interactive dialogue, and also includes the sentence information and the state information of the reply sentence input in the previous time range (such as the previous day or days), the state information includes the preference information of the user dynamic state, and the interaction information related to the question and answer may include the historical preference information (i.e. hobby information such as like basketball) and the historical interaction information (such as the previous week participated in basketball game and winning a trophy) corresponding to the target user. In addition to the user emotion detection, utterance intention analysis, and history dialogue content recording and understanding, the present step may also input and rewrite the obtained sentence information. Specifically, the input rewrite can be performed by using a natural language processing technology, and the misword in the sentence sydney input by the user is corrected, complemented, and the like, for example: the user: do you do nothing? Is you aware of? User emotion detection, user utterance intention analysis may be performed using a machine learning model, such as, for example, a user: i have a fantod; and (3) corresponding identification: emotional state: negative; dialog intention: the user is annoyed. The user history dialog content may also be stored and managed using a database (i.e., the database of fig. 2 that records global status) or other recording means. The global state is a global dialogue state; such as: machine AI: do you like basketball at ordinary times? The user: dislike; the system will record, < dislike, play basketball > into the database.

Step 102: the global dialog state is input to a policy network model to determine a reply policy corresponding to the statement information in the global dialog state. The policy network model is obtained by training based on a sample dialogue state, a sample reply policy corresponding to the sample dialogue state and a sample rewarding value.

In the implementation process of the step, conditional probability analysis can be carried out on the global dialogue state based on the strategy network model, so as to obtain probability values of various reply strategies corresponding to the statement information in the current dialogue state in the global dialogue state scene; and determining the reply strategy corresponding to the statement information based on the probability values of the reply strategies.

The policy network model (policy network) is a multi-classification model based on a Transformer, and the training process of the policy network model is updated and trained in real time in combination with online interactive session state data, and in a complete interactive session process, < action, state, and reorder > of each moment (i.e., each time step, which may respectively correspond to each round of interactive session) are collected, where the action may be session intention (i.e., sample reply policy), the state may be sample session state, and the reorder may be sample reward value. Specifically, parameters of the policy network model can be updated by using a ppo algorithm according to a goal of maximizing a total value of Reward (namely a total rewarding value) in the whole interactive dialogue, and the updated policy network model is obtained.

Firstly, calculating a Return value corresponding to action at each moment, wherein the total Return value is the sum of all Return values from the current moment to the end moment, and the specific dialogue management algorithm comprises the following steps:

；

wherein,,represented as an entire trajectry (i.e., a complete session of interactive sessions, such as 5 interactive sessions performed on a day); />For a value of (0, 1), each instant will attenuate the Reward of that time step to some extent so that the sum of the overall Reward can converge; />A Reward total value for the entire interactive session;trepresenting time of day (each time of day or time step may correspond to one interactive session turn); />Representation oftTime of day of use->A value; />Representation oftA time of day Reward value;iis an intermediate variable;Nis the number of all reply sentences in a section of interactive dialog. The optimization objective is to maximize the sum of the Reward of the entire trajectry (i.e., the total value of Return). The specific dialogue management algorithm comprises:

；

wherein the method comprises the steps of ，/>Represented as a policy network model.

；

Wherein,,representing an objective function trained by the strategic network model;Erepresenting the desire; />In the policy network model->The trajectry is present below>(i.e., reply to the policy); />The probability of a reply strategy corresponding to the initial turn in a section of interactive dialogue is represented; TRepresenting a session of conversationThe number of alternating runs, e.g., 7 runs;t=0 denotes starting from the first round in a session; />Representing the policy network model->In the current global dialog state (i.e. current global state information)>Generate the reply policy->Probability of (2); />Representing the current global dialog state,/, for>Global dialog state for the next time step; />Representing the probability that the global dialog state and behavior time transition to the next global dialog state given this time step, the term does not need to be calculated when finally the policy gradient method is applied for updating.

In a specific training process, take the following complete dialogue as an example:

t1: AI is played recently? The method comprises the steps of carrying out a first treatment on the surface of the The user: none;

t2: AI: why do nothing? (action: inquiry; forward: 1; state: { last played, none }); the user: unwanted movements;

t3: AI: does you have a sense of loving like in that usual time? (action: transfer topic; rewind: 2; state { why none, unwanted }) user: playing basketball;

t4: AI: do you like a ball star? (action query; reward:2; state { that you have a fun in usual time, play basketball }); the user: like, his back-up jump throw is very beautiful;

T5: AI I also feel that he is very accurate in shooting. (action: agree; recall: 1; state { do you like A ball star, like, his back-skip throw is very beautiful }); the user: is true.

Wherein, each event state is expressed as the above statement of dynamic chat information in the device. Throughout the interactive session: at each time step the value of Reward in this state is calculated by the Reward model,r _t =RewardModel(s _t ) The method comprises the steps of carrying out a first treatment on the surface of the Further, the total value of the Return of the session is calculated as,(for simplicity the temporary override +.>) Calculating the output of all time steps corresponding to the policy network in the current state (i.e. the current global dialog state), i.e +.>The method comprises the steps of carrying out a first treatment on the surface of the And then performing iterative optimization by using a PPO algorithm. The invention can determine the next operation of the dialogue by using the strategy network model, and can guide the development direction of the dialogue by using the dialogue management algorithm; dialog quality can be assessed using assessment metrics including, but not limited to, security, interestingness, rationality, etc. dimensions of the replies.

Step 103: a top Wen Yugou of the statement information is obtained, and a reply statement is generated based on the reply policy, the global dialog state, and the top Wen Yugou.

In the implementation process of the step, the reply strategy, the global dialogue state and the above sentences can be input into a preset language model to obtain reply sentences output by the language model; the language model is a pre-training language model based on a transducer architecture, which is obtained by training in advance based on third sample sentence information and reply sentences corresponding to the third sample sentence information. The reply strategies include, but are not limited to, a reply strategy of a question, a reply strategy of a reply and question, and the like. The determining the reply strategy corresponding to the statement information based on the probability values of the reply strategies may include: and comparing the reply strategy of the questioning, the reply strategy of the replying and the reply strategy of the questioning with the sizes of the corresponding probability values respectively, and determining the reply strategy with the largest corresponding probability value in the reply strategy of the replying and the reply strategy of the questioning as the reply strategy corresponding to the statement information. It should be noted that, the above sentence refers to the above sentence information in the current interaction scene, for example, one interaction scene includes 7 rounds of interaction sentences, and the currently input sentence information is the 7 th round of sentence information, and the above sentence may refer to the 6 th round of interaction sentence information in the interaction scene, and may of course also be the 3-6 th round of interaction sentence information. Specifically, by using the GPT natural language generation technique to generate a reply sentence, a reply sentence may be generated according to the reply policy obtained in step 102, as well as the previous sentence of the current dialog and the global dialog state. GPT is a transducer-based language model. For example, the user: "I play at the present, very happy"; at this point, a reply policy of "question" may be selected, i.e. < question > user: i go out and play at present, and have great happiness; machine AI: "which is playing? ".

In the implementation process of the invention, after each round of generating the reply sentence, the method further comprises the following steps:

based on the statement information and the reply statement which are input currently, carrying out multidimensional evaluation analysis on the current dialogue state, and determining a corresponding rewarding value; and determining updating parameters of the strategy network model based on the rewarding value and taking the rewarding value which maximizes the current dialogue state as an optimization target, and obtaining the updated strategy network model. Wherein, the current dialogue state comprises a plurality of sentence information which is input currently and a plurality of reply sentences corresponding to the sentence information; and analyzing the global dialogue state based on the updated strategy network model to determine a reply strategy corresponding to the statement information which is input in turn in the global dialogue state.

The step of carrying out multidimensional evaluation analysis on the current dialogue state based on the statement information and the reply statement input at present to determine a corresponding rewarding value specifically comprises the following steps: inputting the statement information and the reply statement which are input currently into a preset rewarding Model (namely a Reward Model), and carrying out multi-dimensional evaluation analysis on the current dialogue state based on a dialogue naturalness classification evaluation Model, an information accuracy classification evaluation Model, a user satisfaction degree classification evaluation Model and an emotion expression classification evaluation Model in the rewarding Model to obtain a dialogue naturalness evaluation value, an information accuracy evaluation value, a user satisfaction degree evaluation value and an emotion expression evaluation value; determining a reward value of current dialog state information based on the dialog naturalness evaluation value, the information accuracy evaluation value, the user satisfaction evaluation value, and the emotion expression evaluation value. It should be noted that, the dialogue naturalness classification evaluation model, the information accuracy classification evaluation model, the user satisfaction classification evaluation model and the emotion expression classification evaluation model are all trained classification evaluation models based on a transducer architecture. According to the method, the classification evaluation model is respectively built according to the dimensions, for example, the conversation naturalness has a special classification evaluation model, and the quality and naturalness of the interactive conversation can be effectively improved.

The main difference between the reward model and the reward function in the traditional technology is that a single model or a manual rule system is established in the traditional technology, only single indexes are considered, and the expansibility and the flexibility are not high. For multi-dimensional evaluation factors, the output of the corresponding reward score (i.e., the total prize value) is performed by combining multiple classification evaluation models. The dimensions considered include, but are not limited to, naturalness of the conversation, accuracy of the information, user satisfaction, expression of emotion, etc

Categorical-based bonus model settings: the reward model is based on a Transformer, inputs the above statement for the current dialogue, the reply policy, and the global dialogue state, and outputs the corresponding reward score, i.e., the reward value. For example, in data collection: the offline data collection stage will score comprehensively according to the following dimensions, and finally add to obtain the reward score. The labels corresponding to the naturalness of the dialog are: < nature, unnatural > corresponding to the reward score of <1,0>, respectively; the labels corresponding to the accuracy of the information are: < exact, not exact >, corresponding to the reward score of <1,0>, respectively; the labels corresponding to the user satisfaction are: < satisfied, dissatisfied >, corresponding to the reward score of <1,0>, respectively; the expression of emotion corresponds to the label of < negative- > active, neutral, active- > negative >, and the label corresponds to the record score of <2,1,0> (for judging the change of the current emotion signal of emotion). Reference is made in particular to fig. 3. Specifically, the training format may be: input: [ CLS ] today the weather is too good [ SEP ] is very good, tags: 4, a step of; input: [ CLS ] today the weather is too good [ SEP ] is very good [ SEP ] you go out and play [ SEP ] goes out, tags: 4.

It should be noted that, the loss functions used by the dialogue naturalness classification evaluation model, the information accuracy classification evaluation model, the user satisfaction classification evaluation model and the emotion expression classification evaluation model in the reward model may be as follows:

loss: the loss function is adopted to utilize cross entropy during training:

；

the number of the sample sentence data of all training is N;iis an intermediate variable;y _i a corresponding correct category, such as a satisfaction category or an dissatisfaction category in user satisfaction; pi is the probability under the correct category. And feedback learning is carried out on the system, so that the system is continuously optimized. The Reward model requires offline training through data. In a dialog system, the reorder model generates a bonus point (i.e., a bonus value) by combining user replies and machine replies, during user and machine interactions, the system collects all the bonus points (i.e., total bonus values) in a dialog,the reply sentence is updated and the robot can reach the ideal dialogue interaction level. By comprehensively utilizing the technologies of language understanding, dialogue management, reply generation, reward model and the like, the reinforcement learning interactive design under the chatting dialogue scene is realized. Through analysis and understanding of user language, guidance and optimization of dialogue management, generation and quality evaluation of reply sentences, continuous learning and optimization of the system are achieved, and dialogue quality is improved.

According to the interactive method based on reinforcement learning, the global dialogue state of the sentence information is determined by acquiring the currently input sentence information, the global dialogue state is input into a strategy network model to determine the reply strategy corresponding to the sentence information in the global dialogue state, then the upper Wen Yugou of the sentence information is acquired, and a reply sentence is generated based on the reply strategy, the global dialogue state and the upper Wen Yugou; the interactive system can respond to the state change information of the user in time based on the global dialogue state, so that the interactive intention of the user is quickly understood, the quality and naturalness of the interactive dialogue are effectively improved, and the use experience of the user is improved.

Corresponding to the interaction method based on reinforcement learning, the invention also provides an interaction device based on reinforcement learning. Since the embodiments of the device are similar to the method embodiments described above, the description is relatively simple, and reference should be made to the description of the method embodiments section above, and the embodiments of the reinforcement learning-based interaction device described below are merely illustrative. Fig. 4 is a schematic structural diagram of an interactive device based on reinforcement learning according to an embodiment of the present invention.

The invention relates to an interaction device based on reinforcement learning, which specifically comprises the following parts:

the language understanding module 401 is configured to obtain currently input sentence information, and determine a global dialogue state of the sentence information;

a dialogue management module 402, configured to input the global dialogue state into a policy network model, so as to determine a reply policy corresponding to the sentence information in the global dialogue state;

a reply generation module 403, configured to obtain the upper Wen Yugou of the statement information, and generate a reply statement based on the reply policy, the global dialog state, and the upper Wen Yugou.

Further, after each reply sentence generation, the method further comprises:

Further, the language understanding module is specifically configured to:

Further, the session management module is specifically configured to:

In the embodiment of the invention, the user inputs sentence information, and the language understanding module performs semantic analysis and understanding on the sentence. The dialogue management module determines the next operation of the dialogue according to the analysis result of the language understanding module. And the reply generation module generates corresponding reply sentences according to the operation result of the dialogue management module. The sentences input by the user and the reply sentences generated by the system are transmitted to a reorder model (namely a reward model processing and updating module), and the reward model processing and updating module gives a reward value to the system in combination with the current and global dialogue states and the evaluation results of dialogue quality, and the steps are circularly carried out until the dialogue is ended. And (3) learning and optimizing the dialogue management module according to the accumulated prize value of one round, so as to improve the dialogue quality.

It should be noted that, the language understanding module needs to perform not only grammar and semantic analysis but also emotion analysis with respect to other systems, so as to understand the emotional state of the user and the purpose of language use. This will facilitate the decision of the dialog management module and the language generation of the reply generation module. It is also desirable to analyze the content of the user session in the process and record key information so that the machine has memory and understanding capabilities of the user's historical session content, better supporting the operation of the session management module. Thus, repeated or contradictory contents in the dialogue can be avoided, and the consistency and naturalness of the dialogue are improved. In addition, in the dialog management module, the device is different from how to maintain the context and state of the dialog in the task-driven dialog without too much complicated design, but mainly focuses on how to process interaction with a user, how to guide the development direction of the dialog, how to improve the quality of the dialog, and the like. These are further implemented by means of a policy network model.

Other dialogue systems based on reinforcement learning are mainly task-driven dialogues, the task-driven dialog systems mainly set reward values with the aim of whether tasks are completed or not, and in boring dialogues, the aim is more complex, and the setting of the reward values is more difficult. Therefore, the device has the advantages that in designing the reward model, the evaluation factors of multiple dimensions, such as naturalness of a dialogue, accuracy of information, expression of emotion, and the combination of the history information of the user overall situation are required to be considered. The device needs to apply the GPT natural language generation technology to realize reply generation so as to ensure the accuracy of information and make reply sentences more natural and smooth.

According to the interactive device based on reinforcement learning, the global dialogue state of the sentence information is determined by acquiring the currently input sentence information, the global dialogue state is input into a strategy network model to determine the reply strategy corresponding to the sentence information in the global dialogue state, then the upper Wen Yugou of the sentence information is acquired, and a reply sentence is generated based on the reply strategy, the global dialogue state and the upper Wen Yugou; the interactive system can respond to the state change information of the user in time based on the global dialogue state, so that the interactive intention of the user is quickly understood, the quality and naturalness of the interactive dialogue are effectively improved, and the use experience of the user is improved.

Corresponding to the interaction method based on reinforcement learning, the invention further provides electronic equipment. Since the embodiments of the electronic device are similar to the method embodiments described above, the description is relatively simple, and reference should be made to the description of the method embodiments described above, and the electronic device described below is merely illustrative. Fig. 5 is a schematic diagram of the physical structure of an electronic device according to an embodiment of the present invention. The electronic device may include: a processor (processor) 501, a memory (memory) 502 and a communication bus 503, wherein the processor 501 and the memory 502 complete communication with each other through the communication bus 503 and communicate with the outside through a communication interface 504. The processor 501 may invoke logic instructions in the memory 502 to perform a reinforcement learning based interaction method comprising: acquiring statement information input currently, and determining a global dialogue state of the statement information; inputting the global dialogue state into a strategy network model to determine a reply strategy corresponding to the statement information under the global dialogue state; the strategy network model is obtained by training based on a sample dialogue state, a sample reply strategy corresponding to the sample dialogue state and a sample rewarding value; a top Wen Yugou of the statement information is obtained, and a reply statement is generated based on the reply policy, the global dialog state, and the top Wen Yugou.

Further, the logic instructions in the memory 502 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a Memory chip, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention also provide a computer program product, including a computer program stored on a processor-readable storage medium, the computer program including program instructions that, when executed by a computer, are capable of performing the reinforcement learning-based interaction method provided by the above-described method embodiments. The method comprises the following steps: acquiring statement information input currently, and determining a global dialogue state of the statement information; inputting the global dialogue state into a strategy network model to determine a reply strategy corresponding to the statement information under the global dialogue state; the strategy network model is obtained by training based on a sample dialogue state, a sample reply strategy corresponding to the sample dialogue state and a sample rewarding value; a top Wen Yugou of the statement information is obtained, and a reply statement is generated based on the reply policy, the global dialog state, and the top Wen Yugou.

In yet another aspect, embodiments of the present invention further provide a processor-readable storage medium having a computer program stored thereon, which when executed by a processor is implemented to perform the reinforcement learning-based interaction method provided in the foregoing embodiments. The method comprises the following steps: acquiring statement information input currently, and determining a global dialogue state of the statement information; inputting the global dialogue state into a strategy network model to determine a reply strategy corresponding to the statement information under the global dialogue state; the strategy network model is obtained by training based on a sample dialogue state, a sample reply strategy corresponding to the sample dialogue state and a sample rewarding value; a top Wen Yugou of the statement information is obtained, and a reply statement is generated based on the reply policy, the global dialog state, and the top Wen Yugou.

The processor-readable storage medium may be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NAND FLASH), solid State Disk (SSD)), and the like.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An interactive method based on reinforcement learning, comprising:

2. The reinforcement learning-based interaction method of claim 1, further comprising, after each generation of the reply sentence:

based on the statement information and the reply statement which are input currently, carrying out multidimensional evaluation analysis on the current dialogue state, and determining a corresponding rewarding value;

determining updating parameters of the strategy network model based on the rewarding value and taking the rewarding value which maximizes the current dialogue state as an optimization target, and obtaining an updated strategy network model; wherein, the current dialogue state comprises a plurality of sentence information which is input currently and a plurality of reply sentences corresponding to the sentence information;

and analyzing the global dialogue state based on the updated strategy network model to determine a reply strategy corresponding to the statement information which is input next time in the global dialogue state.

3. The reinforcement learning-based interaction method of claim 1, wherein the determining the global dialogue state of the sentence information specifically comprises:

Determining a target user corresponding to the statement information, and acquiring historical dialogue information of the target user in a preset time range; the history dialogue information comprises history statement information input by the target user in a preset time range and corresponding history reply statements;

analyzing the historical dialogue information, extracting key feature information in the historical dialogue information, and determining historical dialogue state information of the target user based on the key feature information; the history dialogue state information comprises history preference information and history interaction information corresponding to the target user;

4. The reinforcement learning based interaction method of claim 1, wherein the generating a reply sentence based on the reply policy, the global dialog state and the upper Wen Yugou specifically comprises: inputting the reply strategy, the global dialogue state and the above sentences into a preset language model to obtain reply sentences output by the language model; the language model is a pre-training language model based on a transducer architecture, which is obtained by training in advance based on third sample sentence information and reply sentences corresponding to the third sample sentence information.

5. The reinforcement learning-based interaction method according to claim 2, wherein the multi-dimensional evaluation analysis is performed on the current dialogue state based on the sentence information and the reply sentence inputted currently, and the determining of the corresponding prize value specifically includes:

6. The reinforcement learning-based interaction method of claim 1, wherein inputting the global dialog state into a policy network model to determine a reply policy corresponding to the sentence information in the global dialog state, comprises:

7. The reinforcement learning-based interaction method of claim 6, wherein the various types of reply policies include a reply policy of a question, a reply policy of a reply and a reply policy of a question;

8. An interactive apparatus based on reinforcement learning, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the reinforcement learning based interaction method of any of claims 1 to 7 when the computer program is executed.

10. A processor readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the reinforcement learning based interaction method of any of claims 1 to 7.