CN111160514B

CN111160514B - Conversation method and system

Info

Publication number: CN111160514B
Application number: CN202010251489.5A
Authority: CN
Inventors: 王子豪; 崔恒斌; 刘佳
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2020-08-28
Anticipated expiration: 2040-04-01
Also published as: CN111160514A

Abstract

The embodiment of the specification discloses a method and a system for conversation. The dialogue method comprises the following steps: acquiring a conversation text; the dialog text comprises at least one user utterance; determining a current state of a dialog based on the dialog context; obtaining revenue scores of one or more candidate dialogs based on the current state of the dialog based on a dialog model; wherein the dialogue model is a reinforcement learning model; determining a target utterance from the one or more candidate utterances that is responsive to the above utterance based on the revenue score.

Description

Conversation method and system

Technical Field

The embodiment of the specification relates to the technical field of artificial intelligence, in particular to a conversation method and system.

Background

An intelligent dialogue robot is an intelligent dialogue system that interacts with users using natural language. The dialog of the intelligent dialog robot with the user may include: task-type conversations, FAQ-type conversations, chat-type conversations, and persuasive-type conversations, among others. The intelligent dialogue robot generates dialogue sentences with a guiding function based on a dialogue process between the intelligent dialogue robot and a user, so that the user can complete specific operations after the dialogue. The guided dialogue realized by the intelligent dialogue robot can be widely applied to scenes such as charitable donation, commodity recommendation, loan acceptance promotion and the like.

The specification provides a guided dialogue method and a guided dialogue system suitable for an intelligent dialogue robot.

Disclosure of Invention

One aspect of an embodiment of the present specification provides a method of dialog, the method comprising: acquiring a conversation text; the dialog text comprises at least one user utterance; determining a current state of a dialog based on the dialog context; obtaining revenue scores of one or more candidate dialogs based on the current state of the dialog based on a dialog model; wherein the dialogue model is a reinforcement learning model; determining a target utterance from the one or more candidate utterances that is responsive to the above utterance based on the revenue score.

One aspect of an embodiment of the present specification provides a system for dialogs, the system comprising: the conversation data acquisition module is used for acquiring conversation texts; the dialog text comprises at least one user utterance; the conversation current state determining module is used for determining the conversation current state based on the conversation text; the profit score determining module is used for acquiring profit scores of one or more candidate dialogues on the basis of the current state of the conversation based on a conversation model; wherein the dialogue model is a reinforcement learning model; a target utterance determination module to determine a target utterance responsive to the above-dialog from the one or more candidate utterances based on the revenue score.

One aspect of embodiments of the present specification provides an apparatus for dialog, comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of any of the above.

One aspect of an embodiment of the present specification provides a method of training a dialogue model, the method comprising: acquiring multiple rounds of historical conversations; extracting a plurality of groups of training data from the plurality of rounds of historical conversations, wherein one group of the plurality of groups of training data at least comprises a sample conversation current state, a response conversation, a sample conversation next state and a reward value corresponding to the response conversation; and iteratively updating parameters of the reinforcement learning model based on the plurality of groups of training data, so that the trained conversation model can determine the income scores of the candidate dialogues based on the current state of any conversation.

One aspect of an embodiment of the present specification provides a training dialogue model, the system comprising: the dialogue data acquisition unit is used for acquiring multiple rounds of historical dialogue; the training data extraction unit is used for extracting a plurality of groups of training data from the plurality of rounds of historical conversations, wherein one group of the plurality of groups of training data at least comprises a sample conversation current state, a response conversation, a sample conversation next state and a reward value corresponding to the response conversation; and the model parameter updating unit is used for iteratively updating the parameters of the reinforcement learning model based on a plurality of groups of training data, so that the trained conversation model can determine the profit score of the candidate conversation technology based on the current state of any conversation.

Drawings

The present description will be further described by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a schematic diagram of an application scenario of a dialog system shown in accordance with some embodiments of the present description;

FIG. 2 is an exemplary flow diagram of a dialog method shown in accordance with some embodiments of the present description;

FIG. 3 is an exemplary flow diagram of a method of training a dialogue model, shown in accordance with some embodiments of the present description;

FIG. 4 is a schematic illustration of obtaining training data according to some embodiments of the present description;

fig. 5 is a block diagram of a dialog system shown in accordance with some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used in this specification is a method for distinguishing different components, elements, parts or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

[1] As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

The intelligent dialogue robot can be applied to a series of speaking type dialogue scenes with the user, such as debt collection, commodity promotion, activity recommendation and the like. The intelligent dialogue robot can generate dialogue sentences (or dialogue techniques or robot techniques) with guiding function in the persuasive dialogue scenes, so that a user can complete specific operations after the dialogue.

Taking the intelligent conversation robot as an example of the application of the debt charging conversation scenario, in some embodiments, the intelligent conversation robot (also called "charging robot") may perform a conversation with a user who does not comply with the repayment obligation in an appointed time based on a configured fixed conversation process, and the conversation of each round of conversation is configured as a master manually. This approach may have the following features: the conversation process is fixed, so that the conversation scene covered by the conversation process is single.

In still other embodiments, the fixed dialog flow may not be relied upon, such that the bot intelligently selects the best-performing dialog response user from the candidate dialogs based on the preceding dialog content.

Fig. 1 is a schematic diagram of an exemplary dialog system shown in accordance with some embodiments of the present description. As shown in fig. 1, the dialog system 100 may include a processing device 110, a network 120, a user terminal 130, and a storage device 140.

The processing device 110 may be used to process information and/or data associated with the dialog generation to perform one or more of the functions disclosed in this specification. In some embodiments, processing device 110 may be used to obtain the dialog context. In some embodiments, processing device 110 may determine the current state of the conversation based on the context of the conversation. In some embodiments, process 110 may obtain revenue scores for candidate dialogs based on the current state of the dialog based on the dialog model. In some embodiments, the processing device 110 may determine a target utterance from the candidate utterances that is responsive to the above-dialog based on the revenue score. It is understood that in some embodiments, the processing device 110 may implement the functionality of an intelligent conversation robot or act as a cloud service. In still other embodiments, processing device 110 may obtain multiple rounds of historical dialog and train the dialog model. In some embodiments, the processing device 110 may include one or more processing engines (e.g., single core processing engines or multi-core processors). By way of example only, the processing device 110 may include one or more combinations of Central Processing Units (CPUs), Application Specific Integrated Circuits (ASICs), application specific instruction set processors (ASIPs), image processors (GPUs), physical arithmetic processing units (PPUs), Digital Signal Processors (DSPs), Field Programmable Gate Arrays (FPGAs), Programmable Logic Devices (PLDs), controllers, micro-controller units, Reduced Instruction Set Computers (RISCs), microprocessors, and the like.

Network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components of dialog system 100 (e.g., processing device 110, storage device 140, and user terminal 130) may communicate information to other components of dialog system 100 over network 120. For example, the processing device 110 may obtain a user utterance from the user terminal 130 over the network 120, which is combined with prior dialog content to yield a dialog context. As another example, processing device 110 may retrieve information and/or data from storage device 140 via network 120 for historical conversation to perform conversation model training. As another example, the processing device 110 may push the targeted conversation to the user terminal 130 over the network 120 to complete a conversation. In some embodiments, the network 120 may be any form of wired or wireless network, or any combination thereof. By way of example only, network 120 may be one or more combinations of a wireline network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a Public Switched Telephone Network (PSTN), a bluetooth network, and so forth.

User terminal 130 may be a device with data acquisition, storage, and/or transmission capabilities. In some embodiments, the user of the user terminal 130 may be a service user, a consumer, or a debtor, among others. In some embodiments, the user terminal 130 may include, but is not limited to, a mobile device 130-1, a tablet 130-2, a laptop 130-3, a desktop 130-4, and the like, or any combination thereof. Exemplary mobile devices 130-1 may include, but are not limited to, smart phones, Personal Digital Assistants (PDAs), handheld game consoles, smart watches, wearable devices, virtual display devices, display enhancement devices, and the like, or any combination thereof. In some embodiments, the user terminal 130 may send the retrieved data to one or more devices in the dialog system 100. For example, the user terminal 130 may transmit the acquired data to the processing device 110 or the storage device 140. In some embodiments, the data acquired by the user terminal 130 may be data related to the service used by the user or the session content related to the consultation questions or the answering intelligent session robot. For example only, when the user of the user terminal 130 is a debtor, the acquired data may be the word (or called user speech) of the debtor answering the incentive robot.

Storage device 140 may store data and/or instructions. In some embodiments, storage device 140 may store data collected from user terminal 130. The data may be data relating to service usage as described above. In some embodiments, the data may also be conversation data between the user and the service provider, such as chat records, call records, etc., of the user and customer service (e.g., human customer service or intelligent conversation robot). In some embodiments, storage device 140 may store data and/or instructions for execution or use by processing device 110, which processing device 110 may execute or use to implement the example methods of this specification. In some embodiments, a storage device 140 may be connected to network 120 to enable communication with one or more components (e.g., processing device 110, user terminal 130, etc.) in dialog system 100. One or more components of the dialog system 100 may access data or instructions stored in the storage device 140 via the network 120. In some embodiments, the storage device 140 may be part of the processing device 110. In some embodiments, storage device 140 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), and the like, or any combination thereof. In some embodiments, the storage device 140 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-tiered cloud, and the like, or any combination thereof.

Fig. 2 is an exemplary flow diagram of a dialog method, shown in accordance with some embodiments of the present description.

In some embodiments, one or more steps in flow 200 may be implemented in system 100 shown in FIG. 1. For example, one or more steps in flow 200 may be stored as instructions in storage device 140 and invoked and/or executed by processing device 110.

Step 202, obtaining a conversation text; the dialog text includes at least one user utterance. In particular, this step 202 may be performed by the session data acquisition module 510.

A conversation may refer to an interaction in natural language between a user and a service.

The user utterance may be a word that the user speaks in a conversation. In some embodiments, the user may be a service user or an individual or group with potential service needs. Such as a borrower or a user who requires a borrowing service. Also for example, a user ordering goods or a user asking for goods. In some embodiments, the user may enter the user utterance on the user terminal 130 by voice input, text input, or the like.

The dialog context may refer to a natural language form of dialog context. In some embodiments, the dialog context comprises at least one user utterance. In some embodiments, the dialog context may be the completed dialog content of the user with the service. In some embodiments, the conversation context may include current user utterances and previously completed conversation content of the user and the service. Wherein, the server can be an intelligent dialogue robot working for the server.

In some embodiments, the dialogue data acquisition module may acquire a current user utterance from the user terminal 130. In some embodiments, the session data acquisition module may acquire the session context from a cache device (e.g., storage device 140) or a customer service log.

Step 204, determining the current state of the dialog based on the dialog context. In particular, this step 204 may be performed by the dialog current state determination module 520.

In some embodiments, the current state of a dialog may reflect information above the dialog. Such as textual information, semantic information, contextual information, and the like.

In some embodiments, the dialog current state determination module may determine the dialog current state by encoding the obtained dialog context, i.e., using a vector encoded for the dialog context as the dialog current state.

In some embodiments, the manner in which the encoding is performed may include a variety of ways. For example, one-hot, TF-IDF, Word2Vec model, Bert model, etc. For example only, multiple dialects in a dialog may be concatenated and input into the Bert model to obtain an encoded vector.

Step 206, obtaining income scores of one or more candidate dialogues on the basis of the current state of the conversation based on a conversation model; wherein the dialogue model is a reinforcement learning model. In particular, this step 206 may be performed by the revenue score determination module 530.

In some embodiments, the candidate utterance is a service utterance determined to be related to a target task guided by a service (e.g., an intelligent dialogue robot), wherein the service utterance refers to an utterance spoken by the service (e.g., the intelligent dialogue robot). For example, the intelligent dialogue robot is aimed at guiding the user to pay, and the candidate call is a service call related to the payment.

In some embodiments, the candidate sessions may be derived based on historical service logs, historical call records, and the like. For example, manually summarizing a historical service log or a historical call log, etc. results in candidate calls. For example, candidate sessions may be extracted from a historical service log or a historical call log by a model or algorithm.

In some embodiments, the number of candidate utterances is one or more. The candidate words are pre-configured and may be stored in a memory (e.g., storage device 140). The dialogue model can directly acquire candidate dialogs from the memory in any other modes such as direct reading, interface acquisition and the like when calculating the profit score.

The dialogue model is a reinforcement learning model, and the trained reinforcement learning model can select different actions under the current state and obtain the income values under different actions. For the dialogue model, the current state is the current state of the dialogue, and all possible actions correspond to all candidate dialogs. For training the dialogue model, refer to fig. 3 and its related description, which are not repeated herein.

It is to be appreciated that upon determining the current state of the conversation, the revenue score determination module can obtain revenue scores for one or more candidate dialogs via the conversation model. Specifically, the current state of the dialog and the candidate dialogs may be entered into a dialog model, which may derive a revenue score for each candidate dialogs.

In some embodiments, the revenue score positively correlates to the probability that the business objective is achieved by the corresponding candidate dialect, i.e., the greater the probability that the business objective is achieved, the greater the revenue score. The business objective may be the purpose that the service party wishes to achieve through the conversation. In some embodiments, the business objective may be an objective task that the service party wishes the intelligent conversation robot to guide the user through the conversation to accomplish. The business objective is related to a specific application scenario, for example, for a debt charging scenario, the business objective is a user payment. As another example, a merchandising scenario, where business targets are users purchasing merchandising.

A target utterance is determined from the one or more candidate utterances that is responsive to the above-utterance based on the revenue score, step 208. In particular, this step 208 may be performed by the target surgery determination module 540.

In some embodiments, the target dialect determination module may determine the candidate dialect corresponding to the highest revenue score as the target dialect.

The embodiment of the description determines the target dialogues responding to the conversation text from the candidate dialogues by adopting the conversation model, does not need to configure fixed conversation processes in advance, and can be flexibly applied to different conversation scenes; meanwhile, the dialogue model adopts a reinforcement learning model, and the optimal candidate dialogues can be intelligently selected based on the current state of the dialogue, so that the probability of realizing the business target is improved.

It should be noted that the above description related to the flow 200 is only for illustration and description, and does not limit the applicable scope of the present specification. Various modifications and alterations to flow 200 will be apparent to those skilled in the art in light of this description. However, such modifications and variations are intended to be within the scope of the present description.

FIG. 3 is an exemplary flow diagram of a method of training a dialogue model, shown in accordance with some embodiments of the present description. In some embodiments, one or more of the steps in flow 300 may be implemented in system 100 shown in FIG. 1. For example, one or more steps in flow 300 may be stored as instructions in storage device 140 and invoked and/or executed by processing device 110. In some embodiments, the process 300 may be implemented by a training module 550.

Step 302, obtaining a plurality of historical conversations. Specifically, this step 302 may be performed by the dialogue data acquisition unit 551.

In some embodiments, a round of historical conversations may refer to the full number of conversations of two parties to a conversation that have occurred within a period of time in the past (e.g., a week, a month, a half year, or a year). Wherein a historical conversation comprises a plurality of conversations divided in time sequence, and a conversation can be composed of a user utterance and a response conversation of a service party (e.g., an intelligent conversation robot) to the user utterance.

For example, a represents the dialect of the user a, B represents the dialect of the intelligent robot B (which may be replaced by a worker service), the content of the whole dialog between the user a and the intelligent robot B is a1B1a2B2a3B3a4B4, then a1B1a2B2a3B3a4B4 is a round of full-volume dialog, a1B1, a2B2, a3B3 or a4B4 may be regarded as a dialog, and a1B1, a2B2, a3B3 and a4B4 may be regarded as a first dialog, a second dialog, a third dialog and a fourth dialog, respectively.

In some embodiments, the historical conversations are derived from historical conversation logs, historical call records, and the like, and may be obtained from an online platform (e.g., a website, an application, and the like), or from manually consolidated conversation records, or may be read directly from a storage device that stores a large number of rounds of historical conversations.

Step 304, extracting multiple sets of training data from the multiple rounds of historical conversations, wherein one of the multiple sets of training data at least comprises a sample conversation current state, a response conversation, a sample conversation next state and a reward value corresponding to the response conversation. In particular, this step 304 may be performed by the training data extraction unit 552.

The training data refers to data input to the reinforcement learning model for training the model. In some embodiments, the set of training data includes at least a sample dialog current state, a response dialog, a sample dialog next state, and a reward value corresponding to the response dialog. In some embodiments, the training data extraction unit may extract a plurality of sets of training data from a plurality of rounds of historical conversation.

The response utterance refers to a service utterance in a historical dialogue in response to a user utterance, wherein the service utterance refers to an utterance spoken by a service party (e.g., an intelligent dialogue robot). In some embodiments, the response utterance is a service utterance in a conversation. It will be appreciated that if a full round of historical conversations contains multiple conversations, multiple response conversations may be determined. In some embodiments, the response utterance may be encoded in the form of a vector, denoted by a below.

Continuing with the example of step 302, the response dialog may be b1 in the first dialog a1b1, b2 in the second dialog a2b2, b3 in the third dialog a3b3, or b4 in the fourth dialog a4b 4.

The sample dialog current state refers to a vector of user utterances responded to by the responding dialog in a round of historical dialog and the dialog content preceding the user utterances. The current state of the sample dialog may be denoted by s. The process of obtaining a sample dialog current state vector based on the user utterance and its previous dialog content may refer to the processing of step 204.

Continuing with the above example, a round of historical full-volume dialogs is a1b1a2b2a3b3a4b4, if the vector of response dialogs b1 is b1 ', the corresponding sample dialog current state is a 1' of a1, if the vector of response dialogs b2 is b2 ', the corresponding sample dialog current state is a vector (a1b1a 2)' of a1b1a2, and so on.

The sample dialog next state refers to a vector of user utterances after the response utterance and dialog content prior to the user utterance. The next state of the sample dialog can be represented by s'.

Continuing with the above example, a round of historical full-scale dialog is a1b1a2b2a3b3a4b4, if the response dialog vector is b1 ', the next state of the corresponding sample dialog is vector (a1b1a 2)' of a1b1a2, if the response dialog vector is b2 ', the current state of the corresponding sample dialog is vector (a1b1a2b2a 3)' of a1b1a2b2a3, and so on.

In some embodiments, the reward value corresponding to the response utterance may reflect one or more of the following information: a relevance of a responsive utterance to a current state of a sample conversation, a probability of achievement of a business objective, an emotion score of a historical user utterance responsive to the responsive utterance, an intent category of the historical user utterance responsive to the responsive utterance, and a conversation engagement related to the responsive utterance. In some embodiments, the training data extraction unit may determine a reward value corresponding to the response utterance, which may be represented by r, based on the above-described relevance, achievement probability of the business objective, emotion score, intention category, and conversation engagement.

In some embodiments, the sample dialog next state may reflect whether the dialog is finished, i.e., if there is no sample dialog next state, the end of the dialog is indicated. Continuing with the example above, the next state of the dialog cannot be extracted based on the fourth dialog, thus indicating that the dialog ends after the fourth dialog.

In some embodiments, a session end identifier may also be added to the training data to indicate whether the session is over. The end of session identification may be denoted by t. The dialogue end identifier is used for identifying whether the corresponding training data is the last dialogue in the round of historical dialogue. In some embodiments, the end of dialog indicator may be a preset character. Such as numbers, letters or other symbols, etc. Illustratively, the number 0 represents the end of the session, and 1 represents the non-end of the session. In some embodiments, if the end of dialog indicator represents an end, the training data has no sample next state s' to the dialog.

In some embodiments, a set of training samples may be represented by (s, a, s', r, t). As previously described, if a full set of historical sessions contains multiple sessions, multiple sets of training samples may be determined. It is to be appreciated that the training data can be extracted on a one-time dialogue basis.

Continuing with the above example, a historical full-scale session is a1b1a2b2a3b3a4b4, training samples (s, a, s ', r, t) can be extracted as (a 1', b1 ', (a1b1a 2)', r1, 0) based on the first session a1b1, training samples ((a 1b1a2) ', b 2', (a1b1a2b2a3) ', r2, 0) can be extracted as the second session a2b2, training samples ((a 1b1a2b2a 3)', b3 ', (a1b1a2b2a3b3a 4)', r3, 0) can be extracted as the third session a2b2, training samples ((a 1b1a2b2a3b3 ', 3a 3b 3', 0, 3b3, 0).

And step 306, iteratively updating parameters of the reinforcement learning model based on a plurality of groups of the training data, so that the trained conversation model can determine the income score of the candidate conversation technology based on the current state of any conversation. In particular, this step 306 may be performed by the model parameter update unit 553.

In the model training process, the model parameter updating unit continuously updates the parameters of the reinforcement learning model based on the training data. Specifically, during the training process, the model parameter updating unit may continuously adjust the parameters of the reinforcement learning model so that the loss function of the model satisfies a predetermined condition, for example, the loss function converges, or the loss function value is smaller than a predetermined value. And finishing model training when the loss function meets the preset condition to obtain a trained dialogue model. The trained dialog model may be a mapping to the revenue scores that reflects the current state of the dialog and the candidate dialogs, or may be understood as a Q (s, a) function that determines the revenue score for each candidate dialogs based on the input current state of the dialog.

In some embodiments, the model parameter update unit may train the dialogue model through an off-policy reinforcement learning method. Specifically, the model parameter updating unit may put a plurality of sets of training data into the experience playback data set; training data is randomly extracted from the empirical playback data set, and parameters of the reinforcement learning model are updated by the training data based on an off-policy reinforcement learning method. In some embodiments, the off-policy reinforcement learning methods may include, but are not limited to: q-learning, DQN, DDPG, and the like.

The experience replay data set may also be referred to as an experience pool, and the extraction of the training data from the experience replay data set may be a process of random uniform sampling, which may break the correlation between the training data, because the correlation may make the reinforcement learning model unable to learn effectively and difficult to converge. Meanwhile, a plurality of training data are uniformly sampled, so that the distribution of the training data is smoothed, and the problem of sample distribution change is solved.

Taking DQN reinforcement learning algorithm as an example, in some embodiments, when each set of training data does not include a session end identifier, the trained loss function may be formula (1):

（1）

wherein the content of the first and second substances,

representing a loss function of the model after the ith iterative training;

representing some positive correlation mapping;

representing parameters of the model in the ith iterative training;

representing the parameters of the model in the i-1 st iterative training;

represents a discount factor; r represents a reward value corresponding to a response utterance; s represents the current state of the sample session in the training data; a represents response utterances in the training data;

representing the income score of the response dialect a determined by the model after the ith iterative training based on the current state s of the sample dialog; s 'represents a sample dialog next state in the training data, a' represents a candidate dialog;

the model representing the i-1 st iterative training is based on the maximum value of all candidate conversational gain scores determined for the sample conversational next state s'. r may reflect the current benefit or benefits of the current,

future benefits may be reflected.

As described above, when the session is ended and there is no session next state, the loss function at this time is

。

Taking DQN reinforcement learning algorithm as an example, in some implementations, if each set of training data further includes a session end identifier, the loss function can be formula (2):

（2）

except for t, all the parameters in the formula (2) have the same meaning as that in the formula (1), and are not described herein again. t is the end of dialog indicator, i.e., end, t =0,

when not finished, t =1,

。

in some embodiments, the model parameter updating unit may also train the dialogue model through an on-policy reinforcement learning method. In some embodiments, the on-policy reinforcement learning method may include, but is not limited to: sarsa or Sarsalambda, etc.

In some embodiments, the trained dialog model may be optimized by the real-world feedback of the online user. Specifically, after the trained dialogue model is put on line to have a real dialogue with a real user, new training data can be extracted based on the real dialogue, and it can be understood that the user utterance in the new training data is real feedback of the on-line user to the service dialogue, the training is continued based on the new training data, and the performance of the model can be continuously improved by repeating the optimized process.

The reward value of the response dialect obtained in the embodiment of the description combines the semantic relevance, the realization probability of the business objective, the emotion score, the intention category and the conversation participation characteristic, and the conversation model obtained by training the training data formed by the reward value can effectively consider the current income score of the response dialect and improve the realization probability of the business objective. Meanwhile, training data are extracted from the experience playback data set, an off-policy-based reinforcement learning method is used for off-line training of the reinforcement learning model, the training data in the experience playback data set are updated on the basis of on-line interaction between the model and a user, data closed loop is achieved, and the model is iterated effectively and continuously.

In order to more clearly and completely deduce the training session model method shown in some embodiments of the present specification, fig. 4 is taken as an example to schematically illustrate the training data obtaining process in the training session model method shown in some embodiments of the present specification.

Illustratively, as shown in fig. 4, the training data may be obtained specifically by:

and acquiring manual conversation logs, wherein each manual conversation log can correspond to one round of historical conversation, so that multiple rounds of historical conversation are obtained based on the manual conversation logs. Obtaining a primary dialog composed of a user utterance and a service dialog replying to the user utterance from a plurality of rounds of historical dialogs, and further, executing the following operations aiming at a certain primary dialog to extract a corresponding set of training data:

and extracting the service dialogues in the dialog, coding the service dialogues, and taking the expression vectors of the service dialogues obtained after coding as response dialogues a.

And extracting the user words before the service dialogues and the dialogues before the user words as first sample dialog texts, coding the first sample dialog texts, and taking the coded representation vectors as sample dialog current states s.

And extracting the user words after the service dialogues and dialogues before the user words as second sample dialog texts, coding the second sample dialog texts, and taking the coded expression vectors as sample dialog next states s'.

A reward value r corresponding to the responsive utterance is determined based on a relevance of the responsive utterance to a current state of the conversation, a probability of achievement of the business objective, an emotion score of the historical user utterance in response to the responsive utterance, an intent category of the historical user utterance in response to the responsive utterance, and a conversation engagement related to the responsive utterance. For example only, 5 sub-reward values are determined based on the relevance of the response speech to the current state of the conversation, the realization probability of the business objective, the emotion score of the historical user speech responding to the response speech, the intention category of the historical user speech responding to the response speech and the conversation participation degree related to the response speech, and then the 5 sub-reward values are weighted to obtain r. Wherein the weight is positively correlated with the importance of the 5 subentry prize values. In some embodiments, the value of the dividend reward based on the relevance of the response grammar to the current state of the conversation, and the value of the dividend reward determined based on the probability of achievement of the business objective, are weighted more heavily. It is understood that r can measure the contribution of the alleged talent to completing the objective task from the above 5 angles, respectively.

In some embodiments, the relevance of the responsive dialogs to the current state of the dialog may be a semantic relevance of the two. In some embodiments, the trained semantic matching model may be used to process the current state of the sample dialog and the response dialog to determine the semantic relatedness between the two. Specifically, the current state of the dialog and the response dialog are input into a semantic matching model, and the semantic correlation between the current state of the dialog and the response dialog is output, wherein the semantic correlation is a number from 0 to 1.

In some embodiments, a first point reward value (denoted by r 1) may be determined based on the semantic relevance score, e.g., with the relevance score as r1 and, for example, with the relevance score minus a cardinality (e.g., 0.5) as the reward value.

In some embodiments, the semantic matching model may be trained based on sets of labeled training samples. Training samples may be obtained through manual dialog logs. A set of training samples may include sample dialog current state and dialog techniques. The label may be whether the sample dialog current state is related to the dialog (for example, related is represented by 1, unrelated is represented by 0), and if the dialog is a sentence in the history dialog text corresponding to the sample dialog current state, the label is related; otherwise, it is not relevant.

In some embodiments, the trained business objective prediction model can be used to process the current state of the sample dialog and the response dialog, and determine the probability of achieving the business objective. Specifically, the current state of the dialog and the response dialog are input into a business target prediction model, and the realization probability of the business target is output, wherein the probability is a number between 0 and 1.

In some embodiments, a second fractional prize value (denoted by r 2) may be determined based on the probability of achieving the business objective. The determination method is the same as r1 and is not described in detail.

In some embodiments, a business objective prediction model may be trained based on sets of labeled training samples. In some embodiments, a set of training samples may include sample dialog current states and response dialogs. The tag may be a time interval, e.g., 20min, 1h, etc., between the user completing the target task and the server outputting the response utterance.

In some embodiments, historical user utterances responsive to the responsive utterance may be processed using a trained emotion recognition model, emotion categories of the historical user utterances responsive to the responsive utterance may be determined, and corresponding categories may be converted to scores based on preset mapping rules. The score is a number from 0 to 1.

In some embodiments, a third bonus value (denoted r 3) may be determined based on the sentiment score. The determination method is the same as r1 and is not described in detail.

In some embodiments, the emotion recognition model may be trained based on a plurality of labeled training samples. The training sample may be a user utterance. In some embodiments, the tags may be positive and negative sentiment category labels. For example, 0 characterizes negative emotions and 1 characterizes positive emotions. In some embodiments, the tags may also contain other emotions, e.g., neutral, etc.

In some embodiments, historical user utterances responsive to the responsive utterance may be processed using a trained intent recognition model to determine historical user utterance intent categories responsive to the responsive utterance. In some embodiments, the intent category may be accept, decline, interest, or the like. Different intent categories may correspond to different scores. In some embodiments, a more aggressive intent score is higher, e.g., an accepted score is higher than a rejected score.

In some embodiments, a fourth bonus value (denoted r 4) may be determined based on the intent score. The determination method is the same as r1 and is not described in detail.

In some embodiments, the intent recognition model may be trained based on sets of labeled training samples. In some embodiments, the set of training samples may include user utterances. In some embodiments, the tags may be intent categories.

In some embodiments, the historical dialogue data at the time of training the semantic matching model, the business objective prediction model, the emotion recognition model, and the intention recognition model may be different from the data of the training dialogue model.

In some embodiments, the session engagement associated with the responsive utterance may be determined based on a number of sessions for a certain historical turn of sessions in which the responsive utterance is located. In some embodiments, the session engagement may be obtained through a preset mapping relationship of the number of sessions and the session engagement score. For example, the mapping relationship shown in equation (3):

(3)

wherein the content of the first and second substances,

scoring a conversation engagement;

a constant of (0, 1);

the number of conversations is a preset maximum number of conversations or the total number of conversations of a round of full-scale historical conversations where the response conversations are located;

the number of sessions in response to a certain historical session of the session.

In some embodiments, a fifth point prize value (denoted r 5) may be determined based on the session engagement. The determination method is the same as r1 and is not described in detail.

In some embodiments, a determination of whether a conversation is complete, i.e., a conversation end identification, may be based on whether there are more user utterances after the responsive utterance.

Finally, the current state of the sample dialog, the response dialog, the corresponding reward value of the response dialog, the next state of the sample dialog and the end-of-dialog identifier are determined as a set of training data.

FIG. 5 is a block diagram of a dialog system shown in accordance with some embodiments of the present description. In some embodiments, the dialog system may be implemented by the processing device 110. In some embodiments, the dialog system may generate a target dialog based on the dialog context and the candidate dialogs. As shown in FIG. 5, dialog system 500 may include a dialog data acquisition module 510, a dialog current state determination module 520, a revenue score determination module 530, a target dialogs determination module 540, and a training module 550.

The dialogue data acquisition module 510 may be configured to acquire a dialogue context; the dialog text includes at least one user utterance.

The dialog current state determination module 520 may be configured to determine a dialog current state based on the dialog context.

The profit score determining module 530 may be configured to obtain profit scores for one or more candidate dialogs based on the current state of the dialog based on a dialog model; wherein the dialogue model is a reinforcement learning model. In some embodiments, the revenue score positively correlates to a probability that the respective candidate utterance resulted in achievement of the business objective.

The target utterance determination module 540 may be configured to determine a target utterance responsive to the above-dialog from the one or more candidate utterances based on the revenue score.

In some embodiments, the dialog system further comprises a training module 550, the training module 550 comprising: a dialogue data acquisition unit 551, a training data extraction unit 552, and a model parameter update unit 553.

In some embodiments, the dialogue data acquisition unit 551 may be used to acquire multiple rounds of historical dialogue.

In some embodiments, the training data extraction unit 552 may be configured to extract a plurality of sets of training data from the plurality of historical conversations, one of the plurality of sets of training data including at least: a sample dialog current state, a response dialog, a sample dialog next state, and a reward value corresponding to the response dialog. In some embodiments, one of the sets of training data further comprises: and the conversation ending identifier is used for identifying whether the corresponding training data is the last conversation in the round of historical conversations.

In some embodiments, the reward value corresponding to the response utterance reflects one or more of the following information: a relevance of a responsive utterance to a current state of a sample conversation, a probability of achievement of a business objective, an emotion score of a historical user utterance responsive to the responsive utterance, an intent category of the historical user utterance responsive to the responsive utterance, and a conversation engagement related to the responsive utterance. In some embodiments, the training data extraction unit may be to: and processing the current state of the sample dialogue and the response dialogue by utilizing a trained semantic matching model to determine the semantic correlation between the current state of the sample dialogue and the response dialogue.

In some embodiments, the training data extraction unit 552 may be configured to: and processing the current state of the sample dialogue and the response dialogue by using a trained business target prediction model to determine the realization probability of the business target.

In some embodiments, the training data extraction unit 552 may be configured to: processing historical user utterances responsive to the utterance with a trained emotion recognition model to determine the emotion score.

In some embodiments, the training data extraction unit 552 may be used to process historical user utterances responsive to speech techniques using a trained intent recognition model to determine the intent categories. In some embodiments, the conversation engagement associated with the responsive utterance positively correlates with the number of conversations corresponding to the responsive utterance.

In some embodiments, the model parameter update unit 553 may be configured to iteratively update parameters of the reinforcement learning model based on a plurality of sets of the training data, such that the trained conversation model is capable of determining a revenue score for a candidate conversation based on a current state of any conversation.

In some embodiments, the model parameter update unit 553 is configured to place sets of the training data into empirical playback data sets; training data is randomly extracted from the empirical playback data set, and parameters of the reinforcement learning model are updated with the training data based on an off-policy reinforcement learning method.

It should be appreciated that the system and its modules (e.g., a dialog system and its modules and/or a system for training a dialog model and its modules) may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above descriptions of the dialog system and its modules, and the system for training the dialog model and its modules are only for convenience of description, and should not be construed as limiting the present disclosure to the embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, the conversation data acquisition module 510, the conversation current state determination module 520, the profit score determination module 530, the target dialogues determination module 540 and the training module 550 disclosed in the conversation system may be different modules in one system, or may be a module that implements the functions of the two modules. For another example, each module in the dialog system may share one storage module, and each module may have its own storage module. Such variations are within the scope of the present disclosure.

Embodiments of the present specification also provide an apparatus for dialog, comprising at least one storage medium and at least one processor, the at least one storage medium configured to store computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of dialog described in any of the embodiments of this specification.

Embodiments of the present specification further provide an apparatus for training a dialog, including at least one storage medium and at least one processor, where the at least one storage medium is used to store computer instructions; the at least one processor is configured to execute the computer instructions to implement the method for training a dialogue model according to any embodiment of the present specification.

The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) the embodiment in the specification determines the target dialogs responding to the conversation text from the candidate dialogs by adopting the conversation model, does not need to configure a fixed conversation process in advance, and can be flexibly applied to different conversation scenes; (2) the reward value of the response dialect combines the semantic relevance, the realization probability of the business objective, the emotion value, the intention category and the conversation participation, and the conversation model obtained by training the training data formed by the response dialect can measure the profit value of the candidate dialect from the 5 angles, and output the candidate dialect which can most guide the user to finish the objective task to the user, thereby improving the realization probability of the business objective; (3) the trained dialogue model is subjected to iterative optimization training based on real feedback data of the online user, and the performance of the model can be improved. It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran2003, Perl, COBOL2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or processing device. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing processing device or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of training a conversation model for guiding a user through business goals, comprising:

acquiring multiple rounds of historical conversations;

extracting a plurality of groups of training data from the plurality of rounds of historical conversations, wherein one group of the plurality of groups of training data at least comprises a sample conversation current state, a response conversation, a sample conversation next state and a reward value corresponding to the response conversation;

iteratively updating parameters of a reinforcement learning model based on a plurality of groups of training data so that the trained conversation model can determine the profit score of a candidate conversation technique based on the current state of any conversation;

the reward value corresponding to the response tactics is obtained by carrying out weighted operation on the subentry reward values corresponding to the following information:

a relevance of a responsive utterance to a current state of a sample conversation, a probability of achievement of the business objective, an emotion score of a historical user utterance responsive to the responsive utterance, an intent category of the historical user utterance responsive to the responsive utterance, and a conversation engagement related to the responsive utterance.

2. The method of claim 1, one of the sets of training data further comprising:

and the conversation ending identifier is used for identifying whether the corresponding training data is the last conversation in the round of historical conversations.

3. The method of claim 1, wherein the relevance of the response dialog to the current state of the dialog is obtained based on:

and processing the current state of the sample dialogue and the response dialogue by utilizing a trained semantic matching model to determine the semantic correlation between the current state of the sample dialogue and the response dialogue.

4. The method of claim 1, wherein the probability of achieving the business objective is obtained based on:

processing the current state of the sample dialogue and the response dialogue by using a trained business target prediction model to determine the realization probability of the business target; the business objective prediction model is obtained based on historical dialogue data and training of the historical dialogue data and the time interval of business objective realization.

5. The method of claim 1, wherein the emotion score responsive to the historical user utterances responsive to the utterance is obtained based on:

processing historical user utterances responsive to the utterance with a trained emotion recognition model to determine the emotion score.

6. The method of claim 1, wherein the intent categories responsive to historical user utterances responsive to the utterance are obtained based on:

historical user utterances responsive to the utterance are processed using the trained intent recognition model to determine the intent category.

7. The method of claim 1, wherein the conversation engagement associated with the responsive utterance positively correlates to a number of conversations corresponding to the responsive utterance.

8. The method of claim 1, wherein iteratively updating parameters of a reinforcement learning model based on the plurality of sets of training data, training the resulting dialogue model comprises:

placing a plurality of sets of the training data into an empirical playback dataset;

training data is randomly extracted from the empirical playback data set, and parameters of the reinforcement learning model are updated with the training data based on an off-policy reinforcement learning method.

9. A method of dialog, comprising:

acquiring a conversation text; the dialog text comprises at least one user utterance;

determining a current state of a dialog based on the dialog context;

obtaining revenue scores of one or more candidate dialogs based on the current state of the dialog based on a dialog model; wherein the dialogue model is a reinforcement learning model and is obtained by the method of any one of claims 1 to 8;

determining a target utterance from the one or more candidate utterances that is responsive to the above utterance based on the revenue score.

10. The method of claim 9, the revenue score positively correlates to a probability that the corresponding candidate dialect caused the business objective to be achieved.

11. A system for training a conversation model for guiding a user through business goals, comprising:

the dialogue data acquisition unit is used for acquiring multiple rounds of historical dialogue;

the training data extraction unit is used for extracting a plurality of groups of training data from the plurality of rounds of historical conversations, wherein one group of the plurality of groups of training data at least comprises a sample conversation current state, a response conversation, a sample conversation next state and a reward value corresponding to the response conversation;

the model parameter updating unit is used for iteratively updating parameters of the reinforcement learning model based on a plurality of groups of training data, so that the trained conversation model can determine the income score of the candidate conversation technique based on the current state of any conversation;

12. The system of claim 11, one of the sets of training data further comprising:

13. The system of claim 11, the training data extraction unit to:

14. The system of claim 11, the training data extraction unit to:

15. The system of claim 11, the training data extraction unit to:

16. The system of claim 11, the training data extraction unit to:

17. The system of claim 11, wherein the conversation engagement associated with the responsive utterance positively correlates to a number of conversations corresponding to the responsive utterance.

18. The system of claim 11, the model parameter update unit to:

19. A system for dialogues, comprising:

the conversation data acquisition module is used for acquiring conversation texts; the dialog text comprises at least one user utterance;

the conversation current state determining module is used for determining the conversation current state based on the conversation text;

the profit score determining module is used for acquiring profit scores of one or more candidate dialogues on the basis of the current state of the conversation based on a conversation model; wherein the dialogue model is a reinforcement learning model and is obtained by the method of any one of claims 1 to 8;

a target utterance determination module to determine a target utterance responsive to the above-dialog from the one or more candidate utterances based on the revenue score.

20. The system of claim 19, the revenue score positively correlates to a probability that the corresponding candidate utterance resulted in achievement of a business objective.

21. An apparatus for training a dialogue model, comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of any of claims 1-8.

22. An apparatus for a conversation comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of claim 9 or 10.