WO2021086589A1

WO2021086589A1 - Providing a response in automated chatting

Info

Publication number: WO2021086589A1
Application number: PCT/US2020/055296
Authority: WO
Inventors: Pingping LIN; Yue Liu; Lisong QIU; Ruihua Song
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2019-10-29
Filing date: 2020-10-13
Publication date: 2021-05-06
Also published as: CN112750430A

Abstract

The present disclosure provides a method and apparatus for providing a response in automated chatting. A message may be obtained in a chat flow. A context associated with the message may be determined, the context comprising a set of utterances, the set of utterances comprising the message. For each candidate response of a set of candidate responses, the candidate response may be scored based at least on information change between adjacent utterances among the set of utterances and the candidate response. A highest-scored candidate response among the set of candidate responses may be provided in the chat flow.

Description

PROVIDING A RESPONSE IN AUTOMATED CHATTING

BACKGROUND

[0001] Artificial intelligence (AI) chatbots are becoming more and more popular and are being used in more and more scenarios. Chatbots are designed to simulate human utterances and may chat with users through text, voice, images, etc. In general, a chatbot may identify language content within a message entered by a user or apply natural language processing to a message, and then provide the user with a response to the message.

SUMMARY

[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0003] Embodiments of the present disclosure provides a method and apparatus for providing a response in automated chatting. A message may be obtained in a chat flow. A context associated with the message may be determined, the context comprising a set of utterances, the set of utterances comprising the message. For each candidate response of a set of candidate responses, the candidate response may be scored based at least on information change between adjacent utterances among the set of utterances and the candidate response. A highest-scored candidate response among the set of candidate responses may be provided in the chat flow.

[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS [0005] The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. [0006] FIG. 1 illustrates an exemplary application scenario of a chatbot according to an embodiment of the present disclosure.

[0007] FIG. 2 illustrates an exemplary chat window according to an embodiment of the present disclosure.

[0008] FIG. 3 illustrates an exemplary process for obtaining a comprehensive relevance score according to an embodiment of the present disclosure.

[0009] FIG. 4 illustrates an exemplary Valence-Arousal model according to an embodiment of the present disclosure.

[0010] FIG. 5 illustrates an exemplary process for generating initial representations according to an embodiment of the present disclosure.

[0011] FIG. 6 illustrates an exemplary process for generating interaction representations according to an embodiment of the present disclosure.

[0012] FIG. 7 illustrates an exemplary process for semantic matching according to an embodiment of the present disclosure.

[0013] FIG. 8 illustrates an exemplary process for emotional matching according to an embodiment of the present disclosure.

[0014] FIG. 9 illustrates an exemplary process for performing aggregation according to an embodiment of the present disclosure.

[0015] FIG. 10 illustrates an exemplary chat flow and an associated emotional flow according to an embodiment of the present disclosure.

[0016] FIG. 11 illustrates an exemplary process for training a transitional memory- based matching model according to an embodiment of the present disclosure.

[0017] FIG. 12 illustrates an exemplary process for optimizing emotional representation with a conversation corpus according to an embodiment of the present disclosure.

[0018] FIG. 13 illustrates an exemplary process for optimizing emotional representations with a sentence corpus according to an embodiment of the present disclosure.

[0019] FIG. 14 illustrates an exemplary process for generating an additional emotional representation according to an embodiment of the present disclosure.

[0020] FIG. 15 illustrates an exemplary process for inserting an additional emotional representation according to an embodiment of the present disclosure.

[0021] FIG. 16 illustrates an exemplary process for combining multi- modality inputs through an early-fusion strategy according to an embodiment of the present disclosure. [0022] FIG. 17 illustrates an exemplary process for combining multi-modality inputs through a late-fusion strategy according to an embodiment of the present disclosure.

[0023] FIG. 18 illustrates an exemplary scenario for expressing emotional states of responses through light according to an embodiment of the present disclosure.

[0024] FIG. 19 is a flowchart of an exemplary method for providing a response in automated chatting according to an embodiment of the present disclosure.

[0025] FIG. 20 illustrates an exemplary apparatus for providing a response in automated chatting according to an embodiment of the present disclosure.

[0026] FIG. 21 illustrates an exemplary apparatus for providing a response in automated chatting according to an embodiment of the present disclosure.

DETAILED DESCRIPTION

[0027] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

[0028] In general, a chatbot may chat automatically in a session with a user. Herein, the "session" may refer to a time-continuous conversation between two chat participants. When the chatbot is conducting automated chatting, it may receive messages from the user and reply by selecting a candidate response from a set of candidate responses stored in its associated database. Currently, when the chatbot selects a candidate response, it usually scores relevance between each candidate response and the message in the chat flow, and provides the user with a highest-scored candidate response. Since emotional change in the chat flow is not considered during the scoring process, the candidate response that is finally selected may significantly fluctuate in terms of emotion.

[0029] Embodiments of the present disclosure propose a method and apparatus for providing a response in automated chatting. According to an embodiment of the present disclosure, after a message in a chat flow being obtained, a context associated with the message may be determined, and a response being smooth and relevant to the context in both semantic and emotional terms may be provided. Herein, the context refers to all received messages and sent responses in a current session, i.e., a session in which the most recently received message is located, and may include the most recently received message itself.

[0030] In an aspect, an embodiment of the present disclosure proposes a transitional memory-based matching model that may model semantic change and emotional change in a chat flow and consider such change when selecting a candidate response, thereby may provide a response that is smoother and more natural in terms of semantic and emotion. [0031] In another aspect, an embodiment of the present disclosure proposes to use a multi-task framework to optimize emotional representations of a context and a candidate response by an additional emotion classification task. A training corpus with emotional labels may be used to perform the additional emotion classification task.

[0032] In another aspect, an embodiment of the present disclosure proposes to train a transitional memory-based matching model for a predetermined personality, thereby obtaining a chatbot with the predetermined personality. The personality of a speaker may be associated with his or her emotional change range in the speech. The transitional memory-based matching model may be trained based on the emotional change range constraint associated with the predetermined personality.

[0033] In another aspect, an embodiment of the present disclosure proposes to consider external factors that affect emotional states, such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc., when making candidate response selections. A basic emotional state may be determined based on the external factors, thereby an emotional state of a selected response is consistent with the basic emotional state determined based on the external factors, and is smooth and relevant to previous utterances in the current session.

[0034] In another aspect, a transitional memory-based matching model proposed by an embodiment of the present disclosure may support multi-modality inputs. Inputs for different modalities of a particular utterance may be converted into corresponding representations. These representations may be combined through multiple fusion strategies.

[0035] In another aspect, an embodiment of the present disclosure proposes that a selected candidate response may be presented based on an emotional state of the response, and the emotional state of the selected candidate response may also be expressed by additionally providing other multi-modality signals.

[0036] In another aspect, an embodiment of the present disclosure proposes to achieve empathy between a chatbot and a user, and guide the user to obtain a positive emotional state.

[0037] FIG. 1 illustrates an exemplary application scenario 100 of a chatbot according to an embodiment of the present disclosure. In the scenario 100, a network 110 is applied to interconnect between a terminal device 120 and a chatbot server 130. The network 110 may be any type of network capable of interconnecting network entities. The network 110 may be a single network or a combination of various types of networks. In terms of coverage, the network 110 may be a Local Area Network (LAN), a Wide Area Network (WAN), etc. In terms of carrying medium, the network 110 may be a wireline network, a wireless network, etc. In terms of data switching techniques, the network 110 may be a circuit switching network, a packet switching network, etc.

[0038] The terminal device 120 may be any type of electronic computing device capable of connecting to the network 110, accessing a server or website on the network 110, processing data or signals, etc. For example, the terminal device 120 may be a desktop computer, a notebook computer, a tablet computer, a smart phone, etc. Although only one terminal device 120 is shown in FIG. 1, it is to be understood that a different number of terminal devices may be connected to the network 110.

[0039] The terminal device 120 may include a chatbot client 122 that may provide an automated chatting service to a user. In some implementations, the chatbot client 122 may interact with the chatbot server 130 and present to the user information and responses that the chatbot server 130 provides. For example, the chatbot client 122 may send a message entered by the user to the chatbot server 130 and receive a response relevant to the message from the chatbot server 130. However, it is to be understood that in other implementations, the chatbot client 122 may also generate locally a response to the message entered by the user, rather than interacting with the chatbot server 130.

[0040] The chatbot server 130 may conduct automated chatting with a user of the terminal device 120. A corpus for automated chatting may be stored in a chatbot database 132 that the chatbot server 130 connects with or the chatbot server 130 contains.

[0041] It is to be understood that all the network entities in FIG. 1 are exemplary, and any other network entity may be involved in the application scenario 100 according to specific application requirements.

[0042] FIG. 2 illustrates an exemplary chat window 200 according to an embodiment of the present disclosure. The chat window 200 may include a presenting area 210, a control area 220, and an input area 230. The presenting area 210 displays messages and responses in a chat flow. The control area 220 includes a plurality of virtual buttons for use by a user to perform message input settings. For example, the user may choose to perform voice input, attach an image file, select an emoji, take a screenshot of a current screen, etc. through the control area 220. The input area 230 is used for the user to enter a message. For example, the user may type a text through the input area 230. The chat window 200 may further include a virtual button 240 for confirming transmission of the entered message. If the user touches the virtual button 240, a message entered in the input area 230 may be transmitted to the presenting area 210.

[0043] It should be noted that all the units in FIG. 2 and their layouts are exemplary. According to specific application requirements, the chat window in FIG. 2 may omit or add any unit, and the layouts of the units in the chat window in FIG. 2 may also be changed in various ways.

[0044] According to an embodiment of the present disclosure, when conducting automated chatting, a chatbot may obtain a message in a chat flow, such as a message most recently received from a user, and determine a context associated with the message. The context may include all received messages and sent responses in a current session, and may include the most recently received message itself. Herein, the messages received and responses sent by the chatbot are collectively referred to as utterances. Thus, the context may include a set of utterances. The chatbot may also obtain a set of candidate responses from a database that it connects with or it contains, and for each candidate response of the set of candidate responses, score relevance between the candidate response and the context to obtain a comprehensive relevance score corresponding to the candidate response. The chatbot may then provide, in the chat flow, a candidate response with the highest comprehensive relevance score among the set of candidate responses.

[0045] FIG. 3 illustrates an exemplary process 300 for obtaining a comprehensive relevance score according to an embodiment of the present disclosure. The process 300 may be perfomied by, for example, the chatbot server 130 in FIG. 1. [0046] After obtaining a message in a chat flow, a chatbot server may determine a context associated with the message, such as a context 302 in FIG. 3, which may include all received utterances and sent messages in a session in which the message is located, such as utterances 302-1, 302-2, 302-3, ..., 302-n, wherein utterance 302-n may be a message currently obtained from the chat flow. In addition, the chatbot server may also obtain a set of candidate responses 304 from a database that it connects with or it contains, such as a chatbot database 132 in FIG. 1, which may include a plurality of candidate responses, such as a candidate response 306. The candidate response 306 is taken as an example to illustrate an exemplary process for obtaining a comprehensive relevance score of the candidate response 306. The context 302 and the candidate response 306 may be provided to a transitional memory-based matching model 308. The transitional memory- based matching model 308 may include, for example, an initial representation generating part 310, an interaction representation generation part 312, a matching part 314, and an aggregation part 316. [0047] At the initial representation generating part 310, an initial representation of the context 302 and an initial representation of the candidate response 306 may be generated. Herein, the initial representation refers to a representation generated based on a representation of each utterance in the context or the candidate response.

[0048] The representation of each utterance may include a semantic representation and/or an emotional representation. The emotional representation may be generated based on a variety of approaches for characterizing emotional states. In an implementation, the emotional states may be characterized through a Valence- Arousal (V-A) model. FIG. 4 illustrates an exemplary V-A model 400 according to an embodiment of the present disclosure. The V-A model 400 maps emotional features to a two-dimensional space, which is defined by two orthogonal dimensions such as valence and arousal. The valence may represent the polarity of emotion, such as negative emotion and positive emotion, and indicate the degree by continuous values in the range of, for example, [-1, 0] and [0, 1], respectively. The arousal may indicate the energy of emotion, and indicate the degree by a continuous value in the range of, for example, [0, 1], Almost all human emotional states may be mapped to points defined in this two-dimensional space based on valence value- arousal value pairs (V-A pairs). Four exemplary emotional states are shown in FIG. 4, such as "happy", "satisfied", "nervous", and "sad". The emotional state "happy" may be mapped, for example, to point 402 in the V-A model 400, whose V-A pair is (0.8, 0.6). The emotional state "satisfied" may be mapped, for example, to point 404 in the V-A model 400, whose V-A pair is (0.7, 0.4). The emotional state "nervous" may be mapped, for example, to point 406 in the V-A model 400, whose V-A pair is (-0.3, 0.9). The emotional state "sad" may be mapped, for example, to point 406 in the V-A model 408, whose V-A pair is (-0.8, 0.3).

[0049] It is to be understood that characterizing the emotional states by the V-A model described in conjunction with FIG. 4 is only an example, and the emotional states may also be characterized in other ways. For example, the emotional states may be characterized by a six-category method, that is, the emotional states are characterized by a probability distribution for six basic emotion types. These six basic types of emotion include, for example, anger, happiness, surprise, disgust, sadness, and fear. The emotion representation according to an embodiment of the present disclosure may be based on any one of approaches for characterizing emotional states.

[0050] Based on a semantic representation of each utterance in the context 302, a context semantic initial representation may be generated. Based on an emotional representation of each utterance in the context 302, a context emotional initial representation may be generated. Based on a semantic representation of the candidate response 306, a candidate response semantic initial representation may be generated. Based on an emotional representation of the candidate response 306, a candidate response emotional initial representation may be generated. The specific process for generating the above initial representations will be explained later in conjunction with FIG. 5.

[0051] After the initial representations of the context 302 and the initial representations of the candidate response 306 are obtained, at the interaction representation generation part 312, interaction representations of the context 302 and interaction representations of the candidate response 306 may be further generated. Herein, an interaction representation refers to a representation generated based on information change between every two adjacent utterances among the context and/or the candidate response. Such information change may include semantic change and/or emotional change. Based on the semantic change between every two adjacent utterances among the context 302, a context semantic interaction representation may be generated. Based on the emotional change between every two adjacent utterances among the context 302, a context emotional interaction representation may be generated. Based on the semantic change between every two adjacent utterances among the context 302 and the candidate response 306, a candidate response semantic interaction representation may be generated. Based on the emotional change between every two adjacent utterances among the context 302 and the candidate response 306, a candidate response emotional interaction representation may be generated. The specific process for generating the above interaction representations will be explained later in conjunction with FIG. 6.

[0052] At the matching part 314, a matching process may be performed based on the generated initial representations and interaction representations. Each of the initial representations and the interaction representations may include a semantic representation and an emotional representation. Accordingly, the matching may include semantic matching and emotional matching. The semantic matching may be performed between two semantic representations to obtain a semantic relevance representation and a semantic interaction relevance representation. The specific process for the semantic matching will be explained later in conjunction with FIG. 7. The emotional matching may be performed between two emotional representations to obtain an emotional initial relevance representation and an emotional interaction relevance representation. The specific process for the emotional matching will be explained later in conjunction with FIG. 8. [0053] After obtaining the semantic relevance representation, the semantic interaction relevance representation, the emotional initial relevance representation, and the emotional interaction relevance representation, these relevance representations may be aggregated at the aggregation part 316 to obtain a comprehensive relevance score 318. The specific process for performing the aggregation will be explained later in conjunction with FIG. 9.

[0054] FIG. 5 illustrates an exemplary process 500 for generating initial representations according to an embodiment of the present disclosure. The initial representations may include a semantic initial representation and an emotional initial representation, for example, context initial representations may include a context semantic initial representation and a context emotional initial representation, and candidate response initial representations may include a candidate response semantic initial representation and a candidate response emotional initial representation. The processes for generating the semantic initial representations and the emotional initial representations are similar.

[0055] The process 500 may be performed on a context 502 and a candidate response 512. The context 502 may correspond to the context 302 in FIG. 3. The context 502 may include, for example, utterances 502-1, 502-2, 502-3, ..., 502-n, which may correspond to the utterances 302-1, 302-2, 302-3, ..., 302-n in FIG. 3, respectively. The candidate response 512 may correspond to the candidate response 306 in FIG. 3.

[0056] Word vector sequences corresponding to the utterances 502-1, 502-2, 502-3, ..., 502-n, respectively, may be generated through embedding layers 504-1, 504-2, ..., 504-n.

Assume that the context 502 may be represented as {u¹, u², u³, ..., uⁿ}, wherein u represents an utterance, and u^k represents the k -th utterance in the context 502, that is, utterance 502-k. After being processed by an embedding layer, u^k may be represented as , wherein e represents a word vector of the j-th word in utterance 502-k, and m represents the number of words in utterance 502-k.

[0057] Similarly, a word vector sequence corresponding to the candidate response 512 may be generated through an embedding layer 514. This word vector may be represented as R = [e_r1, e_r2 ,..., e_rm], wherein e_rj represents a word vector of the j-th word in the candidate response 512, that is, the candidate response r, and m represents the number of words in the candidate response 512.

[0058] Subsequently, word-level representations 508-1, 508-2, 508-3, ..., 508-n corresponding to utterances 502-1, 502-2, 502-3, ..., 502-n may be generated through attention mechanisms and feed-forward neural networks 506-1, 506-2, ..., 506-n, respectively. Similarly, a word-level representation 518 corresponding to the candidate response 512 may be generated through an attention mechanism and a feed-forward neural network 516. A word-level representation 508-k: corresponding to the utterance 502-k may be represented as U^k _self , and the word-level representation 518 corresponding to the candidate response 512 may be represented as R_self . U^k _self and R_self may be represented, for example, by the following formula:

wherein f_ATT( ) represents output of an attention mechanism and a feed-forward neural network.

[0059] A context initial representation 510, that is may be generated

through combining, such as cascading, the word-level representations 508-1, 508-2, 508-3, ..., 508-n. The word-level representation 518 may be adopted as a candidate response initial representation 520. Both a semantic initial representation and an emotional initial representation may be generated through the process 500 in FIG. 5. Through the process 500, a context semantic initial representation , a context emotional initial

representation

a candidate response semantic initial representation R^s _self, and a candidate response emotional initial representation R^e _self may be generated.

[0060] After the word-level representations corresponding to the respective utterances in the context and the candidate response being generated, context interaction representations and candidate response interaction representations may be further generated based on these word-level representations. FIG. 6 illustrates an exemplary process 600 for generating interaction representations according to an embodiment of the present disclosure. The interaction representations may include semantic interaction representations and emotional interaction representations, for example, context interaction representations may include a context semantic interaction representation and a context emotional interaction representation, and candidate response interaction representations may include a candidate response semantic interaction representation and a candidate response interaction initial representation. The processes for generating a semantic interaction representation and an emotional interaction representation are similar.

[0061] Firstly, word-level representations 602-1, 602-2, 602-3, ..., 602 -n corresponding to respective utterance in a context 602 and a word-level representation 618 corresponding to a candidate response 616 may be obtained, wherein a word-level representation 602 -k corresponds to utterance k in the context 602, i.e., u^k. The context 602 may correspond to the context 502 in FIG. 5, and the word-level representations 602- 1, 602-2, 602-3, ..., 602 -n may correspond to the word-level representations 508-1, 508-2, 508-3, ..., 508-n in FIG. 5, respectively. The candidate response 616 may correspond to the candidate response 512 in FIG. 5, and the word-level representation 618 may correspond to the word-level representation 518 in FIG. 5.

[0062] Sentence-level representations 606-1, 606-2, 606-3, 606-3, ..., 606 -n corresponding to the word-level representations 602-1, 602-2, 602-3, ..., 602 -n, respectively, may be generated through recurrent neural networks and attention mechanisms 604-1, 604-2, ..., 604 -n. Similarly, a sentence-level representation 622 corresponding to the word-level representation 618 may be generated through a recurrent neural network and an attention mechanism 620.

[0063] A sentence-level representation 606-k: corresponding to utterance k in the context 602 may be represented as U^k _utter , and the sentence-level representation 622 corresponding to the candidate response 616 may be represented as R_utter. The process for generating sentence-level representations U^k _utter and R_utter through recurrent neural networks and attention mechanisms may be represented, for example, by the following formulas. Firstly, a hidden state H^{u,r}[i] corresponding to the z^'-th word in respective utterance of utterance u in the context or the candidate response r may be calculated, as shown in the following formula: H^{u,r}[i] = GRU(W_self[i], H^{u,r}[i - 1]) (3) wherein GRU represents a Gated Recurrent Unit, W_self ∈ { U^k _self, R_self} ,H^{u _,r} ∈ R^mxd represents a hidden state corresponding to respective utterance in the context or candidate response, wherein m represents the number of words in the corresponding utterance, and d represents a dimension. Subsequently, an attention mechanism and average pooling may be performed on the hidden state H^{u,r} to obtain a sentence-level representation U^k _utter corresponding to respective utterance u^k in the context and a sentence-level representation R_utter corresponding to the candidate response r, as shown in the following formula: U^k _utter = mean(f_ATT(H^uk, H^uk)) (4)

R_utter = mean(f_ATT(H^r, H^r)) (5) wherein mean ( ) represents average pooling.

[0064] A difference between sentence-level representations of adjacent utterances among the context and the candidate response may be calculated based on M_utter and

R_utter. Such a difference may reflect information change between adjacent utterances among the context and the candidate response, such as semantic change and/or emotional change. As shown in FIG. 6, a difference 608-1 may be calculated based on a sentence- level representation 606-1 and required preceding information, wherein the difference 608-1 may reflect information change between utterance 1 in the context 602 and the required preceding information, and wherein the required preceding information may be initialized to zero; a difference 608-2 may be calculated based on a sentence-level representation 606-2 and the sentence-level representation 606-1, wherein the difference 608-2 may reflect information change between utterance 2 and utterance 1 in the context 602; a difference 608-3 may be calculated based on a sentence-level representation 606-3 and the sentence-level representation 606-2, wherein the difference 608-3 may reflect information change between utterance 3 and utterance 2 in context 602, ..., by analogy, a difference 608-n may be calculated, which may reflect information change between utterance n and utterance n- 1 in the context 602. An utterance adjacent to the candidate response 616 is utterance n in the context 602. Accordingly, a difference 624 may be calculated based on a sentence-level representation 622 of the candidate response 616 and a sentence-level representation 606-n of utterance n. wherein the difference 624 may reflect information change between the candidate response 616 and utterance n.

[0065] A difference 608-k: between the sentence-level representation

and a sentence-level representation

may be represented, for example, as

A difference 624 between the sentence-level representation R_utter and a sentence-level representation U _utter may be represented, for example, as T^r _local. In an implementation,

T^k _local and T^r _local may be calculated, for example, by the following formulas:

wherein ReLU represents a Rectified Linear Unit, ʘ represents element-wise multiplication, W_t and b_t are trainable parameters, and U⁰ _utter may be filled with zeros. [0066] After obtaining the differences 608-1, 608-2, 608-3, ..., 608 -n and 624, at 610, utterance interaction representations 612-1, 612-2, 612-3, ..., 612 -n corresponding to respective utterances in the context and a candidate response interaction representation 626 corresponding to the candidate response may be generated based on these differences. In an implementation, an utterance interaction representation 612 -k corresponding to utterance k in the context 602 may be generated based on the difference between sentence- level representations of every two adjacent uterances of uterance k in the context 602 and preceding uterances of uterance k, wherein the preceding uterances of utterance k may include uterances before uterance k in the context 602. For example, an uterance interaction representation 612-3 corresponding to uterance 3 in the context 602 may be generated based on the differences 608-2 and 608-3, and an uterance interaction representation 612 -n corresponding to uterance n may be generated based on the differences 608-2, 608-3, ..., 608-n. and a candidate response interaction representation 626 corresponding to the candidate response 616 may be generated based on the differences 608-2, 608-3, ..., 608 -n and 624.

[0067] In an implementation, an uterance interaction representation generating 610 may integrating respective differences through a Transitional Memory Network and by copying historical memories. Herein, the memory is implemented by using a recurrent attention mechanism, wherein a feed-forward neural network may be used to transform uterance k into memory representation and transform the candidate response into

memory representation , as shown in the following formulas:

(8) (9)

wherein

and

represent input memory representations, and

represent output memory representations, and W^{in,out} and b^{in,out} are trainable parameters.

[0068] A global representation , for uterance k in the

context and the candidate response may be obtained, wherein when k' ∈ {1,2, ... , n} represents a global representation for uterance k, and when

represents a global representation for the candidate response. may be calculated,

for example, by the following formulas:

[0069] The above process may be performed iteratively, wherein the representation results between adjacent hops may be integrated by residuals. An uterance interaction representation

for utterance k and the candidate response may be obtained, for example, by concatenating and

, as shown in the following formula:

(12) wherein when may correspond to

in formula (7). When k' ∈

{1,2, ..., n

represents an utterance interaction representation for an utterance in the context, and when k' = n + 1 ,

represents an interaction representation for the candidate response, that is, the candidate response interaction representation 626, i.e., T^r. The utterance interaction representation

may reflect a difference in representation between utterance k' and all previous utterances before utterance k' in the current session, i.e., utterance 1 to utterance k'-1. Subsequently, a context interaction representation 614, i.e., may be obtained by concatenating the utterance interaction representations

612-2, 612-3, ..., 612-n corresponding to the respective utterances in the context 602.

[0070] Both a semantic initial representation and an emotional initial representation may be generated through the process 600 in FIG. 6. Through the process 600, a context semantic interaction representation a context emotional interaction

representation a candidate response semantic interaction representation T^s,r,

and a candidate response emotional interaction representation T^e,r may be generated. [0071] As explained above in connection with FIG. 6, the generation of the context interaction representation and the candidate response interaction representation according to embodiments of the present disclosure considers the difference in representation between adjacent utterances among the context and the candidate response, and further considers the difference in representation between respective utterance in the context and the candidate response and preceding utterances of this utterance in the current session. Such differences may reflect information change during the session, such as semantic change and emotional change. In other words, embodiments of the present disclosure propose to model a semantic flow and an emotional flow in the session, so that the semantic change and the emotional change in the session may be effectively tracked. Referring back to FIG. 3, the context interaction representation and the candidate response interaction representation may then be used in subsequent matching and aggregation processes, and finally generate a comprehensive relevance score indicating relevance between the candidate response and the context. Since the generation of the context interaction representation and the candidate response interaction representation considers the semantic change and the emotional change between adjacent utterances among the context and the candidate response, such change will also be taken into account when generating the comprehensive relevance score, thereby, a calculated relevance score of a candidate response that is smoother and more natural relative to the context in terms of semantic and emotion will be higher.

[0072] FIG. 7 illustrates an exemplary process 700 for semantic matching according to an embodiment of the present disclosure. A context semantic initial representation 704 and a context semantic interaction representation 706 corresponding to a context 702 may be obtained. The context 702 may correspond to the context 302 in FIG. 3. The context semantic initial representation 704 and the context semantic interaction representation 706 may be represented as

, respectively. A candidate response

semantic initial representation 710 and a candidate response semantic interaction representation 712 corresponding to a candidate response 708 may be obtained. The candidate response 708 may correspond to the candidate response 306 in FIG. 3. The candidate response semantic initial representation 710 and the candidate response semantic interaction representation 712 may be represented as R ^s _self and T^s,r, respectively. The context semantic initial representation 704 and the candidate response semantic initial representation 710 may be generated, for example, through the process 500 in FIG. 5, and the context semantic interaction representation 706 and the candidate response semantic interaction representation 712 may be generated, for example, through the process 600 in

FIG. 6.

[0073] The context semantic initial representation 704 and the candidate response semantic initial representation 710 may be matched 714 to generate a semantic initial relevance representation 716. The semantic initial relevance representation 716 may be represented, for example, as . The semantic initial relevance representation 716 may

indicate relevance between the context semantic initial representation 704 and the candidate response semantic initial representation 710. The generation of the semantic initial relevance representation 716

may be represented, for example, by the following formulas:

wherein and are trainable parameters. [0074] The context semantic interaction representation 706 and the candidate response semantic interaction representation 712 may be matched 718 to generate a semantic interaction relevance representation 720. The semantic interaction relevance representation 720 may be represented, for example, as

The semantic interaction relevance representation 720 may indicate relevance between the context semantic interaction representation 706 and the candidate response semantic interaction representation 712. The generation of the semantic interaction relevance representation 720

may be represented, for example, by the following formulas:

wherein W

are trainable parameters.

[0075] FIG. 8 illustrates an exemplary process 800 for emotional matching according to an embodiment of the present disclosure. A context emotional initial representation 804 and a context emotional interaction representation 806 corresponding to a context 802 may be obtained. The context 802 may correspond to the context 302 in FIG. 3. The context emotional initial representation 804 and the context emotional interaction representation 806 may be represented as respectively. A candidate response

emotional initial representation 810 and a candidate response emotional interaction representation 812 corresponding to a candidate response 808 may be obtained. The candidate response 808 may correspond to the candidate response 306 in FIG. 3. The candidate response emotional initial representation 810 and the candidate response emotional interaction representation 812 may be represented as

respectively. The context emotional initial representation 804 and the candidate response emotional initial representation 810 may be generated, for example, through the process 500 in FIG. 5, and the context emotional interaction representation 806 and the candidate response emotional interaction representation 812 may be generated, for example, through the process 600 in FIG. 6.

[0076] The context emotional initial representation 804 and the candidate response emotional initial representation 810 may be matched 814 to generate an emotional initial relevance representation 816. The emotional initial relevance representation 816 may be represented, for example, as The emotional initial relevance representation 816 may

indicate relevance between the context emotional initial representation 804 and the candidate response emotional initial representation 810. The generation of emotional initial relevance representation 816 may be represented, for example, by the following formulas:

wherein are trainable parameters.

[0077] The context emotional interaction representation 806 and the candidate response emotional interaction representation 812 may be matched 818 to generate an emotional interaction relevance representation 820. The emotional interaction relevance representation 820 may be represented, for example, as The emotional interaction

relevance representation 820 may indicate relevance between the context emotional interaction representation 806 and the candidate response emotional interaction representation 812. The generation of the emotional interaction relevance representation 820 may be represented, for example, by the following formula:

wherein are trainable parameters.

[0078] FIG. 9 illustrates an exemplary process 900 for performing aggregation according to an embodiment of the present disclosure. The process 900 may be performed by the aggregation part 316 in the transitional memory-based matching model 308 shown in FIG. 3. A semantic initial relevance representation 902 and a semantic interaction relevance representation 904 in FIG. 9 may correspond to the semantic initial relevance representation 716 and the semantic interaction relevance representation 720 in FIG. 7, respectively, and an emotional initial relevance representation 920 and an emotional interaction relevance representation 922 in FIG. 9 may correspond to the emotional initial relevance representation 816 and the emotional interaction relevance representation 820 in FIG. 8, respectively.

[0079] The semantic initial relevance representation 902 may be processed by, for example, two layers of recurrent neural networks 906 and 908, as shown in the following formulas:

wherein represents the number of words in the corresponding

utterance; k ∈ (1,2, ...,n) n represents the number of utterances in the context; may

be initialized to zero; and

may be used for the subsequent relevance score calculating process.

[0080] The semantic interaction relevance representation 904 may be processed by a recurrent neural network 910, as shown in the following formula:

wherein k ∈ (1,2, ... , n}, n represents the number of utterances in the context; and

may be used for the subsequent relevance score calculating process.

[0081] At 912, the processed semantic initial relevance representation 902 and the processed semantic interaction relevance representation 904 may be combined, such as cascaded, to obtain a semantic relevance representation 914. Subsequently, through a forward neural network 916, a semantic relevance score 918 may be generated based on the semantic relevance representation 914, as shown in the following formula:

wherein are trainable parameters.

[0082] The emotional initial relevance representation 920 may be processed by, for example, two layers of recurrent neural networks 924 and 926, as shown in the following formulas:

wherein

represents the number of words in the corresponding utterance; k ∈ (1,2, ..., n], n represents the number of utterances in the context; may

be initialized to zero; and

may be used for the subsequent relevance score calculating process.

[0083] The emotional interaction relevance representation 922 may be processed by a recurrent neural network 928, as shown in the following formula:

wherein k ∈ (1,2, ... , n], n represents the number of utterances in the context; and 4 may be used for the subsequent relevance score calculating process.

[0084] At 930, the processed emotional initial relevance representation 920 and the processed emotional interaction relevance representation 922 may be combined, such as cascaded, to obtain an emotional relevance representation 932. Subsequently, through a forward neural network 934, an emotional relevance score 936 may be generated based on the emotional relevance representation 932, as shown in the following formula:

wherein are trainable parameters.

[0085] At 938, the semantic relevance score 918 and the emotional relevance score 936 may be combined to obtain a comprehensive relevance score 940. The comprehensive relevance score 940 may be represented, for example, as g. The comprehensive relevance score 940 may correspond to the comprehensive relevance score 318 in FIG. 3. In an implementation, the comprehensive relevance score 940 may be obtained by summing the semantic relevance score 918 and the emotional relevance score 936, as shown in the following formula:

[0086] The specific operation process for each part in the transitional memory-based matching model is described above in conjunction with FIGs. 5-9. It is to be understood that these processes are merely exemplary. Each process may employ any other unit, may include any other step, and may include more or fewer steps, depending on the actual application requirements.

[0087] FIG. 10 illustrates an exemplary chat flow 1000a and associated emotional flow 1000b according to an embodiment of the present disclosure. The chat flow 1000a may occur between a chatbot and a user.

[0088] At 1002, the chatbot may output an utterance U1 "I like Taurus girls so much! ". In the case where emotional states are characterized by a V-A model, an emotional state Eui of the utterance U1 may be, for example, (0.804, 0.673).

[0089] At 1004, the user may enter an utterance U2 "Well, Scorpio boys always like Taurus girls. This is a fact." An emotional state Era of the utterance U2 may be, for example, (0.392, 0.616).

[0090] At 1006, the chatbot may output an utterance U3 ""But why can't I meet a Taurus girl who likes me?". An emotional state Era of the utterance U3 may be, for example, (-0.348, 0.647).

[0091] At 1008, the user may enter an utterance U4 "Because your circle of friends is too narrow". An emotional state Era of the utterance U4 may be, for example, (-0.339, 0.599). [0092] The position of each emotional state of the utterances U1 to U4 in the V-A model is shown in the emotion flow 1000b.

[0093] After receiving the utterance U4 at 1008, the chatbot may firstly determine a context associated with the utterance U4, which includes, for example, the utterances U1 to U4. The chatbot may then determine a response to be provided to the user from a set of candidate responses in a database that it connects with or contains. For example, the chatbot may calculate a comprehensive relevance score between each candidate response of the set of candidate responses and the context. A block 1010 shows two exemplary candidate responses, that is, candidate response R1 "I will meet one" and candidate response R2: "Forget it, I'm kidding. Hahahaha". An emotional state ERI of the candidate response R1 may be, for example, (-0.837, 0.882). The emotional state ER2 of the candidate response R2 may be, for example, (0.225, 0.670).

[0094] The comprehensive relevance score may be calculated, for example, through the process 300 in FIG. 3 in combination with the processes 500-900 in FIGs. 5-9. Since the calculation of the comprehensive relevance score considers semantic change and emotional change between adjacent utterances among the context and the candidate response, as well as between each utterance among the context and the candidate response and preceding utterances of this utterance in the current session, a calculated relevance score of a candidate response that is more smooth and natural relative to the context in terms of semantic and emotion will be higher. For example, a relevance score SI corresponding to the candidate response R1 with the emotional state of (-0.837, 0.882) may be 0.562, and a relevance score S2 corresponding to the candidate response R2 with the emotional state of (0.225, 0.670) may be 0.114. The relevance score SI is higher than the relevance score S2, so the chatbot finally outputs the candidate response R1 "I will meet one" at 912. It can also be seen from the emotion flow 1000b that compared with the candidate response R2, the emotional state of the candidate response R1 is smoother and more natural relative to the utterances U1 to U4.

[0095] FIG. 11 illustrates an exemplary process 1100 for training a transitional memory-based matching model according to an embodiment of the present disclosure. A transitional memory-based matching model 1106 in FIG. 11 may correspond to the transitional memory-based matching model 308 in FIG. 3. The transitional memory-based matching model 1106 may include an initial representation generating part 1108, an interaction representation generation part 1110, a matching part 1112, and an aggregation part 1114, which may correspond to the initial representation generating part 310, the interaction representation generation part 312, the matching part 314 and the aggregation part 316 in FIG. 3, respectively.

[0096] Training of the transitional memory-based matching model 1106 may be based on a corpus 1150. The corpus 1150 may include a plurality of conversation-based training samples, such as [context c\, candidate response r₁, relevance label yi |. [context c₂, candidate response r₂, relevance label y₂], [context C₃, candidate response r₃, relevance label y₃], etc., wherein context c_i may include a set of conversation-based utterances, candidate response r_i may be a candidate response for context c_i, and the relevance label y_i ∈ {0,1} may indicate relevance between context c_i and candidate response r_i. wherein "0" may indicate that candidate response r_i is irrelevant to context c_i and "1" may indicate that candidate response r_i is relevant to context C_i.

[0097] Take a training sample i [context c_i, candidate response r_i. relevance label y_i] in the corpus 1150 as an example. The context c_i 1102 and the candidate response r_i 1104 may be used as input to the transitional memory-based matching model 1106. The transitional memory-based matching model 1106 may perform a scoring task on the relevance between context c_i and candidate response r_i , and output a comprehensive relevance score g(c_i,r_i) 1116. The comprehensive relevance score may be calculated, for example, through the process 300 in FIG. 3 in combination with the processes 500-900 in FIGS. 5-9. In an implementation, a prediction loss of the training sample i may be calculated as a binary cross-entropy loss, and a prediction loss corresponding to the

scoring task is calculated by summing the prediction losses of all the training samples, as shown by the following formula:

[0098] Embodiments of the present disclosure propose to use a multi-task framework to utilize an additional emotion classification task to optimize emotional representations of a context and a candidate response, such as, the context emotional initial representation and the candidate response emotional initial representation generated through the initial representation generating part 310 in FIG. 3, and the context emotional interaction representation and the candidate response emotional interaction representation generated through the interaction representation generation part 312. During the training process, the additional emotion classification task may be performed in conjunction with the scoring task described with reference to FIG. 11. A corpus that includes training data with emotional labels may be utilized to perform the additional emotion classification task. In an implementation, the corpus may be a conversation corpus including a plurality of conversation-based training samples. FIG. 12 illustrates an exemplary process 1200 for optimizing emotional representations with a conversation corpus according to an embodiment of the present disclosure.

[0099] In FIG. 12, a corpus 1250 for performing the additional emotion classification task to optimize emotional representations may include a plurality of conversation-based training samples, such as [context c₁. candidate response r₁. emotional label {z_{1 ,j}}], [context c₂, candidate response r₂, emotional label {z_{2, j} }], [context c₃, candidate response r₃, emotional label {z_3,j}|etc.. wherein context c_i may include a set of conversation-based uterances, and candidate response r_i may be a candidate response for context c,. Different forms of the emotional label may be provided for different approaches for characterizing emotional states. For example, when using a six-category method to characterize emotional states, the emotional label for the emotional category j in the training sample i may be represented as z_i,j ∈ {0,1}.

[00100] Take using training sample i [context c_i. candidate response r_i. emotional label {z_i,j}] to perform the additional emotion classification task as an example. Firstly, a candidate response emotional initial representation 1206 corresponding to a candidate response r_i 1204 may be generated. The candidate response emotional initial representation 1206 may be generated, for example, through the initial representation generating part 310 in FIG. 3, and more specifically, through the process 500 in FIG. 5. The candidate response emotional initial representation 1206 may be expressed as R _eself, which may correspond to, for example, R_self in the above formula (2). Subsequently, a candidate response emotional interaction representation 1210 corresponding to the candidate response r_i may be generated based on the context c_i 1202 and the candidate response r_i 1204. The candidate response emotional interaction representation 1210 may be generated, for example, through the interaction representation generation part 312 in FIG. 3, and more specifically, through the process 600 in FIG. 6. The candidate response emotional interaction representation 1210 may be represented as T^e, which may, for example, correspond to T^e,r that may be calculated by the above formula (12).

[00101] At 1212, the candidate response emotional initial representation 1206 processed by a pooling layer 1208 may be combined with the candidate response emotional interaction representation 1210 to obtain a candidate response emotional comprehensive representation. A forward neural network 1214 may generate an emotional prediction result h(x_i) 1216 based on the candidate response emotional comprehensive representation, as shown in the following formula:

wherein is a trainable parameter for linear transformation; mean

represents an average pooling function; and K is the number of emotion types, for example, K may be 6 when the six-category method is used to characterize emotional states.

[00102] In an implementation, a prediction loss of the training sample i may be calculated as a multi-class cross-entropy loss, and a prediction loss L_emo corresponding to the additional emotion classification task is calculated by summing the prediction losses of all the training samples, as shown by the following formula:

wherein K is the number of emotion types, and M is the number of training samples. [00103] In addition to the conversation corpus, a sentence corpus based on sentences may also be used to perform an additional emotion classification task to optimize emotional representations. FIG. 13 illustrates an exemplary process 1300 for optimizing emotional representations with a sentence corpus according to an embodiment of the present disclosure.

[00104] A corpus 1350 in FIG. 13 may include a plurality of training samples, such as [utterance x_1; emotional label {z_{1 ;·}}], [utterance x₂, emotion labeling (z_2,j}|. [utterance x₃, emotion labeling {z_{3 ;j}}], etc., wherein emotional label {z_{i, j}} is used to indicate the emotional state of utterance x_i. Different forms of an emotional label may be provided for different approaches for characterizing emotional states. For example, when using a six- category method to characterize emotional states, the emotional label for the emotional category j in the training sample i may be represented as z_i,j ∈ {0,1}.

[00105] Take using training sample i [utterance x_i, emotional label {z_{i, j}}| to perform the additional emotion classification task as an example. Firstly, a word-level representation 1304 corresponding to an utterance x_i.1302 may be generated. The word- level representation 1304 may be generated, for example, through the initial representation generating part 310 in FIG. 3, and more specifically, through the process 500 in FIG. 5. A pooling layer 1306 and a forward neural network 1308 may process the word-level representation 1304 to obtain an emotion prediction result h(x_i) 1310. Subsequently, a prediction loss

corresponding to the additional emotion classification task may be calculated based on the emotional prediction result

1310 and emotional label

In an implementation, the prediction loss of the training sample i may be calculated as a multi-class cross-entropy loss, and the prediction loss corresponding to the

additional emotion classification task is calculated by summing the prediction losses of all the training samples, as shown by the above formula (32).

[00106] It is to be understood that performing the additional emotion classification task by using the conversation corpus described with reference to FIG. 12 and performing the additional emotion classification task by using the sentence corpus described with reference to FIG. 13 may be performed separately or together. In the case of being performed together, the prediction loss corresponding to the additional emotion classification task may be calculated based on both the prediction loss obtained by performing the additional emotion classification task by using the conversation corpus and the prediction loss obtained by performing the additional emotion classification task by using the sentence corpus.

[00107] The scoring task in FIG. 11 and the additional emotion classification task in FIG. 12 and / or FIG. 13 may be performed jointly. A total prediction loss may be

calculated by weighted summing the prediction loss

corresponding to the scoring task and the prediction loss corresponding to the additional emotion classification

task, as shown in the following formula:

wherein α is a hyper-parameter set by the system.

[00108] People with different personalities may have different emotional change ranges. For example, the emotions of an emotional person may change easily, while the emotions of a quiet person may be difficult to change. For example, an emotional person may easily become very depressed even if he was very happy for the last second. An embodiment of the present disclosure proposes that a transitional memory-based matching model, such as the transitional memory-based matching model 308 in FIG. 3, may be trained for a predetermined personality to obtain a chatbot with a predetermined personality.

[00109] In an implementation, a transitional memory-based matching model may be trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality. For example, during the training process of the transitional memory-based matching model, a prediction loss

associated with an emotional change range, such as an emotional change range between two adjacent utterances, may be added to the prediction loss function shown in the above formula (33) , and a weight β associated with the prediction loss may be set, as

shown by the following formula:

wherein β is a hyper-parameter set by the system, which may affect the proportion of the predicted loss associated with the emotional change range to the total predicted

loss If it is desired to train a chatbot with a large emotional change range, such as a chatbot with an emotional personality,β may be set to be small, so that the proportion of prediction loss

to the total prediction loss may be small. On the contrary, if it is desired to train a chatbot with a small emotional change range, such as a chatbot with a quiet personalityβ, may be set to be large, so that the proportion of prediction loss

^lo the total prediction loss may be large.

[00110] Emotional states may also be affected by external factors such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc. For example, if a speaker is sick or the weather is bad, he may be down even if he hears good news; while if a speaker is healthy or the weather is good, he may be calm even if he hears bad news. An embodiment of the present disclosure proposes that when providing a response, not only a context in a chat flow, but also external factors that affect an emotional state of a chatbot may be considered.

[00111] In an implementation, an additional emotional representation corresponding to an external factor may be generated and inserted among a set of word-level representations corresponding to a set of utterances in a context of a chat flow, thereby affecting subsequent relevance score generating and further affecting the selection of a candidate response.

[00112] FIG. 14 illustrates an exemplary process 1400 for generating an additional emotional representation according to an embodiment of the present disclosure.

[00113] Firstly, an external factor 1402 that affects an emotional state of a chatbot may be identified, such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc. External factor such as weather may be related to actual conditions, such as the actual weather conditions of the day, and may be obtained through other applications. External factors such as health condition, whether a good thing happened, whether a bad thing happened may be artificially defined or automatically defined by the system.

[00114] At 1404, the external factor 1402 may be mapped to an emotional state 1406 corresponding to the external factor 1402 through a predefined function. In the case that a V-A model is used to characterize emotional states, the emotional state 1406 may be, for example, a V-A pair.

[00115] Subsequently, through a forward neural network 1408, an additional emotional representation 1410 may be generated based on the emotional state 1406. Herein, a generated emotional representation corresponding to an external factor is referred to as an additional emotional representation. In the case that the emotional state 1406 is a V-A pair, the forward neural network 1408 may generate an additional emotional representation 1410 by converting the emotional state 1406 into a valence vector and an arousal vector, and combining the valence vector and the arousal vector.

[00116] After the additional emotional representation is generated, it may be inserted among a set of word-level representations corresponding to a set of utterances in a context of a chat flow. FIG. 15 illustrates an exemplary process 1500 for inserting an additional emotional representation according to an embodiment of the present disclosure.

[00117] Firstly, a set of word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-« corresponding to utterances 1502-1, 1502-2, 1502-3, ..., 1502-«, respectively, in a context 1502 may be obtained. For example, the word-level representations 1504-1, 1504-2, 1502- 3, ..., 1504-n may be generated through the process 500 in FIG. 5. In an implementation, an additional emotional representation 1506 generated, for example, through the process 1400 of FIG. 14 may be inserted before a representation of a first utterance of a current session, that is, before the word-level representation 1504-1. In another implementation, the additional emotional representation 1506 may be inserted before a word-level representation of the current utterance, that is, before the word-level representation 1504- n.

[00118] An updated context initial representation 1508 may be generated based on the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506. For example, the updated context initial representation 1508 may be generated through cascading the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506. An updated context interaction representation 1510 may also be generated based on the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506. In addition, the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotion representations 1506 along with a word-level representation 1514 of a candidate response 1512 may also generate an updated response interaction representation 1516. For example, the updated context interaction representation 1510 and the updated response interaction representation 1516 may be generated through the process 600 in FIG. 6.

[00119] The generation of the updated context initial representation 1508, the updated context interaction representation 1510, and the updated response interaction representation 1516 considers an additional emotional representation corresponding to an external factor. These updated representations may then be used in a subsequent matching process, such as the process 800 in FIG. 8, and a subsequent aggregation process, such as the process 900 in FIG. 9, and ultimately obtain a comprehensive relevance score. Since the generation of the updated context initial representation 1508, the updated context interaction representation 1510, and the updated response interaction representation 1516 considers the additional emotional representation corresponding to the external factor, the additional emotional representations are also taken into account when generating the comprehensive relevance score, so that a calculated relevance score for a candidate response that is consistent with an emotional state of the additional emotional representation will be higher.

[00120] In another implementation, a basic emotional state of a chatbot may also be determined based on external factors. For example, when an external factor is "good weather", the basic emotional state of the chatbot may be determined as "high mood"; while when the external factor is "bad weather", the basic emotional state of the chatbot may be determined as "low mood". Then, a threshold corresponding to the basic emotional state may be set for each candidate response. In some embodiments, only a valence threshold may be set. Taking a candidate response "ha-ha" as an example, the valence threshold corresponding to "high mood" may be "0.1", while a valence threshold corresponding to "low mood" may be "0.8", for example. In this case, when the basic emotional state determined based on external factors is "high mood", as long as a valence value of the emotional state of the chatbot predicted according to the context in the session is greater than "0.1", the candidate response "ha-ha" may be provided; while when the basic emotional state determined based on external factors is "low mood", only when the valence value of the emotional state of the chatbot predicted is greater than "0.8", the candidate response "ha-ha" may be provided. [00121] In addition, after the basic emotional state of the chatbot is determined based on external factors, the emotional state of the chatbot may also be adapted according to the determined basic emotional state. For example, when the basic emotional state is "high mood", the valence value of the emotional state of the chatbot predicted according to the context in the session may be increased, for example, multiplied by a coefficient greater than 1; when the basic emotional state is "low mood", the valence value of the emotional state of the chatbot predicted according to the context in the session may be reduced, for example, multiplied by a coefficient less than 1.

[00122] The foregoing describes different ways in which the chatbot considers external factors that affect emotional states when providing responses. These ways may make emotional states of responses provided throughout the session consistent with the basic emotional state determined by the external factors. It is to be understood that the foregoing ways are merely exemplary, and the embodiments of the present disclosure are not limited thereto, but emotional states of responses provided by the chatbot can be caused in any other way to be consistent with the basic emotional state determined by the external factors.

[00123] A transitional memory-based matching model according to an embodiment of the present disclosure may support multi -modality inputs. Each utterance that is an input of a transitional memory-based matching model may employ at least one of the following modalities: text, voice, facial expressions, and gestures. For example, when a user uses a terminal device to chat with a chatbot, a microphone on the terminal device may capture voice, a speech recognition software may convert the voice into text, or the user may directly enter text. In addition, a camera on the terminal device may capture the user's facial expressions, body gestures, and hand gestures. Inputs of different modalities for a particular utterance may be converted into corresponding representations. These representations may be combined together through an early-fusion strategy or a late-fusion strategy to generate a context initial representation and a context interaction representation. Herein, the early-fusion strategy refers to combining representations of various modality inputs for each utterance into a comprehensive representation of the utterance, and then generating an context initial representation and a context interaction representation based on the comprehensive representation of the utterance and comprehensive representations of other utterances. The late-fusion strategy refers to using representations of various modality inputs of each utterance to generate intermediate initial representations and intermediate interaction representations in respective modalities, and then generating a context initial representation and a context interaction representation by combining the generated intermediate initial representations and intermediate interaction representations, respectively.

[00124] FIG. 16 illustrates an exemplary process 1600 for combining multi-modality inputs through an early-fusion strategy according to an embodiment of the present disclosure.

[00125] Assume that a transitional memory-based matching model according to an embodiment of the present disclosure may support m modality inputs. In FIG. 16, an utterance 1 1602 may have, for example, a modality 1 input 1602-1, a modality 2 input 1602-2, ..., a modality m input 1602 -m. These inputs may be converted into corresponding representations, such as, a representation 1 of utterance 1 1604-1, a representation 2 of utterance 1 1604-2, ..., a representation m of utterance 1 1604 -m. Similarly, an utterance 2 1606 may, for example, have a modality 1 input 1606-1, a modality 2 input 1606-2, ..., a modality m input 1606 -m. These inputs may be converted into corresponding representations, such as a representation 1 of utterance 2 1608-1, a representation 1 of utterance 2 1608-2, ..., a representation m of utterance 2 1608 -m. It is to be understood that although it is shown in FIG. 16 that both utterance 1 and utterance 2 have m modality inputs, the number of modality inputs that utterance 1 and utterance 2 have may be less than m. Without a certain modality input, the modality input and the corresponding representation may be initialized to zero.

[00126] The representation 1 of utterance 1 1604-1, the representation 2 of utterance 1 1604-2, ..., the representation m of utterance 1 1604 -m may be combined together to generate a comprehensive representation of utterance 1 1610. Similarly, the representation 1 of utterance 2 1608-1, the representation 2 of utterance 2 1608-2, ..., the representation m of utterance 2 1608 -m may be combined together to generate a comprehensive representation of utterance 2 1612. A context initial representation 1614 and a context interaction representation 1616 may be generated based on the comprehensive representation of utterance 1 1610, the comprehensive representation of utterance 2 1612, and possible comprehensive representations (not shown) of other utterances. The context initial representation 1614 and the context interaction representation 1616 may be generated, for example, through the process 500 in FIG. 5 and the process 600 in FIG. 6 respectively. The context initial representation 1614 and the context interaction representation 1616 may be used in subsequent matching and aggregation processes, and finally engage in generating a comprehensive relevance score indicating relevance between a candidate response and a context.

[00127] FIG. 17 illustrates an exemplary process 1700 for combining multi-modality inputs through a late-fusion strategy according to an embodiment of the present disclosure. [00128] Assume that a transitional memory-based matching model according to an embodiment of the present disclosure may support m modality inputs. In FIG. 17, an utterance 1 may have, for example, a modality 1 input of utterance 11702-1, a modality 2 input of utterance 11702-2, ..., a modality m input of utterance 11702-m. These inputs may be converted into corresponding representations, respectively, such as a representation 1 of utterance 11704-1, a representation 2 of utterance 11704-2 , ..., a representation m of utterance 11704-m. Similarly, an utterance 21706 may have, for example, a modality 1 input of utterance 21706-1, a modality 2 input of utterance 21706- 1, ..., a modality m input of utterance 21706-m. These inputs may be converted into corresponding representations, respectively, such as a representation 1 of utterance 2 1708-1, a representation 2 of utterance 21708-2, ..., a representation m of utterance 2 1708-m. It is to be understood that although it is shown in FIG. 17 that both utterance 1 and utterance 2 have m modality inputs, the number of modality inputs that utterance 1 and utterance 2 have may be less than m. Without a certain modality input, the modality input and the corresponding representation may be initialized to zero.

[00129] A representation of each modality input of each utterance may be used to generate an intermediate initial representation and an intermediate interaction representation in respective modality input. For example, an intermediate initial representation corresponding to modality 11710-1 and an intermediate interaction representation corresponding to modality 11712-1 may be generated based on the representation 1 of utterance 11704-1, the representation 1 of utterance 21708-1, and representations of possible other utterances corresponding to mode 1 (not shown); an intermediate initial representation corresponding to modality 21710-2 and intermediate interaction representation corresponding to modality 21712-2 may be generated based on the representation 2 of utterance 11704-2, the representation 2 of utterance 21708-2, and representations of possible other utterances corresponding to mode 2 (not shown); ...; an intermediate initial representation corresponding to modality m 1710-m and an intermediate interaction representation corresponding to modality m 1712-m may be generated based on the representation m of utterance 11704-m, the representation m of utterance 21708-/W, and representations of possible other utterances corresponding to mode m (not shown). The intermediate initial representations 1710-1,1710-2,...,1710-m may be generated, for example, through a process similar to the process 500 in FIG. 5 that used to generate the context initial representation, and the intermediate interaction representations 1712-1, 1712-2,..., 1712- in may be generated, for example, through a process similar to the process 600 in FIG. 6 that used to generate the context interaction representation.

[00130] Then, a context initial representation 1714 may be generated through combining the intermediate initial representation 1710-1, the intermediate initial representation 1710-2, ..., the intermediate initial representation 1710-m, and a context interaction representation 1716 may be generated through combining the intermediate interaction representation 1712-1, intermediate interaction representation 1712-2, ..., intermediate interaction representation 1712 -m. The context initial representation 1714 and the context interaction representation 1716 may be used in subsequent matching and aggregation processes, and finally engage in generating a comprehensive relevance score indicating relevance between a candidate response and a context.

[00131] It is to be understood that although only two utterances are shown in FIGs. 16 and 17, the processes for combining the multi-modality inputs through the early -fusion strategy and the late-fusion strategy according to the embodiments of the present disclosure are not limited to a specific number of utterances, but rather may be applied to any number of utterances in a similar manner. In addition, the processes for combining the multi-modality inputs through the early-fusion strategy and the late-fusion strategy shown in FIGs. 16 and 17, respectively, are only exemplary, and the embodiments of the present disclosure are not limited thereto. For example, for the late-fusion strategy, a context initial relevance representation and a context interaction relevance representation may be obtained by firstly using a representation of each modality input of each utterance to generate an intermediate initial relevance representation and an intermediate interaction relevance representation in respective modality, and then combining the generated intermediate initial relevance representations and intermediate interaction relevance representations, respectively. The context initial relevance representation and the context interaction relevance representation may engage in generating a comprehensive relevance score indicating relevance between the candidate response and the context.

[00132] According to an embodiment of the present disclosure, after a candidate response to be provided to a user is selected, a chatbot may present the response based on an emotional state of the selected candidate response. In some embodiments, the chatbot may express, in a corresponding manner, the emotional state of the selected candidate response based on a modality of the response. For example, in the case that the response is in voice, when its emotional state is "happy", the chatbot may present the response with a fast speech rate or a high tone. In addition, the emotional state of the response may be expressed by additionally providing other multi-modality signals, for example, by facial expressions, body gestures, or hand gestures, etc. of the chatbot. In an implementation, when presenting a response, a corresponding light may be provided at the same time to express the emotional state of the response.

[00133] FIG. 18 illustrates an exemplary scenario 1800 for expressing emotional states of response through light according to an embodiment of the present disclosure. This scenario may happen between a user and a smart speaker. The smart speaker may be equipped with a chatbot implemented according to the embodiments of the present disclosure. The smart speaker may respond to the user's voice input by providing a voice response and corresponding light.

[00134] At 1802, the user may say "So annoying!". At 1804, the smart speaker may reply by providing a voice response: "Cheer up! I still like to see you laugh." The emotional state of the voice response at 1804 may have a relatively positive valence, for example, a valence value of "0.9", so the light provided in association with it may have a strong brightness.

[00135] At 1806, the user may then say "But I don't want to laugh now." At 1808, the smart speaker could reply by providing a voice response "You should learn to laugh. Everyone can do it." The emotional state of the voice response at 1808 may have a generally positive valence, for example, a valence value of "0.6", so the light provided in association with it may have a weak brightness.

[00136] At 1810, the user may continue to say "I can't do it." At 1812, the smart speaker may reply by providing a voice response: "Let me make you happy!" The emotional state of the voice response at 1812 may have a relatively positive valence, for example, a valence value of "0.9", so the light provided in association with it may have a strong brightness.

[00137] FIG. 18 shows an example for expressing different emotional states of a response through different light brightness. It is to be understood that the embodiments of the present disclosure are not limited thereto, for example, in the case of expressing emotional states through light, emotional states of responses may also be expressed through the color, duration, etc. of the light. In addition, the emotional states of the responses may be expressed by any other multi-modality signals. [00138] According to an embodiment of the present disclosure, a selection of a candidate response may be based on semantic relevance and emotional relevance between a candidate response and a context. When determining the semantic relevance and the emotional relevance, messages received and responses sent by a chatbot are collectively considered as utterances in the context, and no distinction is made between the received messages and the sent responses. This may share emotional states between the chatbot and a user, and achieve empathy between the chatbot and the user. Further, the chatbot may drive the user's emotional state to the direction of positive valence by providing a more positive response, such as a response with a higher valence value, thereby guiding the user to obtain an emotional state with a positive valence before the end of the session.

[00139] FIG. 19 is a flowchart of an exemplary method 1900 for providing a response in automated chatting according to an embodiment of the present disclosure.

[00140] At step 1910, a message may be obtained in a chat flow.

[00141] At step 1920, a context associated with the message may be determined, the context comprising a set of utterances, the set of utterances comprising the message. [00142] At step 1930, for each candidate response of a set of candidate responses, the candidate response may be scored based at least on information change between adjacent utterances among the set of utterances and the candidate response.

[00143] At step 1940, a highest-scored candidate response among the set of candidate responses may be provided in the chat flow.

[00144] In an implementation, the information change may comprise at least one of semantic change and emotional change.

[00145] The scoring may comprise at least one of: generating a semantic relevance score for the candidate response based at least on the semantic change between adjacent utterances among the set of utterances and the candidate response; and generating an emotional relevance score for the candidate response based at least on the emotional change between adjacent utterances among the set of utterances and the candidate response.

[00146] The scoring may comprise: generating a comprehensive relevance score for the candidate response based on the semantic relevance score and the emotional relevance score.

[00147] In an implementation, the scoring may comprise: generating a context interaction representation corresponding to the context based on information change between every two adjacent utterances of the set of utterances; generating a candidate response interaction representation corresponding to the candidate response based on information change between every two adjacent utterances among the set of utterances and the candidate response; obtaining an interaction relevance representation through matching the context interaction representation with the candidate response interaction representation; and generating a relevance score for the candidate response based at least on the interaction relevance representation.

[00148] The scoring may further comprise: generating a context initial representation corresponding to the context based on a representation of each utterance of the set of utterances; generating a candidate response initial representation corresponding to the candidate response; obtaining an initial relevance representation through matching the context initial representation with the candidate response initial representation; and generating a relevance score for the candidate response based on a combination of the initial relevance representation and the interaction relevance representation.

[00149] The information change may comprise semantic change, and the context interaction representation may include a context semantic interaction representation, the candidate response interaction representation may include a candidate response semantic interaction representation, the interaction relevance representation may include a semantic interaction relevance representation, the context initial representation may include an context semantic initial representation, the candidate response initial representation may include a candidate response semantic initial representation, the initial relevance representation may include a semantic initial relevance representation, and the relevance score may be a semantic relevance score.

[00150] The information change may comprise emotional change, and the context interaction representation may include a context emotional interaction representation, the candidate response interaction representation may include a candidate response emotional interaction representation, the interaction relevance representation may include an emotional interaction relevance representation, the context initial representation may include an context emotional initial representation, the candidate response initial representation may include a candidate response emotional initial representation, the initial relevance representation may include an emotional initial relevance representation, and the relevance score may be an emotional relevance score.

[00151] In an implementation, the method 1900 may further comprise: identifying external factors that affect emotional states; and adding the external factors into the context. [00152] In an implementation, at least one utterance of the set of utterances may employ at least one of the following modalities: text, voice, facial expressions, and gestures.

[00153] In an implementation, the method 1900 may further comprise: presenting the highest-scored candidate response based on an emotional state of the candidate response. [00154] In an implementation, the scoring may be performed through a transitional memory-based matching model, the transitional memory-based matching model being optimized through an additional emotion classification task during a training process. [00155] In an implementation, the scoring may be performed through a transitional memory-based matching model, the transitional memory-based matching model being trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality.

[00156] It is to be understood that the method 1900 may further comprise any steps/processes for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.

[00157] FIG. 20 illustrates an exemplary apparatus 2000 for providing a response in automated chatting according to an embodiment of the present disclosure. The apparatus 2000 may comprise: a message obtaining module 2010, for obtaining a message in a chat flow; a context determining module 2020, for determining a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message; a scoring module 2030, for scoring, for each candidate response of a set of candidate responses, the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response; and a response providing module 2040, for providing a highest-scored candidate response among the set of candidate responses in the chat flow.

[00158] In an implementation, the information change may comprise at least one of semantic change and emotional change.

[00159] The scoring module 2030 may be further configured for performing at least one of: generating a semantic relevance score for the candidate response based at least on the semantic change between adjacent utterances among the set of utterances and the candidate response; and generating an emotional relevance score for the candidate response based at least on the emotional change between adjacent utterances among the set of utterances and the candidate response.

[00160] In an implementation, the apparatus 2000 may further comprise: an external factor identifying module, for identifying external factors that affect emotional states; and an external factor adding module, for adding the external factors into the context.

[00161] In an implementation, the scoring module 2030 may comprise a transitional memory-based matching model, the transitional memory-based matching model being optimized through an additional emotion classification task during a training process. [00162] In an implementation, the scoring module 2030 may comprise a transitional memory-based matching model, the transitional memory-based matching model being trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality.

[00163] It is to be understood that the apparatus 2000 may further comprise any other modules configured for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.

[00164] FIG. 21 illustrates an exemplary apparatus 2100 for providing a response in automated chatting according to an embodiment of the present disclosure.

[00165] The apparatus 2100 may comprise at least one processor 2110. The apparatus 2100 may further comprise a memory 2120 coupled with the processor 2110. The memory 2120 may store computer-executable instructions that, when executed, cause the processor 2110 to perform any operations of the method for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above. [00166] The embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.

[00167] It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

[00168] It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

[00169] Processors are described in connection with various apparatus and methods. These processors can be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a microcontroller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a microcontroller, a DSP, or other suitable platforms.

[00170] Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software can reside on computer readable medium. Computer readable medium may include, for example, a memory, which may be, for example, a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

[00171] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.

Claims

1. A method for providing a response in automated chatting, comprising: obtaining a message in a chat flow; determining a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message; for each candidate response of a set of candidate responses, scoring the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response; and providing a highest-scored candidate response among the set of candidate responses in the chat flow.

2. The method of claim 1, wherein the information change comprises at least one of semantic change and emotional change.

3. The method of claim 2, wherein the scoring comprises at least one of: generating a semantic relevance score for the candidate response based at least on the semantic change between adjacent utterances among the set of utterances and the candidate response; and generating an emotional relevance score for the candidate response based at least on the emotional change between adjacent utterances among the set of utterances and the candidate response.

4. The method of claim 3, wherein the scoring comprises: generating a comprehensive relevance score for the candidate response based on the semantic relevance score and the emotional relevance score.

5. The method of claim 1, wherein the scoring comprises: generating a context interaction representation corresponding to the context based on information change between every two adjacent utterances of the set of utterances; generating a candidate response interaction representation corresponding to the candidate response based on information change between every two adjacent utterances among the set of utterances and the candidate response; obtaining an interaction relevance representation through matching the context interaction representation with the candidate response interaction representation; and generating a relevance score for the candidate response based at least on the interaction relevance representation.

6. The method of claim 5, wherein the scoring further comprises: generating a context initial representation corresponding to the context based on a representation of each utterance of the set of utterances; generating a candidate response initial representation corresponding to the candidate response; obtaining an initial relevance representation through matching the context initial representation with the candidate response initial representation; and generating a relevance score for the candidate response based on a combination of the initial relevance representation and the interaction relevance representation.

7. The method of claim 6, wherein the information change comprises semantic change, and the context interaction representation includes a context semantic interaction representation, the candidate response interaction representation includes a candidate response semantic interaction representation, the interaction relevance representation includes a semantic interaction relevance representation, the context initial representation includes an context semantic initial representation, the candidate response initial representation includes a candidate response semantic initial representation, the initial relevance representation includes a semantic initial relevance representation, and the relevance score is a semantic relevance score.

8. The method of claim 6, wherein the information change comprises emotional change, and the context interaction representation includes a context emotional interaction representation, the candidate response interaction representation includes a candidate response emotional interaction representation, the interaction relevance representation includes an emotional interaction relevance representation, the context initial representation includes a context emotional initial representation, the candidate response initial representation includes a candidate response emotional initial representation, the initial relevance representation includes an emotional initial relevance representation, and the relevance score is an emotional relevance score.

9. The method of claim 1, further comprising: identifying external factors that affect emotional states; and adding the external factors into the context.

10. The method of claim 1, wherein at least one utterance of the set of utterances employs at least one of the following modalities: text, voice, facial expressions, and gestures.

11. The method of claim 1, further comprising: presenting the highest-scored candidate response based on an emotional state of the candidate response.

12. The method of claim 1, wherein the scoring is performed through a transitional memory-based matching model, the transitional memory-based matching model being optimized through an additional emotion classification task during a training process.

13. The method of claim 1, wherein the scoring is performed through a transitional memory-based matching model, the transitional memory-based matching model being trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality.

14. An apparatus for providing a response in automated chatting, comprising: a message obtaining module, for obtaining a message in a chat flow; a context determining module, for determining a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message; a scoring module, for scoring, for each candidate response of a set of candidate responses, the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response; and a response providing module, for providing a highest-scored candidate response among the set of candidate responses in the chat flow.

15. An apparatus for providing a response in automated chatting, comprising: at least one processor; and a memory storing computer executable instructions that, when executed, cause the at least one processor to: obtain a message in a chat flow, determine a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message, for each candidate response of a set of candidate responses, score the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response, and provide a highest-scored candidate response among the set of candidate responses in the chat flow.