WO2021086589A1 - Providing a response in automated chatting - Google Patents

Providing a response in automated chatting Download PDF

Info

Publication number
WO2021086589A1
WO2021086589A1 PCT/US2020/055296 US2020055296W WO2021086589A1 WO 2021086589 A1 WO2021086589 A1 WO 2021086589A1 US 2020055296 W US2020055296 W US 2020055296W WO 2021086589 A1 WO2021086589 A1 WO 2021086589A1
Authority
WO
WIPO (PCT)
Prior art keywords
representation
context
candidate response
emotional
interaction
Prior art date
Application number
PCT/US2020/055296
Other languages
French (fr)
Inventor
Pingping LIN
Yue Liu
Lisong QIU
Ruihua Song
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2021086589A1 publication Critical patent/WO2021086589A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J11/00Manipulators not otherwise provided for
    • B25J11/0005Manipulators having means for high-level communication with users, e.g. speech generator, face recognition means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • chatbots are becoming more and more popular and are being used in more and more scenarios. Chatbots are designed to simulate human utterances and may chat with users through text, voice, images, etc. In general, a chatbot may identify language content within a message entered by a user or apply natural language processing to a message, and then provide the user with a response to the message.
  • Embodiments of the present disclosure provides a method and apparatus for providing a response in automated chatting.
  • a message may be obtained in a chat flow.
  • a context associated with the message may be determined, the context comprising a set of utterances, the set of utterances comprising the message.
  • the candidate response For each candidate response of a set of candidate responses, the candidate response may be scored based at least on information change between adjacent utterances among the set of utterances and the candidate response.
  • a highest-scored candidate response among the set of candidate responses may be provided in the chat flow.
  • FIG. 1 illustrates an exemplary application scenario of a chatbot according to an embodiment of the present disclosure.
  • FIG. 2 illustrates an exemplary chat window according to an embodiment of the present disclosure.
  • FIG. 3 illustrates an exemplary process for obtaining a comprehensive relevance score according to an embodiment of the present disclosure.
  • FIG. 4 illustrates an exemplary Valence-Arousal model according to an embodiment of the present disclosure.
  • FIG. 5 illustrates an exemplary process for generating initial representations according to an embodiment of the present disclosure.
  • FIG. 6 illustrates an exemplary process for generating interaction representations according to an embodiment of the present disclosure.
  • FIG. 7 illustrates an exemplary process for semantic matching according to an embodiment of the present disclosure.
  • FIG. 8 illustrates an exemplary process for emotional matching according to an embodiment of the present disclosure.
  • FIG. 9 illustrates an exemplary process for performing aggregation according to an embodiment of the present disclosure.
  • FIG. 10 illustrates an exemplary chat flow and an associated emotional flow according to an embodiment of the present disclosure.
  • FIG. 11 illustrates an exemplary process for training a transitional memory- based matching model according to an embodiment of the present disclosure.
  • FIG. 12 illustrates an exemplary process for optimizing emotional representation with a conversation corpus according to an embodiment of the present disclosure.
  • FIG. 13 illustrates an exemplary process for optimizing emotional representations with a sentence corpus according to an embodiment of the present disclosure.
  • FIG. 14 illustrates an exemplary process for generating an additional emotional representation according to an embodiment of the present disclosure.
  • FIG. 15 illustrates an exemplary process for inserting an additional emotional representation according to an embodiment of the present disclosure.
  • FIG. 16 illustrates an exemplary process for combining multi- modality inputs through an early-fusion strategy according to an embodiment of the present disclosure.
  • FIG. 17 illustrates an exemplary process for combining multi-modality inputs through a late-fusion strategy according to an embodiment of the present disclosure.
  • FIG. 18 illustrates an exemplary scenario for expressing emotional states of responses through light according to an embodiment of the present disclosure.
  • FIG. 19 is a flowchart of an exemplary method for providing a response in automated chatting according to an embodiment of the present disclosure.
  • FIG. 20 illustrates an exemplary apparatus for providing a response in automated chatting according to an embodiment of the present disclosure.
  • FIG. 21 illustrates an exemplary apparatus for providing a response in automated chatting according to an embodiment of the present disclosure.
  • a chatbot may chat automatically in a session with a user.
  • the "session” may refer to a time-continuous conversation between two chat participants.
  • the chatbot When the chatbot is conducting automated chatting, it may receive messages from the user and reply by selecting a candidate response from a set of candidate responses stored in its associated database.
  • the chatbot selects a candidate response, it usually scores relevance between each candidate response and the message in the chat flow, and provides the user with a highest-scored candidate response. Since emotional change in the chat flow is not considered during the scoring process, the candidate response that is finally selected may significantly fluctuate in terms of emotion.
  • Embodiments of the present disclosure propose a method and apparatus for providing a response in automated chatting.
  • a context associated with the message may be determined, and a response being smooth and relevant to the context in both semantic and emotional terms may be provided.
  • the context refers to all received messages and sent responses in a current session, i.e., a session in which the most recently received message is located, and may include the most recently received message itself.
  • an embodiment of the present disclosure proposes a transitional memory-based matching model that may model semantic change and emotional change in a chat flow and consider such change when selecting a candidate response, thereby may provide a response that is smoother and more natural in terms of semantic and emotion.
  • an embodiment of the present disclosure proposes to use a multi-task framework to optimize emotional representations of a context and a candidate response by an additional emotion classification task. A training corpus with emotional labels may be used to perform the additional emotion classification task.
  • an embodiment of the present disclosure proposes to train a transitional memory-based matching model for a predetermined personality, thereby obtaining a chatbot with the predetermined personality.
  • the personality of a speaker may be associated with his or her emotional change range in the speech.
  • the transitional memory-based matching model may be trained based on the emotional change range constraint associated with the predetermined personality.
  • an embodiment of the present disclosure proposes to consider external factors that affect emotional states, such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc., when making candidate response selections.
  • a basic emotional state may be determined based on the external factors, thereby an emotional state of a selected response is consistent with the basic emotional state determined based on the external factors, and is smooth and relevant to previous utterances in the current session.
  • a transitional memory-based matching model proposed by an embodiment of the present disclosure may support multi-modality inputs. Inputs for different modalities of a particular utterance may be converted into corresponding representations. These representations may be combined through multiple fusion strategies.
  • an embodiment of the present disclosure proposes that a selected candidate response may be presented based on an emotional state of the response, and the emotional state of the selected candidate response may also be expressed by additionally providing other multi-modality signals.
  • an embodiment of the present disclosure proposes to achieve empathy between a chatbot and a user, and guide the user to obtain a positive emotional state.
  • FIG. 1 illustrates an exemplary application scenario 100 of a chatbot according to an embodiment of the present disclosure.
  • a network 110 is applied to interconnect between a terminal device 120 and a chatbot server 130.
  • the network 110 may be any type of network capable of interconnecting network entities.
  • the network 110 may be a single network or a combination of various types of networks.
  • the network 110 may be a Local Area Network (LAN), a Wide Area Network (WAN), etc.
  • LAN Local Area Network
  • WAN Wide Area Network
  • the network 110 may be a wireline network, a wireless network, etc.
  • the network 110 may be a circuit switching network, a packet switching network, etc.
  • the terminal device 120 may be any type of electronic computing device capable of connecting to the network 110, accessing a server or website on the network 110, processing data or signals, etc.
  • the terminal device 120 may be a desktop computer, a notebook computer, a tablet computer, a smart phone, etc. Although only one terminal device 120 is shown in FIG. 1, it is to be understood that a different number of terminal devices may be connected to the network 110.
  • the terminal device 120 may include a chatbot client 122 that may provide an automated chatting service to a user.
  • the chatbot client 122 may interact with the chatbot server 130 and present to the user information and responses that the chatbot server 130 provides.
  • the chatbot client 122 may send a message entered by the user to the chatbot server 130 and receive a response relevant to the message from the chatbot server 130.
  • the chatbot client 122 may also generate locally a response to the message entered by the user, rather than interacting with the chatbot server 130.
  • the chatbot server 130 may conduct automated chatting with a user of the terminal device 120.
  • a corpus for automated chatting may be stored in a chatbot database 132 that the chatbot server 130 connects with or the chatbot server 130 contains.
  • FIG. 2 illustrates an exemplary chat window 200 according to an embodiment of the present disclosure.
  • the chat window 200 may include a presenting area 210, a control area 220, and an input area 230.
  • the presenting area 210 displays messages and responses in a chat flow.
  • the control area 220 includes a plurality of virtual buttons for use by a user to perform message input settings. For example, the user may choose to perform voice input, attach an image file, select an emoji, take a screenshot of a current screen, etc. through the control area 220.
  • the input area 230 is used for the user to enter a message. For example, the user may type a text through the input area 230.
  • the chat window 200 may further include a virtual button 240 for confirming transmission of the entered message. If the user touches the virtual button 240, a message entered in the input area 230 may be transmitted to the presenting area 210.
  • chat window in FIG. 2 may omit or add any unit, and the layouts of the units in the chat window in FIG. 2 may also be changed in various ways.
  • a chatbot when conducting automated chatting, may obtain a message in a chat flow, such as a message most recently received from a user, and determine a context associated with the message.
  • the context may include all received messages and sent responses in a current session, and may include the most recently received message itself.
  • the messages received and responses sent by the chatbot are collectively referred to as utterances.
  • the context may include a set of utterances.
  • the chatbot may also obtain a set of candidate responses from a database that it connects with or it contains, and for each candidate response of the set of candidate responses, score relevance between the candidate response and the context to obtain a comprehensive relevance score corresponding to the candidate response.
  • the chatbot may then provide, in the chat flow, a candidate response with the highest comprehensive relevance score among the set of candidate responses.
  • FIG. 3 illustrates an exemplary process 300 for obtaining a comprehensive relevance score according to an embodiment of the present disclosure.
  • the process 300 may be perfomied by, for example, the chatbot server 130 in FIG. 1.
  • a chatbot server may determine a context associated with the message, such as a context 302 in FIG. 3, which may include all received utterances and sent messages in a session in which the message is located, such as utterances 302-1, 302-2, 302-3, ..., 302-n, wherein utterance 302-n may be a message currently obtained from the chat flow.
  • the chatbot server may also obtain a set of candidate responses 304 from a database that it connects with or it contains, such as a chatbot database 132 in FIG. 1, which may include a plurality of candidate responses, such as a candidate response 306.
  • the candidate response 306 is taken as an example to illustrate an exemplary process for obtaining a comprehensive relevance score of the candidate response 306.
  • the context 302 and the candidate response 306 may be provided to a transitional memory-based matching model 308.
  • the transitional memory- based matching model 308 may include, for example, an initial representation generating part 310, an interaction representation generation part 312, a matching part 314, and an aggregation part 316.
  • an initial representation of the context 302 and an initial representation of the candidate response 306 may be generated.
  • the initial representation refers to a representation generated based on a representation of each utterance in the context or the candidate response.
  • the representation of each utterance may include a semantic representation and/or an emotional representation.
  • the emotional representation may be generated based on a variety of approaches for characterizing emotional states.
  • the emotional states may be characterized through a Valence- Arousal (V-A) model.
  • FIG. 4 illustrates an exemplary V-A model 400 according to an embodiment of the present disclosure.
  • the V-A model 400 maps emotional features to a two-dimensional space, which is defined by two orthogonal dimensions such as valence and arousal.
  • the valence may represent the polarity of emotion, such as negative emotion and positive emotion, and indicate the degree by continuous values in the range of, for example, [-1, 0] and [0, 1], respectively.
  • the arousal may indicate the energy of emotion, and indicate the degree by a continuous value in the range of, for example, [0, 1], Almost all human emotional states may be mapped to points defined in this two-dimensional space based on valence value- arousal value pairs (V-A pairs).
  • V-A pairs valence value- arousal value pairs
  • Four exemplary emotional states are shown in FIG. 4, such as "happy”, “satisfied”, "nervous”, and "sad”.
  • the emotional state "happy” may be mapped, for example, to point 402 in the V-A model 400, whose V-A pair is (0.8, 0.6).
  • the emotional state "satisfied” may be mapped, for example, to point 404 in the V-A model 400, whose V-A pair is (0.7, 0.4).
  • the emotional state "nervous” may be mapped, for example, to point 406 in the V-A model 400, whose V-A pair is (-0.3, 0.9).
  • the emotional state "sad” may be mapped, for example, to point 406 in the V-A model 408, whose V-A pair is (-0.8, 0.3).
  • the emotional states may also be characterized in other ways.
  • the emotional states may be characterized by a six-category method, that is, the emotional states are characterized by a probability distribution for six basic emotion types. These six basic types of emotion include, for example, anger, happiness, surprise, disgust, sadness, and fear.
  • the emotion representation according to an embodiment of the present disclosure may be based on any one of approaches for characterizing emotional states.
  • a context semantic initial representation may be generated. Based on an emotional representation of each utterance in the context 302, a context emotional initial representation may be generated. Based on a semantic representation of the candidate response 306, a candidate response semantic initial representation may be generated. Based on an emotional representation of the candidate response 306, a candidate response emotional initial representation may be generated. The specific process for generating the above initial representations will be explained later in conjunction with FIG. 5.
  • interaction representations of the context 302 and interaction representations of the candidate response 306 may be further generated.
  • an interaction representation refers to a representation generated based on information change between every two adjacent utterances among the context and/or the candidate response. Such information change may include semantic change and/or emotional change. Based on the semantic change between every two adjacent utterances among the context 302, a context semantic interaction representation may be generated. Based on the emotional change between every two adjacent utterances among the context 302, a context emotional interaction representation may be generated.
  • a candidate response semantic interaction representation may be generated.
  • a candidate response emotional interaction representation may be generated. The specific process for generating the above interaction representations will be explained later in conjunction with FIG. 6.
  • a matching process may be performed based on the generated initial representations and interaction representations.
  • Each of the initial representations and the interaction representations may include a semantic representation and an emotional representation.
  • the matching may include semantic matching and emotional matching.
  • the semantic matching may be performed between two semantic representations to obtain a semantic relevance representation and a semantic interaction relevance representation.
  • the specific process for the semantic matching will be explained later in conjunction with FIG. 7.
  • the emotional matching may be performed between two emotional representations to obtain an emotional initial relevance representation and an emotional interaction relevance representation.
  • the specific process for the emotional matching will be explained later in conjunction with FIG. 8.
  • semantic relevance representation After obtaining the semantic relevance representation, the semantic interaction relevance representation, the emotional initial relevance representation, and the emotional interaction relevance representation, these relevance representations may be aggregated at the aggregation part 316 to obtain a comprehensive relevance score 318.
  • the specific process for performing the aggregation will be explained later in conjunction with FIG. 9.
  • FIG. 5 illustrates an exemplary process 500 for generating initial representations according to an embodiment of the present disclosure.
  • the initial representations may include a semantic initial representation and an emotional initial representation, for example, context initial representations may include a context semantic initial representation and a context emotional initial representation, and candidate response initial representations may include a candidate response semantic initial representation and a candidate response emotional initial representation.
  • the processes for generating the semantic initial representations and the emotional initial representations are similar.
  • the process 500 may be performed on a context 502 and a candidate response 512.
  • the context 502 may correspond to the context 302 in FIG. 3.
  • the context 502 may include, for example, utterances 502-1, 502-2, 502-3, ..., 502-n, which may correspond to the utterances 302-1, 302-2, 302-3, ..., 302-n in FIG. 3, respectively.
  • the candidate response 512 may correspond to the candidate response 306 in FIG. 3.
  • Word vector sequences corresponding to the utterances 502-1, 502-2, 502-3, ..., 502-n, respectively, may be generated through embedding layers 504-1, 504-2, ..., 504-n.
  • the context 502 may be represented as ⁇ u 1 , u 2 , u 3 , ..., u n ⁇ , wherein u represents an utterance, and u k represents the k -th utterance in the context 502, that is, utterance 502-k.
  • u k may be represented as , wherein e represents a word vector of the j-th word in utterance 502-k, and m represents the number of words in utterance 502-k.
  • a word vector sequence corresponding to the candidate response 512 may be generated through an embedding layer 514.
  • word-level representations 508-1, 508-2, 508-3, ..., 508-n corresponding to utterances 502-1, 502-2, 502-3, ..., 502-n may be generated through attention mechanisms and feed-forward neural networks 506-1, 506-2, ..., 506-n, respectively.
  • a word-level representation 518 corresponding to the candidate response 512 may be generated through an attention mechanism and a feed-forward neural network 516.
  • a word-level representation 508-k: corresponding to the utterance 502-k may be represented as U k self
  • the word-level representation 518 corresponding to the candidate response 512 may be represented as R self .
  • U k self and R self may be represented, for example, by the following formula: wherein f ATT ( ) represents output of an attention mechanism and a feed-forward neural network.
  • a context initial representation 510 that is may be generated through combining, such as cascading, the word-level representations 508-1, 508-2, 508-3, ..., 508-n.
  • the word-level representation 518 may be adopted as a candidate response initial representation 520.
  • Both a semantic initial representation and an emotional initial representation may be generated through the process 500 in FIG. 5.
  • a context semantic initial representation a context emotional initial representation a candidate response semantic initial representation R s self, and a candidate response emotional initial representation R e self may be generated.
  • FIG. 6 illustrates an exemplary process 600 for generating interaction representations according to an embodiment of the present disclosure.
  • the interaction representations may include semantic interaction representations and emotional interaction representations, for example, context interaction representations may include a context semantic interaction representation and a context emotional interaction representation, and candidate response interaction representations may include a candidate response semantic interaction representation and a candidate response interaction initial representation.
  • the processes for generating a semantic interaction representation and an emotional interaction representation are similar.
  • word-level representations 602-1, 602-2, 602-3, ..., 602 -n corresponding to respective utterance in a context 602 and a word-level representation 618 corresponding to a candidate response 616 may be obtained, wherein a word-level representation 602 -k corresponds to utterance k in the context 602, i.e., u k .
  • the context 602 may correspond to the context 502 in FIG. 5, and the word-level representations 602- 1, 602-2, 602-3, ..., 602 -n may correspond to the word-level representations 508-1, 508-2, 508-3, ..., 508-n in FIG. 5, respectively.
  • the candidate response 616 may correspond to the candidate response 512 in FIG. 5, and the word-level representation 618 may correspond to the word-level representation 518 in FIG. 5.
  • Sentence-level representations 606-1, 606-2, 606-3, 606-3, ..., 606 -n corresponding to the word-level representations 602-1, 602-2, 602-3, ..., 602 -n, respectively, may be generated through recurrent neural networks and attention mechanisms 604-1, 604-2, ..., 604 -n.
  • a sentence-level representation 622 corresponding to the word-level representation 618 may be generated through a recurrent neural network and an attention mechanism 620.
  • a sentence-level representation 606-k: corresponding to utterance k in the context 602 may be represented as U k utter
  • the sentence-level representation 622 corresponding to the candidate response 616 may be represented as R utter
  • the process for generating sentence-level representations U k utter and R utter through recurrent neural networks and attention mechanisms may be represented, for example, by the following formulas.
  • H ⁇ u,r ⁇ [i] GRU(W self [i], H ⁇ u,r ⁇ [i - 1]) (3)
  • GRU represents a Gated Recurrent Unit
  • W self ⁇ ⁇ U k self , R self ⁇ ,H ⁇ u ,r ⁇ ⁇ R mxd represents a hidden state corresponding to respective utterance in the context or candidate response, wherein m represents the number of words in the corresponding utterance, and d represents a dimension.
  • an attention mechanism and average pooling may be performed on the hidden state H ⁇ u,r ⁇ to obtain a sentence-level representation U k utter corresponding to respective utterance u k in the context and a sentence-level representation R utter corresponding to the candidate response r, as shown in the following formula:
  • U k utter mean(f ATT (H uk , H uk )) (4)
  • R utter mean(f ATT (H r , H r )) (5) wherein mean ( ) represents average pooling.
  • a difference between sentence-level representations of adjacent utterances among the context and the candidate response may be calculated based on M utter and
  • a difference 608-1 may be calculated based on a sentence- level representation 606-1 and required preceding information, wherein the difference 608-1 may reflect information change between utterance 1 in the context 602 and the required preceding information, and wherein the required preceding information may be initialized to zero;
  • a difference 608-2 may be calculated based on a sentence-level representation 606-2 and the sentence-level representation 606-1, wherein the difference 608-2 may reflect information change between utterance 2 and utterance 1 in the context 602;
  • a difference 608-3 may be calculated based on a sentence-level representation 606-3 and the sentence-level representation 606-2, wherein the difference 608-3 may reflect information change between utterance 3 and utterance 2 in context 602, ..., by analogy, a difference 608-n may be calculated, which may reflect information change between utterance n and
  • a difference 624 may be calculated based on a sentence-level representation 622 of the candidate response 616 and a sentence-level representation 606-n of utterance n. wherein the difference 624 may reflect information change between the candidate response 616 and utterance n.
  • a difference 608-k between the sentence-level representation and a sentence-level representation may be represented, for example, as A difference 624 between the sentence-level representation R utter and a sentence-level representation U utter may be represented, for example, as T r local .
  • a difference 624 between the sentence-level representation R utter and a sentence-level representation U utter may be represented, for example, as T r local .
  • T k local and T r local may be calculated, for example, by the following formulas: wherein ReLU represents a Rectified Linear Unit, ⁇ represents element-wise multiplication, W t and b t are trainable parameters, and U 0 utter may be filled with zeros.
  • ReLU represents a Rectified Linear Unit
  • represents element-wise multiplication
  • W t and b t are trainable parameters
  • U 0 utter may be filled with zeros.
  • utterance interaction representations 612-1, 612-2, 612-3, ..., 612 -n corresponding to respective utterances in the context and a candidate response interaction representation 626 corresponding to the candidate response may be generated based on these differences.
  • an utterance interaction representation 612 -k corresponding to utterance k in the context 602 may be generated based on the difference between sentence- level representations of every two adjacent uterances of uterance k in the context 602 and preceding uterances of uterance k, wherein the preceding uterances of utterance k may include uterances before uterance k in the context 602.
  • an uterance interaction representation 612-3 corresponding to uterance 3 in the context 602 may be generated based on the differences 608-2 and 608-3
  • an uterance interaction representation 612 -n corresponding to uterance n may be generated based on the differences 608-2, 608-3, ..., 608-n.
  • a candidate response interaction representation 626 corresponding to the candidate response 616 may be generated based on the differences 608-2, 608-3, ..., 608 -n and 624.
  • an uterance interaction representation generating 610 may integrating respective differences through a Transitional Memory Network and by copying historical memories.
  • the memory is implemented by using a recurrent attention mechanism, wherein a feed-forward neural network may be used to transform uterance k into memory representation and transform the candidate response into memory representation , as shown in the following formulas: (8) (9) wherein and represent input memory representations, and represent output memory representations, and W ⁇ in,out ⁇ and b ⁇ in,out ⁇ are trainable parameters.
  • a global representation , for uterance k in the context and the candidate response may be obtained, wherein when k' ⁇ ⁇ 1,2, ... , n ⁇ represents a global representation for uterance k, and when represents a global representation for the candidate response.
  • k' ⁇ ⁇ 1,2, ... , n ⁇ represents a global representation for uterance k, and when represents a global representation for the candidate response.
  • An uterance interaction representation for utterance k and the candidate response may be obtained, for example, by concatenating and , as shown in the following formula: (12) wherein when may correspond to in formula (7).
  • k' ⁇ ⁇ 1,2, ..., n represents an utterance interaction representation for an utterance in the context
  • the utterance interaction representation may reflect a difference in representation between utterance k' and all previous utterances before utterance k' in the current session, i.e., utterance 1 to utterance k'-1.
  • a context interaction representation 614 i.e., may be obtained by concatenating the utterance interaction representations 612-2, 612-3, ..., 612-n corresponding to the respective utterances in the context 602.
  • Both a semantic initial representation and an emotional initial representation may be generated through the process 600 in FIG. 6.
  • a context semantic interaction representation a context emotional interaction representation a candidate response semantic interaction representation T s,r , and a candidate response emotional interaction representation T e,r may be generated.
  • the generation of the context interaction representation and the candidate response interaction representation considers the difference in representation between adjacent utterances among the context and the candidate response, and further considers the difference in representation between respective utterance in the context and the candidate response and preceding utterances of this utterance in the current session. Such differences may reflect information change during the session, such as semantic change and emotional change.
  • embodiments of the present disclosure propose to model a semantic flow and an emotional flow in the session, so that the semantic change and the emotional change in the session may be effectively tracked.
  • the context interaction representation and the candidate response interaction representation may then be used in subsequent matching and aggregation processes, and finally generate a comprehensive relevance score indicating relevance between the candidate response and the context. Since the generation of the context interaction representation and the candidate response interaction representation considers the semantic change and the emotional change between adjacent utterances among the context and the candidate response, such change will also be taken into account when generating the comprehensive relevance score, thereby, a calculated relevance score of a candidate response that is smoother and more natural relative to the context in terms of semantic and emotion will be higher.
  • FIG. 7 illustrates an exemplary process 700 for semantic matching according to an embodiment of the present disclosure.
  • a context semantic initial representation 704 and a context semantic interaction representation 706 corresponding to a context 702 may be obtained.
  • the context 702 may correspond to the context 302 in FIG. 3.
  • the context semantic initial representation 704 and the context semantic interaction representation 706 may be represented as , respectively.
  • a candidate response semantic initial representation 710 and a candidate response semantic interaction representation 712 corresponding to a candidate response 708 may be obtained.
  • the candidate response 708 may correspond to the candidate response 306 in FIG. 3.
  • the candidate response semantic initial representation 710 and the candidate response semantic interaction representation 712 may be represented as R s self and T s,r , respectively.
  • the context semantic initial representation 704 and the candidate response semantic initial representation 710 may be generated, for example, through the process 500 in FIG. 5, and the context semantic interaction representation 706 and the candidate response semantic interaction representation 712 may be generated, for example, through the process 600 in
  • the context semantic initial representation 704 and the candidate response semantic initial representation 710 may be matched 714 to generate a semantic initial relevance representation 716.
  • the semantic initial relevance representation 716 may be represented, for example, as .
  • the semantic initial relevance representation 716 may indicate relevance between the context semantic initial representation 704 and the candidate response semantic initial representation 710.
  • the generation of the semantic initial relevance representation 716 may be represented, for example, by the following formulas: wherein and are trainable parameters.
  • the context semantic interaction representation 706 and the candidate response semantic interaction representation 712 may be matched 718 to generate a semantic interaction relevance representation 720.
  • the semantic interaction relevance representation 720 may be represented, for example, as
  • the semantic interaction relevance representation 720 may indicate relevance between the context semantic interaction representation 706 and the candidate response semantic interaction representation 712.
  • the generation of the semantic interaction relevance representation 720 may be represented, for example, by the following formulas: wherein W are trainable parameters.
  • FIG. 8 illustrates an exemplary process 800 for emotional matching according to an embodiment of the present disclosure.
  • a context emotional initial representation 804 and a context emotional interaction representation 806 corresponding to a context 802 may be obtained.
  • the context 802 may correspond to the context 302 in FIG. 3.
  • the context emotional initial representation 804 and the context emotional interaction representation 806 may be represented as respectively.
  • a candidate response emotional initial representation 810 and a candidate response emotional interaction representation 812 corresponding to a candidate response 808 may be obtained.
  • the candidate response 808 may correspond to the candidate response 306 in FIG. 3.
  • the candidate response emotional initial representation 810 and the candidate response emotional interaction representation 812 may be represented as respectively.
  • the context emotional initial representation 804 and the candidate response emotional initial representation 810 may be generated, for example, through the process 500 in FIG. 5, and the context emotional interaction representation 806 and the candidate response emotional interaction representation 812 may be generated, for example, through the process 600 in FIG. 6.
  • the context emotional initial representation 804 and the candidate response emotional initial representation 810 may be matched 814 to generate an emotional initial relevance representation 816.
  • the emotional initial relevance representation 816 may be represented, for example, as The emotional initial relevance representation 816 may indicate relevance between the context emotional initial representation 804 and the candidate response emotional initial representation 810.
  • the generation of emotional initial relevance representation 816 may be represented, for example, by the following formulas: wherein are trainable parameters.
  • the context emotional interaction representation 806 and the candidate response emotional interaction representation 812 may be matched 818 to generate an emotional interaction relevance representation 820.
  • the emotional interaction relevance representation 820 may be represented, for example, as
  • the emotional interaction relevance representation 820 may indicate relevance between the context emotional interaction representation 806 and the candidate response emotional interaction representation 812.
  • the generation of the emotional interaction relevance representation 820 may be represented, for example, by the following formula: wherein are trainable parameters.
  • FIG. 9 illustrates an exemplary process 900 for performing aggregation according to an embodiment of the present disclosure.
  • the process 900 may be performed by the aggregation part 316 in the transitional memory-based matching model 308 shown in FIG. 3.
  • a semantic initial relevance representation 902 and a semantic interaction relevance representation 904 in FIG. 9 may correspond to the semantic initial relevance representation 716 and the semantic interaction relevance representation 720 in FIG. 7, respectively, and an emotional initial relevance representation 920 and an emotional interaction relevance representation 922 in FIG. 9 may correspond to the emotional initial relevance representation 816 and the emotional interaction relevance representation 820 in FIG. 8, respectively.
  • the semantic initial relevance representation 902 may be processed by, for example, two layers of recurrent neural networks 906 and 908, as shown in the following formulas: wherein represents the number of words in the corresponding utterance; k ⁇ (1,2, ...,n) n represents the number of utterances in the context; may be initialized to zero; and may be used for the subsequent relevance score calculating process.
  • the semantic interaction relevance representation 904 may be processed by a recurrent neural network 910, as shown in the following formula: wherein k ⁇ (1,2, ... , n ⁇ , n represents the number of utterances in the context; and may be used for the subsequent relevance score calculating process.
  • the processed semantic initial relevance representation 902 and the processed semantic interaction relevance representation 904 may be combined, such as cascaded, to obtain a semantic relevance representation 914.
  • a semantic relevance score 918 may be generated based on the semantic relevance representation 914, as shown in the following formula: wherein are trainable parameters.
  • the emotional initial relevance representation 920 may be processed by, for example, two layers of recurrent neural networks 924 and 926, as shown in the following formulas: wherein represents the number of words in the corresponding utterance; k ⁇ (1,2, ..., n], n represents the number of utterances in the context; may be initialized to zero; and may be used for the subsequent relevance score calculating process.
  • the emotional interaction relevance representation 922 may be processed by a recurrent neural network 928, as shown in the following formula: wherein k ⁇ (1,2, ... , n], n represents the number of utterances in the context; and 4 may be used for the subsequent relevance score calculating process.
  • the processed emotional initial relevance representation 920 and the processed emotional interaction relevance representation 922 may be combined, such as cascaded, to obtain an emotional relevance representation 932.
  • an emotional relevance score 936 may be generated based on the emotional relevance representation 932, as shown in the following formula: wherein are trainable parameters.
  • the semantic relevance score 918 and the emotional relevance score 936 may be combined to obtain a comprehensive relevance score 940.
  • the comprehensive relevance score 940 may be represented, for example, as g.
  • the comprehensive relevance score 940 may correspond to the comprehensive relevance score 318 in FIG. 3.
  • the comprehensive relevance score 940 may be obtained by summing the semantic relevance score 918 and the emotional relevance score 936, as shown in the following formula:
  • FIG. 10 illustrates an exemplary chat flow 1000a and associated emotional flow 1000b according to an embodiment of the present disclosure.
  • the chat flow 1000a may occur between a chatbot and a user.
  • the chatbot may output an utterance U1 "I like Taurus girls so much! ".
  • an emotional state Eui of the utterance U1 may be, for example, (0.804, 0.673).
  • the user may enter an utterance U2 "Well, Scorpio boys always like Taurus girls. This is a fact.”
  • An emotional state Era of the utterance U2 may be, for example, (0.392, 0.616).
  • the chatbot may output an utterance U3 ""But why can't I meet a Taurus girl who likes me?".
  • An emotional state Era of the utterance U3 may be, for example, (-0.348, 0.647).
  • the user may enter an utterance U4 "Because your circle of friends is too narrow".
  • An emotional state Era of the utterance U4 may be, for example, (-0.339, 0.599).
  • the position of each emotional state of the utterances U1 to U4 in the V-A model is shown in the emotion flow 1000b.
  • the chatbot may firstly determine a context associated with the utterance U4, which includes, for example, the utterances U1 to U4. The chatbot may then determine a response to be provided to the user from a set of candidate responses in a database that it connects with or contains. For example, the chatbot may calculate a comprehensive relevance score between each candidate response of the set of candidate responses and the context.
  • a block 1010 shows two exemplary candidate responses, that is, candidate response R1 "I will meet one" and candidate response R2: "Forget it, I'm Reason. Hahahaha".
  • An emotional state ERI of the candidate response R1 may be, for example, (-0.837, 0.882).
  • the emotional state ER2 of the candidate response R2 may be, for example, (0.225, 0.670).
  • the comprehensive relevance score may be calculated, for example, through the process 300 in FIG. 3 in combination with the processes 500-900 in FIGs. 5-9. Since the calculation of the comprehensive relevance score considers semantic change and emotional change between adjacent utterances among the context and the candidate response, as well as between each utterance among the context and the candidate response and preceding utterances of this utterance in the current session, a calculated relevance score of a candidate response that is more smooth and natural relative to the context in terms of semantic and emotion will be higher.
  • a relevance score SI corresponding to the candidate response R1 with the emotional state of (-0.837, 0.882) may be 0.562
  • a relevance score S2 corresponding to the candidate response R2 with the emotional state of (0.225, 0.670) may be 0.114.
  • the relevance score SI is higher than the relevance score S2, so the chatbot finally outputs the candidate response R1 "I will meet one" at 912. It can also be seen from the emotion flow 1000b that compared with the candidate response R2, the emotional state of the candidate response R1 is smoother and more natural relative to the utterances U1 to U4.
  • FIG. 11 illustrates an exemplary process 1100 for training a transitional memory-based matching model according to an embodiment of the present disclosure.
  • a transitional memory-based matching model 1106 in FIG. 11 may correspond to the transitional memory-based matching model 308 in FIG. 3.
  • the transitional memory-based matching model 1106 may include an initial representation generating part 1108, an interaction representation generation part 1110, a matching part 1112, and an aggregation part 1114, which may correspond to the initial representation generating part 310, the interaction representation generation part 312, the matching part 314 and the aggregation part 316 in FIG. 3, respectively.
  • Training of the transitional memory-based matching model 1106 may be based on a corpus 1150.
  • the corpus 1150 may include a plurality of conversation-based training samples, such as [context c ⁇ , candidate response r 1 , relevance label yi
  • [0097] Take a training sample i [context c i, candidate response r i . relevance label y i ] in the corpus 1150 as an example.
  • the context c i 1102 and the candidate response r i 1104 may be used as input to the transitional memory-based matching model 1106.
  • the transitional memory-based matching model 1106 may perform a scoring task on the relevance between context c i and candidate response r i , and output a comprehensive relevance score g(c i ,r i ) 1116.
  • the comprehensive relevance score may be calculated, for example, through the process 300 in FIG. 3 in combination with the processes 500-900 in FIGS. 5-9.
  • a prediction loss of the training sample i may be calculated as a binary cross-entropy loss, and a prediction loss corresponding to the scoring task is calculated by summing the prediction losses of all the training samples, as shown by the following formula:
  • Embodiments of the present disclosure propose to use a multi-task framework to utilize an additional emotion classification task to optimize emotional representations of a context and a candidate response, such as, the context emotional initial representation and the candidate response emotional initial representation generated through the initial representation generating part 310 in FIG. 3, and the context emotional interaction representation and the candidate response emotional interaction representation generated through the interaction representation generation part 312.
  • the additional emotion classification task may be performed in conjunction with the scoring task described with reference to FIG. 11.
  • a corpus that includes training data with emotional labels may be utilized to perform the additional emotion classification task.
  • the corpus may be a conversation corpus including a plurality of conversation-based training samples.
  • FIG. 12 illustrates an exemplary process 1200 for optimizing emotional representations with a conversation corpus according to an embodiment of the present disclosure.
  • a corpus 1250 for performing the additional emotion classification task to optimize emotional representations may include a plurality of conversation-based training samples, such as [context c 1 . candidate response r 1 . emotional label ⁇ z 1 ,j ⁇ ], [context c 2 , candidate response r 2 , emotional label ⁇ z 2, j ⁇ ], [context c 3 , candidate response r 3 , emotional label ⁇ z 3,j ⁇
  • context c i may include a set of conversation-based uterances
  • candidate response r i may be a candidate response for context c,.
  • Different forms of the emotional label may be provided for different approaches for characterizing emotional states. For example, when using a six-category method to characterize emotional states, the emotional label for the emotional category j in the training sample i may be represented as z i,j ⁇ ⁇ 0,1 ⁇ .
  • a candidate response emotional initial representation 1206 corresponding to a candidate response r i 1204 may be generated.
  • the candidate response emotional initial representation 1206 may be generated, for example, through the initial representation generating part 310 in FIG. 3, and more specifically, through the process 500 in FIG. 5.
  • the candidate response emotional initial representation 1206 may be expressed as R eself , which may correspond to, for example, R self in the above formula (2).
  • a candidate response emotional interaction representation 1210 corresponding to the candidate response r i may be generated based on the context c i 1202 and the candidate response r i 1204.
  • the candidate response emotional interaction representation 1210 may be generated, for example, through the interaction representation generation part 312 in FIG. 3, and more specifically, through the process 600 in FIG. 6.
  • the candidate response emotional interaction representation 1210 may be represented as T e , which may, for example, correspond to T e,r that may be calculated by the above formula (12).
  • the candidate response emotional initial representation 1206 processed by a pooling layer 1208 may be combined with the candidate response emotional interaction representation 1210 to obtain a candidate response emotional comprehensive representation.
  • a forward neural network 1214 may generate an emotional prediction result h(x i ) 1216 based on the candidate response emotional comprehensive representation, as shown in the following formula: wherein is a trainable parameter for linear transformation; mean represents an average pooling function; and K is the number of emotion types, for example, K may be 6 when the six-category method is used to characterize emotional states.
  • a prediction loss of the training sample i may be calculated as a multi-class cross-entropy loss, and a prediction loss L emo corresponding to the additional emotion classification task is calculated by summing the prediction losses of all the training samples, as shown by the following formula: wherein K is the number of emotion types, and M is the number of training samples.
  • a sentence corpus based on sentences may also be used to perform an additional emotion classification task to optimize emotional representations.
  • FIG. 13 illustrates an exemplary process 1300 for optimizing emotional representations with a sentence corpus according to an embodiment of the present disclosure.
  • a corpus 1350 in FIG. 13 may include a plurality of training samples, such as [utterance x 1; emotional label ⁇ z 1 ; ⁇ ⁇ ], [utterance x 2 , emotion labeling (z 2,j ⁇
  • Different forms of an emotional label may be provided for different approaches for characterizing emotional states. For example, when using a six- category method to characterize emotional states, the emotional label for the emotional category j in the training sample i may be represented as z i,j ⁇ ⁇ 0,1 ⁇ .
  • a word-level representation 1304 corresponding to an utterance x i .1302 may be generated.
  • the word- level representation 1304 may be generated, for example, through the initial representation generating part 310 in FIG. 3, and more specifically, through the process 500 in FIG. 5.
  • a pooling layer 1306 and a forward neural network 1308 may process the word-level representation 1304 to obtain an emotion prediction result h(x i ) 1310.
  • a prediction loss corresponding to the additional emotion classification task may be calculated based on the emotional prediction result 1310 and emotional label
  • the prediction loss of the training sample i may be calculated as a multi-class cross-entropy loss
  • the prediction loss corresponding to the additional emotion classification task is calculated by summing the prediction losses of all the training samples, as shown by the above formula (32).
  • performing the additional emotion classification task by using the conversation corpus described with reference to FIG. 12 and performing the additional emotion classification task by using the sentence corpus described with reference to FIG. 13 may be performed separately or together.
  • the prediction loss corresponding to the additional emotion classification task may be calculated based on both the prediction loss obtained by performing the additional emotion classification task by using the conversation corpus and the prediction loss obtained by performing the additional emotion classification task by using the sentence corpus.
  • the scoring task in FIG. 11 and the additional emotion classification task in FIG. 12 and / or FIG. 13 may be performed jointly.
  • a total prediction loss may be calculated by weighted summing the prediction loss corresponding to the scoring task and the prediction loss corresponding to the additional emotion classification task, as shown in the following formula: wherein ⁇ is a hyper-parameter set by the system.
  • a transitional memory-based matching model such as the transitional memory-based matching model 308 in FIG. 3, may be trained for a predetermined personality to obtain a chatbot with a predetermined personality.
  • a transitional memory-based matching model may be trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality. For example, during the training process of the transitional memory-based matching model, a prediction loss associated with an emotional change range, such as an emotional change range between two adjacent utterances, may be added to the prediction loss function shown in the above formula (33) , and a weight ⁇ associated with the prediction loss may be set, as shown by the following formula: wherein ⁇ is a hyper-parameter set by the system, which may affect the proportion of the predicted loss associated with the emotional change range to the total predicted loss If it is desired to train a chatbot with a large emotional change range, such as a chatbot with an emotional personality, ⁇ may be set to be small, so that the proportion of prediction loss to the total prediction loss may be small. On the contrary, if it is desired to train a chatbot with a small emotional change range, such as a chatbot with a quiet personality ⁇ , may be set
  • Emotional states may also be affected by external factors such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc. For example, if a speaker is sick or the weather is bad, he may be down even if he hears good news; while if a speaker is healthy or the weather is good, he may be calm even if he hears bad news.
  • An embodiment of the present disclosure proposes that when providing a response, not only a context in a chat flow, but also external factors that affect an emotional state of a chatbot may be considered.
  • an additional emotional representation corresponding to an external factor may be generated and inserted among a set of word-level representations corresponding to a set of utterances in a context of a chat flow, thereby affecting subsequent relevance score generating and further affecting the selection of a candidate response.
  • FIG. 14 illustrates an exemplary process 1400 for generating an additional emotional representation according to an embodiment of the present disclosure.
  • an external factor 1402 that affects an emotional state of a chatbot may be identified, such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc.
  • External factor such as weather may be related to actual conditions, such as the actual weather conditions of the day, and may be obtained through other applications.
  • External factors such as health condition, whether a good thing happened, whether a bad thing happened may be artificially defined or automatically defined by the system.
  • the external factor 1402 may be mapped to an emotional state 1406 corresponding to the external factor 1402 through a predefined function.
  • the emotional state 1406 may be, for example, a V-A pair.
  • an additional emotional representation 1410 may be generated based on the emotional state 1406.
  • a generated emotional representation corresponding to an external factor is referred to as an additional emotional representation.
  • the forward neural network 1408 may generate an additional emotional representation 1410 by converting the emotional state 1406 into a valence vector and an arousal vector, and combining the valence vector and the arousal vector.
  • FIG. 15 illustrates an exemplary process 1500 for inserting an additional emotional representation according to an embodiment of the present disclosure.
  • a set of word-level representations 1504-1, 1504-2, 1502-3, ..., 1504- « corresponding to utterances 1502-1, 1502-2, 1502-3, ..., 1502- «, respectively, in a context 1502 may be obtained.
  • the word-level representations 1504-1, 1504-2, 1502- 3, ..., 1504-n may be generated through the process 500 in FIG. 5.
  • an additional emotional representation 1506 generated, for example, through the process 1400 of FIG. 14 may be inserted before a representation of a first utterance of a current session, that is, before the word-level representation 1504-1.
  • the additional emotional representation 1506 may be inserted before a word-level representation of the current utterance, that is, before the word-level representation 1504- n.
  • An updated context initial representation 1508 may be generated based on the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506.
  • the updated context initial representation 1508 may be generated through cascading the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506.
  • An updated context interaction representation 1510 may also be generated based on the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506.
  • word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotion representations 1506 along with a word-level representation 1514 of a candidate response 1512 may also generate an updated response interaction representation 1516.
  • the updated context interaction representation 1510 and the updated response interaction representation 1516 may be generated through the process 600 in FIG. 6.
  • the generation of the updated context initial representation 1508, the updated context interaction representation 1510, and the updated response interaction representation 1516 considers an additional emotional representation corresponding to an external factor. These updated representations may then be used in a subsequent matching process, such as the process 800 in FIG. 8, and a subsequent aggregation process, such as the process 900 in FIG. 9, and ultimately obtain a comprehensive relevance score. Since the generation of the updated context initial representation 1508, the updated context interaction representation 1510, and the updated response interaction representation 1516 considers the additional emotional representation corresponding to the external factor, the additional emotional representations are also taken into account when generating the comprehensive relevance score, so that a calculated relevance score for a candidate response that is consistent with an emotional state of the additional emotional representation will be higher.
  • a basic emotional state of a chatbot may also be determined based on external factors. For example, when an external factor is "good weather”, the basic emotional state of the chatbot may be determined as "high mood”; while when the external factor is "bad weather”, the basic emotional state of the chatbot may be determined as "low mood”. Then, a threshold corresponding to the basic emotional state may be set for each candidate response. In some embodiments, only a valence threshold may be set. Taking a candidate response "ha-ha” as an example, the valence threshold corresponding to "high mood” may be "0.1", while a valence threshold corresponding to "low mood” may be "0.8", for example.
  • the candidate response "ha-ha” may be provided; while when the basic emotional state determined based on external factors is "low mood", only when the valence value of the emotional state of the chatbot predicted is greater than "0.8", the candidate response "ha-ha” may be provided.
  • the emotional state of the chatbot may also be adapted according to the determined basic emotional state.
  • the valence value of the emotional state of the chatbot predicted according to the context in the session may be increased, for example, multiplied by a coefficient greater than 1; when the basic emotional state is "low mood", the valence value of the emotional state of the chatbot predicted according to the context in the session may be reduced, for example, multiplied by a coefficient less than 1.
  • the foregoing describes different ways in which the chatbot considers external factors that affect emotional states when providing responses. These ways may make emotional states of responses provided throughout the session consistent with the basic emotional state determined by the external factors. It is to be understood that the foregoing ways are merely exemplary, and the embodiments of the present disclosure are not limited thereto, but emotional states of responses provided by the chatbot can be caused in any other way to be consistent with the basic emotional state determined by the external factors.
  • a transitional memory-based matching model may support multi -modality inputs.
  • Each utterance that is an input of a transitional memory-based matching model may employ at least one of the following modalities: text, voice, facial expressions, and gestures.
  • modalities text, voice, facial expressions, and gestures.
  • a microphone on the terminal device may capture voice
  • a speech recognition software may convert the voice into text
  • the user may directly enter text.
  • a camera on the terminal device may capture the user's facial expressions, body gestures, and hand gestures. Inputs of different modalities for a particular utterance may be converted into corresponding representations.
  • the early-fusion strategy refers to combining representations of various modality inputs for each utterance into a comprehensive representation of the utterance, and then generating an context initial representation and a context interaction representation based on the comprehensive representation of the utterance and comprehensive representations of other utterances.
  • the late-fusion strategy refers to using representations of various modality inputs of each utterance to generate intermediate initial representations and intermediate interaction representations in respective modalities, and then generating a context initial representation and a context interaction representation by combining the generated intermediate initial representations and intermediate interaction representations, respectively.
  • FIG. 16 illustrates an exemplary process 1600 for combining multi-modality inputs through an early-fusion strategy according to an embodiment of the present disclosure.
  • an utterance 1 1602 may have, for example, a modality 1 input 1602-1, a modality 2 input 1602-2, ..., a modality m input 1602 -m. These inputs may be converted into corresponding representations, such as, a representation 1 of utterance 1 1604-1, a representation 2 of utterance 1 1604-2, ..., a representation m of utterance 1 1604 -m.
  • an utterance 2 1606 may, for example, have a modality 1 input 1606-1, a modality 2 input 1606-2, ..., a modality m input 1606 -m. These inputs may be converted into corresponding representations, such as a representation 1 of utterance 2 1608-1, a representation 1 of utterance 2 1608-2, ..., a representation m of utterance 2 1608 -m. It is to be understood that although it is shown in FIG. 16 that both utterance 1 and utterance 2 have m modality inputs, the number of modality inputs that utterance 1 and utterance 2 have may be less than m. Without a certain modality input, the modality input and the corresponding representation may be initialized to zero.
  • the representation 1 of utterance 1 1604-1, the representation 2 of utterance 1 1604-2, ..., the representation m of utterance 1 1604 -m may be combined together to generate a comprehensive representation of utterance 1 1610.
  • the representation 1 of utterance 2 1608-1, the representation 2 of utterance 2 1608-2, ..., the representation m of utterance 2 1608 -m may be combined together to generate a comprehensive representation of utterance 2 1612.
  • a context initial representation 1614 and a context interaction representation 1616 may be generated based on the comprehensive representation of utterance 1 1610, the comprehensive representation of utterance 2 1612, and possible comprehensive representations (not shown) of other utterances.
  • the context initial representation 1614 and the context interaction representation 1616 may be generated, for example, through the process 500 in FIG. 5 and the process 600 in FIG. 6 respectively.
  • the context initial representation 1614 and the context interaction representation 1616 may be used in subsequent matching and aggregation processes, and finally engage in generating a comprehensive relevance score indicating relevance between a candidate response and a context.
  • FIG. 17 illustrates an exemplary process 1700 for combining multi-modality inputs through a late-fusion strategy according to an embodiment of the present disclosure.
  • a transitional memory-based matching model may support m modality inputs.
  • an utterance 1 may have, for example, a modality 1 input of utterance 11702-1, a modality 2 input of utterance 11702-2, ..., a modality m input of utterance 11702-m.
  • These inputs may be converted into corresponding representations, respectively, such as a representation 1 of utterance 11704-1, a representation 2 of utterance 11704-2 , ..., a representation m of utterance 11704-m.
  • an utterance 21706 may have, for example, a modality 1 input of utterance 21706-1, a modality 2 input of utterance 21706- 1, ..., a modality m input of utterance 21706-m.
  • These inputs may be converted into corresponding representations, respectively, such as a representation 1 of utterance 2 1708-1, a representation 2 of utterance 21708-2, ..., a representation m of utterance 2 1708-m.
  • both utterance 1 and utterance 2 have m modality inputs
  • the number of modality inputs that utterance 1 and utterance 2 have may be less than m. Without a certain modality input, the modality input and the corresponding representation may be initialized to zero.
  • a representation of each modality input of each utterance may be used to generate an intermediate initial representation and an intermediate interaction representation in respective modality input.
  • an intermediate initial representation corresponding to modality 11710-1 and an intermediate interaction representation corresponding to modality 11712-1 may be generated based on the representation 1 of utterance 11704-1, the representation 1 of utterance 21708-1, and representations of possible other utterances corresponding to mode 1 (not shown);
  • an intermediate initial representation corresponding to modality 21710-2 and intermediate interaction representation corresponding to modality 21712-2 may be generated based on the representation 2 of utterance 11704-2, the representation 2 of utterance 21708-2, and representations of possible other utterances corresponding to mode 2 (not shown);
  • an intermediate initial representation corresponding to modality m 1710-m and an intermediate interaction representation corresponding to modality m 1712-m may be generated based on the representation m of utterance 11704-m, the representation m of utterance 21708-/
  • the intermediate initial representations 1710-1,1710-2,...,1710-m may be generated, for example, through a process similar to the process 500 in FIG. 5 that used to generate the context initial representation, and the intermediate interaction representations 1712-1, 1712-2,..., 1712- in may be generated, for example, through a process similar to the process 600 in FIG. 6 that used to generate the context interaction representation.
  • a context initial representation 1714 may be generated through combining the intermediate initial representation 1710-1, the intermediate initial representation 1710-2, ..., the intermediate initial representation 1710-m
  • a context interaction representation 1716 may be generated through combining the intermediate interaction representation 1712-1, intermediate interaction representation 1712-2, ..., intermediate interaction representation 1712 -m.
  • the context initial representation 1714 and the context interaction representation 1716 may be used in subsequent matching and aggregation processes, and finally engage in generating a comprehensive relevance score indicating relevance between a candidate response and a context.
  • a context initial relevance representation and a context interaction relevance representation may be obtained by firstly using a representation of each modality input of each utterance to generate an intermediate initial relevance representation and an intermediate interaction relevance representation in respective modality, and then combining the generated intermediate initial relevance representations and intermediate interaction relevance representations, respectively.
  • the context initial relevance representation and the context interaction relevance representation may engage in generating a comprehensive relevance score indicating relevance between the candidate response and the context.
  • a chatbot may present the response based on an emotional state of the selected candidate response.
  • the chatbot may express, in a corresponding manner, the emotional state of the selected candidate response based on a modality of the response. For example, in the case that the response is in voice, when its emotional state is "happy", the chatbot may present the response with a fast speech rate or a high tone.
  • the emotional state of the response may be expressed by additionally providing other multi-modality signals, for example, by facial expressions, body gestures, or hand gestures, etc. of the chatbot.
  • a corresponding light may be provided at the same time to express the emotional state of the response.
  • FIG. 18 illustrates an exemplary scenario 1800 for expressing emotional states of response through light according to an embodiment of the present disclosure.
  • This scenario may happen between a user and a smart speaker.
  • the smart speaker may be equipped with a chatbot implemented according to the embodiments of the present disclosure.
  • the smart speaker may respond to the user's voice input by providing a voice response and corresponding light.
  • the user may say "So annoying!”.
  • the smart speaker may reply by providing a voice response: "Cheer up! I still like to see you laugh.”
  • the emotional state of the voice response at 1804 may have a relatively positive valence, for example, a valence value of "0.9", so the light provided in association with it may have a strong brightness.
  • the user may then say “But I don't want to laugh now.”
  • the smart speaker could reply by providing a voice response "You should learn to laugh. People can do it.”
  • the emotional state of the voice response at 1808 may have a generally positive valence, for example, a valence value of "0.6", so the light provided in association with it may have a weak brightness.
  • the user may continue to say “I can't do it.”
  • the smart speaker may reply by providing a voice response: "Let me make you happy!”
  • the emotional state of the voice response at 1812 may have a relatively positive valence, for example, a valence value of "0.9", so the light provided in association with it may have a strong brightness.
  • FIG. 18 shows an example for expressing different emotional states of a response through different light brightness. It is to be understood that the embodiments of the present disclosure are not limited thereto, for example, in the case of expressing emotional states through light, emotional states of responses may also be expressed through the color, duration, etc. of the light. In addition, the emotional states of the responses may be expressed by any other multi-modality signals. [00138] According to an embodiment of the present disclosure, a selection of a candidate response may be based on semantic relevance and emotional relevance between a candidate response and a context. When determining the semantic relevance and the emotional relevance, messages received and responses sent by a chatbot are collectively considered as utterances in the context, and no distinction is made between the received messages and the sent responses.
  • chatbot may share emotional states between the chatbot and a user, and achieve empathy between the chatbot and the user. Further, the chatbot may drive the user's emotional state to the direction of positive valence by providing a more positive response, such as a response with a higher valence value, thereby guiding the user to obtain an emotional state with a positive valence before the end of the session.
  • FIG. 19 is a flowchart of an exemplary method 1900 for providing a response in automated chatting according to an embodiment of the present disclosure.
  • a message may be obtained in a chat flow.
  • a context associated with the message may be determined, the context comprising a set of utterances, the set of utterances comprising the message.
  • the candidate response may be scored based at least on information change between adjacent utterances among the set of utterances and the candidate response.
  • a highest-scored candidate response among the set of candidate responses may be provided in the chat flow.
  • the information change may comprise at least one of semantic change and emotional change.
  • the scoring may comprise at least one of: generating a semantic relevance score for the candidate response based at least on the semantic change between adjacent utterances among the set of utterances and the candidate response; and generating an emotional relevance score for the candidate response based at least on the emotional change between adjacent utterances among the set of utterances and the candidate response.
  • the scoring may comprise: generating a comprehensive relevance score for the candidate response based on the semantic relevance score and the emotional relevance score.
  • the scoring may comprise: generating a context interaction representation corresponding to the context based on information change between every two adjacent utterances of the set of utterances; generating a candidate response interaction representation corresponding to the candidate response based on information change between every two adjacent utterances among the set of utterances and the candidate response; obtaining an interaction relevance representation through matching the context interaction representation with the candidate response interaction representation; and generating a relevance score for the candidate response based at least on the interaction relevance representation.
  • the scoring may further comprise: generating a context initial representation corresponding to the context based on a representation of each utterance of the set of utterances; generating a candidate response initial representation corresponding to the candidate response; obtaining an initial relevance representation through matching the context initial representation with the candidate response initial representation; and generating a relevance score for the candidate response based on a combination of the initial relevance representation and the interaction relevance representation.
  • the information change may comprise semantic change
  • the context interaction representation may include a context semantic interaction representation
  • the candidate response interaction representation may include a candidate response semantic interaction representation
  • the interaction relevance representation may include a semantic interaction relevance representation
  • the context initial representation may include an context semantic initial representation
  • the candidate response initial representation may include a candidate response semantic initial representation
  • the initial relevance representation may include a semantic initial relevance representation
  • the relevance score may be a semantic relevance score.
  • the information change may comprise emotional change
  • the context interaction representation may include a context emotional interaction representation
  • the candidate response interaction representation may include a candidate response emotional interaction representation
  • the interaction relevance representation may include an emotional interaction relevance representation
  • the context initial representation may include an context emotional initial representation
  • the candidate response initial representation may include a candidate response emotional initial representation
  • the initial relevance representation may include an emotional initial relevance representation
  • the relevance score may be an emotional relevance score.
  • the method 1900 may further comprise: identifying external factors that affect emotional states; and adding the external factors into the context.
  • at least one utterance of the set of utterances may employ at least one of the following modalities: text, voice, facial expressions, and gestures.
  • the method 1900 may further comprise: presenting the highest-scored candidate response based on an emotional state of the candidate response.
  • the scoring may be performed through a transitional memory-based matching model, the transitional memory-based matching model being optimized through an additional emotion classification task during a training process.
  • the scoring may be performed through a transitional memory-based matching model, the transitional memory-based matching model being trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality.
  • the method 1900 may further comprise any steps/processes for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.
  • FIG. 20 illustrates an exemplary apparatus 2000 for providing a response in automated chatting according to an embodiment of the present disclosure.
  • the apparatus 2000 may comprise: a message obtaining module 2010, for obtaining a message in a chat flow; a context determining module 2020, for determining a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message; a scoring module 2030, for scoring, for each candidate response of a set of candidate responses, the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response; and a response providing module 2040, for providing a highest-scored candidate response among the set of candidate responses in the chat flow.
  • the information change may comprise at least one of semantic change and emotional change.
  • the scoring module 2030 may be further configured for performing at least one of: generating a semantic relevance score for the candidate response based at least on the semantic change between adjacent utterances among the set of utterances and the candidate response; and generating an emotional relevance score for the candidate response based at least on the emotional change between adjacent utterances among the set of utterances and the candidate response.
  • the apparatus 2000 may further comprise: an external factor identifying module, for identifying external factors that affect emotional states; and an external factor adding module, for adding the external factors into the context.
  • the scoring module 2030 may comprise a transitional memory-based matching model, the transitional memory-based matching model being optimized through an additional emotion classification task during a training process.
  • the scoring module 2030 may comprise a transitional memory-based matching model, the transitional memory-based matching model being trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality.
  • the apparatus 2000 may further comprise any other modules configured for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.
  • FIG. 21 illustrates an exemplary apparatus 2100 for providing a response in automated chatting according to an embodiment of the present disclosure.
  • the apparatus 2100 may comprise at least one processor 2110.
  • the apparatus 2100 may further comprise a memory 2120 coupled with the processor 2110.
  • the memory 2120 may store computer-executable instructions that, when executed, cause the processor 2110 to perform any operations of the method for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.
  • the embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors are described in connection with various apparatus and methods. These processors can be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a microcontroller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • PLD programmable logic device
  • state machine gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure.
  • the functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a micro
  • Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software can reside on computer readable medium.
  • Computer readable medium may include, for example, a memory, which may be, for example, a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk.
  • a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Information Transfer Between Computers (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure provides a method and apparatus for providing a response in automated chatting. A message may be obtained in a chat flow. A context associated with the message may be determined, the context comprising a set of utterances, the set of utterances comprising the message. For each candidate response of a set of candidate responses, the candidate response may be scored based at least on information change between adjacent utterances among the set of utterances and the candidate response. A highest-scored candidate response among the set of candidate responses may be provided in the chat flow.

Description

PROVIDING A RESPONSE IN AUTOMATED CHATTING
BACKGROUND
[0001] Artificial intelligence (AI) chatbots are becoming more and more popular and are being used in more and more scenarios. Chatbots are designed to simulate human utterances and may chat with users through text, voice, images, etc. In general, a chatbot may identify language content within a message entered by a user or apply natural language processing to a message, and then provide the user with a response to the message.
SUMMARY
[0002] This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
[0003] Embodiments of the present disclosure provides a method and apparatus for providing a response in automated chatting. A message may be obtained in a chat flow. A context associated with the message may be determined, the context comprising a set of utterances, the set of utterances comprising the message. For each candidate response of a set of candidate responses, the candidate response may be scored based at least on information change between adjacent utterances among the set of utterances and the candidate response. A highest-scored candidate response among the set of candidate responses may be provided in the chat flow.
[0004] It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.
BRIEF DESCRIPTION OF THE DRAWINGS [0005] The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects. [0006] FIG. 1 illustrates an exemplary application scenario of a chatbot according to an embodiment of the present disclosure.
[0007] FIG. 2 illustrates an exemplary chat window according to an embodiment of the present disclosure.
[0008] FIG. 3 illustrates an exemplary process for obtaining a comprehensive relevance score according to an embodiment of the present disclosure.
[0009] FIG. 4 illustrates an exemplary Valence-Arousal model according to an embodiment of the present disclosure.
[0010] FIG. 5 illustrates an exemplary process for generating initial representations according to an embodiment of the present disclosure.
[0011] FIG. 6 illustrates an exemplary process for generating interaction representations according to an embodiment of the present disclosure.
[0012] FIG. 7 illustrates an exemplary process for semantic matching according to an embodiment of the present disclosure.
[0013] FIG. 8 illustrates an exemplary process for emotional matching according to an embodiment of the present disclosure.
[0014] FIG. 9 illustrates an exemplary process for performing aggregation according to an embodiment of the present disclosure.
[0015] FIG. 10 illustrates an exemplary chat flow and an associated emotional flow according to an embodiment of the present disclosure.
[0016] FIG. 11 illustrates an exemplary process for training a transitional memory- based matching model according to an embodiment of the present disclosure.
[0017] FIG. 12 illustrates an exemplary process for optimizing emotional representation with a conversation corpus according to an embodiment of the present disclosure.
[0018] FIG. 13 illustrates an exemplary process for optimizing emotional representations with a sentence corpus according to an embodiment of the present disclosure.
[0019] FIG. 14 illustrates an exemplary process for generating an additional emotional representation according to an embodiment of the present disclosure.
[0020] FIG. 15 illustrates an exemplary process for inserting an additional emotional representation according to an embodiment of the present disclosure.
[0021] FIG. 16 illustrates an exemplary process for combining multi- modality inputs through an early-fusion strategy according to an embodiment of the present disclosure. [0022] FIG. 17 illustrates an exemplary process for combining multi-modality inputs through a late-fusion strategy according to an embodiment of the present disclosure.
[0023] FIG. 18 illustrates an exemplary scenario for expressing emotional states of responses through light according to an embodiment of the present disclosure.
[0024] FIG. 19 is a flowchart of an exemplary method for providing a response in automated chatting according to an embodiment of the present disclosure.
[0025] FIG. 20 illustrates an exemplary apparatus for providing a response in automated chatting according to an embodiment of the present disclosure.
[0026] FIG. 21 illustrates an exemplary apparatus for providing a response in automated chatting according to an embodiment of the present disclosure.
DETAILED DESCRIPTION
[0027] The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
[0028] In general, a chatbot may chat automatically in a session with a user. Herein, the "session" may refer to a time-continuous conversation between two chat participants. When the chatbot is conducting automated chatting, it may receive messages from the user and reply by selecting a candidate response from a set of candidate responses stored in its associated database. Currently, when the chatbot selects a candidate response, it usually scores relevance between each candidate response and the message in the chat flow, and provides the user with a highest-scored candidate response. Since emotional change in the chat flow is not considered during the scoring process, the candidate response that is finally selected may significantly fluctuate in terms of emotion.
[0029] Embodiments of the present disclosure propose a method and apparatus for providing a response in automated chatting. According to an embodiment of the present disclosure, after a message in a chat flow being obtained, a context associated with the message may be determined, and a response being smooth and relevant to the context in both semantic and emotional terms may be provided. Herein, the context refers to all received messages and sent responses in a current session, i.e., a session in which the most recently received message is located, and may include the most recently received message itself.
[0030] In an aspect, an embodiment of the present disclosure proposes a transitional memory-based matching model that may model semantic change and emotional change in a chat flow and consider such change when selecting a candidate response, thereby may provide a response that is smoother and more natural in terms of semantic and emotion. [0031] In another aspect, an embodiment of the present disclosure proposes to use a multi-task framework to optimize emotional representations of a context and a candidate response by an additional emotion classification task. A training corpus with emotional labels may be used to perform the additional emotion classification task.
[0032] In another aspect, an embodiment of the present disclosure proposes to train a transitional memory-based matching model for a predetermined personality, thereby obtaining a chatbot with the predetermined personality. The personality of a speaker may be associated with his or her emotional change range in the speech. The transitional memory-based matching model may be trained based on the emotional change range constraint associated with the predetermined personality.
[0033] In another aspect, an embodiment of the present disclosure proposes to consider external factors that affect emotional states, such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc., when making candidate response selections. A basic emotional state may be determined based on the external factors, thereby an emotional state of a selected response is consistent with the basic emotional state determined based on the external factors, and is smooth and relevant to previous utterances in the current session.
[0034] In another aspect, a transitional memory-based matching model proposed by an embodiment of the present disclosure may support multi-modality inputs. Inputs for different modalities of a particular utterance may be converted into corresponding representations. These representations may be combined through multiple fusion strategies.
[0035] In another aspect, an embodiment of the present disclosure proposes that a selected candidate response may be presented based on an emotional state of the response, and the emotional state of the selected candidate response may also be expressed by additionally providing other multi-modality signals.
[0036] In another aspect, an embodiment of the present disclosure proposes to achieve empathy between a chatbot and a user, and guide the user to obtain a positive emotional state.
[0037] FIG. 1 illustrates an exemplary application scenario 100 of a chatbot according to an embodiment of the present disclosure. In the scenario 100, a network 110 is applied to interconnect between a terminal device 120 and a chatbot server 130. The network 110 may be any type of network capable of interconnecting network entities. The network 110 may be a single network or a combination of various types of networks. In terms of coverage, the network 110 may be a Local Area Network (LAN), a Wide Area Network (WAN), etc. In terms of carrying medium, the network 110 may be a wireline network, a wireless network, etc. In terms of data switching techniques, the network 110 may be a circuit switching network, a packet switching network, etc.
[0038] The terminal device 120 may be any type of electronic computing device capable of connecting to the network 110, accessing a server or website on the network 110, processing data or signals, etc. For example, the terminal device 120 may be a desktop computer, a notebook computer, a tablet computer, a smart phone, etc. Although only one terminal device 120 is shown in FIG. 1, it is to be understood that a different number of terminal devices may be connected to the network 110.
[0039] The terminal device 120 may include a chatbot client 122 that may provide an automated chatting service to a user. In some implementations, the chatbot client 122 may interact with the chatbot server 130 and present to the user information and responses that the chatbot server 130 provides. For example, the chatbot client 122 may send a message entered by the user to the chatbot server 130 and receive a response relevant to the message from the chatbot server 130. However, it is to be understood that in other implementations, the chatbot client 122 may also generate locally a response to the message entered by the user, rather than interacting with the chatbot server 130.
[0040] The chatbot server 130 may conduct automated chatting with a user of the terminal device 120. A corpus for automated chatting may be stored in a chatbot database 132 that the chatbot server 130 connects with or the chatbot server 130 contains.
[0041] It is to be understood that all the network entities in FIG. 1 are exemplary, and any other network entity may be involved in the application scenario 100 according to specific application requirements.
[0042] FIG. 2 illustrates an exemplary chat window 200 according to an embodiment of the present disclosure. The chat window 200 may include a presenting area 210, a control area 220, and an input area 230. The presenting area 210 displays messages and responses in a chat flow. The control area 220 includes a plurality of virtual buttons for use by a user to perform message input settings. For example, the user may choose to perform voice input, attach an image file, select an emoji, take a screenshot of a current screen, etc. through the control area 220. The input area 230 is used for the user to enter a message. For example, the user may type a text through the input area 230. The chat window 200 may further include a virtual button 240 for confirming transmission of the entered message. If the user touches the virtual button 240, a message entered in the input area 230 may be transmitted to the presenting area 210.
[0043] It should be noted that all the units in FIG. 2 and their layouts are exemplary. According to specific application requirements, the chat window in FIG. 2 may omit or add any unit, and the layouts of the units in the chat window in FIG. 2 may also be changed in various ways.
[0044] According to an embodiment of the present disclosure, when conducting automated chatting, a chatbot may obtain a message in a chat flow, such as a message most recently received from a user, and determine a context associated with the message. The context may include all received messages and sent responses in a current session, and may include the most recently received message itself. Herein, the messages received and responses sent by the chatbot are collectively referred to as utterances. Thus, the context may include a set of utterances. The chatbot may also obtain a set of candidate responses from a database that it connects with or it contains, and for each candidate response of the set of candidate responses, score relevance between the candidate response and the context to obtain a comprehensive relevance score corresponding to the candidate response. The chatbot may then provide, in the chat flow, a candidate response with the highest comprehensive relevance score among the set of candidate responses.
[0045] FIG. 3 illustrates an exemplary process 300 for obtaining a comprehensive relevance score according to an embodiment of the present disclosure. The process 300 may be perfomied by, for example, the chatbot server 130 in FIG. 1. [0046] After obtaining a message in a chat flow, a chatbot server may determine a context associated with the message, such as a context 302 in FIG. 3, which may include all received utterances and sent messages in a session in which the message is located, such as utterances 302-1, 302-2, 302-3, ..., 302-n, wherein utterance 302-n may be a message currently obtained from the chat flow. In addition, the chatbot server may also obtain a set of candidate responses 304 from a database that it connects with or it contains, such as a chatbot database 132 in FIG. 1, which may include a plurality of candidate responses, such as a candidate response 306. The candidate response 306 is taken as an example to illustrate an exemplary process for obtaining a comprehensive relevance score of the candidate response 306. The context 302 and the candidate response 306 may be provided to a transitional memory-based matching model 308. The transitional memory- based matching model 308 may include, for example, an initial representation generating part 310, an interaction representation generation part 312, a matching part 314, and an aggregation part 316. [0047] At the initial representation generating part 310, an initial representation of the context 302 and an initial representation of the candidate response 306 may be generated. Herein, the initial representation refers to a representation generated based on a representation of each utterance in the context or the candidate response.
[0048] The representation of each utterance may include a semantic representation and/or an emotional representation. The emotional representation may be generated based on a variety of approaches for characterizing emotional states. In an implementation, the emotional states may be characterized through a Valence- Arousal (V-A) model. FIG. 4 illustrates an exemplary V-A model 400 according to an embodiment of the present disclosure. The V-A model 400 maps emotional features to a two-dimensional space, which is defined by two orthogonal dimensions such as valence and arousal. The valence may represent the polarity of emotion, such as negative emotion and positive emotion, and indicate the degree by continuous values in the range of, for example, [-1, 0] and [0, 1], respectively. The arousal may indicate the energy of emotion, and indicate the degree by a continuous value in the range of, for example, [0, 1], Almost all human emotional states may be mapped to points defined in this two-dimensional space based on valence value- arousal value pairs (V-A pairs). Four exemplary emotional states are shown in FIG. 4, such as "happy", "satisfied", "nervous", and "sad". The emotional state "happy" may be mapped, for example, to point 402 in the V-A model 400, whose V-A pair is (0.8, 0.6). The emotional state "satisfied" may be mapped, for example, to point 404 in the V-A model 400, whose V-A pair is (0.7, 0.4). The emotional state "nervous" may be mapped, for example, to point 406 in the V-A model 400, whose V-A pair is (-0.3, 0.9). The emotional state "sad" may be mapped, for example, to point 406 in the V-A model 408, whose V-A pair is (-0.8, 0.3).
[0049] It is to be understood that characterizing the emotional states by the V-A model described in conjunction with FIG. 4 is only an example, and the emotional states may also be characterized in other ways. For example, the emotional states may be characterized by a six-category method, that is, the emotional states are characterized by a probability distribution for six basic emotion types. These six basic types of emotion include, for example, anger, happiness, surprise, disgust, sadness, and fear. The emotion representation according to an embodiment of the present disclosure may be based on any one of approaches for characterizing emotional states.
[0050] Based on a semantic representation of each utterance in the context 302, a context semantic initial representation may be generated. Based on an emotional representation of each utterance in the context 302, a context emotional initial representation may be generated. Based on a semantic representation of the candidate response 306, a candidate response semantic initial representation may be generated. Based on an emotional representation of the candidate response 306, a candidate response emotional initial representation may be generated. The specific process for generating the above initial representations will be explained later in conjunction with FIG. 5.
[0051] After the initial representations of the context 302 and the initial representations of the candidate response 306 are obtained, at the interaction representation generation part 312, interaction representations of the context 302 and interaction representations of the candidate response 306 may be further generated. Herein, an interaction representation refers to a representation generated based on information change between every two adjacent utterances among the context and/or the candidate response. Such information change may include semantic change and/or emotional change. Based on the semantic change between every two adjacent utterances among the context 302, a context semantic interaction representation may be generated. Based on the emotional change between every two adjacent utterances among the context 302, a context emotional interaction representation may be generated. Based on the semantic change between every two adjacent utterances among the context 302 and the candidate response 306, a candidate response semantic interaction representation may be generated. Based on the emotional change between every two adjacent utterances among the context 302 and the candidate response 306, a candidate response emotional interaction representation may be generated. The specific process for generating the above interaction representations will be explained later in conjunction with FIG. 6.
[0052] At the matching part 314, a matching process may be performed based on the generated initial representations and interaction representations. Each of the initial representations and the interaction representations may include a semantic representation and an emotional representation. Accordingly, the matching may include semantic matching and emotional matching. The semantic matching may be performed between two semantic representations to obtain a semantic relevance representation and a semantic interaction relevance representation. The specific process for the semantic matching will be explained later in conjunction with FIG. 7. The emotional matching may be performed between two emotional representations to obtain an emotional initial relevance representation and an emotional interaction relevance representation. The specific process for the emotional matching will be explained later in conjunction with FIG. 8. [0053] After obtaining the semantic relevance representation, the semantic interaction relevance representation, the emotional initial relevance representation, and the emotional interaction relevance representation, these relevance representations may be aggregated at the aggregation part 316 to obtain a comprehensive relevance score 318. The specific process for performing the aggregation will be explained later in conjunction with FIG. 9.
[0054] FIG. 5 illustrates an exemplary process 500 for generating initial representations according to an embodiment of the present disclosure. The initial representations may include a semantic initial representation and an emotional initial representation, for example, context initial representations may include a context semantic initial representation and a context emotional initial representation, and candidate response initial representations may include a candidate response semantic initial representation and a candidate response emotional initial representation. The processes for generating the semantic initial representations and the emotional initial representations are similar.
[0055] The process 500 may be performed on a context 502 and a candidate response 512. The context 502 may correspond to the context 302 in FIG. 3. The context 502 may include, for example, utterances 502-1, 502-2, 502-3, ..., 502-n, which may correspond to the utterances 302-1, 302-2, 302-3, ..., 302-n in FIG. 3, respectively. The candidate response 512 may correspond to the candidate response 306 in FIG. 3.
[0056] Word vector sequences corresponding to the utterances 502-1, 502-2, 502-3, ..., 502-n, respectively, may be generated through embedding layers 504-1, 504-2, ..., 504-n.
Assume that the context 502 may be represented as {u1, u2, u3, ..., un}, wherein u represents an utterance, and uk represents the k -th utterance in the context 502, that is, utterance 502-k. After being processed by an embedding layer, uk may be represented as , wherein e represents a word vector of the j-th word in utterance 502-k, and m represents the number of words in utterance 502-k.
[0057] Similarly, a word vector sequence corresponding to the candidate response 512 may be generated through an embedding layer 514. This word vector may be represented as R = [er1, er2 ,..., erm], wherein erj represents a word vector of the j-th word in the candidate response 512, that is, the candidate response r, and m represents the number of words in the candidate response 512.
[0058] Subsequently, word-level representations 508-1, 508-2, 508-3, ..., 508-n corresponding to utterances 502-1, 502-2, 502-3, ..., 502-n may be generated through attention mechanisms and feed-forward neural networks 506-1, 506-2, ..., 506-n, respectively. Similarly, a word-level representation 518 corresponding to the candidate response 512 may be generated through an attention mechanism and a feed-forward neural network 516. A word-level representation 508-k: corresponding to the utterance 502-k may be represented as Uk self , and the word-level representation 518 corresponding to the candidate response 512 may be represented as Rself . Uk self and Rself may be represented, for example, by the following formula:
Figure imgf000012_0003
wherein fATT( ) represents output of an attention mechanism and a feed-forward neural network.
[0059] A context initial representation 510, that is may be generated
Figure imgf000012_0004
through combining, such as cascading, the word-level representations 508-1, 508-2, 508-3, ..., 508-n. The word-level representation 518 may be adopted as a candidate response initial representation 520. Both a semantic initial representation and an emotional initial representation may be generated through the process 500 in FIG. 5. Through the process 500, a context semantic initial representation , a context emotional initial
Figure imgf000012_0001
representation
Figure imgf000012_0002
a candidate response semantic initial representation Rs self, and a candidate response emotional initial representation Re self may be generated.
[0060] After the word-level representations corresponding to the respective utterances in the context and the candidate response being generated, context interaction representations and candidate response interaction representations may be further generated based on these word-level representations. FIG. 6 illustrates an exemplary process 600 for generating interaction representations according to an embodiment of the present disclosure. The interaction representations may include semantic interaction representations and emotional interaction representations, for example, context interaction representations may include a context semantic interaction representation and a context emotional interaction representation, and candidate response interaction representations may include a candidate response semantic interaction representation and a candidate response interaction initial representation. The processes for generating a semantic interaction representation and an emotional interaction representation are similar.
[0061] Firstly, word-level representations 602-1, 602-2, 602-3, ..., 602 -n corresponding to respective utterance in a context 602 and a word-level representation 618 corresponding to a candidate response 616 may be obtained, wherein a word-level representation 602 -k corresponds to utterance k in the context 602, i.e., uk. The context 602 may correspond to the context 502 in FIG. 5, and the word-level representations 602- 1, 602-2, 602-3, ..., 602 -n may correspond to the word-level representations 508-1, 508-2, 508-3, ..., 508-n in FIG. 5, respectively. The candidate response 616 may correspond to the candidate response 512 in FIG. 5, and the word-level representation 618 may correspond to the word-level representation 518 in FIG. 5.
[0062] Sentence-level representations 606-1, 606-2, 606-3, 606-3, ..., 606 -n corresponding to the word-level representations 602-1, 602-2, 602-3, ..., 602 -n, respectively, may be generated through recurrent neural networks and attention mechanisms 604-1, 604-2, ..., 604 -n. Similarly, a sentence-level representation 622 corresponding to the word-level representation 618 may be generated through a recurrent neural network and an attention mechanism 620.
[0063] A sentence-level representation 606-k: corresponding to utterance k in the context 602 may be represented as Uk utter , and the sentence-level representation 622 corresponding to the candidate response 616 may be represented as Rutter. The process for generating sentence-level representations Uk utter and Rutter through recurrent neural networks and attention mechanisms may be represented, for example, by the following formulas. Firstly, a hidden state H{u,r}[i] corresponding to the z'-th word in respective utterance of utterance u in the context or the candidate response r may be calculated, as shown in the following formula: H{u,r}[i] = GRU(Wself[i], H{u,r}[i - 1]) (3) wherein GRU represents a Gated Recurrent Unit, Wself ∈ { Uk self, Rself} ,H{u ,r} ∈ Rmxd represents a hidden state corresponding to respective utterance in the context or candidate response, wherein m represents the number of words in the corresponding utterance, and d represents a dimension. Subsequently, an attention mechanism and average pooling may be performed on the hidden state H{u,r} to obtain a sentence-level representation Uk utter corresponding to respective utterance uk in the context and a sentence-level representation Rutter corresponding to the candidate response r, as shown in the following formula: Uk utter = mean(fATT(Huk, Huk)) (4)
Rutter = mean(fATT(Hr, Hr)) (5) wherein mean ( ) represents average pooling.
[0064] A difference between sentence-level representations of adjacent utterances among the context and the candidate response may be calculated based on Mutter and
Rutter. Such a difference may reflect information change between adjacent utterances among the context and the candidate response, such as semantic change and/or emotional change. As shown in FIG. 6, a difference 608-1 may be calculated based on a sentence- level representation 606-1 and required preceding information, wherein the difference 608-1 may reflect information change between utterance 1 in the context 602 and the required preceding information, and wherein the required preceding information may be initialized to zero; a difference 608-2 may be calculated based on a sentence-level representation 606-2 and the sentence-level representation 606-1, wherein the difference 608-2 may reflect information change between utterance 2 and utterance 1 in the context 602; a difference 608-3 may be calculated based on a sentence-level representation 606-3 and the sentence-level representation 606-2, wherein the difference 608-3 may reflect information change between utterance 3 and utterance 2 in context 602, ..., by analogy, a difference 608-n may be calculated, which may reflect information change between utterance n and utterance n- 1 in the context 602. An utterance adjacent to the candidate response 616 is utterance n in the context 602. Accordingly, a difference 624 may be calculated based on a sentence-level representation 622 of the candidate response 616 and a sentence-level representation 606-n of utterance n. wherein the difference 624 may reflect information change between the candidate response 616 and utterance n.
[0065] A difference 608-k: between the sentence-level representation
Figure imgf000014_0003
and a sentence-level representation
Figure imgf000014_0001
may be represented, for example, as
Figure imgf000014_0004
A difference 624 between the sentence-level representation Rutter and a sentence-level representation U utter may be represented, for example, as Tr local. In an implementation,
Tk local and Tr local may be calculated, for example, by the following formulas:
Figure imgf000014_0002
wherein ReLU represents a Rectified Linear Unit, ʘ represents element-wise multiplication, Wt and bt are trainable parameters, and U0 utter may be filled with zeros. [0066] After obtaining the differences 608-1, 608-2, 608-3, ..., 608 -n and 624, at 610, utterance interaction representations 612-1, 612-2, 612-3, ..., 612 -n corresponding to respective utterances in the context and a candidate response interaction representation 626 corresponding to the candidate response may be generated based on these differences. In an implementation, an utterance interaction representation 612 -k corresponding to utterance k in the context 602 may be generated based on the difference between sentence- level representations of every two adjacent uterances of uterance k in the context 602 and preceding uterances of uterance k, wherein the preceding uterances of utterance k may include uterances before uterance k in the context 602. For example, an uterance interaction representation 612-3 corresponding to uterance 3 in the context 602 may be generated based on the differences 608-2 and 608-3, and an uterance interaction representation 612 -n corresponding to uterance n may be generated based on the differences 608-2, 608-3, ..., 608-n. and a candidate response interaction representation 626 corresponding to the candidate response 616 may be generated based on the differences 608-2, 608-3, ..., 608 -n and 624.
[0067] In an implementation, an uterance interaction representation generating 610 may integrating respective differences through a Transitional Memory Network and by copying historical memories. Herein, the memory is implemented by using a recurrent attention mechanism, wherein a feed-forward neural network may be used to transform uterance k into memory representation and transform the candidate response into
Figure imgf000015_0001
memory representation , as shown in the following formulas:
Figure imgf000015_0008
(8) (9)
Figure imgf000015_0002
wherein
Figure imgf000015_0003
and
Figure imgf000015_0004
represent input memory representations, and
Figure imgf000015_0005
Figure imgf000015_0006
represent output memory representations, and W{in,out} and b{in,out} are trainable parameters.
[0068] A global representation , for uterance k in the
Figure imgf000015_0009
context and the candidate response may be obtained, wherein when k' ∈ {1,2, ... , n} represents a global representation for uterance k, and when
Figure imgf000015_0007
Figure imgf000015_0011
represents a global representation for the candidate response. may be calculated,
Figure imgf000015_0012
for example, by the following formulas:
Figure imgf000015_0010
[0069] The above process may be performed iteratively, wherein the representation results between adjacent hops may be integrated by residuals. An uterance interaction representation
Figure imgf000016_0001
for utterance k and the candidate response may be obtained, for example, by concatenating and
Figure imgf000016_0004
, as shown in the following formula:
Figure imgf000016_0003
Figure imgf000016_0002
(12) wherein when may correspond to
Figure imgf000016_0012
in formula (7). When k' ∈
Figure imgf000016_0005
{1,2, ..., n
Figure imgf000016_0009
represents an utterance interaction representation for an utterance in the context, and when k' = n + 1 ,
Figure imgf000016_0010
represents an interaction representation for the candidate response, that is, the candidate response interaction representation 626, i.e., Tr. The utterance interaction representation
Figure imgf000016_0011
may reflect a difference in representation between utterance k' and all previous utterances before utterance k' in the current session, i.e., utterance 1 to utterance k'-1. Subsequently, a context interaction representation 614, i.e., may be obtained by concatenating the utterance interaction representations
Figure imgf000016_0006
612-2, 612-3, ..., 612-n corresponding to the respective utterances in the context 602.
[0070] Both a semantic initial representation and an emotional initial representation may be generated through the process 600 in FIG. 6. Through the process 600, a context semantic interaction representation a context emotional interaction
Figure imgf000016_0007
representation a candidate response semantic interaction representation Ts,r,
Figure imgf000016_0008
and a candidate response emotional interaction representation Te,r may be generated. [0071] As explained above in connection with FIG. 6, the generation of the context interaction representation and the candidate response interaction representation according to embodiments of the present disclosure considers the difference in representation between adjacent utterances among the context and the candidate response, and further considers the difference in representation between respective utterance in the context and the candidate response and preceding utterances of this utterance in the current session. Such differences may reflect information change during the session, such as semantic change and emotional change. In other words, embodiments of the present disclosure propose to model a semantic flow and an emotional flow in the session, so that the semantic change and the emotional change in the session may be effectively tracked. Referring back to FIG. 3, the context interaction representation and the candidate response interaction representation may then be used in subsequent matching and aggregation processes, and finally generate a comprehensive relevance score indicating relevance between the candidate response and the context. Since the generation of the context interaction representation and the candidate response interaction representation considers the semantic change and the emotional change between adjacent utterances among the context and the candidate response, such change will also be taken into account when generating the comprehensive relevance score, thereby, a calculated relevance score of a candidate response that is smoother and more natural relative to the context in terms of semantic and emotion will be higher.
[0072] FIG. 7 illustrates an exemplary process 700 for semantic matching according to an embodiment of the present disclosure. A context semantic initial representation 704 and a context semantic interaction representation 706 corresponding to a context 702 may be obtained. The context 702 may correspond to the context 302 in FIG. 3. The context semantic initial representation 704 and the context semantic interaction representation 706 may be represented as
Figure imgf000017_0002
, respectively. A candidate response
Figure imgf000017_0001
semantic initial representation 710 and a candidate response semantic interaction representation 712 corresponding to a candidate response 708 may be obtained. The candidate response 708 may correspond to the candidate response 306 in FIG. 3. The candidate response semantic initial representation 710 and the candidate response semantic interaction representation 712 may be represented as R s self and Ts,r, respectively. The context semantic initial representation 704 and the candidate response semantic initial representation 710 may be generated, for example, through the process 500 in FIG. 5, and the context semantic interaction representation 706 and the candidate response semantic interaction representation 712 may be generated, for example, through the process 600 in
FIG. 6.
[0073] The context semantic initial representation 704 and the candidate response semantic initial representation 710 may be matched 714 to generate a semantic initial relevance representation 716. The semantic initial relevance representation 716 may be represented, for example, as . The semantic initial relevance representation 716 may
Figure imgf000017_0003
indicate relevance between the context semantic initial representation 704 and the candidate response semantic initial representation 710. The generation of the semantic initial relevance representation 716
Figure imgf000017_0004
may be represented, for example, by the following formulas:
Figure imgf000017_0005
wherein and are trainable parameters. [0074] The context semantic interaction representation 706 and the candidate response semantic interaction representation 712 may be matched 718 to generate a semantic interaction relevance representation 720. The semantic interaction relevance representation 720 may be represented, for example, as
Figure imgf000018_0003
The semantic interaction relevance representation 720 may indicate relevance between the context semantic interaction representation 706 and the candidate response semantic interaction representation 712. The generation of the semantic interaction relevance representation 720
Figure imgf000018_0005
may be represented, for example, by the following formulas:
Figure imgf000018_0002
wherein W
Figure imgf000018_0004
are trainable parameters.
[0075] FIG. 8 illustrates an exemplary process 800 for emotional matching according to an embodiment of the present disclosure. A context emotional initial representation 804 and a context emotional interaction representation 806 corresponding to a context 802 may be obtained. The context 802 may correspond to the context 302 in FIG. 3. The context emotional initial representation 804 and the context emotional interaction representation 806 may be represented as respectively. A candidate response
Figure imgf000018_0001
emotional initial representation 810 and a candidate response emotional interaction representation 812 corresponding to a candidate response 808 may be obtained. The candidate response 808 may correspond to the candidate response 306 in FIG. 3. The candidate response emotional initial representation 810 and the candidate response emotional interaction representation 812 may be represented as
Figure imgf000018_0006
respectively. The context emotional initial representation 804 and the candidate response emotional initial representation 810 may be generated, for example, through the process 500 in FIG. 5, and the context emotional interaction representation 806 and the candidate response emotional interaction representation 812 may be generated, for example, through the process 600 in FIG. 6.
[0076] The context emotional initial representation 804 and the candidate response emotional initial representation 810 may be matched 814 to generate an emotional initial relevance representation 816. The emotional initial relevance representation 816 may be represented, for example, as The emotional initial relevance representation 816 may
Figure imgf000018_0007
indicate relevance between the context emotional initial representation 804 and the candidate response emotional initial representation 810. The generation of emotional initial relevance representation 816 may be represented, for example, by the following formulas:
Figure imgf000019_0001
wherein are trainable parameters.
Figure imgf000019_0002
[0077] The context emotional interaction representation 806 and the candidate response emotional interaction representation 812 may be matched 818 to generate an emotional interaction relevance representation 820. The emotional interaction relevance representation 820 may be represented, for example, as The emotional interaction
Figure imgf000019_0003
relevance representation 820 may indicate relevance between the context emotional interaction representation 806 and the candidate response emotional interaction representation 812. The generation of the emotional interaction relevance representation 820 may be represented, for example, by the following formula:
Figure imgf000019_0004
wherein are trainable parameters.
Figure imgf000019_0006
[0078] FIG. 9 illustrates an exemplary process 900 for performing aggregation according to an embodiment of the present disclosure. The process 900 may be performed by the aggregation part 316 in the transitional memory-based matching model 308 shown in FIG. 3. A semantic initial relevance representation 902 and a semantic interaction relevance representation 904 in FIG. 9 may correspond to the semantic initial relevance representation 716 and the semantic interaction relevance representation 720 in FIG. 7, respectively, and an emotional initial relevance representation 920 and an emotional interaction relevance representation 922 in FIG. 9 may correspond to the emotional initial relevance representation 816 and the emotional interaction relevance representation 820 in FIG. 8, respectively.
[0079] The semantic initial relevance representation 902 may be processed by, for example, two layers of recurrent neural networks 906 and 908, as shown in the following formulas:
Figure imgf000019_0005
wherein represents the number of words in the corresponding
Figure imgf000020_0001
utterance; k ∈ (1,2, ...,n) n represents the number of utterances in the context; may
Figure imgf000020_0002
be initialized to zero; and
Figure imgf000020_0009
may be used for the subsequent relevance score calculating process.
[0080] The semantic interaction relevance representation 904 may be processed by a recurrent neural network 910, as shown in the following formula:
Figure imgf000020_0003
wherein k ∈ (1,2, ... , n}, n represents the number of utterances in the context; and
Figure imgf000020_0012
may be used for the subsequent relevance score calculating process.
[0081] At 912, the processed semantic initial relevance representation 902 and the processed semantic interaction relevance representation 904 may be combined, such as cascaded, to obtain a semantic relevance representation 914. Subsequently, through a forward neural network 916, a semantic relevance score 918 may be generated based on the semantic relevance representation 914, as shown in the following formula:
Figure imgf000020_0004
wherein are trainable parameters.
Figure imgf000020_0005
[0082] The emotional initial relevance representation 920 may be processed by, for example, two layers of recurrent neural networks 924 and 926, as shown in the following formulas:
Figure imgf000020_0006
wherein
Figure imgf000020_0007
represents the number of words in the corresponding utterance; k ∈ (1,2, ..., n], n represents the number of utterances in the context; may
Figure imgf000020_0010
be initialized to zero; and
Figure imgf000020_0008
may be used for the subsequent relevance score calculating process.
[0083] The emotional interaction relevance representation 922 may be processed by a recurrent neural network 928, as shown in the following formula:
Figure imgf000020_0011
wherein k ∈ (1,2, ... , n], n represents the number of utterances in the context; and 4 may be used for the subsequent relevance score calculating process.
[0084] At 930, the processed emotional initial relevance representation 920 and the processed emotional interaction relevance representation 922 may be combined, such as cascaded, to obtain an emotional relevance representation 932. Subsequently, through a forward neural network 934, an emotional relevance score 936 may be generated based on the emotional relevance representation 932, as shown in the following formula:
Figure imgf000021_0001
wherein are trainable parameters.
Figure imgf000021_0002
[0085] At 938, the semantic relevance score 918 and the emotional relevance score 936 may be combined to obtain a comprehensive relevance score 940. The comprehensive relevance score 940 may be represented, for example, as g. The comprehensive relevance score 940 may correspond to the comprehensive relevance score 318 in FIG. 3. In an implementation, the comprehensive relevance score 940 may be obtained by summing the semantic relevance score 918 and the emotional relevance score 936, as shown in the following formula:
Figure imgf000021_0003
[0086] The specific operation process for each part in the transitional memory-based matching model is described above in conjunction with FIGs. 5-9. It is to be understood that these processes are merely exemplary. Each process may employ any other unit, may include any other step, and may include more or fewer steps, depending on the actual application requirements.
[0087] FIG. 10 illustrates an exemplary chat flow 1000a and associated emotional flow 1000b according to an embodiment of the present disclosure. The chat flow 1000a may occur between a chatbot and a user.
[0088] At 1002, the chatbot may output an utterance U1 "I like Taurus girls so much! ". In the case where emotional states are characterized by a V-A model, an emotional state Eui of the utterance U1 may be, for example, (0.804, 0.673).
[0089] At 1004, the user may enter an utterance U2 "Well, Scorpio boys always like Taurus girls. This is a fact." An emotional state Era of the utterance U2 may be, for example, (0.392, 0.616).
[0090] At 1006, the chatbot may output an utterance U3 ""But why can't I meet a Taurus girl who likes me?". An emotional state Era of the utterance U3 may be, for example, (-0.348, 0.647).
[0091] At 1008, the user may enter an utterance U4 "Because your circle of friends is too narrow". An emotional state Era of the utterance U4 may be, for example, (-0.339, 0.599). [0092] The position of each emotional state of the utterances U1 to U4 in the V-A model is shown in the emotion flow 1000b.
[0093] After receiving the utterance U4 at 1008, the chatbot may firstly determine a context associated with the utterance U4, which includes, for example, the utterances U1 to U4. The chatbot may then determine a response to be provided to the user from a set of candidate responses in a database that it connects with or contains. For example, the chatbot may calculate a comprehensive relevance score between each candidate response of the set of candidate responses and the context. A block 1010 shows two exemplary candidate responses, that is, candidate response R1 "I will meet one" and candidate response R2: "Forget it, I'm kidding. Hahahaha". An emotional state ERI of the candidate response R1 may be, for example, (-0.837, 0.882). The emotional state ER2 of the candidate response R2 may be, for example, (0.225, 0.670).
[0094] The comprehensive relevance score may be calculated, for example, through the process 300 in FIG. 3 in combination with the processes 500-900 in FIGs. 5-9. Since the calculation of the comprehensive relevance score considers semantic change and emotional change between adjacent utterances among the context and the candidate response, as well as between each utterance among the context and the candidate response and preceding utterances of this utterance in the current session, a calculated relevance score of a candidate response that is more smooth and natural relative to the context in terms of semantic and emotion will be higher. For example, a relevance score SI corresponding to the candidate response R1 with the emotional state of (-0.837, 0.882) may be 0.562, and a relevance score S2 corresponding to the candidate response R2 with the emotional state of (0.225, 0.670) may be 0.114. The relevance score SI is higher than the relevance score S2, so the chatbot finally outputs the candidate response R1 "I will meet one" at 912. It can also be seen from the emotion flow 1000b that compared with the candidate response R2, the emotional state of the candidate response R1 is smoother and more natural relative to the utterances U1 to U4.
[0095] FIG. 11 illustrates an exemplary process 1100 for training a transitional memory-based matching model according to an embodiment of the present disclosure. A transitional memory-based matching model 1106 in FIG. 11 may correspond to the transitional memory-based matching model 308 in FIG. 3. The transitional memory-based matching model 1106 may include an initial representation generating part 1108, an interaction representation generation part 1110, a matching part 1112, and an aggregation part 1114, which may correspond to the initial representation generating part 310, the interaction representation generation part 312, the matching part 314 and the aggregation part 316 in FIG. 3, respectively.
[0096] Training of the transitional memory-based matching model 1106 may be based on a corpus 1150. The corpus 1150 may include a plurality of conversation-based training samples, such as [context c\, candidate response r1, relevance label yi |. [context c2, candidate response r2, relevance label y2], [context C3, candidate response r3, relevance label y3], etc., wherein context ci may include a set of conversation-based utterances, candidate response ri may be a candidate response for context ci, and the relevance label yi ∈ {0,1} may indicate relevance between context ci and candidate response ri. wherein "0" may indicate that candidate response ri is irrelevant to context ci and "1" may indicate that candidate response ri is relevant to context Ci.
[0097] Take a training sample i [context ci, candidate response ri. relevance label yi] in the corpus 1150 as an example. The context ci 1102 and the candidate response ri 1104 may be used as input to the transitional memory-based matching model 1106. The transitional memory-based matching model 1106 may perform a scoring task on the relevance between context ci and candidate response ri , and output a comprehensive relevance score g(ci,ri) 1116. The comprehensive relevance score may be calculated, for example, through the process 300 in FIG. 3 in combination with the processes 500-900 in FIGS. 5-9. In an implementation, a prediction loss of the training sample i may be calculated as a binary cross-entropy loss, and a prediction loss corresponding to the
Figure imgf000023_0002
scoring task is calculated by summing the prediction losses of all the training samples, as shown by the following formula:
Figure imgf000023_0001
[0098] Embodiments of the present disclosure propose to use a multi-task framework to utilize an additional emotion classification task to optimize emotional representations of a context and a candidate response, such as, the context emotional initial representation and the candidate response emotional initial representation generated through the initial representation generating part 310 in FIG. 3, and the context emotional interaction representation and the candidate response emotional interaction representation generated through the interaction representation generation part 312. During the training process, the additional emotion classification task may be performed in conjunction with the scoring task described with reference to FIG. 11. A corpus that includes training data with emotional labels may be utilized to perform the additional emotion classification task. In an implementation, the corpus may be a conversation corpus including a plurality of conversation-based training samples. FIG. 12 illustrates an exemplary process 1200 for optimizing emotional representations with a conversation corpus according to an embodiment of the present disclosure.
[0099] In FIG. 12, a corpus 1250 for performing the additional emotion classification task to optimize emotional representations may include a plurality of conversation-based training samples, such as [context c1. candidate response r1. emotional label {z1 ,j}], [context c2, candidate response r2, emotional label {z2, j }], [context c3, candidate response r3, emotional label {z3,j}|etc.. wherein context ci may include a set of conversation-based uterances, and candidate response ri may be a candidate response for context c,. Different forms of the emotional label may be provided for different approaches for characterizing emotional states. For example, when using a six-category method to characterize emotional states, the emotional label for the emotional category j in the training sample i may be represented as zi,j ∈ {0,1}.
[00100] Take using training sample i [context ci. candidate response ri. emotional label {zi,j}] to perform the additional emotion classification task as an example. Firstly, a candidate response emotional initial representation 1206 corresponding to a candidate response ri 1204 may be generated. The candidate response emotional initial representation 1206 may be generated, for example, through the initial representation generating part 310 in FIG. 3, and more specifically, through the process 500 in FIG. 5. The candidate response emotional initial representation 1206 may be expressed as R eself, which may correspond to, for example, Rself in the above formula (2). Subsequently, a candidate response emotional interaction representation 1210 corresponding to the candidate response ri may be generated based on the context ci 1202 and the candidate response ri 1204. The candidate response emotional interaction representation 1210 may be generated, for example, through the interaction representation generation part 312 in FIG. 3, and more specifically, through the process 600 in FIG. 6. The candidate response emotional interaction representation 1210 may be represented as Te, which may, for example, correspond to Te,r that may be calculated by the above formula (12).
[00101] At 1212, the candidate response emotional initial representation 1206 processed by a pooling layer 1208 may be combined with the candidate response emotional interaction representation 1210 to obtain a candidate response emotional comprehensive representation. A forward neural network 1214 may generate an emotional prediction result h(xi) 1216 based on the candidate response emotional comprehensive representation, as shown in the following formula:
Figure imgf000025_0001
wherein is a trainable parameter for linear transformation; mean
Figure imgf000025_0003
represents an average pooling function; and K is the number of emotion types, for example, K may be 6 when the six-category method is used to characterize emotional states.
[00102] In an implementation, a prediction loss of the training sample i may be calculated as a multi-class cross-entropy loss, and a prediction loss Lemo corresponding to the additional emotion classification task is calculated by summing the prediction losses of all the training samples, as shown by the following formula:
Figure imgf000025_0002
wherein K is the number of emotion types, and M is the number of training samples. [00103] In addition to the conversation corpus, a sentence corpus based on sentences may also be used to perform an additional emotion classification task to optimize emotional representations. FIG. 13 illustrates an exemplary process 1300 for optimizing emotional representations with a sentence corpus according to an embodiment of the present disclosure.
[00104] A corpus 1350 in FIG. 13 may include a plurality of training samples, such as [utterance x1; emotional label {z1 ;·}], [utterance x2, emotion labeling (z2,j}|. [utterance x3, emotion labeling {z3 ;j}], etc., wherein emotional label {zi, j} is used to indicate the emotional state of utterance xi. Different forms of an emotional label may be provided for different approaches for characterizing emotional states. For example, when using a six- category method to characterize emotional states, the emotional label for the emotional category j in the training sample i may be represented as zi,j ∈ {0,1}.
[00105] Take using training sample i [utterance xi, emotional label {zi, j}| to perform the additional emotion classification task as an example. Firstly, a word-level representation 1304 corresponding to an utterance xi.1302 may be generated. The word- level representation 1304 may be generated, for example, through the initial representation generating part 310 in FIG. 3, and more specifically, through the process 500 in FIG. 5. A pooling layer 1306 and a forward neural network 1308 may process the word-level representation 1304 to obtain an emotion prediction result h(xi) 1310. Subsequently, a prediction loss
Figure imgf000026_0001
corresponding to the additional emotion classification task may be calculated based on the emotional prediction result
Figure imgf000026_0004
1310 and emotional label
Figure imgf000026_0003
In an implementation, the prediction loss of the training sample i may be calculated as a multi-class cross-entropy loss, and the prediction loss corresponding to the
Figure imgf000026_0002
additional emotion classification task is calculated by summing the prediction losses of all the training samples, as shown by the above formula (32).
[00106] It is to be understood that performing the additional emotion classification task by using the conversation corpus described with reference to FIG. 12 and performing the additional emotion classification task by using the sentence corpus described with reference to FIG. 13 may be performed separately or together. In the case of being performed together, the prediction loss corresponding to the additional emotion classification task may be calculated based on both the prediction loss obtained by performing the additional emotion classification task by using the conversation corpus and the prediction loss obtained by performing the additional emotion classification task by using the sentence corpus.
[00107] The scoring task in FIG. 11 and the additional emotion classification task in FIG. 12 and / or FIG. 13 may be performed jointly. A total prediction loss may be
Figure imgf000026_0006
calculated by weighted summing the prediction loss
Figure imgf000026_0005
corresponding to the scoring task and the prediction loss corresponding to the additional emotion classification
Figure imgf000026_0008
task, as shown in the following formula:
Figure imgf000026_0007
wherein α is a hyper-parameter set by the system.
[00108] People with different personalities may have different emotional change ranges. For example, the emotions of an emotional person may change easily, while the emotions of a quiet person may be difficult to change. For example, an emotional person may easily become very depressed even if he was very happy for the last second. An embodiment of the present disclosure proposes that a transitional memory-based matching model, such as the transitional memory-based matching model 308 in FIG. 3, may be trained for a predetermined personality to obtain a chatbot with a predetermined personality.
[00109] In an implementation, a transitional memory-based matching model may be trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality. For example, during the training process of the transitional memory-based matching model, a prediction loss
Figure imgf000027_0005
associated with an emotional change range, such as an emotional change range between two adjacent utterances, may be added to the prediction loss function shown in the above formula (33) , and a weight β associated with the prediction loss may be set, as
Figure imgf000027_0004
shown by the following formula:
Figure imgf000027_0001
wherein β is a hyper-parameter set by the system, which may affect the proportion of the predicted loss associated with the emotional change range to the total predicted
Figure imgf000027_0002
loss If it is desired to train a chatbot with a large emotional change range, such as a chatbot with an emotional personality,β may be set to be small, so that the proportion of prediction loss
Figure imgf000027_0003
to the total prediction loss may be small. On the contrary, if it is desired to train a chatbot with a small emotional change range, such as a chatbot with a quiet personalityβ, may be set to be large, so that the proportion of prediction loss
Figure imgf000027_0006
lo the total prediction loss may be large.
[00110] Emotional states may also be affected by external factors such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc. For example, if a speaker is sick or the weather is bad, he may be down even if he hears good news; while if a speaker is healthy or the weather is good, he may be calm even if he hears bad news. An embodiment of the present disclosure proposes that when providing a response, not only a context in a chat flow, but also external factors that affect an emotional state of a chatbot may be considered.
[00111] In an implementation, an additional emotional representation corresponding to an external factor may be generated and inserted among a set of word-level representations corresponding to a set of utterances in a context of a chat flow, thereby affecting subsequent relevance score generating and further affecting the selection of a candidate response.
[00112] FIG. 14 illustrates an exemplary process 1400 for generating an additional emotional representation according to an embodiment of the present disclosure.
[00113] Firstly, an external factor 1402 that affects an emotional state of a chatbot may be identified, such as weather, health condition, whether a good thing happened, whether a bad thing happened, etc. External factor such as weather may be related to actual conditions, such as the actual weather conditions of the day, and may be obtained through other applications. External factors such as health condition, whether a good thing happened, whether a bad thing happened may be artificially defined or automatically defined by the system.
[00114] At 1404, the external factor 1402 may be mapped to an emotional state 1406 corresponding to the external factor 1402 through a predefined function. In the case that a V-A model is used to characterize emotional states, the emotional state 1406 may be, for example, a V-A pair.
[00115] Subsequently, through a forward neural network 1408, an additional emotional representation 1410 may be generated based on the emotional state 1406. Herein, a generated emotional representation corresponding to an external factor is referred to as an additional emotional representation. In the case that the emotional state 1406 is a V-A pair, the forward neural network 1408 may generate an additional emotional representation 1410 by converting the emotional state 1406 into a valence vector and an arousal vector, and combining the valence vector and the arousal vector.
[00116] After the additional emotional representation is generated, it may be inserted among a set of word-level representations corresponding to a set of utterances in a context of a chat flow. FIG. 15 illustrates an exemplary process 1500 for inserting an additional emotional representation according to an embodiment of the present disclosure.
[00117] Firstly, a set of word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-« corresponding to utterances 1502-1, 1502-2, 1502-3, ..., 1502-«, respectively, in a context 1502 may be obtained. For example, the word-level representations 1504-1, 1504-2, 1502- 3, ..., 1504-n may be generated through the process 500 in FIG. 5. In an implementation, an additional emotional representation 1506 generated, for example, through the process 1400 of FIG. 14 may be inserted before a representation of a first utterance of a current session, that is, before the word-level representation 1504-1. In another implementation, the additional emotional representation 1506 may be inserted before a word-level representation of the current utterance, that is, before the word-level representation 1504- n.
[00118] An updated context initial representation 1508 may be generated based on the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506. For example, the updated context initial representation 1508 may be generated through cascading the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506. An updated context interaction representation 1510 may also be generated based on the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotional representation 1506. In addition, the word-level representations 1504-1, 1504-2, 1502-3, ..., 1504-n and the additional emotion representations 1506 along with a word-level representation 1514 of a candidate response 1512 may also generate an updated response interaction representation 1516. For example, the updated context interaction representation 1510 and the updated response interaction representation 1516 may be generated through the process 600 in FIG. 6.
[00119] The generation of the updated context initial representation 1508, the updated context interaction representation 1510, and the updated response interaction representation 1516 considers an additional emotional representation corresponding to an external factor. These updated representations may then be used in a subsequent matching process, such as the process 800 in FIG. 8, and a subsequent aggregation process, such as the process 900 in FIG. 9, and ultimately obtain a comprehensive relevance score. Since the generation of the updated context initial representation 1508, the updated context interaction representation 1510, and the updated response interaction representation 1516 considers the additional emotional representation corresponding to the external factor, the additional emotional representations are also taken into account when generating the comprehensive relevance score, so that a calculated relevance score for a candidate response that is consistent with an emotional state of the additional emotional representation will be higher.
[00120] In another implementation, a basic emotional state of a chatbot may also be determined based on external factors. For example, when an external factor is "good weather", the basic emotional state of the chatbot may be determined as "high mood"; while when the external factor is "bad weather", the basic emotional state of the chatbot may be determined as "low mood". Then, a threshold corresponding to the basic emotional state may be set for each candidate response. In some embodiments, only a valence threshold may be set. Taking a candidate response "ha-ha" as an example, the valence threshold corresponding to "high mood" may be "0.1", while a valence threshold corresponding to "low mood" may be "0.8", for example. In this case, when the basic emotional state determined based on external factors is "high mood", as long as a valence value of the emotional state of the chatbot predicted according to the context in the session is greater than "0.1", the candidate response "ha-ha" may be provided; while when the basic emotional state determined based on external factors is "low mood", only when the valence value of the emotional state of the chatbot predicted is greater than "0.8", the candidate response "ha-ha" may be provided. [00121] In addition, after the basic emotional state of the chatbot is determined based on external factors, the emotional state of the chatbot may also be adapted according to the determined basic emotional state. For example, when the basic emotional state is "high mood", the valence value of the emotional state of the chatbot predicted according to the context in the session may be increased, for example, multiplied by a coefficient greater than 1; when the basic emotional state is "low mood", the valence value of the emotional state of the chatbot predicted according to the context in the session may be reduced, for example, multiplied by a coefficient less than 1.
[00122] The foregoing describes different ways in which the chatbot considers external factors that affect emotional states when providing responses. These ways may make emotional states of responses provided throughout the session consistent with the basic emotional state determined by the external factors. It is to be understood that the foregoing ways are merely exemplary, and the embodiments of the present disclosure are not limited thereto, but emotional states of responses provided by the chatbot can be caused in any other way to be consistent with the basic emotional state determined by the external factors.
[00123] A transitional memory-based matching model according to an embodiment of the present disclosure may support multi -modality inputs. Each utterance that is an input of a transitional memory-based matching model may employ at least one of the following modalities: text, voice, facial expressions, and gestures. For example, when a user uses a terminal device to chat with a chatbot, a microphone on the terminal device may capture voice, a speech recognition software may convert the voice into text, or the user may directly enter text. In addition, a camera on the terminal device may capture the user's facial expressions, body gestures, and hand gestures. Inputs of different modalities for a particular utterance may be converted into corresponding representations. These representations may be combined together through an early-fusion strategy or a late-fusion strategy to generate a context initial representation and a context interaction representation. Herein, the early-fusion strategy refers to combining representations of various modality inputs for each utterance into a comprehensive representation of the utterance, and then generating an context initial representation and a context interaction representation based on the comprehensive representation of the utterance and comprehensive representations of other utterances. The late-fusion strategy refers to using representations of various modality inputs of each utterance to generate intermediate initial representations and intermediate interaction representations in respective modalities, and then generating a context initial representation and a context interaction representation by combining the generated intermediate initial representations and intermediate interaction representations, respectively.
[00124] FIG. 16 illustrates an exemplary process 1600 for combining multi-modality inputs through an early-fusion strategy according to an embodiment of the present disclosure.
[00125] Assume that a transitional memory-based matching model according to an embodiment of the present disclosure may support m modality inputs. In FIG. 16, an utterance 1 1602 may have, for example, a modality 1 input 1602-1, a modality 2 input 1602-2, ..., a modality m input 1602 -m. These inputs may be converted into corresponding representations, such as, a representation 1 of utterance 1 1604-1, a representation 2 of utterance 1 1604-2, ..., a representation m of utterance 1 1604 -m. Similarly, an utterance 2 1606 may, for example, have a modality 1 input 1606-1, a modality 2 input 1606-2, ..., a modality m input 1606 -m. These inputs may be converted into corresponding representations, such as a representation 1 of utterance 2 1608-1, a representation 1 of utterance 2 1608-2, ..., a representation m of utterance 2 1608 -m. It is to be understood that although it is shown in FIG. 16 that both utterance 1 and utterance 2 have m modality inputs, the number of modality inputs that utterance 1 and utterance 2 have may be less than m. Without a certain modality input, the modality input and the corresponding representation may be initialized to zero.
[00126] The representation 1 of utterance 1 1604-1, the representation 2 of utterance 1 1604-2, ..., the representation m of utterance 1 1604 -m may be combined together to generate a comprehensive representation of utterance 1 1610. Similarly, the representation 1 of utterance 2 1608-1, the representation 2 of utterance 2 1608-2, ..., the representation m of utterance 2 1608 -m may be combined together to generate a comprehensive representation of utterance 2 1612. A context initial representation 1614 and a context interaction representation 1616 may be generated based on the comprehensive representation of utterance 1 1610, the comprehensive representation of utterance 2 1612, and possible comprehensive representations (not shown) of other utterances. The context initial representation 1614 and the context interaction representation 1616 may be generated, for example, through the process 500 in FIG. 5 and the process 600 in FIG. 6 respectively. The context initial representation 1614 and the context interaction representation 1616 may be used in subsequent matching and aggregation processes, and finally engage in generating a comprehensive relevance score indicating relevance between a candidate response and a context.
[00127] FIG. 17 illustrates an exemplary process 1700 for combining multi-modality inputs through a late-fusion strategy according to an embodiment of the present disclosure. [00128] Assume that a transitional memory-based matching model according to an embodiment of the present disclosure may support m modality inputs. In FIG. 17, an utterance 1 may have, for example, a modality 1 input of utterance 11702-1, a modality 2 input of utterance 11702-2, ..., a modality m input of utterance 11702-m. These inputs may be converted into corresponding representations, respectively, such as a representation 1 of utterance 11704-1, a representation 2 of utterance 11704-2 , ..., a representation m of utterance 11704-m. Similarly, an utterance 21706 may have, for example, a modality 1 input of utterance 21706-1, a modality 2 input of utterance 21706- 1, ..., a modality m input of utterance 21706-m. These inputs may be converted into corresponding representations, respectively, such as a representation 1 of utterance 2 1708-1, a representation 2 of utterance 21708-2, ..., a representation m of utterance 2 1708-m. It is to be understood that although it is shown in FIG. 17 that both utterance 1 and utterance 2 have m modality inputs, the number of modality inputs that utterance 1 and utterance 2 have may be less than m. Without a certain modality input, the modality input and the corresponding representation may be initialized to zero.
[00129] A representation of each modality input of each utterance may be used to generate an intermediate initial representation and an intermediate interaction representation in respective modality input. For example, an intermediate initial representation corresponding to modality 11710-1 and an intermediate interaction representation corresponding to modality 11712-1 may be generated based on the representation 1 of utterance 11704-1, the representation 1 of utterance 21708-1, and representations of possible other utterances corresponding to mode 1 (not shown); an intermediate initial representation corresponding to modality 21710-2 and intermediate interaction representation corresponding to modality 21712-2 may be generated based on the representation 2 of utterance 11704-2, the representation 2 of utterance 21708-2, and representations of possible other utterances corresponding to mode 2 (not shown); ...; an intermediate initial representation corresponding to modality m 1710-m and an intermediate interaction representation corresponding to modality m 1712-m may be generated based on the representation m of utterance 11704-m, the representation m of utterance 21708-/W, and representations of possible other utterances corresponding to mode m (not shown). The intermediate initial representations 1710-1,1710-2,...,1710-m may be generated, for example, through a process similar to the process 500 in FIG. 5 that used to generate the context initial representation, and the intermediate interaction representations 1712-1, 1712-2,..., 1712- in may be generated, for example, through a process similar to the process 600 in FIG. 6 that used to generate the context interaction representation.
[00130] Then, a context initial representation 1714 may be generated through combining the intermediate initial representation 1710-1, the intermediate initial representation 1710-2, ..., the intermediate initial representation 1710-m, and a context interaction representation 1716 may be generated through combining the intermediate interaction representation 1712-1, intermediate interaction representation 1712-2, ..., intermediate interaction representation 1712 -m. The context initial representation 1714 and the context interaction representation 1716 may be used in subsequent matching and aggregation processes, and finally engage in generating a comprehensive relevance score indicating relevance between a candidate response and a context.
[00131] It is to be understood that although only two utterances are shown in FIGs. 16 and 17, the processes for combining the multi-modality inputs through the early -fusion strategy and the late-fusion strategy according to the embodiments of the present disclosure are not limited to a specific number of utterances, but rather may be applied to any number of utterances in a similar manner. In addition, the processes for combining the multi-modality inputs through the early-fusion strategy and the late-fusion strategy shown in FIGs. 16 and 17, respectively, are only exemplary, and the embodiments of the present disclosure are not limited thereto. For example, for the late-fusion strategy, a context initial relevance representation and a context interaction relevance representation may be obtained by firstly using a representation of each modality input of each utterance to generate an intermediate initial relevance representation and an intermediate interaction relevance representation in respective modality, and then combining the generated intermediate initial relevance representations and intermediate interaction relevance representations, respectively. The context initial relevance representation and the context interaction relevance representation may engage in generating a comprehensive relevance score indicating relevance between the candidate response and the context.
[00132] According to an embodiment of the present disclosure, after a candidate response to be provided to a user is selected, a chatbot may present the response based on an emotional state of the selected candidate response. In some embodiments, the chatbot may express, in a corresponding manner, the emotional state of the selected candidate response based on a modality of the response. For example, in the case that the response is in voice, when its emotional state is "happy", the chatbot may present the response with a fast speech rate or a high tone. In addition, the emotional state of the response may be expressed by additionally providing other multi-modality signals, for example, by facial expressions, body gestures, or hand gestures, etc. of the chatbot. In an implementation, when presenting a response, a corresponding light may be provided at the same time to express the emotional state of the response.
[00133] FIG. 18 illustrates an exemplary scenario 1800 for expressing emotional states of response through light according to an embodiment of the present disclosure. This scenario may happen between a user and a smart speaker. The smart speaker may be equipped with a chatbot implemented according to the embodiments of the present disclosure. The smart speaker may respond to the user's voice input by providing a voice response and corresponding light.
[00134] At 1802, the user may say "So annoying!". At 1804, the smart speaker may reply by providing a voice response: "Cheer up! I still like to see you laugh." The emotional state of the voice response at 1804 may have a relatively positive valence, for example, a valence value of "0.9", so the light provided in association with it may have a strong brightness.
[00135] At 1806, the user may then say "But I don't want to laugh now." At 1808, the smart speaker could reply by providing a voice response "You should learn to laugh. Everyone can do it." The emotional state of the voice response at 1808 may have a generally positive valence, for example, a valence value of "0.6", so the light provided in association with it may have a weak brightness.
[00136] At 1810, the user may continue to say "I can't do it." At 1812, the smart speaker may reply by providing a voice response: "Let me make you happy!" The emotional state of the voice response at 1812 may have a relatively positive valence, for example, a valence value of "0.9", so the light provided in association with it may have a strong brightness.
[00137] FIG. 18 shows an example for expressing different emotional states of a response through different light brightness. It is to be understood that the embodiments of the present disclosure are not limited thereto, for example, in the case of expressing emotional states through light, emotional states of responses may also be expressed through the color, duration, etc. of the light. In addition, the emotional states of the responses may be expressed by any other multi-modality signals. [00138] According to an embodiment of the present disclosure, a selection of a candidate response may be based on semantic relevance and emotional relevance between a candidate response and a context. When determining the semantic relevance and the emotional relevance, messages received and responses sent by a chatbot are collectively considered as utterances in the context, and no distinction is made between the received messages and the sent responses. This may share emotional states between the chatbot and a user, and achieve empathy between the chatbot and the user. Further, the chatbot may drive the user's emotional state to the direction of positive valence by providing a more positive response, such as a response with a higher valence value, thereby guiding the user to obtain an emotional state with a positive valence before the end of the session.
[00139] FIG. 19 is a flowchart of an exemplary method 1900 for providing a response in automated chatting according to an embodiment of the present disclosure.
[00140] At step 1910, a message may be obtained in a chat flow.
[00141] At step 1920, a context associated with the message may be determined, the context comprising a set of utterances, the set of utterances comprising the message. [00142] At step 1930, for each candidate response of a set of candidate responses, the candidate response may be scored based at least on information change between adjacent utterances among the set of utterances and the candidate response.
[00143] At step 1940, a highest-scored candidate response among the set of candidate responses may be provided in the chat flow.
[00144] In an implementation, the information change may comprise at least one of semantic change and emotional change.
[00145] The scoring may comprise at least one of: generating a semantic relevance score for the candidate response based at least on the semantic change between adjacent utterances among the set of utterances and the candidate response; and generating an emotional relevance score for the candidate response based at least on the emotional change between adjacent utterances among the set of utterances and the candidate response.
[00146] The scoring may comprise: generating a comprehensive relevance score for the candidate response based on the semantic relevance score and the emotional relevance score.
[00147] In an implementation, the scoring may comprise: generating a context interaction representation corresponding to the context based on information change between every two adjacent utterances of the set of utterances; generating a candidate response interaction representation corresponding to the candidate response based on information change between every two adjacent utterances among the set of utterances and the candidate response; obtaining an interaction relevance representation through matching the context interaction representation with the candidate response interaction representation; and generating a relevance score for the candidate response based at least on the interaction relevance representation.
[00148] The scoring may further comprise: generating a context initial representation corresponding to the context based on a representation of each utterance of the set of utterances; generating a candidate response initial representation corresponding to the candidate response; obtaining an initial relevance representation through matching the context initial representation with the candidate response initial representation; and generating a relevance score for the candidate response based on a combination of the initial relevance representation and the interaction relevance representation.
[00149] The information change may comprise semantic change, and the context interaction representation may include a context semantic interaction representation, the candidate response interaction representation may include a candidate response semantic interaction representation, the interaction relevance representation may include a semantic interaction relevance representation, the context initial representation may include an context semantic initial representation, the candidate response initial representation may include a candidate response semantic initial representation, the initial relevance representation may include a semantic initial relevance representation, and the relevance score may be a semantic relevance score.
[00150] The information change may comprise emotional change, and the context interaction representation may include a context emotional interaction representation, the candidate response interaction representation may include a candidate response emotional interaction representation, the interaction relevance representation may include an emotional interaction relevance representation, the context initial representation may include an context emotional initial representation, the candidate response initial representation may include a candidate response emotional initial representation, the initial relevance representation may include an emotional initial relevance representation, and the relevance score may be an emotional relevance score.
[00151] In an implementation, the method 1900 may further comprise: identifying external factors that affect emotional states; and adding the external factors into the context. [00152] In an implementation, at least one utterance of the set of utterances may employ at least one of the following modalities: text, voice, facial expressions, and gestures.
[00153] In an implementation, the method 1900 may further comprise: presenting the highest-scored candidate response based on an emotional state of the candidate response. [00154] In an implementation, the scoring may be performed through a transitional memory-based matching model, the transitional memory-based matching model being optimized through an additional emotion classification task during a training process. [00155] In an implementation, the scoring may be performed through a transitional memory-based matching model, the transitional memory-based matching model being trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality.
[00156] It is to be understood that the method 1900 may further comprise any steps/processes for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.
[00157] FIG. 20 illustrates an exemplary apparatus 2000 for providing a response in automated chatting according to an embodiment of the present disclosure. The apparatus 2000 may comprise: a message obtaining module 2010, for obtaining a message in a chat flow; a context determining module 2020, for determining a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message; a scoring module 2030, for scoring, for each candidate response of a set of candidate responses, the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response; and a response providing module 2040, for providing a highest-scored candidate response among the set of candidate responses in the chat flow.
[00158] In an implementation, the information change may comprise at least one of semantic change and emotional change.
[00159] The scoring module 2030 may be further configured for performing at least one of: generating a semantic relevance score for the candidate response based at least on the semantic change between adjacent utterances among the set of utterances and the candidate response; and generating an emotional relevance score for the candidate response based at least on the emotional change between adjacent utterances among the set of utterances and the candidate response.
[00160] In an implementation, the apparatus 2000 may further comprise: an external factor identifying module, for identifying external factors that affect emotional states; and an external factor adding module, for adding the external factors into the context.
[00161] In an implementation, the scoring module 2030 may comprise a transitional memory-based matching model, the transitional memory-based matching model being optimized through an additional emotion classification task during a training process. [00162] In an implementation, the scoring module 2030 may comprise a transitional memory-based matching model, the transitional memory-based matching model being trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality.
[00163] It is to be understood that the apparatus 2000 may further comprise any other modules configured for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.
[00164] FIG. 21 illustrates an exemplary apparatus 2100 for providing a response in automated chatting according to an embodiment of the present disclosure.
[00165] The apparatus 2100 may comprise at least one processor 2110. The apparatus 2100 may further comprise a memory 2120 coupled with the processor 2110. The memory 2120 may store computer-executable instructions that, when executed, cause the processor 2110 to perform any operations of the method for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above. [00166] The embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for providing a response in automated chatting according to the embodiments of the present disclosure as mentioned above.
[00167] It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.
[00168] It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
[00169] Processors are described in connection with various apparatus and methods. These processors can be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend on the specific application and the overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, a microcontroller, a digital signal processor (DSP), a field programmable gate array (FPGA), a programmable logic device (PLD), state machine, gate logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described in this disclosure. The functions of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, a microcontroller, a DSP, or other suitable platforms.
[00170] Software should be considered broadly to represent instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, running threads, processes, functions, and the like. Software can reside on computer readable medium. Computer readable medium may include, for example, a memory, which may be, for example, a magnetic storage device (e.g., a hard disk, a floppy disk, a magnetic strip), an optical disk, a smart card, a flash memory device, a random access memory (RAM), a read only memory (ROM), a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).
[00171] The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be encompassed by the claims.

Claims

1. A method for providing a response in automated chatting, comprising: obtaining a message in a chat flow; determining a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message; for each candidate response of a set of candidate responses, scoring the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response; and providing a highest-scored candidate response among the set of candidate responses in the chat flow.
2. The method of claim 1, wherein the information change comprises at least one of semantic change and emotional change.
3. The method of claim 2, wherein the scoring comprises at least one of: generating a semantic relevance score for the candidate response based at least on the semantic change between adjacent utterances among the set of utterances and the candidate response; and generating an emotional relevance score for the candidate response based at least on the emotional change between adjacent utterances among the set of utterances and the candidate response.
4. The method of claim 3, wherein the scoring comprises: generating a comprehensive relevance score for the candidate response based on the semantic relevance score and the emotional relevance score.
5. The method of claim 1, wherein the scoring comprises: generating a context interaction representation corresponding to the context based on information change between every two adjacent utterances of the set of utterances; generating a candidate response interaction representation corresponding to the candidate response based on information change between every two adjacent utterances among the set of utterances and the candidate response; obtaining an interaction relevance representation through matching the context interaction representation with the candidate response interaction representation; and generating a relevance score for the candidate response based at least on the interaction relevance representation.
6. The method of claim 5, wherein the scoring further comprises: generating a context initial representation corresponding to the context based on a representation of each utterance of the set of utterances; generating a candidate response initial representation corresponding to the candidate response; obtaining an initial relevance representation through matching the context initial representation with the candidate response initial representation; and generating a relevance score for the candidate response based on a combination of the initial relevance representation and the interaction relevance representation.
7. The method of claim 6, wherein the information change comprises semantic change, and the context interaction representation includes a context semantic interaction representation, the candidate response interaction representation includes a candidate response semantic interaction representation, the interaction relevance representation includes a semantic interaction relevance representation, the context initial representation includes an context semantic initial representation, the candidate response initial representation includes a candidate response semantic initial representation, the initial relevance representation includes a semantic initial relevance representation, and the relevance score is a semantic relevance score.
8. The method of claim 6, wherein the information change comprises emotional change, and the context interaction representation includes a context emotional interaction representation, the candidate response interaction representation includes a candidate response emotional interaction representation, the interaction relevance representation includes an emotional interaction relevance representation, the context initial representation includes a context emotional initial representation, the candidate response initial representation includes a candidate response emotional initial representation, the initial relevance representation includes an emotional initial relevance representation, and the relevance score is an emotional relevance score.
9. The method of claim 1, further comprising: identifying external factors that affect emotional states; and adding the external factors into the context.
10. The method of claim 1, wherein at least one utterance of the set of utterances employs at least one of the following modalities: text, voice, facial expressions, and gestures.
11. The method of claim 1, further comprising: presenting the highest-scored candidate response based on an emotional state of the candidate response.
12. The method of claim 1, wherein the scoring is performed through a transitional memory-based matching model, the transitional memory-based matching model being optimized through an additional emotion classification task during a training process.
13. The method of claim 1, wherein the scoring is performed through a transitional memory-based matching model, the transitional memory-based matching model being trained based on an emotional change range constraint between two adjacent utterances that is associated with a predetermined personality.
14. An apparatus for providing a response in automated chatting, comprising: a message obtaining module, for obtaining a message in a chat flow; a context determining module, for determining a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message; a scoring module, for scoring, for each candidate response of a set of candidate responses, the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response; and a response providing module, for providing a highest-scored candidate response among the set of candidate responses in the chat flow.
15. An apparatus for providing a response in automated chatting, comprising: at least one processor; and a memory storing computer executable instructions that, when executed, cause the at least one processor to: obtain a message in a chat flow, determine a context associated with the message, the context comprising a set of utterances, the set of utterances comprising the message, for each candidate response of a set of candidate responses, score the candidate response based at least on information change between adjacent utterances among the set of utterances and the candidate response, and provide a highest-scored candidate response among the set of candidate responses in the chat flow.
PCT/US2020/055296 2019-10-29 2020-10-13 Providing a response in automated chatting WO2021086589A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911036507.1 2019-10-29
CN201911036507.1A CN112750430A (en) 2019-10-29 2019-10-29 Providing responses in automatic chat

Publications (1)

Publication Number Publication Date
WO2021086589A1 true WO2021086589A1 (en) 2021-05-06

Family

ID=73040331

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/055296 WO2021086589A1 (en) 2019-10-29 2020-10-13 Providing a response in automated chatting

Country Status (2)

Country Link
CN (1) CN112750430A (en)
WO (1) WO2021086589A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870902A (en) * 2021-10-27 2021-12-31 安康汇智趣玩具科技技术有限公司 Emotion recognition system, device and method for voice interaction plush toy

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018118546A1 (en) * 2016-12-21 2018-06-28 Microsoft Technology Licensing, Llc Systems and methods for an emotionally intelligent chat bot
US20180196796A1 (en) * 2017-01-12 2018-07-12 Microsoft Technology Licensing, Llc Systems and methods for a multiple topic chat bot
WO2019000170A1 (en) * 2017-06-26 2019-01-03 Microsoft Technology Licensing, Llc Generating responses in automated chatting

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9947319B1 (en) * 2016-09-27 2018-04-17 Google Llc Forming chatbot output based on user state
US11729120B2 (en) * 2017-03-16 2023-08-15 Microsoft Technology Licensing, Llc Generating responses in automated chatting
CN109690602A (en) * 2017-05-26 2019-04-26 微软技术许可有限责任公司 Products Show is provided in automatic chatting
US20200137001A1 (en) * 2017-06-29 2020-04-30 Microsoft Technology Licensing, Llc Generating responses in automated chatting
CN108960402A (en) * 2018-06-11 2018-12-07 上海乐言信息科技有限公司 A kind of mixed strategy formula emotion towards chat robots pacifies system
CN109977201B (en) * 2019-01-28 2023-09-22 平安科技(深圳)有限公司 Machine chat method and device with emotion, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018118546A1 (en) * 2016-12-21 2018-06-28 Microsoft Technology Licensing, Llc Systems and methods for an emotionally intelligent chat bot
US20180196796A1 (en) * 2017-01-12 2018-07-12 Microsoft Technology Licensing, Llc Systems and methods for a multiple topic chat bot
WO2019000170A1 (en) * 2017-06-26 2019-01-03 Microsoft Technology Licensing, Llc Generating responses in automated chatting

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"12th European Conference on Computer Vision, ECCV 2012", vol. 11108, 1 January 2018, SPRINGER BERLIN HEIDELBERG, Berlin Germany, ISBN: 978-3-642-38170-6, ISSN: 0302-9743, article XINGWU LU ET AL: "Memory-Based Matching Models for Multi-turn Response Selection in Retrieval-Based Chatbots : 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26-30, 2018, Proceedings, Part I", pages: 269 - 278, XP055766266, 031559, DOI: 10.1007/978-3-319-99495-6_23 *
"Genetic and Evolutionary Computing : Proceedings of the Twelfth International Conference on Genetic and Evolutionary Computing 2019; Changzhou, Jiangsu, China", vol. 927, March 2019, SPRINGER, Berlin, ISSN: 2194-5357, article SHAFQUAT HUSSAIN ET AL: "A Survey on Conversational Agents/Chatbots Classification and Design Techniques : Proceedings of the Workshops of the 33rd International Conference on Advanced Information Networking and Applications (WAINA-2019)", pages: 946 - 956, XP055766707, DOI: 10.1007/978-3-030-15035-8_93 *
QIU LISONG QIULS@PKU EDU CN ET AL: "What If Bots Feel Moods?", PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, ACMPUB27, NEW YORK, NY, USA, 25 July 2020 (2020-07-25), pages 1161 - 1170, XP058465148, ISBN: 978-1-4503-8016-4, DOI: 10.1145/3397271.3401108 *
SHUM HEUNG-YEUNG ET AL: "From Eliza to XiaoIce: challenges and opportunities with social chatbots", FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, ZHEJIANG UNIVERSITY PRESS, HEIDELBERG, vol. 19, no. 1, 8 January 2018 (2018-01-08), pages 10 - 26, XP036506112, ISSN: 2095-9184, [retrieved on 20180108], DOI: 10.1631/FITEE.1700826 *
XIANGYANG ZHOU ET AL: "Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network", PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (VOLUME 1: LONG PAPERS), 1 January 2018 (2018-01-01), Stroudsburg, PA, USA, pages 1118 - 1127, XP055766636, DOI: 10.18653/v1/P18-1103 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870902A (en) * 2021-10-27 2021-12-31 安康汇智趣玩具科技技术有限公司 Emotion recognition system, device and method for voice interaction plush toy
CN113870902B (en) * 2021-10-27 2023-03-14 安康汇智趣玩具科技技术有限公司 Emotion recognition system, device and method for voice interaction plush toy

Also Published As

Publication number Publication date
CN112750430A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
US11586810B2 (en) Generating responses in automated chatting
CN110427617B (en) Push information generation method and device
US11487986B2 (en) Providing a response in a session
CN109844741B (en) Generating responses in automated chat
WO2018195875A1 (en) Generating question-answer pairs for automated chatting
CN108304439B (en) Semantic model optimization method and device, intelligent device and storage medium
US11704501B2 (en) Providing a response in a session
US11729120B2 (en) Generating responses in automated chatting
WO2018227462A1 (en) Method and apparatus for intelligent automated chatting
JP6951712B2 (en) Dialogue devices, dialogue systems, dialogue methods, and programs
US20220230632A1 (en) Utilizing machine learning models to generate automated empathetic conversations
US11810337B2 (en) Providing emotional care in a session
CN113901189A (en) Digital human interaction method and device, electronic equipment and storage medium
WO2021086589A1 (en) Providing a response in automated chatting
CN112910761B (en) Instant messaging method, device, equipment, storage medium and program product
CN113590798A (en) Dialog intention recognition, training method for model for recognizing dialog intention
Jiang et al. An affective chatbot with controlled specific emotion expression
Yang et al. GME-dialogue-NET: gated multimodal sentiment analysis model based on fusion mechanism
Graziani et al. A language modeling-like approach to sketching
KR20230054211A (en) Method and device for generating conversation image based on deep learning based natural language processing
Stergioulas et al. Sign language communication through an interactive mobile application
Tencent Research Institute 645949364@ qq. com et al. Artificial Intelligence: Today and in the Future
CN117668176A (en) Intelligent knowledge explanation system based on large language model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20800480

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20800480

Country of ref document: EP

Kind code of ref document: A1