WO2020087534A1

WO2020087534A1 - Generating response in conversation

Info

Publication number: WO2020087534A1
Application number: PCT/CN2018/113815
Authority: WO
Inventors: Yongfang MA; Yasuhiro TAKASHITA; Can XU; Huang Hu; Kazuna TSUBOI; Mina MIYOSHI
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2020-05-07
Also published as: CN111971670A

Abstract

A method and apparatus are for generating a response in a conversation. At least one signal may be received from at least one signal source. Text information may be generated based on the at least one received signal. A response mode may be determined based at least on the text information. In some implementations, the response mode may indicate an expression style of a response to be generated. The response may be generated based at least on the text information and the response mode.

Description

[Title established by the ISA under Rule 37.2] GENERATING RESPONSE IN CONVERSATION

BACKGROUND

Artificial Intelligence (AI) chatbots are becoming more and more popular, and are being applied in an increasing number of scenarios. The chatbot is designed to simulate conversation with a human, and may chat with users by text, speech, image, etc. Generally, the chatbot may scan for keywords within a message input by a user or apply natural language processing on the message, and provide a response with the most matching keywords or the most similar wording pattern to the user.

SUMMARY

This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure propose method and apparatus for generating a response in a conversation. At least one signal may be received from at least one signal source. Text information may be generated based on the at least one received signal. A response mode may be determined based at least on the text information. The response mode may indicate an expression style of a response to be generated. The response may be generated based at least on the text information and the response mode.

It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will hereinafter be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

FIG. 1 is exemplary implementation architecture of a conversation according to an embodiment.

FIG. 2 illustrates an exemplary general process for generating a response based on received signals according to an embodiment.

FIG. 3 is a block diagram of an exemplary response generation system according to an embodiment.

FIG. 4 illustrates an exemplary response mode determination model according to an embodiment.

FIG. 5 illustrates an exemplary response generation model with a text attention model according to an embodiment.

FIG. 6 illustrates an exemplary process for generating a response based on speech signals or text signals according to an embodiment.

FIG. 7 illustrates an exemplary process for generating a response based on image signals according to an embodiment.

FIG. 8 illustrates an exemplary spatial attention model according to an embodiment.

FIG. 9 illustrates an exemplary adaptive attention model according to an embodiment.

FIG. 10 illustrates an exemplary process for generating a response based on audio signals according to an embodiment.

FIG. 11 illustrates an exemplary process for generating a response based on an image signal and an audio signal according to an embodiment.

FIG. 12 illustrates an exemplary conversation window for a conversation between a user and a chatbot according to an embodiment.

FIG. 13 illustrates a flowchart of an exemplary method for generating a response in a conversation according to an embodiment.

FIG. 14 illustrates an exemplary apparatus for generating a response in a conversation according to an embodiment.

FIG. 15 illustrates an exemplary apparatus for generating a response in a conversation according to an embodiment.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

A chatbot may conduct various conversations with users, such as making chitchat with the users, performing tasks requested by the users, and so on. Conventionally, the chatbot may start a topic randomly or maintain a topic only depending on conversation history, without considering environment signals and conversational patterns or expression styles. Therefore, the chatbot may be less friendly for another participant, e.g., a human being, in a conversation.

In many cases, e.g., under the chitchat situation, it would be desired that the chatbot may accompany users when they feel boring or lonely. Accordingly, a chatbot may be designed to have abilities to put forward attractive topics to talk, so as to be more friendly to human users. Moreover, considering that in a human-to-human chitchat, a topic may be highly triggered by a human based on environment signals, e.g., when the human sees or hears something interesting, and conversational patterns or expression styles may be varied during the conversation, it would also be desired to make the chatbot have such behaviors similar with human beings.

Embodiments of the present disclosure propose methods and apparatus for generating a response by considering both user signals and environment signals in a conversation and considering a response mode which indicates an expression style of a response to be generated.

Examples disclosed herein are directed to methods and apparatuses implementing an interactive chatbot on client devices. Through the disclosed examples, a client device may be equipped with a chatbot that can understand and interpret signal received from a user and/or environment and can determine a response mode indicating an expression style of a response to be generated, which is similar with that happens in a human-to-human conversation, in order to generate a response based at least on the received signals and the response mode.

To create an intelligent chatbot, the examples disclosed herein may capture various relevant user and environment signals on the client device, and communicate the captured user and environment signals to a chat server for determining a response mode, and generating a response based at least on the response mode and the received signals.

Examples of the signals may include, without limitation, speech signals from a user, image signals from environment, and any audio signals from the environment, e.g., background sound signals which include speech signals from other users and/or noises from the environment. Herein, “environment signals” refer to signals relating to a surrounding environment, location, or other activity being performed, as captured by one or more sensors or electrical components of a computing device. For example, environment signals may include audio signals detected by a microphone of a client device such as, but without limitation, sound of wind, sound of rain, sound from other speakers, and whistle of a car or any other noises.

For example, sound of rain may be received through the microphone and it may be used to generate text information as “it is raining” . In some examples, text information may be generated from the environment signals by the client device and then sent to the chat server. In alternative examples, environment signal may be processed by a chat server receiving the signals from a client device over a network.

In some examples, user input signals and environment signals are analyzed and/or converted into text information, either by a client device or by a chat server to determine a response mode through a response mode determining module. Herein, the user input signals and environment signals may be in any form of text signals, image signals, audio signals, video signals or any other detected signals. Responses for interacting with a participant in a conversation, such as a user, may be generated through a response generation module based on integrated text information generated from user input signals and/or environment signals.

A response output module may be used to select one of the generated responses to be outputted in a form of text, speech, image, or video, taking into account relevance between the received signals and the generated responses and/or any other factors, for example, semantic information extracted from the user’s speech signals, text information converted from the environment signals, conversation log, user profile, and so on. For example, the response output module may take a generated response with the highest relevance score as a response to be outputted.

The generated responses are not limited to simple descriptions of the captured image signals, audio signals, video signals, etc., but may also contain the chatbot’s emotions and/or opinions, which may be referred to as “empathy responses” . A chatbot capable of generating such empathy responses may provide a more communicative and more intelligent chat experience than those conventional chatbots. Such a chatbot may be applied in various scenarios, e.g., a driving companion, a travel companion, a jogging companion, etc.

In this disclosure, “conversation” or “chat conversation” refers to electronic interactions between a chatbot and a user, or between a chatbot and a virtual user, such as, sequences of exchanged text, video, image, audio, etc. The virtual user may refer to an electronic chatting participant.

Herein, a “user profile” refers to an electronically stored collection of information related to the user. Such information may include the user’s name, age, gender, height, weight, demographics, current location, residency, citizenship, family, friends, schooling, occupation, hobbies, skills, interests, Web searches, health information, birthday, anniversary, celebrated holidays, moods, and any other personalized information associated with the user.

Having generally provided an overview of some of the disclosed examples, attention is drawn to the accompanying drawings to further illustrate some additional details. The illustrated configurations and operational sequences are provided to aid the reader in understanding some aspects of the disclosed examples. The accompanying figures are not meant to limit all examples, and thus some examples may include different components, devices, or sequences of operations while not departing from the scope of the disclosed examples discussed herein. In other words, some examples may be embodied or may function in different ways than those shown.

FIG. 1 is exemplary implementation architecture of a conversation according to an embodiment. There may be a client device 100, a user 101, environment 102 in which the conversation is conducted, a network 103, a chat server 132 and a database 134 involved in the exemplary implementation architecture of the conversation.

In some examples, the client device 100 has at least one processor 106, a transceiver 108, one or more presentation components 110, one or more input/output (I/O) ports 112, one or more I/O components 114, and a memory 124.

The client device 100 may take the form of a mobile computing device or any other portable device, such as, a mobile telephone, laptop, tablet, computing pad, notebook, gaming device, portable media player, etc. The client device 100 may also include less portable devices such as desktop personal computers, kiosks, tabletop devices, industrial control devices, wireless charging stations, electric automobile charging stations, on board device, etc. Further still, the client device 100 may alternatively take the form of an electronic component of a vehicle, e.g., a vehicle computer equipped with microphones or other sensors; or any other computing device.

The processor 106 may include a variable number of processing units, and is programmed to execute computer-executable instructions for implementing aspects of the disclosure. The instructions may be performed by the processor within the client device, or performed by a processor external to the client device. In some examples, the processor 106 is programmed to execute methods according to the embodiments of the disclosure. Additionally or alternatively, the processor 106 may be programmed to present a chat in a user interface ( “UI” ) , e.g., the UI shown in FIG. 12.

The transceiver 108 is an antenna capable of transmitting and receiving signals. One skilled in the art will appreciate and understand that various antenna and corresponding chipsets may be used to provide communicative capabilities between the client device 100 and other remote devices.

The presentation components 110 visibly or audibly present information on the client device 100. Examples of presentation components 110 include, without limitation, computer monitors, televisions, projectors, touch screens, phone displays, tablet displays, wearable device screens, loudspeakers, vibrating devices, and any other devices configured to display, verbally communicate, or otherwise indicate chat responses to a user.

The I/O ports 112 allow the client device 100 to be logically coupled to other devices and I/O components 114, some of which may be built into the client device 100 while others may be external. Specific to the examples discussed herein, the I/O components 114 include a microphone 116, one or more sensors 118, a camera 120, and a touch device 122. The microphone 116 captures speech signals from a user 101 and background sound signals from the environment 102, as audio signals. The sensors 118 may include any number of sensors in the client device 100. Additionally, the sensors 118 may include an accelerometer, magnetometer, pressure sensor, photometer, thermometer, global positioning system ( “GPS” ) chip or circuitry, bar scanner, biometric scanner for scanning fingerprint, palm print, blood, eye, or the like, gyroscope, near-field communication ( “NFC” ) receiver, smell sensor, or any other sensor configured to capture signals from the user 101 or the environment 102. The camera 120 may capture images or videos from the environment 102. The touch device 122 may include a touchpad, track pad, touch screen, or other touch-capturing device. Although the I/O components 114 are illustrated as being included in the client device 100, any of the I/O components may also be external to the client device 100.

The memory 124 includes a variable number of storage devices associated with or accessible by the client device 100. The memory 124 may be internal to the client device 100, as shown in FIG. 1, external to the client device 100, not shown in FIG. 1, or both. Examples of the memory 124 may include, without limitation, random access memory (RAM) , read only memory (ROM) , electronically erasable programmable read only memory (EEPROM) , flash memory or other memory technologies, CDROM, digital versatile disks (DVDs) or other optical or holographic media, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, memory wired into an analog computing device, or any other medium for encoding desired information and for access by the client device 100. The memory 124 stores, among other data, various device applications that, when executed by the processor 106, operate to perform functionality on the computing device 100.

Specifically, instructions stored in the memory 124 comprise a communications interface application 126, a user interface application 128, and a chat application 130. In some examples, the communications interface application 126 includes computer-executable instructions for operating a network interface card and/or a driver for operating the network interface card. Communication between the client device 100 and other devices may occur using any protocols or mechanisms over a wired or wireless connection, or across the network 104. In some examples, the communications interface application 126 is operable with RF and short-range communication technologies using electronic tags, such as NFC tags,

brand tags, or the like.

In some examples, the user interface application 128 includes a graphics application for displaying information to the user and receiving information from the user. The user interface application 128 may also include computer-executable instructions for operating the graphics card to display chat responses and corresponding images or speech on or through the presentation components 110. The user interface application 128 may also interact with the various sensors 118 to both capture and present information through the presentation components 110.

In some examples, the chat application 130, when executed, may retrieve user signals and/or environment signals captured through the I/O components 114, and communicate the retrieved user and environment signals over a network 104 to a remote server, such as the chat server 132. The chat application 130 may include instructions for determining a response mode on the client device 100.

Instead of making such determinations on the client device 100, in other examples, the chat server 132 may operate a server application configured to determine a response mode from the communicated user signals and environment signals, generate chat responses based at least on the response mode, and communicate the chat responses back to the client device 100 for displaying or outputting through the presentation components 110. The chat server 132 represents a server or a collection of servers configured to execute different web-service computer-executable instructions. Determination of the response mode may be performed either by the chat application 130 in the client device 100 or by the chat server 132.

The response mode may comprise various types of modes, for example, positive response mode and negative response mode. As an alternative way, the response mode may also comprise at least one of: topic initiating mode, topic maintaining mode, topic switching mode, and so on. As a further alternative way, the response mode may specifically comprise at least one of: a topic initiating statement mode, a topic initiating question mode, a topic initiating answer mode, a topic maintaining statement mode, a topic maintaining question mode, a topic maintaining answer mode, a topic switching statement mode, a topic switching question mode, a topic switching answer mode, and so on. In some implementations, those topic initiating modes may be incorporated into the corresponding topic switching modes, as particular initialization case of the topic switching mode. For example, the topic initiating statement mode may be incorporated in the topic switching statement mode, the topic initiating question mode may be incorporated into the topic switching question mode, the topic initiating answer mode may be incorporated into the topic switching answer mode, and so on.

The response mode may be determined, in some examples, through the interpretation, recognition or analysis of text signals, video signals, image signals, audio signals, touch signals, or any other detected signals, for example, speed signals, smell signals, temperature signals, and so on, that originate from the user and/or the environment and are captured or detected on the client device. In some examples, audio signals may be further classified into speech signals from a user and background sound signals from the environment.

For example, a response mode indicates that an expression style of a response to be generated is a topic maintaining question mode and the text information is “flower, red” . Then the responses may be generated based on such topic maintaining question mode and the text information, such as “Do you think this red flower beautiful? ” , “Is this red flower a rose? ” and “Do you like this red flower? ” . A most appropriate response, such as “Do you like this red flower? ” , may be selected from the generated response to be outputted to the user.

The network 104 may include any computer network, for example the Internet, a private network, local area network (LAN) , wide area network (WAN) , or the like. The network 104 may include various network interfaces, adapters, modems, and other networking devices for communicatively connecting the client devices 100, the chat server 132, and a database 134.

The database 134 provides backend storage of Web, user, and environment data that may be accessed over the network 104 by the chat server 132 or the client device 100. The data stored in the database includes, for example but without limitation, user profiles 136, conversation log 138 and so on. Additionally or alternatively, some or all of the captured user and environment data may be transmitted to the database 134 for storage. For example, information that is related to a user’s profile or conversation gathered by the chat application 130 on the client device 100 may be stored on the database 134.

The user profiles 136 may include any of the previously mentioned data for individual users. The conversation log 138 may refer to conversation history or record of the conversation.

It shall be appreciated that although an exemplary client device comprising several components is described above, any other components may be added into the client device 100, and/or any shown components in the client device 100 may be omitted or replaced with other components.

FIG. 2 illustrates an exemplary general process 200 for generating a response based on received signals according to an embodiment.

At 210, one or more signals may be received from at least one signal source. For example, signals may be received from a participant of a conversation, e.g., a user 101, and/or from the environment 102 in which the conversation is conducted. The received signals may comprise text signals and/or non-text signals, for example, text signals from the user 101, speech signals from the user 101, image signals from the environment 102, background sound signals from the environment 102, and any other signals from the environment 102. Herein, the non-text signals may comprise at least one of an image signal, an audio signal, and a video signal, and the audio signal comprises at least one of a speech signal and a background sound signal.

At 220, text information may be generated from the received signals. The text information may refer to at least one of: semantic content of a text represented by text signals, semantic content of a speech represented by speech signals, image caption of an image represented by image signals, attribute of background sound signals or any other detected signals, and so on.

In some examples, when the received signals are text signals, the text information may be generated directly from semantic content of the text signals.

In some examples, when the received signals are speech signals, the text information may be generated by recognizing semantic content of the speech signals through speech recognition. Here the semantic content of the speech signals may represent content of what the user is saying.

In some other examples, when the received signals are image signals, the text information may be generated by performing an image caption process to the received image signals. For example, when a received image signal shows yellow flowers by the roadside, an image caption “there are yellow flowers by the roadside” of this image may be used as text information for the image.

In still other examples, when the received signals are background sound signals, the text information may be generated by performing an audio analysis to the background sound signals to obtain attribute of the signals as text information. For example, when the background sound signal indicates that sound of wind is loud, the attribute of the background sound signal may be analyzed as “sound of wind, loud” , which may be considered as the text information generated from the received background sound signal. In some other examples, when the background sound signal is sound from other speakers, the attribute of the background sound signal may be analyzed as “people are speaking” , “here is human voice” or “someone is speaking” , which may be considered as the text information.

Additionally or alternatively, one or more signals, such as some particular signals, may be selected from the received signals, and the text information may be generated from the one or more selected signals. It may reduce processing burden through processing one or more selected signals, compared to processing all received signals. The selecting operation may be performed based on a predefined condition. In some implementations, such condition may comprise at least one of signal difference between a previous received signal and a current received signal being above a threshold, the signal difference being below a threshold, a predefined period, and conversation log.

Herein the signal difference between a previous received signal and a current signal may be represented as signal vector difference of the previous received signal and the current received signal. The threshold may be preset by the user, for example based on his/her preference, or determined by the chatbot automatically based at least on user profile and/or conversation log. For example, in a case of a camera in a chatbot capturing images continuously, the chatbot may not need to process every captured image. When the camera captured an image with flowers which is different from the previously captured image, a signal vector difference between the current image signal and the previous image signal may be increased significantly, and the chatbot may select this image with flowers from a lot of the captured images to be used to generate text information from the selected image.

In some implementations, one or more signals may be selected from the received signals to be used to generate text information based on a predefined period. The predefined period may be preset by a user or determined by the chatbot randomly or automatically based at least on user profile and/or conversation log. For example, a signal may be selected from the received signals every 10 seconds, every 5 minutes, or based on any other period.

In some other implementations, one or more signals may be selected from the received signals based on a conversation log for a conversation between the user and the chatbot. For example, when one or more captured signals, such as images or sounds, are related to content in the conversation log, such one or more captured signals may be selected to be used to generate text information.

It should be appreciated that all the above examples are merely for illustration and without limitations on the scope of the present disclosure.

At 230, a response mode may be determined based on the text information generated at 220. The response mode may indicate an expression style of the response to be generated.

At 240, a response may be generated based at least on the text information, the expression style indicated by the response mode and optionally, certain types of environment signals, such as image signals.

FIG. 3 is a block diagram of an exemplary response generation system 300 according to an embodiment.

In general, the response generation system 300 may comprise a response mode determining module 310, a response generation module 320 and a response output module 330.

The generated text information 302 may be provided to the response mode determining module 310, to determine a response mode for a response 304 to be generated.

When the response mode is determined in the response mode determining module 310, it may be fed to the response generation module 320 along with the text information 302 to generate responses. Herein, the response mode may also be in a text form and combined with the text information to generate a text sequence as an output of the response mode determining module 310, to be provided to the response generation module 320.

Although the response generation module 320 is illustrated as a single module, one skilled in the art will appreciate that the response generation module 320 may, in fact, be scalable. In some examples, the response generation module 320 may comprise a text encoder 322, a text attention model 324 and a decoder 326. Herein, the text encoder 322 may receive the text sequence, which includes the text information and the response mode, and perform encoding on the text sequence to generate text vectors. The text vectors may be provided to the text attention model 324, to generate text attention features through a text attention processing. The decoder 326 may receive such text attention features and perform a decoding process to generate responses.

The generated responses may be inputted to the response output module 330. The response output module 330 selects an appropriate response from the generated responses to output. The appropriate response may be selected based on a predefined condition, or by any other available techniques, such as any existing sorting or ranking techniques. For example, a response with the highest relevance score may be selected as the appropriate response to be outputted.

It should be appreciated that although the response output module 330 is illustrated as separated from the response generation module 320, it may also be incorporated into the response generation module 320. That is, the response generation module 320 may generate and output an appropriate response 304.

FIG. 4 illustrates an exemplary response mode determination model 400 according to an embodiment.

The response mode determination model 400 may be used in the response mode determining module 310 in FIG. 3 and may be implemented through a neural network classifier.

As shown in FIG. 4, text information 402 may be inputted into the response mode determination model 400. Herein, the text information 402 may be separated into a sequence of sentences s ₁-s _v, each sentence s _i being represented by a sequence of words w _i1-w _it. In some examples, a sentence may be a short sentence including just one or several words and/or one or several phrases, or a long sentence including a plurality of words and/or a plurality of phrases. For example, sentence s ₁ may be represented by w ₁₁-w _1t; sentence s ₂ may be represented by w ₂₁-w _2t; and sentence s _v may be represented by w _v1-w _vt, as shown in FIG. 4. The sequence of sentences s ₁-s _v may be encoded with an encoder, such as a hierarchical encoder, to generate a sequence of hidden vectors h ^s ₁-h ^s _v for the text information, which may be concatenated into a hidden vector h ^s. Several response modes m ₁-m _n may be encoded with an encoder, such as a Gated Recurrent Unit (GRU) , to generate a sequence of hidden vectors h ^m ₁-h ^m _n for the response modes, which may be concatenated into a hidden vector h ^m. The hidden vectors h ^s and h ^m may be fed into a multi-layer perceptron (MLP) to calculate a probability distribution of response mode for a next turn in the conversation, which may be represented as p _m1, p _m2...p _mn, as shown in FIG. 4, and may be calculated as follows:

where m _i represents a response mode for the i-th turn dialogue in the conversation, d _i= { (s ₁, m ₁) , (s ₂, m ₂) ... (s _i-1, m _i-1) } represents a dialogue set in the conversation, f _MLP represents a MLP function.

According to the calculated probability p _mi for each response mode m _i, an appropriate response mode m _i may be determined for the i-th turn dialogue in the conversation.

FIG. 5 illustrates an exemplary response generation model 500 with a text attention model according to an embodiment. The response generation model 500 may be used in the response generation module 320 in FIG. 3 and implemented through a neural network.

As shown in FIG. 5, as for each sentence s _i, it may be generated based on a determined response mode m _i and a received sentence s _i-1. Herein, the sentence s _i and s _i-1 may be represented as a sequence of words, [w _i, 1...w _i, t] and [w _i-1, 1...w _i-1, t] respectively. The determined response mode m _i may be attached to the sentence s _i-1, as a special word, to form a word sequence which is encoded with an encoder to generate vector set [v ₀, v ₁, ..., v _t] . Herein, the encoder may be implemented through a neural network, such as a bidirectional recurrent neural network with gated recurrent units (biGRUs) . It should be appreciated that although m _i is attached to the top of the sequence of words [w _i-1, 1...w _i-1, t-1] in FIG. 5, it may be attached to the end of the sequence of words, or may be embedded to any location of the sequence of words, if applicable.

The generated vector set, [v ₀, v ₁, ..., v _t] , from the encoder may be inputted to a text attention model, to generate an attention vector set, [v’ ₁, v’ ₂, ..., v’ _t-1] . The decoder takes the attention vector set, [v’ ₁, v’ ₂, ..., v’ _t-1] as input and generates a response by a language model with an attention mechanism. Through the decoding process, it may obtain a sequence of words, [w _i, 1...w _i, t-1] , which may in turn go through a softmax layer to output a word, e.g., an exemplary w _i, 3 shown in FIG. 5. It should be appreciated that although there is shown only w _i, 3 outputted, there should be one or more words to be outputted from the decoder to generate a response. It should also be appreciated that although the shown input of the encoder in FIG. 5 are m _i and word sequence [w _i-1, 1...w _i-1, t-1] representing sentence s _i-1, there may be also s _i-2, s _i-3, ..., s ₁ in a conversation log to be inputted to the encoder. From the examples illustrated in FIG. 4 and FIG. 5, a response may be generated based at least on a determined response mode and one or more sentences comprised in the text information.

FIG. 6 illustrates an exemplary process 600 for generating a response based on speech signals or text signals according to an embodiment.

As shown in FIG. 6, the process for generating a response based on a speech signal 602 is similar to that for a text signal 602’except that text information 610 may be generated from the received text signal 602’directly without any additional recognition or conversion processing, so the detailed description for the process of text signal 602’is omitted here for simplicity, and the process of audio signal 602 may be described below as an example.

When an audio signal 602 is received, it may be fed to a user ID identifying module 604 to identify whether this audio signal is a speech signal 606 from a user who is having the conversation with the chatbot. For example, the user ID identifying module 604 may extract audio feature of the audio signal 602 to match it with a pre-stored user ID. If matched, then the audio signal 602 may be considered as a speech signal from the user and fed to a speech recognition module 608. The speech recognition module 608 may translate or convert this speech signal to text information 610 through various speech-to-text techniques. The text information 610 may be inputted to a response mode determining module 620 to be used to determine a response mode.

A response generation module 630 may receive the determined response mode and the text information 610 and generate one or more responses based at least on the response mode and the text information. Herein, the response generation module 630 may comprise a text encoder 632, a text attention model 634 and a decoder 636, which is similar to the response generation module 320 comprising the text encoder 322, the text attention model 324 and the decoder 326, as shown in FIG. 3. In particular, the text information 610 and the response mode may be fed to the text encoder 632 included in the response generation module 630. For simplicity, the detailed description for the text encoder 632, the text attention model 634 and the decoder 636 is omitted herein.

The generated one or more responses may be fed to a response output module 640, to select an appropriate response to be outputted. As the operation of the response output module 640 is similar to the response output module 330 shown in FIG. 3, the detailed description for the response output module 640 is omitted herein for simplicity.

FIG. 7 illustrates an exemplary process 700 for generating a response based on image signals according to an embodiment.

An image signal 702 may be received and fed to an image caption module 704. The image caption module 704 performs image caption to the image signal 702 to translate or convert the image signal 702 to text information 706. A response mode determining module 708 may receive the text information for determining a response mode. A response generation module 710 may receive the determined response mode from the response mode determining module 708 and text information 706, to generate a response based at least on the received response mode and text information. As shown in FIG. 7, the response generation module 701 in this implementation comprises a text encoder 711, a text attention model 712, an image encoder 713, a spatial attention model 714, an adaptive attention model 715 and a decoder 716. In particular, the text information 706 and the response mode may be fed to the text encoder 711 in the response generation module 710. Herein, the operations of the text encoder 711 and the text attention model 712 are similar to the text encoder 322 and the text attention model 324 in FIG. 3 and the detailed description for them is omitted for simplicity.

Additionally or alternatively, the image signals 702 may be fed into the image encoder 713. The image encoder 713 may perform encoding on the image signal 702 to generate image vectors. The spatial attention model 714 may receive the image vectors and extract spatial image features for indicating a spatial map highlighting image regions relevant to each generated word. An exemplary structure of the spatial attention model 714 may be described below with reference to FIG. 8.

The adaptive attention model 715 may receive the spatial image features from the spatial attention model 714 and the text attention features from the text attention model 712 to generate adaptive attention features. The adaptive attention model 715 may be configured to determine when to rely on the image signal and when to rely on a language model to generate a next word. When relying on the image signal, the adaptive attention model 715 may also determine where, that is, which image region, it should make attention to. An exemplary structure of the adaptive attention model 715 may be described below with reference to FIG. 9.

The decoder 716 may receive adaptive attention features from the adaptive attention model 715 and generate responses based at least on the adaptive attention features.

The generated responses from the decoder 716 may be conveyed to a response output module 720 for selecting an appropriate response to output. The operation for selecting an appropriate response in the response output module 720 may be similar to that in the response output module 330 and thus is omitted for simplicity.

Additionally or alternatively, the response output module 720 may comprise a convolutional feature extraction module 721 and a dual attention module 722. The convolutional feature extraction module 721 may receive the image signal 702 and extract convolutional features of the image signal. The extracted features of the image signal may be fed to the dual attention module 722 along with the generated responses from the decoder 716 in a text form. The dual attention module 722 may incorporate visual and textual attention models and perform dual attention mechanism on the extracted features of the image signal 702 and the generated responses, for example, comparing these two inputs, to output an appropriate response. The visual attention model may pay attention to specific regions in an image to extract image attention features and the textual attention model may pay attention to specific words or sentences in text content to extract text attention features from the text. In some examples, the dual attention module 722 may perform image-text matching by comparing the extracted features of the image signal and the text contents of the generated responses, and may estimate similarity between the features of the image signal and the text contents of the responses by focusing on their common semantics.

It should be appreciated that although the convolutional feature extraction module 721 and the dual attention module 722 are shown as being included in the response output module 720, they may also be separated from the response output module 720 and/or may be omitted or replaced by any other suitable modules.

FIG. 8 illustrates an exemplary spatial attention model 800 according to an embodiment, which is corresponding to the spatial attention model 714 in FIG. 7.

Herein, the spatial attention model 800 may be implemented by a neural network for generating a spatial attention vector c _t for an image. As shown in FIG. 8, x _t and h _t-1 are inputted to a Long-Short Term Memory (LSTM) to generate a hidden state h _t of the LSTM. Here, x _t represents an input vector at time t, h _t represents the hidden state of the LSTM at time t-1, and h _t represents the hidden state of the LSTM at time t. The generated vector h _t may be fed to an attention model along with a spatial image feature set V which may be represented as V= [v ₁, ...v _k] , each of v _i is a multi-dimensional representation corresponding to a region of the image. Through the attention model, a spatial attention vector c _t may be generated as follows:

c _t=g (V, h _t) Equation (2)

where g is an attention function.

The generated spatial attention vector c _t may be fed to MLP along with h _t, to generate an output vector y _t, corresponding to a word, through a MLP function f _MLP:

y _t=f _MLP ( [c _t, h _t] ) Equation (3)

It should be appreciated that although not shown in FIG. 8, there may be an attention weight α over each spatial image feature v in the spatial image feature set V.

FIG. 9 illustrates an exemplary adaptive attention model 900 according to an embodiment, which is corresponding to the adaptive attention model 715 in FIG. 7.

The adaptive attention model 900 may be implemented by a neural network for generating an adaptive attention vector c’ _t for both image and text.

Similar to FIG. 8, x _t and h _t-1 are inputted to a Long-Short Term Memory (LSTM) to generate a hidden state h _t of the LSTM. Here, an indication vector i _t is extracted from the input vector x _t, to indicate whether to pay attention to the text. The indication vector i _t may be calculated through the following equations:

i _t=g _t⊙tanh (m _t) Equation (4)

g _t=σ (W _xx _t+W _hh _t-1) Equation (5)

where g _t represents a gate applied on a memory cell m _t of the LSTM, ⊙ represents an element-wise product, W _x and W _h represent weight parameters for the input vector x _t and the hidden state h _t-1 respectively, and σ represents a logistic sigmoid activation.

Based on the generated indication vector i _t and a spatial image feature set V= [v ₁, ... v _k] , an adaptive attention vector c’ _t may be calculated through the following equation:

c′ _t=β _ti _t+ (1-β _t) c _t=β _ti _t+ (1-β _t) g (V, h _t) Equation (6)

where β _t represents a probability for paying attention to text at time t, which is in the range [0, 1] , in which a value of 1 means that only text features are used and a value of 0 means that only spatial image features are used when generating the next word; and c _t represents the spatial attention vector, as calculated in Equation (2) by g (V, h _t) .

Additionally or alternatively, as shown in FIG. 9, α _i for each spatial image feature v _i represents a respective attention weight over each spatial image feature.

Although not shown in FIG. 9, an output y _t may be generated through MLP based on the adaptive attention vector c’ _t, instead of the spatial attention vector c _t in FIG. 8.

FIG. 10 illustrates an exemplary process 1000 for generating a response based on audio signals according to an embodiment.

When an audio signal 1002 is received, it may be fed to a user ID identifying module 1004 to identify whether this audio signal is a speech signal 1006 from a user. If the audio signal is considered as not a speech signal from the user, then the audio signal 1002 may be considered as a background sound signal, such as sound of wind, sound of rain, sound from other speakers and so on, and may be fed to an audio analysis module 1008. The audio analysis module 1008 may make analysis on the audio signal to extract text information 1010 from it. The text information 1010 may be inputted to a response mode determining module 1020 for determining a response mode.

As the operations of the user ID identifying module 1004 are similar to the user ID identifying module 604 in FIG. 6, and the operations of the response mode determining module 1020 are similar to the response mode determining module 620 in FIG. 6, the detailed descriptions for the user ID identifying module 1004 and the response mode determining module 1020 may be omitted herein.

A response generation module 1030 may receive the determined response mode and the text information 1010 and generate one or more responses based at least on the response mode and the text information. Herein, the response generation module 1030 may comprise a text encoder 1032, a text attention model 1034 and a decoder 1036, whose operations are similar to that of the response generation module 320 in FIG. 3 and the response generation module 630 in FIG. 6. For simplicity, the detailed description for the text encoder 1032, the text attention model 1034 and the decoder 1036 is omitted herein.

The generated one or more responses may be fed to a response output module 1040, to select an appropriate response to be outputted. As the operation of the response output module 1040 is similar to the response output module 330 shown in FIG. 3 and the response output module 630 shown in FIG. 6, the detailed description for the response output module 1040 is omitted herein for simplicity.

Additionally or alternatively, the response output module 1040 may comprise a text-to-speech (TTS) module 1042, for converting text signal to speech signal and generating a speech output. It should be appreciated that although the TTS module 1042 is shown as being included in the response output module 1040, it may also be separated from the response output module 1040 and/or may be omitted or replaced by any other suitable modules.

FIG. 11 illustrates an exemplary process 1100 for generating a response based on an image signal and an audio signal according to an embodiment.

As the process 1100 for generating a response based on an image signal and an audio signal may be deemed as a combination of the processes shown in FIG. 6, FIG. 7 and FIG. 10, detailed descriptions of modules in FIG. 11 may be omitted or simplified.

When an image signal 1102 is received, it may be fed to an image caption module 1104. The image caption module 1104 performs image caption to the image signal 1102 to translate or convert the image signal 1102 to text information, as a part of text information 1116.

When an audio signal 1106 is received, it may be fed to a user ID identifying module 1108 to identify whether the audio signal is a speech signal 1110 from a user. If the audio signal 1106 is considered as a speech signal from the user, then it may be fed to a speech recognition module 1114. The speech recognition module 1114 may translate or convert the speech signal to text information, as a part of text information 1116. If it is determined that the audio signal is not a speech signal from the user, then the audio signal 1106 may be considered as a background sound signal, such as sound of wind, sound of rain, sound from other speakers and so on, and may be fed to an audio analysis module 1112. The audio analysis module 1112 may make analysis on the audio signal to extract text information therefrom, as a part of text information 1116.

Text information 1116 may be generated by combining respective text information of the received two or more signals, such as the image signal 1102 and the audio signal 1106. For example, text information converted from the image signal 1102, and text information converted or extracted from the audio signal 1106 may be combined to generate the text information 1116.

The text information 1116 may be inputted to a response mode determining module 1118 for determining a response mode.

A response generation module 1120 may receive the determined response mode from the response mode determining module 1118 and the text information 1116, to generate a response based at least on the received response mode and the text information. As shown in FIG. 11, the response generation module 1120 in this implementation comprises a text encoder 1121, a text attention model 1122, an image encoder 1123, a spatial attention model 1124, an adaptive attention model 1125 and a decoder 1126. In particular, the text information 1116 may be fed to the text encoder 1121 in the response generation module 1120 along with the determined response mode.

Herein, since the operations of the text encoder 1121 and the text attention model 1122 are similar to the text encoder 322 and the text attention model 324 in FIG. 3, the text encoder 632 and the text attention model 634 in FIG. 6, and the text encoder 711 and the text attention model 712 in FIG. 7, respectively, the detailed description for them is omitted here for simplicity. Moreover, since the operations of the image encoder 1123, the spatial attention model 1124, the adaptive attention model 1125, and the decoder 1126 are similar to the image encoder 713, the spatial attention model 714, the adaptive attention model 715 and the decoder 716 in FIG. 7, respectively, the detailed description for them is omitted here for simplicity.

The generated responses from the decoder 1126 may be conveyed to a response output module 1130 to select an appropriate response to output. The operation for selecting an appropriate response in the response output module 1130 may be similar to that in the response output module 330 in FIG. 3 and thus is omitted for simplicity.

Additionally or alternatively, the response output module 1130 may comprise a convolutional feature extraction module 1131, a dual attention module 1132 and optionally a TTS module 1133. Since the operations of the convolutional feature extraction module 1131 and the dual attention module 1132 are similar to the convolutional feature extraction module 721 and the dual attention module 722 in FIG. 7, the detailed description for them is omitted here for simplicity. Moreover, since the operations of the TTS module 1133 are similar to the TTS module 1042 in FIG. 10, the detailed description for it is omitted here for simplicity.

According to the exemplary processes for generating a response based at least on a response mode and text information from an audio signal and/or an image signal as discussed above, FIG. 12 illustrates an exemplary conversation window 1200 for a conversation between a user and a chatbot according to an embodiment.

In the example of FIG. 12, semantic information or content said by the user and/or the chatbot, which may be not visible in the conversation window, is shown in a text form in dashed blocks outside the conversation window, for the convenience of description. Also for the convenience of understanding, a description for capturing environment signals is shown in solid blocks outside the conversation window in the example in FIG. 12.

As shown by 1201 in FIG. 12, when the chatbot detects there is something different from a previous scene, for example, there are some yellow flowers by the roadside, it may capture an image with yellow flowers and a topic may be initiated or switched based on the captured image. An initial response mode may be determined by a response mode determining model based on the information extracted from the image. For example, the initial response mode may be determined as a positive response mode and/or a topic initiating statement mode. A response may be generated based at least on the initial response mode, and text information from the captured image, such as attention features of “yellow, flowers” , together with any other possible information in the user profile and/or the conversation log. The exemplary response may be outputted as “Look! The yellow flowers are blooming. My mother grew the same flowers in the garden when I was young” as shown by 1211.

When the user provides a speech message shown by 1221, the chatbot may generate text information “Oh, yes. They are so beautiful” from a speech signal of the speech message and determine a response mode for a response to be generated based on the text information, for example, a positive response mode based on a positive word “beautiful” and/or a topic maintaining question mode based on the sentence “They are so beautiful” . Based on the determined response mode and the generated text information, the chatbot may generate and output a response “Would you like to grow some in your garden? ” as shown by 1212 in the topic maintaining question mode.

When a speech message shown by 1222 received by the chatbot, the chatbot may generate text information “Actually, not. Because I am allergic to pollen” from the received signal and determine a response mode as a positive response mode and/or a topic maintaining statement mode based on the generated text information. Further, based at least on attention features “not” and “allergic to pollen” and the determined response maintaining statement mode, the chatbot may generate and output a response “It is also a good way to have a look far away” as shown by 1213 to maintain the current topic in the conversation.

In additional to receiving speech/audio signals, it is possible that the chatbot may receive signals in other forms. For example, the chatbot may receive a message in a text form from the user, such as a word “Yes” as shown by 1223.

Meanwhile or some minutes later, the chatbot may detect an audio signal and may identify text information “Michael Jackson’s music” from the audio signal through an audio analysis module, as shown by 1202. Based on the identified or generated text information, the chatbot may determine a response mode as a positive response mode and/or a topic switching statement mode. A response may be generated based on the determined response mode and the identified text information, such as “Oh, I like Michael Jackson but I prefer his slow songs compared to this one” as shown by 1214. When receiving a speech message from the user, the chatbot may generate text information from the speech signal of the speech message, which is “Could you recommend one of his slow songs? ” as shown by 1224. Based on the generated text information, the chatbot may determine a response mode for a next response, such as a positive response mode and/or a topic maintaining answer mode. The next response “Sure. Let me play it for you” as shown by 1215 may be generated based on the determined response mode and the text information. The next response may be outputted in a speech form through a TTS module. As an alternative way, the response may be outputted in a text form.

After a few minutes, the chatbot may detect a background sound signal through microphone and capture an image signal through a camera. The background sound signal may be analyzed to generate text information “loud noise” and the image signal may be processed through image caption to generate text information “many people” , as shown by 1203. The chatbot may determine a response mode based at least on the generated text information, such as a negative response mode and/or a topic switching question mode. A response, e.g., “It’s so noisy. What happened? ” as shown by 1216, may be generated based on the determined response mode, together with the generated text information, e.g., text attention features “loud noise” from the text information.

The user may provide a speech message as shown by 1226 to the chatbot to answer its question. The chatbot receives this speech message and recognizes it as text information “There is a rock festival” . The chatbot may determine a response mode, such as a negative response mode and/or a topic switching statement mode, based on the text information. Therefore, a response “Oh, I don’t like rock music. There is so crowded. Let’s leave here” as shown by 1217 may be generated based on the response mode and the text information and may be outputted in a speech form through a TTS module.

It should be appreciated that the conversation between the user and the chatbot may be made in any form of text, speech, image, video, etc. or any combination thereof.

FIG. 13 illustrates a flowchart of an exemplary method 1300 for generating a response in a conversation according to an embodiment.

At 1310, at least one signal may be received from at least one signal source.

At 1320, text information may be generated based on the at least one received signal.

At 1330, a response mode may be determined based at least on the text information. In some implementations, the response mode may indicate an expression style of a response to be generated.

At 1340, the response may be generated based at least on the text information and the response mode.

In an implementation, the at least one signal source may comprise a participant of the conversation or environment in which the conversation is conducted.

In an implementation, the at least one received signal may comprise a text signal and/or a non-text signal. In some examples, the non-text signal may comprise at least one of an image signal, an audio signal, and a video signal, and the audio signal may comprise at least one of a speech signal and a background sound signal.

In an implementation, the at least one received signal may comprise two or more signals. In some examples, generating the text information may comprise generating the text information by combining respective text information of the two or more signals.

In an implementation, the response mode may comprise at least one of a positive response mode and a negative response mode.

In an implementation, the response mode may comprise at least one of a topic maintaining statement mode, a topic maintaining question mode, a topic maintaining answer mode, a topic switching statement mode, a topic switching question mode and a topic switching answer mode.

In an implementation, determining the response mode may comprise determining the response mode based at least on the text information through a neural network classifier.

In an implementation, generating the response may comprise: generating at least one text attention feature based on the text information and the response mode through a text attention model; and generating the response based at least on the at least one text attention feature.

In an implementation, the at least one received signal may comprise a non-text signal. In some examples, generating the text information may comprise generating the text information through performing signal analysis to the non-text signal.

In an implementation, the non-text signal is an image signal, and generating the response may comprise: generating at least one image attention feature based on the image signal through a spatial attention model; generating at least one text attention feature based on the text information and the response mode through a text attention model; and generating the response based at least on the at least one image attention feature and the at least one text attention feature.

In an implementation, generating the response may comprise: generating at least one adaptive attention feature based on the at least one image attention feature and the at least one text attention feature through an adaptive attention model; and generating the response based at least on the at least one adaptive attention feature.

It should be appreciated that the method 1300 may further comprise any steps/processes for generating a response in a conversation according to the embodiments of the present disclosure as mentioned above.

FIG. 14 illustrates an exemplary apparatus 1400 for generating a response in a conversation according to an embodiment.

The apparatus 1400 may comprise: a signal receiving module 1410, for receiving at least one signal from at least one signal source; a text information generating module 1420, for generating text information based on the at least one received signal; a response mode determining module 1430, for determining a response mode based at least on the text information, the response mode indicating an expression style of a response to be generated; and a response generating module 1440, for generating the response based at least on the text information and the response mode.

In an implementation, the at least one signal source may comprise a participant of the conversation or environment in which the conversation is conducted, and wherein the at least one received signal may comprise a text signal and/or a non-text signal, the non-text signal may comprise at least one of an image signal, an audio signal, and a video signal, and the audio signal comprises at least one of a speech signal and a background sound signal.

In an implementation, the at least one received signal comprises two or more signals, and the text information generating module 1420 is further for generating the text information by combining respective text information of the two or more signals.

In an implementation, the response generating module 1440 is further for: generating at least one text attention feature based on the text information and the response mode through a text attention model; and generating the response based at least on the at least one text attention feature.

In an implementation, the at least one received signal may comprise a non-text signal. In some examples, the text information generating module is further for generating the text information through performing signal analysis to the non-text signal.

In an implementation, the non-text signal is an image signal. In some examples, the response generating module 1440 is further for: generating at least one image attention feature based on the image signal through a spatial attention model; generating at least one text attention feature based on the text information and the response mode through a text attention model; and generating the response based at least on the at least one image attention feature and the at least one text attention feature.

In an implementation, the response generating module 1440 is further for: generating at least one adaptive attention feature based on the at least one image attention feature and the at least one text attention feature through an adaptive attention model; and generating the response based at least on the at least one adaptive attention feature.

Moreover, the apparatus 1400 may also comprise any other modules configured for generating a response in a conversation according to the embodiments of the present disclosure as mentioned above.

FIG. 15 illustrates an exemplary apparatus 1500 for generating a response in a conversation according to an embodiment. The apparatus 1500 may comprise one or more processors 1510 and a memory 1520 storing computer-executable instructions. When executing the computer-executable instructions, the one or more processors 1510 may: receive at least one signal from at least one signal source; generate text information based on the at least one received signal; determine a response mode based at least on the text information, the response mode indicating an expression style of a response to be generated; and generate the response based at least on the text information and the response mode.

The embodiments of the present disclosure may be embodied in a non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for generating a response in a conversation according to the embodiments of the present disclosure as mentioned above.

It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device, e.g., hard disk, floppy disk, magnetic strip, an optical disk, a smart card, a flash memory device, random access memory (RAM) , read only memory (ROM) , programmable ROM (PROM) , erasable PROM (EPROM) , electrically erasable PROM (EEPROM) , a register, or a removable disk. Although memory is shown separate from the processors in the various aspects presented throughout the present disclosure, the memory may be internal to the processors, e.g., cache or register.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

A method for generating a response in a conversation, comprising:

receiving at least one signal from at least one signal source;

generating text information based on the at least one received signal;

determining a response mode based at least on the text information, the response mode indicating an expression style of a response to be generated; and

generating the response based at least on the text information and the response mode.
The method of claim 1, wherein the at least one signal source comprises a participant of the conversation or environment in which the conversation is conducted.
The method of claim 1, wherein the at least one received signal comprises a text signal and/or a non-text signal, the non-text signal comprises at least one of an image signal, an audio signal, and a video signal, and the audio signal comprises at least one of a speech signal and a background sound signal.
The method of claim 1, wherein the at least one received signal comprises two or more signals, and generating the text information comprises:

generating the text information by combining respective text information of the two or more signals.
The method of claim 1, wherein the response mode comprises at least one of a positive response mode and a negative response mode.
The method of claim 1, wherein the response mode comprises at least one of a topic maintaining statement mode, a topic maintaining question mode, a topic maintaining answer mode, a topic switching statement mode, a topic switching question mode and a topic switching answer mode.
The method of claim 1, wherein determining the response mode comprises:

determining the response mode based at least on the text information through a neural network classifier.
The method of claim 1, wherein generating the response comprises:

generating at least one text attention feature based on the text information and the response mode through a text attention model; and

generating the response based at least on the at least one text attention feature.
The method of claim 1, wherein the at least one received signal comprises a non-text signal, and generating the text information comprises:

generating the text information through performing signal analysis to the non-text signal.
The method of claim 9, wherein the non-text signal is an image signal, and generating the response comprises:

generating at least one image attention feature based on the image signal through a spatial attention model;

generating at least one text attention feature based on the text information and the response mode through a text attention model; and

generating the response based at least on the at least one image attention feature and the at least one text attention feature.
The method of claim 10, wherein generating the response comprises:

generating at least one adaptive attention feature based on the at least one image attention feature and the at least one text attention feature through an adaptive attention model; and

generating the response based at least on the at least one adaptive attention feature.
An apparatus for generating a response in a conversation, comprising:

a signal receiving module, for receiving at least one signal from at least one signal source;

a text information generating module, for generating text information based on the at least one received signal;

a response mode determining module, for determining a response mode based at least on the text information, the response mode indicating an expression style of a response to be generated; and

a response generating module, for generating the response based at least on the text information and the response mode.
The apparatus of claim 12, wherein:

the at least one signal source comprises a participant of the conversation or environment in which the conversation is conducted, and

the at least one received signal comprises a text signal and/or a non-text signal, the non-text signal comprises at least one of an image signal, an audio signal, and a video signal, and the audio signal comprises at least one of a speech signal and a background sound signal.
The apparatus of claim 12, wherein the at least one received signal comprises two or more signals, and the text information generating module is further for generating the text information by combining respective text information of the two or more signals.
The apparatus of claim 12, wherein the response generating module is further for:

generating at least one text attention feature based on the text information and the response mode through a text attention model; and

generating the response based at least on the at least one text attention feature.
The apparatus of claim 12, wherein the response mode comprises at least one of a topic maintaining statement mode, a topic maintaining question mode, a topic maintaining answer mode, a topic switching statement mode, a topic switching question mode and a topic switching answer mode.
The apparatus of claim 12, wherein the at least one received signal comprises a non-text signal, and the text information generating module is further for:

generating the text information through performing signal analysis to the non-text signal.
The apparatus of claim 17, wherein the non-text signal is an image signal, and the response generating module is further for:

generating at least one image attention feature based on the image signal through a spatial attention model;

generating at least one text attention feature based on the text information and the response mode through a text attention model; and

generating the response based at least on the at least one image attention feature and the at least one text attention feature.
The apparatus of claim 18, wherein the response generating module is further for:

generating at least one adaptive attention feature based on the at least one image attention feature and the at least one text attention feature through an adaptive attention model; and

generating the response based at least on the at least one adaptive attention feature.
An apparatus for generating a response in a conversation, comprising:

one or more processors; and

a memory storing computer-executable instructions that, when executed, cause the one or more processors to:

receive at least one signal from at least one signal source;

generate text information based on the at least one received signal;

determine a response mode based at least on the text information, the response mode indicating an expression style of a response to be generated; and

generate the response based at least on the text information and the response mode.