CN113704419A

CN113704419A - Conversation processing method and device

Info

Publication number: CN113704419A
Application number: CN202110219068.9A
Authority: CN
Inventors: 田植良; 闭玮; 史树明
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-11-26

Abstract

The application provides a conversation processing method, a conversation processing device, electronic equipment and a computer readable storage medium; relates to the field of artificial intelligence; the method comprises the following steps: acquiring an input sentence and an input image in a conversation; coding the input sentence and the input image to obtain a context variable fused with sequence information of the input sentence and emotion information expressed by the input image; and decoding the context variable to obtain a reply sentence which is used for responding to the input sentence and is adaptive to the emotion information. By the method and the device, the reply sentences of the scenes can be generated according to the emotion of the user, and the accuracy of the reply sentences is improved.

Description

Conversation processing method and device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing a dialog, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Among them, the man-machine dialog system is an important branch of the field of artificial intelligence, and its main objective is to enable understanding and application of natural language by machines, so as to interact with users like "human". The man-machine conversation system has wide application prospect, such as man-machine conversation interfaces of various robots, intelligent customer service systems, personal assistants and the like.

However, the man-machine interactive system provided by the related art usually utilizes the corpus and the template to determine the information input by the user, and then selects the corresponding reply sentence for response. For example, only the literal meaning of the sentence input by the user is used to search the relevant reply sentence, and human emotion is lacked, so that the communication of the man-machine conversation is not smooth, the accuracy rate of the man-machine conversation is low, and the user experience is poor.

Disclosure of Invention

The embodiment of the application provides a conversation processing method and device, an electronic device and a computer-readable storage medium, which can generate a reply sentence of a scene according to the emotion of a user, and improve the accuracy of the reply sentence.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a conversation processing method, which comprises the following steps:

acquiring an input sentence and an input image in a conversation;

coding the input statement and the input image to obtain a context variable, wherein the context variable is fused with sequence information of the input statement and emotion information expressed by the input image;

and decoding the context variable to obtain a reply sentence which is used for responding to the input sentence and is adaptive to the emotion information.

An embodiment of the present application provides a dialog processing apparatus, including:

the acquisition module is used for acquiring input sentences and input images in the conversation;

the encoding module is used for encoding the input statement and the input image to obtain a context variable, and the context variable is fused with sequence information of the input statement and emotion information expressed by the input image;

and the decoding module is used for decoding the context variable to obtain a reply sentence which is used for responding to the input sentence and is adaptive to the emotion information.

In the above scheme, the obtaining module is further configured to obtain sequence information of the input sentence; the apparatus further comprises an image representation module for extracting corresponding image features from the input image; the device further comprises an in-image object representation module for extracting object features of objects included in the input image from the input image; the encoding module is further configured to perform fusion processing on the sequence information of the input sentence, the image features of the input image, and the object features of the object in the input image to obtain multi-modal features; and calling an encoder to encode the multi-modal characteristics to obtain the context variable fused with the sequence information of the input statement and the emotion information expressed by the input image.

In the above scheme, the obtaining module is further configured to perform word segmentation on the input sentence, perform splicing on word embedded vectors corresponding to each word obtained by the word segmentation, and determine an obtained splicing result as sequence information of the input sentence; the image representation module is further used for zooming the input image to a fixed size to obtain a standard image; performing convolution processing on the standard image, and performing pooling processing on the obtained convolution characteristics to obtain pooling characteristics; performing full connection processing on the pooled features to obtain image features corresponding to the input image; the in-image object representing module is used for carrying out target detection processing on the input image to obtain a bounding box of each object in the input image; and extracting object features of the corresponding objects from the surrounding frame of each object.

In the above scheme, the context variable is also fused with the fluency of the input sentence; the decoding module is further configured to decode the context variable to obtain a reply sentence which is used for responding to the input sentence and is adapted to the emotion information and the fluency.

In the above scheme, the decoding module is further configured to decode the context variable to obtain a reply word corresponding to the context variable and a selected probability of the reply word; and selecting at least one reply word to form a reply sentence which is used for responding the input sentence and is adaptive to the emotional information according to the selection probability of the reply word.

In the above scheme, the encoding process and the decoding process are implemented by a dialog model; the acquisition module is further configured to acquire a training image and a training corpus, where the training corpus includes a sample input sentence and a target reply sentence corresponding to the sample input sentence; the coding module and the decoding module are further used for jointly carrying out coding processing and decoding processing on the training image and the sample input sentence through the dialogue model to obtain a prediction reply sentence for responding to the sample input sentence; the device also comprises a text sentiment classification module which is used for carrying out sentiment prediction processing on the prediction reply sentence through a sentiment classification model to obtain a sentiment score representing positive sentiment; the device further comprises an updating module, a judging module and a judging module, wherein the updating module is used for substituting the emotion scoring and the loss function of the dialogue model into a first objective function so as to determine the parameters of the dialogue model when the first objective function obtains the minimum value, and updating the dialogue model based on the parameters; wherein the first objective function is used for carrying out weighted summation processing on the emotion scoring and the loss function of the dialogue model.

In the above scheme, the encoding process and the decoding process are implemented by a dialog model; the acquisition module is further configured to acquire a training image and a training corpus, where the training corpus includes a sample input sentence and a target reply sentence corresponding to the sample input sentence; the coding module and the decoding module are further used for jointly carrying out coding processing and decoding processing on the training image and the sample input sentence through the dialogue model to obtain a prediction reply sentence for responding to the sample input sentence; the text sentiment classification module is also used for carrying out sentiment prediction processing on the prediction reply sentence through a sentiment classification model to obtain a sentiment score representing positive sentiment; the device also comprises a language fluency module which is used for carrying out fluency prediction processing on the prediction reply sentences through a language fluency model to obtain fluency scores representing the fluency of the sentences; the updating module is further used for substituting the emotion score, the fluency score and the loss function of the dialogue model into a second objective function so as to determine the parameters of the dialogue model when the second objective function obtains the minimum value, and updating the dialogue model based on the parameters; and the second objective function is used for carrying out weighted summation processing on the emotion scoring, the fluency scoring and the loss function of the dialogue model.

In the foregoing solution, the language fluency module is further configured to train the language fluency model in the following manner: obtaining a sentence expressing normally as a normal sample for training the language fluency model; obtaining a sentence expressing an abnormality as a negative example sample for training the language fluency model; and constructing a training set according to the positive example sample and the negative example sample, and training the language fluency model according to the training set.

In the foregoing solution, the language fluency module is further configured to, for the normally expressed sentence, perform at least one of the following processes: deleting part of words in the normal expression sentences, and taking the sentences obtained after deletion as abnormal expression sentences; and replacing part of words in the normal expression sentences with irrelevant words, and taking the sentences obtained after replacement processing as the abnormal expression sentences.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the conversation processing method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to implement a dialog processing method provided by the embodiment of the present application when the processor executes the executable instructions.

The embodiment of the application has the following beneficial effects:

the input sentence and the input image are jointly coded to obtain the context variable fused with the sequence information of the input sentence and the emotion information expressed by the input image, and then the obtained context variable is decoded to obtain the reply sentence which is used for responding to the input sentence and is adaptive to the emotion information.

Drawings

Fig. 1 is a schematic architecture diagram of a dialog processing system 100 according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a server 200 provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a dialog processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a dialog processing method provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of Fast R-CNN provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of an encoder provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a decoder provided in an embodiment of the present application;

fig. 8 is a schematic diagram of a CNN provided in an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating an embodiment of the present application for obtaining object features of an object in an input image;

FIG. 10 is a schematic diagram of a basic dialogue model provided by the related art;

fig. 11 is a schematic diagram of an improved dialogue model provided by an embodiment of the application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first", "second", and the like are only used for distinguishing similar objects and do not denote a particular order or importance, but rather the terms "first", "second", and the like may be used interchangeably with the order of priority or the order in which they are expressed, where permissible, to enable embodiments of the present application described herein to be practiced otherwise than as specifically illustrated and described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Human-Machine Conversation System: refers to a system capable of performing a human-machine conversation, and includes a Task-Oriented (Task-Oriented) conversation system and a Non-Task-Oriented (Non-Task-Oriented) conversation system (also referred to as a chat robot). Taking the non-task oriented dialog system as an example, the non-task oriented dialog system can perform chatting interaction with the user, provide reasonable reply and entertainment functions, and generally mainly focus on open fields to talk with the user.

2) Encoder (Encoder): the function is to convert an input sequence of indefinite length into a vector of definite length, which encodes the sequence information of the input sequence. A common encoder is a Recurrent Neural Network (RNN).

3) Decoder (Decoder): the effect of this is to map the fixed-length vectors generated by the encoder into an output sequence of indefinite length. A common decoder is also a recurrent neural network, such as a Long Short-Term memory neural network (LSTM).

In the related art, the man-machine conversation is usually implemented by determining input information (e.g., a sentence or an image input by a user) of the user using a corpus and a template, and selecting a corresponding reply sentence for response.

Because emotion is a very important factor in the communication conversation between people, people can adjust their own conversation strategy according to the current emotion of the other party, thereby achieving the effect of smooth communication. However, the man-machine conversation system provided by the related art usually searches for the related reply sentences only according to the literal meanings of the sentences input by the user or extracts the non-emotion information only from the image input by the user, and cannot select the appropriate reply sentences according to the emotion of the user, so that the man-machine conversation communication is not smooth, the accuracy of the man-machine conversation is low, and the user experience is poor.

In view of the foregoing technical problems, embodiments of the present application provide a dialog processing method, apparatus, electronic device, and computer-readable storage medium, which can generate a reply sentence corresponding to a scene according to a user's emotion, so as to improve accuracy of the reply sentence.

The following describes an exemplary application of the electronic device provided in the embodiments of the present application, and the electronic device provided in the embodiments of the present application may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, and a portable game device), and may also be implemented as a server, or implemented by implementing the server and the user terminal cooperatively. In the following, an exemplary application will be explained when the electronic device is implemented as a server.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a dialog processing system 100 according to an embodiment of the present application, which is implemented to support an application of a reply sentence generating a response according to a user emotion, so as to improve accuracy of the reply sentence. As shown in fig. 1, the dialogue processing system 100 includes: the server 200, the network 300, and the terminal device 400 are explained below separately.

The server 200 runs a trained dialog model, and is configured to perform encoding processing on an input sentence and an input image received by the terminal device 400 to obtain a context variable, where the context variable is fused with sequence information of the input sentence and emotion information expressed by the input image; then, the server 200 continues to call the trained dialogue model to decode the context variable, so as to obtain a reply sentence which is used for responding to the sentence input by the user and is adaptive to the emotion information; finally, the server 200 returns the reply sentence output by the conversation model to the terminal device 400, so that the terminal device 400 calls the chat page provided by the client 410 for presentation.

The network 300, which is used as a medium for communication between the server 200 and the terminal device 400, may be a wide area network or a local area network, or a combination of both.

The terminal device 400 runs a client 410, the client 410 may be a chat application, such as a chat assistant, an intelligent customer service, and the like, and the terminal device 400 acquires an image and a sentence input by a user in a chat page through the client 410 and transmits the acquired input image and input sentence to the server 200 through the network 300. The terminal device 400 is further configured to invoke a chat page of the client 410 for presentation after receiving the reply sentence returned by the server 200.

It should be noted that, the dialog processing method provided in the embodiment of the present application may be implemented independently by the server, or may be implemented independently by the terminal device, or may be implemented cooperatively by the server and the terminal device, for example, the server trains the dialog model and issues the trained dialog model to the terminal device, so that the terminal device may perform a human-computer dialog directly based on the received dialog model.

Next, an exemplary application when the electronic device implementing the dialog processing method provided in the embodiment of the present application is a terminal device is described.

In some embodiments, taking the terminal device 400 in fig. 1 as an example, the client 410 runs on the terminal device 400, where the implementation form of the client 410 may be a functional module integrated in an operating system, an independent Application (e.g., a chat assistant, an intelligent customer service, etc.), or an Application Programming Interface (API) integrated in an Application, and is used for being called by other applications. Taking the implementation form of the client 410 as an example of a chat assistant, in this embodiment, the chat assistant implements the dialog processing method in the terminal device 400 in an offline manner (i.e., without depending on the server 200), that is, the processes of encoding the input sentence and the input image and decoding the output reply sentence are all performed by the chat assistant in the terminal device 400 alone.

Illustratively, after receiving an image and a sentence input by a user in a chat page, a chat assistant invokes a trained dialog model stored in the terminal device 400 to encode the received input image and input sentence, so as to obtain a context variable fused with sequence information of the input sentence and emotion information expressed by the input image; and then, the chat assistant continuously calls the dialogue model to decode the context variable to obtain a reply sentence which is used for responding the input sentence and is adaptive to the emotion information, and the reply sentence output by the dialogue model is displayed in the chat page.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal device 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal device 400 and the server 200 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.

The following describes the configuration of the server 200 in fig. 1. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 240, at least one network interface 220. The various components in server 200 are coupled together by a bus system 230. It is understood that the bus system 230 is used to enable connected communication between these components. The bus system 230 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 230 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 240 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 240 optionally includes one or more storage devices physically located remote from processor 210.

The memory 240 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 240 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 240 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, to support various operations, as exemplified below.

An operating system 241, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 242 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

in some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a dialog processing apparatus 243 stored in the memory 240, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: an acquisition module 2431, an encoding module 2432, a decoding module 2433, an image representation module 2434, an object-in-image representation module 2435, a text emotion classification module 2436, an update module 2437, and a language fluency module 2438, which are logical and thus can be arbitrarily combined or further separated depending on the functionality implemented. It is noted that in fig. 2, for convenience of expression, all the modules described above are shown at once, but should not be construed as excluding implementations that may include only the acquisition module 2431, the encoding module 2432 and the decoding module 2433 at the dialogue processing device 243, and the functions of the respective modules will be described below.

Different software implementations of the dialog processing device 243 are illustrated below.

Example one, the dialogue processing device can be an application program and a module running on the terminal equipment

The embodiment of the application can provide a software module designed by using a programming language such as C/C + +, Java, and the like, and is embedded into various terminal Apps (such as game applications and the like) based on systems such as Android, iOS and the like (stored in a storage medium of the terminal as executable instructions and executed by a processor of the terminal), so that relevant tasks such as machine model training, application and the like are completed by directly using computing resources of the terminal, and results such as model training, application and the like are transmitted to a remote server through various network communication modes periodically or aperiodically or are stored locally at a mobile terminal.

Example two, the conversation processing device may be a server application and platform

The embodiment of the application can provide application software designed by using programming languages such as C/C + +, Java and the like or a special software module in a large-scale software system, operate in a server end (stored in a storage medium of the server end in an executable instruction mode and operated by a processor of the server end), combine at least one of various kinds of received original data, intermediate data of various levels and final results from other equipment with some data or results existing on the server to train a model and identify a transaction by using the trained model, and then output the model or the result of the transaction identification to other application programs or modules in real time or non-real time for use, and can also write the model or the result of the transaction identification into a database or a file at the server end for storage.

The embodiment of the application can also provide a User Interface (UI) design platform and the like for individuals, groups or enterprises to use by carrying a customized and easily interactive network (Web) Interface or other User interfaces on a distributed and parallel computing platform formed by a plurality of servers. The user can upload the existing data packets to the platform in batch to obtain various calculation results, and can also transmit the real-time data stream to the platform to calculate and refresh each stage of results in real time.

Third, the dialog processing device can be a server side Application Program Interface (API) and a plug-in

The embodiment of the application can provide an API for realizing model training function and abnormal transaction identification based on model generation, a Software Development Kit (SDK) or a plug-in for server side application program developers to call and embed into various application programs.

Example four, the dialog processing device may be a terminal device client API and a plug-in

The embodiment of the application can also provide an API, an SDK or a plug-in for realizing the model training function of the terminal equipment end and generating abnormal transaction identification based on a machine learning or deep learning model, so that other terminal application developers can call the API, the SDK or the plug-in, and the API, the SDK or the plug-in is embedded into various application programs.

Example five, the conversation processing device may be a cloud open service

The embodiment of the Application can provide the cloud service for designing the UI interface of abnormal transaction processing based on artificial intelligence, and the embodiment of the Application can also provide an Application Package (APK, Android Application Package), a Software Development Kit (SDK), a plug-in and the like for designing the cloud service for the UI interface, and the cloud service can be packaged and packaged into a cloud service which can be used by personnel inside and outside an enterprise or can display various results on various terminal display devices in a proper form for individuals, groups or enterprises.

In other embodiments, the dialog processing Device provided in the embodiments of the present Application may be implemented in hardware, and for example, the dialog processing Device provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the dialog processing method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The following describes a dialog processing method provided in an embodiment of the present application in detail with reference to the drawings. Referring to fig. 3, fig. 3 is a schematic flowchart of a dialog processing method provided in an embodiment of the present application; in some embodiments, the session processing method provided in the embodiments of the present application may be implemented by a server or a terminal device alone, or may be implemented by the server and the terminal device in cooperation. In the following steps, the server is taken as a single embodiment to describe the session processing method provided in the embodiment of the present application.

In step S101, an input sentence and an input image in a dialogue are acquired.

In some embodiments, a trained dialogue model is run in the server, so that the input sentence and the input image in the dialogue are acquired through the trained dialogue model. The input image may be an image uploaded to the server by the user through the terminal device (for example, an image pre-stored on the terminal device or an image captured by the user in real time through a camera carried by the terminal device), or an image pre-stored in the server (for example, an image stored in the server and adapted to an existing dialog scene), or an image in an existing image set.

For example, taking an implementation form of a client as an independent application as an example, a chat application program, such as a chat assistant, is run on a terminal device of a user, and when the user clicks to enter a chat page provided by the chat assistant, an opening sentence may be input in the chat page (for example, the user may input an opening sentence like "how is today's weather. After receiving the sentences and images input by the user in the chat page, the chat assistant uploads the input sentences and the input images to the server so that the server calls the trained dialogue model to perform subsequent processing on the received input sentences and input images.

In step S102, an input sentence and an input image are encoded to obtain a context variable, where the context variable is fused with sequence information of the input sentence and emotion information expressed by the input image.

In some embodiments, step S102 shown in fig. 3 may be implemented by steps S1021 to S1024 shown in fig. 4, which will be described in conjunction with the steps shown in fig. 4.

In step S1021, sequence information of the input sentence is acquired.

In some embodiments, the server may obtain the sequence information of the input sentence by: performing word segmentation on the input sentence, performing splicing processing on word embedding vectors corresponding to each word obtained by the word segmentation, and determining the obtained splicing result as sequence information of the input sentence.

For example, in order to analyze an input sentence by using a dialogue model, first, words in a text corresponding to the input sentence need to be converted into vectors, that is, the words are used as input of the dialogue model in a digital form. Word Embedding (Word Embedding) is a method for converting words in a text into a digital vector, and the Word Embedding process is to embed a high-dimensional space with the dimension of all Word quantities into a continuous vector space with a much lower dimension, each Word or phrase is mapped into a vector on a real number domain, and the Word vector is generated as a result of Word Embedding. Common word embedding methods include: One-Hot (One-Hot) encoding, distributed representation, Skip-Gram (Skip-Gram), Continuous Bag-of-Words (CBOW, Continuous Bag of Words), and the like.

Among them, one-hot coding is the most basic vector representation method, which represents a word in a text by a vector of a vocabulary size, in which only an item corresponding to the word is 1 and all other items are 0; the purpose of the distributed representation is to find a transformation function to convert each word into its associated vector, i.e. the distributed representation is to convert words into vectors where the similarity between the vectors is related to the semantic similarity between the words; the basic idea of the word skipping model is to predict the window functions of the use sequence of each central function and correct the vector of the central function according to the prediction result; the basic idea of the continuous bag-of-words model is to predict the vector of the center function by the vector of the window functions in the order of use of each function in sequence.

For example, for the input sentence "how much weather is today", the server first performs word segmentation processing on the input sentence to obtain a corresponding word sequence: the method comprises the steps of obtaining feature vectors corresponding to each word respectively by using a word embedding method for each word in a word sequence, namely obtaining the feature vectors corresponding to the word respectively, then, splicing the feature vectors corresponding to the word respectively by a server, and using the spliced vector as sequence information of an input sentence of 'weather so today'.

It should be noted that, in the embodiment of the present application, a manner of obtaining sequence information of an input sentence (i.e., converting the input sentence into a corresponding vector sequence) is not particularly limited, and any word embedding manner may be used to implement the method.

In step S1022, a corresponding image feature is extracted from the input image.

In some embodiments, the server may extract corresponding image features from the input image by: zooming an input image to a fixed size to obtain a standard image; performing convolution processing on the standard image, and performing pooling processing on the obtained convolution characteristics to obtain pooling characteristics; and carrying out full connection processing on the pooled features to obtain image features corresponding to the input image.

For example, for an input image, the server may extract corresponding image features (i.e., a vectorized representation of the input image) from the input image through a trained Convolutional Neural Network (CNN). The CNN can be used for extracting the characteristics of the nerve cells, wherein the CNN can be used for the basic structure comprising a convolutional layer, a pooling layer and a full-connection layer, the convolutional layer is mainly used for extracting the characteristics, and the input of each neuron is connected with a local receiving domain of the previous layer and extracting the characteristics of the local receiving domain; the pooling layer is mainly used for compressing convolution characteristics input by the convolution layer, so that the convolution characteristics are reduced, and the network computation complexity is simplified; on the other hand, feature compression is carried out to extract main features from the convolution features, and as Pooling features, there are two kinds of Pooling operations, one is average Pooling (Avy Pooling) and the other is maximum Pooling (Max Pooling); the fully-connected layer is mainly used for feature mapping, each calculation layer of the network is composed of a plurality of feature mappings, each feature mapping is a plane, and the weights of all neurons on the plane are equal. The user can customize the structure of the CNN (e.g., the number of layers, connectivity between layers, etc.), then determine parameters of each layer through training, and then perform feature extraction on the input image through the trained CNN, where the extracted features can be expressed by using a vector space to obtain an image feature vector of the image, so as to map the input image into a low-dimensional vector space.

For example, for input image a, the server may first scale the size of the received input image a to a fixed size using a spatial transformation matrix, resulting in a standard image (assuming that the size of the standard image is 32 × 3, where 32 × 32 is the width × height and 3 is the depth of the standard image (i.e., R, G, B)); then, the server inputs the standard image into the trained CNN, so that the convolution layer of the CNN performs feature extraction on the standard image, wherein the convolution layer can be a 5 × 3 receptive field (filter), the depth of the receptive field must be the same as that of the standard image, and a 28 × 1 feature map can be obtained through convolution operation of the receptive field and the standard image; then, inputting the feature map obtained through convolution operation into a pooling layer, and compressing the input feature map, for example, compressing the input feature map into a size of 13 × 13 by adopting a maximum pooling mode; and finally, inputting the compressed feature map into a full-connection layer for feature mapping, thereby mapping the input image A into a low-dimensional vector space, and obtaining the image features corresponding to the input image A.

In step S1023, object features of an object included in the input image are extracted from the input image.

In some embodiments, the server may extract object features of objects included in the input image from the input image by: carrying out target detection processing on the input image to obtain a bounding box of each object in the input image; and extracting object features of the corresponding object from the surrounding frame of each object.

For example, the server may first perform the target detection process on the input image using a regional convolutional neural network (R-CNN, Regions with CNN features), resulting in a bounding box for each object in the input image. The process of the R-CNN for target detection of the input image is as follows: firstly, inputting a model into a picture (such as an input image), then extracting a preset number (such as 200) of regions to be detected from the picture, performing feature extraction on the preset number of regions to be detected one by one (namely, in a serial mode) through a convolutional neural network, classifying the extracted features through a Support Vector Machine (SVM), determining the class of an object, and adjusting the size of a target Bounding box through Bounding box regression (Bounding box regression). After the bounding box of each object in the input image is obtained, for each bounding box, object features of the corresponding object (i.e., vectorized representation of the object) may be extracted from the bounding box of each object in a similar manner as in step S1022.

For example, the server may also use Fast R-CNN for object detection on the input image. Referring to FIG. 5, FIG. 5 is a schematic diagram of Fast R-CNN provided by the embodiments of the present application. As shown in fig. 5, a predetermined number (e.g., 100) Of regions to be detected are determined on a picture (e.g., an input image), then feature extraction is performed on the predetermined number Of regions to be detected respectively through a convolutional neural network, then a feature corresponding to each ROI is extracted from a full-picture feature through a Region Of Interest Pooling Layer (ROI Pooling Layer, Region Of Interest Pooling Layer), and classification and bounding box correction are performed through a full-Connected Layer (FC Layer, full Connected Layer). After the bounding box of each object in the input image is obtained, for each bounding box, the object features of the corresponding object may be extracted from the bounding box of each object in a manner similar to that in step S1022, which is not described herein again in this embodiment of the present application.

It should be noted that, steps S1021 to S1023 may be executed synchronously, or executed sequentially, for example, step S1021 is executed first, step S1022 is executed then step S1023 is executed last, or step S1022 is executed first, step S1023 is executed last, and this embodiment of the present application is not limited in this respect.

In step S1024, fusion processing is performed on the sequence information of the input sentence, the image features of the input image, and the object features of the object in the input image, so as to obtain multi-modal features; and calling an encoder to encode the multi-modal characteristics to obtain the context variables fused with the sequence information of the input sentences and the emotion information expressed by the input images.

In some embodiments, after acquiring the sequence information of the input sentence, the image feature of the input image, and the object feature of the object in the input image, the server performs a fusion process (also referred to as a stitching process) on the sequence information of the input sentence, the image feature of the input image, and the object feature of the object in the input image to obtain a multi-modal feature, for example, the image feature of the input image and the object feature of the object in the input image may be directly stitched behind the sequence information of the input sentence; next, the encoder (for example, RNN) is called to perform encoding processing on the multi-modal feature obtained by the fusion, and a context variable in which sequence information of the input sentence and emotion information expressed in the input image are fused is obtained.

For example, taking an input sentence as "how the weather is today" and an input image as an image a as an example, the server first maps a text in the input sentence into a word vector through word embedding processing, so as to obtain sequence information of the input sentence, and it is assumed that the sequence information of the input sentence is { a1, a2, a3}, where "a 1" is a word vector corresponding to "today", "a 2" is a word vector corresponding to "weather", and "a 3" is a word vector corresponding to "how is so"; then, the server acquires the image characteristics of the image A through the CNN, and the image characteristics are assumed as a vector B; subsequently, the server acquires the object features of the object in the image A through Fast R-CNN, which are assumed to be { b1, b2}, wherein "b 1" is the feature vector corresponding to the object c in the image A, and "b 2" is the feature vector corresponding to the object d in the image A. After obtaining the sequence information of the input sentence, the image feature of the image A and the object feature of the object in the image A, the server performs a stitching process on the features to obtain multi-modal features { a1, a2, a3, B, b1, b2 }.

Referring to fig. 6, fig. 6 is a schematic diagram of an encoder according to an embodiment of the present application, and as shown in fig. 6, after obtaining the multi-modal features through the fusion process, the server may input the multi-modal features into the encoder (e.g., RNN) for encoding process, so as to convert the input multi-modal features into a fixed-length vector c (i.e., a context variable fused with sequence information of the input sentence and emotion information expressed by the image a). There are various ways to obtain the vector c, for example, the last hidden state of the encoder may be directly assigned to the vector c, i.e., c ═ h₆(ii) a The last hidden state may also be transformed to obtain a vector c, i.e., c ═ q (h)₆) (ii) a All hidden states can also be transformed to obtain a vector c, i.e., c ═ q (h)₁，h₂，h₃，h₄，h₅，h₆) Wherein "h" is₀Is an initial hidden state, h₁"is a hidden state obtained by encoding a vector" a1 ", and" h "is a hidden state₂"is a hidden state obtained by encoding the vector" a2 ", and so on," h₆"is a hidden state obtained by performing encoding processing on the vector" b2 ".

Of course, the encoder may also be constructed using a bi-directional recurrent neural network. In this case, the hidden state of the encoder at each time step depends on both the sub-sequences before and after the time step (including the input of the current time step) and encodes the information of the entire sequence. And then, splicing the hidden states of each time step by using a splicing mode to obtain the context variable fused with the sequence information of the input sentence and the emotion information expressed by the input image.

In step S103, the context variable is decoded to obtain a reply sentence adapted to the emotion information in response to the input sentence.

In some embodiments, the server may decode the context variable to obtain a reply sentence adapted to the emotion information in response to the input sentence by: decoding the context variable to obtain a reply word corresponding to the context variable and the selected probability of the reply word; and selecting at least one reply word to form a reply sentence which is used for responding the input sentence and is adaptive to the emotional information according to the selected probability of the reply word.

For example, after the server calls the encoder to encode the multi-modal features obtained through splicing to obtain a context variable fused with sequence information of the input sentence and emotion information expressed by the input image, a decoder (for example, another RNN) is called to decode the context variable to obtain reply words corresponding to the context variable and the selected probability of each reply word; and then, selecting at least one reply word to form a reply sentence which is used for responding the input sentence and is adaptive to the emotional information expressed by the input image according to the selected probability of each reply word. The RNN may adopt a Long and Short Term Memory model (LSTM) or a Gated cyclic Unit (GRU).

For example, referring to fig. 7, fig. 7 is a schematic diagram of a decoder according to an embodiment of the present application, and as shown in fig. 7, after calling an encoder to perform encoding processing on a multi-modal feature to obtain a vector c with a fixed length (i.e., a context variable fused with sequence information of an input sentence and emotion information expressed by an input image), the server inputs the vector c into the decoder to perform decoding to obtain an output sequence. Wherein, the output of the previous time is used as the input of the current time, and the vector c is only used as the initial state to participate in the operation, and the latter operation is not related to the vector c. Of course, the vector c may also participate in the operation at all time instants of the sequence, that is, the output at the previous time instant is still used as the input of the current time instant, but the vector c may participate in the operation at all time instants.

In other embodiments, the context variable may further be fused with fluency of the input sentence, and the server may perform the decoding processing on the context variable in the following manner to obtain a reply sentence adapted to the emotion information and used for responding to the input sentence: and decoding the context variable to obtain a reply sentence which is used for responding to the input sentence and is adaptive to the emotional information and the fluency.

In some embodiments, the encoding process and the decoding process may be implemented by a dialogue model, and the server may further perform the following operations before performing the encoding process and the decoding process on the input sentence and the input image by the dialogue model: acquiring a training image and a training corpus, wherein the training corpus comprises a sample input sentence and a target reply sentence corresponding to the sample input sentence; coding and decoding the training image and the sample input statement through a dialogue model to obtain a prediction reply statement for responding to the sample input statement; carrying out emotion prediction processing on the prediction reply sentence through the emotion classification model to obtain an emotion score representing positive emotion; substituting the emotion score and a loss function of the dialogue model (for example, a loss function obtained based on a difference between a predicted reply statement and an objective reply statement output by the dialogue model) into the first objective function to determine a parameter of the dialogue model when the first objective function obtains a minimum value, and updating the dialogue model based on the parameter; the first objective function is used for carrying out weighted summation processing on the emotion scoring and the loss function of the dialogue model.

For example, the server may train the conversation model by: after the conversation model outputs a prediction reply sentence for responding to the sample input sentence, the server calls an emotion classification model (for example, a text emotion classification model based on CNN, emotion classification is essentially a task of text classification, so that a classic CNN framework for text classification can be used) to perform emotion prediction processing on the prediction reply sentence, and an emotion score representing positive emotion is obtained (the score interval of the emotion score can be between 0 and 1, wherein the positive emotion is 1, and the negative emotion is 0); then, the server substitutes the emotion score and the loss function of the dialogue model itself into the first objective function to determine the parameters of the dialogue model when the first objective function obtains the minimum value, and updates the dialogue model based on the obtained parameters. The Loss Function of the dialog model itself may be an Error between the predicted reply statement and the target reply statement as a difference factor, and the types of the Loss Function may include Mean square Error Loss Function (MSE), Hinge Loss Function (HLF), Cross Entropy Loss Function (Cross Entropy), and the like. For example, taking Square Loss function (Square Loss) as an example, when the number of samples is n, the Loss function at this time can be expressed as:

wherein Y represents the target reply statement, f (x) represents the predicted reply statement, Y-f (x) represents the error between the target reply statement and the predicted reply statement as the difference factor, and the whole equation represents the sum of the squares of the error, then the first objective function can be represented as:

F1＝(1-λ)S1+λL

wherein S1 represents emotion score, L represents loss function of the dialogue model itself, λ represents weight value corresponding to the loss function L, and (1- λ) represents weight value corresponding to the emotion score S1, and the final objective of training is to minimize the function value of the first objective function F1.

In other embodiments, the encoding process and the decoding process may be implemented by a dialog model, and the server may further perform the following operations before performing the encoding process and the decoding process on the input sentence and the input image by the dialog model: acquiring a training image and a training corpus, wherein the training corpus comprises a sample input sentence and a target reply sentence corresponding to the sample input sentence; coding and decoding the training image and the sample input statement through a dialogue model to obtain a prediction reply statement for responding to the sample input statement; carrying out emotion prediction processing on the prediction reply sentence through the emotion classification model to obtain an emotion score representing positive emotion; carrying out fluency prediction processing on the prediction reply sentences through the language fluency model to obtain fluency scores representing the fluency of the sentences; substituting the emotion scoring, the fluency scoring and the loss function of the dialogue model into a second objective function to determine the parameters of the dialogue model when the second objective function obtains the minimum value, and updating the dialogue model based on the parameters; and the second objective function is used for carrying out weighted summation processing on the emotion scoring, the fluency scoring and the loss function of the conversation model.

Illustratively, when the server trains a dialogue model, the server can also train by combining fluency information of sentences, namely when the dialogue model outputs a prediction reply sentence for responding to an input sentence, the server not only calls an emotion classification model to carry out emotion prediction processing on the prediction reply sentence to obtain an emotion score representing positive emotion, but also calls a trained language fluency model to carry out fluency prediction processing on the prediction reply sentence to obtain a fluency score representing the fluency of the sentences; then, the server substitutes the emotion score, the fluency score and the loss function of the dialogue model into a second objective function at the same time to determine the parameters of the dialogue model when the second objective function obtains the minimum value, and updates the dialogue model based on the obtained parameters. At this time, the second objective function may be expressed as:

F2＝λ1S1+λ2S2+λ3L

wherein S1 represents emotion scoring, S2 represents fluency scoring, L represents a loss function of the dialogue model itself, λ 1 represents a weight value corresponding to emotion scoring S1, λ 2 represents a weight value corresponding to fluency scoring S2, and λ 1 represents a weight value corresponding to the loss function L, and the final goal of training is to minimize the function value of the second objective function F2.

In some embodiments, the server may further perform the following operations before performing the fluency prediction processing on the prediction reply statement through the language fluency model: obtaining a sentence expressing the normal, and taking the sentence as a normal sample of a training language fluency model; obtaining a sentence expressing an abnormality as a negative example sample of a training language fluency model; and constructing a training set according to the positive sample and the negative sample, and training the language fluency model according to the constructed training set.

In other embodiments, in support of the above embodiments, the server may obtain the statement expressing the exception by: deleting partial words in the sentences which express the normal expression, and taking the sentences which are obtained after deletion processing as the sentences which express the abnormal expression; or replacing part of words in the normal expression sentences with irrelevant words, and taking the sentences obtained after the replacement processing as abnormal expression sentences.

For example, the server may train the language fluency model by: obtaining sentences which are normally expressed, and taking the sentences which are normally expressed as the correct examples of the training language fluency model (the sentences are smoothly expressed, and the corresponding scores of the sentences should be very high); then, adding "noise" to some words in the sentences that represent normal to make them abnormal sentences as negative examples of training the language fluency model (the sentences are not fluent and the corresponding scores should be low), wherein the adding of "noise" to some words in the sentences that represent normal to make them includes: randomly blocking some words or randomly replacing some words in the sentence with irrelevant words; and then, constructing a training sample set according to the positive sample and the negative sample, and training the language fluency model according to the constructed training sample set to obtain the trained language fluency model. Therefore, the trained language fluency model can be used for carrying out fluency prediction processing on the prediction reply sentences to obtain fluency scores corresponding to the prediction reply sentences.

According to the dialogue processing method provided by the embodiment of the application, the emotion information expressed by the input image is fused in the context variable obtained by encoding, so that the emotion of the user can be recognized and perceived according to the input image, and the reply sentence adaptive to the emotion of the user can be decoded and obtained (and as the dialogue model is trained based on the emotion scoring of positive emotion, the emotion of the reply sentence output by the dialogue model is positive, so that the user can be encouraged, the user is happy, and the user experience is improved); meanwhile, the fluency of the input sentences is also fused in the context variables, so that the expression of the reply sentences obtained by decoding is smoother, and the accuracy of the reply sentences is further improved.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In the related art, the man-machine conversation is usually implemented by determining input information (e.g., a sentence or an image input by a user) of the user using a corpus and a template, and selecting a corresponding reply sentence for response. That is, the solution provided by the related art does not consider the emotion of the user, and only performs analysis based on the text data input by the user, or extracts non-emotional information from the image input by the user for analysis, so as to generate a corresponding reply sentence.

However, since emotion is a very important factor in the communication session between people, people can adjust their own conversation strategy according to the current emotion of the other person, thereby achieving a smooth communication effect. However, the man-machine conversation system provided by the related art usually searches for the related reply sentences only according to the literal meanings of the sentences input by the user or extracts the non-emotion information only from the image input by the user, and cannot select the appropriate reply sentences according to the emotion of the user, so that the man-machine conversation communication is not smooth, the accuracy of the man-machine conversation is low, and the user experience is poor.

In view of this, the present disclosure provides a dialog processing method, which can recognize and perceive the emotion of a user through an image input by the user (e.g., a self-photograph of the user), and generate a reply sentence corresponding to the emotion of the user, so as to encourage the user by performing a dialog with the user, and thus, the user is happy, and a positive emotion is generated.

The conversation processing method provided by the embodiment of the application can be applied to a multi-modal (for example, two different modalities including images and texts) conversation scene, such as an intelligent customer service, a chat assistant and the like. The user speaks a sentence to the intelligent customer service (for example, corresponding characters can be input into a chat page of the intelligent customer service or the corresponding characters can be input through a voice mode), and a picture (for example, the facial expression of the user) is matched, so that the intelligent customer service can recognize and perceive the emotion of the user according to the matched picture input by the user and return a reply sentence matched with the emotion of the user.

The following specifically describes a dialog processing method provided in the embodiment of the present application.

The operation subject of the conversation processing method provided by the embodiment of the application can be a server or a terminal device. Taking the terminal device as an example, the following five functional modules are operated in the terminal device, including: the system comprises an image representation module, an object in image representation module, a text emotion classification module, a language fluency model and a basic dialogue model, wherein the function modules are respectively explained in detail below.

Image representation module

The image representation module is mainly used for representing an image into a vector, and the vector represents the information of the image. For example, the image may be processed by using a trained Convolutional Neural Network (CNN) to perform feature extraction of the image, and the extracted features may be represented by using a vector space to obtain an image feature vector of the image, so as to map the image into a low-dimensional vector space.

For example, referring to fig. 8, fig. 8 is a schematic diagram of the CNN principle provided in the embodiment of the present application, and as shown in fig. 8, an image sequentially passes through a convolutional layer, a pooling layer, and a full-link layer, and finally a vector representation of the image is obtained, and this representation is input into a subsequent dialog model (for example, a dialog model constructed based on a transform model).

Second, object representation module in image

The object in image representation module is mainly used for identifying a plurality of objects included in one image and respectively representing each identified object as a corresponding vector, wherein each vector represents information of the corresponding object. As shown in fig. 9, the fast R-CNN may be used to perform target detection processing on the image input by the user to obtain a bounding box of each object in the input image, and then, for each bounding box, the trained CNN is used to map the objects in the bounding box into corresponding vector representations, and these vectors are also input into the subsequent transform model. The process of using the Faster R-CNN to perform target detection processing on the image input by the user is as follows: firstly, a shared convolution Layer is used for extracting features of a full graph, then the obtained feature graph is sent to a Region candidate Network (RPN), the RPN generates a frame to be detected (the position of an ROI is designated) and carries out first correction on a surrounding frame of the ROI, then Fast R-CNN is constructed, an ROI Pooling Layer selects features corresponding to each ROI on the feature graph according to the output of the RPN, the dimension is set to be a fixed value, finally, the surrounding frames are classified by using a full connection Layer, and second correction of the surrounding frames is carried out.

Third, text sentiment classification module

The text emotion classification module has the functions of: inputting a sentence, and giving an emotion score of the sentence, wherein the interval of the emotion score can be between 0 and 1, the positive emotion is 1, and the negative emotion is 0. Text emotion classification is essentially a task of text classification, and thus a classic CNN architecture for text classification, such as a CNN-based text emotion classification model, can be used. After the dialogue model outputs the reply sentences, the text emotion classification module scores the emotions of the reply sentences, and the scoring result is used for training the dialogue model.

Fourth, the module of fluency of language

The language fluency module functions as: a sentence is input, and the fluency score of the sentence is given.

For example, the language fluency model may be trained by: taking normal sentences as positive examples (the sentences are expressed smoothly, and the corresponding scores of the sentences are high); some "noise" is added to some words of a normal sentence to make it an abnormal sentence, and thus the resulting sentence is taken as a negative example (the sentences are not fluent and the corresponding score should be low). The way of adding "noise" to a normal sentence includes: words in a normal sentence are either randomly masked (Mask) or randomly replaced by irrelevant words (Replace).

The language fluency module may be trained in advance before the entire dialogue model begins to train. After pre-training, the language fluency module is not continued, but as a tool to provide a "language fluency" score throughout the dialogue model.

Five, basic dialogue model

For example, referring to fig. 10, fig. 10 is a schematic diagram of a basic dialogue model provided in the related art. As shown in FIG. 10, the basic dialogue model provided by the related art is a dialogue model based on plain text information, for example, when a user inputs a text "today's weather is true! "after, the basic dialogue model generates" is yes! Go to play the bar "reply sentence together. The basic dialogue model may be of different types as needed, and may be, for example, a sequence-to-sequence (seq2seq) model (applicable to a case where the data amount is small, for example, the data amount is 10 ten thousand or less), or a transform model (applicable to a case where the data amount is large, for example, the data amount is 10 ten thousand or more).

That is, the basic dialogue model may enter a sentence and generate a corresponding reply sentence (i.e., another sentence). The specific process of generating the reply statement is as follows: a sentence is input, an encoder (Encoder) is called to encode to generate a corresponding vector, and then a Decoder (Decoder) is called to decode to generate another sentence. Here, the input sentence is a sentence input by the user, and the generated another sentence is a reply sentence given by the basic dialogue model.

How the above 4 modules provided by the embodiments of the present application are incorporated into the basic dialogue model will be described below based on the basic dialogue model shown in fig. 10.

For example, referring to fig. 11, fig. 11 is a schematic diagram of an improved dialog model provided in an embodiment of the present application, and as shown in fig. 11, an Image representing module and an object in Image representing module may process an Image input by a user, where the Image representing module may generate an Image Vector (Image-Level Vector) for an entire Image; the Object-Level Vector generation module generates an Object-Level Vector for each Object in the image, and for an image, when the image includes a plurality of objects, the Object-Level Vector generation module generates a series of Object vectors. These vectors (including the image vector generated by the image representation module and the series of object vectors generated by the object representation module in the image) are spliced with the input text and input into the basic dialogue model.

After the basic dialogue model outputs the reply sentence, the embodiment of the application may further provide two evaluation functions (rewarded functions) for the generated reply sentence by using the idea of reinforcement learning, and perform feedback learning by using an evaluation result obtained by the evaluation functions. The text emotion classification module is used for scoring whether the emotion of the reply sentence output by the basic dialogue model is positive or not, the scoring result is used as the evaluation of the text emotion classification module, and the higher the evaluation is, the better the evaluation is. The language fluency module scores fluency of reply sentences output by the basic dialogue model, and the scoring result is used as evaluation of the language fluency module, wherein the higher the evaluation is, the better the evaluation is. The reinforcement learning is to guide and train the basic dialogue model according to the two scoring results, so that the basic dialogue model can extract positive emotion information from the image information and embody the output reply sentence.

In addition, training can be performed in combination with the loss function of the basic dialogue model itself (the optimization goal is: maximum likelihood estimation, that is, the generated reply sentence is similar to the "standard reply sentence" in the data set as much as possible), so the final trained objective function can be expressed as:

and the final objective function is the emotion scoring result obtained by the text emotion classification module, the fluency scoring result obtained by the language fluency module and the loss function (maximum likelihood estimation) of the basic dialogue model.

The dialogue processing method provided by the embodiment of the application can improve the effect of the multi-modal dialogue system, so that the emotion of the reply sentence generated by the dialogue system is more positive, the user feels more comfortable, and the user experience is improved.

Continuing with the exemplary structure of the dialog processing device 243 provided by the embodiment of the present application implemented as a software module, in some embodiments, as shown in fig. 2, the software module stored in the dialog processing device 243 of the memory 240 may include: an acquisition module 2431, an encoding module 2432, a decoding module 2433, an image representation module 2434, an object-in-image representation module 2435, a text emotion classification module 2436, an update module 2437, and a language fluency module 2438.

An obtaining module 2431, configured to obtain an input sentence and an input image in a dialog; the encoding module 2432 is configured to perform encoding processing on the input sentence and the input image to obtain a context variable, where the context variable is fused with sequence information of the input sentence and emotion information expressed by the input image; and a decoding module 2433, configured to decode the context variable to obtain a reply sentence adapted to the emotion information and used for responding to the input sentence.

In some embodiments, the obtaining module 2431 is further configured to obtain sequence information of the input sentence; the dialog processing device 243 further includes an image representation module 2434 for extracting corresponding image features from the input image; the dialog processing device 243 further includes an in-image object representation module 2435 for extracting, from the input image, object features of an object included in the input image; the encoding module 2432 is further configured to perform fusion processing on the sequence information of the input sentence, the image features of the input image, and the object features of the object in the input image to obtain multi-modal features; and calling an encoder to encode the multi-modal characteristics to obtain the context variables fused with the sequence information of the input sentences and the emotion information expressed by the input images.

In some embodiments, the obtaining module 2431 is further configured to perform word segmentation on the input sentence, perform splicing on word embedding vectors corresponding to each word obtained through the word segmentation, and determine an obtained splicing result as sequence information of the input sentence; an image representation module 2434, further configured to scale the input image to a fixed size, resulting in a standard image; performing convolution processing on the standard image, and performing pooling processing on the obtained convolution characteristics to obtain pooling characteristics; performing full connection processing on the pooled features to obtain image features corresponding to the input image; an in-image object representation module 2435, configured to perform target detection processing on the input image to obtain a bounding box of each object in the input image; and extracting object features of the corresponding object from the surrounding frame of each object.

In some embodiments, the context variables are also fused with the fluency of the input sentence; the decoding module 2433 is further configured to perform decoding processing on the context variable to obtain a reply sentence adapted to the emotion information and fluency and used for responding to the input sentence.

In some embodiments, the decoding module 2433 is further configured to perform decoding processing on the context variable to obtain a reply word corresponding to the context variable and a selected probability of the reply word; and selecting at least one reply word to form a reply sentence which is used for responding the input sentence and is adaptive to the emotional information according to the selected probability of the reply word.

In some embodiments, the encoding process and the decoding process are implemented by a dialog model; the obtaining module 2431 is further configured to obtain a training image and a training corpus, where the training corpus includes a sample input sentence and a target reply sentence corresponding to the sample input sentence; the encoding module 2432 and the decoding module 2433 are further configured to perform encoding processing and decoding processing on the training image and the sample input sentence together through the dialogue model to obtain a prediction reply sentence for responding to the sample input sentence; the dialogue processing device 243 further includes a text sentiment classification module 2436, configured to perform sentiment prediction processing on the predicted reply sentence through the sentiment classification model to obtain a sentiment score representing a positive sentiment; the dialogue processing device 243 further includes an updating module 2437, configured to substitute the emotion scoring and the loss function of the dialogue model into the first objective function to determine a parameter of the dialogue model when the first objective function obtains a minimum value, and update the dialogue model based on the parameter; the first objective function is used for carrying out weighted summation processing on the emotion scoring and the loss function of the dialogue model.

In some embodiments, the encoding process and the decoding process are implemented by a dialog model; the obtaining module 2431 is further configured to obtain a training image and a training corpus, where the training corpus includes a sample input sentence and a target reply sentence corresponding to the sample input sentence; the encoding module 2432 and the decoding module 2433 are further configured to perform encoding processing and decoding processing on the training image and the sample input sentence together through the dialogue model to obtain a prediction reply sentence for responding to the sample input sentence; the text emotion classification module 2436 is further configured to perform emotion prediction processing on the prediction reply sentence through the emotion classification model to obtain an emotion score representing the positive emotion; the dialogue processing device 243 further includes a language fluency module 2438, configured to perform fluency prediction processing on the prediction reply sentences through the language fluency model, so as to obtain fluency scores representing fluency of the sentences; the updating module 2437 is further configured to substitute the emotion scoring, the fluency scoring, and the loss function of the dialogue model into the second objective function to determine a parameter of the dialogue model when the second objective function obtains a minimum value, and update the dialogue model based on the parameter; and the second objective function is used for carrying out weighted summation processing on the emotion scoring, the fluency scoring and the loss function of the conversation model.

In some embodiments, the language fluency module 2438 is further configured to train the language fluency model by: obtaining a sentence expressing the normal, and taking the sentence as a normal sample of a training language fluency model; obtaining a sentence expressing an abnormality as a negative example sample of a training language fluency model; and constructing a training set according to the positive sample and the negative sample, and training the language fluency model according to the training set.

In some embodiments, the language fluency module 2438 is further configured to perform at least one of the following for a normally expressed statement: deleting partial words in the sentences which express the normal expression, and taking the sentences which are obtained after deletion processing as the sentences which express the abnormal expression; and replacing part of words in the normal expression sentences with irrelevant words, and taking the sentences obtained after replacement processing as abnormal expression sentences.

It should be noted that the description of the embodiments of the apparatus of the present application is similar to the description of the embodiments of the method described above, and has similar beneficial effects to the embodiments of the method, and therefore, the description is omitted here. The inexhaustible technical details of the dialog processing device provided by the embodiment of the present application can be understood from the description of any one of the drawings in fig. 3 to 5.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the conversation processing method described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, a dialog processing method as shown in fig. 3 or fig. 4.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

To sum up, the input sentence and the input image are jointly encoded to obtain the context variable fused with the sequence information of the input sentence and the emotion information expressed by the input image, and then the context variable is decoded to obtain the reply sentence which is used for responding to the input sentence and is adaptive to the emotion information.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of dialog processing, the method comprising:

acquiring an input sentence and an input image in a conversation;

2. The method according to claim 1, wherein the encoding the input sentence and the input image to obtain the context variable comprises:

acquiring sequence information of the input statement;

extracting corresponding image features from the input image, and extracting object features of an object included in the input image from the input image;

performing fusion processing on the sequence information of the input sentence, the image characteristics of the input image and the object characteristics of the object in the input image to obtain multi-modal characteristics;

and calling an encoder to encode the multi-modal characteristics to obtain the context variable fused with the sequence information of the input statement and the emotion information expressed by the input image.

3. The method of claim 2,

the acquiring sequence information of the input sentence comprises:

performing word segmentation on the input sentence, performing splicing processing on word embedding vectors corresponding to each word obtained by the word segmentation, and determining an obtained splicing result as sequence information of the input sentence;

the extracting of the corresponding image features from the input image comprises:

zooming the input image to a fixed size to obtain a standard image;

performing convolution processing on the standard image, and performing pooling processing on the obtained convolution characteristics to obtain pooling characteristics;

performing full connection processing on the pooled features to obtain image features corresponding to the input image;

the extracting, from the input image, object features of an object included in the input image includes:

carrying out target detection processing on the input image to obtain a surrounding frame of each object in the input image;

and extracting object features of the corresponding objects from the surrounding frame of each object.

4. The method of claim 1,

the context variable is also fused with the fluency of the input statement;

the decoding the context variable to obtain a reply sentence which is used for responding to the input sentence and is adaptive to the emotion information, and the method comprises the following steps:

and decoding the context variable to obtain a reply sentence which is used for responding to the input sentence and is adaptive to the emotion information and the fluency.

5. The method of claim 1, wherein the decoding the context variable to obtain a reply sentence adapted to the emotion information in response to the input sentence, comprises:

decoding the context variable to obtain a reply word corresponding to the context variable and the selected probability of the reply word;

and selecting at least one reply word to form a reply sentence which is used for responding the input sentence and is adaptive to the emotional information according to the selection probability of the reply word.

6. The method of claim 1,

the encoding process and the decoding process are implemented by a dialog model;

before the encoding processing and the decoding processing are performed on the input sentence and the input image by the dialogue model, the method further includes:

training the dialogue model by:

acquiring a training image and a training corpus, wherein the training corpus comprises a sample input sentence and a target reply sentence corresponding to the sample input sentence;

coding and decoding the training image and the sample input sentence together through the dialogue model to obtain a prediction reply sentence for responding to the sample input sentence;

carrying out emotion prediction processing on the prediction reply sentence through an emotion classification model to obtain an emotion score representing positive emotion;

substituting the emotion score and a loss function of the dialogue model into a first objective function to determine parameters of the dialogue model when the first objective function obtains a minimum value, and updating the dialogue model based on the parameters;

wherein the first objective function is used for carrying out weighted summation processing on the emotion scoring and the loss function of the dialogue model.

7. The method of claim 1,

training the dialogue model by:

carrying out fluency prediction processing on the prediction reply sentences through a language fluency model to obtain fluency scores representing the fluency of the sentences;

substituting the emotion score, the fluency score and the loss function of the dialogue model into a second objective function to determine the parameters of the dialogue model when the second objective function obtains the minimum value, and updating the dialogue model based on the parameters;

and the second objective function is used for carrying out weighted summation processing on the emotion scoring, the fluency scoring and the loss function of the dialogue model.

8. The method of claim 7,

before the fluency prediction processing is performed on the prediction reply sentence through the language fluency model, the method further comprises:

training the language fluency model by:

obtaining a sentence expressing normally as a normal sample for training the language fluency model;

obtaining a sentence expressing an abnormality as a negative example sample for training the language fluency model;

and constructing a training set according to the positive example sample and the negative example sample, and training the language fluency model according to the training set.

9. The method of claim 8, wherein obtaining statements that express exceptions comprises:

executing at least one of the following processes for the statement with normal expression:

deleting part of words in the normal expression sentences, and taking the sentences obtained after deletion as abnormal expression sentences;

and replacing part of words in the normal expression sentences with irrelevant words, and taking the sentences obtained after replacement processing as the abnormal expression sentences.

10. A conversation processing apparatus, characterized in that the apparatus comprises: