CN112905754A

CN112905754A - Visual conversation method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN112905754A
Application number: CN201911294260.3A
Authority: CN
Inventors: 陈飞龙; 孟凡东; 许家铭; 李鹏; 徐波; 周杰
Original assignee: Tencent Technology Shenzhen Co Ltd; Institute of Automation of Chinese Academy of Science
Current assignee: Tencent Technology Shenzhen Co Ltd; Institute of Automation of Chinese Academy of Science
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2021-06-04

Abstract

The invention provides a visual dialogue method, a visual dialogue device, electronic equipment and a storage medium based on artificial intelligence; the method comprises the following steps: acquiring a conversation question and a conversation history corresponding to the picture; determining at least one of the picture and the dialog history as channel information; according to the dialogue problem and the channel information, at least one of tracking processing and positioning processing is carried out to obtain corresponding channel characteristics; according to the channel characteristics, carrying out fusion processing on the pictures, the conversation questions and the conversation history to obtain fusion characteristics; and performing prediction processing according to the fusion characteristics to obtain a conversation answer corresponding to the conversation question. By the method and the device, multi-mode representation of the dialogue problem can be deepened, accuracy of the obtained dialogue answer is improved, and precision of visual dialogue is improved.

Description

Visual conversation method and device based on artificial intelligence and electronic equipment

Technical Field

The present invention relates to artificial intelligence technology, and in particular, to a visual dialogue method and apparatus, an electronic device, and a storage medium based on artificial intelligence.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.

Visual Dialog (VD) is an important branch of artificial intelligence, integrating machine vision, natural language processing and dialog systems, and the main goal is to teach machines how to communicate Visual data with humans in natural language. However, in the visual dialogue scheme provided by the related art, the inference process may deviate from the representation of the original dialogue problem, so that the inferred dialogue answer does not correspond to the dialogue problem, and the accuracy of the visual dialogue is poor.

Disclosure of Invention

The embodiment of the invention provides a visual conversation method, a visual conversation device, electronic equipment and a storage medium based on artificial intelligence, which can improve the accuracy of obtained conversation answers and improve the user experience of visual conversation.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a visual conversation method based on artificial intelligence, which comprises the following steps:

acquiring a conversation question and a conversation history corresponding to the picture;

determining at least one of the picture and the dialog history as channel information;

according to the dialogue problem and the channel information, at least one of tracking processing and positioning processing is carried out to obtain corresponding channel characteristics;

according to the channel characteristics, carrying out fusion processing on the pictures, the conversation questions and the conversation history to obtain fusion characteristics;

and performing prediction processing according to the fusion characteristics to obtain a conversation answer corresponding to the conversation question.

The embodiment of the invention provides a visual conversation device based on artificial intelligence, which comprises:

the acquisition module is used for acquiring the conversation question and the conversation history corresponding to the picture;

a determining module for determining at least one of the picture and the dialog history as channel information;

the channel processing module is used for performing at least one of tracking processing and positioning processing according to the dialogue problem and the channel information to obtain corresponding channel characteristics;

the fusion module is used for fusing the pictures, the conversation questions and the conversation history according to the channel characteristics to obtain fusion characteristics;

and the prediction module is used for performing prediction processing according to the fusion characteristics to obtain a conversation answer corresponding to the conversation question.

An embodiment of the present invention provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the visual dialogue method based on artificial intelligence provided by the embodiment of the invention when executing the executable instructions stored in the memory.

The embodiment of the invention provides a storage medium, which stores executable instructions and is used for causing a processor to execute the executable instructions so as to realize the visual conversation method based on artificial intelligence provided by the embodiment of the invention.

The embodiment of the invention has the following beneficial effects:

according to the embodiment of the invention, the channel characteristics are obtained through at least one of tracking processing and positioning processing, the multi-mode representation of the original dialogue problem is deepened, and the dialogue answer is determined by combining the channel characteristics, the picture, the dialogue problem and the dialogue history, so that the matching degree of the obtained dialogue answer and the dialogue problem is improved, namely the accuracy of the dialogue answer is improved, and the user experience is enhanced.

Drawings

FIG. 1 is an alternative architecture diagram of an artificial intelligence based visual dialog system provided by an embodiment of the present invention;

FIG. 2 is an alternative architecture diagram of a server provided by an embodiment of the invention;

FIG. 3 is an alternative architecture diagram of an artificial intelligence based visual dialog apparatus provided by an embodiment of the present invention;

FIG. 4A is a schematic flow chart diagram of an alternative method for artificial intelligence based visual dialog according to an embodiment of the present invention;

FIG. 4B is a schematic flow chart diagram illustrating an alternative method for artificial intelligence based visual dialog, according to an embodiment of the present invention;

FIG. 4C is an alternative flow diagram of processing in dual channel mode according to an embodiment of the present invention;

FIG. 4D is a schematic flow chart diagram illustrating an alternative method for artificial intelligence based visual dialog, in accordance with an embodiment of the present invention;

FIG. 5 is an alternative architecture diagram of an artificial intelligence based visual dialog system provided by an embodiment of the present invention;

FIG. 6 is an alternative schematic diagram of a dual channel multi-step inference provided by embodiments of the present invention;

fig. 7 is an alternative architecture diagram of a decoder according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the description that follows, references to the terms "first", "second", and the like, are intended only to distinguish similar objects and not to indicate a particular ordering for the objects, it being understood that "first", "second", and the like may be interchanged under certain circumstances or sequences of events to enable embodiments of the invention described herein to be practiced in other than the order illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) Modality (Modality): for example, the media of information includes voice, video, text, etc., each media of information can be a modality. It is worth mentioning that multimodal in this context means at least two modalities.

2) And (3) conversation problem: questions issued in the visual dialog for the picture, such as "what color is the boy's hair in the picture," may be issued by the user or automatically generated.

3) And (3) a dialogue answer: and in the visual dialogue, replying to the dialogue question according to the meaning shown by the picture to obtain an answer.

4) Dialog History (Dialog History): including a picture description, which refers to a description of a label to a picture, such as "a young boy is playing tennis at a court", a historical dialogue question, and a historical dialogue answer.

5) A channel: refers to the channel for reasoning answers in the visual dialog, and in this document, the channel includes a visual information channel and a dialog history channel. In addition, only the tracking processing or the positioning processing is performed in the single channel mode, and the tracking processing and the positioning processing are performed simultaneously in the dual channel mode.

6) Multilayer perceptron (MLP) model: a feedforward artificial neural network model is used for mapping a plurality of input features into a single feature.

7) Convolutional Neural Networks (CNN) model: a feedforward neural network model containing convolution calculation and having a depth structure has a characteristic learning ability and can carry out translation invariant classification on input information according to a hierarchical structure thereof

8) Recurrent Neural Network (RNN) model: a recurrent neural network model takes sequence data as input, recurses in the evolution direction of the sequence, and all nodes (cyclic units) are connected in a chain mode.

9) Hidden state (hidden state): the RNN model allows information persistence, retains memory of the current state, the retained state exists in the form of hidden variables, i.e. hidden states, usually written as h_t。

The embodiment of the invention provides a visual conversation method, a visual conversation device, electronic equipment and a storage medium based on artificial intelligence, which can improve the accuracy of conversation answers and the user experience of visual conversation.

Referring to fig. 1, fig. 1 is an alternative architecture diagram of an artificial intelligence based visual conversation system 100 according to an embodiment of the present invention, in which, to implement supporting an artificial intelligence based visual conversation application, a terminal device 400 (an exemplary terminal device 400-1 and a terminal device 400-2 are shown) is connected to a server 200 through a network 300, and the server 200 is connected to a database 500, where the network 300 may be a wide area network or a local area network, or a combination of both.

The terminal device 400 is used for displaying pictures on a graphical interface 410 (the graphical interface 410-1 and the graphical interface 410-2 are exemplarily shown); the terminal device 400 is further configured to obtain a dialog question input by the user according to the picture, and send the dialog question to the server 200; the server 200 is configured to obtain a conversation history corresponding to the picture from the database 500, and determine at least one of the picture and the conversation history as channel information; according to the dialogue problem and the channel information, at least one of tracking processing and positioning processing is carried out to obtain corresponding channel characteristics; according to the channel characteristics, fusing the pictures, the conversation problems and the conversation history to obtain fused characteristics; performing prediction processing according to the fusion characteristics to obtain a dialogue answer corresponding to the dialogue question, and sending the dialogue answer to the terminal device 400; the terminal device 400 is also configured to display the dialog answer on the graphical interface 410.

It should be noted that the picture displayed by the terminal device 400 may be a picture obtained by requesting the server 200 in real time, or may be a picture requested by the server 200 in advance and cached locally in the terminal device 400, and the server 200 may obtain the corresponding conversation history from the database 500 according to information such as an address or an identifier of the picture, where the database 500 stores an index relationship between the picture and the conversation history.

The following continues to illustrate exemplary applications of the electronic device provided by embodiments of the present invention. The electronic device may be implemented as various types of terminal devices such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), and the like, and may also be implemented as a server. Next, an electronic device will be described as an example of a server.

Referring to fig. 2, fig. 2 is a schematic diagram of an architecture of a server 200 (for example, the server 200 shown in fig. 1) provided by an embodiment of the present invention, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 240, and at least one network interface 220. The various components in server 200 are coupled together by a bus system 230. It is understood that the bus system 230 is used to enable connected communication between these components. The bus system 230 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 230 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 240 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 240 optionally includes one or more storage devices physically located remote from processor 210.

The memory 240 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 240 described in connection with embodiments of the present invention is intended to comprise any suitable type of memory.

In some embodiments, memory 240 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, to support various operations, as exemplified below.

An operating system 241, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 242 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the artificial intelligence based visual dialog apparatus provided by the embodiments of the present invention can be implemented in software, and fig. 2 shows an artificial intelligence based visual dialog apparatus 243 stored in the memory 240, which can be software in the form of programs and plug-ins, etc., and includes the following software modules: the obtaining module 2431, the determining module 2432, the channel processing module 2433, the fusing module 2434, and the predicting module 2435 are logical and thus may be arbitrarily combined or further divided according to the functions implemented. The functions of the respective modules will be explained below.

In other embodiments, the artificial intelligence based visual dialog Device provided by the embodiments of the present invention may be implemented in hardware, for example, the artificial intelligence based visual dialog Device provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the artificial intelligence based visual dialog method provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The visual conversation method based on artificial intelligence provided by the embodiment of the present invention may be executed by the server, or may be executed by a terminal device (for example, the terminal device 400-1 and the terminal device 400-2 shown in fig. 1), or may be executed by both the server and the terminal device.

The process of implementing an artificial intelligence based visual dialog method in an electronic device by means of an embedded artificial intelligence based visual dialog apparatus will be described below in connection with the above-noted exemplary applications and structures of the electronic device.

Referring to fig. 3 and fig. 4A, fig. 3 is an alternative architecture diagram of the artificial intelligence based visual conversation apparatus 243 provided by the embodiment of the present invention, which shows a flow of visual conversation through a series of modules, and fig. 4A is a flow diagram of an artificial intelligence based visual conversation method provided by the embodiment of the present invention, and the steps shown in fig. 4A will be described with reference to fig. 3.

In step 101, a dialog question and a dialog history corresponding to a picture are acquired.

As an example, referring to fig. 3, in the obtaining module 2431, a dialog question input by a user according to a picture is obtained, and a dialog history corresponding to the picture is obtained from a database, which, of course, does not constitute a limitation on the embodiment of the present invention, for example, the dialog question may also be automatically generated. It is worth noting that the conversation history includes the annotated picture description, the historical conversation questions and the historical conversation answers.

In step 102, at least one of a picture and a dialog history is determined as channel information.

As an example, referring to fig. 3, in the determination module 2432, at least one of a picture and a dialog history is determined as channel information, and answer inference is performed using different channel modes according to a difference that the channel information includes contents.

In step 103, at least one of tracking processing and positioning processing is performed according to the dialogue problem and the channel information, and a corresponding channel feature is obtained.

The channel of the embodiment of the invention comprises a visual information channel and a conversation history channel, wherein in a single-channel mode, only the visual information channel or the conversation history channel is applied, only tracking processing is carried out when the visual information channel is applied, and only positioning processing is carried out when the conversation history channel is applied; and under a dual-channel mode, simultaneously applying a visual information channel and a conversation history channel, at least performing tracking processing in the visual information channel, and at least performing positioning processing in the conversation history channel. Through the tracking processing and the positioning processing, the relation between the conversation problem and the channel information can be effectively captured, and the specific processing mode is elaborated in detail later. And after the corresponding processing is finished, determining the output of the single channel or the double channels as the channel characteristics corresponding to the picture, wherein the channel characteristics can be in a vector form.

In step 104, the pictures, the dialogue questions and the dialogue history are fused according to the channel characteristics to obtain fusion characteristics.

Here, the RNN model may be initialized according to the obtained channel characteristics, and the pictures, the dialogue questions, and the dialogue history may be subjected to a fusion process according to the RNN model to obtain fusion characteristics, which will be described in detail later.

In step 105, a prediction process is performed based on the fusion features to obtain a dialogue answer corresponding to the dialogue question.

Here, prediction processing is performed according to the fusion features to obtain an accurate dialog answer corresponding to the dialog question, and the dialog answer is output.

As can be seen from the above exemplary implementation of fig. 4A in the embodiment of the present invention, the relation between the dialog question and the channel information is captured through at least one of the tracking processing and the positioning processing, and the answer inference process is made to be fitted to the original dialog question through the subsequent fusion processing, so that the accuracy of the obtained dialog answer is improved, the user experience is enhanced, and the method and the device are suitable for application scenarios such as dialog, chat, and the like based on pictures.

In some embodiments, referring to fig. 4B, fig. 4B is an optional flowchart of the artificial intelligence based visual dialog method provided by the embodiment of the present invention, and step 103 shown in fig. 4A may be implemented by any one of step 201 to step 203, which will be described in conjunction with each step.

In step 201, a single channel tracking process is performed according to the dialog problem and the channel information to obtain a visual channel feature.

As an example, referring to fig. 3, in the channel processing module 2433, answer inference can be performed using a single channel mode including only visual information channels. Specifically, the channel information includes pictures, and the visual channel characteristics are obtained by performing tracking processing according to the dialogue problem and the pictures in a single-channel mode.

In some embodiments, the above-mentioned single-channel tracking processing according to the dialog problem and the channel information may be implemented in such a way that the visual channel characteristics are obtained: enhancing the problem features corresponding to the conversation problem and the picture features corresponding to the picture to obtain attention weight; carrying out normalization processing on the attention weight; performing product processing on the attention weight after the normalization processing and the picture characteristics to obtain a jth update problem characteristic; iterating the value of J until the J-th updating problem characteristic is obtained, and determining the J-th updating problem characteristic as a visual channel characteristic; wherein J is 1 and … … J in sequence, and J is an integer greater than 0.

For convenience of processing, firstly, feature extraction processing is carried out on the dialogue questions to obtain question features, and feature extraction processing is carried out on the pictures to obtain picture features. In the tracking process, feature enhancement is realized through an attention mechanism, specifically, weighting processing is carried out on problem features through a first MLP model to obtain a first weighting result, weighting processing is carried out on picture features through a second MLP model to obtain a second weighting result, dot product operation is carried out on the first weighting result and the second weighting result, and the result of the dot product operation is determined as the attention weight. The first MLP model corresponds to the problem feature, and the second MLP model corresponds to the picture feature, where the MLP model herein may be a two-layer MLP model, and certainly may also be an MLP model with other layers.

Then, the attention weight is subjected to normalization processing, for example, normalization processing by a softmax function. And performing product processing on the attention weight after the normalization processing and the picture characteristics to obtain the jth update problem characteristic, wherein the update problem characteristic embodies the relation between the original conversation problem and the picture. And iterating the value of J until the J-th update problem feature is obtained, in order to facilitate distinguishing, naming the J-th update problem feature as a visual channel feature as the output of a visual information channel, wherein the value of J is an integer from 1 to J, J begins to be valued from 1, and J is an integer greater than 0 and can be preset.

It should be noted that, in the case of only applying the visual information channel, the tracking process may be performed once, or at least twice, and in the latter case, the problem feature input by the next tracking process is the updated problem feature output by the previous tracking process, for example, when the first tracking process is performed, the updated problem feature 1 is output, and then the updated problem feature 1 and the picture feature are determined as the input of the second tracking process. In addition, the product processing herein may apply hadamard product (hadamard product), but of course, other product processing methods may also be applied. Through the tracking processing mode, more accurate updated question features can be generated by combining the relation between the conversation questions and the pictures, and the accuracy of answer reasoning is improved.

In step 202, a single channel positioning process is performed according to the dialogue problem and the channel information to obtain the dialogue history channel characteristics.

As an example, referring to fig. 3, in the channel processing module 2433, answer inference can also be performed using a single channel mode including only a dialog history channel. Specifically, the channel information includes a dialogue history, and positioning processing is performed according to the dialogue problem and the dialogue history in a single-channel mode to obtain dialogue history channel characteristics.

In some embodiments, the above single-channel positioning process according to the dialog question and the channel information may be implemented in such a way that the dialog history channel feature is obtained: enhancing the problem features of the dialogue problems and the dialogue history features corresponding to the dialogue history to obtain attention weight; carrying out normalization processing on the attention weight; carrying out product processing on the attention weight after the normalization processing and the conversation historical characteristics to obtain intermediate conversation historical characteristics; activating the intermediate conversation historical characteristics, and performing regularization on the activation processing result and the picture description characteristics in the conversation historical characteristics to obtain an r-th update problem characteristic; iterating the value of R until an R-th updating problem characteristic is obtained, and determining the R-th updating problem characteristic as a conversation history channel characteristic; wherein, the values of R are 1 and … … R in sequence, and R is an integer more than 0.

For convenience of processing, firstly, the dialogue questions are subjected to feature extraction processing to obtain question features, and dialogue history is subjected to feature extraction processing to obtain dialogue history features. In the process of positioning processing, firstly, enhancement processing is performed on the problem features and the dialogue history features to obtain attention weight, and the manner of enhancement processing is the same as that of the above, and is not described again here. The attention weight is normalized, for example, by a softmax function, and then the normalized attention weight is multiplied by the dialogue history feature to obtain an intermediate dialogue history feature.

And after the intermediate conversation historical characteristics are activated, adding the activated result and the picture description characteristics in the conversation historical characteristics, and then performing regularization processing on the added result to obtain the r-th update problem characteristic. It should be noted that the activation process here may be implemented by a modified Linear Unit (ReLU) function, but may also be implemented by other activation functions such as a tanh function or an leak ReLU function, and the regularization process may be implemented by a Layer Normalization method.

Iterating the value of R until obtaining the R-th update problem feature, in order to facilitate distinguishing, naming the R-th update problem feature as a dialogue history channel feature as the output of a dialogue history channel, wherein the value of R is an integer from 1 to R, and R begins to be valued from 1, R is an integer greater than 0, and R can be preset according to an actual application scene. In the case where only the dialogue history channel is applied, the positioning process may be performed once, or at least two times, in the latter case, the problem feature input by the next positioning process is the update problem feature output by the previous positioning process, for example, when the first positioning process is performed, the update problem feature 1 is output, and then the update problem feature 1 and the dialogue history feature are determined as the input of the second positioning process. Through the method, the problem characteristics are updated by combining the conversation history characteristics, and the relation between the conversation history channel characteristics and the conversation history characteristics is deepened.

In step 203, performing dual-channel tracking processing according to the dialogue problem and the channel information to obtain a visual channel feature, performing dual-channel positioning processing according to the dialogue problem and the channel information to obtain a dialogue history channel feature, and performing multi-mode fusion processing on the visual channel feature and the dialogue history channel feature to obtain a multi-mode channel feature.

As an example, referring to fig. 3, in the channel processing module 2433, a two-channel mode can also be applied for answer reasoning, i.e., a visual information channel and a dialogue history channel are applied. Specifically, the channel information comprises pictures and conversation history, and under a double-channel mode, tracking processing is carried out in a visual information channel according to conversation problems and channel information to obtain visual channel characteristics; and carrying out positioning processing in the conversation history channel according to the conversation problems and the channel information to obtain the conversation history channel characteristics. And then, performing multi-mode fusion processing on the visual channel characteristics and the dialogue historical channel characteristics to obtain multi-mode channel characteristics.

In some embodiments, the above-mentioned multi-modal fusion processing on the visual channel features and the dialogue history channel features can be implemented in such a way that multi-modal channel features are obtained: enhancing the problem feature and the visual channel feature corresponding to the dialogue problem to obtain a first attention weight; enhancing the problem features corresponding to the conversation problems and the conversation historical channel features to obtain a second attention weight; and splicing the first attention weight and the second attention weight, and activating a splicing result to obtain the multi-modal channel characteristics.

The method includes the steps of performing enhancement processing on problem features and visual channel features corresponding to a conversational problem to obtain a first attention weight, specifically, performing weighting processing on the problem features through a first MLP model corresponding to the problem features to obtain a first weighting result, performing weighting processing on the visual channel features through a second MLP model corresponding to the visual channel features to obtain a second weighting result, and then performing dot product operation on the first weighting result and the second weighting result to obtain the first attention weight. The same enhancement processing is performed on the question feature and the dialogue history channel feature to obtain a second attention weight.

And splicing the first attention weight and the second attention weight, and activating a splicing result to complete multi-mode fusion to obtain multi-mode channel characteristics, wherein the activation can be realized through a tanh function. By means of the method, the information output by the two channels is effectively fused, and the accuracy of the obtained multi-mode channel characteristics is improved.

As can be seen from the above exemplary implementation of fig. 4B, the embodiment of the present invention provides three ways of determining channel characteristics, so as to effectively perform answer reasoning, improve processing flexibility, and determine a specific processing way according to specific situations of pictures and dialog histories.

In some embodiments, referring to fig. 4C, fig. 4C is an optional flowchart of the processing performed in the dual channel mode according to the embodiment of the present invention, and step 203 shown in fig. 4B may be implemented by steps 301 to 305, which will be described in conjunction with the steps.

In step 301, a tracking process is performed according to the dialog question and the channel information to obtain a visual channel feature.

In the dual channel mode, the number of times of tracking processing performed in the visual information channel and the number of times of positioning processing performed in the dialogue history channel may be the same or different. In one case, a tracking process is performed in the visual information channel according to the dialog question and the picture in the channel information, and the obtained updated question feature is determined as the visual channel feature. When the tracking processing is performed once, the consumed computing resources are less, and faster feedback can be obtained. The specific way of the tracking process is the same as above, and is not described herein again.

In step 302, at least two tracking processes are performed according to the dialogue problem and the channel information, and a positioning process is included between two adjacent tracking processes to obtain the visual channel feature.

In the visual information channel, at least two times of tracking processing can be carried out according to the dialogue problem and the channel information, and positioning processing is included between two adjacent times of tracking processing to obtain the visual channel characteristic.

In some embodiments, the above-mentioned at least two tracking processes according to the dialog question and the channel information may be implemented in such a way that a positioning process is included between two adjacent tracking processes to obtain the visual channel characteristics: according to the problem feature corresponding to the conversation problem and the picture feature corresponding to the picture, tracking processing is carried out to obtain the nth update problem feature; according to the nth update problem feature and the conversation history feature corresponding to the conversation history, positioning processing is carried out to obtain an (n + 1) th update problem feature; iterating the value of N until the Nth updating problem characteristic is obtained, and determining the Nth updating problem characteristic as the visual channel characteristic; wherein the value of N is 1, … … N in sequence, and N is an integer greater than 1.

And for the visual information channel, tracking according to the problem features corresponding to the conversation problems and the picture features corresponding to the pictures to obtain the nth update problem feature. Then, according to the nth update problem feature and the conversation history feature obtained through the tracking processing, positioning processing is carried out to obtain an n +1 th update problem feature. Here, the value of N is an integer from 1 to N, N starts to be a value from 1, and N is an integer greater than 1, and a specific value may be set according to an actual application scenario. And iterating the value of N until the Nth updating problem characteristic is obtained, and naming the Nth updating problem characteristic as a visual channel characteristic, namely, the Nth updating problem characteristic is used as the output of the visual information channel.

It should be noted that, in the iteration process, the problem feature input by the next positioning process is the updated problem feature output by the previous tracking process, and the problem feature input by the next tracking process is the updated problem feature output by the previous positioning process. For example, when the first tracking processing is performed, the update question feature 1 is output, and then the update question feature 1 and the dialogue history feature are determined as inputs to the first positioning processing, and the update question feature 2 and the picture feature output by the first positioning processing are determined as inputs to the second tracking processing. By the method, more refined answer reasoning can be realized, namely fine-grained analysis is realized, and the accuracy of the obtained visual channel characteristics is improved.

In step 303, a positioning process is performed according to the dialogue problem and the channel information to obtain the dialogue history channel characteristics.

In the conversation history channel, positioning processing can be carried out once according to conversation problems and conversation history in the channel information, and the obtained updated problem characteristic is determined as the conversation history channel characteristic. The specific manner of the positioning process is the same as that described above, and is not described herein again.

In step 304, at least two positioning processes are performed according to the dialogue problem and the channel information, and a tracking process is included between two adjacent positioning processes to obtain the dialogue historical channel characteristics.

Besides one positioning process, at least two positioning processes can be carried out in the conversation history channel according to the conversation question and the channel information, and a tracking process is included between two adjacent positioning processes, so as to obtain more accurate conversation history channel characteristics.

In some embodiments, the above-mentioned at least two positioning processes according to the dialogue problem and the channel information can be realized by the following tracking process between two adjacent positioning processes to obtain the dialogue history channel characteristics: according to the question features corresponding to the conversation questions and the conversation history features corresponding to the conversation history, positioning processing is carried out to obtain the mth updated question features; tracking according to the m & ltth & gt update problem feature and the picture feature corresponding to the picture to obtain an m +1 & ltth & gt update problem feature; iterating the value of M until the Mth updating problem characteristic is obtained, and determining the Mth updating problem characteristic as the historical channel characteristic of the conversation; wherein the values of M are 1 and … … M in sequence, and M is an integer greater than 1.

And for the conversation history channel, positioning according to the problem features and the conversation history features to obtain the mth update problem feature, and tracking according to the mth update problem feature and the picture features obtained through positioning to obtain the (m + 1) th update problem feature. Here, M is an integer from 1 to M, M is an integer greater than 1 starting from 1, and M is generally the same as N above, but different values may be set for M and N depending on the actual application. Iterating the value of M until the Mth updating problem feature is obtained, and naming the Mth updating problem feature as a dialogue history channel feature, namely, the M is used as the output of the dialogue history channel. It should be noted that, in the iteration process, the problem feature input by the next tracking processing is the updated problem feature output by the last positioning processing, and the problem feature input by the next positioning processing is the updated problem feature output by the last tracking processing. By the method, the problem feature update with finer granularity is realized in the conversation history channel, and the accuracy of the obtained conversation history channel feature is improved.

In step 305, a multi-modal fusion process is performed on the visual channel features and the dialogue history channel features to obtain multi-modal channel features.

As can be seen from the above exemplary implementation of fig. 4C, the embodiment of the present invention provides two processing modes in the dual-channel mode, so as to improve the flexibility of the dual-channel mode. In an actual application scenario, a specific processing mode can be determined according to a response speed requirement in a visual dialogue, for example, when a quick feedback of a dialogue answer is required, a tracking process is set in a visual information channel, and a positioning process is performed in a dialogue history channel, so that the speed of obtaining a visual channel feature and a dialogue history channel feature is increased.

In some embodiments, referring to fig. 4D, fig. 4D is an optional flowchart of the artificial intelligence based visual dialog method provided by the embodiment of the present invention, and step 104 shown in fig. 4A can be implemented by steps 401 to 404, which will be described in conjunction with the steps.

In step 401, picture features corresponding to pictures, question features corresponding to conversation questions, and conversation history features corresponding to conversation histories are determined.

The picture, the dialogue question and the dialogue history are respectively subjected to feature extraction processing in advance, and picture features, question features and dialogue history features are determined in the step, and the operation of the feature extraction processing can be executed before the step 103.

In step 402, a hidden layer state of the decoding recurrent neural network model is initialized according to the channel characteristics.

The multi-modal information fusion is performed based on the RNN model, and the RNN model is named as a decoding RNN model for the convenience of distinction. By way of example, referring to FIG. 3, in a fusion module 2434, a first hidden state of the decoded RNN model is initialized based on the obtained channel features, wherein the channel features may be visual channel features, dialog history channel features, or multi-modal channel features, depending on the particular channel mode.

In some embodiments, the above-mentioned determining of the picture feature corresponding to the picture, the question feature corresponding to the conversation question, and the conversation history feature corresponding to the conversation history may be implemented in such a manner that: performing feature extraction processing on the picture through a convolutional neural network model to obtain picture features; performing feature extraction processing on the dialogue problem through a first cyclic neural network model to obtain problem features; and performing feature extraction processing on the conversation history through the second recurrent neural network model to obtain conversation history features.

Here, the image features may be obtained by performing feature extraction processing on the image through a CNN model, such as a Fast R-CNN model, a Visual Geometry Group (VGG) model, or a Residual Neural Network (ResNet) model. Meanwhile, the dialogue problem is subjected to feature extraction processing through the first RNN model to obtain problem features, and dialogue history is subjected to feature extraction processing through the second RNN model to obtain dialogue history features, wherein the first RNN model and the second RNN model can be a Long Short-Term Memory network (LSTM) model, a bidirectional Long Short-Term Memory network (Bi-LSTM) model, a Gated Short-Term Memory (Bi-directional) model, a Gated circulating Unit (GRU) model or the like, and can also be other variants of the RNN model. By the method, picture features can be effectively extracted, and sequence data can be effectively acquired by extracting features of dialogue problems and dialogue histories through the RNN model.

In some embodiments, initializing the hidden state of the decoding recurrent neural network model according to the channel characteristics as described above may be implemented in such a way that: and initializing the hidden layer state of the decoding circular neural network model according to the channel characteristics and the last hidden layer state of the first circular neural network model.

On the basis of carrying out feature extraction processing on the dialogue problem through the first RNN model, the first hidden layer state of the RNN model is initialized and decoded according to the channel feature and the last hidden layer state of the first RNN model, and the decoding capability of the RNN model is improved.

In step 403, attention processing is performed on the picture feature, the question feature and the dialogue history feature according to the hidden layer state of the decoding recurrent neural network model.

For example, referring to fig. 3, in the fusion module 2434, an attention mechanism is applied to perform attention processing on the picture feature, the question feature and the dialogue history feature respectively according to the current hidden layer state of the decoded RNN model.

In some embodiments, the above-mentioned attention processing on the picture feature, the question feature and the dialogue history feature according to the hidden layer state of the decoding recurrent neural network model can be realized by the following steps: activating the hidden layer state and the initial characteristic of the decoding recurrent neural network model together, and normalizing the activation result to obtain an intermediate attention characteristic; performing product processing on the intermediate attention feature and the initial feature to obtain an attention initial feature; wherein the initial feature is a picture feature, a question feature, or a conversation history feature.

Taking the initial feature as an example of the problem feature, when performing attention processing, the current hidden layer state of the decoded RNN model and the problem feature are jointly activated, for example, by a tanh function. Then, the result of the activation process is normalized to obtain an intermediate attention feature, where the normalization process can be implemented using a softmax function. And finally, performing product processing on the intermediate attention feature and the problem feature to obtain the attention problem feature, wherein the attention problem feature is the embodiment of more important information in the original problem feature.

In step 404, the noticed picture feature, the noticed question feature and the noticed dialogue history feature are spliced, and the splicing result is activated to obtain a fusion feature.

As an example, referring to fig. 3, in a fusion module 2434, the noticed picture feature, the noticed question feature, and the noticed conversation history feature are subjected to a splicing process, and a result of the splicing process is subjected to an activation process according to an activation function such as a tanh function, so as to obtain a fusion feature.

In fig. 4D, step 105 shown in fig. 4A can be implemented by steps 405 to 407, and will be described with reference to each step.

In step 405, weighting the fusion features and the hidden layer state of the decoding recurrent neural network model through a multilayer perceptron model to obtain answer features; each numerical value in the answer features corresponds to the selection probability of one word.

For example, referring to fig. 3, in the prediction module 2435, the fusion features and the current hidden layer state of the decoded RNN model are weighted by the MLP model to obtain answer features, where the answer features are in a vector form, and each value corresponds to the selection probability of a word (chinese word or word).

In step 406, when the answer features satisfy the selection condition, the word corresponding to the maximum value in the answer features is added to the answer sequence.

Here, the selection condition determines whether to select a word corresponding to the answer feature, and the selection condition may be set according to an actual application scenario, for example, set to a value exceeding a probability threshold, such as 40%, in the answer feature. And when the answer characteristics determined according to the current hidden layer state of the decoding RNN model meet the selection condition, adding the word corresponding to the maximum numerical value in the answer characteristics to the answer sequence.

In step 407, when the answer features do not satisfy the selection condition, the words in the answer sequence are combined into the dialog answer.

And otherwise, when the answer characteristics do not meet the selection condition, combining the existing words in the answer sequence into the dialogue answers according to the sequence of adding time from morning to evening, and realizing the response to the dialogue questions.

As can be seen from the above exemplary implementation of fig. 4D, in the embodiment of the present invention, the multi-modal information of the picture, the dialog question, and the dialog history is combined, and the dialog answer is obtained by initializing the hidden layer state, the attention processing, the splicing processing, and the like, so that the accuracy of the dialog answer is improved, the dialog question can be effectively replied through the dialog answer, and the accuracy of the visual dialog is improved.

In the following, an exemplary application of the embodiments of the present invention in a practical application scenario will be described.

Referring to fig. 5, fig. 5 is an alternative architecture diagram of an artificial intelligence based visual dialog system provided by an embodiment of the present invention, and for ease of understanding, the modules shown in fig. 5 are described. In the feature representation module shown in fig. 5, pictures, dialog questions and a dialog history are read in, wherein the dialog history comprises a picture description (not shown in fig. 5), historical dialog questions and historical dialog answers, wherein the historical dialog questions are represented in fig. 5 by Q, such as "age of girls? ", historical dialogue questions are denoted by a in fig. 5, such as" late teenager ". In fig. 5, the image is subjected to feature extraction processing to obtain image features, and the feature extraction processing can be specifically performed through a Fast R-CNN model, a VGG model, a ResNet model, or other CNN models. For example, representing a picture by I, picture features can be extracted by the Fast R-CNN model as follows:

v＝Faster R-CNN(I)

wherein v is the picture feature.

By means of Bi-LSTM model pairsDialogue problem Q_tAnd (3) carrying out feature extraction treatment, wherein a specific formula is as follows:

wherein, LSTM_fIs the forward part in the Bi-LSTM model; LSTM_bIs the backward part in the Bi-LSTM model; w is a word vector obtained after word embedding (word embedding) processing is carried out on words in the dialogue problem; x is a feature vector of a word obtained after w is processed by a Bi-LSTM model; l represents the length of the dialogue question, namely the number of words in the dialogue question; q. q.s_tI.e. the extracted problem features. It should be noted that the word here may be a chinese word, an english word, or a word in another language, and j in the formula is not related to j in the above single-channel tracking process.

Similarly, after feature extraction processing is carried out on the dialogue history through the Bi-LSTM model, the dialogue history feature u is obtained. The blank rectangles in fig. 5 represent feature vectors.

After the feature representation of the pictures, the dialogue questions and the dialogue history is completed, the picture features, the question features and the dialogue history features are input into a dual-channel multi-step reasoning module shown in the figure 5, the dual-channel multi-step reasoning module consists of a tracking module and a positioning module, and multi-step reasoning is realized in two channels through the tracking module and the positioning module.

In the tracking module, the problem features and the picture features are respectively processed through two different MLP models to obtain new features, and then updated problem features are obtained through applying an attention mechanism, wherein the MLP models can be two-layer MLP modelsOther number of MLP models are also possible. For ease of distinction, the problem feature of the input trace module is named q_trackThen, the following formula is used in the tracking module for tracking processing:

α＝softmax(W^γγ+b^γ)

where f denotes a two-layer MLP model,

representing the MLP model in the tracking module corresponding to the problem feature,

representing an MLP model corresponding to the picture features in the tracking module; w and b are trainable parameter matrices; k represents the total number of lines of the picture feature v; "o" is a dot product (dot product) operation symbol. In addition, γ is the attention weight in the above tracking process, and α is the attention weight after the normalization process in the above tracking process.

In a positioning module, the problem features and the conversation historical features are respectively processed through different two-layer MLP models to obtain new features, and then the intermediate conversation historical features are obtained through an attention mechanism. Describing characteristics u of intermediate conversation history characteristics and pictures in conversation history characteristics₀And (3) performing two-layer MLP model processing and regularization processing (corresponding to ReLU activation processing and LayerNorm processing in the following formula), and finally obtaining the updated problem characteristics. For ease of distinction, the problem feature of the input positioning module is named q_locateThen, the formula for performing the positioning process in the positioning module is as follows:

α＝softmax(W^ββ+b^β)

q_locate＝LayerNorm(g+u₀)

wherein T represents the total line number of the conversation history characteristics u;

and

for the purpose of distinguishing between different ones of W,

and

the same process is carried out; u. of₀The picture description features are shown, wherein 0 is a subscript used for indicating the position of the picture description features in the conversation history features. The ReLU function in the formula may be replaced by other activation functions, such as tanh function or the leakrellu function. In addition, β is the attention weight in the above positioning process, α is the attention weight after normalization process in the above positioning process,

i.e. the intermediate dialog history feature.

The embodiment of the invention provides a schematic diagram of the two-channel multi-step reasoning as shown in fig. 6, which shows a process of performing the two-channel multi-step reasoning on the basis of taking n +2 as a final step number. For convenience of illustration, taking n as the final step number, Track () for tracking processing performed in the tracking module, and Locate () for positioning processing performed in the positioning module, the multi-step reasoning process of the visual information channel can be described as follows:

step 1:

step 2:

step 3:

…

step n:

wherein each step is a one-step reasoning.

The multi-step reasoning process for the dialog history channel can be described as:

step 1:

step 2:

step 3:

…

step n:

when performing the multi-step reasoning, n may be an integer greater than 2, and further, n may be an odd number greater than 2. Of course, the above is only an example of multi-step reasoning, and the reasoning of any number of steps can be set according to different practical application scenarios.

The visual channel characteristics obtained after the double-channel multi-step reasoning

And dialogue history channel characteristics

Input to the multi-modal fusion module shown in FIG. 5 to enhance the attention of the user through the attention enhancement module in the multi-modal fusion module

And

and (6) performing enhancement treatment. In particular, the amount of the solvent to be used,

and

corresponding to different attention enhancing modules, in

In the corresponding attention enhancing module, the formula of the enhancing process is as follows:

wherein the content of the first and second substances,

i.e. the first attention weight above.

In that

wherein the content of the first and second substances,

i.e. the second attention weight above.

Then, the linear transformation module is used for transforming the data into the linear transformation data

And

splicing treatment is carried out, and the result of the splicing treatment is activated to obtain the multi-modal channel characteristics

The formula is as follows:

wherein the linear transformation module may be a fully connected layer.

Will be provided with

The decoder is configured to perform decoding, and it should be noted that the decoder according to the embodiment of the present invention may be configured based on a multi-modal attention mechanism, and may also be configured based on other attention mechanisms. An embodiment of the invention provides a schematic diagram of a decoder as shown in fig. 7, the decoder comprising an LSTM model. In the decoder, first, according to

Initialization of hidden states of the LSTM model:

wherein s is_qIs the last hidden state of the Bi-LSTM model that performs feature extraction processing on the dialog problem.

In the attention mechanism module, according to the current hidden layer state h of the LSTM model_tIf the picture feature, the question feature, and the dialogue history feature are respectively attentively processed, for example, if the question feature is q, the process of attentively processing the question feature is as follows:

wherein A is a matrix with all elements of 1, T represents the transposition of the matrix, l represents the total row number of the problem characteristic q, and m is obtained_qI.e. the problem feature after attention. Similarly, the noticed conversation history feature m can be obtained_uAnd picture feature m after attention_v。

By fusing the modules, m_q、m_uAnd m_vSplicing and activating the splicing result to obtain a fusion characteristic c_tThe formula is as follows:

c_t＝tanh(W_C[m_q,m_u,m_v])

then through in the decoderTwo layers of MLP models f, generating answer features f (h)_t,c_t) Answer features f (h)_t,c_t) Is embodied in y₁,…,y_t-1In the case of q, v, u occurring simultaneously, the word y_tThe probability of occurrence, i.e.:

logp(y_t|y₁,…,y_t-1,q,v,u)＝f(h_t,c_t)

when the answer characteristics meet the selection condition, corresponding the maximum numerical value in the answer characteristics to the word y_tAdding to the answer sequence; and when the answer characteristics do not meet the selection condition, combining the existing words in the answer sequence into the dialogue answer according to the sequence of adding time from morning to evening. FIG. 5 shows the resulting y from a two-pass multi-step inference with the dialog question "what color it is"₁,...,y_tThe formed dialogue answers are black and white, namely, in the process of visual dialogue, the meaning of the 'other' in the dialogue questions (namely clothes) is analyzed more finely by combining pictures, dialogue questions and dialogue history, and corresponding reply is carried out according to the pictures, so that the reply effect is improved.

Proved by the inventor experiment, compared with the visual dialogue model provided by the prior art, the visual dialogue method based on artificial intelligence provided by the embodiment of the invention can improve the accuracy of the obtained dialogue answer and obtain better precision, and the specific indexes are as follows:

wherein, Mean Reciprocal Rank (MRR), R @ k and Mean (Mean Rank of human response) are all the evaluation indexes of the visual dialogue accuracy. The higher the values of MRR and R @ k, the higher the accuracy of the visual dialog is represented; the lower the value of Mean, the higher the accuracy of the visual dialog represented. Generally speaking, a point where the numerical value of the evaluation index increases or decreases is a significant change in the accuracy of the visual dialog.

The embodiment of the invention also provides the following ablation experimental results:

here, "√" indicates that the corresponding component has been deployed or has performed the corresponding process, and "×" indicates that the corresponding component has not been deployed or has not performed the corresponding process. Based on the ablation experiment results, it was found that the best visual dialogue accuracy could be achieved when a localization module, a tracking module, and a decoder (decoder in the multi-modal attention module shown in fig. 5) were deployed and two-channel 3-step inference was performed.

Continuing with the exemplary structure in which the artificial intelligence based visual dialog device 243 provided by embodiments of the present invention is implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based visual dialog device 243 of the memory 240 may include: an obtaining module 2431, configured to obtain a dialog question and a dialog history corresponding to the picture; a determining module 2432 for determining at least one of the picture and the dialog history as channel information; the channel processing module 2433 is configured to perform at least one of tracking processing and positioning processing according to the dialog question and the channel information to obtain a corresponding channel feature; a fusion module 2434, configured to perform fusion processing on the picture, the dialog question, and the dialog history according to the channel feature to obtain a fusion feature; and the prediction module 2435 is configured to perform prediction processing according to the fusion feature to obtain a dialog answer corresponding to the dialog question.

In some embodiments, the channel processing module 2433 is further configured to: any one of the following processes is performed: tracking a single channel according to the dialogue problem and the channel information to obtain visual channel characteristics; performing single-channel positioning processing according to the dialogue problem and the channel information to obtain dialogue historical channel characteristics; and tracking two channels according to the dialogue problems and the channel information to obtain visual channel characteristics, positioning two channels according to the dialogue problems and the channel information to obtain dialogue historical channel characteristics, and performing multi-mode fusion processing on the visual channel characteristics and the dialogue historical channel characteristics to obtain multi-mode channel characteristics.

In some embodiments, the channel processing module 2433 is further configured to: any one of the following processes is performed: performing one-time tracking processing according to the conversation problem and the channel information to obtain visual channel characteristics; and performing at least two tracking treatments according to the dialogue problem and the channel information, wherein a positioning treatment is included between two adjacent tracking treatments to obtain a visual channel characteristic.

A channel processing module 2433, further configured to: any one of the following processes is performed: performing primary positioning processing according to the conversation problem and the channel information to obtain conversation historical channel characteristics; and performing at least two times of positioning processing according to the conversation question and the channel information, wherein tracking processing is included between two adjacent times of positioning processing, so as to obtain the historical channel characteristics of the conversation.

In some embodiments, the channel processing module 2433 is further configured to: according to the problem feature corresponding to the conversation problem and the picture feature corresponding to the picture, tracking processing is carried out to obtain an nth update problem feature; according to the nth update problem feature and the dialogue history feature corresponding to the dialogue history, positioning processing is carried out to obtain an (n + 1) th update problem feature; iterating the value of N until an Nth updating problem characteristic is obtained, and determining the Nth updating problem characteristic as a visual channel characteristic; wherein the value of N is 1, … … N in sequence, and N is an integer greater than 1.

In some embodiments, the channel processing module 2433 is further configured to: according to the question features corresponding to the dialogue questions and the dialogue history features corresponding to the dialogue history, positioning processing is carried out to obtain the mth updated question features; tracking according to the m & ltth & gt updating problem feature and the picture feature corresponding to the picture to obtain an m +1 & ltth & gt updating problem feature; iterating the value of M until an Mth updating problem feature is obtained, and determining the Mth updating problem feature as a conversation history channel feature; wherein the values of M are 1 and … … M in sequence, and M is an integer greater than 1.

In some embodiments, the channel processing module 2433 is further configured to: performing enhancement processing on the problem feature corresponding to the dialogue problem and the visual channel feature to obtain a first attention weight; enhancing the problem features corresponding to the dialogue problems and the dialogue historical channel features to obtain a second attention weight; and splicing the first attention weight and the second attention weight, and activating a splicing result to obtain a multi-modal channel characteristic.

In some embodiments, the channel processing module 2433 is further configured to: weighting problem features corresponding to the dialogue problems through a first multilayer perceptron model to obtain a first weighting result; weighting the visual channel characteristics through a second multilayer perceptron model to obtain a second weighting result; and performing dot product processing on the first weighting result and the second weighting result to obtain a first attention weight.

In some embodiments, the channel processing module 2433 is further configured to: enhancing the problem features corresponding to the conversation problems and the picture features corresponding to the pictures to obtain attention weights; normalizing the attention weight; performing product processing on the attention weight after normalization processing and the picture characteristic to obtain a jth update problem characteristic; iterating the value of J until the J-th updating problem characteristic is obtained, and determining the J-th updating problem characteristic as a visual channel characteristic; wherein J is 1, … … J in sequence, and J is an integer greater than 0.

In some embodiments, the channel processing module 2433 is further configured to: enhancing the problem features of the dialogue problems and the dialogue history features corresponding to the dialogue history to obtain attention weight; normalizing the attention weight; carrying out product processing on the attention weight after normalization processing and the conversation historical characteristics to obtain intermediate conversation historical characteristics; activating the intermediate conversation history feature, and performing regularization processing on the activation processing result and the picture description feature in the conversation history feature together to obtain an r-th update problem feature; iterating the value of R until an R-th updating problem characteristic is obtained, and determining the R-th updating problem characteristic as a conversation history channel characteristic; wherein the values of R are 1 and … … R in sequence, and R is an integer greater than 0.

In some embodiments, the fusion module 2434 is further configured to: determining picture characteristics corresponding to the pictures, question characteristics corresponding to the conversation questions and conversation history characteristics corresponding to the conversation histories; initializing a hidden layer state of a decoding circular neural network model according to the channel characteristics; according to the hidden layer state of the decoding recurrent neural network model, attention processing is respectively carried out on the picture feature, the problem feature and the dialogue historical feature; and splicing the noticed picture features, the noticed problem features and the noticed conversation history features, and activating the splicing result to obtain a fusion feature.

In some embodiments, prediction module 2435 is further configured to: weighting the fusion characteristics and the hidden layer state of the decoding recurrent neural network model through a multilayer perceptron model to obtain answer characteristics; each numerical value in the answer features corresponds to the selection probability of one word; when the answer features meet the selection conditions, adding the word corresponding to the maximum numerical value in the answer features to an answer sequence; and when the answer characteristics do not meet the selection conditions, combining the words in the answer sequence into a conversation answer.

In some embodiments, the fusion module 2434 is further configured to: performing feature extraction processing on the picture through a convolutional neural network model to obtain picture features; performing feature extraction processing on the dialogue problem through a first cyclic neural network model to obtain problem features; carrying out feature extraction processing on the conversation history through a second recurrent neural network model to obtain conversation history features;

a fusion module 2434, further configured to: initializing the hidden layer state of the decoding circular neural network model according to the channel characteristics and the last hidden layer state of the first circular neural network model.

In some embodiments, the fusion module 2434 is further configured to: activating the hidden layer state and the initial feature of the decoding recurrent neural network model together, and normalizing the activation result to obtain an intermediate attention feature; performing product processing on the intermediate attention feature and the initial feature to obtain the noted initial feature; wherein the initial feature is the picture feature, the question feature, or the conversation history feature.

Embodiments of the present invention provide a storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform an artificial intelligence based visual dialog method provided by embodiments of the present invention, for example, the artificial intelligence based visual dialog method as illustrated in fig. 4A, 4B or 4D.

In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the present invention deepens the multi-modal representation of the dialog problem, and strengthens the decoder through the multi-modal attention mechanism, thereby effectively improving the accuracy of the obtained dialog answer and improving the precision of the visual dialog.

The above description is only an example of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. A visual dialog method based on artificial intelligence, comprising:

2. The visual dialogue method of claim 1, wherein the performing at least one of a tracking process and a positioning process based on the dialogue problem and the channel information to obtain a corresponding channel feature comprises:

any one of the following processes is performed:

tracking a single channel according to the dialogue problem and the channel information to obtain visual channel characteristics;

performing single-channel positioning processing according to the dialogue problem and the channel information to obtain dialogue historical channel characteristics;

tracking processing of double channels is carried out according to the conversation question and the channel information to obtain visual channel characteristics, positioning processing of double channels is carried out according to the conversation question and the channel information to obtain conversation historical channel characteristics, and

and performing multi-mode fusion processing on the visual channel characteristics and the dialogue historical channel characteristics to obtain multi-mode channel characteristics.

3. The visual dialogue method of claim 2, wherein the performing a two-channel tracking process according to the dialogue problem and the channel information to obtain a visual channel feature comprises:

any one of the following processes is performed:

performing one-time tracking processing according to the conversation problem and the channel information to obtain visual channel characteristics;

performing at least two times of tracking processing according to the conversation problem and the channel information, wherein positioning processing is included between two adjacent times of tracking processing to obtain visual channel characteristics;

the positioning processing of two channels is carried out according to the dialogue problem and the channel information to obtain the dialogue historical channel characteristics, and the method comprises the following steps:

any one of the following processes is performed:

performing primary positioning processing according to the conversation problem and the channel information to obtain conversation historical channel characteristics;

and performing at least two times of positioning processing according to the conversation question and the channel information, wherein tracking processing is included between two adjacent times of positioning processing, so as to obtain the historical channel characteristics of the conversation.

4. The visual dialogue method of claim 3, wherein the performing at least two tracking processes according to the dialogue problem and the channel information, and including a positioning process between two adjacent tracking processes to obtain the visual channel feature comprises:

according to the problem feature corresponding to the conversation problem and the picture feature corresponding to the picture, tracking processing is carried out to obtain an nth update problem feature;

according to the nth update problem feature and the dialogue history feature corresponding to the dialogue history, positioning processing is carried out to obtain an (n + 1) th update problem feature;

iterating the value of N until an Nth updating problem characteristic is obtained, and determining the Nth updating problem characteristic as a visual channel characteristic;

wherein the value of N is 1, … … N in sequence, and N is an integer greater than 1.

5. The visual dialogue method of claim 3, wherein the performing at least two positioning processes according to the dialogue problem and the channel information, and including a tracking process between two adjacent positioning processes to obtain a dialogue history channel feature comprises:

according to the question features corresponding to the dialogue questions and the dialogue history features corresponding to the dialogue history, positioning processing is carried out to obtain the mth updated question features;

tracking according to the m & ltth & gt updating problem feature and the picture feature corresponding to the picture to obtain an m +1 & ltth & gt updating problem feature;

iterating the value of M until an Mth updating problem feature is obtained, and determining the Mth updating problem feature as a conversation history channel feature;

wherein the values of M are 1 and … … M in sequence, and M is an integer greater than 1.

6. The visual dialog method of claim 2 wherein said multimodal fusion processing of the visual channel features and the dialog history channel features to obtain multimodal channel features comprises:

performing enhancement processing on the problem feature corresponding to the dialogue problem and the visual channel feature to obtain a first attention weight;

enhancing the problem features corresponding to the dialogue problems and the dialogue historical channel features to obtain a second attention weight;

and splicing the first attention weight and the second attention weight, and activating a splicing result to obtain a multi-modal channel characteristic.

7. The visual dialogue method of claim 6, wherein the enhancing the question feature and the visual channel feature corresponding to the dialogue question to obtain a first attention weight comprises:

weighting problem features corresponding to the dialogue problems through a first multilayer perceptron model to obtain a first weighting result;

weighting the visual channel characteristics through a second multilayer perceptron model to obtain a second weighting result;

and performing dot product processing on the first weighting result and the second weighting result to obtain a first attention weight.

8. The visual dialogue method of claim 2, wherein the performing a single-channel tracking process according to the dialogue problem and the channel information to obtain a visual channel feature comprises:

enhancing the problem features corresponding to the conversation problems and the picture features corresponding to the pictures to obtain attention weights;

normalizing the attention weight;

performing product processing on the attention weight after normalization processing and the picture characteristic to obtain a jth update problem characteristic;

iterating the value of J until the J-th updating problem characteristic is obtained, and determining the J-th updating problem characteristic as a visual channel characteristic;

wherein J is 1, … … J in sequence, and J is an integer greater than 0.

9. The visual dialogue method of claim 2, wherein the performing a single-channel positioning process according to the dialogue problem and the channel information to obtain a dialogue history channel feature comprises:

enhancing the problem features of the dialogue problems and the dialogue history features corresponding to the dialogue history to obtain attention weight;

normalizing the attention weight;

carrying out product processing on the attention weight after normalization processing and the conversation historical characteristics to obtain intermediate conversation historical characteristics;

activating the intermediate conversation history feature, and performing regularization processing on the activation processing result and the picture description feature in the conversation history feature together to obtain an r-th update problem feature;

iterating the value of R until an R-th updating problem characteristic is obtained, and determining the R-th updating problem characteristic as a conversation history channel characteristic;

wherein the values of R are 1 and … … R in sequence, and R is an integer greater than 0.

10. The visual conversation method according to any one of claims 1 to 9, wherein said merging the picture, the conversation question and the conversation history according to the channel feature to obtain a merged feature comprises:

determining picture characteristics corresponding to the pictures, question characteristics corresponding to the conversation questions and conversation history characteristics corresponding to the conversation histories;

initializing a hidden layer state of a decoding circular neural network model according to the channel characteristics;

according to the hidden layer state of the decoding recurrent neural network model, attention processing is respectively carried out on the picture feature, the problem feature and the dialogue historical feature;

and splicing the noticed picture features, the noticed problem features and the noticed conversation history features, and activating the splicing result to obtain a fusion feature.

11. The visual dialogue method of claim 10, wherein the performing prediction processing based on the fusion features to obtain a dialogue answer corresponding to the dialogue question comprises:

weighting the fusion characteristics and the hidden layer state of the decoding recurrent neural network model through a multilayer perceptron model to obtain answer characteristics; each numerical value in the answer features corresponds to the selection probability of one word;

when the answer features meet the selection conditions, adding the word corresponding to the maximum numerical value in the answer features to an answer sequence;

and when the answer characteristics do not meet the selection conditions, combining the words in the answer sequence into a conversation answer.

12. The visual dialogue method of claim 10, wherein the attention processing the picture feature, the question feature, and the dialogue history feature according to the hidden layer state of the decoding recurrent neural network model comprises:

activating the hidden layer state and the initial feature of the decoding recurrent neural network model together, and normalizing the activation result to obtain an intermediate attention feature;

performing product processing on the intermediate attention feature and the initial feature to obtain the noted initial feature;

wherein the initial feature is the picture feature, the question feature, or the conversation history feature.

13. An artificial intelligence based visual dialog device, comprising:

14. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based visual dialog method of any one of claims 1 to 12 when executing executable instructions stored in the memory.

15. A storage medium having stored thereon executable instructions for causing a processor to, when executed, implement the artificial intelligence based visual dialog method of any one of claims 1 to 12.