CN110647612A

CN110647612A - Visual conversation generation method based on double-visual attention network

Info

Publication number: CN110647612A
Application number: CN201910881305.0A
Authority: CN
Inventors: 郭丹; 王辉; 汪萌
Original assignee: Hefei Polytechnic University
Current assignee: Hefei University of Technology; Hefei Polytechnic University
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2020-01-03

Abstract

The invention discloses a visual dialog generating method based on a double-visual attention network, which comprises the following steps: 1. preprocessing text input in a visual dialogue and constructing a word list; 2. extracting the characteristics of the conversation images and the characteristics of the conversation texts; 3. performing attention processing on the historical dialogue information based on the current question information; 4. attention processing with independent double visual features; 5. attention processing with two visual features intersecting each other; 6. optimizing the visual characteristics; 7. generating an answer characteristic sequence by multi-mode semantic fusion and decoding; 8. optimizing parameters of a visual dialog generation network model based on the double-visual attention network; 9. and generating a predicted answer. The invention can provide more complete and reasonable visual semantic information and finer-grained text semantic information for the intelligent agent, thereby improving the reasonability and accuracy of the answer generated by the intelligent agent for predicting the problem.

Description

Visual conversation generation method based on double-visual attention network

Technical Field

The invention belongs to the technical field of computer vision, relates to technologies such as pattern recognition, natural language processing, artificial intelligence and the like, and particularly relates to a visual dialogue generating method based on a double-visual attention network.

Background

The visual conversation is a man-machine interaction method, and aims to enable a machine intelligent agent and a human to carry out reasonable and correct natural conversation on a given daily scene graph in a question-answer mode. Therefore, how to let the intelligent agent correctly understand the multi-modal semantic information composed of images and texts so as to give reasonable answers to questions posed by human beings is a key in visual conversation. Visual interaction is also one of the subjects of intense research in the field of computer vision, and the application scenarios thereof are very wide, including: the method helps people with visual impairment to know social media content or daily environment, artificial intelligence assistance, robot application and the like.

With the development of modern image processing technology and deep learning, the visual dialogue technology has been greatly developed, but still faces the following problems:

firstly, the intelligent agent lacks of learning the text features with finer granularity when processing the text information.

For example, in 2017, the image attention method based on historical dialogue, which is proposed in the article "Best of people's hands from transformed Knowledge Learning from textual visual Dialog Model" published by the author of Jiasen Lu on the top international Conference Processing Systems (NIPS 2017), first performs attention Processing on historical dialogue at sentence level, and then performs attention Learning on image features based on the processed text features, but the method only considers the semantics at sentence level and does not consider the semantics at word level when Processing the text Information of the current question, while only part of the keywords in the sentence actually asked are the most relevant answers to the prediction. Therefore, the method has certain limitations in practical application.

Secondly, the existing method carries out feature extraction based on a global image, so that visual semantic information is not accurate enough.

For example, in 2018, an author of Qi Wu et al published in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018) of the top international Conference, "Are You Talking to Me? real Visual Dialog Generation through semantic analysis and fusion" in the article, global Visual features, problems and historical Dialog text features Are used for a series of mutual attention processing and fusion to obtain multi-modal semantic features.

Disclosure of Invention

The invention provides a visual dialogue generating method based on a double-vision attention network in order to overcome the defects of the prior art, so that more complete and reasonable visual semantic information and more fine-grained text semantic information can be provided for an intelligent agent, and the reasonability and the accuracy of the intelligent agent in the answer reasoning generation of problems are improved.

The invention adopts the following technical scheme for solving the technical problems:

the invention relates to a visual dialog generating method based on a double-visual attention network, which is characterized by comprising the following steps:

step 1, preprocessing text input in visual dialogue and constructing a word list:

step 1.1, acquiring a visual dialogue data set, wherein the visual dialogue data set comprises sentence texts and images;

performing word segmentation processing on all sentence texts in the visual dialogue data set to obtain segmented words;

step 1.2, screening all words with the word frequency larger than a threshold value from the segmented words, and constructing a word index table Voc; then, one-hot coding is carried out on each word in the index table Voc to obtain an one-hot vector table O ═ O₁,o₂,...,o_n,...,o_N]Wherein o is_nRepresents the nth of the index table VocA one-hot coding vector corresponding to the word, wherein N is the number of the words in the index table Voc;

step 1.3, randomly initializing a word embedding matrix W_e，

Wherein d is_wA dimension representing a word vector; embedding matrices W with words_eMapping the coding vector of each word in the one-hot vector table to a corresponding word vector so as to obtain a word vector table;

step 2, extracting the characteristics of the conversation images and the characteristics of the conversation texts;

step 2.1, acquiring any one image I and corresponding historical dialogue U ═ U from the visual dialogue dataset₁,u₂,...,u_t,...,u_T]Current problem of

And a true answer label A_GTThe composed visual dialogue information D; where T is the total number of dialog segments in the historical dialog U, U_tRepresenting the t-th session in the session, L₁Sentence length, w, representing the current question Q_Q,iRepresenting a word vector corresponding to the ith word in the current question Q in the word vector table;

step 2.2, extracting the characteristics of the image I in the visual dialogue information D by using a convolutional neural network to obtain global visual characteristics

Wherein

Representing a global visual feature V⁽⁰⁾M denotes a global visual feature V⁽⁰⁾Total number of spatial regions of (1), d_gFor global visual features V⁽⁰⁾The channel dimension of (a);

step 2.3, extracting the characteristics of the image I in the visual dialogue information D by using a target detection characteristic extractor to obtain local visual characteristicsWhereinRepresenting local visual features R⁽⁰⁾K denotes the local visual feature R⁽⁰⁾Total number of detected local target objects in d_rFor local visual features R⁽⁰) The channel dimension of (a);

step 2.4, mapping the global visual feature and the local visual feature to a space with the same dimensionality by using full connection operation to obtain the converted global visual feature V ═ V₁,v₂,...,v_m,...,v_M]，And local visual feature R ═ R₁,r₂,...,r_k,...,r_K]，

Wherein v is_mRepresenting the m-th region feature, r, in the global visual feature V_kRepresenting the kth target object feature in the local visual feature R, wherein d is the converted channel dimension;

step 2.5, using long-short term memory network LSTM to extract features of current problem Q to obtain hidden state feature sequence

And the hidden state feature of the last step length output of the long-short term memory network LSTM is taken

The sentence-level question feature vector Q as the current question Q,

wherein h is_Q,iA hidden state characteristic representing the ith step length output of the long-short term memory network LSTM;

step 2.6, using long-short term memory network LSTM to process the t section dialogue in history dialogue U

Carrying out feature extraction to obtain the t-th hidden state sequence

The long-short term memory network takes the hidden state characteristic of the last step length output of the LSTM

As a t-th dialog u_tSentence level features of h_t，

The total historical dialog feature is H ═ H₁,h₂,...,h_t,...,h_T]，

Wherein w_t,iRepresenting the t-th dialog u_tThe word vector, L, corresponding to the ith word in the word vector table₂For the t-th dialogue u_tLength of sentence, h_t,iA hidden state characteristic representing the ith step length output of the long-short term memory network LSTM;

step 3, performing attention processing on historical dialogue information based on current question information;

using equation (1) to obtain the total historical dialogue feature H ═ H₁,h₂,...,h_t,...,h_T]Performing attention processing to obtain attention concerned history feature vector h_a，

h_a＝α^hH^T (1)

In the formula (1), the reaction mixture is,

represents the attention distribution weight for the historical dialog feature H and has:

α^h＝softmax(PTz^h) (2)

in the formula (2), the reaction mixture is,representing a similarity matrix between the sentence-level question feature vector q and the historical dialog features H,representing a similarity matrix z^hAnd the parameters to be trained include:

z^h＝tanh(W_qq+W_hH) (3)

in the formula (3), the reaction mixture is,

representing the parameters to be trained corresponding to the sentence-level question feature vector q,

representing parameters to be trained corresponding to the historical dialogue features H;

step 4, the attention of the double visual characteristics is processed independently;

step 4.1, using formula (4) to determine global visual feature V ═ V₁,v₂,...,v_m,...,v_M]Performing attention processing to obtain an attention-focused global visual feature vector V',

V′＝α^V1V^T (4)

in the formula (4), the reaction mixture is,

represents the attention distribution weight for the global visual feature V and has:

in the formula (5), the reaction mixture is,

representing sentence-level question feature vector q, attention-focused historical feature vector h_aAnd a similarity matrix between global visual features V

Representing a similarity matrix z^V1And the parameters to be trained include:

z^V1＝tanh(W_q1q+W_h1h_a+W_V1V) (6)

in the formula (6), the reaction mixture is,

representing the parameters to be trained corresponding to the sentence-level question feature vector q,historical feature vector h representing attention_aThe corresponding parameters to be trained are set to be,

representing a parameter to be trained corresponding to the global visual feature V;

step 4.2, using formula (7) to determine the local visual feature R ═ R₁,r₂,...,r_k,...,r_K]Performing attention processing to obtain a local visual feature vector R' concerned by attention,

R′＝α^R1R^T (7)

in the formula (7), the reaction mixture is,

represents the attention distribution weight for the local visual feature R and has:

in the formula (8), the reaction mixture is,

representing sentence-level question feature vector q, attention-focused historical feature vector h_aAnd a similarity matrix between the local visual features R,

representing a similarity matrix z^V1And the parameters to be trained include:

z^R1＝tanh(W′_q1q+W′_h1h_a+W_R1R) (9)

in the formula (9), the reaction mixture is,

historical feature vector h representing attention_aThe corresponding parameters to be trained are set to be,

representing a parameter to be trained corresponding to the local visual feature R;

step 5, attention processing of mutual intersection of the double visual characteristics;

step 5.1, using formula (10) to determine global visual feature V ═ V₁,v₂,...,v_m,...,v_M]Performing double-vision cross attention processing to obtain a global visual feature vector V' of further attention,

V″＝α^V2V^T (10)

in the formula (10), the compound represented by the formula (10),

represents a further attention distribution weight for the global visual feature V and has:

in the formula (11), the reaction mixture is,

representing sentence-level question feature vector q, attention-focused historical feature vector h_aA local visual feature vector R' of attention and a similarity matrix between the global visual features V,representing a similarity matrix z^V2And the parameters to be trained include:

z^V2＝tanh(W_q2q+W_h2h_a+W_R2R′+W_V2V) (12)

in the formula (12), the reaction mixture is,

the local visual feature vector R' representing attention corresponds to a parameter to be trained,

representing global visual feature VcorrespondencesThe parameters to be trained;

step 5.2, using formula (13) to compare the local visual characteristics R ═ R₁,r₂,...,r_k,...,r_K]Performing double vision cross attention processing to obtain a local vision characteristic vector R' of further attention,

R″＝α^R2R^T (13)

in the formula (13), the reaction mixture is,

represents further attention distribution weights for the local visual features R and has:

in the formula (14), the compound represented by the formula (I),

representing sentence-level question feature vector q, attention-focused historical feature vector h_aA global visual feature vector V' of attention and a similarity matrix between local visual features R,representing a similarity matrix z^R2And the parameters to be trained include:

z^R2＝tanh(W′_q2q+W′_h2h_a+W′_V2V′+W′_R2R) (15)

in the formula (15), the reaction mixture is,

to representHistorical feature vector h of attention_aThe corresponding parameters to be trained are set to be,the parameters to be trained corresponding to all visual feature vectors V' of attention,

step 6, optimizing visual characteristics;

step 6.1, performing word-level attention processing on the current question Q by using the formula (16) to obtain a word-level question feature vector Q concerned by attention^s，

q^s＝α^qQ^T (16)

In the formula (16), the compound represented by the formula,

represents the attention distribution weight for the current question Q and has:

in the formula (14), the compound represented by the formula (I),

a self-attention semantic matrix representing the current question Q,

representing a self-attention semantic matrix z^QAnd the parameters to be trained include:

z^Q＝tanh(W_QQ) (18)

in the formula (18), the reaction mixture,

representing a parameter to be trained corresponding to the current question Q during word level attention processing;

step 6.2, respectively carrying out optimization processing on the global visual feature vector V 'and the local visual feature vector R' which are focused by further attention by using the formulas (19) and (20), and obtaining a final global visual feature vector

And local visual feature vector

In the formulae (19) and (20),

word-level problem feature vector q in visual feature optimization^sThe corresponding parameter to be trained, an, indicates a dot product operation;

step 7, generating an answer characteristic sequence by multi-mode semantic fusion and decoding;

step 7.1, focusing attention on the word-level question feature vector q^sAttention is focused on the historical feature vector h_aOptimized global visual feature vector

And local visual feature vector

Obtaining a multi-modal feature vector e after splicing_M，

Wherein d is_M＝3d+d_wDimensions representing multi-modal feature vectors; and then the multi-modal feature vector e is subjected to full-connection operation_MMapping is carried out to obtain a fusion semantic feature vector e,

step 7.2, inputting the fusion semantic feature vector e into a long-short term memory network (LSTM) to obtain a hidden state feature sequence of a predicted answer

Wherein h is_A,iFor the output of the ith step of the long short term memory network LSTM, L₃Label a for the true answer_GTSentence length of (1);

step 7.3, utilizing full-connection operation to enable the hidden state feature sequence of the predicted answer

Mapping the predicted answer to the space with the same dimension as the one-hot vector table O to obtain a word vector set of the predicted answer

Wherein y is_iThe mapping vector of the ith word in the predicted answer is represented, and the length of the vector is the same as the number of the words;

step 8, optimizing parameters of a visual dialogue generation network model based on the double-visual attention network;

step 8.1, according to the word one-hot vector table O, real answer label A_GTThe words in (1) construct a set of vectors

WhereinLabel A representing true answer_GTMapping vector of the ith word, wherein the length of the vector is the same as the number of the words;

step 8.2 calculate the predicted answer and the true answer A using equation (21)_GTCost of loss between E:

step 8.3, optimizing and solving the loss cost E by using a random gradient descent method to minimize the loss cost E, so as to obtain a visual dialogue network model based on the double-visual attention network with optimal parameters;

step 9, generating a predicted answer;

set of word vectors to the predicted answer

Obtaining a mapping vector y of the ith word by using a greedy decoding algorithm_iThe position corresponding to the maximum value in the word index table Voc, and the word at the corresponding position is searched in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word_iAnd finally predicting words to obtain a predicted answer corresponding to the word vector set Y, and taking the current question Q and the predicted answer corresponding to the word vector set Y as a finally generated visual dialogue.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the visual dialogue technology researched in the past, the visual dialogue method and the visual dialogue system have the advantages that the visual characteristics of the global image are extracted, the visual characteristics of the local image object are extracted, the visual semantic information contained in the global visual characteristics is more comprehensive, the visual semantics contained in the local visual characteristics is more specific, so that the characteristics of the two visual characteristics are fully considered, the internal relation and the mutual relation of the two visual characteristics are respectively learned through two-stage attention processing, visual semantic complementation is formed, and the intelligent body can obtain more complete and accurate visual semantic information.

2. The method respectively processes text characteristics from a sentence level and a word level, firstly carries out sentence level characteristic extraction on problems and historical dialogue and carries out attention processing on historical dialogue characteristics during processing; secondly, learning the relation of the two visual characteristics based on the text characteristics of the sentence level; then, the invention carries out attention processing on the question features from a word level so as to capture the keyword features which are helpful for answer conjecture in the question, and the finer-grained text processing method enables the invention to generate more accurate and reasonable answers in the visual conversation.

3. The invention provides a multi-mode semantic fusion structure, which firstly optimizes two visual characteristics by using problem text characteristics on a word level so as to further highlight visual information related to problem keywords in the visual characteristics. Then, by splicing the problem features, the historical dialogue features, the global visual features and the local visual features, learning and fusion are carried out, furthermore, the visual features and the text features can mutually influence through a multi-mode semantic fusion network, parameters of the network are assisted to be optimized, after the visual semantics and the text semantics are simultaneously obtained by the fusion network, the answer prediction generation effect of the intelligent agent is greatly improved, and the prediction result is more accurate.

Drawings

FIG. 1 is a schematic diagram of a network model of the present invention;

FIG. 2 is a schematic diagram of dual visual attention processing in accordance with the present invention;

FIG. 3 is a schematic diagram of network model training according to the present invention.

Detailed Description

In this embodiment, as shown in fig. 1, a method for generating a visual dialog based on a dual-vision attention network is performed as follows:

step 1.1, acquiring a visual conversation data set from the Internet, wherein the currently disclosed data set mainly comprises a VisDial Dataset, the data set is collected by related researchers of the George Asian worker school, and the visual conversation data set comprises sentence texts and images;

step 1.2, screening all words with the word frequency larger than a threshold value from the segmented words, setting the threshold value to be 4, and constructing a word index table Voc; the method for creating the word index table Voc comprises the following steps: the word list may contain words, punctuation; counting the number of words and ordering the words, wherein a blank character is added to satisfy the optimized training process. Constructing a corresponding table of words and sequence numbers for all the words according to the sequence; then, one-hot coding is carried out on each word in the index table Voc to obtain a one-hot vector table O ═ O₁,o₂,...,o_n,...,o_N]Wherein o is_nRepresenting a one-hot coding vector corresponding to the nth word in the index table Voc, wherein N is the number of words in the index table Voc;

step 1.3, randomly initializing a word embedding matrix W_e，

Wherein d is_wA dimension representing a word vector; embedding matrices W with words_eEncoding vector o of nth word in one-hot vector table_nMapping to the nth word vector w_n，

Thereby obtaining a word vector table;

And a true answer label A_GTThe composed visual dialogue information D; where T is the total number of dialog segments in the historical dialog U, U_tRepresenting the t-th session in the session, L₁Sentence length, L, representing the current question Q₁Is set to 16, and for sentences with sentence length less than 16, the zero vector is used for filling until the length is L₁，w_Q,iA word vector representing an ith word in the sentence;

Wherein

Representing a global visual feature V⁽⁰⁾M denotes a global visual feature V⁽⁰⁾Total number of spatial regions of (1), d_gFor global visual features V⁽⁰⁾The channel dimension of (a); in this embodiment, a pre-trained VGG convolutional neural network may be used to perform feature extraction on the global visual feature of the image I; the VGG is a two-dimensional convolutional neural network which is proved to have strong visual information expression capacity, so that the VGG which is pre-trained on a COCO2014 data set is used as an experimental global visual feature extractor, and the network of the VGG does not participate in the parameter updating part of the subsequent step 8;

step 2.3, extracting the characteristics of the image I in the visual dialogue information D by using a target detection characteristic extractor to obtain local visual characteristics

Wherein

Representing local visual features R⁽⁰⁾K denotes the local visual feature R⁽⁰⁾Total number of detected local target objects in d_rFor local visual features R⁽⁰⁾The channel dimension of (a); in this exampleThe local visual features of the image I can be extracted by adopting a pre-trained Faster-RCNN target detection feature extractor; the local Visual features extracted by the fast-RCNN achieve excellent effects on many Visual tasks, so we use the fast-RCNN pre-trained on the Visual Genome dataset as the local Visual feature extractor of the experiment, and the network of the part does not participate in the parameter updating part of the subsequent step 8;

step 2.4, mapping the global visual feature and the local visual feature to a space with the same dimensionality by using full connection operation to obtain the converted global visual feature V ═ V₁,v₂,...,v_m,...,v_M]，

And local visual feature R ═ R₁,r₂,...,r_k,...,r_K]，

And the hidden state feature of the last step length output of the LSTM is takenThe sentence-level question feature vector Q as the current question Q,

wherein h is_Q,iA hidden state feature representing the output of the ith step length of the LSTM;

step 2.6, using long-short term memory network LSTM to each session in history session U

Carrying out feature extraction to obtain a hidden state sequenceTaking the hidden state feature of the last step output of the LSTM

As a dialog u_tSentence level features of h_t，

The total historical dialog feature is H ═ H₁,h₂,...,h_t,...,h_T]，

Wherein w_u,iPresentation dialog u_tWord vector of the ith word, L₂For a conversation u_tLength of sentence, L₂Is set to 25, and for sentences with sentence length less than 25, the zero vector is used for filling until the length of the sentence is L₂，h_u,iA hidden state feature representing the output of the ith step length of the LSTM;

using equation (1) to set total historical dialogue feature H ═ H₁,h₂,...,h_t,...,h_T]Performing attention processing to obtain attention concerned history feature vector h_a,

h_a＝α^hH (1)

In the formula (1), the reaction mixture is,

α^h＝softmax(PTz^h) (2)

in the formula (2), the reaction mixture is,

representing a similarity matrix between the sentence-level question feature vector q and the historical dialog features H,

representing a similarity matrix z^hAnd the parameters to be trained include:

z^h＝tanh(W_qq+W_hH) (3)

in the formula (3), the reaction mixture is,

step 4, as shown in fig. 2, performing respective independent attention processing on the dual-vision characteristics;

V′＝α^V1V^T (4)

in the formula (4), the reaction mixture is,

in the formula (5), the reaction mixture is,

representing sentence-level question feature vector q, attention-focused historical feature vector h_aAnd a similarity matrix between the global visual features V,

representing a similarity matrix z^V1And the parameters to be trained include:

z^V1＝tanh(W_q1q+W_h1h_a+W_V1V) (6)

in the formula (6), the reaction mixture is,

R′＝α^R1R^T (7)

in the formula (7), the reaction mixture is,

in the formula (8), the reaction mixture is,representing sentence-level question feature vector q, attention-focused historical feature vector h_aAnd a similarity matrix between the local visual features R,representing a similarity matrix z^V1And the parameters to be trained include:

z^R1＝tanh(W′_q1q+W′_h1h_a+W_R1R) (9)

in the formula (9), the reaction mixture is,representing the parameters to be trained corresponding to the sentence-level question feature vector q,

historical feature vector h representing attention_aThe corresponding parameters to be trained are set to be,representing a parameter to be trained corresponding to the local visual feature R;

step 5, as shown in fig. 2, performing intercrossed attention processing on the double visual characteristics;

V″＝α^V2V^T (10)

in the formula (10), the compound represented by the formula (10),

in the formula (11), the reaction mixture is,

z^V2＝tanh(W_q2q+W_h2h_a+W_R2R′+W_V2V) (12)

in the formula (12), the reaction mixture is,

R″＝α^R2R^T (13)

in the formula (13), the reaction mixture is,represents further attention distribution weights for the local visual features R and has:

in the formula (14), the compound represented by the formula (I),

representing sentence-level question feature vector q, attention-focused historical feature vector h_aA global visual feature vector V' of attention and a similarity matrix between local visual features R,

representing a similarity matrix z^R2And the parameters to be trained include:

z^R2＝tanh(W′_q2q+W′_h2h_a+W′_V2V′+W′_R2R) (15)

in the formula (15), the reaction mixture is,

historical feature vector h representing attention_aCorresponding to be treatedThe parameters of the training are set to be,

the parameters to be trained corresponding to all visual feature vectors V' of attention,

step 6, optimizing visual characteristics;

q^s＝α^qQ (16)

In the formula (16), the compound represented by the formula,

in the formula (14), the compound represented by the formula (I),

a self-attention semantic matrix representing the current question Q,

z^Q＝tanh(W_QQ) (18)

in the formula (18), the reaction mixture,

current when indicating word level attention processingA parameter to be trained corresponding to the question Q;

And local visual feature vector

In the formulae (19) and (20),

step 7.1, attention focused word-level problem feature vector q^sAttention is focused on the historical feature vector h_aOptimized global visual feature vector

And local visual feature vector

Obtaining a multi-modal feature vector e after splicing_M，

Wherein d is_M＝3d+d_wDimensions representing multi-modal feature vectors; and then a layer of full-connection operation is utilized to carry out multi-mode feature vector e_MMapping is carried out to obtain a fusion semantic feature vector e,

step 7.2, inputting the fusion semantic feature vector e into the long-short term memory network LSTM to obtain the hidden state feature sequence of the predicted answer

Wherein h is_A,iFor the output of the ith time step of LSTM, L₃Label a for the true answer_GTLength of sentence, L₃May be set to a size of 9;

step 7.3, predicting the hidden state characteristic sequence of the answer by utilizing full-connection operation

step 8, as shown in fig. 3, optimizing parameters of a network model generated by a visual dialog based on a dual-vision attention network;

step 8.1, according to the word one-hot vector table O, the real answer label A is identified_GTThe words in (1) construct a set of vectors

Wherein

Label A representing true answer_GTMapping vector of the ith word, wherein the length of the vector is the same as the number of the words;

step 9, generating a predicted answer;

word vector set for predictive answers

Obtaining a mapping vector y of the ith word by using a greedy decoding algorithm_iThe position corresponding to the maximum value in the word index table Voc, and the word at the corresponding position is searched in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith word_iAnd finally predicting the words to obtain a predicted answer corresponding to the word vector set Y, and taking the current question Q and the predicted answer corresponding to the word vector set Y as a finally generated visual dialogue.

Claims

1. A visual dialog generating method based on a double-visual attention network is characterized by comprising the following steps:

step 1.2, screening out word frequency from the segmented wordsAll words with the rate larger than the threshold value are constructed, and a word index table Voc is constructed; then, one-hot coding is carried out on each word in the index table Voc to obtain an one-hot vector table O ═ O₁,o₂,...,o_n,...,o_N]Wherein o is_nRepresenting a one-hot coding vector corresponding to the nth word in the index table Voc, wherein N is the number of words in the index table Voc;

step 1.3, randomly initializing a word embedding matrix W_e，

Wherein

Representing a global visual feature V⁽⁰⁾M of (1)Regional characteristics, M denotes the global visual characteristics V⁽⁰⁾Total number of spatial regions of (1), d_gFor global visual features V⁽⁰⁾The channel dimension of (a);

Wherein

Representing local visual features R⁽⁰⁾K denotes the local visual feature R⁽⁰⁾Total number of detected local target objects in d_rFor local visual features R⁽⁰⁾The channel dimension of (a);

And local visual feature R ═ R₁,r₂,...,r_k,...,r_K]，Wherein v is_mRepresenting the m-th region feature, r, in the global visual feature V_kRepresenting the kth target object feature in the local visual feature R, wherein d is the converted channel dimension;

step 2.5, using long-short term memory network LSTM to extract features of current problem Q to obtain hidden state feature sequenceAnd the hidden state feature of the last step length output of the long-short term memory network LSTM is taken

The sentence-level question feature vector Q as the current question Q,

Carrying out feature extraction to obtain the t-th hidden state sequence

The long-short term memory network takes the hidden state characteristic of the last step length output of the LSTMAs a t-th dialog u_tSentence level features of h_t，

The total historical dialog feature is H ═ H₁,h₂,...,h_t,...,h_T]，

using equation (1) to obtain the total historical dialogue feature H ═ H₁,h₂,...,h_t,...,h_T]The attention-seeking treatment is carried out,obtaining a history feature vector h concerned by attention_a，

h_a＝α^hH^T (1)

In the formula (1), the reaction mixture is,

α^h＝softmax(P^Tz^h) (2)

in the formula (2), the reaction mixture is,

representing a similarity matrix z^hAnd the parameters to be trained include:

z^h＝tanh(W_qq+W_hH) (3)

in the formula (3), the reaction mixture is,representing the parameters to be trained corresponding to the sentence-level question feature vector q,

V′＝α^V1V^T (4)

in the formula (4), the reaction mixture is,

in the formula (5), the reaction mixture is,

Representing a similarity matrix z^V1And the parameters to be trained include:

z^V1＝tanh(W_q1q+W_h1h_a+W_V1V) (6)

in the formula (6), the reaction mixture is,

historical feature vector h representing attention_aThe corresponding parameters to be trained are set to be,representing a parameter to be trained corresponding to the global visual feature V;

step 4.2, benefitUsing formula (7) to determine the local visual characteristic R ═ R₁,r₂,...,r_k,...,r_K]Performing attention processing to obtain a local visual feature vector R' concerned by attention,

R′＝α^R1R^T (7)

in the formula (7), the reaction mixture is,

in the formula (8), the reaction mixture is,

representing a similarity matrix z^V1And the parameters to be trained include:

z^R1＝tanh(W′_q1q+W′_h1h_a+W_R1R) (9)

in the formula (9), the reaction mixture is,

V″＝α^V2V^T (10)

in the formula (10), the compound represented by the formula (10),

in the formula (11), the reaction mixture is,

representing sentence-level question feature vector q, attention-focused historical feature vector h_aA local visual feature vector R' of attention and a similarity matrix between the global visual features V,

representing a similarity matrix z^V2And the parameters to be trained include:

z^V2＝tanh(W_q2q+W_h2h_a+W_R2R′+W_V2V) (12)

in the formula (12), the reaction mixture is,

R″＝α^R2R^T (13)

in the formula (13), the reaction mixture is,

in the formula (14), the compound represented by the formula (I),

representing sentence-level question feature vector q, attention-focused historical feature vector h_aAll visual feature vectors of attentionV' and the similarity matrix between the local visual features R,representing a similarity matrix z^R2And the parameters to be trained include:

z^R2＝tanh(W′_q2q+W′_h2h_a+W′_V2V′+W′_R2R) (15)

in the formula (15), the reaction mixture is,

historical feature vector h representing attention_aThe corresponding parameters to be trained are set to be,the parameters to be trained corresponding to all visual feature vectors V' of attention,

step 6, optimizing visual characteristics;

q^s＝α^qQ^T (16)

In the formula (16), the compound represented by the formula,

representing the attention distribution to the current question QWeights, and have:

in the formula (14), the compound represented by the formula (I),a self-attention semantic matrix representing the current question Q,

z^Q＝tanh(W_QQ) (18)

in the formula (18), the reaction mixture,representing a parameter to be trained corresponding to the current question Q during word level attention processing;

And local visual feature vector

In the formulae (19) and (20),

And local visual feature vector

Obtaining a multi-modal feature vector e after splicing_M，

step 7.2, inputting the fusion semantic feature vector e into a long-short term memory network (LSTM) to obtain a hidden state feature sequence of a predicted answerWherein h is_A,iFor the output of the ith step of the long short term memory network LSTM, L₃For true answer marksLabel A_GTSentence length of (1);

step 7.3, utilizing full-connection operation to enable the hidden state feature sequence of the predicted answerMapping the predicted answer to the space with the same dimension as the one-hot vector table O to obtain a word vector set of the predicted answer

step 9, generating predicted answer

Set of word vectors to the predicted answer