CN110647612A - Visual conversation generation method based on double-visual attention network - Google Patents

Visual conversation generation method based on double-visual attention network Download PDF

Info

Publication number
CN110647612A
CN110647612A CN201910881305.0A CN201910881305A CN110647612A CN 110647612 A CN110647612 A CN 110647612A CN 201910881305 A CN201910881305 A CN 201910881305A CN 110647612 A CN110647612 A CN 110647612A
Authority
CN
China
Prior art keywords
visual
representing
attention
vector
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910881305.0A
Other languages
Chinese (zh)
Inventor
郭丹
王辉
汪萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Hefei Polytechnic University
Original Assignee
Hefei Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Polytechnic University filed Critical Hefei Polytechnic University
Priority to CN201910881305.0A priority Critical patent/CN110647612A/en
Publication of CN110647612A publication Critical patent/CN110647612A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Abstract

The invention discloses a visual dialog generating method based on a double-visual attention network, which comprises the following steps: 1. preprocessing text input in a visual dialogue and constructing a word list; 2. extracting the characteristics of the conversation images and the characteristics of the conversation texts; 3. performing attention processing on the historical dialogue information based on the current question information; 4. attention processing with independent double visual features; 5. attention processing with two visual features intersecting each other; 6. optimizing the visual characteristics; 7. generating an answer characteristic sequence by multi-mode semantic fusion and decoding; 8. optimizing parameters of a visual dialog generation network model based on the double-visual attention network; 9. and generating a predicted answer. The invention can provide more complete and reasonable visual semantic information and finer-grained text semantic information for the intelligent agent, thereby improving the reasonability and accuracy of the answer generated by the intelligent agent for predicting the problem.

Description

Visual conversation generation method based on double-visual attention network
Technical Field
The invention belongs to the technical field of computer vision, relates to technologies such as pattern recognition, natural language processing, artificial intelligence and the like, and particularly relates to a visual dialogue generating method based on a double-visual attention network.
Background
The visual conversation is a man-machine interaction method, and aims to enable a machine intelligent agent and a human to carry out reasonable and correct natural conversation on a given daily scene graph in a question-answer mode. Therefore, how to let the intelligent agent correctly understand the multi-modal semantic information composed of images and texts so as to give reasonable answers to questions posed by human beings is a key in visual conversation. Visual interaction is also one of the subjects of intense research in the field of computer vision, and the application scenarios thereof are very wide, including: the method helps people with visual impairment to know social media content or daily environment, artificial intelligence assistance, robot application and the like.
With the development of modern image processing technology and deep learning, the visual dialogue technology has been greatly developed, but still faces the following problems:
firstly, the intelligent agent lacks of learning the text features with finer granularity when processing the text information.
For example, in 2017, the image attention method based on historical dialogue, which is proposed in the article "Best of people's hands from transformed Knowledge Learning from textual visual Dialog Model" published by the author of Jiasen Lu on the top international Conference Processing Systems (NIPS 2017), first performs attention Processing on historical dialogue at sentence level, and then performs attention Learning on image features based on the processed text features, but the method only considers the semantics at sentence level and does not consider the semantics at word level when Processing the text Information of the current question, while only part of the keywords in the sentence actually asked are the most relevant answers to the prediction. Therefore, the method has certain limitations in practical application.
Secondly, the existing method carries out feature extraction based on a global image, so that visual semantic information is not accurate enough.
For example, in 2018, an author of Qi Wu et al published in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018) of the top international Conference, "Are You Talking to Me? real Visual Dialog Generation through semantic analysis and fusion" in the article, global Visual features, problems and historical Dialog text features Are used for a series of mutual attention processing and fusion to obtain multi-modal semantic features.
Disclosure of Invention
The invention provides a visual dialogue generating method based on a double-vision attention network in order to overcome the defects of the prior art, so that more complete and reasonable visual semantic information and more fine-grained text semantic information can be provided for an intelligent agent, and the reasonability and the accuracy of the intelligent agent in the answer reasoning generation of problems are improved.
The invention adopts the following technical scheme for solving the technical problems:
the invention relates to a visual dialog generating method based on a double-visual attention network, which is characterized by comprising the following steps:
step 1, preprocessing text input in visual dialogue and constructing a word list:
step 1.1, acquiring a visual dialogue data set, wherein the visual dialogue data set comprises sentence texts and images;
performing word segmentation processing on all sentence texts in the visual dialogue data set to obtain segmented words;
step 1.2, screening all words with the word frequency larger than a threshold value from the segmented words, and constructing a word index table Voc; then, one-hot coding is carried out on each word in the index table Voc to obtain an one-hot vector table O ═ O1,o2,...,on,...,oN]Wherein o isnRepresents the nth of the index table VocA one-hot coding vector corresponding to the word, wherein N is the number of the words in the index table Voc;
step 1.3, randomly initializing a word embedding matrix We
Figure BDA0002205997000000021
Wherein d iswA dimension representing a word vector; embedding matrices W with wordseMapping the coding vector of each word in the one-hot vector table to a corresponding word vector so as to obtain a word vector table;
step 2, extracting the characteristics of the conversation images and the characteristics of the conversation texts;
step 2.1, acquiring any one image I and corresponding historical dialogue U ═ U from the visual dialogue dataset1,u2,...,ut,...,uT]Current problem of
Figure BDA0002205997000000022
And a true answer label AGTThe composed visual dialogue information D; where T is the total number of dialog segments in the historical dialog U, UtRepresenting the t-th session in the session, L1Sentence length, w, representing the current question QQ,iRepresenting a word vector corresponding to the ith word in the current question Q in the word vector table;
step 2.2, extracting the characteristics of the image I in the visual dialogue information D by using a convolutional neural network to obtain global visual characteristics
Figure BDA0002205997000000023
Wherein
Figure BDA0002205997000000024
Representing a global visual feature V(0)M denotes a global visual feature V(0)Total number of spatial regions of (1), dgFor global visual features V(0)The channel dimension of (a);
step 2.3, extracting the characteristics of the image I in the visual dialogue information D by using a target detection characteristic extractor to obtain local visual characteristicsWhereinRepresenting local visual features R(0)K denotes the local visual feature R(0)Total number of detected local target objects in drFor local visual features R(0) The channel dimension of (a);
step 2.4, mapping the global visual feature and the local visual feature to a space with the same dimensionality by using full connection operation to obtain the converted global visual feature V ═ V1,v2,...,vm,...,vM],And local visual feature R ═ R1,r2,...,rk,...,rK],
Figure BDA0002205997000000032
Wherein v ismRepresenting the m-th region feature, r, in the global visual feature VkRepresenting the kth target object feature in the local visual feature R, wherein d is the converted channel dimension;
step 2.5, using long-short term memory network LSTM to extract features of current problem Q to obtain hidden state feature sequence
Figure BDA0002205997000000033
And the hidden state feature of the last step length output of the long-short term memory network LSTM is taken
Figure BDA0002205997000000034
The sentence-level question feature vector Q as the current question Q,
Figure BDA0002205997000000035
wherein h isQ,iA hidden state characteristic representing the ith step length output of the long-short term memory network LSTM;
step 2.6, using long-short term memory network LSTM to process the t section dialogue in history dialogue U
Figure BDA0002205997000000036
Carrying out feature extraction to obtain the t-th hidden state sequence
Figure BDA0002205997000000037
The long-short term memory network takes the hidden state characteristic of the last step length output of the LSTM
Figure BDA0002205997000000038
As a t-th dialog utSentence level features of ht
Figure BDA0002205997000000039
The total historical dialog feature is H ═ H1,h2,...,ht,...,hT],
Figure BDA00022059970000000310
Wherein wt,iRepresenting the t-th dialog utThe word vector, L, corresponding to the ith word in the word vector table2For the t-th dialogue utLength of sentence, ht,iA hidden state characteristic representing the ith step length output of the long-short term memory network LSTM;
step 3, performing attention processing on historical dialogue information based on current question information;
using equation (1) to obtain the total historical dialogue feature H ═ H1,h2,...,ht,...,hT]Performing attention processing to obtain attention concerned history feature vector ha
Figure BDA00022059970000000311
ha=αhHT (1)
In the formula (1), the reaction mixture is,
Figure BDA00022059970000000312
represents the attention distribution weight for the historical dialog feature H and has:
αh=softmax(PTzh) (2)
in the formula (2), the reaction mixture is,representing a similarity matrix between the sentence-level question feature vector q and the historical dialog features H,representing a similarity matrix zhAnd the parameters to be trained include:
zh=tanh(Wqq+WhH) (3)
in the formula (3), the reaction mixture is,
Figure BDA00022059970000000315
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000000316
representing parameters to be trained corresponding to the historical dialogue features H;
step 4, the attention of the double visual characteristics is processed independently;
step 4.1, using formula (4) to determine global visual feature V ═ V1,v2,...,vm,...,vM]Performing attention processing to obtain an attention-focused global visual feature vector V',
V′=αV1VT (4)
in the formula (4), the reaction mixture is,
Figure BDA0002205997000000042
represents the attention distribution weight for the global visual feature V and has:
Figure BDA0002205997000000043
in the formula (5), the reaction mixture is,
Figure BDA0002205997000000044
representing sentence-level question feature vector q, attention-focused historical feature vector haAnd a similarity matrix between global visual features V
Figure BDA0002205997000000045
Representing a similarity matrix zV1And the parameters to be trained include:
zV1=tanh(Wq1q+Wh1ha+WV1V) (6)
in the formula (6), the reaction mixture is,
Figure BDA0002205997000000046
representing the parameters to be trained corresponding to the sentence-level question feature vector q,historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,
Figure BDA0002205997000000048
representing a parameter to be trained corresponding to the global visual feature V;
step 4.2, using formula (7) to determine the local visual feature R ═ R1,r2,...,rk,...,rK]Performing attention processing to obtain a local visual feature vector R' concerned by attention,
Figure BDA0002205997000000049
R′=αR1RT (7)
in the formula (7), the reaction mixture is,
Figure BDA00022059970000000410
represents the attention distribution weight for the local visual feature R and has:
Figure BDA00022059970000000411
in the formula (8), the reaction mixture is,
Figure BDA00022059970000000412
representing sentence-level question feature vector q, attention-focused historical feature vector haAnd a similarity matrix between the local visual features R,
Figure BDA00022059970000000413
representing a similarity matrix zV1And the parameters to be trained include:
zR1=tanh(W′q1q+W′h1ha+WR1R) (9)
in the formula (9), the reaction mixture is,
Figure BDA00022059970000000414
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000000415
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,
Figure BDA00022059970000000416
representing a parameter to be trained corresponding to the local visual feature R;
step 5, attention processing of mutual intersection of the double visual characteristics;
step 5.1, using formula (10) to determine global visual feature V ═ V1,v2,...,vm,...,vM]Performing double-vision cross attention processing to obtain a global visual feature vector V' of further attention,
Figure BDA0002205997000000051
V″=αV2VT (10)
in the formula (10), the compound represented by the formula (10),
Figure BDA0002205997000000052
represents a further attention distribution weight for the global visual feature V and has:
Figure BDA0002205997000000053
in the formula (11), the reaction mixture is,
Figure BDA0002205997000000054
representing sentence-level question feature vector q, attention-focused historical feature vector haA local visual feature vector R' of attention and a similarity matrix between the global visual features V,representing a similarity matrix zV2And the parameters to be trained include:
zV2=tanh(Wq2q+Wh2ha+WR2R′+WV2V) (12)
in the formula (12), the reaction mixture is,
Figure BDA0002205997000000056
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA0002205997000000057
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,
Figure BDA0002205997000000058
the local visual feature vector R' representing attention corresponds to a parameter to be trained,
Figure BDA0002205997000000059
representing global visual feature VcorrespondencesThe parameters to be trained;
step 5.2, using formula (13) to compare the local visual characteristics R ═ R1,r2,...,rk,...,rK]Performing double vision cross attention processing to obtain a local vision characteristic vector R' of further attention,
Figure BDA00022059970000000510
R″=αR2RT (13)
in the formula (13), the reaction mixture is,
Figure BDA00022059970000000511
represents further attention distribution weights for the local visual features R and has:
Figure BDA00022059970000000512
in the formula (14), the compound represented by the formula (I),
Figure BDA00022059970000000513
representing sentence-level question feature vector q, attention-focused historical feature vector haA global visual feature vector V' of attention and a similarity matrix between local visual features R,representing a similarity matrix zR2And the parameters to be trained include:
zR2=tanh(W′q2q+W′h2ha+W′V2V′+W′R2R) (15)
in the formula (15), the reaction mixture is,
Figure BDA00022059970000000515
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000000516
to representHistorical feature vector h of attentionaThe corresponding parameters to be trained are set to be,the parameters to be trained corresponding to all visual feature vectors V' of attention,
Figure BDA0002205997000000062
representing a parameter to be trained corresponding to the local visual feature R;
step 6, optimizing visual characteristics;
step 6.1, performing word-level attention processing on the current question Q by using the formula (16) to obtain a word-level question feature vector Q concerned by attentions
Figure BDA0002205997000000063
qs=αqQT (16)
In the formula (16), the compound represented by the formula,
Figure BDA0002205997000000064
represents the attention distribution weight for the current question Q and has:
in the formula (14), the compound represented by the formula (I),
Figure BDA0002205997000000066
a self-attention semantic matrix representing the current question Q,
Figure BDA0002205997000000067
representing a self-attention semantic matrix zQAnd the parameters to be trained include:
zQ=tanh(WQQ) (18)
in the formula (18), the reaction mixture,
Figure BDA0002205997000000068
representing a parameter to be trained corresponding to the current question Q during word level attention processing;
step 6.2, respectively carrying out optimization processing on the global visual feature vector V 'and the local visual feature vector R' which are focused by further attention by using the formulas (19) and (20), and obtaining a final global visual feature vector
Figure BDA0002205997000000069
And local visual feature vector
Figure BDA00022059970000000611
Figure BDA00022059970000000612
Figure BDA00022059970000000613
In the formulae (19) and (20),
Figure BDA00022059970000000615
word-level problem feature vector q in visual feature optimizationsThe corresponding parameter to be trained, an, indicates a dot product operation;
step 7, generating an answer characteristic sequence by multi-mode semantic fusion and decoding;
step 7.1, focusing attention on the word-level question feature vector qsAttention is focused on the historical feature vector haOptimized global visual feature vector
Figure BDA00022059970000000616
And local visual feature vector
Figure BDA00022059970000000617
Obtaining a multi-modal feature vector e after splicingM
Figure BDA00022059970000000618
Wherein d isM=3d+dwDimensions representing multi-modal feature vectors; and then the multi-modal feature vector e is subjected to full-connection operationMMapping is carried out to obtain a fusion semantic feature vector e,
Figure BDA0002205997000000071
step 7.2, inputting the fusion semantic feature vector e into a long-short term memory network (LSTM) to obtain a hidden state feature sequence of a predicted answer
Figure BDA0002205997000000072
Wherein h isA,iFor the output of the ith step of the long short term memory network LSTM, L3Label a for the true answerGTSentence length of (1);
step 7.3, utilizing full-connection operation to enable the hidden state feature sequence of the predicted answer
Figure BDA0002205997000000073
Mapping the predicted answer to the space with the same dimension as the one-hot vector table O to obtain a word vector set of the predicted answer
Figure BDA0002205997000000074
Wherein y isiThe mapping vector of the ith word in the predicted answer is represented, and the length of the vector is the same as the number of the words;
step 8, optimizing parameters of a visual dialogue generation network model based on the double-visual attention network;
step 8.1, according to the word one-hot vector table O, real answer label AGTThe words in (1) construct a set of vectors
Figure BDA0002205997000000075
WhereinLabel A representing true answerGTMapping vector of the ith word, wherein the length of the vector is the same as the number of the words;
step 8.2 calculate the predicted answer and the true answer A using equation (21)GTCost of loss between E:
step 8.3, optimizing and solving the loss cost E by using a random gradient descent method to minimize the loss cost E, so as to obtain a visual dialogue network model based on the double-visual attention network with optimal parameters;
step 9, generating a predicted answer;
set of word vectors to the predicted answer
Figure BDA0002205997000000078
Obtaining a mapping vector y of the ith word by using a greedy decoding algorithmiThe position corresponding to the maximum value in the word index table Voc, and the word at the corresponding position is searched in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith wordiAnd finally predicting words to obtain a predicted answer corresponding to the word vector set Y, and taking the current question Q and the predicted answer corresponding to the word vector set Y as a finally generated visual dialogue.
Compared with the prior art, the invention has the beneficial effects that:
1. compared with the visual dialogue technology researched in the past, the visual dialogue method and the visual dialogue system have the advantages that the visual characteristics of the global image are extracted, the visual characteristics of the local image object are extracted, the visual semantic information contained in the global visual characteristics is more comprehensive, the visual semantics contained in the local visual characteristics is more specific, so that the characteristics of the two visual characteristics are fully considered, the internal relation and the mutual relation of the two visual characteristics are respectively learned through two-stage attention processing, visual semantic complementation is formed, and the intelligent body can obtain more complete and accurate visual semantic information.
2. The method respectively processes text characteristics from a sentence level and a word level, firstly carries out sentence level characteristic extraction on problems and historical dialogue and carries out attention processing on historical dialogue characteristics during processing; secondly, learning the relation of the two visual characteristics based on the text characteristics of the sentence level; then, the invention carries out attention processing on the question features from a word level so as to capture the keyword features which are helpful for answer conjecture in the question, and the finer-grained text processing method enables the invention to generate more accurate and reasonable answers in the visual conversation.
3. The invention provides a multi-mode semantic fusion structure, which firstly optimizes two visual characteristics by using problem text characteristics on a word level so as to further highlight visual information related to problem keywords in the visual characteristics. Then, by splicing the problem features, the historical dialogue features, the global visual features and the local visual features, learning and fusion are carried out, furthermore, the visual features and the text features can mutually influence through a multi-mode semantic fusion network, parameters of the network are assisted to be optimized, after the visual semantics and the text semantics are simultaneously obtained by the fusion network, the answer prediction generation effect of the intelligent agent is greatly improved, and the prediction result is more accurate.
Drawings
FIG. 1 is a schematic diagram of a network model of the present invention;
FIG. 2 is a schematic diagram of dual visual attention processing in accordance with the present invention;
FIG. 3 is a schematic diagram of network model training according to the present invention.
Detailed Description
In this embodiment, as shown in fig. 1, a method for generating a visual dialog based on a dual-vision attention network is performed as follows:
step 1, preprocessing text input in visual dialogue and constructing a word list:
step 1.1, acquiring a visual conversation data set from the Internet, wherein the currently disclosed data set mainly comprises a VisDial Dataset, the data set is collected by related researchers of the George Asian worker school, and the visual conversation data set comprises sentence texts and images;
performing word segmentation processing on all sentence texts in the visual dialogue data set to obtain segmented words;
step 1.2, screening all words with the word frequency larger than a threshold value from the segmented words, setting the threshold value to be 4, and constructing a word index table Voc; the method for creating the word index table Voc comprises the following steps: the word list may contain words, punctuation; counting the number of words and ordering the words, wherein a blank character is added to satisfy the optimized training process. Constructing a corresponding table of words and sequence numbers for all the words according to the sequence; then, one-hot coding is carried out on each word in the index table Voc to obtain a one-hot vector table O ═ O1,o2,...,on,...,oN]Wherein o isnRepresenting a one-hot coding vector corresponding to the nth word in the index table Voc, wherein N is the number of words in the index table Voc;
step 1.3, randomly initializing a word embedding matrix We
Figure BDA0002205997000000091
Wherein d iswA dimension representing a word vector; embedding matrices W with wordseEncoding vector o of nth word in one-hot vector tablenMapping to the nth word vector wn
Figure BDA0002205997000000092
Thereby obtaining a word vector table;
step 2, extracting the characteristics of the conversation images and the characteristics of the conversation texts;
step 2.1, acquiring any one image I and corresponding historical dialogue U ═ U from the visual dialogue dataset1,u2,...,ut,...,uT]Current problem of
Figure BDA0002205997000000093
And a true answer label AGTThe composed visual dialogue information D; where T is the total number of dialog segments in the historical dialog U, UtRepresenting the t-th session in the session, L1Sentence length, L, representing the current question Q1Is set to 16, and for sentences with sentence length less than 16, the zero vector is used for filling until the length is L1,wQ,iA word vector representing an ith word in the sentence;
step 2.2, extracting the characteristics of the image I in the visual dialogue information D by using a convolutional neural network to obtain global visual characteristics
Figure BDA0002205997000000094
Wherein
Figure BDA0002205997000000095
Representing a global visual feature V(0)M denotes a global visual feature V(0)Total number of spatial regions of (1), dgFor global visual features V(0)The channel dimension of (a); in this embodiment, a pre-trained VGG convolutional neural network may be used to perform feature extraction on the global visual feature of the image I; the VGG is a two-dimensional convolutional neural network which is proved to have strong visual information expression capacity, so that the VGG which is pre-trained on a COCO2014 data set is used as an experimental global visual feature extractor, and the network of the VGG does not participate in the parameter updating part of the subsequent step 8;
step 2.3, extracting the characteristics of the image I in the visual dialogue information D by using a target detection characteristic extractor to obtain local visual characteristics
Figure BDA0002205997000000096
Wherein
Figure BDA0002205997000000097
Representing local visual features R(0)K denotes the local visual feature R(0)Total number of detected local target objects in drFor local visual features R(0)The channel dimension of (a); in this exampleThe local visual features of the image I can be extracted by adopting a pre-trained Faster-RCNN target detection feature extractor; the local Visual features extracted by the fast-RCNN achieve excellent effects on many Visual tasks, so we use the fast-RCNN pre-trained on the Visual Genome dataset as the local Visual feature extractor of the experiment, and the network of the part does not participate in the parameter updating part of the subsequent step 8;
step 2.4, mapping the global visual feature and the local visual feature to a space with the same dimensionality by using full connection operation to obtain the converted global visual feature V ═ V1,v2,...,vm,...,vM],
Figure BDA0002205997000000098
And local visual feature R ═ R1,r2,...,rk,...,rK],
Figure BDA0002205997000000099
Wherein v ismRepresenting the m-th region feature, r, in the global visual feature VkRepresenting the kth target object feature in the local visual feature R, wherein d is the converted channel dimension;
step 2.5, using long-short term memory network LSTM to extract features of current problem Q to obtain hidden state feature sequence
Figure BDA0002205997000000101
And the hidden state feature of the last step length output of the LSTM is takenThe sentence-level question feature vector Q as the current question Q,
Figure BDA0002205997000000103
wherein h isQ,iA hidden state feature representing the output of the ith step length of the LSTM;
step 2.6, using long-short term memory network LSTM to each session in history session U
Figure BDA0002205997000000104
Carrying out feature extraction to obtain a hidden state sequenceTaking the hidden state feature of the last step output of the LSTM
Figure BDA0002205997000000106
As a dialog utSentence level features of ht
Figure BDA0002205997000000107
The total historical dialog feature is H ═ H1,h2,...,ht,...,hT],
Figure BDA0002205997000000108
Wherein wu,iPresentation dialog utWord vector of the ith word, L2For a conversation utLength of sentence, L2Is set to 25, and for sentences with sentence length less than 25, the zero vector is used for filling until the length of the sentence is L2,hu,iA hidden state feature representing the output of the ith step length of the LSTM;
step 3, performing attention processing on historical dialogue information based on current question information;
using equation (1) to set total historical dialogue feature H ═ H1,h2,...,ht,...,hT]Performing attention processing to obtain attention concerned history feature vector ha,
ha=αhH (1)
In the formula (1), the reaction mixture is,
Figure BDA00022059970000001010
represents the attention distribution weight for the historical dialog feature H and has:
αh=softmax(PTzh) (2)
in the formula (2), the reaction mixture is,
Figure BDA00022059970000001011
representing a similarity matrix between the sentence-level question feature vector q and the historical dialog features H,
Figure BDA00022059970000001012
representing a similarity matrix zhAnd the parameters to be trained include:
zh=tanh(Wqq+WhH) (3)
in the formula (3), the reaction mixture is,
Figure BDA00022059970000001013
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000001014
representing parameters to be trained corresponding to the historical dialogue features H;
step 4, as shown in fig. 2, performing respective independent attention processing on the dual-vision characteristics;
step 4.1, using formula (4) to determine global visual feature V ═ V1,v2,...,vm,...,vM]Performing attention processing to obtain an attention-focused global visual feature vector V',
Figure BDA00022059970000001015
V′=αV1VT (4)
in the formula (4), the reaction mixture is,
Figure BDA0002205997000000111
represents the attention distribution weight for the global visual feature V and has:
Figure BDA0002205997000000112
in the formula (5), the reaction mixture is,
Figure BDA0002205997000000113
representing sentence-level question feature vector q, attention-focused historical feature vector haAnd a similarity matrix between the global visual features V,
Figure BDA0002205997000000114
representing a similarity matrix zV1And the parameters to be trained include:
zV1=tanh(Wq1q+Wh1ha+WV1V) (6)
in the formula (6), the reaction mixture is,
Figure BDA0002205997000000115
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA0002205997000000116
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,
Figure BDA0002205997000000117
representing a parameter to be trained corresponding to the global visual feature V;
step 4.2, using formula (7) to determine the local visual feature R ═ R1,r2,...,rk,...,rK]Performing attention processing to obtain a local visual feature vector R' concerned by attention,
Figure BDA0002205997000000118
R′=αR1RT (7)
in the formula (7), the reaction mixture is,
Figure BDA0002205997000000119
represents the attention distribution weight for the local visual feature R and has:
Figure BDA00022059970000001110
in the formula (8), the reaction mixture is,representing sentence-level question feature vector q, attention-focused historical feature vector haAnd a similarity matrix between the local visual features R,representing a similarity matrix zV1And the parameters to be trained include:
zR1=tanh(W′q1q+W′h1ha+WR1R) (9)
in the formula (9), the reaction mixture is,representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000001114
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,representing a parameter to be trained corresponding to the local visual feature R;
step 5, as shown in fig. 2, performing intercrossed attention processing on the double visual characteristics;
step 5.1, using formula (10) to determine global visual feature V ═ V1,v2,...,vm,...,vM]Performing double-vision cross attention processing to obtain a global visual feature vector V' of further attention,
Figure BDA00022059970000001116
V″=αV2VT (10)
in the formula (10), the compound represented by the formula (10),
Figure BDA0002205997000000121
represents a further attention distribution weight for the global visual feature V and has:
Figure BDA0002205997000000122
in the formula (11), the reaction mixture is,
Figure BDA0002205997000000123
representing sentence-level question feature vector q, attention-focused historical feature vector haA local visual feature vector R' of attention and a similarity matrix between the global visual features V,representing a similarity matrix zV2And the parameters to be trained include:
zV2=tanh(Wq2q+Wh2ha+WR2R′+WV2V) (12)
in the formula (12), the reaction mixture is,
Figure BDA0002205997000000125
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA0002205997000000126
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,
Figure BDA0002205997000000127
the local visual feature vector R' representing attention corresponds to a parameter to be trained,
Figure BDA0002205997000000128
representing a parameter to be trained corresponding to the global visual feature V;
step 5.2, using formula (13) to compare the local visual characteristics R ═ R1,r2,...,rk,...,rK]Performing double vision cross attention processing to obtain a local vision characteristic vector R' of further attention,
R″=αR2RT (13)
in the formula (13), the reaction mixture is,represents further attention distribution weights for the local visual features R and has:
Figure BDA00022059970000001211
in the formula (14), the compound represented by the formula (I),
Figure BDA00022059970000001212
representing sentence-level question feature vector q, attention-focused historical feature vector haA global visual feature vector V' of attention and a similarity matrix between local visual features R,
Figure BDA00022059970000001213
representing a similarity matrix zR2And the parameters to be trained include:
zR2=tanh(W′q2q+W′h2ha+W′V2V′+W′R2R) (15)
in the formula (15), the reaction mixture is,
Figure BDA00022059970000001214
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure BDA00022059970000001215
historical feature vector h representing attentionaCorresponding to be treatedThe parameters of the training are set to be,
Figure BDA00022059970000001216
the parameters to be trained corresponding to all visual feature vectors V' of attention,
Figure BDA00022059970000001217
representing a parameter to be trained corresponding to the local visual feature R;
step 6, optimizing visual characteristics;
step 6.1, performing word-level attention processing on the current question Q by using the formula (16) to obtain a word-level question feature vector Q concerned by attentions
Figure BDA0002205997000000131
qs=αqQ (16)
In the formula (16), the compound represented by the formula,
Figure BDA0002205997000000132
represents the attention distribution weight for the current question Q and has:
Figure BDA0002205997000000133
in the formula (14), the compound represented by the formula (I),
Figure BDA0002205997000000134
a self-attention semantic matrix representing the current question Q,
Figure BDA0002205997000000135
representing a self-attention semantic matrix zQAnd the parameters to be trained include:
zQ=tanh(WQQ) (18)
in the formula (18), the reaction mixture,
Figure BDA0002205997000000136
current when indicating word level attention processingA parameter to be trained corresponding to the question Q;
step 6.2, respectively carrying out optimization processing on the global visual feature vector V 'and the local visual feature vector R' which are focused by further attention by using the formulas (19) and (20), and obtaining a final global visual feature vector
Figure BDA0002205997000000137
Figure BDA0002205997000000138
And local visual feature vector
Figure BDA0002205997000000139
Figure BDA00022059970000001310
Figure BDA00022059970000001311
In the formulae (19) and (20),
Figure BDA00022059970000001313
word-level problem feature vector q in visual feature optimizationsThe corresponding parameter to be trained, an, indicates a dot product operation;
step 7, generating an answer characteristic sequence by multi-mode semantic fusion and decoding;
step 7.1, attention focused word-level problem feature vector qsAttention is focused on the historical feature vector haOptimized global visual feature vector
Figure BDA00022059970000001314
And local visual feature vector
Figure BDA00022059970000001315
Obtaining a multi-modal feature vector e after splicingM
Figure BDA00022059970000001316
Wherein d isM=3d+dwDimensions representing multi-modal feature vectors; and then a layer of full-connection operation is utilized to carry out multi-mode feature vector eMMapping is carried out to obtain a fusion semantic feature vector e,
step 7.2, inputting the fusion semantic feature vector e into the long-short term memory network LSTM to obtain the hidden state feature sequence of the predicted answer
Figure BDA00022059970000001318
Wherein h isA,iFor the output of the ith time step of LSTM, L3Label a for the true answerGTLength of sentence, L3May be set to a size of 9;
step 7.3, predicting the hidden state characteristic sequence of the answer by utilizing full-connection operation
Figure BDA0002205997000000141
Mapping the predicted answer to the space with the same dimension as the one-hot vector table O to obtain a word vector set of the predicted answer
Figure BDA0002205997000000142
Wherein y isiThe mapping vector of the ith word in the predicted answer is represented, and the length of the vector is the same as the number of the words;
step 8, as shown in fig. 3, optimizing parameters of a network model generated by a visual dialog based on a dual-vision attention network;
step 8.1, according to the word one-hot vector table O, the real answer label A is identifiedGTThe words in (1) construct a set of vectors
Figure BDA0002205997000000143
Wherein
Figure BDA0002205997000000144
Label A representing true answerGTMapping vector of the ith word, wherein the length of the vector is the same as the number of the words;
step 8.2 calculate the predicted answer and the true answer A using equation (21)GTCost of loss between E:
Figure BDA0002205997000000145
step 8.3, optimizing and solving the loss cost E by using a random gradient descent method to minimize the loss cost E, so as to obtain a visual dialogue network model based on the double-visual attention network with optimal parameters;
step 9, generating a predicted answer;
word vector set for predictive answers
Figure BDA0002205997000000146
Obtaining a mapping vector y of the ith word by using a greedy decoding algorithmiThe position corresponding to the maximum value in the word index table Voc, and the word at the corresponding position is searched in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith wordiAnd finally predicting the words to obtain a predicted answer corresponding to the word vector set Y, and taking the current question Q and the predicted answer corresponding to the word vector set Y as a finally generated visual dialogue.

Claims (1)

1. A visual dialog generating method based on a double-visual attention network is characterized by comprising the following steps:
step 1, preprocessing text input in visual dialogue and constructing a word list:
step 1.1, acquiring a visual dialogue data set, wherein the visual dialogue data set comprises sentence texts and images;
performing word segmentation processing on all sentence texts in the visual dialogue data set to obtain segmented words;
step 1.2, screening out word frequency from the segmented wordsAll words with the rate larger than the threshold value are constructed, and a word index table Voc is constructed; then, one-hot coding is carried out on each word in the index table Voc to obtain an one-hot vector table O ═ O1,o2,...,on,...,oN]Wherein o isnRepresenting a one-hot coding vector corresponding to the nth word in the index table Voc, wherein N is the number of words in the index table Voc;
step 1.3, randomly initializing a word embedding matrix We
Figure FDA0002205996990000011
Wherein d iswA dimension representing a word vector; embedding matrices W with wordseMapping the coding vector of each word in the one-hot vector table to a corresponding word vector so as to obtain a word vector table;
step 2, extracting the characteristics of the conversation images and the characteristics of the conversation texts;
step 2.1, acquiring any one image I and corresponding historical dialogue U ═ U from the visual dialogue dataset1,u2,...,ut,...,uT]Current problem of
Figure FDA0002205996990000012
And a true answer label AGTThe composed visual dialogue information D; where T is the total number of dialog segments in the historical dialog U, UtRepresenting the t-th session in the session, L1Sentence length, w, representing the current question QQ,iRepresenting a word vector corresponding to the ith word in the current question Q in the word vector table;
step 2.2, extracting the characteristics of the image I in the visual dialogue information D by using a convolutional neural network to obtain global visual characteristics
Figure FDA0002205996990000013
Wherein
Figure FDA0002205996990000014
Representing a global visual feature V(0)M of (1)Regional characteristics, M denotes the global visual characteristics V(0)Total number of spatial regions of (1), dgFor global visual features V(0)The channel dimension of (a);
step 2.3, extracting the characteristics of the image I in the visual dialogue information D by using a target detection characteristic extractor to obtain local visual characteristics
Figure FDA0002205996990000015
Wherein
Figure FDA0002205996990000016
Representing local visual features R(0)K denotes the local visual feature R(0)Total number of detected local target objects in drFor local visual features R(0)The channel dimension of (a);
step 2.4, mapping the global visual feature and the local visual feature to a space with the same dimensionality by using full connection operation to obtain the converted global visual feature V ═ V1,v2,...,vm,...,vM],
Figure FDA0002205996990000017
And local visual feature R ═ R1,r2,...,rk,...,rK],Wherein v ismRepresenting the m-th region feature, r, in the global visual feature VkRepresenting the kth target object feature in the local visual feature R, wherein d is the converted channel dimension;
step 2.5, using long-short term memory network LSTM to extract features of current problem Q to obtain hidden state feature sequenceAnd the hidden state feature of the last step length output of the long-short term memory network LSTM is taken
Figure FDA0002205996990000022
The sentence-level question feature vector Q as the current question Q,
Figure FDA0002205996990000023
wherein h isQ,iA hidden state characteristic representing the ith step length output of the long-short term memory network LSTM;
step 2.6, using long-short term memory network LSTM to process the t section dialogue in history dialogue U
Figure FDA0002205996990000024
Carrying out feature extraction to obtain the t-th hidden state sequence
Figure FDA0002205996990000025
The long-short term memory network takes the hidden state characteristic of the last step length output of the LSTMAs a t-th dialog utSentence level features of ht
Figure FDA0002205996990000027
The total historical dialog feature is H ═ H1,h2,...,ht,...,hT],
Figure FDA0002205996990000028
Wherein wt,iRepresenting the t-th dialog utThe word vector, L, corresponding to the ith word in the word vector table2For the t-th dialogue utLength of sentence, ht,iA hidden state characteristic representing the ith step length output of the long-short term memory network LSTM;
step 3, performing attention processing on historical dialogue information based on current question information;
using equation (1) to obtain the total historical dialogue feature H ═ H1,h2,...,ht,...,hT]The attention-seeking treatment is carried out,obtaining a history feature vector h concerned by attentiona
Figure FDA0002205996990000029
ha=αhHT (1)
In the formula (1), the reaction mixture is,
Figure FDA00022059969900000210
represents the attention distribution weight for the historical dialog feature H and has:
αh=softmax(PTzh) (2)
in the formula (2), the reaction mixture is,
Figure FDA00022059969900000211
representing a similarity matrix between the sentence-level question feature vector q and the historical dialog features H,
Figure FDA00022059969900000212
representing a similarity matrix zhAnd the parameters to be trained include:
zh=tanh(Wqq+WhH) (3)
in the formula (3), the reaction mixture is,representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA00022059969900000214
representing parameters to be trained corresponding to the historical dialogue features H;
step 4, the attention of the double visual characteristics is processed independently;
step 4.1, using formula (4) to determine global visual feature V ═ V1,v2,...,vm,...,vM]Performing attention processing to obtain an attention-focused global visual feature vector V',
Figure FDA00022059969900000215
V′=αV1VT (4)
in the formula (4), the reaction mixture is,
Figure FDA0002205996990000031
represents the attention distribution weight for the global visual feature V and has:
Figure FDA0002205996990000032
in the formula (5), the reaction mixture is,
Figure FDA0002205996990000033
representing sentence-level question feature vector q, attention-focused historical feature vector haAnd a similarity matrix between global visual features V
Figure FDA0002205996990000034
Representing a similarity matrix zV1And the parameters to be trained include:
zV1=tanh(Wq1q+Wh1ha+WV1V) (6)
in the formula (6), the reaction mixture is,
Figure FDA0002205996990000035
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA0002205996990000036
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,representing a parameter to be trained corresponding to the global visual feature V;
step 4.2, benefitUsing formula (7) to determine the local visual characteristic R ═ R1,r2,...,rk,...,rK]Performing attention processing to obtain a local visual feature vector R' concerned by attention,
Figure FDA0002205996990000038
R′=αR1RT (7)
in the formula (7), the reaction mixture is,
Figure FDA0002205996990000039
represents the attention distribution weight for the local visual feature R and has:
Figure FDA00022059969900000310
in the formula (8), the reaction mixture is,
Figure FDA00022059969900000311
representing sentence-level question feature vector q, attention-focused historical feature vector haAnd a similarity matrix between the local visual features R,
Figure FDA00022059969900000312
representing a similarity matrix zV1And the parameters to be trained include:
zR1=tanh(W′q1q+W′h1ha+WR1R) (9)
in the formula (9), the reaction mixture is,
Figure FDA00022059969900000313
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA00022059969900000314
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,representing a parameter to be trained corresponding to the local visual feature R;
step 5, attention processing of mutual intersection of the double visual characteristics;
step 5.1, using formula (10) to determine global visual feature V ═ V1,v2,...,vm,...,vM]Performing double-vision cross attention processing to obtain a global visual feature vector V' of further attention,
Figure FDA00022059969900000316
V″=αV2VT (10)
in the formula (10), the compound represented by the formula (10),
Figure FDA0002205996990000041
represents a further attention distribution weight for the global visual feature V and has:
Figure FDA0002205996990000042
in the formula (11), the reaction mixture is,
Figure FDA0002205996990000043
representing sentence-level question feature vector q, attention-focused historical feature vector haA local visual feature vector R' of attention and a similarity matrix between the global visual features V,
Figure FDA0002205996990000044
representing a similarity matrix zV2And the parameters to be trained include:
zV2=tanh(Wq2q+Wh2ha+WR2R′+WV2V) (12)
in the formula (12), the reaction mixture is,
Figure FDA0002205996990000045
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA0002205996990000046
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,
Figure FDA0002205996990000047
the local visual feature vector R' representing attention corresponds to a parameter to be trained,
Figure FDA0002205996990000048
representing a parameter to be trained corresponding to the global visual feature V;
step 5.2, using formula (13) to compare the local visual characteristics R ═ R1,r2,...,rk,...,rK]Performing double vision cross attention processing to obtain a local vision characteristic vector R' of further attention,
Figure FDA0002205996990000049
R″=αR2RT (13)
in the formula (13), the reaction mixture is,
Figure FDA00022059969900000410
represents further attention distribution weights for the local visual features R and has:
Figure FDA00022059969900000411
in the formula (14), the compound represented by the formula (I),
Figure FDA00022059969900000412
representing sentence-level question feature vector q, attention-focused historical feature vector haAll visual feature vectors of attentionV' and the similarity matrix between the local visual features R,representing a similarity matrix zR2And the parameters to be trained include:
zR2=tanh(W′q2q+W′h2ha+W′V2V′+W′R2R) (15)
in the formula (15), the reaction mixture is,
Figure FDA00022059969900000414
representing the parameters to be trained corresponding to the sentence-level question feature vector q,
Figure FDA00022059969900000415
historical feature vector h representing attentionaThe corresponding parameters to be trained are set to be,the parameters to be trained corresponding to all visual feature vectors V' of attention,
Figure FDA00022059969900000417
representing a parameter to be trained corresponding to the local visual feature R;
step 6, optimizing visual characteristics;
step 6.1, performing word-level attention processing on the current question Q by using the formula (16) to obtain a word-level question feature vector Q concerned by attentions
qs=αqQT (16)
In the formula (16), the compound represented by the formula,
Figure FDA0002205996990000051
representing the attention distribution to the current question QWeights, and have:
Figure FDA0002205996990000052
in the formula (14), the compound represented by the formula (I),a self-attention semantic matrix representing the current question Q,
Figure FDA0002205996990000054
representing a self-attention semantic matrix zQAnd the parameters to be trained include:
zQ=tanh(WQQ) (18)
in the formula (18), the reaction mixture,representing a parameter to be trained corresponding to the current question Q during word level attention processing;
step 6.2, respectively carrying out optimization processing on the global visual feature vector V 'and the local visual feature vector R' which are focused by further attention by using the formulas (19) and (20), and obtaining a final global visual feature vector
Figure FDA0002205996990000056
And local visual feature vector
Figure FDA0002205996990000058
Figure FDA0002205996990000059
Figure FDA00022059969900000510
Figure FDA00022059969900000511
In the formulae (19) and (20),
Figure FDA00022059969900000512
word-level problem feature vector q in visual feature optimizationsThe corresponding parameter to be trained, an, indicates a dot product operation;
step 7, generating an answer characteristic sequence by multi-mode semantic fusion and decoding;
step 7.1, focusing attention on the word-level question feature vector qsAttention is focused on the historical feature vector haOptimized global visual feature vector
Figure FDA00022059969900000513
And local visual feature vector
Figure FDA00022059969900000514
Obtaining a multi-modal feature vector e after splicingM
Figure FDA00022059969900000515
Wherein d isM=3d+dwDimensions representing multi-modal feature vectors; and then the multi-modal feature vector e is subjected to full-connection operationMMapping is carried out to obtain a fusion semantic feature vector e,
Figure FDA00022059969900000516
step 7.2, inputting the fusion semantic feature vector e into a long-short term memory network (LSTM) to obtain a hidden state feature sequence of a predicted answerWherein h isA,iFor the output of the ith step of the long short term memory network LSTM, L3For true answer marksLabel AGTSentence length of (1);
step 7.3, utilizing full-connection operation to enable the hidden state feature sequence of the predicted answerMapping the predicted answer to the space with the same dimension as the one-hot vector table O to obtain a word vector set of the predicted answer
Figure FDA0002205996990000061
Wherein y isiThe mapping vector of the ith word in the predicted answer is represented, and the length of the vector is the same as the number of the words;
step 8, optimizing parameters of a visual dialogue generation network model based on the double-visual attention network;
step 8.1, according to the word one-hot vector table O, real answer label AGTThe words in (1) construct a set of vectors
Figure FDA0002205996990000062
WhereinLabel A representing true answerGTMapping vector of the ith word, wherein the length of the vector is the same as the number of the words;
step 8.2 calculate the predicted answer and the true answer A using equation (21)GTCost of loss between E:
step 8.3, optimizing and solving the loss cost E by using a random gradient descent method to minimize the loss cost E, so as to obtain a visual dialogue network model based on the double-visual attention network with optimal parameters;
step 9, generating predicted answer
Set of word vectors to the predicted answer
Figure FDA0002205996990000065
Obtaining a mapping vector y of the ith word by using a greedy decoding algorithmiThe position corresponding to the maximum value in the word index table Voc, and the word at the corresponding position is searched in the word index table Voc according to the position corresponding to the maximum value as the mapping vector y of the ith wordiAnd finally predicting words to obtain a predicted answer corresponding to the word vector set Y, and taking the current question Q and the predicted answer corresponding to the word vector set Y as a finally generated visual dialogue.
CN201910881305.0A 2019-09-18 2019-09-18 Visual conversation generation method based on double-visual attention network Pending CN110647612A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910881305.0A CN110647612A (en) 2019-09-18 2019-09-18 Visual conversation generation method based on double-visual attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910881305.0A CN110647612A (en) 2019-09-18 2019-09-18 Visual conversation generation method based on double-visual attention network

Publications (1)

Publication Number Publication Date
CN110647612A true CN110647612A (en) 2020-01-03

Family

ID=68992004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910881305.0A Pending CN110647612A (en) 2019-09-18 2019-09-18 Visual conversation generation method based on double-visual attention network

Country Status (1)

Country Link
CN (1) CN110647612A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783475A (en) * 2020-07-28 2020-10-16 北京深睿博联科技有限责任公司 Semantic visual positioning method and device based on phrase relation propagation
CN111897939A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model
CN111967272A (en) * 2020-06-23 2020-11-20 合肥工业大学 Visual dialog generation system based on semantic alignment
CN111967487A (en) * 2020-03-23 2020-11-20 同济大学 Incremental data enhancement method for visual question-answer model training and application
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113177112A (en) * 2021-04-25 2021-07-27 天津大学 KR product fusion multi-mode information-based neural network visual dialogue model and method
CN113220859A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Image-based question and answer method and device, computer equipment and storage medium
CN113420606A (en) * 2021-05-31 2021-09-21 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and machine vision
CN113435399A (en) * 2021-07-14 2021-09-24 电子科技大学 Multi-round visual dialogue method based on multi-level sequencing learning
CN113515951A (en) * 2021-07-19 2021-10-19 同济大学 Story description generation method based on knowledge enhanced attention network and group-level semantics
CN113553418A (en) * 2021-07-27 2021-10-26 天津大学 Visual dialog generation method and device based on multi-modal learning
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN114299510A (en) * 2022-03-08 2022-04-08 山东山大鸥玛软件股份有限公司 Handwritten English line recognition system
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels
CN115277248A (en) * 2022-09-19 2022-11-01 南京聚铭网络科技有限公司 Network security alarm merging method, device and storage medium
CN116342332A (en) * 2023-05-31 2023-06-27 合肥工业大学 Auxiliary judging method, device, equipment and storage medium based on Internet
CN113220859B (en) * 2021-06-01 2024-05-10 平安科技(深圳)有限公司 Question answering method and device based on image, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160883B2 (en) * 2004-01-10 2012-04-17 Microsoft Corporation Focus tracking in dialogs
CN104077419A (en) * 2014-07-18 2014-10-01 合肥工业大学 Long inquiring image searching reordering algorithm based on semantic and visual information
US20170024645A1 (en) * 2015-06-01 2017-01-26 Salesforce.Com, Inc. Dynamic Memory Network
CN108877801A (en) * 2018-06-14 2018-11-23 南京云思创智信息科技有限公司 More wheel dialog semantics based on multi-modal Emotion identification system understand subsystem
CN110209789A (en) * 2019-05-29 2019-09-06 山东大学 A kind of multi-modal dialog system and method for user's attention guidance

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160883B2 (en) * 2004-01-10 2012-04-17 Microsoft Corporation Focus tracking in dialogs
CN104077419A (en) * 2014-07-18 2014-10-01 合肥工业大学 Long inquiring image searching reordering algorithm based on semantic and visual information
US20170024645A1 (en) * 2015-06-01 2017-01-26 Salesforce.Com, Inc. Dynamic Memory Network
CN108877801A (en) * 2018-06-14 2018-11-23 南京云思创智信息科技有限公司 More wheel dialog semantics based on multi-modal Emotion identification system understand subsystem
CN110209789A (en) * 2019-05-29 2019-09-06 山东大学 A kind of multi-modal dialog system and method for user's attention guidance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DANGUO等: "Dual Visual Attention Network for Visual Dialog", 《PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967487A (en) * 2020-03-23 2020-11-20 同济大学 Incremental data enhancement method for visual question-answer model training and application
CN111967487B (en) * 2020-03-23 2022-09-20 同济大学 Incremental data enhancement method for visual question-answer model training and application
CN111967272B (en) * 2020-06-23 2023-10-31 合肥工业大学 Visual dialogue generating system based on semantic alignment
CN111967272A (en) * 2020-06-23 2020-11-20 合肥工业大学 Visual dialog generation system based on semantic alignment
CN111783475A (en) * 2020-07-28 2020-10-16 北京深睿博联科技有限责任公司 Semantic visual positioning method and device based on phrase relation propagation
CN111897939A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model
CN111897939B (en) * 2020-08-12 2024-02-02 腾讯科技(深圳)有限公司 Visual dialogue method, training method, device and equipment for visual dialogue model
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113010712B (en) * 2021-03-04 2022-12-02 天津大学 Visual question answering method based on multi-graph fusion
CN113177112A (en) * 2021-04-25 2021-07-27 天津大学 KR product fusion multi-mode information-based neural network visual dialogue model and method
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113761153B (en) * 2021-05-19 2023-10-24 腾讯科技(深圳)有限公司 Picture-based question-answering processing method and device, readable medium and electronic equipment
CN113420606B (en) * 2021-05-31 2022-06-14 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and machine vision
CN113420606A (en) * 2021-05-31 2021-09-21 华南理工大学 Method for realizing autonomous navigation of robot based on natural language and machine vision
CN113220859B (en) * 2021-06-01 2024-05-10 平安科技(深圳)有限公司 Question answering method and device based on image, computer equipment and storage medium
CN113220859A (en) * 2021-06-01 2021-08-06 平安科技(深圳)有限公司 Image-based question and answer method and device, computer equipment and storage medium
CN113435399A (en) * 2021-07-14 2021-09-24 电子科技大学 Multi-round visual dialogue method based on multi-level sequencing learning
CN113435399B (en) * 2021-07-14 2022-04-15 电子科技大学 Multi-round visual dialogue method based on multi-level sequencing learning
CN113515951A (en) * 2021-07-19 2021-10-19 同济大学 Story description generation method based on knowledge enhanced attention network and group-level semantics
CN113515951B (en) * 2021-07-19 2022-07-05 同济大学 Story description generation method based on knowledge enhanced attention network and group-level semantics
CN113553418A (en) * 2021-07-27 2021-10-26 天津大学 Visual dialog generation method and device based on multi-modal learning
CN114661874B (en) * 2022-03-07 2024-04-30 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels
CN114299510A (en) * 2022-03-08 2022-04-08 山东山大鸥玛软件股份有限公司 Handwritten English line recognition system
CN115277248A (en) * 2022-09-19 2022-11-01 南京聚铭网络科技有限公司 Network security alarm merging method, device and storage medium
CN116342332B (en) * 2023-05-31 2023-08-01 合肥工业大学 Auxiliary judging method, device, equipment and storage medium based on Internet
CN116342332A (en) * 2023-05-31 2023-06-27 合肥工业大学 Auxiliary judging method, device, equipment and storage medium based on Internet

Similar Documents

Publication Publication Date Title
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN111488739B (en) Implicit chapter relation identification method for generating image enhancement representation based on multiple granularities
CN112990296B (en) Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
CN114511906A (en) Cross-modal dynamic convolution-based video multi-modal emotion recognition method and device and computer equipment
CN111597341B (en) Document-level relation extraction method, device, equipment and storage medium
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
Liu et al. Attribute-guided attention for referring expression generation and comprehension
CN114385802A (en) Common-emotion conversation generation method integrating theme prediction and emotion inference
CN113065344A (en) Cross-corpus emotion recognition method based on transfer learning and attention mechanism
CN114492441A (en) BilSTM-BiDAF named entity identification method based on machine reading understanding
CN111311364B (en) Commodity recommendation method and system based on multi-mode commodity comment analysis
CN114254645A (en) Artificial intelligence auxiliary writing system
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
CN112651225B (en) Multi-item selection machine reading understanding method based on multi-stage maximum attention
Rastgoo et al. Word separation in continuous sign language using isolated signs and post-processing
Boukdir et al. Character-level arabic text generation from sign language video using encoder–decoder model
CN112560440A (en) Deep learning-based syntax dependence method for aspect-level emotion analysis
Yigzaw et al. A Generic Approach towards Amharic Sign Language Recognition
Hossain et al. A novel approach to classify bangla sign digits using capsule network
CN112949284A (en) Text semantic similarity prediction method based on Transformer model
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
Shurid et al. Bangla sign language recognition and sentence building using deep learning
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
CN115391534A (en) Text emotion reason identification method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200103