CN113177112A - KR product fusion multi-mode information-based neural network visual dialogue model and method - Google Patents
KR product fusion multi-mode information-based neural network visual dialogue model and method Download PDFInfo
- Publication number
- CN113177112A CN113177112A CN202110448254.XA CN202110448254A CN113177112A CN 113177112 A CN113177112 A CN 113177112A CN 202110448254 A CN202110448254 A CN 202110448254A CN 113177112 A CN113177112 A CN 113177112A
- Authority
- CN
- China
- Prior art keywords
- information
- fusion
- visual
- feature
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 86
- 230000004927 fusion Effects 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 23
- 239000013598 vector Substances 0.000 claims abstract description 90
- 230000007246 mechanism Effects 0.000 claims abstract description 16
- 238000007500 overflow downdraw method Methods 0.000 claims abstract description 15
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 9
- 239000000284 extract Substances 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 43
- 238000004364 calculation method Methods 0.000 claims description 26
- 238000009826 distribution Methods 0.000 claims description 7
- 230000003993 interaction Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 2
- 238000003909 pattern recognition Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000001502 supplementing effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a KR product fusion multi-modal information-based neural network visual dialogue model and a KR product fusion multi-modal information-based neural network visual dialogue method. The modal characteristic extraction module extracts the characteristics of the problem text and the characteristics of the historical information through an LSTM network, extracts the entity characteristics of the picture by using an Faster R-CNN network, and extracts the visual characteristics related to the problem by adopting an attention mechanism; the different-mode information fusion module captures feature information in different modes by using a later-stage fusion method, captures correlation information between different modes by using a KR product-based feature fusion method, and fuses intra-mode information and inter-mode information; the candidate answer prediction module performs answer prediction by using a fusion vector fused with intra-modality information and inter-modality information, so that a relevant answer can be more accurately found out. The method overcomes the current situation that the traditional visual dialogue model is insufficient in later fusion capture of the associated information among different modes.
Description
Technical Field
The invention relates to the technical field of visual dialogue and multi-mode fusion, in particular to a model and a method for judging a real answer from candidate answers aiming at a picture, historical dialogue information and a corresponding question.
Background
Visual dialog is a challenging task in the field of intersection of language and vision that requires consideration of historical information from multiple rounds of dialog and related information in images to find the best candidate answer to a current question. Visual conversations occur in many application scenarios, such as helping the blind understand surroundings, interactive searching, indoor navigation, and the like. In the visual dialogue task, in order to capture information related to answers, a model needs to understand questions, capture visual information and historical information related to the questions, and capture potential association information among the questions, the vision and the history. With the rise of neural networks, the early visual conversational tasks adopted LF[1]、HRE[1]、MN[1]Problems, pictures and historical information are coded by the three encoder methods, and relevance between visual features and historical features extracted by the method and the problems is not large, so that predicted answers are not accurate enough after problem vectors, visual vectors and historical vectors are fused; to predict the answer more accurately, some methods are based on an attention mechanism to obtain visual and historical features related to the question, such as Bottom-Up[2]、Attention Memory[3]、RVA[4]、sequential co-attention[5]And the like. The Bottom-UP provides a combined attention mechanism from Bottom to top and from top to Bottom, the visual conversation task acquires the entity characteristics in the image through the Faster R-CNN network, and the attention mechanism is utilized to acquire the visual characteristics related to the problem; attention Memory proposes an Attention storage mechanism to better select visual contents related to a question by calculating a visual Attention distribution related to a current question using visual Attention distributions in past dialogues; RVA provides a recursive visual attention mechanism, and visual attention distribution related to the current problem is gradually refined by using related visual information in historical conversation; sequential co-attention proposes a continuous co-attention mechanism, which uses the information of the other two modalities to calculate the attention score of the other modality, and selectively focuses on the image and the content of the historical conversation. The method carries out later-stage fusion on the acquired visual characteristics, historical characteristics and problem characteristics, and is beneficial toAnd predicting the candidate answers by using the fused vectors. However, after the features of different modalities are extracted respectively, the method of fusing the features of different modalities by using later-stage fusion cannot capture the correlation information between different modalities well, so that the model is not accurate enough in predicting candidate answers. In order to obtain more useful information, a method for fusing information of different modalities needs to be improved.
To achieve a better-performing visual dialogue model, the main challenges at present are: the visual dialogue task needs to model the image content and the dialogue history and obtain information useful for answer prediction from the question vector, the visual vector and the history vector. The prediction of the answer needs to pay attention to not only the feature information in different modalities, but also the associated information among different modalities, and the answer can be better predicted by capturing the feature information in the modalities and the associated information among the modalities. Therefore, how to better capture the associated information between different modalities is a problem to be solved by the visual dialogue model.
Reference documents:
[1]Das,A.,Kottur,S.,et al.“Visual Dialog.”IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2017.
[2]Anderson,Peter,et al.“Bottom-up and top-down attention for image captioning and visual question answering.”Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018.
[3]Sigal,leonid,et al.“Visual Reference Resolution Using Attention Memory for Visual Dialog.”Advances in Neural Information Processing Systems,vol.30,2018,pp.3719–3729.
[4]Niu,Yulei,et al.“Recursive Visual Attention in Visual Dialog.”IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.6679–6688.
[5]Wu,Qi,et al.“Are You Talking to MeReasoned Visual Dialog Generation Through Adversarial Learning.”IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018,pp.6106–6115.
disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a neural network visual dialogue model and a method based on fusion of KR product and multi-mode information.
The purpose of the invention is realized by the following technical scheme:
a neural network visual dialogue model based on KR product fusion multi-modal information comprises a modal feature extraction module, a different modal information fusion module and a candidate answer prediction module;
the modal feature extraction module extracts semantic features of the problem, visual features of the image and historical features of historical conversation; finally, obtaining problem characteristics, visual characteristics and historical characteristics;
the different-modality information fusion module is used for fusing the problem features, the visual features and the historical features to obtain a final fusion vector, and comprehensively considering the information of different modalities, so that the information related to the candidate answers can be conveniently captured; the method specifically comprises three steps of obtaining characteristic information in different modes, extracting correlation information among the different modes and fusing information between the modes;
the candidate answer prediction module is used for predicting candidate answers, the input of the candidate answer prediction module is fusion vectors obtained by the different-mode information fusion module, and then inner product calculation is carried out on the fusion vectors and the feature vectors of the candidate answers to obtain scores corresponding to the candidate answers.
Furthermore, the modal feature extraction module is used for learning the embedded expression of words in the word2vec model by dividing words of the problem text and utilizing a glove tool, wherein each word is expressed by a dense vector, and then the LSTM model is used for acquiring the semantic features of the problem, namely the problem features;
extracting entity characteristics in the image by using a Faster R-CNN pre-training network, wherein each entity is represented by a dense vector, acquiring attention distribution related to a problem by an attention mechanism, and performing weighted summation on the entity characteristics to acquire visual characteristics of the image;
and splicing the historical dialogue information, segmenting words of the historical information, learning the embedded expression of the words in the word2vec model by using a glove tool, and acquiring the historical characteristics of the historical dialogue by using an LSTM model.
Further, in the method for acquiring the feature information in different modes, firstly, the problem feature, the visual feature and the historical feature are spliced, then the features of the problem feature, the visual feature and the historical feature are fused through a full connection layer, and finally, the feature information in different modes is acquired;
the method for extracting the correlation information among different modes comprises the steps of firstly, obtaining two interaction matrixes through two times of KR product calculation, splicing problem features, visual features and historical features according to the 0 th dimension in the first time of KR product calculation to generate a feature matrix, and performing KR product calculation on the feature matrix according to columns to generate a third-order tensor; performing Mask calculation on the third-order tensor, and capturing information of the third-order tensor through convolution operation to obtain an interacted feature matrix; performing KR product calculation on the feature matrix obtained after interaction and the original feature matrix according to corresponding columns by the second KR product calculation, and performing convolution operation on the obtained third-order tensor to generate an interacted feature matrix; then splicing and compressing the interaction matrixes obtained twice to obtain final characteristics, wherein the final characteristics comprise correlation information among the modes;
and finally, splicing the intra-modal characteristic information obtained by later-stage fusion and the inter-modal association information obtained by fusion of the KR product to obtain a final fusion vector, wherein the fusion vector comprises the intra-modal characteristic information and the inter-modal association information.
Furthermore, the candidate answer prediction module is used for predicting a plurality of candidate answers to obtain the score of each candidate answer, calculating loss by comparing the real answers, and then continuously updating the parameters of the neural network visual dialogue model by using a back propagation algorithm to finally obtain the score corresponding to each candidate answer.
The invention also provides a KR product fusion multi-modal information-based neural network visual dialogue method, which is based on the neural network visual dialogue model and comprises the following steps of: capturing feature information in different modalities through a later-stage fusion method, capturing correlation information between the different modalities in a KR product calculation mode, and finally splicing the feature information in the modalities and the correlation information between the modalities to obtain a final fusion vector; and the later-stage fusion method is used for splicing the captured problem features, visual features and historical features and performing linear layer fusion.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. a KR product-based different-modality information fusion module is provided, and is used for capturing correlation information between different modalities. Most of the previous correlation work focuses on capturing visual features and historical features related to the current problem respectively, and then fusing the visual features, the problem features and the historical features through a later-stage fusion method, but the later-stage fusion method does not deeply consider the association between different modalities. The KR product-based method combines information among different modalities by using a tensor product calculation mode and captures associated information among different modalities.
2. A redundancy characteristic reduction method based on a Mask mechanism is provided on the basis of KR products. In the process of first-layer feature combination based on KR products, because input feature matrixes are the same and tensor calculation is carried out according to columns, the obtained matrix is a symmetric matrix and contains repeated combination information; in order to reduce redundant information generated by information combination between different modes, repeated combination information is deleted by combining a Mask mechanism. After the redundant features in the first layer of combined features are removed, when the redundant features are used for combining the features of the second layer, the associated information between the modalities can be captured more accurately;
3. the neural network visual dialogue model captures characteristic information in the modes through a later-stage fusion method, and captures correlation information between the modes through a KR product + Mask mechanism + convolution neural network method, so that the two parts of information are fused through a splicing method. The KR product calculation needs to map visual features, problem features and historical features into the same vector space, and then capture correlation information among modes through a method of KR product, Mask mechanism and convolutional neural network; such an approach captures potential correlation information between modalities, but loses characteristic information within different modalities. In order to retain the feature information in different modalities, the information in the modalities is captured by a later-stage fusion method, and then the information is spliced and fused with the inter-modality association information captured by a KR-based method, so that the final fusion vector contains the feature information in the modalities and the inter-modality association information. Compared with the related work, the method for combining the correlation information among the modes improves the result of answer prediction.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of a visual dialog model based on KR product fusion multimodal information.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A neural network visual dialogue model based on KR product fusion multi-modal information comprises a modal feature extraction module, a different modal information fusion module and a candidate answer prediction module;
the modal feature extraction module is used for extracting semantic features of the problem, visual features of the image and historical features of historical conversations. Firstly, obtaining vector representation of problems through an LSTM network, obtaining a group of entity characteristic vectors of an image by using an Faster R-CNN network, regarding historical conversation information as a whole or regarding each piece of conversation content as a whole, and obtaining vector representation of conversation history through the LSTM network; then, the visual attention distribution and the historical attention distribution related to the problem are calculated through an attention mechanism, and a visual vector and a historical vector are obtained through a weighted summation method.
The different modality information fusion module is used for fusing problem features, visual features and historical features. The problem vector, the visual vector and the historical vector are obtained by a modal feature extraction module, and information of the problem vector, the visual vector and the historical vector is fused by a later fusion method to capture feature information in different modalities; performing feature combination on the problem vector, the visual vector and the historical vector by a KR product-based multi-modal information fusion method, and capturing correlation information among different modes; and finally, splicing the information in different modes and the information among different modes to obtain a final fusion vector.
The candidate answer prediction module obtains answer vectors of 100 candidate answers through the LSTM network, and performs inner product calculation with the fusion vectors obtained by the different mode information fusion module to obtain scores of 100 candidate answers.
Fig. 1 shows a flow of a visual dialog method proposed by the present method; fig. 2 shows a neural network visual dialogue model designed by the present embodiment. The method comprises the following specific operation steps:
(1) traversing all the problems in the VisDial v1.0 data set (the VisDial v1.0 data set is downloaded through a visual dialogue official network, and the website address is https:// visual dialog.org/data), determining the fixed length (generally the maximum sentence length in the data set) of the input problems, filling the sentences of which the sentence lengths are less than the fixed length in the data set, and supplementing 0 for the rest lengths, ensuring the fixed lengths of all the sentences, and facilitating the calculation of a neural network and an attention mechanism.
(2) Traversing all answers in the data set, determining the fixed length of the answer text, filling sentences of which the sentence lengths are smaller than the fixed length in the data set, and supplementing 0 for the rest lengths, so that the fixed lengths of all the sentences are ensured, and the calculation of a neural network is facilitated.
(3) Obtaining 300-dimensional word vector w by utilizing glove tooliFor constructing a word vector matrix E ═ w1,...,wn]Each row of the matrix represents a word, the matrix is directly used as a word vector matrix in the model initialization stage, the word vector matrix is set to be trainable, and the word vector matrix is used for training the word vector in the model training processWill be optimized.
(4) Extracting the characteristic vector corresponding to each picture by using a pre-trained Faster R-CNN network, and generating a 36 x 2048 matrix for each picture36 denotes the number of entities identified in the picture and 2048 denotes the length of each entity feature vector.
(5) Problem vector q of round ttObtained through an LSTM network; the history information consists of picture titles and question and answer information of each turn in the history, the title information of the picture is spliced with all the question and answer information, and a history vector h is obtained through an LSTM networkt。
(6) Problem-related visual vector v obtained by using attention mechanismtThe calculation formula is as follows:
wherein,representing a certain entity vector in a picture, the matrix W1∈R512×2048Vector qt∈R512Representing the problem feature of the t round, dropout (circle) represents that the data in the vector is discarded according to a certain probability,representing multiplication of corresponding positions, vectorMatrix W2∈R512×1Matrix ofsoftmax (·) is used to calculate correlations between different entity features and problem features in a picture, vectorVector vt∈R2048The visual features representing the problem associated with the t-th round are obtained by weighted summation of the different physical features in the image.
(7) Fusing the acquired problem vector, visual vector and historical vector by using a later-stage fusion method to obtain a vector containing intra-modal characteristic informationThe calculation process is as follows:
wherein [ ·]Representing a stitching operation, vector qt;vt;ht]∈R3072The matrix W3∈R3072×512Vector of motion
(8) Acquiring association information among problem vectors, visual vectors and historical vectors by using a KR product-based multi-modal information fusion method to obtain vectors containing association information among modalitiesFirstly, mapping a problem vector, a visual vector and a historical vector into the same vector space, splicing the three vectors according to the 0 th dimension to obtain a feature matrixThen, characteristic momentsKR product calculation is carried out according to the array to obtain a third-order tensorSetting the data of the lower triangle of each matrix in the third-order tensor to be 0 through a Mask mechanism; finally, the combined information among the characteristics is captured through the convolutional neural network to obtain a characteristic matrixUsing feature matrix X0And the feature matrix X1Extracting X in columns0And X1The KR product is calculated to obtain a third order tensorObtaining a feature matrix by a convolutional neural networkFor feature matrix X1And the feature matrix X2Respectively compressing according to the 1 st dimension to obtain vectorsAndthen splicing the two vectors to obtain a final vectorThe calculation formula is as follows:
wherein WMask∈R3×3A mask matrix with an upper triangular value of 1 and the remaining values of 0,it is indicated that the corresponding position is multiplied by,representing tensor operations (mathematically expressed as outer products), DsThe dimensions of the features are represented in a graph,is a matrixColumn vector, matrix ofIs third order tensorOf (1), matrix Wi 4∈R3×3For the ith convolution kernel, sum (-) is an addition function, which sums all values,is a scalar quantity, expressed as a matrixThe jth value in the ith vector,is a scalar quantity, by subtending quantitiesCarrying out average pooling;is a matrixThe column vector of (a) is,is a matrixColumn vector, matrix ofIs third order tensorOf (1), matrix Wi 5∈RM×3For the (i) th convolution kernel,is a scalar quantity, expressed as a matrixThe jth value in the ith vector,is a scalar quantity, by subtending quantitiesTo carry outObtaining by average pooling; m, N represent the number of convolution kernels in the first and second layers, respectively.
(9) Output of the splicing step (7) and the step (8), vector of the candidate answerThe method comprises the following steps of obtaining through an LSTM network, and calculating scores of 100 candidate answers through a candidate answer module, wherein the scores are as follows:
wherein the vectorObtained by stitching, vectorsFeatures represented as one of the 100 candidate answers,representing multiplication of corresponding positions, sum (-) representing addition of all values in the vector, scoreiIs a scalar quantity, expressed as the relevance score of the ith candidate answer.
And calculating the cross entropy between the candidate answer score and the real answer score, and then continuously updating the parameters of the model through back propagation.
(10) Finally, the method is applied to RVA and LF models, training and testing are carried out on a VisDial v1.0 data set, and the evaluation indexes of the testing are R @1, R @5, R @10, mean (mean rank), MRR (mean receiptional rank) and NDCG (normalized discrete relational gain). As can be seen from the experimental results in table 1, the method for obtaining multimodal information based on late fusion is insufficient for capturing the associated information between different modalities; according to the invention, by utilizing a KR product-based multi-modal information fusion method, the correlation information among different modes is captured and spliced with the characteristic information in the different modes to obtain a final fusion vector, and experiments show that the overall effect of the visual dialogue model can be improved by capturing the correlation information among the different modes.
TABLE 1 comparison of results for models on VisDial v1.0 dataset
The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (5)
1. A neural network visual dialogue model based on KR product fusion multi-modal information is characterized by comprising a modal feature extraction module, a different modal information fusion module and a candidate answer prediction module;
the modal feature extraction module extracts semantic features of the problem, visual features of the image and historical features of historical conversation; finally, obtaining problem characteristics, visual characteristics and historical characteristics;
the different-modality information fusion module is used for fusing the problem features, the visual features and the historical features to obtain a final fusion vector, and comprehensively considering the information of different modalities, so that the information related to the candidate answers can be conveniently captured; the method specifically comprises three steps of obtaining characteristic information in different modes, extracting correlation information among the different modes and fusing information between the modes;
the candidate answer prediction module is used for predicting candidate answers, the input of the candidate answer prediction module is fusion vectors obtained by the different-mode information fusion module, and then inner product calculation is carried out on the fusion vectors and the feature vectors of the candidate answers to obtain scores corresponding to the candidate answers.
2. The KR product fusion multi-modal information-based neural network visual dialogue model of claim 1, wherein the modal feature extraction module is configured to use a glove tool to learn the embedded representation of words in a word2vec model by segmenting a problem text, each word is represented by a dense vector, and then use an LSTM model to obtain semantic features of a problem, namely problem features;
extracting entity characteristics in the image by using a Faster R-CNN pre-training network, wherein each entity is represented by a dense vector, acquiring attention distribution related to a problem by an attention mechanism, and performing weighted summation on the entity characteristics to acquire visual characteristics of the image;
and splicing the historical dialogue information, segmenting words of the historical information, learning the embedded expression of the words in the word2vec model by using a glove tool, and acquiring the historical characteristics of the historical dialogue by using an LSTM model.
3. The KR product fusion multi-modal information-based neural network visual dialogue model of claim 1, wherein in the method for obtaining feature information in different modalities, the problem feature, the visual feature and the historical feature are spliced, and then the features of the problem feature, the visual feature and the historical feature are fused through a full connection layer, so as to obtain feature information in different modalities;
the method for extracting the correlation information among different modes comprises the steps of firstly, obtaining two interaction matrixes through two times of KR product calculation, splicing problem features, visual features and historical features according to the 0 th dimension in the first time of KR product calculation to generate a feature matrix, and performing KR product calculation on the feature matrix according to columns to generate a third-order tensor; performing Mask calculation on the third-order tensor, and capturing information of the third-order tensor through convolution operation to obtain an interacted feature matrix; performing KR product calculation on the feature matrix obtained after interaction and the original feature matrix according to corresponding columns by the second KR product calculation, and performing convolution operation on the obtained third-order tensor to generate an interacted feature matrix; then splicing and compressing the interaction matrixes obtained twice to obtain final characteristics, wherein the final characteristics comprise correlation information among the modes;
and finally, splicing the intra-modal characteristic information obtained by later-stage fusion and the inter-modal association information obtained by fusion of the KR product to obtain a final fusion vector, wherein the fusion vector comprises the intra-modal characteristic information and the inter-modal association information.
4. The KR-product-fusion-multimodal-information-based neural network visual dialogue model as claimed in claim 1, wherein the candidate answer prediction module is configured to predict a plurality of candidate answers to obtain a score of each candidate answer, calculate loss by comparing the actual answers, and then continuously update parameters of the neural network visual dialogue model by using a back propagation algorithm to finally obtain a score corresponding to each candidate answer.
5. A neural network visual dialogue method based on KR product fusion multi-modal information is based on the neural network visual dialogue model and is characterized by comprising the following steps: capturing feature information in different modalities through a later-stage fusion method, capturing correlation information between the different modalities in a KR product calculation mode, and finally splicing the feature information in the modalities and the correlation information between the modalities to obtain a final fusion vector; and the later-stage fusion method is used for splicing the captured problem features, visual features and historical features and performing linear layer fusion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110448254.XA CN113177112B (en) | 2021-04-25 | 2021-04-25 | Neural network visual conversation device and method based on KR product fusion multi-mode information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110448254.XA CN113177112B (en) | 2021-04-25 | 2021-04-25 | Neural network visual conversation device and method based on KR product fusion multi-mode information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113177112A true CN113177112A (en) | 2021-07-27 |
CN113177112B CN113177112B (en) | 2022-07-01 |
Family
ID=76925652
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110448254.XA Expired - Fee Related CN113177112B (en) | 2021-04-25 | 2021-04-25 | Neural network visual conversation device and method based on KR product fusion multi-mode information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113177112B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113868451A (en) * | 2021-09-02 | 2021-12-31 | 天津大学 | Cross-modal social network conversation method and device based on context cascade perception |
CN114357968A (en) * | 2021-11-29 | 2022-04-15 | 阿里巴巴达摩院(杭州)科技有限公司 | Conversation method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408624A (en) * | 2018-11-06 | 2019-03-01 | 江西师范大学 | A kind of vision chat robots session generation method |
CN110377710A (en) * | 2019-06-17 | 2019-10-25 | 杭州电子科技大学 | A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
US10769374B1 (en) * | 2019-04-24 | 2020-09-08 | Honghui CHEN | Answer selection method for question answering system and the system |
CN111859005A (en) * | 2020-07-01 | 2020-10-30 | 江西理工大学 | Cross-layer multi-model feature fusion and image description method based on convolutional decoding |
CN111897940A (en) * | 2020-08-12 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training device and training equipment of visual dialogue model |
CN111897939A (en) * | 2020-08-12 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training device and training equipment of visual dialogue model |
-
2021
- 2021-04-25 CN CN202110448254.XA patent/CN113177112B/en not_active Expired - Fee Related
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408624A (en) * | 2018-11-06 | 2019-03-01 | 江西师范大学 | A kind of vision chat robots session generation method |
US10769374B1 (en) * | 2019-04-24 | 2020-09-08 | Honghui CHEN | Answer selection method for question answering system and the system |
CN110377710A (en) * | 2019-06-17 | 2019-10-25 | 杭州电子科技大学 | A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
CN111859005A (en) * | 2020-07-01 | 2020-10-30 | 江西理工大学 | Cross-layer multi-model feature fusion and image description method based on convolutional decoding |
CN111897940A (en) * | 2020-08-12 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training device and training equipment of visual dialogue model |
CN111897939A (en) * | 2020-08-12 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Visual dialogue method, training device and training equipment of visual dialogue model |
Non-Patent Citations (3)
Title |
---|
QI WU: "Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 * |
廖文雄: "基于多模态信息融合的生成式问答系统研究", 《万方数据知识服务平台》 * |
王嫱: "基于视觉语义表示的视觉对话方法研究", 《万方数据知识服务平台》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113868451A (en) * | 2021-09-02 | 2021-12-31 | 天津大学 | Cross-modal social network conversation method and device based on context cascade perception |
CN113868451B (en) * | 2021-09-02 | 2024-06-11 | 天津大学 | Cross-modal conversation method and device for social network based on up-down Wen Jilian perception |
CN114357968A (en) * | 2021-11-29 | 2022-04-15 | 阿里巴巴达摩院(杭州)科技有限公司 | Conversation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113177112B (en) | 2022-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220180202A1 (en) | Text processing model training method, and text processing method and apparatus | |
CN110737801B (en) | Content classification method, apparatus, computer device, and storage medium | |
JP6351689B2 (en) | Attention based configurable convolutional neural network (ABC-CNN) system and method for visual question answering | |
US11776269B2 (en) | Action classification in video clips using attention-based neural networks | |
CN111460121B (en) | Visual semantic conversation method and system | |
CN115329779B (en) | Multi-person dialogue emotion recognition method | |
CN109766557B (en) | Emotion analysis method and device, storage medium and terminal equipment | |
CN110704601A (en) | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network | |
CN113297370B (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN113792177B (en) | Scene character visual question-answering method based on knowledge-guided deep attention network | |
CN113177112B (en) | Neural network visual conversation device and method based on KR product fusion multi-mode information | |
CN115223020B (en) | Image processing method, apparatus, device, storage medium, and computer program product | |
CN114339450B (en) | Video comment generation method, system, device and storage medium | |
CN111597341A (en) | Document level relation extraction method, device, equipment and storage medium | |
CN110705490A (en) | Visual emotion recognition method | |
CN116385937A (en) | Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework | |
Huang et al. | Learning long-term temporal contexts using skip RNN for continuous emotion recognition | |
CN112560440A (en) | Deep learning-based syntax dependence method for aspect-level emotion analysis | |
CN114168769B (en) | Visual question-answering method based on GAT relation reasoning | |
CN115906861A (en) | Statement emotion analysis method and device based on interaction aspect information fusion | |
CN112765955B (en) | Cross-modal instance segmentation method under Chinese finger representation | |
Fenghour et al. | Contour mapping for speaker-independent lip reading system | |
CN116824461B (en) | Question understanding guiding video question answering method and system | |
CN117521674B (en) | Method, device, computer equipment and storage medium for generating countermeasure information | |
US11526781B2 (en) | Automatic sentence inferencing network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220701 |
|
CF01 | Termination of patent right due to non-payment of annual fee |