CN113177112A - KR product fusion multi-mode information-based neural network visual dialogue model and method - Google Patents

KR product fusion multi-mode information-based neural network visual dialogue model and method Download PDF

Info

Publication number
CN113177112A
CN113177112A CN202110448254.XA CN202110448254A CN113177112A CN 113177112 A CN113177112 A CN 113177112A CN 202110448254 A CN202110448254 A CN 202110448254A CN 113177112 A CN113177112 A CN 113177112A
Authority
CN
China
Prior art keywords
information
fusion
visual
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110448254.XA
Other languages
Chinese (zh)
Other versions
CN113177112B (en
Inventor
骆克
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110448254.XA priority Critical patent/CN113177112B/en
Publication of CN113177112A publication Critical patent/CN113177112A/en
Application granted granted Critical
Publication of CN113177112B publication Critical patent/CN113177112B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a KR product fusion multi-modal information-based neural network visual dialogue model and a KR product fusion multi-modal information-based neural network visual dialogue method. The modal characteristic extraction module extracts the characteristics of the problem text and the characteristics of the historical information through an LSTM network, extracts the entity characteristics of the picture by using an Faster R-CNN network, and extracts the visual characteristics related to the problem by adopting an attention mechanism; the different-mode information fusion module captures feature information in different modes by using a later-stage fusion method, captures correlation information between different modes by using a KR product-based feature fusion method, and fuses intra-mode information and inter-mode information; the candidate answer prediction module performs answer prediction by using a fusion vector fused with intra-modality information and inter-modality information, so that a relevant answer can be more accurately found out. The method overcomes the current situation that the traditional visual dialogue model is insufficient in later fusion capture of the associated information among different modes.

Description

KR product fusion multi-mode information-based neural network visual dialogue model and method
Technical Field
The invention relates to the technical field of visual dialogue and multi-mode fusion, in particular to a model and a method for judging a real answer from candidate answers aiming at a picture, historical dialogue information and a corresponding question.
Background
Visual dialog is a challenging task in the field of intersection of language and vision that requires consideration of historical information from multiple rounds of dialog and related information in images to find the best candidate answer to a current question. Visual conversations occur in many application scenarios, such as helping the blind understand surroundings, interactive searching, indoor navigation, and the like. In the visual dialogue task, in order to capture information related to answers, a model needs to understand questions, capture visual information and historical information related to the questions, and capture potential association information among the questions, the vision and the history. With the rise of neural networks, the early visual conversational tasks adopted LF[1]、HRE[1]、MN[1]Problems, pictures and historical information are coded by the three encoder methods, and relevance between visual features and historical features extracted by the method and the problems is not large, so that predicted answers are not accurate enough after problem vectors, visual vectors and historical vectors are fused; to predict the answer more accurately, some methods are based on an attention mechanism to obtain visual and historical features related to the question, such as Bottom-Up[2]、Attention Memory[3]、RVA[4]、sequential co-attention[5]And the like. The Bottom-UP provides a combined attention mechanism from Bottom to top and from top to Bottom, the visual conversation task acquires the entity characteristics in the image through the Faster R-CNN network, and the attention mechanism is utilized to acquire the visual characteristics related to the problem; attention Memory proposes an Attention storage mechanism to better select visual contents related to a question by calculating a visual Attention distribution related to a current question using visual Attention distributions in past dialogues; RVA provides a recursive visual attention mechanism, and visual attention distribution related to the current problem is gradually refined by using related visual information in historical conversation; sequential co-attention proposes a continuous co-attention mechanism, which uses the information of the other two modalities to calculate the attention score of the other modality, and selectively focuses on the image and the content of the historical conversation. The method carries out later-stage fusion on the acquired visual characteristics, historical characteristics and problem characteristics, and is beneficial toAnd predicting the candidate answers by using the fused vectors. However, after the features of different modalities are extracted respectively, the method of fusing the features of different modalities by using later-stage fusion cannot capture the correlation information between different modalities well, so that the model is not accurate enough in predicting candidate answers. In order to obtain more useful information, a method for fusing information of different modalities needs to be improved.
To achieve a better-performing visual dialogue model, the main challenges at present are: the visual dialogue task needs to model the image content and the dialogue history and obtain information useful for answer prediction from the question vector, the visual vector and the history vector. The prediction of the answer needs to pay attention to not only the feature information in different modalities, but also the associated information among different modalities, and the answer can be better predicted by capturing the feature information in the modalities and the associated information among the modalities. Therefore, how to better capture the associated information between different modalities is a problem to be solved by the visual dialogue model.
Reference documents:
[1]Das,A.,Kottur,S.,et al.“Visual Dialog.”IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2017.
[2]Anderson,Peter,et al.“Bottom-up and top-down attention for image captioning and visual question answering.”Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018.
[3]Sigal,leonid,et al.“Visual Reference Resolution Using Attention Memory for Visual Dialog.”Advances in Neural Information Processing Systems,vol.30,2018,pp.3719–3729.
[4]Niu,Yulei,et al.“Recursive Visual Attention in Visual Dialog.”IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.6679–6688.
[5]Wu,Qi,et al.“Are You Talking to MeReasoned Visual Dialog Generation Through Adversarial Learning.”IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018,pp.6106–6115.
disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a neural network visual dialogue model and a method based on fusion of KR product and multi-mode information.
The purpose of the invention is realized by the following technical scheme:
a neural network visual dialogue model based on KR product fusion multi-modal information comprises a modal feature extraction module, a different modal information fusion module and a candidate answer prediction module;
the modal feature extraction module extracts semantic features of the problem, visual features of the image and historical features of historical conversation; finally, obtaining problem characteristics, visual characteristics and historical characteristics;
the different-modality information fusion module is used for fusing the problem features, the visual features and the historical features to obtain a final fusion vector, and comprehensively considering the information of different modalities, so that the information related to the candidate answers can be conveniently captured; the method specifically comprises three steps of obtaining characteristic information in different modes, extracting correlation information among the different modes and fusing information between the modes;
the candidate answer prediction module is used for predicting candidate answers, the input of the candidate answer prediction module is fusion vectors obtained by the different-mode information fusion module, and then inner product calculation is carried out on the fusion vectors and the feature vectors of the candidate answers to obtain scores corresponding to the candidate answers.
Furthermore, the modal feature extraction module is used for learning the embedded expression of words in the word2vec model by dividing words of the problem text and utilizing a glove tool, wherein each word is expressed by a dense vector, and then the LSTM model is used for acquiring the semantic features of the problem, namely the problem features;
extracting entity characteristics in the image by using a Faster R-CNN pre-training network, wherein each entity is represented by a dense vector, acquiring attention distribution related to a problem by an attention mechanism, and performing weighted summation on the entity characteristics to acquire visual characteristics of the image;
and splicing the historical dialogue information, segmenting words of the historical information, learning the embedded expression of the words in the word2vec model by using a glove tool, and acquiring the historical characteristics of the historical dialogue by using an LSTM model.
Further, in the method for acquiring the feature information in different modes, firstly, the problem feature, the visual feature and the historical feature are spliced, then the features of the problem feature, the visual feature and the historical feature are fused through a full connection layer, and finally, the feature information in different modes is acquired;
the method for extracting the correlation information among different modes comprises the steps of firstly, obtaining two interaction matrixes through two times of KR product calculation, splicing problem features, visual features and historical features according to the 0 th dimension in the first time of KR product calculation to generate a feature matrix, and performing KR product calculation on the feature matrix according to columns to generate a third-order tensor; performing Mask calculation on the third-order tensor, and capturing information of the third-order tensor through convolution operation to obtain an interacted feature matrix; performing KR product calculation on the feature matrix obtained after interaction and the original feature matrix according to corresponding columns by the second KR product calculation, and performing convolution operation on the obtained third-order tensor to generate an interacted feature matrix; then splicing and compressing the interaction matrixes obtained twice to obtain final characteristics, wherein the final characteristics comprise correlation information among the modes;
and finally, splicing the intra-modal characteristic information obtained by later-stage fusion and the inter-modal association information obtained by fusion of the KR product to obtain a final fusion vector, wherein the fusion vector comprises the intra-modal characteristic information and the inter-modal association information.
Furthermore, the candidate answer prediction module is used for predicting a plurality of candidate answers to obtain the score of each candidate answer, calculating loss by comparing the real answers, and then continuously updating the parameters of the neural network visual dialogue model by using a back propagation algorithm to finally obtain the score corresponding to each candidate answer.
The invention also provides a KR product fusion multi-modal information-based neural network visual dialogue method, which is based on the neural network visual dialogue model and comprises the following steps of: capturing feature information in different modalities through a later-stage fusion method, capturing correlation information between the different modalities in a KR product calculation mode, and finally splicing the feature information in the modalities and the correlation information between the modalities to obtain a final fusion vector; and the later-stage fusion method is used for splicing the captured problem features, visual features and historical features and performing linear layer fusion.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. a KR product-based different-modality information fusion module is provided, and is used for capturing correlation information between different modalities. Most of the previous correlation work focuses on capturing visual features and historical features related to the current problem respectively, and then fusing the visual features, the problem features and the historical features through a later-stage fusion method, but the later-stage fusion method does not deeply consider the association between different modalities. The KR product-based method combines information among different modalities by using a tensor product calculation mode and captures associated information among different modalities.
2. A redundancy characteristic reduction method based on a Mask mechanism is provided on the basis of KR products. In the process of first-layer feature combination based on KR products, because input feature matrixes are the same and tensor calculation is carried out according to columns, the obtained matrix is a symmetric matrix and contains repeated combination information; in order to reduce redundant information generated by information combination between different modes, repeated combination information is deleted by combining a Mask mechanism. After the redundant features in the first layer of combined features are removed, when the redundant features are used for combining the features of the second layer, the associated information between the modalities can be captured more accurately;
3. the neural network visual dialogue model captures characteristic information in the modes through a later-stage fusion method, and captures correlation information between the modes through a KR product + Mask mechanism + convolution neural network method, so that the two parts of information are fused through a splicing method. The KR product calculation needs to map visual features, problem features and historical features into the same vector space, and then capture correlation information among modes through a method of KR product, Mask mechanism and convolutional neural network; such an approach captures potential correlation information between modalities, but loses characteristic information within different modalities. In order to retain the feature information in different modalities, the information in the modalities is captured by a later-stage fusion method, and then the information is spliced and fused with the inter-modality association information captured by a KR-based method, so that the final fusion vector contains the feature information in the modalities and the inter-modality association information. Compared with the related work, the method for combining the correlation information among the modes improves the result of answer prediction.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a diagram of a visual dialog model based on KR product fusion multimodal information.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A neural network visual dialogue model based on KR product fusion multi-modal information comprises a modal feature extraction module, a different modal information fusion module and a candidate answer prediction module;
the modal feature extraction module is used for extracting semantic features of the problem, visual features of the image and historical features of historical conversations. Firstly, obtaining vector representation of problems through an LSTM network, obtaining a group of entity characteristic vectors of an image by using an Faster R-CNN network, regarding historical conversation information as a whole or regarding each piece of conversation content as a whole, and obtaining vector representation of conversation history through the LSTM network; then, the visual attention distribution and the historical attention distribution related to the problem are calculated through an attention mechanism, and a visual vector and a historical vector are obtained through a weighted summation method.
The different modality information fusion module is used for fusing problem features, visual features and historical features. The problem vector, the visual vector and the historical vector are obtained by a modal feature extraction module, and information of the problem vector, the visual vector and the historical vector is fused by a later fusion method to capture feature information in different modalities; performing feature combination on the problem vector, the visual vector and the historical vector by a KR product-based multi-modal information fusion method, and capturing correlation information among different modes; and finally, splicing the information in different modes and the information among different modes to obtain a final fusion vector.
The candidate answer prediction module obtains answer vectors of 100 candidate answers through the LSTM network, and performs inner product calculation with the fusion vectors obtained by the different mode information fusion module to obtain scores of 100 candidate answers.
Fig. 1 shows a flow of a visual dialog method proposed by the present method; fig. 2 shows a neural network visual dialogue model designed by the present embodiment. The method comprises the following specific operation steps:
(1) traversing all the problems in the VisDial v1.0 data set (the VisDial v1.0 data set is downloaded through a visual dialogue official network, and the website address is https:// visual dialog.org/data), determining the fixed length (generally the maximum sentence length in the data set) of the input problems, filling the sentences of which the sentence lengths are less than the fixed length in the data set, and supplementing 0 for the rest lengths, ensuring the fixed lengths of all the sentences, and facilitating the calculation of a neural network and an attention mechanism.
(2) Traversing all answers in the data set, determining the fixed length of the answer text, filling sentences of which the sentence lengths are smaller than the fixed length in the data set, and supplementing 0 for the rest lengths, so that the fixed lengths of all the sentences are ensured, and the calculation of a neural network is facilitated.
(3) Obtaining 300-dimensional word vector w by utilizing glove tooliFor constructing a word vector matrix E ═ w1,...,wn]Each row of the matrix represents a word, the matrix is directly used as a word vector matrix in the model initialization stage, the word vector matrix is set to be trainable, and the word vector matrix is used for training the word vector in the model training processWill be optimized.
(4) Extracting the characteristic vector corresponding to each picture by using a pre-trained Faster R-CNN network, and generating a 36 x 2048 matrix for each picture
Figure BDA0003037768270000051
36 denotes the number of entities identified in the picture and 2048 denotes the length of each entity feature vector.
(5) Problem vector q of round ttObtained through an LSTM network; the history information consists of picture titles and question and answer information of each turn in the history, the title information of the picture is spliced with all the question and answer information, and a history vector h is obtained through an LSTM networkt
(6) Problem-related visual vector v obtained by using attention mechanismtThe calculation formula is as follows:
Figure BDA0003037768270000052
Figure BDA0003037768270000053
Figure BDA0003037768270000061
wherein,
Figure BDA0003037768270000062
representing a certain entity vector in a picture, the matrix W1∈R512×2048Vector qt∈R512Representing the problem feature of the t round, dropout (circle) represents that the data in the vector is discarded according to a certain probability,
Figure BDA00030377682700000622
representing multiplication of corresponding positions, vector
Figure BDA0003037768270000063
Matrix W2∈R512×1Matrix of
Figure BDA0003037768270000064
softmax (·) is used to calculate correlations between different entity features and problem features in a picture, vector
Figure BDA0003037768270000065
Vector vt∈R2048The visual features representing the problem associated with the t-th round are obtained by weighted summation of the different physical features in the image.
(7) Fusing the acquired problem vector, visual vector and historical vector by using a later-stage fusion method to obtain a vector containing intra-modal characteristic information
Figure BDA0003037768270000066
The calculation process is as follows:
Figure BDA0003037768270000067
wherein [ ·]Representing a stitching operation, vector qt;vt;ht]∈R3072The matrix W3∈R3072×512Vector of motion
Figure BDA0003037768270000068
(8) Acquiring association information among problem vectors, visual vectors and historical vectors by using a KR product-based multi-modal information fusion method to obtain vectors containing association information among modalities
Figure BDA0003037768270000069
Firstly, mapping a problem vector, a visual vector and a historical vector into the same vector space, splicing the three vectors according to the 0 th dimension to obtain a feature matrix
Figure BDA00030377682700000610
Then, characteristic momentsKR product calculation is carried out according to the array to obtain a third-order tensor
Figure BDA00030377682700000611
Setting the data of the lower triangle of each matrix in the third-order tensor to be 0 through a Mask mechanism; finally, the combined information among the characteristics is captured through the convolutional neural network to obtain a characteristic matrix
Figure BDA00030377682700000612
Using feature matrix X0And the feature matrix X1Extracting X in columns0And X1The KR product is calculated to obtain a third order tensor
Figure BDA00030377682700000613
Obtaining a feature matrix by a convolutional neural network
Figure BDA00030377682700000614
For feature matrix X1And the feature matrix X2Respectively compressing according to the 1 st dimension to obtain vectors
Figure BDA00030377682700000615
And
Figure BDA00030377682700000616
then splicing the two vectors to obtain a final vector
Figure BDA00030377682700000617
The calculation formula is as follows:
Figure BDA00030377682700000618
Figure BDA00030377682700000619
Figure BDA00030377682700000620
Figure BDA00030377682700000621
Figure BDA0003037768270000071
Figure BDA0003037768270000072
wherein WMask∈R3×3A mask matrix with an upper triangular value of 1 and the remaining values of 0,
Figure BDA0003037768270000073
it is indicated that the corresponding position is multiplied by,
Figure BDA0003037768270000074
representing tensor operations (mathematically expressed as outer products), DsThe dimensions of the features are represented in a graph,
Figure BDA0003037768270000075
is a matrix
Figure BDA0003037768270000076
Column vector, matrix of
Figure BDA0003037768270000077
Is third order tensor
Figure BDA0003037768270000078
Of (1), matrix Wi 4∈R3×3For the ith convolution kernel, sum (-) is an addition function, which sums all values,
Figure BDA0003037768270000079
is a scalar quantity, expressed as a matrix
Figure BDA00030377682700000710
The jth value in the ith vector,
Figure BDA00030377682700000711
is a scalar quantity, by subtending quantities
Figure BDA00030377682700000712
Carrying out average pooling;
Figure BDA00030377682700000713
is a matrix
Figure BDA00030377682700000714
The column vector of (a) is,
Figure BDA00030377682700000715
is a matrix
Figure BDA00030377682700000716
Column vector, matrix of
Figure BDA00030377682700000717
Is third order tensor
Figure BDA00030377682700000718
Of (1), matrix Wi 5∈RM×3For the (i) th convolution kernel,
Figure BDA00030377682700000719
is a scalar quantity, expressed as a matrix
Figure BDA00030377682700000720
The jth value in the ith vector,
Figure BDA00030377682700000721
is a scalar quantity, by subtending quantities
Figure BDA00030377682700000722
To carry outObtaining by average pooling; m, N represent the number of convolution kernels in the first and second layers, respectively.
(9) Output of the splicing step (7) and the step (8), vector of the candidate answer
Figure BDA00030377682700000723
The method comprises the following steps of obtaining through an LSTM network, and calculating scores of 100 candidate answers through a candidate answer module, wherein the scores are as follows:
Figure BDA00030377682700000724
wherein the vector
Figure BDA00030377682700000725
Obtained by stitching, vectors
Figure BDA00030377682700000726
Features represented as one of the 100 candidate answers,
Figure BDA00030377682700000727
representing multiplication of corresponding positions, sum (-) representing addition of all values in the vector, scoreiIs a scalar quantity, expressed as the relevance score of the ith candidate answer.
And calculating the cross entropy between the candidate answer score and the real answer score, and then continuously updating the parameters of the model through back propagation.
(10) Finally, the method is applied to RVA and LF models, training and testing are carried out on a VisDial v1.0 data set, and the evaluation indexes of the testing are R @1, R @5, R @10, mean (mean rank), MRR (mean receiptional rank) and NDCG (normalized discrete relational gain). As can be seen from the experimental results in table 1, the method for obtaining multimodal information based on late fusion is insufficient for capturing the associated information between different modalities; according to the invention, by utilizing a KR product-based multi-modal information fusion method, the correlation information among different modes is captured and spliced with the characteristic information in the different modes to obtain a final fusion vector, and experiments show that the overall effect of the visual dialogue model can be improved by capturing the correlation information among the different modes.
TABLE 1 comparison of results for models on VisDial v1.0 dataset
Figure BDA0003037768270000081
The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (5)

1. A neural network visual dialogue model based on KR product fusion multi-modal information is characterized by comprising a modal feature extraction module, a different modal information fusion module and a candidate answer prediction module;
the modal feature extraction module extracts semantic features of the problem, visual features of the image and historical features of historical conversation; finally, obtaining problem characteristics, visual characteristics and historical characteristics;
the different-modality information fusion module is used for fusing the problem features, the visual features and the historical features to obtain a final fusion vector, and comprehensively considering the information of different modalities, so that the information related to the candidate answers can be conveniently captured; the method specifically comprises three steps of obtaining characteristic information in different modes, extracting correlation information among the different modes and fusing information between the modes;
the candidate answer prediction module is used for predicting candidate answers, the input of the candidate answer prediction module is fusion vectors obtained by the different-mode information fusion module, and then inner product calculation is carried out on the fusion vectors and the feature vectors of the candidate answers to obtain scores corresponding to the candidate answers.
2. The KR product fusion multi-modal information-based neural network visual dialogue model of claim 1, wherein the modal feature extraction module is configured to use a glove tool to learn the embedded representation of words in a word2vec model by segmenting a problem text, each word is represented by a dense vector, and then use an LSTM model to obtain semantic features of a problem, namely problem features;
extracting entity characteristics in the image by using a Faster R-CNN pre-training network, wherein each entity is represented by a dense vector, acquiring attention distribution related to a problem by an attention mechanism, and performing weighted summation on the entity characteristics to acquire visual characteristics of the image;
and splicing the historical dialogue information, segmenting words of the historical information, learning the embedded expression of the words in the word2vec model by using a glove tool, and acquiring the historical characteristics of the historical dialogue by using an LSTM model.
3. The KR product fusion multi-modal information-based neural network visual dialogue model of claim 1, wherein in the method for obtaining feature information in different modalities, the problem feature, the visual feature and the historical feature are spliced, and then the features of the problem feature, the visual feature and the historical feature are fused through a full connection layer, so as to obtain feature information in different modalities;
the method for extracting the correlation information among different modes comprises the steps of firstly, obtaining two interaction matrixes through two times of KR product calculation, splicing problem features, visual features and historical features according to the 0 th dimension in the first time of KR product calculation to generate a feature matrix, and performing KR product calculation on the feature matrix according to columns to generate a third-order tensor; performing Mask calculation on the third-order tensor, and capturing information of the third-order tensor through convolution operation to obtain an interacted feature matrix; performing KR product calculation on the feature matrix obtained after interaction and the original feature matrix according to corresponding columns by the second KR product calculation, and performing convolution operation on the obtained third-order tensor to generate an interacted feature matrix; then splicing and compressing the interaction matrixes obtained twice to obtain final characteristics, wherein the final characteristics comprise correlation information among the modes;
and finally, splicing the intra-modal characteristic information obtained by later-stage fusion and the inter-modal association information obtained by fusion of the KR product to obtain a final fusion vector, wherein the fusion vector comprises the intra-modal characteristic information and the inter-modal association information.
4. The KR-product-fusion-multimodal-information-based neural network visual dialogue model as claimed in claim 1, wherein the candidate answer prediction module is configured to predict a plurality of candidate answers to obtain a score of each candidate answer, calculate loss by comparing the actual answers, and then continuously update parameters of the neural network visual dialogue model by using a back propagation algorithm to finally obtain a score corresponding to each candidate answer.
5. A neural network visual dialogue method based on KR product fusion multi-modal information is based on the neural network visual dialogue model and is characterized by comprising the following steps: capturing feature information in different modalities through a later-stage fusion method, capturing correlation information between the different modalities in a KR product calculation mode, and finally splicing the feature information in the modalities and the correlation information between the modalities to obtain a final fusion vector; and the later-stage fusion method is used for splicing the captured problem features, visual features and historical features and performing linear layer fusion.
CN202110448254.XA 2021-04-25 2021-04-25 Neural network visual conversation device and method based on KR product fusion multi-mode information Expired - Fee Related CN113177112B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110448254.XA CN113177112B (en) 2021-04-25 2021-04-25 Neural network visual conversation device and method based on KR product fusion multi-mode information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110448254.XA CN113177112B (en) 2021-04-25 2021-04-25 Neural network visual conversation device and method based on KR product fusion multi-mode information

Publications (2)

Publication Number Publication Date
CN113177112A true CN113177112A (en) 2021-07-27
CN113177112B CN113177112B (en) 2022-07-01

Family

ID=76925652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110448254.XA Expired - Fee Related CN113177112B (en) 2021-04-25 2021-04-25 Neural network visual conversation device and method based on KR product fusion multi-mode information

Country Status (1)

Country Link
CN (1) CN113177112B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868451A (en) * 2021-09-02 2021-12-31 天津大学 Cross-modal social network conversation method and device based on context cascade perception
CN114357968A (en) * 2021-11-29 2022-04-15 阿里巴巴达摩院(杭州)科技有限公司 Conversation method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408624A (en) * 2018-11-06 2019-03-01 江西师范大学 A kind of vision chat robots session generation method
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
US10769374B1 (en) * 2019-04-24 2020-09-08 Honghui CHEN Answer selection method for question answering system and the system
CN111859005A (en) * 2020-07-01 2020-10-30 江西理工大学 Cross-layer multi-model feature fusion and image description method based on convolutional decoding
CN111897940A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model
CN111897939A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408624A (en) * 2018-11-06 2019-03-01 江西师范大学 A kind of vision chat robots session generation method
US10769374B1 (en) * 2019-04-24 2020-09-08 Honghui CHEN Answer selection method for question answering system and the system
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN111859005A (en) * 2020-07-01 2020-10-30 江西理工大学 Cross-layer multi-model feature fusion and image description method based on convolutional decoding
CN111897940A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model
CN111897939A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Visual dialogue method, training device and training equipment of visual dialogue model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
QI WU: "Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
廖文雄: "基于多模态信息融合的生成式问答系统研究", 《万方数据知识服务平台》 *
王嫱: "基于视觉语义表示的视觉对话方法研究", 《万方数据知识服务平台》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868451A (en) * 2021-09-02 2021-12-31 天津大学 Cross-modal social network conversation method and device based on context cascade perception
CN113868451B (en) * 2021-09-02 2024-06-11 天津大学 Cross-modal conversation method and device for social network based on up-down Wen Jilian perception
CN114357968A (en) * 2021-11-29 2022-04-15 阿里巴巴达摩院(杭州)科技有限公司 Conversation method and device

Also Published As

Publication number Publication date
CN113177112B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
US20220180202A1 (en) Text processing model training method, and text processing method and apparatus
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
JP6351689B2 (en) Attention based configurable convolutional neural network (ABC-CNN) system and method for visual question answering
US11776269B2 (en) Action classification in video clips using attention-based neural networks
CN111460121B (en) Visual semantic conversation method and system
CN115329779B (en) Multi-person dialogue emotion recognition method
CN109766557B (en) Emotion analysis method and device, storage medium and terminal equipment
CN110704601A (en) Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN113792177B (en) Scene character visual question-answering method based on knowledge-guided deep attention network
CN113177112B (en) Neural network visual conversation device and method based on KR product fusion multi-mode information
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN114339450B (en) Video comment generation method, system, device and storage medium
CN111597341A (en) Document level relation extraction method, device, equipment and storage medium
CN110705490A (en) Visual emotion recognition method
CN116385937A (en) Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
Huang et al. Learning long-term temporal contexts using skip RNN for continuous emotion recognition
CN112560440A (en) Deep learning-based syntax dependence method for aspect-level emotion analysis
CN114168769B (en) Visual question-answering method based on GAT relation reasoning
CN115906861A (en) Statement emotion analysis method and device based on interaction aspect information fusion
CN112765955B (en) Cross-modal instance segmentation method under Chinese finger representation
Fenghour et al. Contour mapping for speaker-independent lip reading system
CN116824461B (en) Question understanding guiding video question answering method and system
CN117521674B (en) Method, device, computer equipment and storage medium for generating countermeasure information
US11526781B2 (en) Automatic sentence inferencing network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220701

CF01 Termination of patent right due to non-payment of annual fee