CN111008293A - Visual question-answering method based on structured semantic representation - Google Patents
Visual question-answering method based on structured semantic representation Download PDFInfo
- Publication number
- CN111008293A CN111008293A CN201811164612.9A CN201811164612A CN111008293A CN 111008293 A CN111008293 A CN 111008293A CN 201811164612 A CN201811164612 A CN 201811164612A CN 111008293 A CN111008293 A CN 111008293A
- Authority
- CN
- China
- Prior art keywords
- image
- vector
- input
- word
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
- G06V10/422—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
- G06V10/424—Syntactic representation, e.g. by using alphabets or grammars
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a visual question-answering method based on structured semantic representation, which extracts image features of an input image through a convolutional neural network; extracting a word vector of each word from the input problems related to the input image through a pre-trained word embedding model; weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors; converting the word vector into a structured semantic representation vector through a Tree-LSTM network; fusing the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fused feature vector; and taking the fused feature vector as the input of a prediction model, and outputting the answer corresponding to the input question by the prediction model. According to the method, richer semantic information is extracted from the problems, and the performance of the prediction model is improved through multi-layer training optimization, so that the accuracy of the answers is improved.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a visual question answering method based on structured semantic representation.
Background
In the field of computer vision, visual question answering is a very leading and challenging problem. Given a natural image, any questions related to the image content may be asked. In order to accurately predict the problem, data information needs to be sufficiently acquired and represented in a more robust representation mode in the visual question-answering modeling process. The training mode of the visual question-answering model is very important, and the classifier boundary needs to be accurately found. Since the language itself has a composition structure property, different questions often have similar substructures, which also means that the reasoning process in visual question answering must be composition to obtain useful information.
Through the literature search of the prior art, Zichao Yang, Xiaolong He, Jianfeng Gao, LiDeng and Alex Smola in 2016 "IEEE Conference on Vision and Pattern recognition" published "Stacked attention networks for image query and switching" provide a multi-layer attention mechanism, and the image feature weight is calculated for multiple times through text features, so that more accurate weighting information can be obtained through multiple times of weighting. A layered joint attention mode is provided in 2016 'Conference and Workshop Processing Systems' document published by Jiansen Lu, Jianwei Yang, Dhruv Batra and Devi Parikh, and multi-layer feature extraction is performed on text features to obtain three-layer feature vectors, a joint attention mode is provided, and meanwhile, weighting is performed on images and text feature vectors.
In both methods, a Recurrent Neural Network (RNN) and a Long-Short-Term Memory (LSTM) Network are used for extracting text features, each word is relatively independent, and only the relation between the current word and the previous word is considered. However, this chained structure ignores semantic structure information in the text. This overlooked structural information is important in the visual question-answering process.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a visual question-answering method based on structured semantic representation.
The embodiment of the invention provides a visual question-answering method based on structured semantic representation, which comprises the following steps:
extracting image features of an input image through a convolutional neural network;
extracting a word vector of each word from the input problems related to the input image through a pre-trained word embedding model;
weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors;
converting the word vector into a structured semantic representation vector through a Tree-LSTM network;
fusing the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fused feature vector;
and taking the fused feature vector as the input of a prediction model, and outputting the answer corresponding to the input question by the prediction model.
Optionally, extracting the image feature of the input image by a convolutional neural network includes:
cutting the input image into image blocks with preset sizes;
and inputting the image block into a convolutional neural network, and taking the output of the last pooling layer of the convolutional neural network as the image feature of the input image.
Optionally, performing weighting processing on the image feature and the word vector to obtain a weighted image feature vector and a weighted text feature vector, including:
weighting the image features and the word vectors through a preset attention function to obtain weighted image feature vectors and weighted text feature vectors; defining operations of interestX is an unweighted image featureA token or word vector, g is the corresponding image or text attention amount, representing the degree of matching between the image and the text,representing the weighted feature vector, and assuming that X is an image feature to obtain the weighted image feature vector; if X is a word vector, obtaining a weighted feature vector of the text, wherein a calculation formula of the attention function is as follows:
wherein: h represents an intermediate state of the calculation process of interest, tanh is a hyperbolic tangent nonlinear activation function, Wx、Wg、In order to have three parameters to be learned,is a unit vector with all elements 1, axRepresenting the probability value of importance for each image region calculated by text, softmax is a multi-classifier normalized index function,representing weighted probability values, x, of the ith image regioniWhich represents the i-th image area,representing the weighted total image feature vector, i is [1, N ]]N denotes the total number of image areas.
Optionally, converting the word vector into a structured semantic representation vector through a Tree-LSTM network, comprising:
obtaining a tree structure of each input question;
calculating a state vector corresponding to each node position on the tree structure, wherein each node position corresponds to one LSTM unit; the LSTM cell includes: an input gate, an output gate, and a memory cell;
and taking the state vector and the word vector corresponding to each node position as the input of the LSTM unit, and outputting the corresponding structural semantic expression vector by the LSTM unit.
Optionally, the fusing the image feature vector, the text feature vector, and the structured semantic representation vector includes:
and performing fusion processing on the image feature vector, the text feature vector and the structured semantic expression vector by adopting a mode of adding elements in different vectors one by one.
Optionally, the method further comprises:
establishing an initial visual question-answering model comprising the convolutional neural network, a pre-trained word embedding model, a Tree-LSTM network and a prediction model;
taking a training image and an answer corresponding to the training image as the input of the initial visual question-answering model, training a preliminary model according to the existing label information, and outputting a predicted candidate answer by the initial visual question-answering model;
taking the training image, the predicted candidate answer and a supplementary image as the input of the initial visual question-answering model, and outputting a judgment result of whether the predicted candidate answer is correct or not by the initial visual question-answering model;
if the judgment result is correct, the training image, the predicted candidate answer and the supplementary image form a positive sample; if the judgment result is wrong, the training image, the predicted candidate answer and the supplementary image form a negative sample;
and retraining the initial visual question-answer model by using the positive sample and the negative sample, and setting a two-classification loss function of the discrimination result to adjust the parameters of the initial visual question-answer model in the training process to obtain the optimized visual question-answer model.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, the structural information in the language is applied to the process of visual question answering, richer semantic information can be obtained by means of the Tree-LSTM network, the feature dimension is enriched, and the performance of model answer prediction is improved. In the training process, a dual-channel network structure is provided, and more information in new data can be obtained. The performance of model prediction can be further improved by a multi-layer training optimization mode. Therefore, the accuracy of the answers of the visual question answering method is finally improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of a visual question-answering model of the present invention;
FIG. 2 is a schematic diagram of a two-channel network architecture input;
FIG. 3 is a flowchart of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention;
fig. 4 is a flowchart of model training of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Fig. 1 is a schematic diagram of a principle of a visual question-answering model of the present invention, and as shown in fig. 1, image features are first extracted by using a convolutional neural network, and in order to maintain position information of an image, a pooling layer output is usually selected as the image features. And obtaining a word vector of each word by using the word embedding model obtained by training. And then extracting the structural information of the text by using the Tree-LSTM network to obtain a structured semantic representation feature vector. And carrying out importance weighting on the word vectors and the image feature vectors by using a joint attention mechanism, and calculating the importance of different words in different image areas and sentences. And performing multi-mode information fusion on the weighted image, the text feature vector and the structured semantic expression vector. And performing classifier training by using the feature vectors obtained by fusion. FIG. 2 is a schematic input diagram of a two-channel network architecture, as shown in FIG. 2, for further optimizing model parameters using the two-channel network architecture.
Optionally, in the feature extraction process, a multilayer convolution mode may be adopted to obtain multilayer text feature vectors, so that more information can be acquired to apply to the question-answering model.
Optionally, in the process of feature alignment, a multi-layer bidirectional attention mechanism is adopted, so that important image regions and words can be found more accurately, and irrelevant information is filtered out. The performance of the visual question-answering model can be improved.
Alternatively, the way of feature fusion directly affects the final feature vector. Besides basic element-by-element addition and multiplication, a further processing mode can be adopted, and the method has a more direct effect on the training of the classifier.
Fig. 3 is a flowchart of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention, and as shown in fig. 3, the method in this embodiment may include:
and S101, extracting image features of the input image through a convolutional neural network.
In the embodiment, an input image is cut into image blocks with preset sizes; and inputting the image blocks into a convolutional neural network, and taking the output of the last pooling layer of the convolutional neural network as the image characteristics of the input image. Specifically, the input image is cropped to 224 × 224 size and input into the pre-trained VGG-19 network. The output of the last pooling layer is represented as an image feature with dimensions 512 × 14 × 14.
S102, extracting a word vector of each word from input problems related to input images through a pre-trained word embedding model.
In this embodiment, a word vector of each word is obtained by using a trained word embedding model, and the dimension is 512.
S103, weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors.
In this embodiment, the image features and the word vectors are weighted through a preset attention function to obtain weighted image feature vectors and text feature vectors; defining operations of interestX is an unweighted image feature or word vector, g is a corresponding image or text attention amount, representing the degree of match between the image and the text,representing the weighted feature vector, and assuming that X is an image feature to obtain the weighted image feature vector; if X is a word vector, obtaining a weighted feature vector of the text, wherein a calculation formula of the attention function is as follows:
wherein: h represents an intermediate state of the calculation process of interest, tanh is a hyperbolic tangent nonlinear activation function, Wx、Wg、For three parameters needing to be learnedThe number of the first and second groups is,is a unit vector with all elements 1, axRepresenting the probability value of importance for each image region calculated by text, softmax is a multi-classifier normalized index function,representing weighted probability values, x, of the ith image regioniWhich represents the i-th image area,representing the weighted total image feature vector, i is [1, N ]]N denotes the total number of image areas.
And S104, converting the word vector into a structural semantic expression vector through a Tree-LSTM network.
In this embodiment, a tree structure of each input question is obtained; calculating a state vector corresponding to each node position on the tree structure, wherein each node position corresponds to one LSTM unit; the LSTM unit includes: an input gate, an output gate, and a memory cell; and taking the state vector and the word vector corresponding to each node position as the input of an LSTM unit, and outputting a corresponding structural semantic expression vector by the LSTM unit.
Specifically, first obtaining a syntax tree structure of a question sentence, each word corresponding to a tree node in the syntax tree, each node corresponding to an LSTM unit, includes: input door ijOutput gate ojImplicit layer state value hjAnd memory cells cjThe state value update of each node is determined by its child nodes.
cj=ij⊙uj+Σk∈C(j)Jjk⊙ck
hj=ojtanh(cj)
CjSet of all children nodes, W, representing the jth nodei、Ui、bi、Wo、Uo、bo、Wf、Uf、bfRespectively representing the parameters i to be learned by the input unit, the output unit and the forgetting unitj、fjk、ojRespectively representing input state control quantity, forgetting quantity and output quantity, sigma and tanh respectively representing sigmoid and hyperbolic tangent activation function, cjRepresenting the state value of the cell at the jth node position, ⊙ is a product operationjThe input to each node includes the word vector at the state of the child node and the current node. The state values of the final root node positions are used in the visual question-answer modeling process as a representation of the entire sequence.
And S105, carrying out fusion processing on the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fusion feature vector.
In this embodiment, a basic element-by-element addition manner is adopted to perform fusion processing on the image feature vector, the text feature vector, and the structured semantic expression vector.
And S106, taking the fusion feature vector as the input of the prediction model, and outputting the answer corresponding to the input question by the prediction model.
In this embodiment, the fused feature vector is used as an input of the prediction model, and the prediction model outputs an answer corresponding to the input question.
Fig. 4 is a flowchart of model training of a visual question-answering method based on a structured semantic representation according to an embodiment of the present invention, and as shown in fig. 4, before executing S101, the method further includes:
s201, establishing an initial visual question-answer model comprising a convolutional neural network, a pre-trained word embedding model, a Tree-LSTM network and a prediction model.
S202, taking the training images and the answers corresponding to the training images as the input of the initial visual question-answer model, and outputting predicted candidate answers by the initial visual question-answer model.
S203, the training image, the predicted candidate answer and the supplementary image are used as the input of the initial visual question-answering model, and the initial visual question-answering model outputs the judgment result of whether the predicted candidate answer is correct or not.
S204, if the judgment result is correct, the training image, the predicted candidate answer and the supplementary image form a positive sample; if the judgment result is wrong, the training image, the predicted candidate answer and the supplementary image form a negative sample.
S205, training the initial visual question-answer model by using the positive sample and the negative sample, and adjusting parameters of the initial visual question-answer model by setting a loss function of a judgment result in the training process to obtain the optimized visual question-answer model.
In this embodiment, the training images and the answers corresponding to the training images are used as the input of the initial visual question-answer model, the candidate answers are predicted, the multiple classification questions are obtained, the loss function is a mutual information entropy function, and the first-layer training optimization is completed. And taking the training image, the predicted candidate answer and the supplementary image as the input of the initial visual question-answer model, judging whether the predicted candidate answer is correct or not, wherein the predicted candidate answer is a binary problem, the loss function is a mutual information entropy function, and the second-layer training optimization is completed by minimizing the loss function to solve.
According to the above steps, the experiment was carried out using the steps of the summary of the invention, and the data used for the experiment was from a published VQA 2.0.0 data set, including 82783 training images, 40504 verification images and 81434 test images. There were a total of 443757 questions for training, with 447793 test questions. The visual question-answer model based on the structured semantic representation and the layered joint attention model and the depth LSTM model proposed by Jiansen Lu et al in the 'Hierarchical query-image co-attribution for visual query answering' are compared.
The result is: by applying a structural semantic representation mode, the answer accuracy is respectively improved by 1.4% and 4% compared with a hierarchical joint attention model and a deep LSTM model. The two-channel network is used for training, the answer accuracy of the layered joint attention model can be improved by 1.01%, and the visual question-answer model based on the structured semantic representation provided by the invention is improved by 0.7%. Therefore, the method can effectively improve the performance of the visual question-answering model and has certain universality.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (6)
1. A visual question-answering method based on structured semantic representation is characterized by comprising the following steps:
extracting image features of an input image through a convolutional neural network;
extracting a word vector of each word from the input problems related to the input image through a pre-trained word embedding model;
weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors;
converting the word vector into a structured semantic representation vector through a Tree-LSTM network;
fusing the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fused feature vector;
and taking the fused feature vector as the input of a prediction model, and outputting the answer corresponding to the input question by the prediction model.
2. The visual question-answering method based on the structured semantic representation according to claim 1, wherein the extracting of the image features of the input image through a convolutional neural network comprises:
cutting the input image into image blocks with preset sizes;
and inputting the image block into a convolutional neural network, and taking the output of the last pooling layer of the convolutional neural network as the image feature of the input image.
3. The visual question-answering method based on the structured semantic representation according to claim 1, wherein weighting the image features and the word vectors to obtain weighted image feature vectors and text feature vectors comprises:
weighting the image features and the word vectors through a preset attention function to obtain weighted image feature vectors and weighted text feature vectors; defining operations of interestX is an unweighted image feature or word vector, g is a corresponding image or text attention amount, representing the degree of match between the image and the text,representing the weighted feature vector, and assuming that X is an image feature to obtain the weighted image feature vector; if X is a word vector, obtaining a weighted feature vector of the text, wherein a calculation formula of the attention function is as follows:
wherein: h represents an intermediate state of the calculation process of interest, tanh is a hyperbolic tangent nonlinear activation function, Wx、Wg、In order to have three parameters to be learned,is a unit vector with all elements 1, axRepresenting the probability value of importance for each image region calculated by text, softmax is a multi-classifier normalized index function,representing weighted probability values, x, of the ith image regioniWhich represents the i-th image area,representing the weighted total image feature vector, i is [1, N ]]N denotes the total number of image areas.
4. The visual question-answering method based on structured semantic representation according to claim 1, wherein the word vector is converted into a structured semantic representation vector through a Tree-LSTM network, comprising:
obtaining a tree structure of each input question;
calculating a state vector corresponding to each node position on the tree structure, wherein each node position corresponds to one LSTM unit; the LSTM cell includes: an input gate, an output gate, and a memory cell;
and taking the state vector and the word vector corresponding to each node position as the input of the LSTM unit, and outputting the corresponding structural semantic expression vector by the LSTM unit.
5. The visual question-answering method based on the structured semantic representation according to claim 1, wherein the fusion processing of the image feature vector, the text feature vector and the structured semantic representation vector comprises:
and performing fusion processing on the image feature vector, the text feature vector and the structured semantic expression vector by adopting a mode of adding elements in different vectors one by one.
6. The visual question-answering method based on structured semantic representation according to any one of claims 1 to 5, characterized by further comprising:
establishing an initial visual question-answering model comprising the convolutional neural network, a pre-trained word embedding model, a Tree-LSTM network and a prediction model;
taking a training image and an answer corresponding to the training image as the input of the initial visual question-answering model, training a preliminary model according to the existing label information, and outputting a predicted candidate answer by the initial visual question-answering model;
taking the training image, the predicted candidate answer and a supplementary image as the input of the initial visual question-answering model, and outputting a judgment result of whether the predicted candidate answer is correct or not by the initial visual question-answering model;
if the judgment result is correct, the training image, the predicted candidate answer and the supplementary image form a positive sample; if the judgment result is wrong, the training image, the predicted candidate answer and the supplementary image form a negative sample;
and retraining the initial visual question-answer model by using the positive sample and the negative sample, and setting a two-classification loss function of the discrimination result to adjust the parameters of the initial visual question-answer model in the training process to obtain the optimized visual question-answer model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811164612.9A CN111008293A (en) | 2018-10-06 | 2018-10-06 | Visual question-answering method based on structured semantic representation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811164612.9A CN111008293A (en) | 2018-10-06 | 2018-10-06 | Visual question-answering method based on structured semantic representation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111008293A true CN111008293A (en) | 2020-04-14 |
Family
ID=70110598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811164612.9A Pending CN111008293A (en) | 2018-10-06 | 2018-10-06 | Visual question-answering method based on structured semantic representation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111008293A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111539292A (en) * | 2020-04-17 | 2020-08-14 | 中山大学 | Action decision model and method for presenting scene question-answering task |
CN111581335A (en) * | 2020-05-14 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Text representation method and device |
CN111680484A (en) * | 2020-05-29 | 2020-09-18 | 北京理工大学 | Answer model generation method and system for visual general knowledge reasoning question and answer |
CN111797811A (en) * | 2020-07-20 | 2020-10-20 | 广东工业大学 | Blind person navigation system based on image understanding |
CN111832403A (en) * | 2020-06-04 | 2020-10-27 | 北京百度网讯科技有限公司 | Document structure recognition method, and model training method and device for document structure recognition |
CN112036276A (en) * | 2020-08-19 | 2020-12-04 | 北京航空航天大学 | Artificial intelligent video question-answering method |
CN112651403A (en) * | 2020-12-02 | 2021-04-13 | 浙江大学 | Zero-sample visual question-answering method based on semantic embedding |
CN112818889A (en) * | 2021-02-09 | 2021-05-18 | 北京工业大学 | Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network |
CN113010712A (en) * | 2021-03-04 | 2021-06-22 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113065347A (en) * | 2021-04-26 | 2021-07-02 | 上海交通大学 | Criminal case judgment prediction method, system and medium based on multitask learning |
CN113128415A (en) * | 2021-04-22 | 2021-07-16 | 合肥工业大学 | Environment identification method, system, equipment and storage medium |
CN113157889A (en) * | 2021-04-21 | 2021-07-23 | 韶鼎人工智能科技有限公司 | Visual question-answering model construction method based on theme loss |
CN113505246A (en) * | 2021-09-11 | 2021-10-15 | 腾讯科技(深圳)有限公司 | Data processing method, device, terminal equipment and storage medium |
CN113792177A (en) * | 2021-08-05 | 2021-12-14 | 杭州电子科技大学 | Scene character visual question-answering method based on knowledge-guided deep attention network |
CN113849623A (en) * | 2021-10-12 | 2021-12-28 | 中国传媒大学 | Text visual question answering method and device |
CN113869349A (en) * | 2021-08-04 | 2021-12-31 | 西南交通大学 | Schematic question-answering method based on hierarchical multi-task learning |
CN114880527A (en) * | 2022-06-09 | 2022-08-09 | 哈尔滨工业大学(威海) | Multi-modal knowledge graph representation method based on multi-prediction task |
CN115062208A (en) * | 2022-05-30 | 2022-09-16 | 苏州浪潮智能科技有限公司 | Data processing method and system and computer equipment |
CN116168796A (en) * | 2023-03-03 | 2023-05-26 | 杭州电子科技大学 | Medical image report structured generation method based on visual question and answer |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951473A (en) * | 2017-03-06 | 2017-07-14 | 浙江大学 | Towards the construction method of the deep vision question answering system of dysopia personage |
CN108170816A (en) * | 2017-12-31 | 2018-06-15 | 厦门大学 | A kind of intelligent vision Question-Answering Model based on deep neural network |
CN108446404A (en) * | 2018-03-30 | 2018-08-24 | 中国科学院自动化研究所 | Towards the search method and system for being directed toward problem without constraint vision question and answer |
-
2018
- 2018-10-06 CN CN201811164612.9A patent/CN111008293A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951473A (en) * | 2017-03-06 | 2017-07-14 | 浙江大学 | Towards the construction method of the deep vision question answering system of dysopia personage |
CN108170816A (en) * | 2017-12-31 | 2018-06-15 | 厦门大学 | A kind of intelligent vision Question-Answering Model based on deep neural network |
CN108446404A (en) * | 2018-03-30 | 2018-08-24 | 中国科学院自动化研究所 | Towards the search method and system for being directed toward problem without constraint vision question and answer |
Non-Patent Citations (1)
Title |
---|
余东晨: "structured semantic representation for visual question answering", 《IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING》 * |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111539292A (en) * | 2020-04-17 | 2020-08-14 | 中山大学 | Action decision model and method for presenting scene question-answering task |
CN111539292B (en) * | 2020-04-17 | 2023-07-07 | 中山大学 | Action decision model and method for question-answering task with actualized scene |
CN111581335A (en) * | 2020-05-14 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Text representation method and device |
CN111581335B (en) * | 2020-05-14 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Text representation method and device |
CN111680484A (en) * | 2020-05-29 | 2020-09-18 | 北京理工大学 | Answer model generation method and system for visual general knowledge reasoning question and answer |
CN111680484B (en) * | 2020-05-29 | 2023-04-07 | 北京理工大学 | Answer model generation method and system for visual general knowledge reasoning question and answer |
CN111832403A (en) * | 2020-06-04 | 2020-10-27 | 北京百度网讯科技有限公司 | Document structure recognition method, and model training method and device for document structure recognition |
CN111797811A (en) * | 2020-07-20 | 2020-10-20 | 广东工业大学 | Blind person navigation system based on image understanding |
CN111797811B (en) * | 2020-07-20 | 2023-09-12 | 广东工业大学 | Blind person navigation system based on image understanding |
CN112036276A (en) * | 2020-08-19 | 2020-12-04 | 北京航空航天大学 | Artificial intelligent video question-answering method |
CN112651403B (en) * | 2020-12-02 | 2022-09-06 | 浙江大学 | Zero-sample visual question-answering method based on semantic embedding |
CN112651403A (en) * | 2020-12-02 | 2021-04-13 | 浙江大学 | Zero-sample visual question-answering method based on semantic embedding |
CN112818889A (en) * | 2021-02-09 | 2021-05-18 | 北京工业大学 | Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network |
CN112818889B (en) * | 2021-02-09 | 2024-05-28 | 北京工业大学 | Method for fusing accuracy of visual question-answer answers by using super network based on dynamic attention |
CN113010712A (en) * | 2021-03-04 | 2021-06-22 | 天津大学 | Visual question answering method based on multi-graph fusion |
CN113157889A (en) * | 2021-04-21 | 2021-07-23 | 韶鼎人工智能科技有限公司 | Visual question-answering model construction method based on theme loss |
CN113128415B (en) * | 2021-04-22 | 2023-09-29 | 合肥工业大学 | Environment distinguishing method, system, equipment and storage medium |
CN113128415A (en) * | 2021-04-22 | 2021-07-16 | 合肥工业大学 | Environment identification method, system, equipment and storage medium |
CN113065347B (en) * | 2021-04-26 | 2022-06-28 | 上海交通大学 | Criminal case judgment prediction method, system and medium based on multitask learning |
CN113065347A (en) * | 2021-04-26 | 2021-07-02 | 上海交通大学 | Criminal case judgment prediction method, system and medium based on multitask learning |
CN113869349A (en) * | 2021-08-04 | 2021-12-31 | 西南交通大学 | Schematic question-answering method based on hierarchical multi-task learning |
CN113792177B (en) * | 2021-08-05 | 2023-02-03 | 杭州电子科技大学 | Scene character visual question-answering method based on knowledge-guided deep attention network |
CN113792177A (en) * | 2021-08-05 | 2021-12-14 | 杭州电子科技大学 | Scene character visual question-answering method based on knowledge-guided deep attention network |
CN113505246A (en) * | 2021-09-11 | 2021-10-15 | 腾讯科技(深圳)有限公司 | Data processing method, device, terminal equipment and storage medium |
CN113849623B (en) * | 2021-10-12 | 2024-04-09 | 中国传媒大学 | Text visual question-answering method and device |
CN113849623A (en) * | 2021-10-12 | 2021-12-28 | 中国传媒大学 | Text visual question answering method and device |
CN115062208A (en) * | 2022-05-30 | 2022-09-16 | 苏州浪潮智能科技有限公司 | Data processing method and system and computer equipment |
CN115062208B (en) * | 2022-05-30 | 2024-01-23 | 苏州浪潮智能科技有限公司 | Data processing method, system and computer equipment |
CN114880527A (en) * | 2022-06-09 | 2022-08-09 | 哈尔滨工业大学(威海) | Multi-modal knowledge graph representation method based on multi-prediction task |
CN116168796A (en) * | 2023-03-03 | 2023-05-26 | 杭州电子科技大学 | Medical image report structured generation method based on visual question and answer |
CN116168796B (en) * | 2023-03-03 | 2023-11-10 | 杭州电子科技大学 | Medical image report structured generation method based on visual question and answer |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111008293A (en) | Visual question-answering method based on structured semantic representation | |
CN107239446B (en) | A kind of intelligence relationship extracting method based on neural network Yu attention mechanism | |
CN109992780B (en) | Specific target emotion classification method based on deep neural network | |
CN110609891A (en) | Visual dialog generation method based on context awareness graph neural network | |
CN109783666B (en) | Image scene graph generation method based on iterative refinement | |
CN110390397B (en) | Text inclusion recognition method and device | |
CN107562792A (en) | A kind of question and answer matching process based on deep learning | |
CN110033008B (en) | Image description generation method based on modal transformation and text induction | |
CN111966812B (en) | Automatic question answering method based on dynamic word vector and storage medium | |
CN112765312B (en) | Knowledge graph question-answering method and system based on graph neural network embedded matching | |
CN114565104A (en) | Language model pre-training method, result recommendation method and related device | |
CN113535904B (en) | Aspect level emotion analysis method based on graph neural network | |
CN110909736A (en) | Image description method based on long-short term memory model and target detection algorithm | |
CN109492223A (en) | A kind of Chinese missing pronoun complementing method based on ANN Reasoning | |
CN113779220A (en) | Mongolian multi-hop question-answering method based on three-channel cognitive map and graph attention network | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
CN116662500A (en) | Method for constructing question-answering system based on BERT model and external knowledge graph | |
CN112988970A (en) | Text matching algorithm serving intelligent question-answering system | |
CN111914553A (en) | Financial information negative subject judgment method based on machine learning | |
CN115964459B (en) | Multi-hop reasoning question-answering method and system based on food safety cognition spectrum | |
CN116385937A (en) | Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework | |
CN110929013A (en) | Image question-answer implementation method based on bottom-up entry and positioning information fusion | |
CN114357166B (en) | Text classification method based on deep learning | |
CN115906846A (en) | Document-level named entity identification method based on double-graph hierarchical feature fusion | |
CN113239678B (en) | Multi-angle attention feature matching method and system for answer selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |