CN111008293A - Visual question-answering method based on structured semantic representation - Google Patents

Visual question-answering method based on structured semantic representation Download PDF

Info

Publication number
CN111008293A
CN111008293A CN201811164612.9A CN201811164612A CN111008293A CN 111008293 A CN111008293 A CN 111008293A CN 201811164612 A CN201811164612 A CN 201811164612A CN 111008293 A CN111008293 A CN 111008293A
Authority
CN
China
Prior art keywords
image
vector
input
word
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811164612.9A
Other languages
Chinese (zh)
Inventor
熊红凯
余东晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201811164612.9A priority Critical patent/CN111008293A/en
Publication of CN111008293A publication Critical patent/CN111008293A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • G06V10/422Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
    • G06V10/424Syntactic representation, e.g. by using alphabets or grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a visual question-answering method based on structured semantic representation, which extracts image features of an input image through a convolutional neural network; extracting a word vector of each word from the input problems related to the input image through a pre-trained word embedding model; weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors; converting the word vector into a structured semantic representation vector through a Tree-LSTM network; fusing the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fused feature vector; and taking the fused feature vector as the input of a prediction model, and outputting the answer corresponding to the input question by the prediction model. According to the method, richer semantic information is extracted from the problems, and the performance of the prediction model is improved through multi-layer training optimization, so that the accuracy of the answers is improved.

Description

Visual question-answering method based on structured semantic representation
Technical Field
The invention relates to the technical field of computer vision, in particular to a visual question answering method based on structured semantic representation.
Background
In the field of computer vision, visual question answering is a very leading and challenging problem. Given a natural image, any questions related to the image content may be asked. In order to accurately predict the problem, data information needs to be sufficiently acquired and represented in a more robust representation mode in the visual question-answering modeling process. The training mode of the visual question-answering model is very important, and the classifier boundary needs to be accurately found. Since the language itself has a composition structure property, different questions often have similar substructures, which also means that the reasoning process in visual question answering must be composition to obtain useful information.
Through the literature search of the prior art, Zichao Yang, Xiaolong He, Jianfeng Gao, LiDeng and Alex Smola in 2016 "IEEE Conference on Vision and Pattern recognition" published "Stacked attention networks for image query and switching" provide a multi-layer attention mechanism, and the image feature weight is calculated for multiple times through text features, so that more accurate weighting information can be obtained through multiple times of weighting. A layered joint attention mode is provided in 2016 'Conference and Workshop Processing Systems' document published by Jiansen Lu, Jianwei Yang, Dhruv Batra and Devi Parikh, and multi-layer feature extraction is performed on text features to obtain three-layer feature vectors, a joint attention mode is provided, and meanwhile, weighting is performed on images and text feature vectors.
In both methods, a Recurrent Neural Network (RNN) and a Long-Short-Term Memory (LSTM) Network are used for extracting text features, each word is relatively independent, and only the relation between the current word and the previous word is considered. However, this chained structure ignores semantic structure information in the text. This overlooked structural information is important in the visual question-answering process.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a visual question-answering method based on structured semantic representation.
The embodiment of the invention provides a visual question-answering method based on structured semantic representation, which comprises the following steps:
extracting image features of an input image through a convolutional neural network;
extracting a word vector of each word from the input problems related to the input image through a pre-trained word embedding model;
weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors;
converting the word vector into a structured semantic representation vector through a Tree-LSTM network;
fusing the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fused feature vector;
and taking the fused feature vector as the input of a prediction model, and outputting the answer corresponding to the input question by the prediction model.
Optionally, extracting the image feature of the input image by a convolutional neural network includes:
cutting the input image into image blocks with preset sizes;
and inputting the image block into a convolutional neural network, and taking the output of the last pooling layer of the convolutional neural network as the image feature of the input image.
Optionally, performing weighting processing on the image feature and the word vector to obtain a weighted image feature vector and a weighted text feature vector, including:
weighting the image features and the word vectors through a preset attention function to obtain weighted image feature vectors and weighted text feature vectors; defining operations of interest
Figure BDA0001820794860000021
X is an unweighted image featureA token or word vector, g is the corresponding image or text attention amount, representing the degree of matching between the image and the text,
Figure BDA0001820794860000022
representing the weighted feature vector, and assuming that X is an image feature to obtain the weighted image feature vector; if X is a word vector, obtaining a weighted feature vector of the text, wherein a calculation formula of the attention function is as follows:
Figure BDA0001820794860000023
Figure BDA0001820794860000024
Figure BDA0001820794860000025
wherein: h represents an intermediate state of the calculation process of interest, tanh is a hyperbolic tangent nonlinear activation function, Wx、Wg
Figure BDA0001820794860000026
In order to have three parameters to be learned,
Figure BDA0001820794860000027
is a unit vector with all elements 1, axRepresenting the probability value of importance for each image region calculated by text, softmax is a multi-classifier normalized index function,
Figure BDA0001820794860000028
representing weighted probability values, x, of the ith image regioniWhich represents the i-th image area,
Figure BDA0001820794860000031
representing the weighted total image feature vector, i is [1, N ]]N denotes the total number of image areas.
Optionally, converting the word vector into a structured semantic representation vector through a Tree-LSTM network, comprising:
obtaining a tree structure of each input question;
calculating a state vector corresponding to each node position on the tree structure, wherein each node position corresponds to one LSTM unit; the LSTM cell includes: an input gate, an output gate, and a memory cell;
and taking the state vector and the word vector corresponding to each node position as the input of the LSTM unit, and outputting the corresponding structural semantic expression vector by the LSTM unit.
Optionally, the fusing the image feature vector, the text feature vector, and the structured semantic representation vector includes:
and performing fusion processing on the image feature vector, the text feature vector and the structured semantic expression vector by adopting a mode of adding elements in different vectors one by one.
Optionally, the method further comprises:
establishing an initial visual question-answering model comprising the convolutional neural network, a pre-trained word embedding model, a Tree-LSTM network and a prediction model;
taking a training image and an answer corresponding to the training image as the input of the initial visual question-answering model, training a preliminary model according to the existing label information, and outputting a predicted candidate answer by the initial visual question-answering model;
taking the training image, the predicted candidate answer and a supplementary image as the input of the initial visual question-answering model, and outputting a judgment result of whether the predicted candidate answer is correct or not by the initial visual question-answering model;
if the judgment result is correct, the training image, the predicted candidate answer and the supplementary image form a positive sample; if the judgment result is wrong, the training image, the predicted candidate answer and the supplementary image form a negative sample;
and retraining the initial visual question-answer model by using the positive sample and the negative sample, and setting a two-classification loss function of the discrimination result to adjust the parameters of the initial visual question-answer model in the training process to obtain the optimized visual question-answer model.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, the structural information in the language is applied to the process of visual question answering, richer semantic information can be obtained by means of the Tree-LSTM network, the feature dimension is enriched, and the performance of model answer prediction is improved. In the training process, a dual-channel network structure is provided, and more information in new data can be obtained. The performance of model prediction can be further improved by a multi-layer training optimization mode. Therefore, the accuracy of the answers of the visual question answering method is finally improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of a visual question-answering model of the present invention;
FIG. 2 is a schematic diagram of a two-channel network architecture input;
FIG. 3 is a flowchart of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention;
fig. 4 is a flowchart of model training of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Fig. 1 is a schematic diagram of a principle of a visual question-answering model of the present invention, and as shown in fig. 1, image features are first extracted by using a convolutional neural network, and in order to maintain position information of an image, a pooling layer output is usually selected as the image features. And obtaining a word vector of each word by using the word embedding model obtained by training. And then extracting the structural information of the text by using the Tree-LSTM network to obtain a structured semantic representation feature vector. And carrying out importance weighting on the word vectors and the image feature vectors by using a joint attention mechanism, and calculating the importance of different words in different image areas and sentences. And performing multi-mode information fusion on the weighted image, the text feature vector and the structured semantic expression vector. And performing classifier training by using the feature vectors obtained by fusion. FIG. 2 is a schematic input diagram of a two-channel network architecture, as shown in FIG. 2, for further optimizing model parameters using the two-channel network architecture.
Optionally, in the feature extraction process, a multilayer convolution mode may be adopted to obtain multilayer text feature vectors, so that more information can be acquired to apply to the question-answering model.
Optionally, in the process of feature alignment, a multi-layer bidirectional attention mechanism is adopted, so that important image regions and words can be found more accurately, and irrelevant information is filtered out. The performance of the visual question-answering model can be improved.
Alternatively, the way of feature fusion directly affects the final feature vector. Besides basic element-by-element addition and multiplication, a further processing mode can be adopted, and the method has a more direct effect on the training of the classifier.
Fig. 3 is a flowchart of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention, and as shown in fig. 3, the method in this embodiment may include:
and S101, extracting image features of the input image through a convolutional neural network.
In the embodiment, an input image is cut into image blocks with preset sizes; and inputting the image blocks into a convolutional neural network, and taking the output of the last pooling layer of the convolutional neural network as the image characteristics of the input image. Specifically, the input image is cropped to 224 × 224 size and input into the pre-trained VGG-19 network. The output of the last pooling layer is represented as an image feature with dimensions 512 × 14 × 14.
S102, extracting a word vector of each word from input problems related to input images through a pre-trained word embedding model.
In this embodiment, a word vector of each word is obtained by using a trained word embedding model, and the dimension is 512.
S103, weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors.
In this embodiment, the image features and the word vectors are weighted through a preset attention function to obtain weighted image feature vectors and text feature vectors; defining operations of interest
Figure BDA0001820794860000051
X is an unweighted image feature or word vector, g is a corresponding image or text attention amount, representing the degree of match between the image and the text,
Figure BDA0001820794860000052
representing the weighted feature vector, and assuming that X is an image feature to obtain the weighted image feature vector; if X is a word vector, obtaining a weighted feature vector of the text, wherein a calculation formula of the attention function is as follows:
Figure BDA0001820794860000053
Figure BDA0001820794860000054
Figure BDA0001820794860000055
wherein: h represents an intermediate state of the calculation process of interest, tanh is a hyperbolic tangent nonlinear activation function, Wx、Wg
Figure BDA0001820794860000056
For three parameters needing to be learnedThe number of the first and second groups is,
Figure BDA0001820794860000057
is a unit vector with all elements 1, axRepresenting the probability value of importance for each image region calculated by text, softmax is a multi-classifier normalized index function,
Figure BDA0001820794860000058
representing weighted probability values, x, of the ith image regioniWhich represents the i-th image area,
Figure BDA0001820794860000059
representing the weighted total image feature vector, i is [1, N ]]N denotes the total number of image areas.
And S104, converting the word vector into a structural semantic expression vector through a Tree-LSTM network.
In this embodiment, a tree structure of each input question is obtained; calculating a state vector corresponding to each node position on the tree structure, wherein each node position corresponds to one LSTM unit; the LSTM unit includes: an input gate, an output gate, and a memory cell; and taking the state vector and the word vector corresponding to each node position as the input of an LSTM unit, and outputting a corresponding structural semantic expression vector by the LSTM unit.
Specifically, first obtaining a syntax tree structure of a question sentence, each word corresponding to a tree node in the syntax tree, each node corresponding to an LSTM unit, includes: input door ijOutput gate ojImplicit layer state value hjAnd memory cells cjThe state value update of each node is determined by its child nodes.
Figure BDA0001820794860000061
Figure BDA0001820794860000062
Figure BDA0001820794860000063
Figure BDA0001820794860000064
Figure BDA0001820794860000065
cj=ij⊙ujk∈C(j)Jjk⊙ck
hj=ojtanh(cj)
CjSet of all children nodes, W, representing the jth nodei、Ui、bi、Wo、Uo、bo、Wf、Uf、bfRespectively representing the parameters i to be learned by the input unit, the output unit and the forgetting unitj、fjk、ojRespectively representing input state control quantity, forgetting quantity and output quantity, sigma and tanh respectively representing sigmoid and hyperbolic tangent activation function, cjRepresenting the state value of the cell at the jth node position, ⊙ is a product operationjThe input to each node includes the word vector at the state of the child node and the current node. The state values of the final root node positions are used in the visual question-answer modeling process as a representation of the entire sequence.
And S105, carrying out fusion processing on the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fusion feature vector.
In this embodiment, a basic element-by-element addition manner is adopted to perform fusion processing on the image feature vector, the text feature vector, and the structured semantic expression vector.
And S106, taking the fusion feature vector as the input of the prediction model, and outputting the answer corresponding to the input question by the prediction model.
In this embodiment, the fused feature vector is used as an input of the prediction model, and the prediction model outputs an answer corresponding to the input question.
Fig. 4 is a flowchart of model training of a visual question-answering method based on a structured semantic representation according to an embodiment of the present invention, and as shown in fig. 4, before executing S101, the method further includes:
s201, establishing an initial visual question-answer model comprising a convolutional neural network, a pre-trained word embedding model, a Tree-LSTM network and a prediction model.
S202, taking the training images and the answers corresponding to the training images as the input of the initial visual question-answer model, and outputting predicted candidate answers by the initial visual question-answer model.
S203, the training image, the predicted candidate answer and the supplementary image are used as the input of the initial visual question-answering model, and the initial visual question-answering model outputs the judgment result of whether the predicted candidate answer is correct or not.
S204, if the judgment result is correct, the training image, the predicted candidate answer and the supplementary image form a positive sample; if the judgment result is wrong, the training image, the predicted candidate answer and the supplementary image form a negative sample.
S205, training the initial visual question-answer model by using the positive sample and the negative sample, and adjusting parameters of the initial visual question-answer model by setting a loss function of a judgment result in the training process to obtain the optimized visual question-answer model.
In this embodiment, the training images and the answers corresponding to the training images are used as the input of the initial visual question-answer model, the candidate answers are predicted, the multiple classification questions are obtained, the loss function is a mutual information entropy function, and the first-layer training optimization is completed. And taking the training image, the predicted candidate answer and the supplementary image as the input of the initial visual question-answer model, judging whether the predicted candidate answer is correct or not, wherein the predicted candidate answer is a binary problem, the loss function is a mutual information entropy function, and the second-layer training optimization is completed by minimizing the loss function to solve.
According to the above steps, the experiment was carried out using the steps of the summary of the invention, and the data used for the experiment was from a published VQA 2.0.0 data set, including 82783 training images, 40504 verification images and 81434 test images. There were a total of 443757 questions for training, with 447793 test questions. The visual question-answer model based on the structured semantic representation and the layered joint attention model and the depth LSTM model proposed by Jiansen Lu et al in the 'Hierarchical query-image co-attribution for visual query answering' are compared.
The result is: by applying a structural semantic representation mode, the answer accuracy is respectively improved by 1.4% and 4% compared with a hierarchical joint attention model and a deep LSTM model. The two-channel network is used for training, the answer accuracy of the layered joint attention model can be improved by 1.01%, and the visual question-answer model based on the structured semantic representation provided by the invention is improved by 0.7%. Therefore, the method can effectively improve the performance of the visual question-answering model and has certain universality.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (6)

1. A visual question-answering method based on structured semantic representation is characterized by comprising the following steps:
extracting image features of an input image through a convolutional neural network;
extracting a word vector of each word from the input problems related to the input image through a pre-trained word embedding model;
weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors;
converting the word vector into a structured semantic representation vector through a Tree-LSTM network;
fusing the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fused feature vector;
and taking the fused feature vector as the input of a prediction model, and outputting the answer corresponding to the input question by the prediction model.
2. The visual question-answering method based on the structured semantic representation according to claim 1, wherein the extracting of the image features of the input image through a convolutional neural network comprises:
cutting the input image into image blocks with preset sizes;
and inputting the image block into a convolutional neural network, and taking the output of the last pooling layer of the convolutional neural network as the image feature of the input image.
3. The visual question-answering method based on the structured semantic representation according to claim 1, wherein weighting the image features and the word vectors to obtain weighted image feature vectors and text feature vectors comprises:
weighting the image features and the word vectors through a preset attention function to obtain weighted image feature vectors and weighted text feature vectors; defining operations of interest
Figure FDA0001820794850000011
X is an unweighted image feature or word vector, g is a corresponding image or text attention amount, representing the degree of match between the image and the text,
Figure FDA0001820794850000012
representing the weighted feature vector, and assuming that X is an image feature to obtain the weighted image feature vector; if X is a word vector, obtaining a weighted feature vector of the text, wherein a calculation formula of the attention function is as follows:
Figure FDA0001820794850000013
Figure FDA0001820794850000014
Figure FDA0001820794850000015
wherein: h represents an intermediate state of the calculation process of interest, tanh is a hyperbolic tangent nonlinear activation function, Wx、Wg
Figure FDA0001820794850000021
In order to have three parameters to be learned,
Figure FDA0001820794850000022
is a unit vector with all elements 1, axRepresenting the probability value of importance for each image region calculated by text, softmax is a multi-classifier normalized index function,
Figure FDA0001820794850000023
representing weighted probability values, x, of the ith image regioniWhich represents the i-th image area,
Figure FDA0001820794850000024
representing the weighted total image feature vector, i is [1, N ]]N denotes the total number of image areas.
4. The visual question-answering method based on structured semantic representation according to claim 1, wherein the word vector is converted into a structured semantic representation vector through a Tree-LSTM network, comprising:
obtaining a tree structure of each input question;
calculating a state vector corresponding to each node position on the tree structure, wherein each node position corresponds to one LSTM unit; the LSTM cell includes: an input gate, an output gate, and a memory cell;
and taking the state vector and the word vector corresponding to each node position as the input of the LSTM unit, and outputting the corresponding structural semantic expression vector by the LSTM unit.
5. The visual question-answering method based on the structured semantic representation according to claim 1, wherein the fusion processing of the image feature vector, the text feature vector and the structured semantic representation vector comprises:
and performing fusion processing on the image feature vector, the text feature vector and the structured semantic expression vector by adopting a mode of adding elements in different vectors one by one.
6. The visual question-answering method based on structured semantic representation according to any one of claims 1 to 5, characterized by further comprising:
establishing an initial visual question-answering model comprising the convolutional neural network, a pre-trained word embedding model, a Tree-LSTM network and a prediction model;
taking a training image and an answer corresponding to the training image as the input of the initial visual question-answering model, training a preliminary model according to the existing label information, and outputting a predicted candidate answer by the initial visual question-answering model;
taking the training image, the predicted candidate answer and a supplementary image as the input of the initial visual question-answering model, and outputting a judgment result of whether the predicted candidate answer is correct or not by the initial visual question-answering model;
if the judgment result is correct, the training image, the predicted candidate answer and the supplementary image form a positive sample; if the judgment result is wrong, the training image, the predicted candidate answer and the supplementary image form a negative sample;
and retraining the initial visual question-answer model by using the positive sample and the negative sample, and setting a two-classification loss function of the discrimination result to adjust the parameters of the initial visual question-answer model in the training process to obtain the optimized visual question-answer model.
CN201811164612.9A 2018-10-06 2018-10-06 Visual question-answering method based on structured semantic representation Pending CN111008293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811164612.9A CN111008293A (en) 2018-10-06 2018-10-06 Visual question-answering method based on structured semantic representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811164612.9A CN111008293A (en) 2018-10-06 2018-10-06 Visual question-answering method based on structured semantic representation

Publications (1)

Publication Number Publication Date
CN111008293A true CN111008293A (en) 2020-04-14

Family

ID=70110598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811164612.9A Pending CN111008293A (en) 2018-10-06 2018-10-06 Visual question-answering method based on structured semantic representation

Country Status (1)

Country Link
CN (1) CN111008293A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539292A (en) * 2020-04-17 2020-08-14 中山大学 Action decision model and method for presenting scene question-answering task
CN111581335A (en) * 2020-05-14 2020-08-25 腾讯科技(深圳)有限公司 Text representation method and device
CN111680484A (en) * 2020-05-29 2020-09-18 北京理工大学 Answer model generation method and system for visual general knowledge reasoning question and answer
CN111797811A (en) * 2020-07-20 2020-10-20 广东工业大学 Blind person navigation system based on image understanding
CN111832403A (en) * 2020-06-04 2020-10-27 北京百度网讯科技有限公司 Document structure recognition method, and model training method and device for document structure recognition
CN112036276A (en) * 2020-08-19 2020-12-04 北京航空航天大学 Artificial intelligent video question-answering method
CN112651403A (en) * 2020-12-02 2021-04-13 浙江大学 Zero-sample visual question-answering method based on semantic embedding
CN112818889A (en) * 2021-02-09 2021-05-18 北京工业大学 Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113065347A (en) * 2021-04-26 2021-07-02 上海交通大学 Criminal case judgment prediction method, system and medium based on multitask learning
CN113128415A (en) * 2021-04-22 2021-07-16 合肥工业大学 Environment identification method, system, equipment and storage medium
CN113157889A (en) * 2021-04-21 2021-07-23 韶鼎人工智能科技有限公司 Visual question-answering model construction method based on theme loss
CN113505246A (en) * 2021-09-11 2021-10-15 腾讯科技(深圳)有限公司 Data processing method, device, terminal equipment and storage medium
CN113792177A (en) * 2021-08-05 2021-12-14 杭州电子科技大学 Scene character visual question-answering method based on knowledge-guided deep attention network
CN113849623A (en) * 2021-10-12 2021-12-28 中国传媒大学 Text visual question answering method and device
CN113869349A (en) * 2021-08-04 2021-12-31 西南交通大学 Schematic question-answering method based on hierarchical multi-task learning
CN114880527A (en) * 2022-06-09 2022-08-09 哈尔滨工业大学(威海) Multi-modal knowledge graph representation method based on multi-prediction task
CN115062208A (en) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 Data processing method and system and computer equipment
CN116168796A (en) * 2023-03-03 2023-05-26 杭州电子科技大学 Medical image report structured generation method based on visual question and answer

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951473A (en) * 2017-03-06 2017-07-14 浙江大学 Towards the construction method of the deep vision question answering system of dysopia personage
CN108170816A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of intelligent vision Question-Answering Model based on deep neural network
CN108446404A (en) * 2018-03-30 2018-08-24 中国科学院自动化研究所 Towards the search method and system for being directed toward problem without constraint vision question and answer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951473A (en) * 2017-03-06 2017-07-14 浙江大学 Towards the construction method of the deep vision question answering system of dysopia personage
CN108170816A (en) * 2017-12-31 2018-06-15 厦门大学 A kind of intelligent vision Question-Answering Model based on deep neural network
CN108446404A (en) * 2018-03-30 2018-08-24 中国科学院自动化研究所 Towards the search method and system for being directed toward problem without constraint vision question and answer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
余东晨: "structured semantic representation for visual question answering", 《IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539292A (en) * 2020-04-17 2020-08-14 中山大学 Action decision model and method for presenting scene question-answering task
CN111539292B (en) * 2020-04-17 2023-07-07 中山大学 Action decision model and method for question-answering task with actualized scene
CN111581335A (en) * 2020-05-14 2020-08-25 腾讯科技(深圳)有限公司 Text representation method and device
CN111581335B (en) * 2020-05-14 2023-11-24 腾讯科技(深圳)有限公司 Text representation method and device
CN111680484A (en) * 2020-05-29 2020-09-18 北京理工大学 Answer model generation method and system for visual general knowledge reasoning question and answer
CN111680484B (en) * 2020-05-29 2023-04-07 北京理工大学 Answer model generation method and system for visual general knowledge reasoning question and answer
CN111832403A (en) * 2020-06-04 2020-10-27 北京百度网讯科技有限公司 Document structure recognition method, and model training method and device for document structure recognition
CN111797811A (en) * 2020-07-20 2020-10-20 广东工业大学 Blind person navigation system based on image understanding
CN111797811B (en) * 2020-07-20 2023-09-12 广东工业大学 Blind person navigation system based on image understanding
CN112036276A (en) * 2020-08-19 2020-12-04 北京航空航天大学 Artificial intelligent video question-answering method
CN112651403B (en) * 2020-12-02 2022-09-06 浙江大学 Zero-sample visual question-answering method based on semantic embedding
CN112651403A (en) * 2020-12-02 2021-04-13 浙江大学 Zero-sample visual question-answering method based on semantic embedding
CN112818889A (en) * 2021-02-09 2021-05-18 北京工业大学 Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN112818889B (en) * 2021-02-09 2024-05-28 北京工业大学 Method for fusing accuracy of visual question-answer answers by using super network based on dynamic attention
CN113010712A (en) * 2021-03-04 2021-06-22 天津大学 Visual question answering method based on multi-graph fusion
CN113157889A (en) * 2021-04-21 2021-07-23 韶鼎人工智能科技有限公司 Visual question-answering model construction method based on theme loss
CN113128415B (en) * 2021-04-22 2023-09-29 合肥工业大学 Environment distinguishing method, system, equipment and storage medium
CN113128415A (en) * 2021-04-22 2021-07-16 合肥工业大学 Environment identification method, system, equipment and storage medium
CN113065347B (en) * 2021-04-26 2022-06-28 上海交通大学 Criminal case judgment prediction method, system and medium based on multitask learning
CN113065347A (en) * 2021-04-26 2021-07-02 上海交通大学 Criminal case judgment prediction method, system and medium based on multitask learning
CN113869349A (en) * 2021-08-04 2021-12-31 西南交通大学 Schematic question-answering method based on hierarchical multi-task learning
CN113792177B (en) * 2021-08-05 2023-02-03 杭州电子科技大学 Scene character visual question-answering method based on knowledge-guided deep attention network
CN113792177A (en) * 2021-08-05 2021-12-14 杭州电子科技大学 Scene character visual question-answering method based on knowledge-guided deep attention network
CN113505246A (en) * 2021-09-11 2021-10-15 腾讯科技(深圳)有限公司 Data processing method, device, terminal equipment and storage medium
CN113849623B (en) * 2021-10-12 2024-04-09 中国传媒大学 Text visual question-answering method and device
CN113849623A (en) * 2021-10-12 2021-12-28 中国传媒大学 Text visual question answering method and device
CN115062208A (en) * 2022-05-30 2022-09-16 苏州浪潮智能科技有限公司 Data processing method and system and computer equipment
CN115062208B (en) * 2022-05-30 2024-01-23 苏州浪潮智能科技有限公司 Data processing method, system and computer equipment
CN114880527A (en) * 2022-06-09 2022-08-09 哈尔滨工业大学(威海) Multi-modal knowledge graph representation method based on multi-prediction task
CN116168796A (en) * 2023-03-03 2023-05-26 杭州电子科技大学 Medical image report structured generation method based on visual question and answer
CN116168796B (en) * 2023-03-03 2023-11-10 杭州电子科技大学 Medical image report structured generation method based on visual question and answer

Similar Documents

Publication Publication Date Title
CN111008293A (en) Visual question-answering method based on structured semantic representation
CN107239446B (en) A kind of intelligence relationship extracting method based on neural network Yu attention mechanism
CN109992780B (en) Specific target emotion classification method based on deep neural network
CN110609891A (en) Visual dialog generation method based on context awareness graph neural network
CN109783666B (en) Image scene graph generation method based on iterative refinement
CN110390397B (en) Text inclusion recognition method and device
CN107562792A (en) A kind of question and answer matching process based on deep learning
CN110033008B (en) Image description generation method based on modal transformation and text induction
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN112765312B (en) Knowledge graph question-answering method and system based on graph neural network embedded matching
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN113535904B (en) Aspect level emotion analysis method based on graph neural network
CN110909736A (en) Image description method based on long-short term memory model and target detection algorithm
CN109492223A (en) A kind of Chinese missing pronoun complementing method based on ANN Reasoning
CN113779220A (en) Mongolian multi-hop question-answering method based on three-channel cognitive map and graph attention network
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN116662500A (en) Method for constructing question-answering system based on BERT model and external knowledge graph
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN111914553A (en) Financial information negative subject judgment method based on machine learning
CN115964459B (en) Multi-hop reasoning question-answering method and system based on food safety cognition spectrum
CN116385937A (en) Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN114357166B (en) Text classification method based on deep learning
CN115906846A (en) Document-level named entity identification method based on double-graph hierarchical feature fusion
CN113239678B (en) Multi-angle attention feature matching method and system for answer selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination