CN115512191A - Question and answer combined image natural language description method - Google Patents

Question and answer combined image natural language description method Download PDF

Info

Publication number
CN115512191A
CN115512191A CN202211150406.9A CN202211150406A CN115512191A CN 115512191 A CN115512191 A CN 115512191A CN 202211150406 A CN202211150406 A CN 202211150406A CN 115512191 A CN115512191 A CN 115512191A
Authority
CN
China
Prior art keywords
image
question
model
segmentation
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211150406.9A
Other languages
Chinese (zh)
Inventor
卫志华
刘官明
张恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202211150406.9A priority Critical patent/CN115512191A/en
Publication of CN115512191A publication Critical patent/CN115512191A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

A question-answer combined image natural language description method comprises three steps: firstly, extracting the features of an image target and an image background by using an image segmentation model to obtain different pixel-level classifications, and acquiring a segmentation feature map of the target and the background; step two, a problem generation module generates a relation characteristic diagram containing attention target information by constructing an implicit scene type representation, and generates a plurality of semantically related guide problems in a multi-granularity manner; and step three, the combined question-answering module introduces a loss function of contrast learning, performs combined multi-mode embedded representation on the relation characteristic graph and the guide question, and can generate a long text answer related to the question as a refined semantic description of the image content through training.

Description

Question and answer combined image natural language description method
Technical Field
The present invention is in the field of computer vision and natural language processing.
Background
Image description generation is a multi-modal task across text and images, with the goal of producing a corresponding natural language description from a picture. This task is very easy for humans, but very challenging for computers. With the prevalence of deep learning, more and more people are trying to solve the image description problem of machines using neural networks.
However, since the natural language description has diversity and has different standard forms, training with a specific data set can only obtain the image content description conforming to the distribution of the data set, which limits the application range. Meanwhile, most image description methods only passively generate monotonous sentences and do not consider the relation between scenes in the images and the attention targets, so that different and multi-scale targets are difficult to describe at the same time, and potential relation is easy to see. Such mechanical image descriptions have a large deviation from human semantic perception of images, and often cannot be understood when interacting with real humans.
The image description generation is generally divided into two sub-modules of image feature extraction and text generation. A common method of image feature extraction is to extract an object using an image recognition neural network model, but this may result in the absence of image information and the bias of feature extraction. Meanwhile, the generation of a phrase based on feature extraction usually only focuses on a specific target, and cannot generate multi-granularity description from rich semantic information of an image, which easily causes a great amount of loss of cross-modal information. Therefore, the image description of the machine still has a large gap from the natural perception of the image by humans.
Disclosure of Invention
The invention provides a question-answer-combined image natural language description method aiming at the defects of the background technology, which generates image content description on the basis of a visual question-answer model, generates a plurality of semantically related questions by means of different scale areas in a relational feature map by designing an image segmentation module, a question generation module and a joint question-answer module, generates fine description of an image scene by using a guide question with multi-granularity features and taking a question-answer as a corresponding relation, and can capture the occurrence possibility of implicit facts in an image to generate natural language description more conforming to the semantic cognition of human beings on the image.
The invention adopts the following technical scheme:
a question-answer-combined image natural language description method is characterized in that a visual question-answer model is used as a basis to generate a fine description of image content. Firstly, an image segmentation module obtains a segmentation feature map for classifying an image target and a background category; secondly, the problem generation module constructs an implicit scene type representation and generates a multi-granularity guide problem by taking an attention target as a center; and finally, introducing a loss function of comparative learning by the joint question-answering module, and performing joint multi-mode embedded characterization on the relation characteristic diagram and the guide question. The method is based on visual question answering, and natural language description of image content is generated according to the corresponding relation of question-answer.
A question-answering image natural language description method comprises three steps:
firstly, extracting the features of an image target and an image background by using an image segmentation model to obtain different pixel-level classifications, and acquiring a segmentation feature map of the target and the background;
step two, the problem generation module generates a relation characteristic diagram containing the attention target information by constructing an implicit scene type representation, and generates a plurality of semantically related guide problems in a multi-granularity manner;
and step three, the combined question-answering module introduces a loss function of contrast learning, performs combined multi-mode embedded representation on the relation characteristic graph and the guide question, and can generate a long text answer related to the question as a refined semantic description of the image content through training.
Aiming at the step one, the invention provides a preferable scheme for extracting the image characteristics.
Aiming at the second step, the invention discloses a problem generation model based on an LSTM model, which is characterized in that a segmentation feature map can be processed, a relation feature map containing attention target information is firstly generated by constructing an implicit scene type representation, then the attention target is used as the center, the relation between the attention target and an image attention target and the relation between the attention target and the background are established in a multi-scale mode, and the generated multi-granularity guide problem is used as a ring in the subsequent joint question-answering.
Aiming at the third step, the invention discloses a combined question-answer model based on a BUTD (bottom up, top down) model, which is characterized in that a loss function of comparative learning is introduced, a relation characteristic diagram and a guide question can be combined, the cross-modal learning capability of the model is improved, the understanding of the model on the semantic relation between an image and a question answer is enhanced, and the fine description of the image content is generated.
In particular, the method comprises the following steps of,
the method comprises the following steps: image segmentation
1.1 semantically segmenting a data set using a previously disclosed image, wherein the batch of images each have class labels at the pixel level.
1.2 training the image segmentation data set by using a deep learning method to construct an image segmentation neural network model. The task of image segmentation is to perform dense prediction on an image, and each pixel point is provided with a target to which the pixel point belongs or the category of a closed area by labeling different targets with specific colors.
1.3, storing the model weight of the trained image segmentation neural network, wherein the network model can process an original image, distinguish different targets and backgrounds in the image, and finally output segmentation characteristic graphs among the targets and between the targets and the backgrounds.
Step two: problem generation
2.1 processing the visual problem to generate a data set, classifying the problem categories in the data set, seeing the targets and the relation between the targets in different problem categories in multiple angles, and paying attention to the multiple problem categories of the same image not only to different targets but also to image areas of the same target in different scales. Meanwhile, the answers and the questions of the data set are merged to generate a complete natural language description.
2.2 training the processed vision problem generation data set by using a deep learning method, and constructing a problem generation neural network model. The problem generation model builds an implicit scene type representation, a relation feature graph containing attention target information is generated preliminarily, then the attention target is used as the center, the relevance between the problem category and different granularity areas of the image is learned, and different problems related to the attention target context are generated in a multi-scale mode.
Step three: joint question answering
3.1 integrating an image segmentation module, a question generation module and a joint question and answer module, taking a guide question as a context, learning by using a top-down attention mechanism, introducing a loss function of contrast learning, and performing joint multi-mode embedded representation on the guide question and a relation characteristic diagram. And giving out the candidate answers and the confidence degrees thereof according to the trained network to generate the natural language description of the image content.
The invention has the beneficial effects that:
1. image segmentation (using segmentation models that others have) includes not only explicit target and background pixel-level information, but also hidden inter-object dependencies, juxtapositions and logical relationships, providing comprehensive and fine image feature information.
2. Compared with the direct image description generation, the multi-granularity characteristic question generated by the image in the step two can guide more detailed multi-scale answers, and simultaneously can capture the possibility of the occurrence of implicit facts in the image, and provides the capability of describing simple events.
3. The design scheme provides a new image description method, the multi-granularity characteristic problem is used as a guide core for image content description generation, the visual question-answer model is used for building the corresponding relation of the problem and the answer (the visual question-answer model is used for generating the image description for the first time), the loss function of contrast learning is introduced, the cross-modal learning capability of the model is improved, the natural language description which is more in line with the human semantic cognition on the image can be provided, and the processing efficiency of man-machine interaction is improved.
Drawings
FIG. 1 is a diagram of an image segmentation model
FIG. 2 problem generation model designed by the present invention
FIG. 3 is a diagram of a joint question-answer model
FIG. 4 is an exemplary diagram of an image description of joint question answering
Detailed Description
Examples
An example of an image description of a joint question and answer is shown in FIG. 4.
A segmented image content description method based on visual question answering comprises the following steps:
the method comprises the following steps: image segmentation
1.1 in this example, the dataset used is the Cityscapes dataset, which is focused on the visual understanding of complex city street scenes, and which contains a multiplicity of stereoscopic video sequences recorded in street scenes from 50 different cities, with 20000 frames of weak annotations and 5000 frames of high-quality pixel-level annotations, providing a total of 30 categories of annotations including pedestrians, vehicles, roads, buildings, signal lights and signs.
1.2 in this embodiment, the deep learning method is used to train the image segmentation data set, and a deplab image segmentation network model of the encoder-decoder structure is constructed. The Deeplab series is an expansion convolution semantic segmentation model based on full convolution, and the proposed hole convolution can recover feature map resolution which is continuously reduced in convolution operation as much as possible without increasing the number of parameters. The deep v3+ model adopted in the embodiment is a partition network model of an encoder-decoder structure, and has high calculation speed and prediction accuracy. Wherein the encoder module is used to extract high-level semantic information from the image. The decoder module is used for recovering the spatial information of the low level and obtaining clear and complete class segmentation boundaries and the relation between the targets and the background. The network model is shown in fig. 1.
In the stage of encoder structure, firstly, an Xception network combined with depth separable convolution is adopted as a backbone network to perform initial feature extraction on an input image. The Xception network downsamples the feature map using residual concatenation and depth separable convolution. And a subsequent holed Spatial Pyramid Pooling module (ASPP) is used for further extracting multi-scale feature information, wherein 1 × 1 convolution, global average Pooling and hole convolution with expansion rates (represented by Rate) of 6, 12 and 18 in the module are combined into a vertical parallel structure, then the multi-scale feature map is compressed to 256 channels through 1 × 1 convolution processing, and the resolution of the feature map is reduced to 1/16 of the original map and is output as the feature map of the encoder structure.
In the decoder structure stage, the feature graph output by the encoder structure is subjected to up-sampling processing of 4 times of bilinear interpolation, the feature graph is spliced with a shallow feature graph of a corresponding level on a backbone network after channel adjustment, the feature graph is refined through two 3 x 3 convolution layers, and finally the segmentation prediction result which has the same size as the original graph and is rich in details and global information is obtained through up-sampling of 4 times of bilinear interpolation.
1.3, storing the model weight of the trained Deeplab v3+ image segmentation neural network, wherein the network model can process an original image, distinguish different targets and backgrounds in the image, and finally output segmentation feature maps among the targets and between the targets and the backgrounds, which extract pixel-level features of the image in detail.
Step two: problem generation
2.1 in this embodiment, the visual problem generation data set (including the SQuAD data set) that has been disclosed now is processed, the problem categories in the data set are classified, different problem categories see the targets and the relation between them from multiple angles, and multiple problem categories of the same image not only focus on different targets, but also focus on image areas of the same target with different dimensions. Meanwhile, the answers and the questions of the data set are merged to generate a complete natural language description.
The question category includes an Object (Object), an Attribute (Attribute), a Relationship (Relationship), a count (Counting), a Behavior (Behavior), and the like.
2.2 in this embodiment, the processed visual problem generation dataset is trained using a deep learning method, and a problem generation neural network model is constructed.
In order to generate a visual-base-based problem directly from a feature image, a residual-connected MLP (multi-layer perceptron) is constructed, which takes a segmentation feature map containing category information as an input and outputs a relational feature map containing main attention target information for guiding the generation of subsequent problems.
A typical MLP comprises three layers: the MLP neural network comprises an input layer, a hidden layer and an output layer, wherein different layers of the MLP neural network are fully connected, the hidden layer constructs scene type representation weight of an image through training, and a calculation formula of each layer of the neural network is as follows:
H=XW h +b h
O=HW o +b o
Y=σ(O)
where X is the feature input, Y is the feature output, W h And W o Weights for the hidden layer and the output layer, respectively, b h And b o σ denotes the activation function, here the sigmoid function, for the deviations of the hidden and output layers, respectively.
The segmentation feature maps are connected through residual errors to generate a relationship feature map, a channel representing relationship weight represents the significance of an attention target, a closed target area of each segmentation feature map has a corresponding attention coefficient, and u represents a relationship feature layer. Then, the problem generation model establishes the association between each concerned target and the peripheral closed range category in the relation characteristic diagram in a multi-granularity and hierarchical manner on the basis of LSTM (long-short term memory recurrent neural network), and generates related guide problems according to the predicted most prior problem category on different scales.
And (3) learning the relation between the attention target and the problem by using the LSTM, so that the model can be trained by locking the relevant region of the attention target according to the problem. LSTM involves the transfer of short-term memory h and long-term memory C, including input, output, and forget gates, including the formula (prior art):
Figure BDA0003856159120000061
Figure BDA0003856159120000062
Figure BDA0003856159120000063
Figure BDA0003856159120000064
Figure BDA0003856159120000065
Figure BDA0003856159120000066
c t =z t ⊙i t +c t-1 ⊙f t
Figure BDA0003856159120000067
Figure BDA0003856159120000068
y t =(c t )⊙o t
wherein x is t Is an input of the current time, y t-1 Is the short term memory of the last moment, c t-1 Is the long-term memory of the last moment, W f 、W i 、W z And W o Is the corresponding input weight, R f 、R i 、R z And R o Is the corresponding recursive weight, p i Representing the peephole weight matrix, used to update the granularity, b f 、b i 、b z And b o Is the corresponding calculated deviation, and sigma represents the sigmoid activation function.
The loss function of the problem-generating neural network model is as follows:
Figure BDA0003856159120000069
wherein,
Figure BDA00038561591200000610
is the problem vector of the model generation, q i Is the true problem vector in the dataset, and u represents the predicted relationship weight value, which is a positive number.
The network model for the problem generation is shown in fig. 2. The relational feature graph provides context information of the whole image, and by constructing an implicit scene type representation, the concerned targets efficiently concentrate problem generation on a plurality of targets, better limit the concerned targets of the image scene and provide more abstract target focus representation in the image; and on different granularity levels, generating a plurality of problems with different scales according to the predicted most prior problem category by taking the concerned target as the center, wherein the problems cover the important information of the core area in the image, so that the problem generation neural network model can process more comprehensively to obtain the guide problem.
Step three: joint question answering
3.1 in this embodiment, the step one image segmentation module, the step two question generation module and the step three combined question-answering module are integrated, the guide question is used as a context, a top-down attention mechanism is used for learning, and the guide question and the relation feature map are subjected to combined multi-mode embedded representation.
Firstly, a problem feature vector q generated by a problem and a segmentation feature map v obtained by image segmentation are connected as a context to guide the weight W of a model to v and q q And W v Training is carried out;
f q (q)=W q q
f v (v)=W v v
to f q (q) and f v (v) Performing multi-modal embedding, computing using a Hartmann product:
Figure BDA0003856159120000072
p(y)=σ(h)
wherein f is q Output representing problem flow, f v Representing the output of the visual flow, p (y) is the final output result, the problem feature vector q and the segmentation feature map v which are more closely matched have higher scores, W q And W v Represents the weight matrix and sigma represents the linear activation function. The loss function for joint embedding is the contrast loss:
Figure BDA0003856159120000071
wherein, y T And (5) an output result of the problem feature vector q and the segmentation feature map v which represent correct matching.
A network model of joint questions and answers is shown in fig. 3. The integrated image segmentation module, the question generation module and the joint question-answering module can provide answer prediction with the highest confidence coefficient according to the trained network. The method is based on the visual question and answer, and can finely describe the image content.

Claims (8)

1. A question-answering combined image natural language description method is characterized by comprising the following three steps:
firstly, extracting the features of an image target and an image background by using an image segmentation model to obtain different pixel-level classifications, and acquiring segmentation feature maps of the target and the background;
step two, the problem generation module generates a relation characteristic diagram containing the attention target information by constructing an implicit scene type representation, and generates a plurality of semantically related guide problems in a multi-granularity manner;
and step three, the combined question-answering module introduces a loss function of contrast learning, performs combined multi-mode embedded representation on the relation characteristic graph and the guide question, and generates a long text answer related to the question through training of the model to serve as a refined semantic description of the image content.
2. A description method as claimed in claim 1, characterized in that:
and secondly, generating a model based on the LSTM model, processing the segmentation feature map, generating a relation feature map containing the attention target information by constructing an implicit scene type representation, establishing multi-scale relations between the attention target and the image attention target and between the attention target and the background by taking the attention target as a center, and taking the generated multi-granularity guide question as a ring in subsequent joint question-answering.
3. A description method as claimed in claim 1, characterized in that:
aiming at the third step, a combined question-answer model based on the BUTD model introduces a loss function of comparative learning, a combined relation characteristic diagram and a guide question, improves the cross-modal learning capability of the model, enhances the understanding of the model on semantic relation between the image and the question answer, and generates the fine description of the image content.
4. A description method as claimed in claim 1, characterized in that:
the method comprises the following steps: image segmentation
1.1, semantically segmenting a data set by utilizing the prior published image, wherein the batch images are all marked with pixel-level categories;
1.2, training an image segmentation data set by using a deep learning method to construct an image segmentation neural network model; the task of image segmentation is to carry out dense prediction on an image, and each pixel point is provided with a target to which the pixel point belongs or the category of a closed area by labeling different targets with specific colors;
1.3, storing the model weight of the trained image segmentation neural network, wherein the network model can process an original image, distinguish different targets and backgrounds in the image, and finally output segmentation characteristic graphs among the targets and between the targets and the backgrounds.
5. A description method as claimed in claim 1 or 2, characterized in that:
step two: problem generation
2.1 processing the prior visual problem to generate a data set, classifying the problem categories in the data set, seeing the targets and the relation between the targets in different problem categories at multiple angles, and paying attention to a plurality of problem categories of the same image not only to different targets but also to image areas of the same target with different scales; meanwhile, the answers and the questions of the data set are merged to generate a complete natural language description;
2.2 training the processed vision problem generation data set by using a deep learning method, and constructing a problem to generate a neural network model; the problem generation model builds an implicit scene type representation, a relation feature graph containing attention target information is generated preliminarily, then the attention target is used as the center, the relevance between the problem category and different granularity areas of the image is learned, and different problems related to the attention target context are generated in a multi-scale mode.
6. A description method as claimed in claim 1 or 3, characterized in that:
step three: joint question answering
3.1 integrating an image segmentation module, a question generation module and a joint question and answer module, taking a guide question as a context, learning by using a top-down attention mechanism, introducing a loss function of contrast learning, and performing joint multi-mode embedded characterization on the guide question and a relation characteristic diagram; and giving out the candidate answers and the confidence degrees thereof according to the trained network to generate the natural language description of the image content.
7. The description method of claim 5, characterized in that:
the loss function of the problem-generating neural network model is as follows:
Figure FDA0003856159110000021
wherein,
Figure FDA0003856159110000022
is the problem vector of the model generation, q i Is the true problem vector in the dataset, u represents the predicted relationship weight value, which is a positive number.
8. A description method as claimed in claim 5, characterized in that:
the loss function of joint embedding is the contrast loss:
Figure FDA0003856159110000023
wherein, y T And (5) an output result of the problem feature vector q and the segmentation feature map v which represent correct matching.
CN202211150406.9A 2022-09-21 2022-09-21 Question and answer combined image natural language description method Pending CN115512191A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211150406.9A CN115512191A (en) 2022-09-21 2022-09-21 Question and answer combined image natural language description method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211150406.9A CN115512191A (en) 2022-09-21 2022-09-21 Question and answer combined image natural language description method

Publications (1)

Publication Number Publication Date
CN115512191A true CN115512191A (en) 2022-12-23

Family

ID=84503366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211150406.9A Pending CN115512191A (en) 2022-09-21 2022-09-21 Question and answer combined image natural language description method

Country Status (1)

Country Link
CN (1) CN115512191A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116663530A (en) * 2023-08-01 2023-08-29 北京高德云信科技有限公司 Data generation method, device, electronic equipment and storage medium
CN117312512A (en) * 2023-09-25 2023-12-29 星环信息科技(上海)股份有限公司 Question and answer method and device based on large model, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116663530A (en) * 2023-08-01 2023-08-29 北京高德云信科技有限公司 Data generation method, device, electronic equipment and storage medium
CN116663530B (en) * 2023-08-01 2023-10-20 北京高德云信科技有限公司 Data generation method, device, electronic equipment and storage medium
CN117312512A (en) * 2023-09-25 2023-12-29 星环信息科技(上海)股份有限公司 Question and answer method and device based on large model, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP6351689B2 (en) Attention based configurable convolutional neural network (ABC-CNN) system and method for visual question answering
CN110377710B (en) Visual question-answer fusion enhancement method based on multi-mode fusion
Zhang et al. Multimodal intelligence: Representation learning, information fusion, and applications
US12051275B2 (en) Video processing method and apparatus for action recognition
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN110717431A (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
CN112036276B (en) Artificial intelligent video question-answering method
CN109874053A (en) The short video recommendation method with user's dynamic interest is understood based on video content
CN115512191A (en) Question and answer combined image natural language description method
CN115329779A (en) Multi-person conversation emotion recognition method
CN113255457A (en) Animation character facial expression generation method and system based on facial expression recognition
CN113297370A (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN113792177A (en) Scene character visual question-answering method based on knowledge-guided deep attention network
CN113283336A (en) Text recognition method and system
Zhang et al. Teaching chinese sign language with a smartphone
CN112037239B (en) Text guidance image segmentation method based on multi-level explicit relation selection
Zakari et al. Vqa and visual reasoning: An overview of recent datasets, methods and challenges
Xue et al. Lcsnet: End-to-end lipreading with channel-aware feature selection
Cao et al. Visual question answering research on multi-layer attention mechanism based on image target features
CN117934803A (en) Visual positioning method based on multi-modal feature alignment
Bansal et al. 3D-CNN Empowered Assistive Machine Learning Model for the Hearing Impaired
He et al. An optimal 3D convolutional neural network based lipreading method
CN115661830A (en) Text guidance image segmentation method based on structured multi-mode fusion network
CN115018215A (en) Population residence prediction method, system and medium based on multi-modal cognitive map

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination