CN115512191A - Question and answer combined image natural language description method - Google Patents
Question and answer combined image natural language description method Download PDFInfo
- Publication number
- CN115512191A CN115512191A CN202211150406.9A CN202211150406A CN115512191A CN 115512191 A CN115512191 A CN 115512191A CN 202211150406 A CN202211150406 A CN 202211150406A CN 115512191 A CN115512191 A CN 115512191A
- Authority
- CN
- China
- Prior art keywords
- image
- question
- model
- segmentation
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000003709 image segmentation Methods 0.000 claims abstract description 26
- 230000011218 segmentation Effects 0.000 claims abstract description 20
- 230000006870 function Effects 0.000 claims abstract description 17
- 238000010586 diagram Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 10
- 230000000007 visual effect Effects 0.000 claims description 13
- 238000003062 neural network model Methods 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000013135 deep learning Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 4
- 230000000052 comparative effect Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 2
- 239000003086 colorant Substances 0.000 claims description 2
- 238000002372 labelling Methods 0.000 claims description 2
- 238000000605 extraction Methods 0.000 description 5
- 241000282412 Homo Species 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 3
- 230000004913 activation Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000019771 cognition Effects 0.000 description 2
- 230000007787 long-term memory Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 101100295091 Arabidopsis thaliana NUDT14 gene Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/778—Active pattern-learning, e.g. online learning of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
A question-answer combined image natural language description method comprises three steps: firstly, extracting the features of an image target and an image background by using an image segmentation model to obtain different pixel-level classifications, and acquiring a segmentation feature map of the target and the background; step two, a problem generation module generates a relation characteristic diagram containing attention target information by constructing an implicit scene type representation, and generates a plurality of semantically related guide problems in a multi-granularity manner; and step three, the combined question-answering module introduces a loss function of contrast learning, performs combined multi-mode embedded representation on the relation characteristic graph and the guide question, and can generate a long text answer related to the question as a refined semantic description of the image content through training.
Description
Technical Field
The present invention is in the field of computer vision and natural language processing.
Background
Image description generation is a multi-modal task across text and images, with the goal of producing a corresponding natural language description from a picture. This task is very easy for humans, but very challenging for computers. With the prevalence of deep learning, more and more people are trying to solve the image description problem of machines using neural networks.
However, since the natural language description has diversity and has different standard forms, training with a specific data set can only obtain the image content description conforming to the distribution of the data set, which limits the application range. Meanwhile, most image description methods only passively generate monotonous sentences and do not consider the relation between scenes in the images and the attention targets, so that different and multi-scale targets are difficult to describe at the same time, and potential relation is easy to see. Such mechanical image descriptions have a large deviation from human semantic perception of images, and often cannot be understood when interacting with real humans.
The image description generation is generally divided into two sub-modules of image feature extraction and text generation. A common method of image feature extraction is to extract an object using an image recognition neural network model, but this may result in the absence of image information and the bias of feature extraction. Meanwhile, the generation of a phrase based on feature extraction usually only focuses on a specific target, and cannot generate multi-granularity description from rich semantic information of an image, which easily causes a great amount of loss of cross-modal information. Therefore, the image description of the machine still has a large gap from the natural perception of the image by humans.
Disclosure of Invention
The invention provides a question-answer-combined image natural language description method aiming at the defects of the background technology, which generates image content description on the basis of a visual question-answer model, generates a plurality of semantically related questions by means of different scale areas in a relational feature map by designing an image segmentation module, a question generation module and a joint question-answer module, generates fine description of an image scene by using a guide question with multi-granularity features and taking a question-answer as a corresponding relation, and can capture the occurrence possibility of implicit facts in an image to generate natural language description more conforming to the semantic cognition of human beings on the image.
The invention adopts the following technical scheme:
a question-answer-combined image natural language description method is characterized in that a visual question-answer model is used as a basis to generate a fine description of image content. Firstly, an image segmentation module obtains a segmentation feature map for classifying an image target and a background category; secondly, the problem generation module constructs an implicit scene type representation and generates a multi-granularity guide problem by taking an attention target as a center; and finally, introducing a loss function of comparative learning by the joint question-answering module, and performing joint multi-mode embedded characterization on the relation characteristic diagram and the guide question. The method is based on visual question answering, and natural language description of image content is generated according to the corresponding relation of question-answer.
A question-answering image natural language description method comprises three steps:
firstly, extracting the features of an image target and an image background by using an image segmentation model to obtain different pixel-level classifications, and acquiring a segmentation feature map of the target and the background;
step two, the problem generation module generates a relation characteristic diagram containing the attention target information by constructing an implicit scene type representation, and generates a plurality of semantically related guide problems in a multi-granularity manner;
and step three, the combined question-answering module introduces a loss function of contrast learning, performs combined multi-mode embedded representation on the relation characteristic graph and the guide question, and can generate a long text answer related to the question as a refined semantic description of the image content through training.
Aiming at the step one, the invention provides a preferable scheme for extracting the image characteristics.
Aiming at the second step, the invention discloses a problem generation model based on an LSTM model, which is characterized in that a segmentation feature map can be processed, a relation feature map containing attention target information is firstly generated by constructing an implicit scene type representation, then the attention target is used as the center, the relation between the attention target and an image attention target and the relation between the attention target and the background are established in a multi-scale mode, and the generated multi-granularity guide problem is used as a ring in the subsequent joint question-answering.
Aiming at the third step, the invention discloses a combined question-answer model based on a BUTD (bottom up, top down) model, which is characterized in that a loss function of comparative learning is introduced, a relation characteristic diagram and a guide question can be combined, the cross-modal learning capability of the model is improved, the understanding of the model on the semantic relation between an image and a question answer is enhanced, and the fine description of the image content is generated.
In particular, the method comprises the following steps of,
the method comprises the following steps: image segmentation
1.1 semantically segmenting a data set using a previously disclosed image, wherein the batch of images each have class labels at the pixel level.
1.2 training the image segmentation data set by using a deep learning method to construct an image segmentation neural network model. The task of image segmentation is to perform dense prediction on an image, and each pixel point is provided with a target to which the pixel point belongs or the category of a closed area by labeling different targets with specific colors.
1.3, storing the model weight of the trained image segmentation neural network, wherein the network model can process an original image, distinguish different targets and backgrounds in the image, and finally output segmentation characteristic graphs among the targets and between the targets and the backgrounds.
Step two: problem generation
2.1 processing the visual problem to generate a data set, classifying the problem categories in the data set, seeing the targets and the relation between the targets in different problem categories in multiple angles, and paying attention to the multiple problem categories of the same image not only to different targets but also to image areas of the same target in different scales. Meanwhile, the answers and the questions of the data set are merged to generate a complete natural language description.
2.2 training the processed vision problem generation data set by using a deep learning method, and constructing a problem generation neural network model. The problem generation model builds an implicit scene type representation, a relation feature graph containing attention target information is generated preliminarily, then the attention target is used as the center, the relevance between the problem category and different granularity areas of the image is learned, and different problems related to the attention target context are generated in a multi-scale mode.
Step three: joint question answering
3.1 integrating an image segmentation module, a question generation module and a joint question and answer module, taking a guide question as a context, learning by using a top-down attention mechanism, introducing a loss function of contrast learning, and performing joint multi-mode embedded representation on the guide question and a relation characteristic diagram. And giving out the candidate answers and the confidence degrees thereof according to the trained network to generate the natural language description of the image content.
The invention has the beneficial effects that:
1. image segmentation (using segmentation models that others have) includes not only explicit target and background pixel-level information, but also hidden inter-object dependencies, juxtapositions and logical relationships, providing comprehensive and fine image feature information.
2. Compared with the direct image description generation, the multi-granularity characteristic question generated by the image in the step two can guide more detailed multi-scale answers, and simultaneously can capture the possibility of the occurrence of implicit facts in the image, and provides the capability of describing simple events.
3. The design scheme provides a new image description method, the multi-granularity characteristic problem is used as a guide core for image content description generation, the visual question-answer model is used for building the corresponding relation of the problem and the answer (the visual question-answer model is used for generating the image description for the first time), the loss function of contrast learning is introduced, the cross-modal learning capability of the model is improved, the natural language description which is more in line with the human semantic cognition on the image can be provided, and the processing efficiency of man-machine interaction is improved.
Drawings
FIG. 1 is a diagram of an image segmentation model
FIG. 2 problem generation model designed by the present invention
FIG. 3 is a diagram of a joint question-answer model
FIG. 4 is an exemplary diagram of an image description of joint question answering
Detailed Description
Examples
An example of an image description of a joint question and answer is shown in FIG. 4.
A segmented image content description method based on visual question answering comprises the following steps:
the method comprises the following steps: image segmentation
1.1 in this example, the dataset used is the Cityscapes dataset, which is focused on the visual understanding of complex city street scenes, and which contains a multiplicity of stereoscopic video sequences recorded in street scenes from 50 different cities, with 20000 frames of weak annotations and 5000 frames of high-quality pixel-level annotations, providing a total of 30 categories of annotations including pedestrians, vehicles, roads, buildings, signal lights and signs.
1.2 in this embodiment, the deep learning method is used to train the image segmentation data set, and a deplab image segmentation network model of the encoder-decoder structure is constructed. The Deeplab series is an expansion convolution semantic segmentation model based on full convolution, and the proposed hole convolution can recover feature map resolution which is continuously reduced in convolution operation as much as possible without increasing the number of parameters. The deep v3+ model adopted in the embodiment is a partition network model of an encoder-decoder structure, and has high calculation speed and prediction accuracy. Wherein the encoder module is used to extract high-level semantic information from the image. The decoder module is used for recovering the spatial information of the low level and obtaining clear and complete class segmentation boundaries and the relation between the targets and the background. The network model is shown in fig. 1.
In the stage of encoder structure, firstly, an Xception network combined with depth separable convolution is adopted as a backbone network to perform initial feature extraction on an input image. The Xception network downsamples the feature map using residual concatenation and depth separable convolution. And a subsequent holed Spatial Pyramid Pooling module (ASPP) is used for further extracting multi-scale feature information, wherein 1 × 1 convolution, global average Pooling and hole convolution with expansion rates (represented by Rate) of 6, 12 and 18 in the module are combined into a vertical parallel structure, then the multi-scale feature map is compressed to 256 channels through 1 × 1 convolution processing, and the resolution of the feature map is reduced to 1/16 of the original map and is output as the feature map of the encoder structure.
In the decoder structure stage, the feature graph output by the encoder structure is subjected to up-sampling processing of 4 times of bilinear interpolation, the feature graph is spliced with a shallow feature graph of a corresponding level on a backbone network after channel adjustment, the feature graph is refined through two 3 x 3 convolution layers, and finally the segmentation prediction result which has the same size as the original graph and is rich in details and global information is obtained through up-sampling of 4 times of bilinear interpolation.
1.3, storing the model weight of the trained Deeplab v3+ image segmentation neural network, wherein the network model can process an original image, distinguish different targets and backgrounds in the image, and finally output segmentation feature maps among the targets and between the targets and the backgrounds, which extract pixel-level features of the image in detail.
Step two: problem generation
2.1 in this embodiment, the visual problem generation data set (including the SQuAD data set) that has been disclosed now is processed, the problem categories in the data set are classified, different problem categories see the targets and the relation between them from multiple angles, and multiple problem categories of the same image not only focus on different targets, but also focus on image areas of the same target with different dimensions. Meanwhile, the answers and the questions of the data set are merged to generate a complete natural language description.
The question category includes an Object (Object), an Attribute (Attribute), a Relationship (Relationship), a count (Counting), a Behavior (Behavior), and the like.
2.2 in this embodiment, the processed visual problem generation dataset is trained using a deep learning method, and a problem generation neural network model is constructed.
In order to generate a visual-base-based problem directly from a feature image, a residual-connected MLP (multi-layer perceptron) is constructed, which takes a segmentation feature map containing category information as an input and outputs a relational feature map containing main attention target information for guiding the generation of subsequent problems.
A typical MLP comprises three layers: the MLP neural network comprises an input layer, a hidden layer and an output layer, wherein different layers of the MLP neural network are fully connected, the hidden layer constructs scene type representation weight of an image through training, and a calculation formula of each layer of the neural network is as follows:
H=XW h +b h
O=HW o +b o
Y=σ(O)
where X is the feature input, Y is the feature output, W h And W o Weights for the hidden layer and the output layer, respectively, b h And b o σ denotes the activation function, here the sigmoid function, for the deviations of the hidden and output layers, respectively.
The segmentation feature maps are connected through residual errors to generate a relationship feature map, a channel representing relationship weight represents the significance of an attention target, a closed target area of each segmentation feature map has a corresponding attention coefficient, and u represents a relationship feature layer. Then, the problem generation model establishes the association between each concerned target and the peripheral closed range category in the relation characteristic diagram in a multi-granularity and hierarchical manner on the basis of LSTM (long-short term memory recurrent neural network), and generates related guide problems according to the predicted most prior problem category on different scales.
And (3) learning the relation between the attention target and the problem by using the LSTM, so that the model can be trained by locking the relevant region of the attention target according to the problem. LSTM involves the transfer of short-term memory h and long-term memory C, including input, output, and forget gates, including the formula (prior art):
c t =z t ⊙i t +c t-1 ⊙f t
y t =(c t )⊙o t
wherein x is t Is an input of the current time, y t-1 Is the short term memory of the last moment, c t-1 Is the long-term memory of the last moment, W f 、W i 、W z And W o Is the corresponding input weight, R f 、R i 、R z And R o Is the corresponding recursive weight, p i Representing the peephole weight matrix, used to update the granularity, b f 、b i 、b z And b o Is the corresponding calculated deviation, and sigma represents the sigmoid activation function.
The loss function of the problem-generating neural network model is as follows:
wherein,is the problem vector of the model generation, q i Is the true problem vector in the dataset, and u represents the predicted relationship weight value, which is a positive number.
The network model for the problem generation is shown in fig. 2. The relational feature graph provides context information of the whole image, and by constructing an implicit scene type representation, the concerned targets efficiently concentrate problem generation on a plurality of targets, better limit the concerned targets of the image scene and provide more abstract target focus representation in the image; and on different granularity levels, generating a plurality of problems with different scales according to the predicted most prior problem category by taking the concerned target as the center, wherein the problems cover the important information of the core area in the image, so that the problem generation neural network model can process more comprehensively to obtain the guide problem.
Step three: joint question answering
3.1 in this embodiment, the step one image segmentation module, the step two question generation module and the step three combined question-answering module are integrated, the guide question is used as a context, a top-down attention mechanism is used for learning, and the guide question and the relation feature map are subjected to combined multi-mode embedded representation.
Firstly, a problem feature vector q generated by a problem and a segmentation feature map v obtained by image segmentation are connected as a context to guide the weight W of a model to v and q q And W v Training is carried out;
f q (q)=W q q
f v (v)=W v v
to f q (q) and f v (v) Performing multi-modal embedding, computing using a Hartmann product:
p(y)=σ(h)
wherein f is q Output representing problem flow, f v Representing the output of the visual flow, p (y) is the final output result, the problem feature vector q and the segmentation feature map v which are more closely matched have higher scores, W q And W v Represents the weight matrix and sigma represents the linear activation function. The loss function for joint embedding is the contrast loss:
wherein, y T And (5) an output result of the problem feature vector q and the segmentation feature map v which represent correct matching.
A network model of joint questions and answers is shown in fig. 3. The integrated image segmentation module, the question generation module and the joint question-answering module can provide answer prediction with the highest confidence coefficient according to the trained network. The method is based on the visual question and answer, and can finely describe the image content.
Claims (8)
1. A question-answering combined image natural language description method is characterized by comprising the following three steps:
firstly, extracting the features of an image target and an image background by using an image segmentation model to obtain different pixel-level classifications, and acquiring segmentation feature maps of the target and the background;
step two, the problem generation module generates a relation characteristic diagram containing the attention target information by constructing an implicit scene type representation, and generates a plurality of semantically related guide problems in a multi-granularity manner;
and step three, the combined question-answering module introduces a loss function of contrast learning, performs combined multi-mode embedded representation on the relation characteristic graph and the guide question, and generates a long text answer related to the question through training of the model to serve as a refined semantic description of the image content.
2. A description method as claimed in claim 1, characterized in that:
and secondly, generating a model based on the LSTM model, processing the segmentation feature map, generating a relation feature map containing the attention target information by constructing an implicit scene type representation, establishing multi-scale relations between the attention target and the image attention target and between the attention target and the background by taking the attention target as a center, and taking the generated multi-granularity guide question as a ring in subsequent joint question-answering.
3. A description method as claimed in claim 1, characterized in that:
aiming at the third step, a combined question-answer model based on the BUTD model introduces a loss function of comparative learning, a combined relation characteristic diagram and a guide question, improves the cross-modal learning capability of the model, enhances the understanding of the model on semantic relation between the image and the question answer, and generates the fine description of the image content.
4. A description method as claimed in claim 1, characterized in that:
the method comprises the following steps: image segmentation
1.1, semantically segmenting a data set by utilizing the prior published image, wherein the batch images are all marked with pixel-level categories;
1.2, training an image segmentation data set by using a deep learning method to construct an image segmentation neural network model; the task of image segmentation is to carry out dense prediction on an image, and each pixel point is provided with a target to which the pixel point belongs or the category of a closed area by labeling different targets with specific colors;
1.3, storing the model weight of the trained image segmentation neural network, wherein the network model can process an original image, distinguish different targets and backgrounds in the image, and finally output segmentation characteristic graphs among the targets and between the targets and the backgrounds.
5. A description method as claimed in claim 1 or 2, characterized in that:
step two: problem generation
2.1 processing the prior visual problem to generate a data set, classifying the problem categories in the data set, seeing the targets and the relation between the targets in different problem categories at multiple angles, and paying attention to a plurality of problem categories of the same image not only to different targets but also to image areas of the same target with different scales; meanwhile, the answers and the questions of the data set are merged to generate a complete natural language description;
2.2 training the processed vision problem generation data set by using a deep learning method, and constructing a problem to generate a neural network model; the problem generation model builds an implicit scene type representation, a relation feature graph containing attention target information is generated preliminarily, then the attention target is used as the center, the relevance between the problem category and different granularity areas of the image is learned, and different problems related to the attention target context are generated in a multi-scale mode.
6. A description method as claimed in claim 1 or 3, characterized in that:
step three: joint question answering
3.1 integrating an image segmentation module, a question generation module and a joint question and answer module, taking a guide question as a context, learning by using a top-down attention mechanism, introducing a loss function of contrast learning, and performing joint multi-mode embedded characterization on the guide question and a relation characteristic diagram; and giving out the candidate answers and the confidence degrees thereof according to the trained network to generate the natural language description of the image content.
7. The description method of claim 5, characterized in that:
the loss function of the problem-generating neural network model is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211150406.9A CN115512191A (en) | 2022-09-21 | 2022-09-21 | Question and answer combined image natural language description method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211150406.9A CN115512191A (en) | 2022-09-21 | 2022-09-21 | Question and answer combined image natural language description method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115512191A true CN115512191A (en) | 2022-12-23 |
Family
ID=84503366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211150406.9A Pending CN115512191A (en) | 2022-09-21 | 2022-09-21 | Question and answer combined image natural language description method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115512191A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116663530A (en) * | 2023-08-01 | 2023-08-29 | 北京高德云信科技有限公司 | Data generation method, device, electronic equipment and storage medium |
CN117312512A (en) * | 2023-09-25 | 2023-12-29 | 星环信息科技(上海)股份有限公司 | Question and answer method and device based on large model, electronic equipment and storage medium |
-
2022
- 2022-09-21 CN CN202211150406.9A patent/CN115512191A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116663530A (en) * | 2023-08-01 | 2023-08-29 | 北京高德云信科技有限公司 | Data generation method, device, electronic equipment and storage medium |
CN116663530B (en) * | 2023-08-01 | 2023-10-20 | 北京高德云信科技有限公司 | Data generation method, device, electronic equipment and storage medium |
CN117312512A (en) * | 2023-09-25 | 2023-12-29 | 星环信息科技(上海)股份有限公司 | Question and answer method and device based on large model, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6351689B2 (en) | Attention based configurable convolutional neural network (ABC-CNN) system and method for visual question answering | |
CN110377710B (en) | Visual question-answer fusion enhancement method based on multi-mode fusion | |
Zhang et al. | Multimodal intelligence: Representation learning, information fusion, and applications | |
US12051275B2 (en) | Video processing method and apparatus for action recognition | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN110717431A (en) | Fine-grained visual question and answer method combined with multi-view attention mechanism | |
CN112036276B (en) | Artificial intelligent video question-answering method | |
CN109874053A (en) | The short video recommendation method with user's dynamic interest is understood based on video content | |
CN115512191A (en) | Question and answer combined image natural language description method | |
CN115329779A (en) | Multi-person conversation emotion recognition method | |
CN113255457A (en) | Animation character facial expression generation method and system based on facial expression recognition | |
CN113297370A (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN113792177A (en) | Scene character visual question-answering method based on knowledge-guided deep attention network | |
CN113283336A (en) | Text recognition method and system | |
Zhang et al. | Teaching chinese sign language with a smartphone | |
CN112037239B (en) | Text guidance image segmentation method based on multi-level explicit relation selection | |
Zakari et al. | Vqa and visual reasoning: An overview of recent datasets, methods and challenges | |
Xue et al. | Lcsnet: End-to-end lipreading with channel-aware feature selection | |
Cao et al. | Visual question answering research on multi-layer attention mechanism based on image target features | |
CN117934803A (en) | Visual positioning method based on multi-modal feature alignment | |
Bansal et al. | 3D-CNN Empowered Assistive Machine Learning Model for the Hearing Impaired | |
He et al. | An optimal 3D convolutional neural network based lipreading method | |
CN115661830A (en) | Text guidance image segmentation method based on structured multi-mode fusion network | |
CN115018215A (en) | Population residence prediction method, system and medium based on multi-modal cognitive map |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |