CN116452688A - Image description generation method based on common attention mechanism - Google Patents

Image description generation method based on common attention mechanism Download PDF

Info

Publication number
CN116452688A
CN116452688A CN202310334196.7A CN202310334196A CN116452688A CN 116452688 A CN116452688 A CN 116452688A CN 202310334196 A CN202310334196 A CN 202310334196A CN 116452688 A CN116452688 A CN 116452688A
Authority
CN
China
Prior art keywords
image
description
attention
network
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310334196.7A
Other languages
Chinese (zh)
Inventor
贾海涛
李玉琳
李彧
张洋
张钰琪
贾宇明
任利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202310334196.7A priority Critical patent/CN116452688A/en
Publication of CN116452688A publication Critical patent/CN116452688A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image description generation method based on a common attention mechanism. The invention has certain effectiveness in semantic alignment of image description algorithm. Aiming at the problem that the generated description is not aligned with the region in the image, a prior attention mechanism is added in the encoder-decoder framework, and the prior attention mechanism can dynamically pay attention to the image region according to the information of the future time step; aiming at the problem of semantic consistency in image description, through introducing a common attention mechanism into a discriminator, introducing an idea of counterlearning, training a generator and the discriminator to classify the generated image description, thereby improving the semantic consistency of the generated image description. The image description algorithm model based on the common attention mechanism can accurately generate descriptions conforming to image contents and generate image descriptions diversified in language based on the generation of the countermeasure network.

Description

Image description generation method based on common attention mechanism
Technical Field
The invention relates to the field of image description generation in deep learning, and aims at solving the problem that an image is not aligned with the generated description semantics in image description generation.
Background
Image description algorithms are a method of integrating computer vision techniques and natural language processing techniques in order to enable a machine to generate natural language descriptions from a given image. Applications of the algorithm include image searching, automatic image annotation, intelligent robots and other fields.
In practical application scenarios, image description algorithms have been widely used. For example, in social media, the image description algorithm can help the social media platform automatically generate image descriptions, so that a user can better know photo content, and user experience is enhanced. In the search engine, the image description algorithm can help the search engine to better understand the picture content, improve the retrieval accuracy and provide better search results for users. In autopilot, the autopilot needs to perceive the environment through image recognition technology, and image description algorithms can help the autopilot to better understand and predict road conditions. The image description algorithm can also be applied to a plurality of fields such as medical images, unmanned aerial vehicle monitoring and the like, and provides powerful support for realizing intellectualization and automation.
The image description algorithm mainly uses an attention-enhanced encoder-decoder framework. The attention mechanism directs the decoding process by focusing on the hidden state of the image region in each temporal step. This technology has had great success in promoting the development of image description technology. The current attention mechanism focuses on image areas based on previous hidden states, which contain information of words generated in the past. Thus, the attention model must predict the attention weight without knowing the words it should dock with. The image region of interest is thus more accurate on the current input word than on the output word.
On the task of generating image descriptions using convolutional neural networks, reinforcement learning techniques based on strategic gradient methods were introduced to directly optimize N-gram matching metrics such as CIDEr, BLEU4, or SPICE. Image description model training is performed, for example, using CIDEr as an optimization index. However, these metrics do not implement the process of semantic alignment between the image and the description. They do not provide a way to promote the naturalness of language so that machine-generated text becomes indistinguishable from human-created text.
With the continuous progress and development of deep learning, the application of the deep learning in image description algorithms is wider and wider, and the method is applied to the problem that images are not matched with description semantics in the image description algorithms. The invention uses an improved prior attention mechanism in network design based on generation of an countermeasure network, and trains a common attention discriminator to detect a dislocation signal between an image and a generated sentence. In this way, the generator can use this signal to improve its text generation mechanism, thereby better aligning the description with a given image.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an image description generation method based on a common attention mechanism, which is based on generation of an antagonism network. The technique incorporates an improved prior attention mechanism and trains a common attention discriminator to detect a misalignment signal between the image and the generated sentence. Aiming at the problem that the image and the description semantics in the image description algorithm are not matched (as shown in figure 1), the image description algorithm based on the generation of the countermeasure network is further improved.
The technical scheme adopted by the invention is as follows:
step 1: based on an image description algorithm for generating an countermeasure network, the network model is divided into a generator and a discriminator, the former is used for generating a description of the corresponding image; the latter is to evaluate the description accuracy of the text description to the image, and the whole framework is shown in fig. 2;
step 2: the encoder-decoder framework is employed by the generator in step 1. The structure is as follows: the encoder adopts a convolutional neural network, a first-known attention mechanism, the decoder adopts a cyclic neural network, an image I is given, and the generator G outputs an image description
Step 3: the encoder in the step 2 adopts the fast R-CNN to accept the image I and extract the image characteristic V= { V 1 ,...,v k }∈R d×N
Step 4: the generator decoder in step 2 consists of an initial layer and a layer of attention-precedent. The initial layer is of an LSTM structure, and the generation of the image description can be controlled through certain modification. The precedent Attention layer calculates the Attention weight using bi-directional LSTM and improves Self-Attention. The attention weight is divided into present and future two parts, wherein the attention weight of the future part is calculated by predicting the generation probability of the next word;
step 5: in step 1, the discriminator network is designed by adopting a common attention mechanism so as to discriminate the generated image description into manual generation or machine generation. This arbiter is composed of two parts: an image attention module and a text attention module. The two modules are respectively used for extracting the characteristics of the image and the description and generating a corresponding attention matrix. The two attention matrices are then combined by a dot product operation to generate a matrix that represents the degree of semantic matching between the image and the description. Finally, this matrix is used as an output of the arbiter to force semantic alignment between the image and the description;
step 6: training a network model by adopting reinforcement learning SCST, using rewards under a decoding algorithm as a base line, and normalizing by using an image description evaluation index CIDEr so as to enable the generated description to be close to a provided sample reference of an N-gram level;
step 7: the arbiter will alternate with the generator during training. The two modules are trained together, the description generated by the network can reach a balance, and finally a generator network for generating the description and realizing semantic alignment between the image and the description is obtained.
Compared with the prior art, the invention has the beneficial effects that:
(1) The description can reach higher accuracy on the aspect of the problem of the alignment of the image and the description semantics;
(2) For the defect of image description algorithm diversity, more language and more diversified descriptions can be generated.
Drawings
Fig. 1 is: an example graph of a sequence of image regions for each word of the image description is generated.
Fig. 2 is: the image based on the common attention mechanism describes the overall frame map.
Fig. 3 is: faster R-CNN framework diagram.
Fig. 4 is: visual attention architecture diagram.
Fig. 5 is: the prior attention mechanism architecture is shown.
Fig. 6 is: the common attention discriminator architecture diagram.
Fig. 7 is: a SCST training generator schematic is used.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
First, the encoder network in the generator uses Faster R-CNN. The Faster R-CNN structure is shown in FIG. 3. Faster R-CNN is an object detection model that aims to identify object instances belonging to certain classes and locate them with bounding boxes.
The fast R-CNN model is mainly composed of two modules: the RPN candidate frame extraction module and the Fast R-CNN detection module are shown in the following figure and can be subdivided into 3 parts; convolutional layer, region Proposal Network (RPN), roI Pooling. The convolution layer comprises a series of convolution (Conv+Relu) and Pooling (Pooling) operations for extracting features of the image, an existing classical network model VGG16 is used, and the weight parameter of the convolution layer is RPN and Fast RCNN are shared, which is also a key point for accelerating the training process and improving the real-time performance of the model. The RPN network is used for generating a region candidate frame, classifying and judging whether the anchor frame belongs to a target or a background through Softmax based on a multi-scale anchor frame introduced by the network model, carrying out regression prediction on the anchor frame by using frame boundary regression, obtaining the accurate position of the candidate frame, and being used for subsequent target identification and detection. And (3) synthesizing the information of the convolutional layer characteristics and the candidate frames in the RoI Pooling network, mapping the coordinates of the candidate frames in the input image into a final layer (conv 5-3), carrying out Pooling operation on the corresponding areas in the characteristic image, obtaining Pooling results output by a fixed size (7 multiplied by 7), and connecting with the following full-connection layer. The full connection layer is connected with two sub-connection layers, namely a classification layer and a regression layer, wherein the classification layer is used for judging the category of the candidate frame, and the regression layer predicts the accurate position of the candidate frame through the border regression. The output of fast R-CNN is the eigenvector V= { V for k images 1 ,...,v k }∈R d×N
Attention-based enhanced image description decoder as shown in fig. 4, for each decoding step t, the decoder takes the current input word y t-1 Word embedding with averaged visual featuresStitching is performed as an input to LSTM, as in equation (1):
wherein [;]indicating the connection operation, W e Representing a learnable word embedding parameter. Next, the output h of LSTM t Is used as a query to focus on the relevant image areas in the visual feature set V and to generate a visual feature of interest c t As in formulas (2), (3):
wherein w is α 、W h And W is V Is a learnable parameter.Representing matrix-vector addition, calculated by adding a vector to each column of the matrix. Finally, h t And c t Passed to the linear layer to predict the next word as in equation (4):
y t ~p t =softmax(W p [h t ;c t ]+b p ) (4)
wherein W is p And b p Is a learnable parameter. Finally, a target reference sequence is givenAnd a description model with a parameter θ, the training objective is to minimize the following cross entropy loss, as in equation (5):
as can be seen from the formula, at each time period t, the attention model depends on h t It contains the descriptive word y generated in the past 1:t-1 To calculate the attention weight alpha t . This reliance on past information makes the visual features of interest less basic on words generated during the current time period, which compromises the accuracy of the description.
In order to enable the attention model to unbiased relate image regions to words to be generated, a predictive attention model is employed, as shown in fig. 5, which can be used to guide conventional widely used attention models with information on future words to solve their semantic misalignment problem and select the correct image region to generate the corresponding word.
Specifically, firstly adoptGenerating the whole sentence y using a conventional encoder-decoder framework 1:T . Then, for each time step t, the advance attention will be given to the future information y i:j (j.gtoreq.t) as input, calculate the attention weightThis is naturally based on the generated words. In the implementation, as shown in FIG. 5, a bi-directional LSTM (BiLSTM) pair y is used 1:T Encoding is performed so that y i:j The information of (2) is first converted into h' i:j The attention weight is then calculated by the following equation (6):
wherein the attention models in equations (2), (3) and (6) share the same set of parameters. Suggesting alpha use in training t Andthe $l1$ criterion in between, as a regularization penalty, can be defined as equation (7):
wherein I II 1 Representing the $l1$ criterion. By minimizing the loss in equation (7), the attention model will have previously generated the word y 1:t-1 The "bias" attention weight α calculated above t Word y generated towards the future i:j "ideal" attention weight calculated on (j. Gtoreq.t)And (5) closing.
Then, to train the advance notice, willIncorporating a conventional encoder-decoder framework to regenerate target baseQuasi->It is defined as formulas (8), (9), (10):
combine the loss L in equation (5) CE (θ), loss in equation (10)And loss L in equation (7) Att (θ), the complete training goal is defined as equation (11):
where λ is the hyper-parameter that controls regularization. In the training process, the description model is first pre-trained 25 times with equation (5), and then the complete model is trained with equation (11). In this way, appropriate parameter weights can be initialized for the advance notice. In the test phase, the same procedure as the conventional attention model is followed in the description decoder, since future words are not visible for the current time step in the language generation task.
To keep track of the information that can be based on future time steps, the image region is dynamically focused. In particular, for a noun phrase, such as a black shirt, all of the words therein should be treated as a complete phrase rather than a single word. Thus, for Dynamic advance concern (Dynamic propsetAttention, DPA), if the currently output word y t Belonging to a Noun Phrase (NP), the DPA will use all words in the Noun phrase to calculate the attention weightThen, when the word is a Non Visual (NV) word, the pre-emptive attention model is masked, i.e., the loss +.>And the loss in equation (7). For the remaining words, i=j=t is set directly. Specifically, in the image description, the remaining words are typically verbs, which serve as relational words in the description, connecting different noun phrases. In short, dynamic precedent notice is defined as formula (12):
wherein { y } NV And } represents a set of all NV words. The attention model may learn to output each word y t The reference samples of the training description are not needed to locate the image area.
The task of the arbiter is to score the similarity between the image and the description. The image and description are co-embedded at an early stage using a common attention model and similarity is calculated over the entire set representation. The common attention discriminator is shown in fig. 6 and details of the construction are provided below.
Given a word sequence (w 1 ,...w T ) The composed sentence w, the arbiter embeds each word using LSTM (state dimension m=512), resulting in h= [ H ] 1 ,...h T ] T For H.epsilon.R T×m Wherein h is t ,c t =LSTM(h t-1 ,c t-1 ,w t ). For image I, features are extracted (I 1 ,...I C ) Where c=14×14=196, while embedding it as i= [ WI 1 ,...WI C ] T ∈R C×m Whereind I =2048, is the image feature size herein. The present section uses bilinear projection Q εR m×m Calculating correlation Y, y=tanh (IQH T )∈R C×T . The matrix Y is used to calculate the common attention weight of one mode to another, as in equations (13), (14):
α=Softmax(Linear(tanh(IW I +YHW Ih )))∈R C (13)
β=Softmax(Linear(tanh(HW h +YTIW hI )))∈RT (14)
wherein all new matrices are at R m×m Is a kind of medium. The weights are then used to combine the word and image features.For U I ,V S ∈R m×m . Finally, the image-description score is calculated asWherein E is I Is the average spatial set of CNN features, E S Is the last state of the LSTM.
At model training time, the generator is optimized to solve max θ L G (θ), where L G (θ)=E I logD η (I,G θ (I) A kind of electronic device. Training generator G by means of SCST θ SCST is a variant of reinforcement learning, using rewards under the decoding algorithm as a baseline. In this work, the decoding algorithm can be considered as a greedy algorithm, from argmaxp at each step θ (.∣h t ) The most likely word is selected. For a given image, a single sample w of the generator s Is used to estimate the full-sequence prize,wherein w is s ~p θ (. I). Using SCST, the gradient estimation method is as in equation (15):
wherein,,obtained with a greedy maximum as shown in fig. 7. Note that the baseline does not change the expected value of the gradient, but reduces the variance of the estimated value.
In addition, GAN training can describe the evaluation index r by using images NLP Normalization is performed to bring the generated description close to the provided sample reference at the N-gram level. The gradient is then as in equation (16):
distinguishing device D η Not only is training to distinguish between a real description and a fake description, but it is also possible to detect when an image is combined with a random uncorrelated real sentence, forcing it to check not only the composition of the sentence, but also the semantic relationship between the image and the description. To achieve this goal, this section solves the following optimization problem: max (max) η L D (eta) wherein L is lost D (eta) is formula (17):
wherein w is a true sentence, w s Is a slave generator G θ The resulting spurious description is sampled and w' is a true but randomly chosen description.

Claims (4)

1. An image description generation method based on a common attention mechanism, which is characterized by comprising the following steps:
step 1: based on the image description method for generating the countermeasure network, the network model is divided into a generator and a discriminator, wherein the former is used for generating the description of the corresponding image; the latter is to evaluate the descriptive accuracy of the textual description on the image;
step 2: the encoder-decoder framework is employed by the generator in step 1. The structure is as follows: the encoder adopts a convolutional neural network, a first-known attention mechanism, the decoder adopts a cyclic neural network, an image I is given, and the generator G outputs an image description
Step 3: the encoder in the step 2 adopts the fast R-CNN to accept the image I and extract the image characteristic V= { V 1 ,...,v k }∈R d×N
Step 4: the generator decoder in step 2 consists of an initial layer and a layer of attention-precedent. The initial layer is of an LSTM structure, and the generation of the image description can be controlled through certain modification. The precedent Attention layer calculates the Attention weight using bi-directional LSTM and improves Self-Attention. The attention weight is divided into present and future two parts, wherein the attention weight of the future part is calculated by predicting the generation probability of the next word;
step 5: in step 1, the discriminator network is designed by adopting a common attention mechanism so as to discriminate the generated image description into manual generation or machine generation. This arbiter is composed of two parts: an image attention module and a text attention module. The two modules are respectively used for extracting the characteristics of the image and the description and generating a corresponding attention matrix. The two attention matrices are then combined by a dot product operation to generate a matrix that represents the degree of semantic matching between the image and the description. Finally, this matrix is used as an output of the arbiter to force semantic alignment between the image and the description;
step 6: training a network model by adopting reinforcement learning SCST, using rewards under a decoding algorithm as a base line, and normalizing by using an image description evaluation index CIDEr so as to enable the generated description to be close to a provided sample reference of an N-gram level;
step 7: the arbiter will alternate with the generator during training. The two modules are trained together, the description generated by the network can reach a balance, and finally a generator network for generating the description and realizing semantic alignment between the image and the description is obtained.
2. The method of claim 1, wherein the attention mechanism in step 2 is a prior-known attention mechanism.
3. The method of claim 1, wherein the arbiter network in step 5 is a common attention arbiter.
4. The method of claim 1 wherein the model training method of step 6 is trained using an SCST reinforcement learning network model.
CN202310334196.7A 2023-03-31 2023-03-31 Image description generation method based on common attention mechanism Pending CN116452688A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310334196.7A CN116452688A (en) 2023-03-31 2023-03-31 Image description generation method based on common attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310334196.7A CN116452688A (en) 2023-03-31 2023-03-31 Image description generation method based on common attention mechanism

Publications (1)

Publication Number Publication Date
CN116452688A true CN116452688A (en) 2023-07-18

Family

ID=87129529

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310334196.7A Pending CN116452688A (en) 2023-03-31 2023-03-31 Image description generation method based on common attention mechanism

Country Status (1)

Country Link
CN (1) CN116452688A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332048A (en) * 2023-11-30 2024-01-02 运易通科技有限公司 Logistics information query method, device and system based on machine learning
CN118094447A (en) * 2024-04-24 2024-05-28 贵州大学 Unmanned aerial vehicle flight data self-adaptive anomaly detection method based on encoding-decoding

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332048A (en) * 2023-11-30 2024-01-02 运易通科技有限公司 Logistics information query method, device and system based on machine learning
CN117332048B (en) * 2023-11-30 2024-03-22 运易通科技有限公司 Logistics information query method, device and system based on machine learning
CN118094447A (en) * 2024-04-24 2024-05-28 贵州大学 Unmanned aerial vehicle flight data self-adaptive anomaly detection method based on encoding-decoding
CN118094447B (en) * 2024-04-24 2024-07-02 贵州大学 Unmanned aerial vehicle flight data self-adaptive anomaly detection method based on encoding-decoding

Similar Documents

Publication Publication Date Title
CN113158875B (en) Image-text emotion analysis method and system based on multi-mode interaction fusion network
CN109992686A (en) Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN108765383B (en) Video description method based on deep migration learning
CN108563624A (en) A kind of spatial term method based on deep learning
CN116452688A (en) Image description generation method based on common attention mechanism
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN116955699B (en) Video cross-mode search model training method, searching method and device
CN113035311A (en) Medical image report automatic generation method based on multi-mode attention mechanism
CN113204675B (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN114239585A (en) Biomedical nested named entity recognition method
CN116579345B (en) Named entity recognition model training method, named entity recognition method and named entity recognition device
CN117611576A (en) Image-text fusion-based contrast learning prediction method
CN114722798A (en) Ironic recognition model based on convolutional neural network and attention system
CN114626454A (en) Visual emotion recognition method integrating self-supervision learning and attention mechanism
CN117829243A (en) Model training method, target detection device, electronic equipment and medium
CN115758159B (en) Zero sample text position detection method based on mixed contrast learning and generation type data enhancement
CN116151226B (en) Machine learning-based deaf-mute sign language error correction method, equipment and medium
CN117115474A (en) End-to-end single target tracking method based on multi-stage feature extraction
CN116958740A (en) Zero sample target detection method based on semantic perception and self-adaptive contrast learning
Wu et al. Question-driven multiple attention (dqma) model for visual question answer
Ren et al. Improved image description via embedded object structure graph and semantic feature matching
CN114692615B (en) Small sample intention recognition method for small languages
Vakada et al. Descriptive and Coherent Paragraph Generation for Image Paragraph Captioning Using Vision Transformer and Post-processing
Sheng et al. Revolutionizing Image Captioning: Integrating Attention Mechanisms with Adaptive Fusion Gates.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination