CN116452688A - Image description generation method based on common attention mechanism - Google Patents
Image description generation method based on common attention mechanism Download PDFInfo
- Publication number
- CN116452688A CN116452688A CN202310334196.7A CN202310334196A CN116452688A CN 116452688 A CN116452688 A CN 116452688A CN 202310334196 A CN202310334196 A CN 202310334196A CN 116452688 A CN116452688 A CN 116452688A
- Authority
- CN
- China
- Prior art keywords
- image
- description
- attention
- network
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000002787 reinforcement Effects 0.000 claims description 5
- 235000019987 cider Nutrition 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 125000004122 cyclic group Chemical group 0.000 claims description 2
- 238000012986 modification Methods 0.000 claims description 2
- 230000004048 modification Effects 0.000 claims description 2
- 238000011176 pooling Methods 0.000 description 6
- 230000000007 visual effect Effects 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 235000013599 spices Nutrition 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image description generation method based on a common attention mechanism. The invention has certain effectiveness in semantic alignment of image description algorithm. Aiming at the problem that the generated description is not aligned with the region in the image, a prior attention mechanism is added in the encoder-decoder framework, and the prior attention mechanism can dynamically pay attention to the image region according to the information of the future time step; aiming at the problem of semantic consistency in image description, through introducing a common attention mechanism into a discriminator, introducing an idea of counterlearning, training a generator and the discriminator to classify the generated image description, thereby improving the semantic consistency of the generated image description. The image description algorithm model based on the common attention mechanism can accurately generate descriptions conforming to image contents and generate image descriptions diversified in language based on the generation of the countermeasure network.
Description
Technical Field
The invention relates to the field of image description generation in deep learning, and aims at solving the problem that an image is not aligned with the generated description semantics in image description generation.
Background
Image description algorithms are a method of integrating computer vision techniques and natural language processing techniques in order to enable a machine to generate natural language descriptions from a given image. Applications of the algorithm include image searching, automatic image annotation, intelligent robots and other fields.
In practical application scenarios, image description algorithms have been widely used. For example, in social media, the image description algorithm can help the social media platform automatically generate image descriptions, so that a user can better know photo content, and user experience is enhanced. In the search engine, the image description algorithm can help the search engine to better understand the picture content, improve the retrieval accuracy and provide better search results for users. In autopilot, the autopilot needs to perceive the environment through image recognition technology, and image description algorithms can help the autopilot to better understand and predict road conditions. The image description algorithm can also be applied to a plurality of fields such as medical images, unmanned aerial vehicle monitoring and the like, and provides powerful support for realizing intellectualization and automation.
The image description algorithm mainly uses an attention-enhanced encoder-decoder framework. The attention mechanism directs the decoding process by focusing on the hidden state of the image region in each temporal step. This technology has had great success in promoting the development of image description technology. The current attention mechanism focuses on image areas based on previous hidden states, which contain information of words generated in the past. Thus, the attention model must predict the attention weight without knowing the words it should dock with. The image region of interest is thus more accurate on the current input word than on the output word.
On the task of generating image descriptions using convolutional neural networks, reinforcement learning techniques based on strategic gradient methods were introduced to directly optimize N-gram matching metrics such as CIDEr, BLEU4, or SPICE. Image description model training is performed, for example, using CIDEr as an optimization index. However, these metrics do not implement the process of semantic alignment between the image and the description. They do not provide a way to promote the naturalness of language so that machine-generated text becomes indistinguishable from human-created text.
With the continuous progress and development of deep learning, the application of the deep learning in image description algorithms is wider and wider, and the method is applied to the problem that images are not matched with description semantics in the image description algorithms. The invention uses an improved prior attention mechanism in network design based on generation of an countermeasure network, and trains a common attention discriminator to detect a dislocation signal between an image and a generated sentence. In this way, the generator can use this signal to improve its text generation mechanism, thereby better aligning the description with a given image.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an image description generation method based on a common attention mechanism, which is based on generation of an antagonism network. The technique incorporates an improved prior attention mechanism and trains a common attention discriminator to detect a misalignment signal between the image and the generated sentence. Aiming at the problem that the image and the description semantics in the image description algorithm are not matched (as shown in figure 1), the image description algorithm based on the generation of the countermeasure network is further improved.
The technical scheme adopted by the invention is as follows:
step 1: based on an image description algorithm for generating an countermeasure network, the network model is divided into a generator and a discriminator, the former is used for generating a description of the corresponding image; the latter is to evaluate the description accuracy of the text description to the image, and the whole framework is shown in fig. 2;
step 2: the encoder-decoder framework is employed by the generator in step 1. The structure is as follows: the encoder adopts a convolutional neural network, a first-known attention mechanism, the decoder adopts a cyclic neural network, an image I is given, and the generator G outputs an image description
Step 3: the encoder in the step 2 adopts the fast R-CNN to accept the image I and extract the image characteristic V= { V 1 ,...,v k }∈R d×N 。
Step 4: the generator decoder in step 2 consists of an initial layer and a layer of attention-precedent. The initial layer is of an LSTM structure, and the generation of the image description can be controlled through certain modification. The precedent Attention layer calculates the Attention weight using bi-directional LSTM and improves Self-Attention. The attention weight is divided into present and future two parts, wherein the attention weight of the future part is calculated by predicting the generation probability of the next word;
step 5: in step 1, the discriminator network is designed by adopting a common attention mechanism so as to discriminate the generated image description into manual generation or machine generation. This arbiter is composed of two parts: an image attention module and a text attention module. The two modules are respectively used for extracting the characteristics of the image and the description and generating a corresponding attention matrix. The two attention matrices are then combined by a dot product operation to generate a matrix that represents the degree of semantic matching between the image and the description. Finally, this matrix is used as an output of the arbiter to force semantic alignment between the image and the description;
step 6: training a network model by adopting reinforcement learning SCST, using rewards under a decoding algorithm as a base line, and normalizing by using an image description evaluation index CIDEr so as to enable the generated description to be close to a provided sample reference of an N-gram level;
step 7: the arbiter will alternate with the generator during training. The two modules are trained together, the description generated by the network can reach a balance, and finally a generator network for generating the description and realizing semantic alignment between the image and the description is obtained.
Compared with the prior art, the invention has the beneficial effects that:
(1) The description can reach higher accuracy on the aspect of the problem of the alignment of the image and the description semantics;
(2) For the defect of image description algorithm diversity, more language and more diversified descriptions can be generated.
Drawings
Fig. 1 is: an example graph of a sequence of image regions for each word of the image description is generated.
Fig. 2 is: the image based on the common attention mechanism describes the overall frame map.
Fig. 3 is: faster R-CNN framework diagram.
Fig. 4 is: visual attention architecture diagram.
Fig. 5 is: the prior attention mechanism architecture is shown.
Fig. 6 is: the common attention discriminator architecture diagram.
Fig. 7 is: a SCST training generator schematic is used.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
First, the encoder network in the generator uses Faster R-CNN. The Faster R-CNN structure is shown in FIG. 3. Faster R-CNN is an object detection model that aims to identify object instances belonging to certain classes and locate them with bounding boxes.
The fast R-CNN model is mainly composed of two modules: the RPN candidate frame extraction module and the Fast R-CNN detection module are shown in the following figure and can be subdivided into 3 parts; convolutional layer, region Proposal Network (RPN), roI Pooling. The convolution layer comprises a series of convolution (Conv+Relu) and Pooling (Pooling) operations for extracting features of the image, an existing classical network model VGG16 is used, and the weight parameter of the convolution layer is RPN and Fast RCNN are shared, which is also a key point for accelerating the training process and improving the real-time performance of the model. The RPN network is used for generating a region candidate frame, classifying and judging whether the anchor frame belongs to a target or a background through Softmax based on a multi-scale anchor frame introduced by the network model, carrying out regression prediction on the anchor frame by using frame boundary regression, obtaining the accurate position of the candidate frame, and being used for subsequent target identification and detection. And (3) synthesizing the information of the convolutional layer characteristics and the candidate frames in the RoI Pooling network, mapping the coordinates of the candidate frames in the input image into a final layer (conv 5-3), carrying out Pooling operation on the corresponding areas in the characteristic image, obtaining Pooling results output by a fixed size (7 multiplied by 7), and connecting with the following full-connection layer. The full connection layer is connected with two sub-connection layers, namely a classification layer and a regression layer, wherein the classification layer is used for judging the category of the candidate frame, and the regression layer predicts the accurate position of the candidate frame through the border regression. The output of fast R-CNN is the eigenvector V= { V for k images 1 ,...,v k }∈R d×N 。
Attention-based enhanced image description decoder as shown in fig. 4, for each decoding step t, the decoder takes the current input word y t-1 Word embedding with averaged visual featuresStitching is performed as an input to LSTM, as in equation (1):
wherein [;]indicating the connection operation, W e Representing a learnable word embedding parameter. Next, the output h of LSTM t Is used as a query to focus on the relevant image areas in the visual feature set V and to generate a visual feature of interest c t As in formulas (2), (3):
wherein w is α 、W h And W is V Is a learnable parameter.Representing matrix-vector addition, calculated by adding a vector to each column of the matrix. Finally, h t And c t Passed to the linear layer to predict the next word as in equation (4):
y t ~p t =softmax(W p [h t ;c t ]+b p ) (4)
wherein W is p And b p Is a learnable parameter. Finally, a target reference sequence is givenAnd a description model with a parameter θ, the training objective is to minimize the following cross entropy loss, as in equation (5):
as can be seen from the formula, at each time period t, the attention model depends on h t It contains the descriptive word y generated in the past 1:t-1 To calculate the attention weight alpha t . This reliance on past information makes the visual features of interest less basic on words generated during the current time period, which compromises the accuracy of the description.
In order to enable the attention model to unbiased relate image regions to words to be generated, a predictive attention model is employed, as shown in fig. 5, which can be used to guide conventional widely used attention models with information on future words to solve their semantic misalignment problem and select the correct image region to generate the corresponding word.
Specifically, firstly adoptGenerating the whole sentence y using a conventional encoder-decoder framework 1:T . Then, for each time step t, the advance attention will be given to the future information y i:j (j.gtoreq.t) as input, calculate the attention weightThis is naturally based on the generated words. In the implementation, as shown in FIG. 5, a bi-directional LSTM (BiLSTM) pair y is used 1:T Encoding is performed so that y i:j The information of (2) is first converted into h' i:j The attention weight is then calculated by the following equation (6):
wherein the attention models in equations (2), (3) and (6) share the same set of parameters. Suggesting alpha use in training t Andthe $l1$ criterion in between, as a regularization penalty, can be defined as equation (7):
wherein I II 1 Representing the $l1$ criterion. By minimizing the loss in equation (7), the attention model will have previously generated the word y 1:t-1 The "bias" attention weight α calculated above t Word y generated towards the future i:j "ideal" attention weight calculated on (j. Gtoreq.t)And (5) closing.
Then, to train the advance notice, willIncorporating a conventional encoder-decoder framework to regenerate target baseQuasi->It is defined as formulas (8), (9), (10):
combine the loss L in equation (5) CE (θ), loss in equation (10)And loss L in equation (7) Att (θ), the complete training goal is defined as equation (11):
where λ is the hyper-parameter that controls regularization. In the training process, the description model is first pre-trained 25 times with equation (5), and then the complete model is trained with equation (11). In this way, appropriate parameter weights can be initialized for the advance notice. In the test phase, the same procedure as the conventional attention model is followed in the description decoder, since future words are not visible for the current time step in the language generation task.
To keep track of the information that can be based on future time steps, the image region is dynamically focused. In particular, for a noun phrase, such as a black shirt, all of the words therein should be treated as a complete phrase rather than a single word. Thus, for Dynamic advance concern (Dynamic propsetAttention, DPA), if the currently output word y t Belonging to a Noun Phrase (NP), the DPA will use all words in the Noun phrase to calculate the attention weightThen, when the word is a Non Visual (NV) word, the pre-emptive attention model is masked, i.e., the loss +.>And the loss in equation (7). For the remaining words, i=j=t is set directly. Specifically, in the image description, the remaining words are typically verbs, which serve as relational words in the description, connecting different noun phrases. In short, dynamic precedent notice is defined as formula (12):
wherein { y } NV And } represents a set of all NV words. The attention model may learn to output each word y t The reference samples of the training description are not needed to locate the image area.
The task of the arbiter is to score the similarity between the image and the description. The image and description are co-embedded at an early stage using a common attention model and similarity is calculated over the entire set representation. The common attention discriminator is shown in fig. 6 and details of the construction are provided below.
Given a word sequence (w 1 ,...w T ) The composed sentence w, the arbiter embeds each word using LSTM (state dimension m=512), resulting in h= [ H ] 1 ,...h T ] T For H.epsilon.R T×m Wherein h is t ,c t =LSTM(h t-1 ,c t-1 ,w t ). For image I, features are extracted (I 1 ,...I C ) Where c=14×14=196, while embedding it as i= [ WI 1 ,...WI C ] T ∈R C×m Whereind I =2048, is the image feature size herein. The present section uses bilinear projection Q εR m×m Calculating correlation Y, y=tanh (IQH T )∈R C×T . The matrix Y is used to calculate the common attention weight of one mode to another, as in equations (13), (14):
α=Softmax(Linear(tanh(IW I +YHW Ih )))∈R C (13)
β=Softmax(Linear(tanh(HW h +YTIW hI )))∈RT (14)
wherein all new matrices are at R m×m Is a kind of medium. The weights are then used to combine the word and image features.For U I ,V S ∈R m×m . Finally, the image-description score is calculated asWherein E is I Is the average spatial set of CNN features, E S Is the last state of the LSTM.
At model training time, the generator is optimized to solve max θ L G (θ), where L G (θ)=E I logD η (I,G θ (I) A kind of electronic device. Training generator G by means of SCST θ SCST is a variant of reinforcement learning, using rewards under the decoding algorithm as a baseline. In this work, the decoding algorithm can be considered as a greedy algorithm, from argmaxp at each step θ (.∣h t ) The most likely word is selected. For a given image, a single sample w of the generator s Is used to estimate the full-sequence prize,wherein w is s ~p θ (. I). Using SCST, the gradient estimation method is as in equation (15):
wherein,,obtained with a greedy maximum as shown in fig. 7. Note that the baseline does not change the expected value of the gradient, but reduces the variance of the estimated value.
In addition, GAN training can describe the evaluation index r by using images NLP Normalization is performed to bring the generated description close to the provided sample reference at the N-gram level. The gradient is then as in equation (16):
distinguishing device D η Not only is training to distinguish between a real description and a fake description, but it is also possible to detect when an image is combined with a random uncorrelated real sentence, forcing it to check not only the composition of the sentence, but also the semantic relationship between the image and the description. To achieve this goal, this section solves the following optimization problem: max (max) η L D (eta) wherein L is lost D (eta) is formula (17):
wherein w is a true sentence, w s Is a slave generator G θ The resulting spurious description is sampled and w' is a true but randomly chosen description.
Claims (4)
1. An image description generation method based on a common attention mechanism, which is characterized by comprising the following steps:
step 1: based on the image description method for generating the countermeasure network, the network model is divided into a generator and a discriminator, wherein the former is used for generating the description of the corresponding image; the latter is to evaluate the descriptive accuracy of the textual description on the image;
step 2: the encoder-decoder framework is employed by the generator in step 1. The structure is as follows: the encoder adopts a convolutional neural network, a first-known attention mechanism, the decoder adopts a cyclic neural network, an image I is given, and the generator G outputs an image description
Step 3: the encoder in the step 2 adopts the fast R-CNN to accept the image I and extract the image characteristic V= { V 1 ,...,v k }∈R d×N 。
Step 4: the generator decoder in step 2 consists of an initial layer and a layer of attention-precedent. The initial layer is of an LSTM structure, and the generation of the image description can be controlled through certain modification. The precedent Attention layer calculates the Attention weight using bi-directional LSTM and improves Self-Attention. The attention weight is divided into present and future two parts, wherein the attention weight of the future part is calculated by predicting the generation probability of the next word;
step 5: in step 1, the discriminator network is designed by adopting a common attention mechanism so as to discriminate the generated image description into manual generation or machine generation. This arbiter is composed of two parts: an image attention module and a text attention module. The two modules are respectively used for extracting the characteristics of the image and the description and generating a corresponding attention matrix. The two attention matrices are then combined by a dot product operation to generate a matrix that represents the degree of semantic matching between the image and the description. Finally, this matrix is used as an output of the arbiter to force semantic alignment between the image and the description;
step 6: training a network model by adopting reinforcement learning SCST, using rewards under a decoding algorithm as a base line, and normalizing by using an image description evaluation index CIDEr so as to enable the generated description to be close to a provided sample reference of an N-gram level;
step 7: the arbiter will alternate with the generator during training. The two modules are trained together, the description generated by the network can reach a balance, and finally a generator network for generating the description and realizing semantic alignment between the image and the description is obtained.
2. The method of claim 1, wherein the attention mechanism in step 2 is a prior-known attention mechanism.
3. The method of claim 1, wherein the arbiter network in step 5 is a common attention arbiter.
4. The method of claim 1 wherein the model training method of step 6 is trained using an SCST reinforcement learning network model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310334196.7A CN116452688A (en) | 2023-03-31 | 2023-03-31 | Image description generation method based on common attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310334196.7A CN116452688A (en) | 2023-03-31 | 2023-03-31 | Image description generation method based on common attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116452688A true CN116452688A (en) | 2023-07-18 |
Family
ID=87129529
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310334196.7A Pending CN116452688A (en) | 2023-03-31 | 2023-03-31 | Image description generation method based on common attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116452688A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117332048A (en) * | 2023-11-30 | 2024-01-02 | 运易通科技有限公司 | Logistics information query method, device and system based on machine learning |
CN118094447A (en) * | 2024-04-24 | 2024-05-28 | 贵州大学 | Unmanned aerial vehicle flight data self-adaptive anomaly detection method based on encoding-decoding |
-
2023
- 2023-03-31 CN CN202310334196.7A patent/CN116452688A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117332048A (en) * | 2023-11-30 | 2024-01-02 | 运易通科技有限公司 | Logistics information query method, device and system based on machine learning |
CN117332048B (en) * | 2023-11-30 | 2024-03-22 | 运易通科技有限公司 | Logistics information query method, device and system based on machine learning |
CN118094447A (en) * | 2024-04-24 | 2024-05-28 | 贵州大学 | Unmanned aerial vehicle flight data self-adaptive anomaly detection method based on encoding-decoding |
CN118094447B (en) * | 2024-04-24 | 2024-07-02 | 贵州大学 | Unmanned aerial vehicle flight data self-adaptive anomaly detection method based on encoding-decoding |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113158875B (en) | Image-text emotion analysis method and system based on multi-mode interaction fusion network | |
CN109992686A (en) | Based on multi-angle from the image-text retrieval system and method for attention mechanism | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN108765383B (en) | Video description method based on deep migration learning | |
CN108563624A (en) | A kind of spatial term method based on deep learning | |
CN116452688A (en) | Image description generation method based on common attention mechanism | |
CN110991290A (en) | Video description method based on semantic guidance and memory mechanism | |
CN116955699B (en) | Video cross-mode search model training method, searching method and device | |
CN113035311A (en) | Medical image report automatic generation method based on multi-mode attention mechanism | |
CN113204675B (en) | Cross-modal video time retrieval method based on cross-modal object inference network | |
CN114239585A (en) | Biomedical nested named entity recognition method | |
CN116579345B (en) | Named entity recognition model training method, named entity recognition method and named entity recognition device | |
CN117611576A (en) | Image-text fusion-based contrast learning prediction method | |
CN114722798A (en) | Ironic recognition model based on convolutional neural network and attention system | |
CN114626454A (en) | Visual emotion recognition method integrating self-supervision learning and attention mechanism | |
CN117829243A (en) | Model training method, target detection device, electronic equipment and medium | |
CN115758159B (en) | Zero sample text position detection method based on mixed contrast learning and generation type data enhancement | |
CN116151226B (en) | Machine learning-based deaf-mute sign language error correction method, equipment and medium | |
CN117115474A (en) | End-to-end single target tracking method based on multi-stage feature extraction | |
CN116958740A (en) | Zero sample target detection method based on semantic perception and self-adaptive contrast learning | |
Wu et al. | Question-driven multiple attention (dqma) model for visual question answer | |
Ren et al. | Improved image description via embedded object structure graph and semantic feature matching | |
CN114692615B (en) | Small sample intention recognition method for small languages | |
Vakada et al. | Descriptive and Coherent Paragraph Generation for Image Paragraph Captioning Using Vision Transformer and Post-processing | |
Sheng et al. | Revolutionizing Image Captioning: Integrating Attention Mechanisms with Adaptive Fusion Gates. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |