CN108875807B - Image description method based on multiple attention and multiple scales - Google Patents
Image description method based on multiple attention and multiple scales Download PDFInfo
- Publication number
- CN108875807B CN108875807B CN201810551875.9A CN201810551875A CN108875807B CN 108875807 B CN108875807 B CN 108875807B CN 201810551875 A CN201810551875 A CN 201810551875A CN 108875807 B CN108875807 B CN 108875807B
- Authority
- CN
- China
- Prior art keywords
- layer
- neural network
- attention
- model
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
An image description method based on multiple attention and multiple scales comprises the steps of selecting an image detection model for extracting image features, dividing a network training set and a verification set into a test set, extracting the image features, constructing an attention circulation neural network model, training the attention circulation neural network model and describing images. Because the invention constructs an image description generation network model which consists of extracting original image characteristics, multi-attention multi-scale characteristic mapping, cyclic neural network residual error connection and cyclic neural network language decoding, the quality of image description is improved and the details of image description are enriched. The invention can generate high-quality images for description by adopting the neural network model under the condition of only possessing the images.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a method for describing a multi-attention and multi-scale image.
Technical Field
In fields such as robot question answering, pedestrian blind guiding, child-assisted education, etc., problems are often encountered that require understanding of the meaning of images and communication to people through a text language. The image description combines two fields of natural language processing and computer vision, and generates language characters corresponding to image contents by inputting natural images.
Since an image not only contains basic information indicating the type and position of an object but also contains high-level information such as some relations and emotions, if only the image object is detected and identified, a large amount of context information including the interrelations, emotions, and the like is lost, and therefore, how to effectively utilize the characteristics of the image and generate corresponding text descriptions is a difficult point of research.
In recent years, a deep learning-based technology has made great progress in the field of image processing and voice analysis, wherein the complexity of a network model is greatly reduced due to the characteristics of weight sharing and sparse connection of a convolutional neural network. Meanwhile, the occurrence of the residual error network makes it possible to construct a deeper network model. The appearance of long and short term memory networks allows the recurrent neural network model to process longer sequences, with significant effect on text sequence decoding.
At present, the mainstream algorithm based on deep learning in image description generation mainly uses a convolutional neural network to extract image features as the input of a language decoding model, and then the image features are input into a long-term and short-term memory network and corresponding description characters are output by adjusting the structure of the language model. The commonly used description generation model is characterized by inputting images and extracting through a convolutional neural network, and combining vector characteristics of a language sequence as the input of a long-term and short-term memory network. Although the above method utilizes the context information in the input image, the language decoding model only uses the extracted image features using a single attention model, and the input image only uses the high-level semantic features, the features extracted by the shallow convolutional layer are not utilized in the network model, and the contribution of the shallow features to the image description is ignored.
Attention mechanisms have been used to draw upon the selective attention mechanism of human vision. Human vision focuses on a target area in an image, namely an attention focus, by quickly browsing the image, obtains more target details, and inhibits other useless information, and the human vision attention mechanism greatly improves the efficiency and the accuracy of visual information processing. Essentially, the attention mechanism is similar to the selective attention of human vision, and the core goal is to select information more critical to the current task goal from a large number of information, i.e., to highlight the image space features corresponding to a certain generated word. By introducing multiple attention models, the models can use features of different levels of the image.
Disclosure of Invention
The technical problem to be solved by the present invention is to overcome the above drawbacks of the prior art, and to provide a multi-attention and multi-scale image description method with better description effect.
The technical scheme adopted for solving the technical problems comprises the following steps:
(1) selecting image detection model for extracting image characteristics
Selecting a convolutional neural network region target detection method to construct a target detection model, pre-training the target detection model by using a Pascal visual target classification 2007 data set or a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.
(2) Dividing network training set, verification set and test set
Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: in the data set, 90% of the total samples are randomly drawn as a network training set, 5% of the total samples are taken as a verification set, and the remaining 5% of the total samples are taken as a test set.
(3) Extracting image features
And extracting image convolution numerical characteristics of the pre-trained target detection model by using a region target detection model with a 101-layer residual error structure, and respectively converting the image convolution numerical characteristics into numerical characteristic graphs of 14 multiplied by 14 by adopting an average pooling method.
(4) Constructing attention circulation neural network model
The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module is connected with the circulation neural network language decoding module to construct an attention circulation neural network model.
The invention discloses a recurrent neural network language decoding module, which comprises the following steps: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises xt,In the three parts, the first part and the second part,representing the output state of the nth layer, namely the final layer, long-short term memory network at the last moment, wherein t represents the current moment, t-1 represents the previous moment, and xtRepresents the thermally encoded word vector,is a high-level average pooling feature of the image,comprises the following steps:
wherein v isiIs a feature of the ith region. X is to bet,Inputting the three parts into a first layer long-short term memory network structure of the language model to obtain a recurrent neural network language decoding module.
(5) Training attention circulation neural network model
Inputting a network training set into the target detection model in the step (1), extracting numerical characteristic diagrams of images on convolution layers with different depths through the step (3), inputting the numerical characteristic diagrams into the attention circulation neural network model constructed in the step (4), extracting all descriptions in a data set to form word lists and word vectors, training the attention circulation neural network model by dynamically adjusting the learning rate by using an adaptive moment estimation optimization method, and using a cross entropy loss function LXE(θ) as a loss function:
whereinTheta is the parameter of the real sequence and image description generative model decoder of the target language respectively,is output word of long-short term memory network decoderThe probability of (c).
When the attention circulation neural network model is trained, the attention circulation neural network model is trained by adopting a cluster searching method, and then the attention circulation neural network model is trained by using a self-identification sequence training reinforcement learning method.
And after the training is finished, testing the effect of the trained attention circulation neural network model by using an image verification set, and adjusting model parameters to obtain the attention circulation neural network model.
(6) Image description
Inputting the test set obtained in the step (2) into the attention circulation neural network model trained in the step (5), sequentially selecting the word with the maximum probability of each time step in the model as a result of the current time step, connecting the words according to the generation sequence and outputting the words as the final output of the network to finish image description.
In the step (3) of constructing the multi-attention neural network, the method for extracting the image convolution numerical characteristics by using the region target detection model with the 101-layer residual error structure comprises the following steps: and extracting convolution numerical features from the first maximum pooling layer of the residual network of the regional target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolution layer in each group of residual structures after the maximum pooling layer.
The method for extracting the convolution numerical characteristics comprises the following steps:
V′={v1,…,vk},
where V' represents a set of k features of the above k regions, where each feature represents a salient region of the image, VkThe method represents the average pooling convolution characteristics of the k region segmented from the image convolution layer, wherein k is a finite positive integer.
In the step (4) of constructing the attention neural network, the attention feature mapping module of the invention is as follows:
the attention feature mapping module is divided into two parts, including network stateAnd each numerical feature V in the extracted convolutional layeriThe attention feature mapping module is shown as follows:
αt=softmax(at)
parameter in the formulaWva、WhaAre all parameters to be learned, αtFor attention weighting, the attention feature mapping module is input, and the image features with parameters are output as follows:
in the formula viRepresenting the i-th region-averaged pooled convolution feature segmented in the image convolution layer, ctFor the final output result, i, t are finite positive integers.
The method for inputting the numerical characteristics of different levels into different attention models comprises the following steps: the low-layer convolution numerical characteristics are connected into an attention model positioned at the lower layer of the recurrent neural network model, and the high-layer convolution numerical characteristics are connected into an attention model positioned at the higher layer of the recurrent neural network model.
In the step (4) of constructing the multi-attention multi-scale recurrent neural network, the attention feature mapping module and the recurrent neural network language decoding module are connected in the following mode: the output of the first layer of the recurrent neural network is connected with the input of the first layer of the attention network, the output of the first layer of the attention network is connected with the input of the second layer of the recurrent neural network, the output of the second layer of the attention network is connected with the input of the third layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the third layer of the attention network, the output of the third layer of the attention network is connected with the input of the fourth layer of the recurrent neural network, the output of the fourth layer of the recurrent neural network is connected with the input of the fourth layer of the attention network, the output of the fourth layer of the attention network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fifth layer of the recurrent neural network is connected with the input of the fifth layer of the attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network.
The method for connecting the residual errors with each layer of the recurrent neural network comprises the following steps: the output of the first layer of the recurrent neural network is connected with the input of the third layer of the recurrent neural network, the output of the second layer of the recurrent neural network is connected with the input of the fourth layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fourth layer of the recurrent neural network is connected with the input of the sixth layer of the recurrent neural network.
Compared with the prior art, the invention has the following advantages:
because the invention constructs an image description generation network model which consists of extracting original image characteristics, multi-attention multi-scale characteristic mapping, cyclic neural network residual error connection and cyclic neural network language decoding, the quality of image description is improved and the details of image description are enriched. The invention can adopt the neural network model to generate a high-quality image description result under the condition of only possessing the image.
Drawings
FIG. 1 is a flowchart of example 1 of the present invention.
FIG. 2 is a flow diagram of the language generation module in FIG. 1 for constructing a multi-attention multi-scale neural network.
FIG. 3 is a graph comparing the results of image description using the top-down network model processing method with the method of example 1.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, but the present invention is not limited to the examples described below.
Example 1
Taking 100000 images selected from the data set of common object 2014 in the microsoft context as an example, the image description generation method based on multi-attention and multi-scale comprises the following steps:
(1) selecting image detection model for extracting image characteristics
A target detection method In a convolutional neural network region is selected to construct a target detection model, is a known method and is disclosed In In Advances In neural information processing systems.2015. And pre-training a target detection model by using a 2007 data set of a Pascal visual target classification match, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.
(2) Dividing network training set, verification set and test set
Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: 90000 images, namely 90 percent, are randomly extracted from 100000 image data sets to serve as a network training set, 5000 images, namely 5 percent, serve as a verification set, and 5000 images, namely 5 percent, serve as a test set.
(3) Extracting image features
The method comprises the steps of extracting Image convolution numerical characteristics of a pre-trained target detection model by using a region target detection model with a 101-layer Residual error structure, wherein the 101-layer Residual error structure is a known structure, and in Deep Residual Learning for Image Recognition, the Image convolution numerical characteristics are respectively converted into numerical characteristic maps with the size of 14 multiplied by 14 by adopting an average pooling method, and the average pooling method is a unique known method.
The above region target detection model using 101-layer residual structure extracts image convolution numerical features as follows: and extracting convolution numerical features from the first maximum pooling layer of the residual network of the regional target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolution layer in each group of residual structures after the maximum pooling layer.
The method for extracting the convolution numerical characteristics comprises the following steps:
V′={v1,…,vk}
where V' represents a set of k features of the above k regions, where each feature represents a region of the image, VkRepresents the average pooled convolution feature of the k-th region segmented in the image convolution layer, and k is 14.
(4) Constructing attention circulation neural network model
The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module comprises:
the attention feature mapping module is divided into two parts, including network stateAnd each numerical feature V in the extracted convolutional layeriThe attention feature mapping module is shown in the following equation:
αt=softmax(at)
parameter in the formulaWva、WhaAre all parameters to be learned, αtFor attention weighting, the attention feature mapping module is input, and the image features with parameters are output as follows:
ct=∑iαtvi
in the formula viRepresenting the i-th region-averaged pooled convolution feature segmented in the image convolution layer, ctIs the final output result.
The language decoding module of the recurrent neural network comprises: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises xt,In the three parts, the first part and the second part,represents the output state of the n-th layer (final layer) long-short term memory network at the last moment, wherein t represents the current moment, t-1 represents the previous moment, xtRepresents the thermally encoded word vector,is a high-level average pooling feature of the image,comprises the following steps:
wherein v isiIs a feature of the ith region. X is to bet,Inputting the three parts into a first layer long-short term memory network structure of the language model to obtain a recurrent neural network language decoding module.
The attention feature mapping module is connected with the cyclic neural network language decoding module to construct an attention cyclic neural network model.
The method for inputting the numerical characteristics of different levels into different attention models comprises the following steps: the low-layer convolution numerical characteristics are connected into an attention model positioned at the lower layer of the recurrent neural network model, and the high-layer convolution numerical characteristics are connected into an attention model positioned at the higher layer of the recurrent neural network model.
The attention feature mapping module and the recurrent neural network language decoding module in the step are connected in the following mode: sequentially connecting each layer of cyclic neural network in the cyclic neural network decoding module with a residual error to connect each layer of cyclic neural network, wherein the output of the first layer of cyclic neural network is connected with the input of the first layer of attention network, the output of the first layer of attention network is connected with the input of the second layer of cyclic neural network, the output of the second layer of cyclic neural network is connected with the input of the second layer of attention network, the output of the second layer of attention network is connected with the input of the third layer of cyclic neural network, the output of the third layer of cyclic neural network is connected with the input of the third layer of attention network, the output of the third layer of attention network is connected with the input of the fourth layer of cyclic neural network, the output of the fourth layer of attention network is connected with the input of the fifth layer of cyclic neural network, and the output of the fifth layer of cyclic neural network is connected with the input of the fifth layer of attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network.
The method for connecting the residual errors in the step with each layer of the recurrent neural network comprises the following steps: the output of the first layer of the recurrent neural network is connected with the input of the third layer of the recurrent neural network, the output of the second layer of the recurrent neural network is connected with the input of the fourth layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fourth layer of the recurrent neural network is connected with the input of the sixth layer of the recurrent neural network.
(5) Training attention circulation neural network model
And (3) inputting 90000 images serving as a network training set into the target detection model in the step (1), extracting numerical characteristic diagrams of the images on different depth convolution layers through the step (3), and inputting the numerical characteristic diagrams into the attention circulation neural network model constructed in the step (4).
Extracting all descriptions in the data set constitutes a word list and word vectors,the extraction method comprises the following steps: for all descriptions in the data set of the common object 2014 in the microsoft context, words with five times of occurrence and more than five times of occurrence in a sentence are combined into a word list, each word in the word list is coded in a single hot coding mode, and the single hot coding of each word in the description sentence in the data set is mapped into an embedded vector. Training an attention-cycling neural network model by dynamically adjusting learning rate using an adaptive moment estimation Optimization Method in Adam A Method for Stochastic Optimization, using a cross-entropy loss function LXE(θ) as a loss function:
whereinTheta is the parameters of the real sequence and image description generative model decoder of the target language respectively,is output word of long-short term memory network decoderThe probability of (c).
When training the attention circulation neural network model, a cluster searching method in the study of resources of the Five-Year Research efficiency is adopted, the number of hidden nodes of the long and short term memory network layer and the number of hidden nodes of the attention layer are set to be 1000, and the learning rate is 1 × 10-4Training an attention circulation neural network model, Training a reinforcement learning method by using a Self-identification Sequence in Self-identification Sequence Training for Image capturing, and using a learning rate of 1 multiplied by 10-5、1×10-6And training the attention circulation neural network model in turn. After training is finished, testing the effect of the trained attention circulation neural network model by using 5000 image verification sets, and adjusting model parameters to obtain the attention circulation neural networkAnd (4) a collateral model.
(6) Image description
And (3) inputting the 5000 images of the test set obtained in the step (2) into the attention circulation neural network model trained in the step (5), sequentially selecting the word with the maximum probability of each time step from the model as the result of the current time step, connecting the words according to the generation sequence and outputting the words as the final output of the network to finish image description.
After the training of the attention-circulation neural network model is completed, the Image Description is evaluated by using a consistency-based Image Description Evaluation standard (CIDER) with a score of 1.167.
Example 2
Taking 100000 images selected from the data set of common object 2014 in the microsoft context as an example, the image description generation method based on multi-attention and multi-scale comprises the following steps:
in the step (1) of selecting an image detection model for extracting image features, a target detection method In a convolutional neural network region is selected to construct a target detection model, and the target detection method In the convolutional neural network region is a known method and is disclosed In advanced In neural information processing systems 2015. And pre-training a target detection model by using a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.
The other steps were the same as in example 1. The image description is completed.
In order to verify the beneficial effects of the present invention, the inventor carried out a simulation experiment by using the method of embodiment 1 of the present invention, and the experimental conditions were as follows:
1. simulation conditions
The hardware conditions are as follows: 1 piece of Nvidia TITAN Xp video card, 128G internal memory.
The software platform is as follows: pytrch frame.
2. Simulation content and results
The results of the experiment carried out under the above simulation conditions by the method of the present invention are shown in fig. 3, the first behavior is the description of the network model from top to bottom, and the second behavior is the description of the method, compared with the prior art, the method of the present invention has the following advantages:
the invention provides a method for constructing multiple levels of attention, which can respectively extract the features of different levels of an image at the same time and improve the expression capability of generating sentences. A residual error learning mechanism is introduced into the multi-layer long and short term memory network, and the input and the output of the long and short term memory networks of different layers are connected together through an addition principle, so that the problem that the low-layer parameters of the model are difficult to update effectively due to gradient dispersion is solved. A plurality of attention structures are hierarchically fused into the network, and a model is trained by introducing a reinforcement learning method, so that output word sentences are more accurate, and the system performance is further improved. After the attention circulation neural network model is trained, the Image Description is evaluated by adopting a consistency-based Image Description Evaluation standard (CIDER) with a score of 1.167, so that a better effect is achieved.
Claims (3)
1. A multi-attention and multi-scale based image description method is characterized by comprising the following steps:
(1) selecting image detection model for extracting image characteristics
Selecting a convolutional neural network regional target detection method to construct a target detection model, pre-training the target detection model by using a Pascal visual target classification 2007 data set or a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting image features;
(2) dividing network training set, verification set and test set
Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: randomly extracting 90% of total samples in the data set as a network training set, taking 5% of the total samples as a verification set, and taking the rest 5% of the total samples as a test set;
(3) extracting image features
Extracting image convolution numerical characteristics of the pre-trained target detection model by using a region target detection model with a 101-layer residual error structure, and respectively converting the image convolution numerical characteristics into numerical characteristic graphs of 14 multiplied by 14 by adopting an average pooling method;
(4) constructing attention circulation neural network model
The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module is connected with the circulation neural network language decoding module to construct an attention circulation neural network model;
the attention feature mapping module is as follows:
the attention feature mapping module is divided into two parts, including the network stateAnd each numerical feature V in the extracted convolutional layeriThe attention feature mapping module is shown as follows:
αt=softmax(at)
parameter in the formulaWva、WhaAre all parameters to be learned, αtFor attention weighting, the attention feature mapping module is input, and the image features with parameters are output as follows:
in the formula viRepresenting the i-th region-averaged pooled convolution features segmented in the image convolution layer, ctFor the final output result, i and t are finite positive integers;
the method for inputting the numerical characteristics of different levels into different attention models comprises the following steps: the low-layer convolution numerical characteristics are connected into an attention model positioned at the low layer of the cyclic neural network model, and the high-layer convolution numerical characteristics are connected into an attention model positioned at the high layer of the cyclic neural network model;
the recurrent neural network language decoding module is as follows: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises xt,In the three parts, the first part and the second part,representing the output state of the nth layer, namely the final layer, long-short term memory network at the last moment, wherein t represents the current moment, t-1 represents the previous moment, and xtRepresents the thermally encoded word vector,is a high-level average pooling feature of the image,comprises the following steps:
wherein v isiFor the feature of the ith region, xt,Inputting the three parts into a first layer long-short term memory network structure of a language model to obtain a recurrent neural network language decoding module;
(5) training attention circulation neural network model
Inputting the network training set into the target detection model in the step (1), and carrying out the step (3)) Extracting the numerical characteristic diagram of the image on different convolutional layers at different depths, inputting the numerical characteristic diagram into the attention circulation neural network model constructed in the step (4), extracting all descriptions in the data set to form a word list and word vectors, training the attention circulation neural network model by dynamically adjusting the learning rate by using an adaptive moment estimation optimization method, and using a cross entropy loss function LXE(θ) as a loss function:
whereinTheta is the parameters of the real sequence and image description generative model decoder of the target language respectively,is output word of long-short term memory network decoderThe probability of (d);
when the attention circulation neural network model is trained, the attention circulation neural network model is trained by adopting a cluster searching method, and then the attention circulation neural network model is trained by using a self-identification sequence training reinforcement learning method;
after the training is finished, testing the effect of the trained attention circulation neural network model by using an image verification set, and adjusting model parameters to obtain an attention circulation neural network model;
(6) image description
Inputting the test set obtained in the step (2) into the attention circulation neural network model trained in the step (5), sequentially selecting the word with the maximum probability of each time step in the model as a result of the current time step, connecting the words according to the generation sequence and outputting the words as the final output of the network to finish image description.
2. The method for describing images based on multi-attention and multi-scale according to claim 1, wherein in the step (3) of constructing the multi-attention neural network, the region object detection model using the 101-layer residual structure is used for extracting convolution numerical features of the images as follows: extracting convolution numerical features from the first largest pooling layer of a residual network of a region target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolutional layer in each group of residual structures after the largest pooling layer;
the method for extracting the convolution numerical characteristics comprises the following steps:
V′={v1,…,vk},
in the formula V*A set of k features representing the above k regions, each feature representing a salient region of the image, vkThe method represents the average pooling convolution characteristics of the k region segmented from the image convolution layer, wherein k is a finite positive integer.
3. The multi-attention and multi-scale based image description method of claim 1, characterized in that: in the step (4) of constructing the multi-attention multi-scale recurrent neural network, the attention feature mapping module and the recurrent neural network language decoding module are connected in the following way: the output of the first layer of the recurrent neural network is connected with the input of the first layer of the attention network, the output of the first layer of the attention network is connected with the input of the second layer of the recurrent neural network, the output of the second layer of the attention network is connected with the input of the third layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the third layer of the attention network, the output of the third layer of the attention network is connected with the input of the fourth layer of the recurrent neural network, the output of the fourth layer of the recurrent neural network is connected with the input of the fourth layer of the attention network, the output of the fourth layer of the attention network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fifth layer of the recurrent neural network is connected with the input of the fifth layer of the attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network;
the method for connecting the residual errors with each layer of the recurrent neural network comprises the following steps: the output of the first layer of the recurrent neural network is connected with the input of the third layer of the recurrent neural network, the output of the second layer of the recurrent neural network is connected with the input of the fourth layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fourth layer of the recurrent neural network is connected with the input of the sixth layer of the recurrent neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810551875.9A CN108875807B (en) | 2018-05-31 | 2018-05-31 | Image description method based on multiple attention and multiple scales |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810551875.9A CN108875807B (en) | 2018-05-31 | 2018-05-31 | Image description method based on multiple attention and multiple scales |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108875807A CN108875807A (en) | 2018-11-23 |
CN108875807B true CN108875807B (en) | 2022-05-27 |
Family
ID=64336183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810551875.9A Active CN108875807B (en) | 2018-05-31 | 2018-05-31 | Image description method based on multiple attention and multiple scales |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108875807B (en) |
Families Citing this family (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111310518B (en) * | 2018-12-11 | 2023-12-08 | 北京嘀嘀无限科技发展有限公司 | Picture feature extraction method, target re-identification method, device and electronic equipment |
CN111325068B (en) * | 2018-12-14 | 2023-11-07 | 北京京东尚科信息技术有限公司 | Video description method and device based on convolutional neural network |
CN109344920B (en) * | 2018-12-14 | 2021-02-02 | 汇纳科技股份有限公司 | Customer attribute prediction method, storage medium, system and device |
CN111339340A (en) * | 2018-12-18 | 2020-06-26 | 顺丰科技有限公司 | Training method of image description model, image searching method and device |
CN109376804B (en) * | 2018-12-19 | 2020-10-30 | 中国地质大学(武汉) | Hyperspectral remote sensing image classification method based on attention mechanism and convolutional neural network |
CN109784197B (en) * | 2018-12-21 | 2022-06-07 | 西北工业大学 | Pedestrian re-identification method based on hole convolution and attention mechanics learning mechanism |
CN109620205B (en) * | 2018-12-26 | 2022-10-28 | 上海联影智能医疗科技有限公司 | Electrocardiogram data classification method and device, computer equipment and storage medium |
CN109726696B (en) * | 2019-01-03 | 2023-04-07 | 电子科技大学 | Image description generation system and method based on attention-pushing mechanism |
US11087175B2 (en) * | 2019-01-30 | 2021-08-10 | StradVision, Inc. | Learning method and learning device of recurrent neural network for autonomous driving safety check for changing driving mode between autonomous driving mode and manual driving mode, and testing method and testing device using them |
CN109919221B (en) * | 2019-03-04 | 2022-07-19 | 山西大学 | Image description method based on bidirectional double-attention machine |
CN109948691B (en) * | 2019-03-14 | 2022-02-18 | 齐鲁工业大学 | Image description generation method and device based on depth residual error network and attention |
CN110084128B (en) * | 2019-03-29 | 2021-12-14 | 安徽艾睿思智能科技有限公司 | Scene graph generation method based on semantic space constraint and attention mechanism |
CN110084250B (en) * | 2019-04-26 | 2024-03-12 | 北京金山数字娱乐科技有限公司 | Image description method and system |
CN110097136A (en) * | 2019-05-09 | 2019-08-06 | 杭州筑象数字科技有限公司 | Image classification method neural network based |
CN110633610B (en) * | 2019-05-17 | 2022-03-25 | 西南交通大学 | Student state detection method based on YOLO |
CN110188775B (en) * | 2019-05-28 | 2020-06-26 | 创意信息技术股份有限公司 | Image content description automatic generation method based on joint neural network model |
CN110188765B (en) * | 2019-06-05 | 2021-04-06 | 京东方科技集团股份有限公司 | Image semantic segmentation model generation method, device, equipment and storage medium |
CN112101395A (en) * | 2019-06-18 | 2020-12-18 | 上海高德威智能交通系统有限公司 | Image identification method and device |
CN110288029B (en) * | 2019-06-27 | 2022-12-06 | 西安电子科技大学 | Tri-LSTMs model-based image description method |
CN110321962B (en) * | 2019-07-09 | 2021-10-08 | 北京金山数字娱乐科技有限公司 | Data processing method and device |
CN110427836B (en) * | 2019-07-11 | 2020-12-01 | 重庆市地理信息和遥感应用中心(重庆市测绘产品质量检验测试中心) | High-resolution remote sensing image water body extraction method based on multi-scale optimization |
CN110503079A (en) * | 2019-08-30 | 2019-11-26 | 山东浪潮人工智能研究院有限公司 | A kind of monitor video based on deep neural network describes method |
CN111013149A (en) * | 2019-10-23 | 2020-04-17 | 浙江工商大学 | Card design generation method and system based on neural network deep learning |
CN110929013A (en) * | 2019-12-04 | 2020-03-27 | 成都中科云集信息技术有限公司 | Image question-answer implementation method based on bottom-up entry and positioning information fusion |
CN111126282B (en) * | 2019-12-25 | 2023-05-12 | 中国矿业大学 | Remote sensing image content description method based on variational self-attention reinforcement learning |
CN111240486B (en) * | 2020-02-17 | 2021-07-02 | 河北冀联人力资源服务集团有限公司 | Data processing method and system based on edge calculation |
CN111444968A (en) * | 2020-03-30 | 2020-07-24 | 哈尔滨工程大学 | Image description generation method based on attention fusion |
CN111611373B (en) * | 2020-04-13 | 2021-09-10 | 清华大学 | Robot-oriented specific active scene description method |
CN111522986B (en) * | 2020-04-23 | 2023-10-10 | 北京百度网讯科技有限公司 | Image retrieval method, device, equipment and medium |
CN112529857B (en) * | 2020-12-03 | 2022-08-23 | 重庆邮电大学 | Ultrasonic image diagnosis report generation method based on target detection and strategy gradient |
CN112668608B (en) * | 2020-12-04 | 2024-03-15 | 北京达佳互联信息技术有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN112699915B (en) * | 2020-12-07 | 2024-02-02 | 杭州电子科技大学 | Method for identifying CAD model assembly interface based on improved graph annotation force network |
CN112784848B (en) * | 2021-02-04 | 2024-02-27 | 东北大学 | Image description generation method based on multiple attention mechanisms and external knowledge |
CN113591874B (en) * | 2021-06-01 | 2024-04-26 | 清华大学 | Paragraph level image description generation method with long-time memory enhancement |
CN113707112B (en) * | 2021-08-13 | 2024-05-28 | 陕西师范大学 | Automatic generation method of recursion jump connection deep learning music based on layer standardization |
CN114049501A (en) * | 2021-11-22 | 2022-02-15 | 江苏科技大学 | Image description generation method, system, medium and device fusing cluster search |
CN113822383B (en) * | 2021-11-23 | 2022-03-15 | 北京中超伟业信息安全技术股份有限公司 | Unmanned aerial vehicle detection method and system based on multi-domain attention mechanism |
CN115936073B (en) * | 2023-02-16 | 2023-05-16 | 江西省科学院能源研究所 | Language-oriented convolutional neural network and visual question-answering method |
CN115984296B (en) * | 2023-03-21 | 2023-06-13 | 译企科技(成都)有限公司 | Medical image segmentation method and system applying multi-attention mechanism |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107330362A (en) * | 2017-05-25 | 2017-11-07 | 北京大学 | A kind of video classification methods based on space-time notice |
CN107358948A (en) * | 2017-06-27 | 2017-11-17 | 上海交通大学 | Language in-put relevance detection method based on attention model |
CN107451552A (en) * | 2017-07-25 | 2017-12-08 | 北京联合大学 | A kind of gesture identification method based on 3D CNN and convolution LSTM |
CN107844743A (en) * | 2017-09-28 | 2018-03-27 | 浙江工商大学 | A kind of image multi-subtitle automatic generation method based on multiple dimensioned layering residual error network |
CN107918782A (en) * | 2016-12-29 | 2018-04-17 | 中国科学院计算技术研究所 | A kind of method and system for the natural language for generating description picture material |
CN108052512A (en) * | 2017-11-03 | 2018-05-18 | 同济大学 | A kind of iamge description generation method based on depth attention mechanism |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2545661A (en) * | 2015-12-21 | 2017-06-28 | Nokia Technologies Oy | A method for analysing media content |
US10565305B2 (en) * | 2016-11-18 | 2020-02-18 | Salesforce.Com, Inc. | Adaptive attention model for image captioning |
-
2018
- 2018-05-31 CN CN201810551875.9A patent/CN108875807B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107918782A (en) * | 2016-12-29 | 2018-04-17 | 中国科学院计算技术研究所 | A kind of method and system for the natural language for generating description picture material |
CN107330362A (en) * | 2017-05-25 | 2017-11-07 | 北京大学 | A kind of video classification methods based on space-time notice |
CN107358948A (en) * | 2017-06-27 | 2017-11-17 | 上海交通大学 | Language in-put relevance detection method based on attention model |
CN107451552A (en) * | 2017-07-25 | 2017-12-08 | 北京联合大学 | A kind of gesture identification method based on 3D CNN and convolution LSTM |
CN107844743A (en) * | 2017-09-28 | 2018-03-27 | 浙江工商大学 | A kind of image multi-subtitle automatic generation method based on multiple dimensioned layering residual error network |
CN108052512A (en) * | 2017-11-03 | 2018-05-18 | 同济大学 | A kind of iamge description generation method based on depth attention mechanism |
Non-Patent Citations (3)
Title |
---|
LSTM 逐层多目标优化及多层概率融合的图像描述;汤鹏杰 等;《自动化学报》;20171211;第43卷;第1-13页 * |
Show and Tell: A Neural Image Caption Generator;Oriol Vinyals 等;《arXiv:1411.4555》;20150420;第1-9页 * |
Show,attend and tell: Neural image caption generation with visual attention;Kelvin Xu 等;《arXiv:1502.03044》;20160419;第1-22页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108875807A (en) | 2018-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN108319686B (en) | Antagonism cross-media retrieval method based on limited text space | |
CN111260740B (en) | Text-to-image generation method based on generation countermeasure network | |
CN109948691B (en) | Image description generation method and device based on depth residual error network and attention | |
CN110969020B (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN106845411B (en) | Video description generation method based on deep learning and probability map model | |
CN108549658B (en) | Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree | |
CN109783666B (en) | Image scene graph generation method based on iterative refinement | |
CN108830287A (en) | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method | |
CN109242090B (en) | Video description and description consistency judgment method based on GAN network | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN111368142B (en) | Video intensive event description method based on generation countermeasure network | |
CN112784929B (en) | Small sample image classification method and device based on double-element group expansion | |
CN112800292B (en) | Cross-modal retrieval method based on modal specific and shared feature learning | |
CN111475622A (en) | Text classification method, device, terminal and storage medium | |
CN110069611B (en) | Topic-enhanced chat robot reply generation method and device | |
CN107679225A (en) | A kind of reply generation method based on keyword | |
CN109740012B (en) | Method for understanding and asking and answering image semantics based on deep neural network | |
CN114282059A (en) | Video retrieval method, device, equipment and storage medium | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
CN113806564B (en) | Multi-mode informative text detection method and system | |
CN110347853A (en) | A kind of image hash code generation method based on Recognition with Recurrent Neural Network | |
CN111445545B (en) | Text transfer mapping method and device, storage medium and electronic equipment | |
CN111783688B (en) | Remote sensing image scene classification method based on convolutional neural network | |
CN112560440A (en) | Deep learning-based syntax dependence method for aspect-level emotion analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |