CN108875807B - Image description method based on multiple attention and multiple scales - Google Patents

Image description method based on multiple attention and multiple scales Download PDF

Info

Publication number
CN108875807B
CN108875807B CN201810551875.9A CN201810551875A CN108875807B CN 108875807 B CN108875807 B CN 108875807B CN 201810551875 A CN201810551875 A CN 201810551875A CN 108875807 B CN108875807 B CN 108875807B
Authority
CN
China
Prior art keywords
layer
neural network
attention
model
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810551875.9A
Other languages
Chinese (zh)
Other versions
CN108875807A (en
Inventor
吴晓军
张钰
陈龙杰
张玉梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN201810551875.9A priority Critical patent/CN108875807B/en
Publication of CN108875807A publication Critical patent/CN108875807A/en
Application granted granted Critical
Publication of CN108875807B publication Critical patent/CN108875807B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

An image description method based on multiple attention and multiple scales comprises the steps of selecting an image detection model for extracting image features, dividing a network training set and a verification set into a test set, extracting the image features, constructing an attention circulation neural network model, training the attention circulation neural network model and describing images. Because the invention constructs an image description generation network model which consists of extracting original image characteristics, multi-attention multi-scale characteristic mapping, cyclic neural network residual error connection and cyclic neural network language decoding, the quality of image description is improved and the details of image description are enriched. The invention can generate high-quality images for description by adopting the neural network model under the condition of only possessing the images.

Description

Image description method based on multiple attention and multiple scales
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a method for describing a multi-attention and multi-scale image.
Technical Field
In fields such as robot question answering, pedestrian blind guiding, child-assisted education, etc., problems are often encountered that require understanding of the meaning of images and communication to people through a text language. The image description combines two fields of natural language processing and computer vision, and generates language characters corresponding to image contents by inputting natural images.
Since an image not only contains basic information indicating the type and position of an object but also contains high-level information such as some relations and emotions, if only the image object is detected and identified, a large amount of context information including the interrelations, emotions, and the like is lost, and therefore, how to effectively utilize the characteristics of the image and generate corresponding text descriptions is a difficult point of research.
In recent years, a deep learning-based technology has made great progress in the field of image processing and voice analysis, wherein the complexity of a network model is greatly reduced due to the characteristics of weight sharing and sparse connection of a convolutional neural network. Meanwhile, the occurrence of the residual error network makes it possible to construct a deeper network model. The appearance of long and short term memory networks allows the recurrent neural network model to process longer sequences, with significant effect on text sequence decoding.
At present, the mainstream algorithm based on deep learning in image description generation mainly uses a convolutional neural network to extract image features as the input of a language decoding model, and then the image features are input into a long-term and short-term memory network and corresponding description characters are output by adjusting the structure of the language model. The commonly used description generation model is characterized by inputting images and extracting through a convolutional neural network, and combining vector characteristics of a language sequence as the input of a long-term and short-term memory network. Although the above method utilizes the context information in the input image, the language decoding model only uses the extracted image features using a single attention model, and the input image only uses the high-level semantic features, the features extracted by the shallow convolutional layer are not utilized in the network model, and the contribution of the shallow features to the image description is ignored.
Attention mechanisms have been used to draw upon the selective attention mechanism of human vision. Human vision focuses on a target area in an image, namely an attention focus, by quickly browsing the image, obtains more target details, and inhibits other useless information, and the human vision attention mechanism greatly improves the efficiency and the accuracy of visual information processing. Essentially, the attention mechanism is similar to the selective attention of human vision, and the core goal is to select information more critical to the current task goal from a large number of information, i.e., to highlight the image space features corresponding to a certain generated word. By introducing multiple attention models, the models can use features of different levels of the image.
Disclosure of Invention
The technical problem to be solved by the present invention is to overcome the above drawbacks of the prior art, and to provide a multi-attention and multi-scale image description method with better description effect.
The technical scheme adopted for solving the technical problems comprises the following steps:
(1) selecting image detection model for extracting image characteristics
Selecting a convolutional neural network region target detection method to construct a target detection model, pre-training the target detection model by using a Pascal visual target classification 2007 data set or a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.
(2) Dividing network training set, verification set and test set
Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: in the data set, 90% of the total samples are randomly drawn as a network training set, 5% of the total samples are taken as a verification set, and the remaining 5% of the total samples are taken as a test set.
(3) Extracting image features
And extracting image convolution numerical characteristics of the pre-trained target detection model by using a region target detection model with a 101-layer residual error structure, and respectively converting the image convolution numerical characteristics into numerical characteristic graphs of 14 multiplied by 14 by adopting an average pooling method.
(4) Constructing attention circulation neural network model
The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module is connected with the circulation neural network language decoding module to construct an attention circulation neural network model.
The invention discloses a recurrent neural network language decoding module, which comprises the following steps: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises xt
Figure BDA0001680462270000021
In the three parts, the first part and the second part,
Figure BDA0001680462270000022
representing the output state of the nth layer, namely the final layer, long-short term memory network at the last moment, wherein t represents the current moment, t-1 represents the previous moment, and xtRepresents the thermally encoded word vector,
Figure BDA0001680462270000023
is a high-level average pooling feature of the image,
Figure BDA0001680462270000024
comprises the following steps:
Figure BDA0001680462270000025
wherein v isiIs a feature of the ith region. X is to bet
Figure BDA0001680462270000026
Inputting the three parts into a first layer long-short term memory network structure of the language model to obtain a recurrent neural network language decoding module.
(5) Training attention circulation neural network model
Inputting a network training set into the target detection model in the step (1), extracting numerical characteristic diagrams of images on convolution layers with different depths through the step (3), inputting the numerical characteristic diagrams into the attention circulation neural network model constructed in the step (4), extracting all descriptions in a data set to form word lists and word vectors, training the attention circulation neural network model by dynamically adjusting the learning rate by using an adaptive moment estimation optimization method, and using a cross entropy loss function LXE(θ) as a loss function:
Figure BDA0001680462270000031
wherein
Figure BDA0001680462270000032
Theta is the parameter of the real sequence and image description generative model decoder of the target language respectively,
Figure BDA0001680462270000033
is output word of long-short term memory network decoder
Figure BDA0001680462270000034
The probability of (c).
When the attention circulation neural network model is trained, the attention circulation neural network model is trained by adopting a cluster searching method, and then the attention circulation neural network model is trained by using a self-identification sequence training reinforcement learning method.
And after the training is finished, testing the effect of the trained attention circulation neural network model by using an image verification set, and adjusting model parameters to obtain the attention circulation neural network model.
(6) Image description
Inputting the test set obtained in the step (2) into the attention circulation neural network model trained in the step (5), sequentially selecting the word with the maximum probability of each time step in the model as a result of the current time step, connecting the words according to the generation sequence and outputting the words as the final output of the network to finish image description.
In the step (3) of constructing the multi-attention neural network, the method for extracting the image convolution numerical characteristics by using the region target detection model with the 101-layer residual error structure comprises the following steps: and extracting convolution numerical features from the first maximum pooling layer of the residual network of the regional target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolution layer in each group of residual structures after the maximum pooling layer.
The method for extracting the convolution numerical characteristics comprises the following steps:
V′={v1,…,vk},
where V' represents a set of k features of the above k regions, where each feature represents a salient region of the image, VkThe method represents the average pooling convolution characteristics of the k region segmented from the image convolution layer, wherein k is a finite positive integer.
In the step (4) of constructing the attention neural network, the attention feature mapping module of the invention is as follows:
the attention feature mapping module is divided into two parts, including network state
Figure BDA0001680462270000035
And each numerical feature V in the extracted convolutional layeriThe attention feature mapping module is shown as follows:
Figure BDA0001680462270000041
αt=softmax(at)
parameter in the formula
Figure BDA0001680462270000042
Wva、WhaAre all parameters to be learned, αtFor attention weighting, the attention feature mapping module is input, and the image features with parameters are output as follows:
Figure BDA0001680462270000043
in the formula viRepresenting the i-th region-averaged pooled convolution feature segmented in the image convolution layer, ctFor the final output result, i, t are finite positive integers.
The method for inputting the numerical characteristics of different levels into different attention models comprises the following steps: the low-layer convolution numerical characteristics are connected into an attention model positioned at the lower layer of the recurrent neural network model, and the high-layer convolution numerical characteristics are connected into an attention model positioned at the higher layer of the recurrent neural network model.
In the step (4) of constructing the multi-attention multi-scale recurrent neural network, the attention feature mapping module and the recurrent neural network language decoding module are connected in the following mode: the output of the first layer of the recurrent neural network is connected with the input of the first layer of the attention network, the output of the first layer of the attention network is connected with the input of the second layer of the recurrent neural network, the output of the second layer of the attention network is connected with the input of the third layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the third layer of the attention network, the output of the third layer of the attention network is connected with the input of the fourth layer of the recurrent neural network, the output of the fourth layer of the recurrent neural network is connected with the input of the fourth layer of the attention network, the output of the fourth layer of the attention network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fifth layer of the recurrent neural network is connected with the input of the fifth layer of the attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network.
The method for connecting the residual errors with each layer of the recurrent neural network comprises the following steps: the output of the first layer of the recurrent neural network is connected with the input of the third layer of the recurrent neural network, the output of the second layer of the recurrent neural network is connected with the input of the fourth layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fourth layer of the recurrent neural network is connected with the input of the sixth layer of the recurrent neural network.
Compared with the prior art, the invention has the following advantages:
because the invention constructs an image description generation network model which consists of extracting original image characteristics, multi-attention multi-scale characteristic mapping, cyclic neural network residual error connection and cyclic neural network language decoding, the quality of image description is improved and the details of image description are enriched. The invention can adopt the neural network model to generate a high-quality image description result under the condition of only possessing the image.
Drawings
FIG. 1 is a flowchart of example 1 of the present invention.
FIG. 2 is a flow diagram of the language generation module in FIG. 1 for constructing a multi-attention multi-scale neural network.
FIG. 3 is a graph comparing the results of image description using the top-down network model processing method with the method of example 1.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, but the present invention is not limited to the examples described below.
Example 1
Taking 100000 images selected from the data set of common object 2014 in the microsoft context as an example, the image description generation method based on multi-attention and multi-scale comprises the following steps:
(1) selecting image detection model for extracting image characteristics
A target detection method In a convolutional neural network region is selected to construct a target detection model, is a known method and is disclosed In In Advances In neural information processing systems.2015. And pre-training a target detection model by using a 2007 data set of a Pascal visual target classification match, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.
(2) Dividing network training set, verification set and test set
Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: 90000 images, namely 90 percent, are randomly extracted from 100000 image data sets to serve as a network training set, 5000 images, namely 5 percent, serve as a verification set, and 5000 images, namely 5 percent, serve as a test set.
(3) Extracting image features
The method comprises the steps of extracting Image convolution numerical characteristics of a pre-trained target detection model by using a region target detection model with a 101-layer Residual error structure, wherein the 101-layer Residual error structure is a known structure, and in Deep Residual Learning for Image Recognition, the Image convolution numerical characteristics are respectively converted into numerical characteristic maps with the size of 14 multiplied by 14 by adopting an average pooling method, and the average pooling method is a unique known method.
The above region target detection model using 101-layer residual structure extracts image convolution numerical features as follows: and extracting convolution numerical features from the first maximum pooling layer of the residual network of the regional target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolution layer in each group of residual structures after the maximum pooling layer.
The method for extracting the convolution numerical characteristics comprises the following steps:
V′={v1,…,vk}
where V' represents a set of k features of the above k regions, where each feature represents a region of the image, VkRepresents the average pooled convolution feature of the k-th region segmented in the image convolution layer, and k is 14.
(4) Constructing attention circulation neural network model
The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module comprises:
the attention feature mapping module is divided into two parts, including network state
Figure BDA0001680462270000061
And each numerical feature V in the extracted convolutional layeriThe attention feature mapping module is shown in the following equation:
Figure BDA0001680462270000062
αt=softmax(at)
parameter in the formula
Figure BDA0001680462270000063
Wva、WhaAre all parameters to be learned, αtFor attention weighting, the attention feature mapping module is input, and the image features with parameters are output as follows:
ct=∑iαtvi
in the formula viRepresenting the i-th region-averaged pooled convolution feature segmented in the image convolution layer, ctIs the final output result.
The language decoding module of the recurrent neural network comprises: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises xt
Figure BDA0001680462270000064
In the three parts, the first part and the second part,
Figure BDA0001680462270000065
represents the output state of the n-th layer (final layer) long-short term memory network at the last moment, wherein t represents the current moment, t-1 represents the previous moment, xtRepresents the thermally encoded word vector,
Figure BDA0001680462270000066
is a high-level average pooling feature of the image,
Figure BDA0001680462270000067
comprises the following steps:
Figure BDA0001680462270000068
wherein v isiIs a feature of the ith region. X is to bet
Figure BDA0001680462270000069
Inputting the three parts into a first layer long-short term memory network structure of the language model to obtain a recurrent neural network language decoding module.
The attention feature mapping module is connected with the cyclic neural network language decoding module to construct an attention cyclic neural network model.
The method for inputting the numerical characteristics of different levels into different attention models comprises the following steps: the low-layer convolution numerical characteristics are connected into an attention model positioned at the lower layer of the recurrent neural network model, and the high-layer convolution numerical characteristics are connected into an attention model positioned at the higher layer of the recurrent neural network model.
The attention feature mapping module and the recurrent neural network language decoding module in the step are connected in the following mode: sequentially connecting each layer of cyclic neural network in the cyclic neural network decoding module with a residual error to connect each layer of cyclic neural network, wherein the output of the first layer of cyclic neural network is connected with the input of the first layer of attention network, the output of the first layer of attention network is connected with the input of the second layer of cyclic neural network, the output of the second layer of cyclic neural network is connected with the input of the second layer of attention network, the output of the second layer of attention network is connected with the input of the third layer of cyclic neural network, the output of the third layer of cyclic neural network is connected with the input of the third layer of attention network, the output of the third layer of attention network is connected with the input of the fourth layer of cyclic neural network, the output of the fourth layer of attention network is connected with the input of the fifth layer of cyclic neural network, and the output of the fifth layer of cyclic neural network is connected with the input of the fifth layer of attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network.
The method for connecting the residual errors in the step with each layer of the recurrent neural network comprises the following steps: the output of the first layer of the recurrent neural network is connected with the input of the third layer of the recurrent neural network, the output of the second layer of the recurrent neural network is connected with the input of the fourth layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fourth layer of the recurrent neural network is connected with the input of the sixth layer of the recurrent neural network.
(5) Training attention circulation neural network model
And (3) inputting 90000 images serving as a network training set into the target detection model in the step (1), extracting numerical characteristic diagrams of the images on different depth convolution layers through the step (3), and inputting the numerical characteristic diagrams into the attention circulation neural network model constructed in the step (4).
Extracting all descriptions in the data set constitutes a word list and word vectors,the extraction method comprises the following steps: for all descriptions in the data set of the common object 2014 in the microsoft context, words with five times of occurrence and more than five times of occurrence in a sentence are combined into a word list, each word in the word list is coded in a single hot coding mode, and the single hot coding of each word in the description sentence in the data set is mapped into an embedded vector. Training an attention-cycling neural network model by dynamically adjusting learning rate using an adaptive moment estimation Optimization Method in Adam A Method for Stochastic Optimization, using a cross-entropy loss function LXE(θ) as a loss function:
Figure BDA0001680462270000071
wherein
Figure BDA0001680462270000072
Theta is the parameters of the real sequence and image description generative model decoder of the target language respectively,
Figure BDA0001680462270000073
is output word of long-short term memory network decoder
Figure BDA0001680462270000074
The probability of (c).
When training the attention circulation neural network model, a cluster searching method in the study of resources of the Five-Year Research efficiency is adopted, the number of hidden nodes of the long and short term memory network layer and the number of hidden nodes of the attention layer are set to be 1000, and the learning rate is 1 × 10-4Training an attention circulation neural network model, Training a reinforcement learning method by using a Self-identification Sequence in Self-identification Sequence Training for Image capturing, and using a learning rate of 1 multiplied by 10-5、1×10-6And training the attention circulation neural network model in turn. After training is finished, testing the effect of the trained attention circulation neural network model by using 5000 image verification sets, and adjusting model parameters to obtain the attention circulation neural networkAnd (4) a collateral model.
(6) Image description
And (3) inputting the 5000 images of the test set obtained in the step (2) into the attention circulation neural network model trained in the step (5), sequentially selecting the word with the maximum probability of each time step from the model as the result of the current time step, connecting the words according to the generation sequence and outputting the words as the final output of the network to finish image description.
After the training of the attention-circulation neural network model is completed, the Image Description is evaluated by using a consistency-based Image Description Evaluation standard (CIDER) with a score of 1.167.
Example 2
Taking 100000 images selected from the data set of common object 2014 in the microsoft context as an example, the image description generation method based on multi-attention and multi-scale comprises the following steps:
in the step (1) of selecting an image detection model for extracting image features, a target detection method In a convolutional neural network region is selected to construct a target detection model, and the target detection method In the convolutional neural network region is a known method and is disclosed In advanced In neural information processing systems 2015. And pre-training a target detection model by using a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.
The other steps were the same as in example 1. The image description is completed.
In order to verify the beneficial effects of the present invention, the inventor carried out a simulation experiment by using the method of embodiment 1 of the present invention, and the experimental conditions were as follows:
1. simulation conditions
The hardware conditions are as follows: 1 piece of Nvidia TITAN Xp video card, 128G internal memory.
The software platform is as follows: pytrch frame.
2. Simulation content and results
The results of the experiment carried out under the above simulation conditions by the method of the present invention are shown in fig. 3, the first behavior is the description of the network model from top to bottom, and the second behavior is the description of the method, compared with the prior art, the method of the present invention has the following advantages:
the invention provides a method for constructing multiple levels of attention, which can respectively extract the features of different levels of an image at the same time and improve the expression capability of generating sentences. A residual error learning mechanism is introduced into the multi-layer long and short term memory network, and the input and the output of the long and short term memory networks of different layers are connected together through an addition principle, so that the problem that the low-layer parameters of the model are difficult to update effectively due to gradient dispersion is solved. A plurality of attention structures are hierarchically fused into the network, and a model is trained by introducing a reinforcement learning method, so that output word sentences are more accurate, and the system performance is further improved. After the attention circulation neural network model is trained, the Image Description is evaluated by adopting a consistency-based Image Description Evaluation standard (CIDER) with a score of 1.167, so that a better effect is achieved.

Claims (3)

1. A multi-attention and multi-scale based image description method is characterized by comprising the following steps:
(1) selecting image detection model for extracting image characteristics
Selecting a convolutional neural network regional target detection method to construct a target detection model, pre-training the target detection model by using a Pascal visual target classification 2007 data set or a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting image features;
(2) dividing network training set, verification set and test set
Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: randomly extracting 90% of total samples in the data set as a network training set, taking 5% of the total samples as a verification set, and taking the rest 5% of the total samples as a test set;
(3) extracting image features
Extracting image convolution numerical characteristics of the pre-trained target detection model by using a region target detection model with a 101-layer residual error structure, and respectively converting the image convolution numerical characteristics into numerical characteristic graphs of 14 multiplied by 14 by adopting an average pooling method;
(4) constructing attention circulation neural network model
The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module is connected with the circulation neural network language decoding module to construct an attention circulation neural network model;
the attention feature mapping module is as follows:
the attention feature mapping module is divided into two parts, including the network state
Figure FDA0003502208650000011
And each numerical feature V in the extracted convolutional layeriThe attention feature mapping module is shown as follows:
Figure FDA0003502208650000012
αt=softmax(at)
parameter in the formula
Figure FDA0003502208650000013
Wva、WhaAre all parameters to be learned, αtFor attention weighting, the attention feature mapping module is input, and the image features with parameters are output as follows:
Figure FDA0003502208650000014
in the formula viRepresenting the i-th region-averaged pooled convolution features segmented in the image convolution layer, ctFor the final output result, i and t are finite positive integers;
the method for inputting the numerical characteristics of different levels into different attention models comprises the following steps: the low-layer convolution numerical characteristics are connected into an attention model positioned at the low layer of the cyclic neural network model, and the high-layer convolution numerical characteristics are connected into an attention model positioned at the high layer of the cyclic neural network model;
the recurrent neural network language decoding module is as follows: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises xt
Figure FDA0003502208650000021
In the three parts, the first part and the second part,
Figure FDA0003502208650000022
representing the output state of the nth layer, namely the final layer, long-short term memory network at the last moment, wherein t represents the current moment, t-1 represents the previous moment, and xtRepresents the thermally encoded word vector,
Figure FDA0003502208650000023
is a high-level average pooling feature of the image,
Figure FDA0003502208650000024
comprises the following steps:
Figure FDA0003502208650000025
wherein v isiFor the feature of the ith region, xt
Figure FDA0003502208650000026
Inputting the three parts into a first layer long-short term memory network structure of a language model to obtain a recurrent neural network language decoding module;
(5) training attention circulation neural network model
Inputting the network training set into the target detection model in the step (1), and carrying out the step (3)) Extracting the numerical characteristic diagram of the image on different convolutional layers at different depths, inputting the numerical characteristic diagram into the attention circulation neural network model constructed in the step (4), extracting all descriptions in the data set to form a word list and word vectors, training the attention circulation neural network model by dynamically adjusting the learning rate by using an adaptive moment estimation optimization method, and using a cross entropy loss function LXE(θ) as a loss function:
Figure FDA0003502208650000027
wherein
Figure FDA0003502208650000028
Theta is the parameters of the real sequence and image description generative model decoder of the target language respectively,
Figure FDA0003502208650000029
is output word of long-short term memory network decoder
Figure FDA00035022086500000210
The probability of (d);
when the attention circulation neural network model is trained, the attention circulation neural network model is trained by adopting a cluster searching method, and then the attention circulation neural network model is trained by using a self-identification sequence training reinforcement learning method;
after the training is finished, testing the effect of the trained attention circulation neural network model by using an image verification set, and adjusting model parameters to obtain an attention circulation neural network model;
(6) image description
Inputting the test set obtained in the step (2) into the attention circulation neural network model trained in the step (5), sequentially selecting the word with the maximum probability of each time step in the model as a result of the current time step, connecting the words according to the generation sequence and outputting the words as the final output of the network to finish image description.
2. The method for describing images based on multi-attention and multi-scale according to claim 1, wherein in the step (3) of constructing the multi-attention neural network, the region object detection model using the 101-layer residual structure is used for extracting convolution numerical features of the images as follows: extracting convolution numerical features from the first largest pooling layer of a residual network of a region target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolutional layer in each group of residual structures after the largest pooling layer;
the method for extracting the convolution numerical characteristics comprises the following steps:
V′={v1,…,vk},
in the formula V*A set of k features representing the above k regions, each feature representing a salient region of the image, vkThe method represents the average pooling convolution characteristics of the k region segmented from the image convolution layer, wherein k is a finite positive integer.
3. The multi-attention and multi-scale based image description method of claim 1, characterized in that: in the step (4) of constructing the multi-attention multi-scale recurrent neural network, the attention feature mapping module and the recurrent neural network language decoding module are connected in the following way: the output of the first layer of the recurrent neural network is connected with the input of the first layer of the attention network, the output of the first layer of the attention network is connected with the input of the second layer of the recurrent neural network, the output of the second layer of the attention network is connected with the input of the third layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the third layer of the attention network, the output of the third layer of the attention network is connected with the input of the fourth layer of the recurrent neural network, the output of the fourth layer of the recurrent neural network is connected with the input of the fourth layer of the attention network, the output of the fourth layer of the attention network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fifth layer of the recurrent neural network is connected with the input of the fifth layer of the attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network;
the method for connecting the residual errors with each layer of the recurrent neural network comprises the following steps: the output of the first layer of the recurrent neural network is connected with the input of the third layer of the recurrent neural network, the output of the second layer of the recurrent neural network is connected with the input of the fourth layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fourth layer of the recurrent neural network is connected with the input of the sixth layer of the recurrent neural network.
CN201810551875.9A 2018-05-31 2018-05-31 Image description method based on multiple attention and multiple scales Active CN108875807B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810551875.9A CN108875807B (en) 2018-05-31 2018-05-31 Image description method based on multiple attention and multiple scales

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810551875.9A CN108875807B (en) 2018-05-31 2018-05-31 Image description method based on multiple attention and multiple scales

Publications (2)

Publication Number Publication Date
CN108875807A CN108875807A (en) 2018-11-23
CN108875807B true CN108875807B (en) 2022-05-27

Family

ID=64336183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810551875.9A Active CN108875807B (en) 2018-05-31 2018-05-31 Image description method based on multiple attention and multiple scales

Country Status (1)

Country Link
CN (1) CN108875807B (en)

Families Citing this family (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310518B (en) * 2018-12-11 2023-12-08 北京嘀嘀无限科技发展有限公司 Picture feature extraction method, target re-identification method, device and electronic equipment
CN111325068B (en) * 2018-12-14 2023-11-07 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN109344920B (en) * 2018-12-14 2021-02-02 汇纳科技股份有限公司 Customer attribute prediction method, storage medium, system and device
CN111339340A (en) * 2018-12-18 2020-06-26 顺丰科技有限公司 Training method of image description model, image searching method and device
CN109376804B (en) * 2018-12-19 2020-10-30 中国地质大学(武汉) Hyperspectral remote sensing image classification method based on attention mechanism and convolutional neural network
CN109784197B (en) * 2018-12-21 2022-06-07 西北工业大学 Pedestrian re-identification method based on hole convolution and attention mechanics learning mechanism
CN109620205B (en) * 2018-12-26 2022-10-28 上海联影智能医疗科技有限公司 Electrocardiogram data classification method and device, computer equipment and storage medium
CN109726696B (en) * 2019-01-03 2023-04-07 电子科技大学 Image description generation system and method based on attention-pushing mechanism
US11087175B2 (en) * 2019-01-30 2021-08-10 StradVision, Inc. Learning method and learning device of recurrent neural network for autonomous driving safety check for changing driving mode between autonomous driving mode and manual driving mode, and testing method and testing device using them
CN109919221B (en) * 2019-03-04 2022-07-19 山西大学 Image description method based on bidirectional double-attention machine
CN109948691B (en) * 2019-03-14 2022-02-18 齐鲁工业大学 Image description generation method and device based on depth residual error network and attention
CN110084128B (en) * 2019-03-29 2021-12-14 安徽艾睿思智能科技有限公司 Scene graph generation method based on semantic space constraint and attention mechanism
CN110084250B (en) * 2019-04-26 2024-03-12 北京金山数字娱乐科技有限公司 Image description method and system
CN110097136A (en) * 2019-05-09 2019-08-06 杭州筑象数字科技有限公司 Image classification method neural network based
CN110633610B (en) * 2019-05-17 2022-03-25 西南交通大学 Student state detection method based on YOLO
CN110188775B (en) * 2019-05-28 2020-06-26 创意信息技术股份有限公司 Image content description automatic generation method based on joint neural network model
CN110188765B (en) * 2019-06-05 2021-04-06 京东方科技集团股份有限公司 Image semantic segmentation model generation method, device, equipment and storage medium
CN112101395A (en) * 2019-06-18 2020-12-18 上海高德威智能交通系统有限公司 Image identification method and device
CN110288029B (en) * 2019-06-27 2022-12-06 西安电子科技大学 Tri-LSTMs model-based image description method
CN110321962B (en) * 2019-07-09 2021-10-08 北京金山数字娱乐科技有限公司 Data processing method and device
CN110427836B (en) * 2019-07-11 2020-12-01 重庆市地理信息和遥感应用中心(重庆市测绘产品质量检验测试中心) High-resolution remote sensing image water body extraction method based on multi-scale optimization
CN110503079A (en) * 2019-08-30 2019-11-26 山东浪潮人工智能研究院有限公司 A kind of monitor video based on deep neural network describes method
CN111013149A (en) * 2019-10-23 2020-04-17 浙江工商大学 Card design generation method and system based on neural network deep learning
CN110929013A (en) * 2019-12-04 2020-03-27 成都中科云集信息技术有限公司 Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN111126282B (en) * 2019-12-25 2023-05-12 中国矿业大学 Remote sensing image content description method based on variational self-attention reinforcement learning
CN111240486B (en) * 2020-02-17 2021-07-02 河北冀联人力资源服务集团有限公司 Data processing method and system based on edge calculation
CN111444968A (en) * 2020-03-30 2020-07-24 哈尔滨工程大学 Image description generation method based on attention fusion
CN111611373B (en) * 2020-04-13 2021-09-10 清华大学 Robot-oriented specific active scene description method
CN111522986B (en) * 2020-04-23 2023-10-10 北京百度网讯科技有限公司 Image retrieval method, device, equipment and medium
CN112529857B (en) * 2020-12-03 2022-08-23 重庆邮电大学 Ultrasonic image diagnosis report generation method based on target detection and strategy gradient
CN112668608B (en) * 2020-12-04 2024-03-15 北京达佳互联信息技术有限公司 Image recognition method and device, electronic equipment and storage medium
CN112699915B (en) * 2020-12-07 2024-02-02 杭州电子科技大学 Method for identifying CAD model assembly interface based on improved graph annotation force network
CN112784848B (en) * 2021-02-04 2024-02-27 东北大学 Image description generation method based on multiple attention mechanisms and external knowledge
CN113591874B (en) * 2021-06-01 2024-04-26 清华大学 Paragraph level image description generation method with long-time memory enhancement
CN113707112B (en) * 2021-08-13 2024-05-28 陕西师范大学 Automatic generation method of recursion jump connection deep learning music based on layer standardization
CN114049501A (en) * 2021-11-22 2022-02-15 江苏科技大学 Image description generation method, system, medium and device fusing cluster search
CN113822383B (en) * 2021-11-23 2022-03-15 北京中超伟业信息安全技术股份有限公司 Unmanned aerial vehicle detection method and system based on multi-domain attention mechanism
CN115936073B (en) * 2023-02-16 2023-05-16 江西省科学院能源研究所 Language-oriented convolutional neural network and visual question-answering method
CN115984296B (en) * 2023-03-21 2023-06-13 译企科技(成都)有限公司 Medical image segmentation method and system applying multi-attention mechanism

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330362A (en) * 2017-05-25 2017-11-07 北京大学 A kind of video classification methods based on space-time notice
CN107358948A (en) * 2017-06-27 2017-11-17 上海交通大学 Language in-put relevance detection method based on attention model
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN107844743A (en) * 2017-09-28 2018-03-27 浙江工商大学 A kind of image multi-subtitle automatic generation method based on multiple dimensioned layering residual error network
CN107918782A (en) * 2016-12-29 2018-04-17 中国科学院计算技术研究所 A kind of method and system for the natural language for generating description picture material
CN108052512A (en) * 2017-11-03 2018-05-18 同济大学 A kind of iamge description generation method based on depth attention mechanism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2545661A (en) * 2015-12-21 2017-06-28 Nokia Technologies Oy A method for analysing media content
US10565305B2 (en) * 2016-11-18 2020-02-18 Salesforce.Com, Inc. Adaptive attention model for image captioning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918782A (en) * 2016-12-29 2018-04-17 中国科学院计算技术研究所 A kind of method and system for the natural language for generating description picture material
CN107330362A (en) * 2017-05-25 2017-11-07 北京大学 A kind of video classification methods based on space-time notice
CN107358948A (en) * 2017-06-27 2017-11-17 上海交通大学 Language in-put relevance detection method based on attention model
CN107451552A (en) * 2017-07-25 2017-12-08 北京联合大学 A kind of gesture identification method based on 3D CNN and convolution LSTM
CN107844743A (en) * 2017-09-28 2018-03-27 浙江工商大学 A kind of image multi-subtitle automatic generation method based on multiple dimensioned layering residual error network
CN108052512A (en) * 2017-11-03 2018-05-18 同济大学 A kind of iamge description generation method based on depth attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LSTM 逐层多目标优化及多层概率融合的图像描述;汤鹏杰 等;《自动化学报》;20171211;第43卷;第1-13页 *
Show and Tell: A Neural Image Caption Generator;Oriol Vinyals 等;《arXiv:1411.4555》;20150420;第1-9页 *
Show,attend and tell: Neural image caption generation with visual attention;Kelvin Xu 等;《arXiv:1502.03044》;20160419;第1-22页 *

Also Published As

Publication number Publication date
CN108875807A (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN108319686B (en) Antagonism cross-media retrieval method based on limited text space
CN111260740B (en) Text-to-image generation method based on generation countermeasure network
CN109948691B (en) Image description generation method and device based on depth residual error network and attention
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN106845411B (en) Video description generation method based on deep learning and probability map model
CN108549658B (en) Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree
CN109783666B (en) Image scene graph generation method based on iterative refinement
CN108830287A (en) The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN109242090B (en) Video description and description consistency judgment method based on GAN network
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN111368142B (en) Video intensive event description method based on generation countermeasure network
CN112784929B (en) Small sample image classification method and device based on double-element group expansion
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN111475622A (en) Text classification method, device, terminal and storage medium
CN110069611B (en) Topic-enhanced chat robot reply generation method and device
CN107679225A (en) A kind of reply generation method based on keyword
CN109740012B (en) Method for understanding and asking and answering image semantics based on deep neural network
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN113806564B (en) Multi-mode informative text detection method and system
CN110347853A (en) A kind of image hash code generation method based on Recognition with Recurrent Neural Network
CN111445545B (en) Text transfer mapping method and device, storage medium and electronic equipment
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN112560440A (en) Deep learning-based syntax dependence method for aspect-level emotion analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant