CN108875807B

CN108875807B - Image description method based on multiple attention and multiple scales

Info

Publication number: CN108875807B
Application number: CN201810551875.9A
Authority: CN
Inventors: 吴晓军; 张钰; 陈龙杰; 张玉梅
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2022-05-27
Anticipated expiration: 2038-05-31
Also published as: CN108875807A

Abstract

An image description method based on multiple attention and multiple scales comprises the steps of selecting an image detection model for extracting image features, dividing a network training set and a verification set into a test set, extracting the image features, constructing an attention circulation neural network model, training the attention circulation neural network model and describing images. Because the invention constructs an image description generation network model which consists of extracting original image characteristics, multi-attention multi-scale characteristic mapping, cyclic neural network residual error connection and cyclic neural network language decoding, the quality of image description is improved and the details of image description are enriched. The invention can generate high-quality images for description by adopting the neural network model under the condition of only possessing the images.

Description

Image description method based on multiple attention and multiple scales

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method for describing a multi-attention and multi-scale image.

Technical Field

In fields such as robot question answering, pedestrian blind guiding, child-assisted education, etc., problems are often encountered that require understanding of the meaning of images and communication to people through a text language. The image description combines two fields of natural language processing and computer vision, and generates language characters corresponding to image contents by inputting natural images.

Since an image not only contains basic information indicating the type and position of an object but also contains high-level information such as some relations and emotions, if only the image object is detected and identified, a large amount of context information including the interrelations, emotions, and the like is lost, and therefore, how to effectively utilize the characteristics of the image and generate corresponding text descriptions is a difficult point of research.

In recent years, a deep learning-based technology has made great progress in the field of image processing and voice analysis, wherein the complexity of a network model is greatly reduced due to the characteristics of weight sharing and sparse connection of a convolutional neural network. Meanwhile, the occurrence of the residual error network makes it possible to construct a deeper network model. The appearance of long and short term memory networks allows the recurrent neural network model to process longer sequences, with significant effect on text sequence decoding.

At present, the mainstream algorithm based on deep learning in image description generation mainly uses a convolutional neural network to extract image features as the input of a language decoding model, and then the image features are input into a long-term and short-term memory network and corresponding description characters are output by adjusting the structure of the language model. The commonly used description generation model is characterized by inputting images and extracting through a convolutional neural network, and combining vector characteristics of a language sequence as the input of a long-term and short-term memory network. Although the above method utilizes the context information in the input image, the language decoding model only uses the extracted image features using a single attention model, and the input image only uses the high-level semantic features, the features extracted by the shallow convolutional layer are not utilized in the network model, and the contribution of the shallow features to the image description is ignored.

Attention mechanisms have been used to draw upon the selective attention mechanism of human vision. Human vision focuses on a target area in an image, namely an attention focus, by quickly browsing the image, obtains more target details, and inhibits other useless information, and the human vision attention mechanism greatly improves the efficiency and the accuracy of visual information processing. Essentially, the attention mechanism is similar to the selective attention of human vision, and the core goal is to select information more critical to the current task goal from a large number of information, i.e., to highlight the image space features corresponding to a certain generated word. By introducing multiple attention models, the models can use features of different levels of the image.

Disclosure of Invention

The technical problem to be solved by the present invention is to overcome the above drawbacks of the prior art, and to provide a multi-attention and multi-scale image description method with better description effect.

The technical scheme adopted for solving the technical problems comprises the following steps:

(1) selecting image detection model for extracting image characteristics

Selecting a convolutional neural network region target detection method to construct a target detection model, pre-training the target detection model by using a Pascal visual target classification 2007 data set or a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.

(2) Dividing network training set, verification set and test set

Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: in the data set, 90% of the total samples are randomly drawn as a network training set, 5% of the total samples are taken as a verification set, and the remaining 5% of the total samples are taken as a test set.

(3) Extracting image features

And extracting image convolution numerical characteristics of the pre-trained target detection model by using a region target detection model with a 101-layer residual error structure, and respectively converting the image convolution numerical characteristics into numerical characteristic graphs of 14 multiplied by 14 by adopting an average pooling method.

(4) Constructing attention circulation neural network model

The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module is connected with the circulation neural network language decoding module to construct an attention circulation neural network model.

The invention discloses a recurrent neural network language decoding module, which comprises the following steps: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises x_t，

In the three parts, the first part and the second part,

representing the output state of the nth layer, namely the final layer, long-short term memory network at the last moment, wherein t represents the current moment, t-1 represents the previous moment, and x_tRepresents the thermally encoded word vector,

is a high-level average pooling feature of the image,

comprises the following steps:

wherein v is_iIs a feature of the ith region. X is to be_t，

Inputting the three parts into a first layer long-short term memory network structure of the language model to obtain a recurrent neural network language decoding module.

(5) Training attention circulation neural network model

Inputting a network training set into the target detection model in the step (1), extracting numerical characteristic diagrams of images on convolution layers with different depths through the step (3), inputting the numerical characteristic diagrams into the attention circulation neural network model constructed in the step (4), extracting all descriptions in a data set to form word lists and word vectors, training the attention circulation neural network model by dynamically adjusting the learning rate by using an adaptive moment estimation optimization method, and using a cross entropy loss function L_XE(θ) as a loss function:

wherein

Theta is the parameter of the real sequence and image description generative model decoder of the target language respectively,

is output word of long-short term memory network decoder

The probability of (c).

When the attention circulation neural network model is trained, the attention circulation neural network model is trained by adopting a cluster searching method, and then the attention circulation neural network model is trained by using a self-identification sequence training reinforcement learning method.

And after the training is finished, testing the effect of the trained attention circulation neural network model by using an image verification set, and adjusting model parameters to obtain the attention circulation neural network model.

(6) Image description

Inputting the test set obtained in the step (2) into the attention circulation neural network model trained in the step (5), sequentially selecting the word with the maximum probability of each time step in the model as a result of the current time step, connecting the words according to the generation sequence and outputting the words as the final output of the network to finish image description.

In the step (3) of constructing the multi-attention neural network, the method for extracting the image convolution numerical characteristics by using the region target detection model with the 101-layer residual error structure comprises the following steps: and extracting convolution numerical features from the first maximum pooling layer of the residual network of the regional target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolution layer in each group of residual structures after the maximum pooling layer.

The method for extracting the convolution numerical characteristics comprises the following steps:

V′＝{v₁,…,v_k},

where V' represents a set of k features of the above k regions, where each feature represents a salient region of the image, V_kThe method represents the average pooling convolution characteristics of the k region segmented from the image convolution layer, wherein k is a finite positive integer.

In the step (4) of constructing the attention neural network, the attention feature mapping module of the invention is as follows:

the attention feature mapping module is divided into two parts, including network state

And each numerical feature V in the extracted convolutional layer_iThe attention feature mapping module is shown as follows:

α_t＝softmax(a_t)

parameter in the formula

W_va、W_haAre all parameters to be learned, α_tFor attention weighting, the attention feature mapping module is input, and the image features with parameters are output as follows:

in the formula v_iRepresenting the i-th region-averaged pooled convolution feature segmented in the image convolution layer, c_tFor the final output result, i, t are finite positive integers.

The method for inputting the numerical characteristics of different levels into different attention models comprises the following steps: the low-layer convolution numerical characteristics are connected into an attention model positioned at the lower layer of the recurrent neural network model, and the high-layer convolution numerical characteristics are connected into an attention model positioned at the higher layer of the recurrent neural network model.

In the step (4) of constructing the multi-attention multi-scale recurrent neural network, the attention feature mapping module and the recurrent neural network language decoding module are connected in the following mode: the output of the first layer of the recurrent neural network is connected with the input of the first layer of the attention network, the output of the first layer of the attention network is connected with the input of the second layer of the recurrent neural network, the output of the second layer of the attention network is connected with the input of the third layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the third layer of the attention network, the output of the third layer of the attention network is connected with the input of the fourth layer of the recurrent neural network, the output of the fourth layer of the recurrent neural network is connected with the input of the fourth layer of the attention network, the output of the fourth layer of the attention network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fifth layer of the recurrent neural network is connected with the input of the fifth layer of the attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network.

The method for connecting the residual errors with each layer of the recurrent neural network comprises the following steps: the output of the first layer of the recurrent neural network is connected with the input of the third layer of the recurrent neural network, the output of the second layer of the recurrent neural network is connected with the input of the fourth layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fourth layer of the recurrent neural network is connected with the input of the sixth layer of the recurrent neural network.

Compared with the prior art, the invention has the following advantages:

because the invention constructs an image description generation network model which consists of extracting original image characteristics, multi-attention multi-scale characteristic mapping, cyclic neural network residual error connection and cyclic neural network language decoding, the quality of image description is improved and the details of image description are enriched. The invention can adopt the neural network model to generate a high-quality image description result under the condition of only possessing the image.

Drawings

FIG. 1 is a flowchart of example 1 of the present invention.

FIG. 2 is a flow diagram of the language generation module in FIG. 1 for constructing a multi-attention multi-scale neural network.

FIG. 3 is a graph comparing the results of image description using the top-down network model processing method with the method of example 1.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, but the present invention is not limited to the examples described below.

Example 1

Taking 100000 images selected from the data set of common object 2014 in the microsoft context as an example, the image description generation method based on multi-attention and multi-scale comprises the following steps:

(1) selecting image detection model for extracting image characteristics

A target detection method In a convolutional neural network region is selected to construct a target detection model, is a known method and is disclosed In In Advances In neural information processing systems.2015. And pre-training a target detection model by using a 2007 data set of a Pascal visual target classification match, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.

(2) Dividing network training set, verification set and test set

Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: 90000 images, namely 90 percent, are randomly extracted from 100000 image data sets to serve as a network training set, 5000 images, namely 5 percent, serve as a verification set, and 5000 images, namely 5 percent, serve as a test set.

(3) Extracting image features

The method comprises the steps of extracting Image convolution numerical characteristics of a pre-trained target detection model by using a region target detection model with a 101-layer Residual error structure, wherein the 101-layer Residual error structure is a known structure, and in Deep Residual Learning for Image Recognition, the Image convolution numerical characteristics are respectively converted into numerical characteristic maps with the size of 14 multiplied by 14 by adopting an average pooling method, and the average pooling method is a unique known method.

The above region target detection model using 101-layer residual structure extracts image convolution numerical features as follows: and extracting convolution numerical features from the first maximum pooling layer of the residual network of the regional target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolution layer in each group of residual structures after the maximum pooling layer.

V′＝{v₁,…,v_k}

where V' represents a set of k features of the above k regions, where each feature represents a region of the image, V_kRepresents the average pooled convolution feature of the k-th region segmented in the image convolution layer, and k is 14.

(4) Constructing attention circulation neural network model

The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module comprises:

And each numerical feature V in the extracted convolutional layer_iThe attention feature mapping module is shown in the following equation:

α_t＝softmax(a_t)

parameter in the formula

c_t＝∑_iα_tv_i

in the formula v_iRepresenting the i-th region-averaged pooled convolution feature segmented in the image convolution layer, c_tIs the final output result.

The language decoding module of the recurrent neural network comprises: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises x_t，

In the three parts, the first part and the second part,

represents the output state of the n-th layer (final layer) long-short term memory network at the last moment, wherein t represents the current moment, t-1 represents the previous moment, x_tRepresents the thermally encoded word vector,

is a high-level average pooling feature of the image,

comprises the following steps:

wherein v is_iIs a feature of the ith region. X is to be_t，

The attention feature mapping module is connected with the cyclic neural network language decoding module to construct an attention cyclic neural network model.

The attention feature mapping module and the recurrent neural network language decoding module in the step are connected in the following mode: sequentially connecting each layer of cyclic neural network in the cyclic neural network decoding module with a residual error to connect each layer of cyclic neural network, wherein the output of the first layer of cyclic neural network is connected with the input of the first layer of attention network, the output of the first layer of attention network is connected with the input of the second layer of cyclic neural network, the output of the second layer of cyclic neural network is connected with the input of the second layer of attention network, the output of the second layer of attention network is connected with the input of the third layer of cyclic neural network, the output of the third layer of cyclic neural network is connected with the input of the third layer of attention network, the output of the third layer of attention network is connected with the input of the fourth layer of cyclic neural network, the output of the fourth layer of attention network is connected with the input of the fifth layer of cyclic neural network, and the output of the fifth layer of cyclic neural network is connected with the input of the fifth layer of attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network.

The method for connecting the residual errors in the step with each layer of the recurrent neural network comprises the following steps: the output of the first layer of the recurrent neural network is connected with the input of the third layer of the recurrent neural network, the output of the second layer of the recurrent neural network is connected with the input of the fourth layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fourth layer of the recurrent neural network is connected with the input of the sixth layer of the recurrent neural network.

(5) Training attention circulation neural network model

And (3) inputting 90000 images serving as a network training set into the target detection model in the step (1), extracting numerical characteristic diagrams of the images on different depth convolution layers through the step (3), and inputting the numerical characteristic diagrams into the attention circulation neural network model constructed in the step (4).

Extracting all descriptions in the data set constitutes a word list and word vectors,the extraction method comprises the following steps: for all descriptions in the data set of the common object 2014 in the microsoft context, words with five times of occurrence and more than five times of occurrence in a sentence are combined into a word list, each word in the word list is coded in a single hot coding mode, and the single hot coding of each word in the description sentence in the data set is mapped into an embedded vector. Training an attention-cycling neural network model by dynamically adjusting learning rate using an adaptive moment estimation Optimization Method in Adam A Method for Stochastic Optimization, using a cross-entropy loss function L_XE(θ) as a loss function:

wherein

Theta is the parameters of the real sequence and image description generative model decoder of the target language respectively,

is output word of long-short term memory network decoder

The probability of (c).

When training the attention circulation neural network model, a cluster searching method in the study of resources of the Five-Year Research efficiency is adopted, the number of hidden nodes of the long and short term memory network layer and the number of hidden nodes of the attention layer are set to be 1000, and the learning rate is 1 × 10^-4Training an attention circulation neural network model, Training a reinforcement learning method by using a Self-identification Sequence in Self-identification Sequence Training for Image capturing, and using a learning rate of 1 multiplied by 10^-5、1×10^-6And training the attention circulation neural network model in turn. After training is finished, testing the effect of the trained attention circulation neural network model by using 5000 image verification sets, and adjusting model parameters to obtain the attention circulation neural networkAnd (4) a collateral model.

(6) Image description

And (3) inputting the 5000 images of the test set obtained in the step (2) into the attention circulation neural network model trained in the step (5), sequentially selecting the word with the maximum probability of each time step from the model as the result of the current time step, connecting the words according to the generation sequence and outputting the words as the final output of the network to finish image description.

After the training of the attention-circulation neural network model is completed, the Image Description is evaluated by using a consistency-based Image Description Evaluation standard (CIDER) with a score of 1.167.

Example 2

in the step (1) of selecting an image detection model for extracting image features, a target detection method In a convolutional neural network region is selected to construct a target detection model, and the target detection method In the convolutional neural network region is a known method and is disclosed In advanced In neural information processing systems 2015. And pre-training a target detection model by using a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting the image characteristics.

The other steps were the same as in example 1. The image description is completed.

In order to verify the beneficial effects of the present invention, the inventor carried out a simulation experiment by using the method of embodiment 1 of the present invention, and the experimental conditions were as follows:

1. simulation conditions

The hardware conditions are as follows: 1 piece of Nvidia TITAN Xp video card, 128G internal memory.

The software platform is as follows: pytrch frame.

2. Simulation content and results

The results of the experiment carried out under the above simulation conditions by the method of the present invention are shown in fig. 3, the first behavior is the description of the network model from top to bottom, and the second behavior is the description of the method, compared with the prior art, the method of the present invention has the following advantages:

the invention provides a method for constructing multiple levels of attention, which can respectively extract the features of different levels of an image at the same time and improve the expression capability of generating sentences. A residual error learning mechanism is introduced into the multi-layer long and short term memory network, and the input and the output of the long and short term memory networks of different layers are connected together through an addition principle, so that the problem that the low-layer parameters of the model are difficult to update effectively due to gradient dispersion is solved. A plurality of attention structures are hierarchically fused into the network, and a model is trained by introducing a reinforcement learning method, so that output word sentences are more accurate, and the system performance is further improved. After the attention circulation neural network model is trained, the Image Description is evaluated by adopting a consistency-based Image Description Evaluation standard (CIDER) with a score of 1.167, so that a better effect is achieved.

Claims

1. A multi-attention and multi-scale based image description method is characterized by comprising the following steps:

(1) selecting image detection model for extracting image characteristics

Selecting a convolutional neural network regional target detection method to construct a target detection model, pre-training the target detection model by using a Pascal visual target classification 2007 data set or a Pascal visual target classification 2012 data set, and selecting the model with the best target detection effect in the training as the target detection model for extracting image features;

(2) dividing network training set, verification set and test set

Dividing a data set of the common object 2014 in the Microsoft context into a network training set, a verification set and a test set, wherein the data set dividing method comprises the following steps: randomly extracting 90% of total samples in the data set as a network training set, taking 5% of the total samples as a verification set, and taking the rest 5% of the total samples as a test set;

(3) extracting image features

Extracting image convolution numerical characteristics of the pre-trained target detection model by using a region target detection model with a 101-layer residual error structure, and respectively converting the image convolution numerical characteristics into numerical characteristic graphs of 14 multiplied by 14 by adopting an average pooling method;

(4) constructing attention circulation neural network model

The attention circulation neural network comprises an attention feature mapping module and a circulation neural network language decoding module, wherein the attention feature mapping module is connected with the circulation neural network language decoding module to construct an attention circulation neural network model;

the attention feature mapping module is as follows:

the attention feature mapping module is divided into two parts, including the network state

α_t＝softmax(a_t)

parameter in the formula

in the formula v_iRepresenting the i-th region-averaged pooled convolution features segmented in the image convolution layer, c_tFor the final output result, i and t are finite positive integers;

the method for inputting the numerical characteristics of different levels into different attention models comprises the following steps: the low-layer convolution numerical characteristics are connected into an attention model positioned at the low layer of the cyclic neural network model, and the high-layer convolution numerical characteristics are connected into an attention model positioned at the high layer of the cyclic neural network model;

the recurrent neural network language decoding module is as follows: the module comprises six layers of long-term and short-term memory networks and one layer of Softmax network, wherein the input of the first layer of long-term and short-term memory network comprises x_t，

In the three parts, the first part and the second part,

is a high-level average pooling feature of the image,

comprises the following steps:

wherein v is_iFor the feature of the ith region, x_t，

Inputting the three parts into a first layer long-short term memory network structure of a language model to obtain a recurrent neural network language decoding module;

(5) training attention circulation neural network model

Inputting the network training set into the target detection model in the step (1), and carrying out the step (3)) Extracting the numerical characteristic diagram of the image on different convolutional layers at different depths, inputting the numerical characteristic diagram into the attention circulation neural network model constructed in the step (4), extracting all descriptions in the data set to form a word list and word vectors, training the attention circulation neural network model by dynamically adjusting the learning rate by using an adaptive moment estimation optimization method, and using a cross entropy loss function L_XE(θ) as a loss function:

wherein

is output word of long-short term memory network decoder

The probability of (d);

when the attention circulation neural network model is trained, the attention circulation neural network model is trained by adopting a cluster searching method, and then the attention circulation neural network model is trained by using a self-identification sequence training reinforcement learning method;

after the training is finished, testing the effect of the trained attention circulation neural network model by using an image verification set, and adjusting model parameters to obtain an attention circulation neural network model;

(6) image description

2. The method for describing images based on multi-attention and multi-scale according to claim 1, wherein in the step (3) of constructing the multi-attention neural network, the region object detection model using the 101-layer residual structure is used for extracting convolution numerical features of the images as follows: extracting convolution numerical features from the first largest pooling layer of a residual network of a region target detection model with 101 layers of residual structures, and extracting convolution numerical features from the last convolutional layer in each group of residual structures after the largest pooling layer;

V′＝{v₁,…,v_k},

in the formula V^*A set of k features representing the above k regions, each feature representing a salient region of the image, v_kThe method represents the average pooling convolution characteristics of the k region segmented from the image convolution layer, wherein k is a finite positive integer.

3. The multi-attention and multi-scale based image description method of claim 1, characterized in that: in the step (4) of constructing the multi-attention multi-scale recurrent neural network, the attention feature mapping module and the recurrent neural network language decoding module are connected in the following way: the output of the first layer of the recurrent neural network is connected with the input of the first layer of the attention network, the output of the first layer of the attention network is connected with the input of the second layer of the recurrent neural network, the output of the second layer of the attention network is connected with the input of the third layer of the recurrent neural network, the output of the third layer of the recurrent neural network is connected with the input of the third layer of the attention network, the output of the third layer of the attention network is connected with the input of the fourth layer of the recurrent neural network, the output of the fourth layer of the recurrent neural network is connected with the input of the fourth layer of the attention network, the output of the fourth layer of the attention network is connected with the input of the fifth layer of the recurrent neural network, and the output of the fifth layer of the recurrent neural network is connected with the input of the fifth layer of the attention network, the output of the fifth layer attention network is connected with the input of the sixth layer recurrent neural network;