CN110633713A

CN110633713A - Image feature extraction method based on improved LSTM

Info

Publication number: CN110633713A
Application number: CN201910889843.4A
Authority: CN
Inventors: 李建平; 顾小丰; 胡健; 赖志龙; 苌浩阳; 蒋胜; 冯文婷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2019-12-31

Abstract

The invention discloses an image feature extraction method based on improved LSTM, which comprises the following steps of S1: inputting an original image into a convolutional neural network, and extracting a corresponding feature vector; s2, acquiring image characteristics: and inputting the extracted feature vector into a trained LSTM model to obtain image features. After the method obtains the feature vector of the original image, the problem that the existing LSTM model does not conform to the context when generating a new word is considered, and feature extraction is carried out on the LSTM model with the above information by constructing the LSTM model in the decoding stage, so that the accuracy of image feature extraction is improved.

Description

Image feature extraction method based on improved LSTM

Technical Field

The invention belongs to the technical field of image feature extraction, and particularly relates to an image feature extraction method based on improved LSTM.

Background

The image features are used to describe image information, and the image features in physical sense generally include shapes, colors, textures, spatial relationships, and the like. The shape of the image generally refers to an outline shape and a region shape, wherein the outline shape represents an embodied edge shape and represents an external shape of the whole image, and the region feature represents a shape inside the image. The color feature is a global feature, is the most obvious and most noticeable surface characteristic of the image, and is represented based on pixel points. Like the color feature, the texture feature is also a global feature and also represents the surface characteristics of the object, but the texture feature is calculated in a plurality of pixel point regions. The discussion objects of the image space relation features are a plurality of entities in the image and are divided into relative space positions and absolute space positions, wherein the former emphasizes relative relation, and the latter emphasizes distance and coordinate orientation.

At present, the application of extracting image features by adopting a convolutional neural network method is very common, and good effect is achieved; the convolutional neural network belongs to an encoding stage in an image automatic description task, and the most commonly used model in a decoding stage is the recurrent neural network and its variants, such as standard RNN, LSTM, GRU, etc., where LSTM has the most widely used function of long-distance memory. One of the most successful deformation models for RNN is LSTM, which inherits most of the properties of RNN, and LSTM solves the problem of gradient attenuation generated by RNN during gradient back propagation. In natural language processing, LSTM is particularly good at handling sequence related tasks such as dialog systems, machine translation, image description, and the like; although the feedforward neural network represented by the convolutional neural network still has absolute advantages of performance and effect on classification tasks, the recurrent neural network cannot be compared with the convolutional neural network on sequence processing tasks, and the LSTM more vividly expresses and simulates the processes of human behavior characteristics, logical thinking and cognition.

Analysis of the existing LSTM reveals that when a new word is generated, the new word is most affected by a word next to the new word at the sentence level, and other words have little influence on the new word, which is not in accordance with the context, so that the existing LSTM is difficult to extract accurate image features.

Disclosure of Invention

Aiming at the defects in the prior art, the image feature extraction method based on the improved LSTM solves the problem that context information is difficult to be considered when the LSTM is adopted for decoding in the existing image feature extraction process, so that the extracted features are inaccurate.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: an improved LSTM-based image feature extraction method comprises the following steps:

s1, extracting feature vectors: inputting an original image into a convolutional neural network, and extracting a corresponding feature vector;

s2, acquiring image characteristics: and inputting the extracted feature vector into a trained LSTM model to obtain image features.

Further, the size of the original image in the step S1 is 128 × 128;

the convolutional neural network has a 5-layer network structure;

the number of feature vectors extracted by the convolutional neural network is 1.

Further, the convolutional neural network in step S1 includes a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, and a fully-connected layer, which are connected in sequence;

the first convolution layer inputs 128 x 128 original images;

the first convolution layer comprises 8 convolution kernels with the size of 5 x 5 and outputs 8 feature maps with the size of 64 x 64;

the second convolution layer comprises 16 convolution kernels with the size of 4 x 4 and outputs 16 characteristic maps of 32 x 32;

the third convolutional layer comprises 32 convolutional kernels with the size of 3 × 3, and 32 16 × 16 feature maps are output;

the fourth convolution layer comprises 64 convolution kernels with the size of 2 x 2 and outputs 64 characteristic maps of 16 x 16;

and the full connection layer connects all the characteristic graphs output by the fourth convolution layer and outputs 1 characteristic vector.

Further, the LSTM model in step S2 includes a plurality of sequentially connected LSTM units;

in the data flow of the LSTM model:

the output end of the previous LSTM unit outputs the word generated by the LSTM unit and inputs the word into the current LSTM unit; the word generated by the previous LSTM unit and the word generated by the current LSTM unit are subjected to vector dot multiplication to be used as the above information, and the above information is input into the next LSTM unit.

Furthermore, each LSTM unit comprises a forgetting gate, an input gate and an output gate which are connected in sequence;

the forgetting gate is used for determining the information which needs to be discarded by the LSTM unit;

the input gate is used for determining the quantity of information input into the current LSTM unit and updating the state and the information of the current LSTM unit;

the output gate is used for determining the state and the hidden layer state which need to be output by the current LSTM unit.

Further, the forgetting gate comprises a sigmoid function and a vector dot product;

in the data flow direction of the forgetting gate;

hidden layer state h output by previous LSTM unit_t-1And current LSTM cell input x_tObtaining information f in the current LSTM unit after processing through sigmoid function_tA 1 is to f_tState C output from the previous LSTM cell_t-1The information obtained by vector dot multiplication that needs to be discarded is input into the input gate.

Further, the input gate comprises a sigmoid function, a tanh function, a vector point multiplication and a vector accumulation;

in the data flow direction of the input gate;

hidden layer state h output by previous LSTM unit_t-1And the input x of the current LSTM cell_tObtaining information i needing to be updated in the current LSTM unit after processing through sigmoid function_tHidden layer state h output by previous LSTM unit_t-1And the input x of the current LSTM cell_tUpdating the state of the current LSTM cell to be

Will i_tAnd

and after vector dot multiplication, performing vector accumulation with the output of the forgetting gate, and inputting an accumulation result into an output gate.

Further, the output gate comprises a sigmoid function, a tanh function and a vector dot product;

in the data flow direction of the output gate;

hidden layer state h output by previous LSTM unit_t-1And the input x of the current LSTM cell_tCarrying out vector point multiplication after sigmoid function processing to obtain a hidden layer state h required to be output by the current LSTM unit_tMeanwhile, the result of the vector dot product is processed by the tanh function, and the output result of the input gate is combined to determine the state C which needs to be output by the current LSTM unit_t。

Further, in step S2, the optimization function during the LSTM model training is an SGD optimizer;

during the LSTM model training process:

setting the initial learning rate to be 5e-4, and reducing the learning rate to be 0.8 time of the original learning rate every iteration for 4 times; while the batch size is set to 48 and the maximum number of iterations is set to 50.

The invention has the beneficial effects that:

according to the image feature method based on the improved LSTM, after the original image feature vector is obtained, the problem that the existing LSTM model does not conform to the context when a new word is generated is considered, and the LSTM model with the above information is constructed in the decoding stage to perform feature extraction, so that the accuracy of image feature extraction is improved.

Drawings

Fig. 1 is a flow chart of the image feature extraction method based on the improved LSTM provided by the present invention.

FIG. 2 is a diagram of the LSTM model structure in the present invention.

Fig. 3 is a diagram showing the structure of an LSTM unit in the present invention.

Fig. 4 is a model test picture in an embodiment provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, an improved LSTM-based image feature extraction method includes the following steps:

The size of the original image in the above-described step S1 is 128 × 128.

The receptive field in biology is the attributes of visual sense, auditory sense and the like in a nervous system, the received stimulation range is fixed, and a convolutional neural network is simulated according to the thought and has two characteristics of local part and weight sharing. Generally, a convolutional neural network mainly completes two tasks of a feature extraction layer and feature mapping, wherein the feature extraction layer is mainly responsible for extracting the features of neurons connected with the previous layer, and after the features are extracted, the mapping relation between the neurons is determined, namely local connection; the feature mapping layer is a geometric plane, all weights on the geometric plane are equal, namely the weights are shared, and a general feature mapping layer adopts Sigmoid as an activation function, and the function enables mapping between features to have displacement invariance. The weight sharing of the convolutional neural network makes it have fewer parameters and training is faster.

Therefore, the convolutional neural network for extracting the original image features comprises a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer and a full-connection layer which are sequentially connected;

inputting a 128 x 128 original image into a first convolution layer;

the first convolution layer comprises 8 convolution kernels with the size of 5 multiplied by 5, and 8 feature maps with the size of 64 multiplied by 64 are output;

the third convolution layer comprises 32 convolution kernels with the size of 3 x 3 and outputs 32 characteristic maps with the size of 16 x 16;

the full-connection layer connects all the feature maps output by the fourth convolution layer and outputs 1 feature vector.

From the perspective of natural language, regardless of the language type, including english and chinese, each word in a sentence is more or less affected by its context words. Analysis of an LSTM reveals that when the LETM model produces a new word, the new word is most affected at the sentence level by the word immediately above the new word, while other words have little effect on the new word, which is not contextually relevant.

Based on the context of the words in the sentence and the defect consideration of the LSTM in generating the words, the invention improves the LSTM model structure, and in the process of generating a new word by the LSTM, the influence of almost only one injured word is changed into the direct influence of all the words in the context; it should be noted that, since the formula of the sentence is from left to right, and there is only the above text and no text for a new word, the context in the improved LSTM model structure of the present invention refers to the above information. Therefore, the LSTM model provided by the present invention, as shown in fig. 2, includes a plurality of LSTM units connected in sequence;

in the data flow of the LSTM model:

Specifically, as shown in fig. 3, each LSTM unit includes a forgetting gate, an input gate, and an output gate, which are connected in sequence;

the output gate is used for determining the state and hidden state required to be output by the current LSTM unit.

Wherein the forgetting gate comprises a sigmoid function and a vector dot product;

in the data flow direction of the forgetting gate;

hidden layer state h output by previous LSTM unit_t-1And current LSTM cell input x_tObtaining information f in the current LSTM unit after processing through sigmoid function_tA 1 is to f_tState C output from the previous LSTM cell_t-1The information which is obtained by vector dot multiplication and needs to be discarded is input into an input gate;

wherein f is_tComprises the following steps:

f_t＝σ(W_f*[h_t-1,x_t]+b_f)

wherein σ (·) is a sigmoid function;

W_ffor information f in the current LSTM cell_tThe weight of (c);

b_ffor information f in the current LSTM cell_tBias of (3);

the input gate comprises a sigmoid function, a tanh function, a vector point multiplication and a vector accumulation;

in the data flow direction of the input gate;

Will i_tAnd

after vector dot multiplication, carrying out vector accumulation with the output of the forgetting gate, and inputting an accumulation result into an output gate;

wherein the information i to be updated_tComprises the following steps:

i_t＝σ(W_i*[h_t-1,x_t]+b_i)

in the formula, W_iInformation i to be updated_tThe weight of (c);

b_iinformation i to be updated_tBias of (3);

comprises the following steps:

in the formula, W_CIs composed of

The weight of (c);

b_Cis composed of

Bias of (3);

the output gate comprises a sigmoid function, a tanh function and a vector dot product;

in the data flow direction of the output gate;

Wherein, C_tComprises the following steps:

in the formula, C_t-1Is the state of the LSTM cell before updating.

In step S2, the optimization function during the LSTM model training is the SGD optimizer;

during the LSTM model training process:

setting the initial learning rate to be 5e-4, gradually reducing the learning rate in order to relieve the oscillation phenomenon and fall into the local minimum value in the training process, wherein the learning rate is reduced to 0.8 time of the original learning rate every 4 times of iteration; and meanwhile, setting the batch size to be 48, setting the maximum iteration number to be 50, and after the training is finished, taking the LSTM model with the highest BLEU score as the final model for outputting the image characteristics. And, in order to shorten the training time and accelerate the convergence, a batch normalization operation is added.

In one embodiment of the present invention, an experimental procedure for image feature extraction by the method of the present invention is provided:

(1) selecting an image data set;

at present, the commonly used classic data sets for image English description are MSCOCO, Flickr8k, Flickr30k and the like, and the data sets for image Chinese description are AI-Changler, Flickr8k-CN and the like. Because Chinese is more complex than English in the aspects of grammar, semantics and the like, the difficulty of image description based on Chinese is higher, and therefore the invention adopts image English description. In the experiment, an MSCOCO-2015 data set is selected as experimental data, a training set comprises about 16 ten thousand pictures, a test set and a verification set respectively comprise about 8 ten thousand pictures, and each picture is provided with 5 different manually marked English description sentences. In the experiment, a training set, a test set and a verification set are constructed according to a ratio of 8:1:1, wherein 80000 pictures are in the training set, 10000 pictures are in the test set, and 10000 pictures are in the verification set.

(2) Image data pre-processing

The data needs to be preprocessed before training with the MSCOCO-2015 data set. Firstly, letter case conversion is carried out on the description sentences labeled manually, and capital letters are converted into lowercase letters, so that unified processing of data is facilitated. Secondly, punctuation marks in the description sentences have little significance to model training, even have negative influence, so all punctuation marks in the description sentences are removed. Since the length of the descriptive sentence is indefinite, the maximum length of the word sequence is set to 15 after statistical analysis of the descriptive sentence length. In constructing the vocabulary, the threshold value of the occurrence frequency of each word is set to 8, words having an occurrence frequency greater than the threshold value are added to the vocabulary, and words having an occurrence frequency less than the threshold value are replaced with the meaningless character < UNK > in the natural language processing. After the vocabulary construction is completed, the vector representation of the words is performed using the commonly used one-hot encoding.

(3) Constructing a convolution neural network with a five-layer structure, and extracting a characteristic vector in an image data set through the convolution neural network;

(4) and inputting the extracted feature vector into an LSTM model, and extracting image features.

Experimental results and analysis:

in the experiment, on a Microsoft MSCOCO-2015 data set, the proposed model and method are compared with some image description models in the industry at present, such as models of NIC, ATT-FCN, Soft-Attention, MSM and the like, and Table 1 shows the comparison effect of the model and the models.

Table 1: evaluation results of different image description models on MSCOCO data set

Wherein, "- -" in table 1 indicates that no test is performed, B-1, B-2, B-3, B-4, and METEOR are used as the evaluation method of the model, B-1, B-2, B-3, and B-4 respectively indicate the matching mode of the one-tuple, the two-tuple, the triple, and the four-tuple adopted in the BLEU evaluation method, and the bold font indicates the best index under the evaluation method. As can be seen from Table 1, the model of the present invention has the best effect under the B-3 evaluation compared with the other 4 models, and has good effect under the other 4 evaluation methods. The MSM model has better effect than NIC, Soft-Attention and ATT-FCN under 5 evaluation methods, so that the MSM model has very good effect of processing image description. The effect of the model of the invention under the B-3 evaluation method is better than that of the MSM model, and the effect under the METEOR evaluation method is better than that of Soft-Attention and ATT-FCN and is slightly lower than that of MSM. And the effect of the model of the invention under the evaluation of B-1, B-2, B-3 and B-4 is better than that of the NIC model which is put forward at first.

In the experiment of the invention, different iteration cycle epoch training models are tried in order to obtain better experiment effect, and table 2 shows the effect of different iteration cycles under corresponding evaluation methods.

Table 2: scoring of different iteration cycles under corresponding evaluation method

As can be seen from Table 2, the iteration cycle and the model effect are directly related, Bleu-1, Bleu-3 and METEOR all achieve the best results at an iteration cycle of 45, and Bleu-2 and Bleu-4 achieve the optimal values at iteration cycles of 40 and 50, respectively. It can be seen that the iteration cycle is not as large as possible, and not as small as possible, but needs to be set to an appropriate value, and experiments in the invention show that the model has the best effect when the iteration cycle is set to 45.

Finally, in order to show the accuracy and the description effect of the model, after training is completed, 4 test pictures are selected to perform a comparison test on the model and the Muti-Head model, the Muti-Head model is also an improvement based on Soft-Attention and is applied to an image description task, and the test pictures are shown in FIG. 4.

For the test picture shown in fig. 4, the description sentences obtained by the model of the present invention are shown in table 4:

table 3: model description effects of the invention

For the test pictures shown in FIG. 4, the description sentences obtained by the Muti-Head model are shown in Table 4:

table 4: Muti-Head model describes effects

As can be seen from Table 3, the description result of the model of the present invention can substantially accurately express the information of the image, and substantially conforms to the English grammar, the phrase structure is substantially accurate, and the phrases can be organically and naturally connected. Comparing the effects of the Muti-Head model, it can be seen that the syntax of the model of the present invention is more reasonable, and as shown in the description sentences of the Muti-Head model in FIG. 4 (b) and FIG. (c), the predicate "are" absent "; in the aspect of picture boundary identification, the model effect of the invention is better, for example, the Muti-Head model identifies the ground of the graph (a), the graph (b) and the graph (c) as 'field', and the model of the invention respectively and accurately identifies 'ground', 'court' and 'grass'; in the aspect of target entity identification, the two models have good effects, and target entities in the graph are accurately identified.

In general, the model and the method provided by the invention have better effect in the image description task, have even better performance in some aspects compared with other image description models, can basically accurately describe the information contained in the image, and the generated natural language can express the meaning of the image generally although some problems exist.

The invention has the beneficial effects that:

Claims

1. An image feature extraction method based on improved LSTM is characterized by comprising the following steps:

2. The improved LSTM-based image feature extraction method of claim 1, wherein the size of the original image in step S1 is 128 x 128;

the convolutional neural network has a 5-layer network structure;

3. The improved LSTM-based image feature extraction method of claim 1, wherein the convolutional neural network in step S1 includes a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer and a fully-connected layer which are connected in sequence;

the first convolution layer inputs 128 x 128 original images;

4. The improved LSTM-based image feature extraction method of claim 1, wherein the LSTM model in step S2 comprises a plurality of sequentially connected LSTM units;

in the data flow of the LSTM model:

5. The improved LSTM-based image feature extraction method of claim 4, wherein each LSTM unit comprises a forgetting gate, an input gate and an output gate connected in sequence;

6. The improved LSTM-based image feature extraction method of claim 5, wherein said forgetting gate comprises a sigmoid function and a vector point multiplication;

in the data flow direction of the forgetting gate:

hidden layer state h output by previous LSTM unit_t-1And current LSTM cell input x_tObtaining the current value after being processed by sigmoid functionInformation f in LSTM cell_tA 1 is to f_tState C output from the previous LSTM cell_t-1The information obtained by vector dot multiplication that needs to be discarded is input into the input gate.

7. The improved LSTM-based image feature extraction method of claim 5, wherein said input gate comprises a sigmoid function, tanh function, vector point multiplication and vector accumulation;

in the data flow direction of the input gate:

Will i_tAndand after vector dot multiplication, performing vector accumulation with the output of the forgetting gate, and inputting an accumulation result into an output gate.

8. The improved LSTM-based image feature extraction method of claim 5, wherein said output gate comprises a sigmoid function, tanh function and vector dot product;

in the data flow direction of the output gate:

9. The improved LSTM-based image feature extraction method according to claim 1, wherein in step S2, the optimization function of the LSTM model during training is an SGD optimizer;

during the LSTM model training process: