CN108960330B

CN108960330B - Remote sensing image semantic generation method based on fast regional convolutional neural network

Info

Publication number: CN108960330B
Application number: CN201810744473.0A
Authority: CN
Inventors: 张向荣; 李翔; 朱鹏; 焦李成; 唐旭; 侯彪; 马晶晶; 马文萍
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2021-09-10
Anticipated expiration: 2038-07-09
Also published as: CN108960330A

Abstract

The invention provides a remote sensing image semantic generation method based on a fast regional convolutional neural network, which mainly solves the problems that the prior art cannot obtain the relation between a target and an object in an image and cannot obtain the relation between the target and the whole image. The implementation scheme is as follows: constructing a training sample set and a testing sample set; extracting image features of the high-resolution remote sensing image by using a fast regional convolutional neural network; extracting text features of corresponding sentences by using a bidirectional cyclic neural network; matching the image characteristics with the text characteristics by using a probability-based image-text matching model; and training the long-time and short-time memory network by using the matched image-text characteristics, thereby realizing semantic generation of the high-resolution remote sensing image. The method fully considers the characteristics of complex background and various targets of the remote sensing image, improves the semantic generation result of the remote sensing image, and can be used for image retrieval or scene classification.

Description

Remote sensing image semantic generation method based on fast regional convolutional neural network

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an image semantic generation method which can be used for automatically describing the content of a remote sensing image.

Background

The content understanding and description of the remote sensing image can provide decision-making support for remote sensing application, and the method has wide practical application value. For example, in the field of military reconnaissance, existing research algorithms can achieve rapid identification of important military targets such as ports, airports, ships, and the like from remote sensing images. The remote sensing image content understanding and description can accurately and comprehensively understand the large-width battlefield image, so that real-time interpretation, dynamic information generation and the like of the battlefield geographic environment are realized. In the civil aspect, the understanding and description of the remote sensing image content can accurately provide easily understood information such as disaster assessment, vegetation coverage, crop growth condition, city transition and the like.

The study of understanding and describing the image content is mostly to describe natural images by means of a convenient tool of natural language processing. Most of image description work can be regarded as a retrieval process, generally, image features and text features are unified into a vector space, and then character features are retrieved through the image features or the image features are retrieved by utilizing the text features; another method is to build a database for training comprising images and their text descriptions, learn the correspondence between images and text, and use it for the generation of new image text. All these methods are implemented under a supervised framework, and the generated text is a description of the image content. For example, the Stanford Liffy team uses the deep neural network model to deduce the corresponding relationship between these vocabulary segments and image regions, and then uses it to generate a generalized language description model. In the field of computer vision, the description of images or videos works abnormally hot, and Socher et al and Wang et al study the associations between images and words for describing objects in the images.

Due to the influences of adverse factors such as complicated and various targets, difficult sample labeling and the like, the content understanding and description of the remote sensing image are developed slowly compared with the content understanding and description of the natural image. The existing achievements mostly focus on the aspects of semantic extraction and image retrieval of remote sensing images. Such as: the remote sensing image retrieval is realized by utilizing natural language, and a remote sensing image retrieval model based on semantic mining is provided by Zhang Han university and professor Zhang Han dynasty and the like; the YangJunLi, Beijing university of aerospace, and the like, proposes a concept of modeling the bottom-layer characteristics and the context information of the remote sensing image through a CRF frame. The content understanding and description of the remote sensing images are usually obtained by a statistical learning method, firstly semantic related information implicit in the images is obtained, and then the images are further analyzed according to the corresponding relation between the low-level features and the semantic features. The method can obtain auxiliary recognition shallow semantic information, but the method is not deep enough, stays in the target positioning and recognition stage, cannot obtain the relation between the targets in the image and the overall relation between the targets and the image, and influences the precision of subsequent tasks such as image detection, scene classification and the like.

Disclosure of Invention

The invention aims to provide a remote sensing image semantic generation method based on a fast regional convolutional neural network aiming at the defects of the prior art, so that visual information in a high-resolution remote sensing image is fully utilized, and text information in description sentences is combined to obtain the relation between a target and the target in the image and the relation between the target and the whole image, and the precision of tasks such as image retrieval, scene classification and the like is improved.

In order to achieve the above purpose, the implementation steps of the invention comprise the following steps:

(1) taking 60% of image-text pairs in the remote sensing data set as training samples, and taking the rest 40% of image-text pairs as test samples;

(2) extracting image features of remote sensing images in training samples by using a fast regional convolutional neural network;

(3) extracting text features of texts corresponding to the remote sensing images in the training samples by using a bidirectional cyclic neural network;

(4) matching the image characteristics obtained in the step (2) with the text characteristics obtained in the step (3) by using a probability model-based image-text matching method to obtain matched image-text characteristics;

(5) training the long-time and short-time memory network by using the matched image-text characteristics in the step (4);

(6) and (5) extracting the image characteristics of the remote sensing image in the test sample by using the fast regional convolutional neural network, and inputting the image characteristics into the trained long-time and short-time memory network to perform semantic generation to obtain a statement for describing the image content.

Compared with the prior art, the invention has the following advantages:

firstly, the method based on the fast regional convolutional neural network is adopted, so that the visual characteristics suitable for describing the high-resolution remote sensing image can be more accurately obtained.

Secondly, the method adopts the image characteristic text characteristic matching based on the probability model, so that the method can better construct the corresponding relation from the image characteristic to the text characteristic.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a block diagram of a fast regional convolutional neural network used in the present invention to extract image features;

FIG. 3 is a block diagram of a regional candidate network in the fast regional convolutional neural network of the present invention;

FIG. 4 is a diagram of a feature transformation module in a regional candidate network in accordance with the present invention;

FIG. 5 is a block diagram of a bi-directional recurrent neural network used in the present invention to extract text features;

FIG. 6 is a diagram of a long-term and short-term memory network for text prediction according to the present invention;

FIG. 7 is a graph comparing evaluation results of the present invention under different evaluation indexes;

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples:

referring to fig. 1, the implementation steps of the present invention are as follows.

Step 1, constructing a training sample set and a testing sample set.

Three remote sensing image semantic generation Data sets of UCM-Captions Data Set, Sydney-Captions Data Set and RSICD are downloaded from key laboratory websites of surveying and mapping remote sensing countries of Wuhan university, and 60% of image-text pairs in each Data Set are respectively used as training samples, and the rest 40% of image-text pairs are used as test samples.

Step 2, extracting the image characteristics of the remote sensing image in the training sample by using a fast regional convolution network:

the structure of the fast area convolutional network is shown in fig. 2, which includes an area candidate network and a three-layer convolutional neural network, where:

the structure of the regional candidate network is shown in FIG. 3, and the regional candidate network comprises a feature transformation module and a regional suggestion module, wherein the feature transformation module adopts VGG-16, and the structure is shown in FIG. 4; the region suggestion module has two branches: classification branches and regression branches.

The three-layer convolutional neural network is composed of two convolutional layers and a full-connection layer, the number of nodes of the convolutional layers is 256, and the number of nodes of the full-connection layer is 4096.

The specific implementation of this step is as follows:

(2a) screening out a candidate region containing a target by using a region candidate network in the fast regional convolutional network:

converting the original picture into a feature map through a feature conversion module; sliding a 3x3 window on the feature map by using a step size of 1 through a region suggestion module, and generating k candidate regions with different sizes from the center of each window, wherein k is 9 in the example;

respectively converting the k candidate regions into 256-dimensional vectors, inputting the vectors into a region suggestion module, and outputting indicating variables of 2k candidate regions by a classification branch in the module, wherein the indicating variables are 1 to represent targets and 0 to represent non-targets; outputting coordinates of 4k candidate regions by a regression branch in the module, and generating a candidate region containing a target by an indicating variable and the coordinates;

(2b) extracting (2a) image features of the candidate regions using a three-layer convolutional neural network of the fast region convolutional neural network:

according to the pixel I in the ith candidate region_iUsing the three-layer convolutional neural network, the region image features are represented as:

wherein, theta_cNetwork parameter, W, representing a regional candidate network_mAnd b_mWeights and biases representing three layers of convolutional neural networks, respectively，W_mDimension of (d) is h × 4096, h represents the dimension of the embedding space;

is represented by_iThe passing parameter is theta_c4096-dimensional full-link layer vectors output by the area candidate network; v. of_iRepresenting the features of the image extracted from the ith candidate region, including the features of the entire image and the first 19 detected positions;

(2c) extracting image characteristics v of each candidate region_iAnd merging to obtain the image characteristics v of the pictures, wherein the dimensionality of the image characteristics v is equal to the number of the screened candidate regions.

And 3, extracting text features of sentences in the training samples by using the bidirectional cyclic neural network.

The structure of the bidirectional cyclic neural network is shown in fig. 5, in which the number of input nodes and the number of output nodes are both set to N, and the number of forward implicit units and the number of backward implicit units are both set to h.

The specific implementation of this step is as follows:

(3a) inputting a word sequence with the length of N, and converting the t-th word into an h-dimensional vector: x is the number of_t＝W_nΦ_tWherein phi is_tRepresenting an indication column vector, representing the position index of the word in the vocabulary, and t representing the position of each word in the sentence, wherein the value of t is 1 to N; w_nAn embedded matrix for a given word, which is randomly initialized;

(3b) randomly initializing forward weights W of a bidirectional recurrent neural network_rForward bias b_rBackward weight W_lAnd a backward bias b_lRespectively and iteratively calculating the forward implicit unit output of the bidirectional cyclic neural network

And backward implicit element output

Wherein the content of the first and second substances,

is the t-1 forward hidden unit output, x of the bidirectional cyclic neural network_tFor the h-dimensional vector of the t-th word in (3a), t increases from 1 to N, setting

Setting for t +1 th backward hidden unit output of bidirectional cyclic neural network, t is reduced to 1 from N

f is a rectified linear unit activation function RELU expressed as

(3c) Randomly initializing output layer weights W of a network_dAnd bias b_dIteratively calculating the output s of the network_t：

Wherein the content of the first and second substances,

and

the forward implicit cell output and the backward implicit cell output of the bidirectional recurrent neural network in (3b),

(3d) calculating a cross entropy loss function L (θ):

wherein

An indicated column vector for the desired output, i.e., the t-th word, representing the index of the word's position in the vocabulary;

(3e) optimizing L (theta) by using a random gradient descent algorithm to obtain a trained bidirectional cyclic neural network, wherein the output s of the network_tI.e. the text feature of the t-th word, which contains the position of the word and context information.

And 4, matching the image features and the text features extracted from the training sample by using a probability-based image-text matching model.

(4a) Representing the image characteristics of each remote sensing image and the corresponding text characteristics describing sentences by using vectors, and uniformly transforming the image characteristics and the corresponding text characteristics into an h-dimensional space; using dot products

Text feature s representing ith region image feature vi and tth word_tCalculating the matching degree S between the p-th image and the q-th sentence_pq：

Wherein g is_pRepresenting a series of image block areas in the p-th image, g_qRepresenting a series of words in the qth sentence;

(4b) for one picture of the training set, all sentences are traversed, and all matches of the picture are calculated in sequenceMatching degree, selecting maximum matching degree S_pqThe corresponding q-th sentence is used as a sentence index matched with the image;

(4c) traversing all the images of the training set, and repeating the step (4b) to obtain sentence indexes respectively matched with all the images;

(4d) according to the matched sentence index and the image index corresponding to the sentence index, the corresponding image characteristic v in (2) and the corresponding text characteristic s in (3) are retrieved from each picture₁,…,s_t,…s_NFinish the matching process, where s_tThe text characteristic of the t-th word in the sentence is shown, t is from 1 to N, and N is the number of words in the sentence.

Step 5, training a long-term and short-term memory network capable of text prediction:

the long and short term memory network is structured as shown in fig. 6, where the number of network input nodes and the number of network output nodes are both set to be N, and the number of hidden units is set to be h.

The specific implementation of this step is as follows:

(5a) random initialization long-time memory network input layer weight W_hxHidden layer weight W_hhOutput layer weight W_ohHidden layer bias b_hAnd output layer bias b_oInput image feature v, corresponding text feature s₁,…,s_t,…s_NOriginal sentences in the 1 and the t-th hidden unit output h of the long-time and short-time memory network are calculated in an iterative manner_tAnd output layer output y_t：

h_t＝f(W_hxs_t+W_hhh_t-1+b_h+Θ(t＝1)⊙v)，

Wherein the image feature vector v, h in (2) is input only when t is 1_t-1Setting for t-1 hidden unit output of long-short time memory network

Implicit Unit size is set to 512, output vector y_tA probabilistic output representing a word;

(5b) calculating y_tCross entropy loss of

Wherein the content of the first and second substances,

indicating the position index of the word in the vocabulary table for the expected output, namely the indicated column vector of the t word in the original sentence corresponding to the image;

(5c) use of stochastic gradient descent algorithm to correct cross entropy loss

And optimizing to obtain the trained long-time and short-time memory network.

And 6, performing semantic generation on the test sample by using the trained long-time and short-time memory network to obtain a sentence for describing the image content.

(6a) Extracting the image characteristics of the remote sensing image in the test sample by using the fast regional convolutional neural network in the step (2)

(6b) T-th hidden unit output h of iterative computation long-and-short-term memory network_tNetwork output y_tAnd text feature s of the t +1 th word_t+1：

s_t+1＝y_t，

Wherein is provided with

Inputting the image feature vector of the test sample only when t is 1

(6c) According to y_tAnd searching words from the vocabulary table by the index corresponding to the medium maximum value, and forming a sentence by all the words which are sequentially output, namely the generated sentence for describing the image content.

The technical effects of the invention are further described by combining simulation experiments as follows:

1. simulation conditions

The algorithm simulation platform of the embodiment is as follows: the main frequency is CPU of 4.00GHz, memory of 32.0GB, video card GTX-1070, Ubuntu (64 bit) operating system, Mxnet and Python development platform.

The data set used for the simulation was: UCM-potentials Data Set, Sydney-potentials Data Set and RSICD.

The experiment uses indexes such as Bleu1, Bleu2, Bleu3, Bleu4, METEOR, ROUGE _ L and CIDER to evaluate the method.

2. Simulation content and results

Simulation 1, in order to compare the results of the invention under different evaluation indexes, the method of the invention is used for semantic generation, and the three data sets are respectively tested, and the test results are shown in table 1.

Table 1 experimental results of different data sets

Table 1 shows the results of the method of the invention on different data sets for different evaluation indices.

As can be seen from table 1, the results of the various indicators on the RSICD dataset are slightly lower than the UCM and Sydney datasets. But compared with the best semantic generation method at present, on the three data sets, except the CIDER, the results of other indexes are improved by about 0.05.

And 2, in order to verify the influence of the training data proportion on the semantic generation result of the high-resolution remote sensing image, 10% of images are used as a verification set, the proportion of the training set is increased from 10% to 80%, the semantic generation is carried out by using the method, the three data sets are respectively tested, and the test result is shown in FIG. 7.

Fig. 7 shows the results of the method of the present invention at different training set ratios for different evaluation indices, where the abscissa represents the training set ratio and the ordinate represents the corresponding result, and different metrics are represented by different shapes. Wherein fig. 7(a) shows the results on the UCM data set, fig. 7(b) shows the results on the Sydney data set, and fig. 7(c) shows the results on the RSICD data set.

As can be seen from fig. 7(a), as the proportion of the training set increases, the indexes in the result of the UCM data set increase and then gradually become stable;

as can be seen from fig. 7(b), as the training set proportion increases, the indexes in the result of the Sydney data set increase in the initial stage, and after the training proportion reaches 40%, the result almost becomes stable and does not change any more, and the reason is analyzed by the data set, mainly due to the unbalanced distribution of the data set;

as can be seen from fig. 7(c), as the proportion of the training set increases, the indexes in the results of the RSICD dataset increase, but the overall index is slightly lower because the sentence expression in the dataset is more complex.

In conclusion, the method and the device can improve the semantic generation result of the remote sensing image, but the performance of the method and the device on complex data sets and individual indexes needs to be improved.

Claims

1. A remote sensing image semantic generation method based on a fast regional convolutional neural network is characterized by comprising the following steps:

(4) matching the image characteristics obtained in the step (2) with the text characteristics obtained in the step (3) by using a probability model-based image-text matching method to obtain matched image-text characteristics; comprises the following steps:

(4a) representing the image characteristics of each remote sensing image and the corresponding text characteristics describing sentences by using vectors, and unifying the image characteristics and the corresponding text characteristics to the same dimensional space; using dot products

Image feature v representing the ith candidate region_iText feature s associated with the t-th word_tCalculating the matching degree S between the p-th image and the q-th sentence_pq：

(4b) traversing all sentences aiming at one picture of the training set, sequentially calculating all matching degrees of the picture, and selecting the maximum matching degree S_pqThe corresponding q-th sentence is used as a sentence index matched with the image;

(4d) according to the matched sentence index and the image index corresponding to the sentence index, the corresponding image characteristic v in (2) and the corresponding text in (3) are retrieved from each pictureFeature s₁,…,s_t,…s_NFinish the matching process, where s_tThe method is characterized in that the method is a text feature of the tth word in a sentence, t is from 1 to N, and N is the number of words in the sentence;

2. The method of claim 1, wherein the image features of the remote sensing image are extracted by using a fast area convolution neural network in (2) by:

(2a) generating a candidate region by using a regional candidate network in the fast regional convolutional neural network to obtain a candidate region containing different targets in each high-resolution remote sensing image;

(2b) extracting (2a) the image characteristics of the candidate region by using a three-layer convolutional neural network in the fast regional convolutional neural network, and according to the pixel I in the ith candidate region_iUsing a three-layer convolutional neural network, the region image features are represented as:

wherein, theta_cNetwork parameter, W, representing a regional candidate network_mAnd b_mParameters representing a three-layer convolutional neural network, W_mDimension of (d) is h × 4096, h represents the dimension of the embedding space;

(2c) extracting image characteristics v of each candidate region_iMerging to obtain picturesAn image feature v having a dimension equal to the number of candidate regions.

3. The method of claim 1, wherein the extracting text features of the text corresponding to the remote sensing images in the training samples by using the bidirectional recurrent neural network in (3) comprises the following steps:

(3a) inputting a word sequence of length N, by x_t＝W_nΦ_tConverting the t-th word into an h-dimension vector x_tWherein phi is_tRepresenting an indication column vector, representing the position index of the tth word in a vocabulary table, wherein the value of t is 1 to N, and representing the position of each word in a sentence; weight W_nTo specify an embedded matrix of words, W is randomly initialized_n；

And backward implicit element output

Wherein the content of the first and second substances,

is the t-1 forward hidden unit output, x of the bidirectional cyclic neural network_tH-dimensional vector of the t-th word in (3a)T is increased from 1 to N, set

f is a rectified linear unit activation function RELU, and the expression is f:

(3c) randomly initializing output layer weights W of a network_dAnd bias b_dIteratively computing a representation s of the t-th word_t：

Wherein the content of the first and second substances,

and

(3d) optimizing all network parameters W using a cross entropy function_n、W_r、b_r、W_l、b_l、W_dAnd b_dObtaining a trained bidirectional cyclic neural network, the output s of which_tThe method is a text feature of the tth word, and the text feature comprises the position of the tth word and context information.

4. The method of claim 1, wherein the step (5) of training the long-short term memory network by using the matched image-text features in the step (4) comprises the following steps:

h_t＝f(W_hxs_t+W_hhh_t-1+b_h+Θ(t＝1)⊙v)，

(5b) Calculating y_tCross entropy L (θ):

wherein the content of the first and second substances,

indicating the position index of the word in the vocabulary table for the expected output, namely the indicated column vector of the t word in the original sentence corresponding to the image; n is the number of words in the sentence;

(5c) using cross entropy optimization instituteSome network parameters W_hx,W_hh,W_ohAnd b_h,b_oAnd obtaining the trained long-time and short-time memory network.

5. The method according to claim 1, wherein in the step (6), semantic generation is performed in the trained long-and-short-term memory network (5), so as to obtain a sentence describing the image content, and the method comprises the following steps:

(6b) T-th hidden unit output h of iterative computation long-and-short-term memory network_tNetwork output y_tAnd text feature s of the t +1 th word_t+1:

s_t+1＝y_t，

Wherein is provided with

Inputting the image feature vector of the test sample only when t is 1

W_hxRepresents the input layer weight, W_hhRepresenting hidden layer weights, W_ohRepresents the output layer weight, b_hIndicating the hidden layer bias and b_oIndicating an output layer bias;

(6c) according to y_tThe index corresponding to the maximum value in the vocabulary table is used for searching words, and the words are sequentially searchedAll the output words form a sentence, which is the generated sentence describing the image content.