CN110287354A

CN110287354A - A kind of high score remote sensing images semantic understanding method based on multi-modal neural network

Info

Publication number: CN110287354A
Application number: CN201910406998.8A
Authority: CN
Inventors: 卢孝强; 屈博; 刘康
Original assignee: XiAn Institute of Optics and Precision Mechanics of CAS
Current assignee: XiAn Institute of Optics and Precision Mechanics of CAS
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2019-09-27

Abstract

The high score remote sensing images semantic understanding method based on multi-modal neural network that the invention discloses a kind of, mainly solves the problems, such as that the analysis of current high score remote sensing images does not understand remote sensing images from high-level semantic.Implementation step is: 1) constructing high score remote sensing images-text descriptive data base；2) visual signature of all images in high score remote sensing images-text descriptive data base is extracted using the good convolutional neural networks of pre-training；3) vocabulary text library is created according to the word in all text descriptive statements of high score remote sensing images-text descriptive data base；4) the training multi-modal neural network of depth；High score remote sensing images are inputted, the corresponding text of high score remote sensing images is generated using the multi-modal neural network of trained depth and describes.

Description

A kind of high score remote sensing images semantic understanding method based on multi-modal neural network

Technical field

The invention belongs to technical field of information processing, in particular to a kind of image understanding technology can be used for disaster monitoring, army Thing is scouted and geographical national conditions are reconnoitred etc..

Background technique

With the development of China's Aerospace Technology, more and more high score satellites are launched into space, and remote sensing images obtain Taking becomes to be more easier, and the resolution ratio of remote sensing images is also being continuously improved, it includes effective information also become increasingly It is abundant.Therefore, Hi-spatial resolution remote sensing image roading, military surveillance, in terms of will play it is huge Effect.In order to rationally utilize these data, the visual information and semantic information of image, the language of high score remote sensing images are fully considered Reason and good sense solution is a very important research direction.

However, the research work of high spatial resolution (hereinafter referred to as high score) remote sensing images is concentrated mainly on following four at present A aspect:

(1) target detection: some interested targets in high score picture, such as aircraft, oil storage tank are detected automatically；

(2) image classification: by the texture and spatial information of types of ground objects various in analysis picture, by each of picture Pixel is assigned in different classifications；

(3) image segmentation: by high score image segmentation at semantic continuous (have identical property, indicate that differently species are other) Region；

(4) scene classification: the scene that every high score picture of identification is included, such as airport, harbour obtain every picture Scene type label.

These above-mentioned work can only often detect in image comprising certain target or acquisition each pixel or whole of image The class label of a image detailed can not point out relationship mutual between the attribute of target, feature and target in picture, Thus target completely is not understood in semantic level.In addition, piece image visual information can rough reflection picture Main contents, and the information for the description image that corresponding text information then can be more careful.

Natural image is concentrated mainly in conjunction with the work that the image, semantic of image vision information and text information understands at present Field can probably be divided into following a few classes:

First is that based on picture-text embedded structure method, the convolutional neural networks that this method uses pre-training good first The feature vector of model extraction image is then mapped the corresponding description text of picture by the good language model of a pre-training To in feature space identical with characteristics of image, image and text are then found by calculating the similitude of both feature vectors Relationship between this is finally that new test picture generates verbal description by the mapping relations learnt.Specific method referring to With reference to text " R Kiros, R Salakhutdinov, and S Zemel, Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,arXiv preprint arXiv: 1411.2539,2014”。

Second is that the method based on target detection, this method is correlation between each target and description target first A detector is respectively trained out in word, then picture is detected using these trained detectors, to obtain several Then these words are formed several sentences by a language model, finally to obtained these sentences and picture by word Similarity score sequence, the highest sentence of selected and sorted is as final result.Specific method is referring to bibliography " H Fang, S Gupta,F Iandola,et al.,From Captions to Visual Concepts and Back,in Proc.IEEE Conference on Computer Vision and Pattern Recognition,pages:1473-1482,2015”。

Third is that the method based on deep learning, this method is first in the training stage, using depth convolutional neural networks (Convolutional Neural Network, CNN) extracts picture feature, then that picture feature and text information is common It is input to training network in Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN).Finally, will be surveyed in test Attempt piece to input in trained deep neural network, to generate the verbal description of the picture.Specific method is detailed in bibliography “O Vinyal,A Toshev,S Bengio,et al..,Show and Tell:A Neural Image Caption Generator,in Proc.IEEE Conference on Computer Vision and Pattern Recognition, pages:3156-3164,2015”。

Although these methods achieve some good achievements in terms of natural image understanding, in high score remote sensing fields, The semantic understanding of high score remote sensing images remains at blank stage, and current method does not utilize the high-level semantic of image to believe Breath, can not really understand high score remote sensing images.Therefore, how by the visual information of high score remote sensing images and text information knot It altogether, is a significantly problem to realize the understanding to image on semantic level.

Summary of the invention

It is an object of the invention to be directed to the deficiency of above-mentioned existing method, a kind of height based on multi-modal neural network is proposed Divide remote sensing images semantic understanding method, model of this method based on multi-modal neural network, while considering the vision of remote sensing images And semantic information, the correlation in image between the attribute and target of target is paid close attention to, to reach right on high-level semantic The understanding of remote sensing images.

Realize that the object of the invention technical principle is as follows:

(1) text marking is carried out to high score remote sensing images image data base first, every high partial image is labeled 5 and retouches The text for stating its content forms completely new high score remote sensing images-text marking (image-captions) database；

(2) remote sensing images are then extracted by convolutional neural networks (Convolutional Neural Network, CNN) Feature；

(3) Recognition with Recurrent Neural Network will be input to together with the characteristics of image extracted text information corresponding with every image In (Recurrent Neural Network, RNN), training network parameter obtains our multi-modal neural network mould of depth Type；

(4) the corresponding text of high score remote sensing images is generated using the multi-modal neural network of trained depth to describe.

Realize that the specific technical solution of the object of the invention is as follows:

The high score remote sensing images semantic understanding method based on multi-modal neural network that the present invention provides a kind of, including it is following Step:

1) high score remote sensing images-text descriptive data base is constructed；

High score remote sensing images-text the descriptive data base includes several high score remote sensing images and corresponding every high score The a plurality of text descriptive statement of remote sensing images；

2) all images in high score remote sensing images-text descriptive data base are extracted using the good convolutional neural networks of pre-training Visual signature；

Specific formula for calculation are as follows:

b₀=CNN (I)；

Wherein I is the image data in high score remote sensing images-text descriptive data base, b₀For the visual signature of image；

3) according to the word creation vocabulary text in all text descriptive statements of high score remote sensing images-text descriptive data base This library；

Each word is indicated by a vector in the vocabulary text library, and START vector sum END vector, which is added, indicates sentence The starting and termination of son；

4) the training multi-modal neural network of depth；In the multi-modal neural network of depth all moment corresponding one it is defeated Enter layer, a hidden layer and an output layer；

4.1) at the t=1 moment, by the visual signature b of image in step 2)₀And the START vector input in step 3) To the hidden layer of the multi-modal neural network of t=1 moment depth, the hidden layer output at depth multi-modal neural network t=1 moment is obtained h₁；

h₁=g (λ₁w₁+b₀)；

Wherein, w₁For the START vector of input；

G is nonlinear function；

λ₁, λ₂For network weight parameter to be trained；

4.2) hidden layer at t=1 moment is then exported into h₁, it is input to the output at depth multi-modal neural network t=1 moment Then layer calculates the probability distribution of t=1 moment all words by Softmax function；

p(w₂)=softmax (λ₃h₁)；

Wherein, λ₃For network weight parameter to be trained；

4.3) the highest word w of t=1 moment probability distribution is chosen₂, prediction word as the t=1 moment；

4.4) at the moment of t > 1, by last moment predict in word correspond to word vector and last moment network Hidden layer output, while it being input to the hidden layer of the multi-modal neural network of current time depth, it obtains the multi-modal neural network of depth and works as The hidden layer at preceding moment exports；

h_t=g (λ₁w_t+λ₂h_t-1)；

Wherein, h_tIt is exported for the hidden layer at depth multi-modal neural network current time；

w_tFor the input word vector at current time；

4.5) hidden layer at current time is input to the output layer at depth multi-modal neural network current time, is passed through The probability distribution of Softmax function calculating current time all words；

p(w_t+1)=softmax (λ₃h_t)；

4.6) the highest word of probability distribution at current time, the prediction word as current time are chosen；

4.7) step 4.4) is repeated to step 4.6) until prediction word vector is END vector；

4.8) it to all training image texts to summing, obtains the multi-modal neural network of optimal depth and totally damages Lose function；

5) high score remote sensing images are inputted, it is corresponding to generate high score remote sensing images using the multi-modal neural network of trained depth Text description.

Further, above-mentioned steps 4.1) in, nonlinear function g is RNN or LSTM, characteristics of image b₀By AlexNet or VGGNet or GoogLeNet is extracted.

Further, above-mentioned nonlinear function g is LSTM, characteristics of image b₀It is extracted by VGGNet.

Further, above-mentioned steps 1) in a plurality of text descriptive statement content include target or obtain each pixel of image Or mutual relationship between the class label of whole image, the attribute of target, feature and target.

Further, above-mentioned steps 2) described in Image Visual Feature be the full articulamentum of convolutional neural networks the last layer it is defeated 4096 dimensional vectors out.

The beneficial effects of the present invention are:

The present invention compared with the conventional method, has fully considered in high score remote sensing images between the attribute and target of target Correlation, while the vision and semantic information of high score remote sensing images is utilized, to realize in terms of high-level semantic to distant The understanding for feeling image can be used in disaster monitoring, military surveillance and geographical national conditions prospecting etc..

Detailed description of the invention

Fig. 1 is overall flow figure of the invention；

Fig. 2 is the building of high score remote sensing images-text descriptive data base；

Fig. 3 is the multi-modal neural network structure of depth；

Fig. 4 is to adopt the high score remote sensing images text generation result figure being obtained by the present invention.

Specific embodiment

With reference to the accompanying drawing, to the realization step and contrast test verification process of semantic understanding method provided by the invention It is further described:

Referring to Fig.1, realize that step of the invention is as follows:

Step 1): building high score remote sensing images-text describes (image-captions) database；

High score remote sensing images-text descriptive data base includes several high score remote sensing images and corresponding every high score remote sensing The a plurality of text descriptive statement of image；

In the present embodiment on the basis of existing high score Remote Sensing Database Sydney database and UCM database, building UCM-captions database and Sydney-captions database are as high score remote sensing images-text description (image- Captions) database；

Wherein Sydney-captions database includes the high score remote sensing images that 613 resolution ratio are 0.3 meter/pixel, The corresponding 5 word texts description of every image, amounts to 3065 texts；UCM-captions database includes that 2100 resolution ratio are The high score remote sensing images of 0.5 meter/pixel, the corresponding 5 word texts description of every image, amount to 10500 texts (reference Fig. 2) this Two high score remote sensing images-text database foundation, be subsequent high score remote sensing images semantic understanding model training and The solution of semantic understanding problem lays the foundation.

It should be understood that in the present embodiment in high score remote sensing images-text description (image-captions) database Every image be all made of the text descriptions of 5 words, but be not limited only to 5 words, only needing the content of text description includes target Or obtain relationship mutual between each pixel of image or the class label of whole image, the attribute of target, feature and target ?.

Step 2): institute in high score remote sensing images-text descriptive data base is extracted using the good convolutional neural networks of pre-training There is the visual signature of image；Wherein, the visual signature of image is the 4096 of the full articulamentum output of convolutional neural networks the last layer Dimensional vector；

Specific formula for calculation are as follows:

b₀=CNN (I)；

Wherein I is the image in high score remote sensing images-text descriptive data base, b₀For the visual signature of image；

Step 3): vocabulary text is created according to the word in all texts of high score remote sensing images-text descriptive data base Library；

Each word is indicated by a vector in vocabulary text library, and START vector sum END vector, which is added, indicates sentence Starting and termination；

Step 4): referring to Fig. 3, the training multi-modal neural network of depth；In the multi-modal neural network of depth institute sometimes Carve corresponding input layer, a hidden layer and an output layer；

Step 4.1) is at the t=1 moment, by the visual signature b of image in step 2)₀And the START vector in step 3) It is input to the hidden layer of the multi-modal neural network of t=1 moment depth, the hidden layer for obtaining the depth multi-modal neural network t=1 moment is defeated H out₁；

h₁=g (λ₁w₁+b₀)；

Wherein, w₁For the START vector of input；

G is nonlinear function；

λ₁, λ₂For network weight parameter to be trained；

The visual signature b of image₀It is extracted by AlexNet or VGGNet or GoogLeNet；

The hidden layer at t=1 moment is then exported h by step 4.2)₁, it is input to the depth multi-modal neural network t=1 moment Then output layer calculates the probability distribution of t=1 moment all words by Softmax function；

p(w₂)=softmax (λ₃h₁)；

Wherein, λ₃For network weight parameter to be trained；

Step 4.3) chooses the highest word w of t=1 moment probability distribution₂, prediction word as the t=1 moment；

Step 4.4) at the moment of t > 1, by last moment predict in word correspond to word vector and last moment The output of network hidden layer, while it being input to the hidden layer of the multi-modal neural network of current time depth, obtain the multi-modal nerve net of depth The hidden layer at network current time exports；

h_t=g (λ₁w_t+λ₂h_t-1)；

w_tFor the input word vector at current time；

The hidden layer at current time is input to the output layer at depth multi-modal neural network current time by step 4.5), The probability distribution of current time all words is calculated by Softmax function；

p(w_t+1)=softmax (λ₃h_t)；

Step 4.6) chooses the highest word of probability distribution at current time, the prediction word as current time；

Step 4.7) repeats step 4.4) to step 4.6) until prediction word vector is END vector；

Step 4.8), to summing, it is total to obtain the multi-modal neural network of optimal depth to all training image texts Bulk diffusion function；

Step 5): input high score remote sensing images generate high score remote sensing figure using the multi-modal neural network of trained depth As corresponding text describes.

Following emulation testing is described further the effect of this method.

1, simulated conditions

It is Intel (R) Xeon E5-2697,2.60GHZ, memory that the present embodiment emulation testing, which is in central processing unit, On 128G, (SuSE) Linux OS, carried out with MATLAB software and PyCharm software.

The emulation experiment data of this experiment be US Geological Survey (the U.S.Geological Survey, USGS the base for the Sydney database that the UCM database) provided and Mapping remote sensing technology National Key Laboratory, Wuhan University announce On plinth, the description of five texts is manually marked for every remote sensing image, thus obtain final UCM-captions database and Sydney-captions database is as experimental data base.

2, emulation content

The semantic understanding of remote sensing images is carried out with the method for the present invention as follows:

Firstly, using BLEU-n to text generation result, tri- scores of METEOR, CIDEr are evaluated.

Wherein BLEU-n score is that occurred in referenced text according to the n tuple for generating continuous n word composition in text Number is that (B-1, B-2, B-3, B-4 are an evaluation index similar to accuracy rate divided by the number of n tuple in referenced text The abbreviation of BLEU-n)；

METEOR score is the reconciliation for both calculating the accurate rate and recall rate that generate n tuple in text simultaneously, and asking Average is final score；

CIDEr score considers to generate each vocabulary considering to generate in text accurate rate and on the basis of recall rate Significance level.The emphasis of three kinds of score evaluations is different, combines and can be well reflected the superiority and inferiority degree for generating text.

Then, final generation text is being obtained with the step of the present embodiment on UCM-captions database, and used BLEU-n, METEOR, CIDEr index are assessed.

Table 1 is text generation effect on UCM-captions database

Subsequently, it is tested on Sydney-captions database using identical experimental procedure, experimental result is such as Shown in table 2:

Table 2 is text generation effect on Sydney-captions database

It can be seen that method of the invention from the result of Tables 1 and 2 and achieve relatively good effect, generate the matter of text It measures also relatively high.And by obtained index, we are can be found that: raw with the raising of the ability in feature extraction of CNN network At text effect it is better；LSTM effect will generally be better than RNN；Wherein the combination of network of VGG-19+LSTM obtains most preferably Effect.

The result of partial visual is as shown in Figure 4, it can be seen that the text that our method generates largely all compares conjunction Reason, it might even be possible to tell the number (such as storage tanks) of target in image.

Claims

1. a kind of high score remote sensing images semantic understanding method based on multi-modal neural network, which is characterized in that including following step It is rapid:

High score remote sensing images-text the descriptive data base includes several high score remote sensing images and corresponding every high score remote sensing The a plurality of text descriptive statement of image；

2) view of all images in high score remote sensing images-text descriptive data base is extracted using the good convolutional neural networks of pre-training Feel feature；

Specific formula for calculation are as follows:

b₀=CNN (I)；

3) vocabulary text is created according to the word in all text descriptive statements of high score remote sensing images-text descriptive data base Library；

Each word is indicated by a vector in the vocabulary text library, and START vector sum END vector, which is added, indicates sentence Starting and termination；

4) the training multi-modal neural network of depth；The corresponding input of all moment in the multi-modal neural network of depth Layer, a hidden layer and an output layer；

4.1) at the t=1 moment, by the visual signature b of image in step 2)₀And the START vector in step 3) is input to t=1 The hidden layer of the multi-modal neural network of moment depth obtains the hidden layer output h at depth multi-modal neural network t=1 moment₁；

h₁=g (λ₁w₁+b₀)；

Wherein, w₁For the START vector of input；

G is nonlinear function；

λ₁, λ₂For network weight parameter to be trained；

4.2) hidden layer at t=1 moment is then exported into h₁, it is input to the output layer at depth multi-modal neural network t=1 moment, so The probability distribution of t=1 moment all words is calculated by Softmax function afterwards；

p(w₂)=softmax (λ₃h₁)；

Wherein, λ₃For network weight parameter to be trained；

4.4) at the moment of t > 1, by last moment predict in word correspond to word vector and last moment network hidden layer Output, while being input to the hidden layer of the multi-modal neural network of current time depth, obtain the multi-modal neural network of depth it is current when The hidden layer at quarter exports；

h_t=g (λ₁w_t+λ₂h_t-1)；

w_tFor the input word vector at current time；

p(w_t+1)=softmax (λ₃h_t)；

4.8) to all training image texts to summing, the multi-modal neural network overall loss letter of optimal depth is obtained Number；

5) high score remote sensing images are inputted, generate the corresponding text of high score remote sensing images using the multi-modal neural network of trained depth This description.

2. the high score remote sensing images semantic understanding method according to claim 1 based on multi-modal neural network, feature Be: in the step 4.1), nonlinear function g is RNN or LSTM, the visual signature b of image₀By AlexNet or VGGNet Or GoogLeNet is extracted.

3. the high score remote sensing images semantic understanding method according to claim 1 based on multi-modal neural network, feature Be: the nonlinear function g is LSTM, the visual signature b of image₀It is extracted by VGGNet.

4. the high score remote sensing images semantic understanding method according to claim 1 based on multi-modal neural network, feature Be: the content of a plurality of text descriptive statement includes target or each pixel of acquisition image or whole image in the step 1) Mutual relationship between class label, the attribute of target, feature and target.

5. the high score remote sensing images semantic understanding method according to claim 1 based on multi-modal neural network, feature Be: the visual signature of image described in the step 2) is 4096 dimensions of the full articulamentum output of convolutional neural networks the last layer Vector.