CN108960330B - Remote sensing image semantic generation method based on fast regional convolutional neural network - Google Patents

Remote sensing image semantic generation method based on fast regional convolutional neural network Download PDF

Info

Publication number
CN108960330B
CN108960330B CN201810744473.0A CN201810744473A CN108960330B CN 108960330 B CN108960330 B CN 108960330B CN 201810744473 A CN201810744473 A CN 201810744473A CN 108960330 B CN108960330 B CN 108960330B
Authority
CN
China
Prior art keywords
image
neural network
text
network
remote sensing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810744473.0A
Other languages
Chinese (zh)
Other versions
CN108960330A (en
Inventor
张向荣
李翔
朱鹏
焦李成
唐旭
侯彪
马晶晶
马文萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201810744473.0A priority Critical patent/CN108960330B/en
Publication of CN108960330A publication Critical patent/CN108960330A/en
Application granted granted Critical
Publication of CN108960330B publication Critical patent/CN108960330B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a remote sensing image semantic generation method based on a fast regional convolutional neural network, which mainly solves the problems that the prior art cannot obtain the relation between a target and an object in an image and cannot obtain the relation between the target and the whole image. The implementation scheme is as follows: constructing a training sample set and a testing sample set; extracting image features of the high-resolution remote sensing image by using a fast regional convolutional neural network; extracting text features of corresponding sentences by using a bidirectional cyclic neural network; matching the image characteristics with the text characteristics by using a probability-based image-text matching model; and training the long-time and short-time memory network by using the matched image-text characteristics, thereby realizing semantic generation of the high-resolution remote sensing image. The method fully considers the characteristics of complex background and various targets of the remote sensing image, improves the semantic generation result of the remote sensing image, and can be used for image retrieval or scene classification.

Description

Remote sensing image semantic generation method based on fast regional convolutional neural network
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an image semantic generation method which can be used for automatically describing the content of a remote sensing image.
Background
The content understanding and description of the remote sensing image can provide decision-making support for remote sensing application, and the method has wide practical application value. For example, in the field of military reconnaissance, existing research algorithms can achieve rapid identification of important military targets such as ports, airports, ships, and the like from remote sensing images. The remote sensing image content understanding and description can accurately and comprehensively understand the large-width battlefield image, so that real-time interpretation, dynamic information generation and the like of the battlefield geographic environment are realized. In the civil aspect, the understanding and description of the remote sensing image content can accurately provide easily understood information such as disaster assessment, vegetation coverage, crop growth condition, city transition and the like.
The study of understanding and describing the image content is mostly to describe natural images by means of a convenient tool of natural language processing. Most of image description work can be regarded as a retrieval process, generally, image features and text features are unified into a vector space, and then character features are retrieved through the image features or the image features are retrieved by utilizing the text features; another method is to build a database for training comprising images and their text descriptions, learn the correspondence between images and text, and use it for the generation of new image text. All these methods are implemented under a supervised framework, and the generated text is a description of the image content. For example, the Stanford Liffy team uses the deep neural network model to deduce the corresponding relationship between these vocabulary segments and image regions, and then uses it to generate a generalized language description model. In the field of computer vision, the description of images or videos works abnormally hot, and Socher et al and Wang et al study the associations between images and words for describing objects in the images.
Due to the influences of adverse factors such as complicated and various targets, difficult sample labeling and the like, the content understanding and description of the remote sensing image are developed slowly compared with the content understanding and description of the natural image. The existing achievements mostly focus on the aspects of semantic extraction and image retrieval of remote sensing images. Such as: the remote sensing image retrieval is realized by utilizing natural language, and a remote sensing image retrieval model based on semantic mining is provided by Zhang Han university and professor Zhang Han dynasty and the like; the YangJunLi, Beijing university of aerospace, and the like, proposes a concept of modeling the bottom-layer characteristics and the context information of the remote sensing image through a CRF frame. The content understanding and description of the remote sensing images are usually obtained by a statistical learning method, firstly semantic related information implicit in the images is obtained, and then the images are further analyzed according to the corresponding relation between the low-level features and the semantic features. The method can obtain auxiliary recognition shallow semantic information, but the method is not deep enough, stays in the target positioning and recognition stage, cannot obtain the relation between the targets in the image and the overall relation between the targets and the image, and influences the precision of subsequent tasks such as image detection, scene classification and the like.
Disclosure of Invention
The invention aims to provide a remote sensing image semantic generation method based on a fast regional convolutional neural network aiming at the defects of the prior art, so that visual information in a high-resolution remote sensing image is fully utilized, and text information in description sentences is combined to obtain the relation between a target and the target in the image and the relation between the target and the whole image, and the precision of tasks such as image retrieval, scene classification and the like is improved.
In order to achieve the above purpose, the implementation steps of the invention comprise the following steps:
(1) taking 60% of image-text pairs in the remote sensing data set as training samples, and taking the rest 40% of image-text pairs as test samples;
(2) extracting image features of remote sensing images in training samples by using a fast regional convolutional neural network;
(3) extracting text features of texts corresponding to the remote sensing images in the training samples by using a bidirectional cyclic neural network;
(4) matching the image characteristics obtained in the step (2) with the text characteristics obtained in the step (3) by using a probability model-based image-text matching method to obtain matched image-text characteristics;
(5) training the long-time and short-time memory network by using the matched image-text characteristics in the step (4);
(6) and (5) extracting the image characteristics of the remote sensing image in the test sample by using the fast regional convolutional neural network, and inputting the image characteristics into the trained long-time and short-time memory network to perform semantic generation to obtain a statement for describing the image content.
Compared with the prior art, the invention has the following advantages:
firstly, the method based on the fast regional convolutional neural network is adopted, so that the visual characteristics suitable for describing the high-resolution remote sensing image can be more accurately obtained.
Secondly, the method adopts the image characteristic text characteristic matching based on the probability model, so that the method can better construct the corresponding relation from the image characteristic to the text characteristic.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a block diagram of a fast regional convolutional neural network used in the present invention to extract image features;
FIG. 3 is a block diagram of a regional candidate network in the fast regional convolutional neural network of the present invention;
FIG. 4 is a diagram of a feature transformation module in a regional candidate network in accordance with the present invention;
FIG. 5 is a block diagram of a bi-directional recurrent neural network used in the present invention to extract text features;
FIG. 6 is a diagram of a long-term and short-term memory network for text prediction according to the present invention;
FIG. 7 is a graph comparing evaluation results of the present invention under different evaluation indexes;
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
referring to fig. 1, the implementation steps of the present invention are as follows.
Step 1, constructing a training sample set and a testing sample set.
Three remote sensing image semantic generation Data sets of UCM-Captions Data Set, Sydney-Captions Data Set and RSICD are downloaded from key laboratory websites of surveying and mapping remote sensing countries of Wuhan university, and 60% of image-text pairs in each Data Set are respectively used as training samples, and the rest 40% of image-text pairs are used as test samples.
Step 2, extracting the image characteristics of the remote sensing image in the training sample by using a fast regional convolution network:
the structure of the fast area convolutional network is shown in fig. 2, which includes an area candidate network and a three-layer convolutional neural network, where:
the structure of the regional candidate network is shown in FIG. 3, and the regional candidate network comprises a feature transformation module and a regional suggestion module, wherein the feature transformation module adopts VGG-16, and the structure is shown in FIG. 4; the region suggestion module has two branches: classification branches and regression branches.
The three-layer convolutional neural network is composed of two convolutional layers and a full-connection layer, the number of nodes of the convolutional layers is 256, and the number of nodes of the full-connection layer is 4096.
The specific implementation of this step is as follows:
(2a) screening out a candidate region containing a target by using a region candidate network in the fast regional convolutional network:
converting the original picture into a feature map through a feature conversion module; sliding a 3x3 window on the feature map by using a step size of 1 through a region suggestion module, and generating k candidate regions with different sizes from the center of each window, wherein k is 9 in the example;
respectively converting the k candidate regions into 256-dimensional vectors, inputting the vectors into a region suggestion module, and outputting indicating variables of 2k candidate regions by a classification branch in the module, wherein the indicating variables are 1 to represent targets and 0 to represent non-targets; outputting coordinates of 4k candidate regions by a regression branch in the module, and generating a candidate region containing a target by an indicating variable and the coordinates;
(2b) extracting (2a) image features of the candidate regions using a three-layer convolutional neural network of the fast region convolutional neural network:
according to the pixel I in the ith candidate regioniUsing the three-layer convolutional neural network, the region image features are represented as:
Figure BDA0001723984650000041
wherein, thetacNetwork parameter, W, representing a regional candidate networkmAnd bmWeights and biases representing three layers of convolutional neural networks, respectively,WmDimension of (d) is h × 4096, h represents the dimension of the embedding space;
Figure BDA0001723984650000042
is represented byiThe passing parameter is thetac4096-dimensional full-link layer vectors output by the area candidate network; v. ofiRepresenting the features of the image extracted from the ith candidate region, including the features of the entire image and the first 19 detected positions;
(2c) extracting image characteristics v of each candidate regioniAnd merging to obtain the image characteristics v of the pictures, wherein the dimensionality of the image characteristics v is equal to the number of the screened candidate regions.
And 3, extracting text features of sentences in the training samples by using the bidirectional cyclic neural network.
The structure of the bidirectional cyclic neural network is shown in fig. 5, in which the number of input nodes and the number of output nodes are both set to N, and the number of forward implicit units and the number of backward implicit units are both set to h.
The specific implementation of this step is as follows:
(3a) inputting a word sequence with the length of N, and converting the t-th word into an h-dimensional vector: x is the number oft=WnΦtWherein phi istRepresenting an indication column vector, representing the position index of the word in the vocabulary, and t representing the position of each word in the sentence, wherein the value of t is 1 to N; wnAn embedded matrix for a given word, which is randomly initialized;
(3b) randomly initializing forward weights W of a bidirectional recurrent neural networkrForward bias brBackward weight WlAnd a backward bias blRespectively and iteratively calculating the forward implicit unit output of the bidirectional cyclic neural network
Figure BDA0001723984650000043
And backward implicit element output
Figure BDA0001723984650000044
Figure BDA0001723984650000045
Figure BDA0001723984650000046
Wherein the content of the first and second substances,
Figure BDA0001723984650000047
is the t-1 forward hidden unit output, x of the bidirectional cyclic neural networktFor the h-dimensional vector of the t-th word in (3a), t increases from 1 to N, setting
Figure BDA0001723984650000048
Figure BDA0001723984650000049
Setting for t +1 th backward hidden unit output of bidirectional cyclic neural network, t is reduced to 1 from N
Figure BDA00017239846500000410
f is a rectified linear unit activation function RELU expressed as
Figure BDA0001723984650000051
(3c) Randomly initializing output layer weights W of a networkdAnd bias bdIteratively calculating the output s of the networkt
Figure BDA0001723984650000052
Wherein the content of the first and second substances,
Figure BDA0001723984650000053
and
Figure BDA0001723984650000054
the forward implicit cell output and the backward implicit cell output of the bidirectional recurrent neural network in (3b),
(3d) calculating a cross entropy loss function L (θ):
Figure BDA0001723984650000055
wherein
Figure BDA0001723984650000056
An indicated column vector for the desired output, i.e., the t-th word, representing the index of the word's position in the vocabulary;
(3e) optimizing L (theta) by using a random gradient descent algorithm to obtain a trained bidirectional cyclic neural network, wherein the output s of the networktI.e. the text feature of the t-th word, which contains the position of the word and context information.
And 4, matching the image features and the text features extracted from the training sample by using a probability-based image-text matching model.
(4a) Representing the image characteristics of each remote sensing image and the corresponding text characteristics describing sentences by using vectors, and uniformly transforming the image characteristics and the corresponding text characteristics into an h-dimensional space; using dot products
Figure BDA0001723984650000057
Text feature s representing ith region image feature vi and tth wordtCalculating the matching degree S between the p-th image and the q-th sentencepq
Figure BDA0001723984650000058
Wherein g ispRepresenting a series of image block areas in the p-th image, gqRepresenting a series of words in the qth sentence;
(4b) for one picture of the training set, all sentences are traversed, and all matches of the picture are calculated in sequenceMatching degree, selecting maximum matching degree SpqThe corresponding q-th sentence is used as a sentence index matched with the image;
(4c) traversing all the images of the training set, and repeating the step (4b) to obtain sentence indexes respectively matched with all the images;
(4d) according to the matched sentence index and the image index corresponding to the sentence index, the corresponding image characteristic v in (2) and the corresponding text characteristic s in (3) are retrieved from each picture1,…,st,…sNFinish the matching process, where stThe text characteristic of the t-th word in the sentence is shown, t is from 1 to N, and N is the number of words in the sentence.
Step 5, training a long-term and short-term memory network capable of text prediction:
the long and short term memory network is structured as shown in fig. 6, where the number of network input nodes and the number of network output nodes are both set to be N, and the number of hidden units is set to be h.
The specific implementation of this step is as follows:
(5a) random initialization long-time memory network input layer weight WhxHidden layer weight WhhOutput layer weight WohHidden layer bias bhAnd output layer bias boInput image feature v, corresponding text feature s1,…,st,…sNOriginal sentences in the 1 and the t-th hidden unit output h of the long-time and short-time memory network are calculated in an iterative mannertAnd output layer output yt
ht=f(Whxst+Whhht-1+bh+Θ(t=1)⊙v),
Figure BDA0001723984650000061
Wherein the image feature vector v, h in (2) is input only when t is 1t-1Setting for t-1 hidden unit output of long-short time memory network
Figure BDA0001723984650000062
Implicit Unit size is set to 512, output vector ytA probabilistic output representing a word;
(5b) calculating ytCross entropy loss of
Figure BDA0001723984650000063
Figure BDA0001723984650000064
Wherein the content of the first and second substances,
Figure BDA0001723984650000065
indicating the position index of the word in the vocabulary table for the expected output, namely the indicated column vector of the t word in the original sentence corresponding to the image;
(5c) use of stochastic gradient descent algorithm to correct cross entropy loss
Figure BDA0001723984650000066
And optimizing to obtain the trained long-time and short-time memory network.
And 6, performing semantic generation on the test sample by using the trained long-time and short-time memory network to obtain a sentence for describing the image content.
(6a) Extracting the image characteristics of the remote sensing image in the test sample by using the fast regional convolutional neural network in the step (2)
Figure BDA0001723984650000067
(6b) T-th hidden unit output h of iterative computation long-and-short-term memory networktNetwork output ytAnd text feature s of the t +1 th wordt+1
Figure BDA0001723984650000071
Figure BDA0001723984650000072
st+1=yt
Wherein is provided with
Figure BDA0001723984650000073
Inputting the image feature vector of the test sample only when t is 1
Figure BDA0001723984650000074
(6c) According to ytAnd searching words from the vocabulary table by the index corresponding to the medium maximum value, and forming a sentence by all the words which are sequentially output, namely the generated sentence for describing the image content.
The technical effects of the invention are further described by combining simulation experiments as follows:
1. simulation conditions
The algorithm simulation platform of the embodiment is as follows: the main frequency is CPU of 4.00GHz, memory of 32.0GB, video card GTX-1070, Ubuntu (64 bit) operating system, Mxnet and Python development platform.
The data set used for the simulation was: UCM-potentials Data Set, Sydney-potentials Data Set and RSICD.
The experiment uses indexes such as Bleu1, Bleu2, Bleu3, Bleu4, METEOR, ROUGE _ L and CIDER to evaluate the method.
2. Simulation content and results
Simulation 1, in order to compare the results of the invention under different evaluation indexes, the method of the invention is used for semantic generation, and the three data sets are respectively tested, and the test results are shown in table 1.
Table 1 experimental results of different data sets
Figure BDA0001723984650000075
Table 1 shows the results of the method of the invention on different data sets for different evaluation indices.
As can be seen from table 1, the results of the various indicators on the RSICD dataset are slightly lower than the UCM and Sydney datasets. But compared with the best semantic generation method at present, on the three data sets, except the CIDER, the results of other indexes are improved by about 0.05.
And 2, in order to verify the influence of the training data proportion on the semantic generation result of the high-resolution remote sensing image, 10% of images are used as a verification set, the proportion of the training set is increased from 10% to 80%, the semantic generation is carried out by using the method, the three data sets are respectively tested, and the test result is shown in FIG. 7.
Fig. 7 shows the results of the method of the present invention at different training set ratios for different evaluation indices, where the abscissa represents the training set ratio and the ordinate represents the corresponding result, and different metrics are represented by different shapes. Wherein fig. 7(a) shows the results on the UCM data set, fig. 7(b) shows the results on the Sydney data set, and fig. 7(c) shows the results on the RSICD data set.
As can be seen from fig. 7(a), as the proportion of the training set increases, the indexes in the result of the UCM data set increase and then gradually become stable;
as can be seen from fig. 7(b), as the training set proportion increases, the indexes in the result of the Sydney data set increase in the initial stage, and after the training proportion reaches 40%, the result almost becomes stable and does not change any more, and the reason is analyzed by the data set, mainly due to the unbalanced distribution of the data set;
as can be seen from fig. 7(c), as the proportion of the training set increases, the indexes in the results of the RSICD dataset increase, but the overall index is slightly lower because the sentence expression in the dataset is more complex.
In conclusion, the method and the device can improve the semantic generation result of the remote sensing image, but the performance of the method and the device on complex data sets and individual indexes needs to be improved.

Claims (5)

1. A remote sensing image semantic generation method based on a fast regional convolutional neural network is characterized by comprising the following steps:
(1) taking 60% of image-text pairs in the remote sensing data set as training samples, and taking the rest 40% of image-text pairs as test samples;
(2) extracting image features of remote sensing images in training samples by using a fast regional convolutional neural network;
(3) extracting text features of texts corresponding to the remote sensing images in the training samples by using a bidirectional cyclic neural network;
(4) matching the image characteristics obtained in the step (2) with the text characteristics obtained in the step (3) by using a probability model-based image-text matching method to obtain matched image-text characteristics; comprises the following steps:
(4a) representing the image characteristics of each remote sensing image and the corresponding text characteristics describing sentences by using vectors, and unifying the image characteristics and the corresponding text characteristics to the same dimensional space; using dot products
Figure FDA0003185250230000011
Image feature v representing the ith candidate regioniText feature s associated with the t-th wordtCalculating the matching degree S between the p-th image and the q-th sentencepq
Figure FDA0003185250230000012
Wherein g ispRepresenting a series of image block areas in the p-th image, gqRepresenting a series of words in the qth sentence;
(4b) traversing all sentences aiming at one picture of the training set, sequentially calculating all matching degrees of the picture, and selecting the maximum matching degree SpqThe corresponding q-th sentence is used as a sentence index matched with the image;
(4c) traversing all the images of the training set, and repeating the step (4b) to obtain sentence indexes respectively matched with all the images;
(4d) according to the matched sentence index and the image index corresponding to the sentence index, the corresponding image characteristic v in (2) and the corresponding text in (3) are retrieved from each pictureFeature s1,…,st,…sNFinish the matching process, where stThe method is characterized in that the method is a text feature of the tth word in a sentence, t is from 1 to N, and N is the number of words in the sentence;
(5) training the long-time and short-time memory network by using the matched image-text characteristics in the step (4);
(6) and (5) extracting the image characteristics of the remote sensing image in the test sample by using the fast regional convolutional neural network, and inputting the image characteristics into the trained long-time and short-time memory network to perform semantic generation to obtain a statement for describing the image content.
2. The method of claim 1, wherein the image features of the remote sensing image are extracted by using a fast area convolution neural network in (2) by:
(2a) generating a candidate region by using a regional candidate network in the fast regional convolutional neural network to obtain a candidate region containing different targets in each high-resolution remote sensing image;
(2b) extracting (2a) the image characteristics of the candidate region by using a three-layer convolutional neural network in the fast regional convolutional neural network, and according to the pixel I in the ith candidate regioniUsing a three-layer convolutional neural network, the region image features are represented as:
Figure FDA0003185250230000021
wherein, thetacNetwork parameter, W, representing a regional candidate networkmAnd bmParameters representing a three-layer convolutional neural network, WmDimension of (d) is h × 4096, h represents the dimension of the embedding space;
Figure FDA0003185250230000022
is represented byiThe passing parameter is thetac4096-dimensional full-link layer vectors output by the area candidate network; v. ofiRepresenting the features of the image extracted from the ith candidate region, including the features of the entire image and the first 19 detected positions;
(2c) extracting image characteristics v of each candidate regioniMerging to obtain picturesAn image feature v having a dimension equal to the number of candidate regions.
3. The method of claim 1, wherein the extracting text features of the text corresponding to the remote sensing images in the training samples by using the bidirectional recurrent neural network in (3) comprises the following steps:
(3a) inputting a word sequence of length N, by xt=WnΦtConverting the t-th word into an h-dimension vector xtWherein phi istRepresenting an indication column vector, representing the position index of the tth word in a vocabulary table, wherein the value of t is 1 to N, and representing the position of each word in a sentence; weight WnTo specify an embedded matrix of words, W is randomly initializedn
(3b) Randomly initializing forward weights W of a bidirectional recurrent neural networkrForward bias brBackward weight WlAnd a backward bias blRespectively and iteratively calculating the forward implicit unit output of the bidirectional cyclic neural network
Figure FDA0003185250230000023
And backward implicit element output
Figure FDA0003185250230000024
Figure FDA0003185250230000025
Figure FDA0003185250230000026
Wherein the content of the first and second substances,
Figure FDA0003185250230000031
is the t-1 forward hidden unit output, x of the bidirectional cyclic neural networktH-dimensional vector of the t-th word in (3a)T is increased from 1 to N, set
Figure FDA0003185250230000032
Figure FDA0003185250230000033
Setting for t +1 th backward hidden unit output of bidirectional cyclic neural network, t is reduced to 1 from N
Figure FDA0003185250230000034
f is a rectified linear unit activation function RELU, and the expression is f:
Figure FDA0003185250230000035
(3c) randomly initializing output layer weights W of a networkdAnd bias bdIteratively computing a representation s of the t-th wordt
Figure FDA0003185250230000036
Wherein the content of the first and second substances,
Figure FDA0003185250230000037
and
Figure FDA0003185250230000038
the forward implicit cell output and the backward implicit cell output of the bidirectional recurrent neural network in (3b),
(3d) optimizing all network parameters W using a cross entropy functionn、Wr、br、Wl、bl、WdAnd bdObtaining a trained bidirectional cyclic neural network, the output s of whichtThe method is a text feature of the tth word, and the text feature comprises the position of the tth word and context information.
4. The method of claim 1, wherein the step (5) of training the long-short term memory network by using the matched image-text features in the step (4) comprises the following steps:
(5a) random initialization long-time memory network input layer weight WhxHidden layer weight WhhOutput layer weight WohHidden layer bias bhAnd output layer bias boInput image feature v, corresponding text feature s1,…,st,…sNOriginal sentences in the 1 and the t-th hidden unit output h of the long-time and short-time memory network are calculated in an iterative mannertAnd output layer output yt
ht=f(Whxst+Whhht-1+bh+Θ(t=1)⊙v),
Figure FDA0003185250230000039
Wherein the image feature vector v, h in (2) is input only when t is 1t-1Setting for t-1 hidden unit output of long-short time memory network
Figure FDA0003185250230000041
(5b) Calculating ytCross entropy L (θ):
Figure FDA0003185250230000042
wherein the content of the first and second substances,
Figure FDA0003185250230000043
indicating the position index of the word in the vocabulary table for the expected output, namely the indicated column vector of the t word in the original sentence corresponding to the image; n is the number of words in the sentence;
(5c) using cross entropy optimization instituteSome network parameters Whx,Whh,WohAnd bh,boAnd obtaining the trained long-time and short-time memory network.
5. The method according to claim 1, wherein in the step (6), semantic generation is performed in the trained long-and-short-term memory network (5), so as to obtain a sentence describing the image content, and the method comprises the following steps:
(6a) extracting the image characteristics of the remote sensing image in the test sample by using the fast regional convolutional neural network in the step (2)
Figure FDA0003185250230000044
(6b) T-th hidden unit output h of iterative computation long-and-short-term memory networktNetwork output ytAnd text feature s of the t +1 th wordt+1:
Figure FDA0003185250230000045
Figure FDA0003185250230000046
st+1=yt
Wherein is provided with
Figure FDA0003185250230000047
Inputting the image feature vector of the test sample only when t is 1
Figure FDA0003185250230000048
WhxRepresents the input layer weight, WhhRepresenting hidden layer weights, WohRepresents the output layer weight, bhIndicating the hidden layer bias and boIndicating an output layer bias;
(6c) according to ytThe index corresponding to the maximum value in the vocabulary table is used for searching words, and the words are sequentially searchedAll the output words form a sentence, which is the generated sentence describing the image content.
CN201810744473.0A 2018-07-09 2018-07-09 Remote sensing image semantic generation method based on fast regional convolutional neural network Active CN108960330B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810744473.0A CN108960330B (en) 2018-07-09 2018-07-09 Remote sensing image semantic generation method based on fast regional convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810744473.0A CN108960330B (en) 2018-07-09 2018-07-09 Remote sensing image semantic generation method based on fast regional convolutional neural network

Publications (2)

Publication Number Publication Date
CN108960330A CN108960330A (en) 2018-12-07
CN108960330B true CN108960330B (en) 2021-09-10

Family

ID=64483489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810744473.0A Active CN108960330B (en) 2018-07-09 2018-07-09 Remote sensing image semantic generation method based on fast regional convolutional neural network

Country Status (1)

Country Link
CN (1) CN108960330B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740471B (en) * 2018-12-24 2021-06-22 中国科学院西安光学精密机械研究所 Remote sensing image description method based on joint latent semantic embedding
CN109784223B (en) * 2018-12-28 2020-09-01 珠海大横琴科技发展有限公司 Multi-temporal remote sensing image matching method and system based on convolutional neural network
CN111476838A (en) * 2019-01-23 2020-07-31 华为技术有限公司 Image analysis method and system
CN111753822A (en) * 2019-03-29 2020-10-09 北京市商汤科技开发有限公司 Text recognition method and device, electronic equipment and storage medium
CN110287355B (en) * 2019-05-16 2021-06-22 中国科学院西安光学精密机械研究所 Remote sensing image description method based on retrieval topic memory network
CN110232413A (en) * 2019-05-31 2019-09-13 华北电力大学(保定) Insulator image, semantic based on GRU network describes method, system, device
CN110363303B (en) * 2019-06-14 2023-07-07 平安科技(深圳)有限公司 Memory training method and device for intelligent distribution model and computer readable storage medium
CN110378335B (en) * 2019-06-17 2021-11-19 杭州电子科技大学 Information analysis method and model based on neural network
CN110418210B (en) * 2019-07-12 2021-09-10 东南大学 Video description generation method based on bidirectional cyclic neural network and depth output
US20210073317A1 (en) * 2019-09-05 2021-03-11 International Business Machines Corporation Performing dot product operations using a memristive crossbar array
CN110991284B (en) * 2019-11-22 2022-10-18 北京航空航天大学 Optical remote sensing image statement description generation method based on scene pre-classification
CN111126479A (en) * 2019-12-20 2020-05-08 山东浪潮人工智能研究院有限公司 Image description generation method and system based on unsupervised uniqueness optimization
CN112070069A (en) * 2020-11-10 2020-12-11 支付宝(杭州)信息技术有限公司 Method and device for identifying remote sensing image
CN112861882B (en) * 2021-03-10 2023-05-09 齐鲁工业大学 Image-text matching method and system based on frequency self-adaption
CN113298151A (en) * 2021-05-26 2021-08-24 中国电子科技集团公司第五十四研究所 Remote sensing image semantic description method based on multi-level feature fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868691A (en) * 2016-03-08 2016-08-17 中国石油大学(华东) Urban vehicle tracking method based on rapid region convolutional neural network
CN107657008A (en) * 2017-09-25 2018-02-02 中国科学院计算技术研究所 Across media training and search method based on depth discrimination sequence study
CN107729987A (en) * 2017-09-19 2018-02-23 东华大学 The automatic describing method of night vision image based on depth convolution loop neutral net
CN108073941A (en) * 2016-11-17 2018-05-25 江南大学 A kind of image, semantic generation method based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9940577B2 (en) * 2015-07-07 2018-04-10 Adobe Systems Incorporated Finding semantic parts in images

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868691A (en) * 2016-03-08 2016-08-17 中国石油大学(华东) Urban vehicle tracking method based on rapid region convolutional neural network
CN108073941A (en) * 2016-11-17 2018-05-25 江南大学 A kind of image, semantic generation method based on deep learning
CN107729987A (en) * 2017-09-19 2018-02-23 东华大学 The automatic describing method of night vision image based on depth convolution loop neutral net
CN107657008A (en) * 2017-09-25 2018-02-02 中国科学院计算技术研究所 Across media training and search method based on depth discrimination sequence study

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Identity-Aware Textual-Visual Matching with Latent Co-attention;Shuang Li 等;《2017 IEEE International Conference on Computer Vision (ICCV)》;20171029;全文 *
图像理解中的卷积神经网络;常亮 等;《自动化学报》;20160630;全文 *

Also Published As

Publication number Publication date
CN108960330A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108960330B (en) Remote sensing image semantic generation method based on fast regional convolutional neural network
CN106909924B (en) Remote sensing image rapid retrieval method based on depth significance
Yue et al. Self-supervised learning with adaptive distillation for hyperspectral image classification
US10248664B1 (en) Zero-shot sketch-based image retrieval techniques using neural networks for sketch-image recognition and retrieval
Li et al. Large-scale remote sensing image retrieval by deep hashing neural networks
Zhao et al. Hyperspectral anomaly detection based on stacked denoising autoencoders
CN107480261B (en) Fine-grained face image fast retrieval method based on deep learning
CN110209824B (en) Text emotion analysis method, system and device based on combined model
CN110516095B (en) Semantic migration-based weak supervision deep hash social image retrieval method and system
EP3029606A2 (en) Method and apparatus for image classification with joint feature adaptation and classifier learning
CN110929080B (en) Optical remote sensing image retrieval method based on attention and generation countermeasure network
Ye et al. A lightweight model of VGG-16 for remote sensing image classification
CN111680176A (en) Remote sensing image retrieval method and system based on attention and bidirectional feature fusion
CN111542841A (en) System and method for content identification
CN112949740B (en) Small sample image classification method based on multilevel measurement
Zeng et al. IDLN: Iterative distribution learning network for few-shot remote sensing image scene classification
Hosseiny et al. A hyperspectral anomaly detection framework based on segmentation and convolutional neural network algorithms
CN114419351A (en) Image-text pre-training model training method and device and image-text prediction model training method and device
Kiani Shahvandi et al. Small geodetic datasets and deep networks: attention-based residual LSTM autoencoder stacking for geodetic time series
Hu et al. Saliency-based YOLO for single target detection
Rivas-Perea et al. Statistical and neural pattern recognition methods for dust aerosol detection
Zhang et al. Efficient history matching with dimensionality reduction methods for reservoir simulations
Moon et al. Correlation between alignment-uniformity and performance of dense contrastive representations
Darvishnezhad et al. A new model based on multi-aspect images and complex-valued neural network for synthetic aperture radar automatic target recognition
Pankaja et al. A hybrid approach combining CUR matrix decomposition and weighted kernel sparse representation for plant leaf recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant