CN116052018A

CN116052018A - Remote sensing image interpretation method based on life learning

Info

Publication number: CN116052018A
Application number: CN202310331512.5A
Authority: CN
Inventors: 张广益; 陈宇; 鲁锦涛; 吴皓; 张玥珺; 李洁; 邹圣兵
Original assignee: Beijing Shuhui Spatiotemporal Information Technology Co ltd
Current assignee: Beijing Shuhui Spatiotemporal Information Technology Co ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-05-02
Anticipated expiration: 2043-03-31
Also published as: CN116052018B

Abstract

The invention discloses a remote sensing image interpretation method based on life learning, which relates to the field of remote sensing image processing and comprises the following steps: s1, constructing a combined model; s2, acquiring a training sample to pretrain the combined model to obtain a first scene classification result; s3, obtaining a remote sensing image to be interpreted, and uniformly cutting; s4, inputting the cut remote sensing images to be interpreted into the combined model in sequence to obtain a second scene classification result and interpretation information; s5, calculating to obtain a scene difference value; s6, calculating to obtain an interpretation loss value; s7, setting a selection strategy based on the scene difference value and the interpretation loss value, and retraining and expanding the dynamic expandable interpretation sub-model according to the selection strategy to obtain a final combined model; s8, interpreting the newly interpreted remote sensing image through the final combined model. The invention realizes the life learning oriented to remote sensing interpretation based on the dynamic extensible network, and avoids the problem of disastrous forgetting common in life learning.

Description

Remote sensing image interpretation method based on life learning

Technical Field

The invention relates to the field of remote sensing image processing, in particular to a remote sensing image interpretation method based on life-long learning.

Background

In the 21 st century, we collected high-resolution remote sensing images at multiple angles through various devices such as satellites, unmanned aerial vehicles, digital cameras, imaging spectrometers, space planes and the like, and applied the images to different fields. How to rapidly and effectively process a large amount of remote sensing image data is an urgent problem to be solved in the remote sensing field. Clearly, manually processing the remote sensing image, while highly accurate, is inefficient and requires significant cost to put in, which is not desirable. The traditional remote sensing image method utilizes the information such as the geometric shape, the spatial position and the like of an object to extract the characteristics of the object, and can also extract the effective characteristics of three-dimensional data by combining the characteristic information such as color, shadow, texture and the like and LiDAR or SAR. The single method for extracting the characteristics has certain defects, such as insufficient classification effect, more classification errors and the like, and a good balance point between the distinguishability and the robustness cannot be maintained. However, increasingly sophisticated machine learning techniques may be applied in many areas of our lives, especially in deep learning to train the network, ultimately allowing accurate predictions of unknown samples by the model. The remote sensing technology provides a large amount of reliable data, which lays a foundation for the development of the deep learning model. The deep learning can be applied to the fields of classification, semantic segmentation, detection and the like of remote sensing images, and plays a certain role in promoting the better development of remote sensing technology.

The current deep learning method applied to remote sensing image interpretation faces a common problem that when facing different interpretation tasks, a brand new deep learning model needs to be built and brand new training needs to be performed in order to achieve higher interpretation accuracy, which results in huge engineering quantity and low model training efficiency in engineering implementation, extremely low effective utilization rate and multiplexing rate of the existing remote sensing image data and the built model, and limits large-scale engineering implementation. To solve this problem and drive the automated development of remote sensing image interpretation, researchers have attempted to multiplex existing models and already learned knowledge in new remote sensing interpretation tasks using on-line learning and continuous learning methods. Among the existing continuous learning methods, the simplest method is to train the original network through new training data provided by new tasks so as to realize network fine adjustment. However, this simple retraining approach reduces the interpretation of both new and old tasks by the original network. If the correlation between the new and old tasks is low, for example, two tasks are to classify two different kinds of features, such as wheat and buildings, then the features learned by the network from the old task may not have any effect on the new task. Another problem that can be encountered is the catastrophic forgetting problem, where the original network can forget what was learned before after learning new knowledge, caused by two points: (1) Because the structure of deep learning is difficult to adjust during training once determined, the structure of the neural network directly determines the capacity of the learning model. The neural network with a fixed structure means that the capacity of the model is limited, and in the case of limited capacity, the neural network must erase old knowledge in order to learn a new task; (2) Second, neurons of the hidden layer of deep learning are global, and small changes of individual neurons can affect the output results of the entire network at the same time. In addition, all parameters of the feed forward network are connected to each dimension of the input, and new data is highly likely to change all parameters in the network. For neural networks that are fixed in their own right, the parameters are the only variables about knowledge. If the changed parameters include parameters that are highly relevant to the historical knowledge, the net effect is that the new knowledge overrides the old knowledge.

For the remote sensing field, how to ensure that the original capability of the model on the old interpretation task is not reduced while the better effect is obtained on the new interpretation task, and how to overcome the problem of catastrophic forgetting are important problems to be solved in the development of the current remote sensing lifelong learning technology.

Disclosure of Invention

The invention provides a remote sensing image interpretation method based on life learning, which realizes the life learning method suitable for remote sensing image interpretation by combining a remote sensing image scene classification model and a combined model of a dynamic extensible remote sensing image interpretation model. The known tasks and the unknown tasks are identified through the remote sensing image scene classification, and further, the expansion of the model capacity and the learning of the unknown tasks facing the new unknown tasks are realized through the expansion and the retraining of the interpretation network, so that the knowledge is continuously updated. The learned knowledge is fully applied to a new remote sensing interpretation task, so that the problem of catastrophic forgetting is effectively avoided while the interpretation accuracy is not reduced, and the utilization rate of the existing model and data is improved.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a remote sensing image interpretation method based on life learning comprises the following steps:

s1, constructing a combined model, wherein the combined model comprises a dynamic extensible interpretation sub-model and a scene classification sub-model, and the scene classification sub-model comprises a scene classifier and a memory;

s2, obtaining training samples in a sample library, pre-training the combination model by the aid of the cut training samples, and taking the obtained pre-training results as first scene classification results and storing the first scene classification results in a memory;

s3, obtaining a plurality of remote sensing images to be interpreted, and uniformly cutting the remote sensing images to be interpreted, wherein each remote sensing image to be interpreted comprises marked ground object samples and unlabeled target interpretation samples, and the marked ground object samples comprise real labels;

s4, sequentially inputting the cut remote sensing images to be interpreted into a combined model to obtain a second scene classification result and interpretation information, wherein the interpretation information comprises interpretation information of marked ground object samples and interpretation information of unlabeled target interpretation samples;

s5, calculating a second scene classification result and a first scene classification result to obtain a scene difference value;

s6, calculating the interpretation information of the marked ground object sample and the real label of the marked ground object sample to obtain an interpretation loss value;

s7, setting a selection strategy based on the scene difference value and the interpretation loss value, and retraining and expanding the dynamic expandable interpretation sub-model according to the selection strategy to obtain a final combined model;

s8, interpreting the new remote sensing image through the final combined model.

In one embodiment of the present invention, in step S7, the selection strategy is:

first kind: when the scene difference value is smaller than a first preset threshold value and the interpretation loss value is smaller than a second preset threshold value, maintaining the current structure of the dynamic extensible interpretation sub-model, and obtaining a final combined model;

second kind: when the scene difference value is smaller than a first preset threshold value and the interpretation loss value is larger than a second preset threshold value, retraining the dynamic extensible interpretation sub-model, updating the combined model, and turning to step S4;

third kind: when the scene difference value is larger than the first preset threshold value and the interpretation loss value is smaller than the second preset threshold value, retraining the dynamic extensible interpretation sub-model, updating the combined model, and turning to step S4;

fourth kind: when the scene difference value is greater than the first preset threshold and the interpretation loss value is greater than the second preset threshold, the dynamically expandable interpretation sub-model is expanded to update the combined model, and the step S4 is proceeded to.

In one embodiment of the present invention, the dynamically expandable interpretation sub-model includes a convolutional neural network for performing interpretation tasks and an expander for expanding the convolutional neural network.

In one embodiment of the invention, expanding the dynamically extensible interpretation sub-model includes adding neurons of a convolutional neural network and training the added neurons;

retraining the dynamically extensible interpretation sub-model includes selectively adjusting portions of the network parameters.

In one embodiment of the present invention, the extended dynamic extensible interpretation sub-model comprises:

adding a preset number of neurons to each layer of neural network;

removing newly added ineffective neurons by using group sparse regularization;

training the final augmented neurons:

wherein l represents the first layer of the neural network, D _t For interpretation of data, W is the neural network weight, L is the loss function, μ and γ are the regularized term parameters, t is the current task, t-1 is the previous task, g is a group defined by the input weights of each neuron.

In one embodiment of the present invention, retraining a dynamic extensible interpretation sub-model includes:

when a new task t is received, a sparse linear classifier is installed into the last layer of the dynamically extensible interpretation sub-model:

where l represents the first layer of the convolutional neural network,

for the network parameters of layer I, μ is regularized intensity, N is the total number of layers of the network,/->

Representing in addition to->

Other network parameters than;

identifying a sub-network S related to the current new task t according to the established sparse connection, and retraining the sub-network S:

。

in an embodiment of the present invention, in step S5, the method for calculating the second scene classification result and the first scene classification result is distance calculation.

In an embodiment of the present invention, the distance calculation process is as follows:

；

c represents a first scene classification result, wherein p (y=i|x=j) is the prediction probability that the input clipped training sample j belongs to the category i, M is the total number of scene categories, and N is the number of clipped training samples;

the remote sensing image to be interpreted is uniformly cut into r blocks, c _t A second scene classification result is represented and,

D=[d ₁ ,d ₂ ,...,d _r ]representing the nearest distance obtained by calculating the distance between the second scene classification result and the first scene classification result, wherein

；/>

D is ordered in descending order of values, the first K values are selected, and the median value of the K values is used as a scene difference value.

In one embodiment of the present invention, the structure of the convolutional neural network in the initial dynamically scalable interpretation sub-model comprises:

a first layer: convolutional layer 1, input a cut image of 229×229×3; the number of convolution kernels is 96; the size of the convolution kernel is 13×13×3; the step length is 4;

a second layer: a pooling layer 1, wherein the pooling size is 3 multiplied by 3; the step length is 2;

third layer: the input of the convolution layer 2 is the output of the second layer, the number of convolution kernels is 256, and the convolution kernel size is 5 multiplied by 5; the step length is 1;

fourth layer: a pooling layer 2, wherein the pooling size is 3 multiplied by 3; the step length is 2;

fifth layer: the convolution layer 3 is input as the output of the fourth layer, the number of convolution kernels is 384, and the convolution kernel size is 3 multiplied by 3;

sixth layer: the convolution layer 4 is input as the output of the fifth layer, the number of convolution kernels is 384, and the convolution kernel size is 3×3;

seventh layer: the convolution layer 5 is input as the output of the sixth layer, the number of convolution kernels is 256, and the convolution kernel size is 3 multiplied by 3;

eighth layer: a pooling layer 3, wherein the pooling size is 3×3; the step length is 2;

the ninth to eleventh layers are all connected layers, and the number of neurons is 384, 192 and 100 respectively.

In one embodiment of the present invention, the scene classifier is a residual network ResNet-50.

The beneficial effects of the invention are as follows: the lifelong learning method suitable for remote sensing image interpretation is realized by combining a scene classification sub-model and a dynamic extensible interpretation sub-model. The known tasks and the unknown tasks are identified through the remote sensing image scene classification, and further, the expansion of the model capacity and the learning of the unknown tasks facing the new unknown tasks are realized through the expansion and the retraining of the interpretation network, so that the knowledge is continuously updated. The learned knowledge is fully applied to a new remote sensing interpretation task, the problem of catastrophic forgetting is effectively avoided while the interpretation accuracy is not reduced, and the utilization rate of the existing model and remote sensing image data is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a remote sensing image interpretation method based on life learning;

FIG. 2 is a schematic diagram illustrating interpretation of a plurality of remote sensing images to be interpreted by the combined model;

FIG. 3 is a schematic diagram of retraining a dynamically extensible interpretation sub-model;

FIG. 4 is a schematic diagram of an extension of a dynamically extensible interpretation sub-model.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.

Referring to fig. 1, fig. 1 is a flowchart of an embodiment of a remote sensing image interpretation method based on life learning according to the present invention, which includes the following steps:

s8, interpreting the new remote sensing image through the final combined model.

The technical idea of the invention is as follows:

1) The structure of the convolutional neural network in the dynamic extensible interpretation sub-model directly determines the capacity of the learning model, and the convolutional neural network with a fixed structure means that the capacity of the model is limited, and in the case of limited capacity, the convolutional neural network must erase old knowledge in order to learn a new task. The invention realizes the dynamic change of the convolutional neural network structure by constructing the dynamic extensible interpretation sub-model for life learning, can expand the capacity of the convolutional neural network according to the requirement, and can keep old knowledge when learning new knowledge.

2) To avoid catastrophic forgetfulness, the present invention employs selective retraining rather than conventional retraining when receiving a new task. Selectively retraining and selecting a part of neural network structure for retraining, wherein the part of neural network structure is directly related to the current remote sensing image to be interpreted, so that influence on neural nodes irrelevant to the remote sensing image to be interpreted is avoided;

3) In the process of life learning, the correlation between a new remote sensing image to be interpreted and a training sample is uncertain, and a retraining mode is needed to learn new knowledge when the correlation is low, but when a task with high correlation is encountered, the existing model can directly play the task, so that the necessity of retraining is avoided. In order to realize automatic life learning, the invention obtains a second scene classification result of the remote sensing image to be interpreted through a scene classifier before expanding and retraining the convolutional neural network, and confirms whether the scene of the remote sensing image to be interpreted is a new scene or not through comparing the second scene classification result with a first scene result stored in a memory. Retraining of the model is required when facing new scenes and is not required when facing known scenes, so that remote sensing image interpretation scenes are effectively utilized to avoid unnecessary training and adjustment of model structures.

The training samples and the remote sensing image to be interpreted used in the embodiment are remote sensing images with the spatial resolution of 4 meters obtained by high-resolution second-order (GF-2) satellites, the remote sensing images are uniformly cut, and the image size obtained after cutting is 229×229×3.

In particular, the dynamically expandable interpretation sub-model of the present invention includes a convolutional neural network for performing interpretation tasks and an expander for expanding the convolutional neural network.

The schematic diagram of the whole combined model of the embodiment is shown in fig. 2, and the remote sensing interpretation lifelong learning function of the invention is realized by combining a scene classification sub-model based on ResNet-50 and a dynamic extensible interpretation sub-model based on AlexNet.

Specifically, in this embodiment, an improved AlexNet is used as a convolutional neural network in a dynamic extensible interpretation sub-model, where the AlexNet is divided into an upper part and a lower part, and two GPUs are respectively utilized to improve the operation efficiency, and the improved AlexNet has 11 layers of deep neural networks, including a 5-layer convolutional layer, a 3-layer pooling layer, and a 3-layer full-connection layer, without regard to an activation layer.

Specifically, in the initial combined model, the modified AlexNet network structure is as follows:

After the remote sensing image to be interpreted is interpreted by utilizing the improved AlexNet network structure, the obtained interpretation information comprises the interpretation information of the marked ground object samples and the interpretation information of the unlabeled target interpretation samples. And calculating the interpretation information of the marked ground object sample and the real label of the marked ground object sample to obtain an interpretation loss value. In this embodiment, the interpretation information of the labeled feature sample is the interpretation label of the labeled feature sample, and the interpretation loss value is the similarity between the interpretation label of the labeled feature sample and the real label. The specific method for calculating the tag similarity may be cosine similarity.

The invention introduces a scene classification sub-model to judge whether the scene of the remote sensing image to be interpreted is related to the learned task scene. If the model is highly relevant, the dynamic extensible interpretation sub-model is considered to be capable of being used for the task of the type, and the current task can be interpreted directly without retraining a convolutional neural network in the model; otherwise, the convolutional neural network needs to be retrained and expanded.

In this embodiment, a residual network ResNet-50 is used to construct a scene classifier in the scene classification sub-model. In the residual network, the crossing connection is established between the lower layer network and the higher layer network to ensure the information circulation from the lower layer network to the higher layer network and avoid the problem of difficult convergence of deep network training caused by gradient dispersion, and the ResNet Block containing the crossing connection forms a basic logic unit of the residual network. The residual network builds a deep network by stacking multiple ResNet blocks. ResNet-50 obtains corresponding feature vectors through layer-by-layer feature extraction, and inputs the features into a scene classifier SoftMax for deep feature classification to obtain scene category probability distribution. The scene probability distribution of the training samples is stored in a memory.

The method comprises calculating the distance between the second scene classification result of the remote sensing image to be interpreted and the first scene classification result in the memory to obtain scene difference value,

a first scene classification result is represented and,

，

；

In this embodiment, K may be the first 30% of D.

And setting a selection strategy based on the scene difference value and the interpretation loss value, and retraining and expanding the dynamic expandable interpretation sub-model according to the selection strategy to obtain a final combined model.

Specifically, the selection strategy is:

The first preset threshold and the second preset threshold can be set according to actual conditions.

Specifically, expanding the dynamic extensible interpretation sub-model includes adding neurons of a convolutional neural network, and training the added neurons; retraining the dynamically extensible interpretation sub-model includes selectively adjusting portions of the network parameters.

Referring to fig. 3, the process of retraining AlexNet in this embodiment is as follows:

(1) For the initial training task, use is made of

Regularizing training convolutional neural networks to increase the sparsity of the network such that each neuron is connected to only a portion of neurons:

wherein l represents a first layer of the neural network,

as the network parameter of the first layer, mu is regularized intensity, and N is the total layer number of the network;

(2) By maintaining during life-long learning

And focusing on the sparsity of the new task related subnetwork, the computational load of the network can be greatly reduced. When the model receives a new task t, a sparse linear classifier is installed into the last layer of the model:

wherein

Representing in addition to->

Other network parameters. The association between the output unit and the hidden unit of the N-1 layer is obtained by solving this optimization problem. When the sparse connection of the layer is established, all units and weights affected in the training process can be identified on the premise of not affecting other network structures; />

(3) Identifying a sub-network S related to the current new task t according to the established sparse connection, and retraining the sub-network S:

through l ₂ Regularization enables partial retraining of the network. This selective retraining of the partial network can reduce the computational effort and avoid negative migration. Selective retraining as in FIG. 3In the diagram, solid nodes in the diagram are selectively trained network nodes, t-1 is the previous task, and t represents the current task.

Referring to fig. 4, the method for expanding AlexNet includes: k neurons are added to each layer of neural network, then newly added invalid neurons are removed by using group sparse regularization, as shown in a network expansion schematic diagram of fig. 4, solid nodes in the diagram are finally added and trained neurons, nodes with forks are removed newly added invalid neurons, t-1 is the previous task, and t is the current task:

wherein l represents the first layer of the neural network, D _t For interpretation of the data, W is the neural network weight, L is the loss function, μ and γ are the regularized term parameters, g is a group defined by the input weights of each neuron.

Finally, to overcome semantic drift and catastrophic forgetting problems, the method is implemented by the following steps of l ₂ Regularization:

let W ^t And W is equal to ^t-1 Proximity. When lambda is small, the neural network learns new tasks as much as possible, and when lambda is large, the learned knowledge is kept as much as possible. By calculating the l of the neuron at tasks t and (t-1) ₂ If the distance is above the threshold, the meaning of the neuron signature is considered to change significantly during the training process, and the corresponding neuron replicates and splits.

The lifelong learning for remote sensing image interpretation can effectively learn the knowledge of the new type of ground features and the different source ground features of the known type in the new task when facing the new interpretation task, does not influence the interpretation effect of the old task, and maximally reserves the learned knowledge.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The remote sensing image interpretation method based on life learning is characterized by comprising the following steps:

s8, interpreting the new remote sensing image through the final combined model.

2. The method for interpreting a remote sensing image according to claim 1, wherein in step S7, the selection strategy is:

3. The life-learning based remote sensing image interpretation method of claim 2, wherein the dynamically expandable interpretation sub-model includes a convolutional neural network and an expander, wherein the convolutional neural network is used to perform the interpretation task, and the expander is used to expand the convolutional neural network.

4. The method for interpreting a remote sensing image based on life-long learning as claimed in claim 3, wherein:

extending the dynamic extensible interpretation sub-model includes adding neurons of a convolutional neural network and training the added neurons;

5. The life-learning based remote sensing image interpretation method of claim 4, wherein expanding the dynamically expandable interpretation sub-model comprises:

adding a preset number of neurons to each layer of neural network;

removing newly added ineffective neurons by using group sparse regularization;

training the final augmented neurons:

wherein l represents the first layer of the neural network, D _t For interpretation of data, W is the neural network weight, L is the loss function, μ and γ are the regularized term parameters, t is the current task, t-1 is the previous task, g is a group defined by the input weights of each neuron, and N is the total number of layers of the network.

6. The life-learning based remote sensing image interpretation method of claim 4, wherein retraining the dynamic extensible interpretation sub-model includes:

where l represents the first layer of the convolutional neural network,

Representing in addition to->

Other network parameters than;

。

7. the method for interpreting a remote sensing image according to claim 1, wherein in step S5, the method for calculating the second scene classification result and the first scene classification result is distance calculation.

8. The method for interpreting a remote sensing image based on life learning as recited in claim 7, wherein the distance calculation process is as follows:

D=[d ₁ ,d ₂ ,...,d _r ]represent the firstA nearest distance obtained by performing distance calculation on the two scene classification results and the first scene classification result, wherein

；

9. The method of claim 3, wherein the structure of the convolutional neural network in the initial dynamically scalable interpretation sub-model comprises:

10. The method for interpreting a remote sensing image based on life-long learning as claimed in claim 1, wherein the scene classifier is a residual network res net-50.