CN112949388B

CN112949388B - Image processing method, device, electronic equipment and storage medium

Info

Publication number: CN112949388B
Application number: CN202110112824.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2024-04-16
Anticipated expiration: 2041-01-27
Also published as: CN112949388A; WO2022160753A1

Abstract

The application provides an image processing method, an image processing device, electronic equipment and a storage medium. The method can include the steps of extracting features of two-phase images input into a target change detection network to obtain first image features and second image features. And obtaining an image processing result for at least one of the two-stage images and a change detection result for the target in the two-stage images, which are output by the target change detection network, based on the first image feature and the second image feature.

Description

Image processing method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to computer technology, and in particular, to an image processing method, an image processing apparatus, an electronic device, and a storage medium.

Background

Currently, it is of great importance to perform change detection for the same target. The target change detection may be based on whether the target changes at different times at the same location.

For example, in building change detection, it is necessary to determine whether a change such as a building disappearance, a building enlargement, a new building, or the like, occurs in a building included in a remote sensing image by using remote sensing images taken at different times for the same location, thereby monitoring some offending construction mess.

It can be seen that a high performance target detection method is needed.

Disclosure of Invention

In view of this, the present application discloses at least one image processing method, which may include:

performing feature extraction on two-stage images input into a target change detection network to obtain a first image feature and a second image feature;

and obtaining an image processing result for at least one of the two-stage images and a change detection result for the target in the two-stage images, which are output by the target change detection network, based on the first image feature and the second image feature.

In some embodiments, the feature extraction of the two-phase image input to the change detection network to obtain a first image feature and a second image feature includes:

respectively extracting the features of the two-stage images to obtain a first image feature corresponding to a first image in the two-stage images and a second image feature corresponding to a second image in the two-stage images;

the obtaining, based on the first image feature and the second image feature, an image processing result for at least one of the two-phase images and a target change detection result for the two-phase images, which are output by the target change detection network, includes:

Performing image processing on the first image feature and/or the second image feature to obtain the image processing result;

and carrying out feature fusion on the first image feature and the second image feature to obtain fusion features, and carrying out change detection on targets contained in the two-stage images based on the fusion features to obtain a target change detection result.

In some embodiments, the feature fusing the first image feature and the second image feature to obtain a fused feature includes:

and carrying out self-adaptive feature fusion on the first image feature and the second image feature to obtain fusion features.

In some embodiments, the adaptively feature fusing the first image feature and the second image feature to obtain a fused feature includes:

predicting a first offset corresponding to each feature point in the second image feature and a second offset corresponding to each feature point in the first image feature by using the first image feature and the second image feature; the first offset is used for adjusting the receptive field of each feature point in the second image feature; the second offset is used for adjusting the receptive field of each feature point in the first image feature;

Adjusting the second image feature based on the first offset to obtain an adjusted second image feature;

adjusting the first image feature based on the second offset to obtain an adjusted first image feature;

and carrying out feature fusion on the adjusted first image feature and the adjusted second image feature to obtain the fusion feature.

In some embodiments, predicting the first offset corresponding to each feature point in the second image feature and the second offset corresponding to each feature point in the first image feature by using the first image feature and the second image feature includes:

superposing the first image feature and the second image feature to obtain a first superposition feature; the first superposition characteristics are subjected to offset prediction to obtain the first offset; the method comprises the steps of,

superposing the first image feature and the second image feature to obtain a second superposition feature; and convolving the second superposition feature to obtain the second offset.

In some embodiments, the feature fusing the adjusted first image feature and the adjusted second image feature to obtain the fused feature includes:

Performing optical flow prediction on the adjusted first image features and the adjusted second image features to determine corresponding optical flow fields; wherein the optical flow field characterizes the position error of the same pixel point in the two-phase images;

twisting the first image features by using the optical flow field to obtain twisted first image features;

and carrying out feature fusion on the distorted first image feature and the adjusted second image feature to obtain the fusion feature.

In some embodiments shown, the performing optical flow prediction on the adjusted first image feature and the adjusted second image feature to determine a corresponding optical flow field includes:

encoding the adjusted first image features and the adjusted second image features by utilizing a plurality of encoding layers included in the optical flow estimation network to obtain an encoding result;

decoding the coding result by utilizing a plurality of decoding layers included in the optical flow estimation network to obtain the optical flow field; wherein, in the decoding process, starting from the second layer decoding layer, the input of each decoding layer comprises: and outputting the characteristics of the decoding layer at the upper layer, predicting the optical flow based on the output characteristics and outputting the characteristics of the coding layer corresponding to the decoding layer.

performing feature fusion on the first image feature and the second image feature to obtain an initial fusion feature;

performing parallax feature extraction on the first image feature and the second image feature to obtain parallax features; wherein the parallax feature characterizes a degree of matching between feature points of the first image feature and feature points in the second image feature;

and determining weight information based on the parallax characteristics, and performing characteristic selection on the initial fusion characteristics by using the weight information to obtain the fusion characteristics.

In some embodiments shown, the determining the weight information based on the parallax characteristic includes:

and superposing the parallax characteristic and the initial fusion characteristic to obtain a superposition characteristic, and determining weight information based on the superposition characteristic.

In some embodiments shown, the above method further comprises:

and determining the sum of the fusion characteristic and the parallax characteristic as a final fusion characteristic.

In some embodiments shown, the degree of matching is characterized by a cost value;

The parallax feature extraction for the first image feature and the second image feature to obtain parallax features includes:

determining a cost space based on the first image feature and the second image feature; the cost space is used for representing cost values between the characteristic points of the first image characteristic and the characteristic points in the second image characteristic;

and carrying out cost aggregation on the cost space to obtain the parallax characteristic.

sequentially determining each feature point in the first image feature as a current feature point, and executing: determining the similarity between each feature point included in the second image feature and the current feature point, and forming first weight information corresponding to the current feature point; wherein, the first weight information characterizes the similarity between each feature point in the second image feature and the current feature point;

performing feature aggregation on the second image features by using the first weight information to obtain aggregated second image features;

Sequentially determining each feature point in the second image feature as a current feature point, and executing: determining similarity between each feature point included in the first image feature and the current feature point, and forming second weight information corresponding to the current feature point; wherein the second weight information characterizes a similarity between each feature point in the first image feature and the current feature point;

performing feature aggregation on the first image features by using the second weight information to obtain aggregated first image features;

and superposing the polymerized first image features and the polymerized second image features to obtain the fusion features.

In some embodiments shown, the determining the similarity between each feature point included in the second image feature and the current feature point includes:

determining a second neighborhood corresponding to each feature point included in the second image feature, and similarity between the second neighborhood and the first neighborhood corresponding to the current feature point; the first neighborhood comprises a neighborhood with a first preset size, wherein the neighborhood is formed by taking the current characteristic point as a center and other surrounding characteristic points; the second neighborhood comprises a neighborhood with a second preset size, wherein the neighborhood is formed by taking the second image characteristic point as a center and other surrounding characteristic points.

sequentially determining each feature point in the first image feature as a current feature point, and executing: determining the similarity between each feature point included in a third neighborhood corresponding to the feature point with the same position as the current feature point in the second image feature and the current feature point, and forming third weight information corresponding to the current feature point; wherein, the third weight information characterizes the similarity between each feature point in the third adjacent area and the current feature point; the third neighborhood comprises a neighborhood with a third preset size, wherein the neighborhood is formed by taking a feature point, which is the same as the current feature point in the second image feature, as a center and other surrounding feature points;

performing feature aggregation on the second image features by using the third weight information to obtain aggregated second image features;

sequentially determining each feature point in the second image feature as a current feature point, and executing: determining similarity between each feature point included in a fourth neighboring area corresponding to the feature point with the same position as the current feature point in the first image feature and the current feature point, and forming fourth weight information corresponding to the current feature point; wherein, the similarity between each feature point in the fourth neighboring area and the current feature point is represented in the fourth weight information; the fourth neighborhood comprises a neighborhood with a fourth preset size, wherein the neighborhood is formed by taking a feature point, which is the same as the current feature point in the first image feature, as a center and other surrounding feature points;

Performing feature aggregation on the first image features by using the fourth weight information to obtain aggregated first image features;

In some embodiments shown, the above image processing includes at least one of:

classifying images; semantic segmentation; dividing examples; panoramic segmentation; and (5) detecting a target.

In some embodiments shown, the above method further comprises:

data enhancement is performed on at least one of the two-stage images by at least one of the following modes:

cutting the image; rotating the image; turning over the image; adjusting brightness of the image; adjusting the contrast of the image; adding gaussian noise to the image; registration errors are added to the two-phase image.

In some embodiments shown, the target change detection network includes a first input and a second input;

the obtaining, based on the first image feature and the second image feature, an image processing result for at least one of the two-phase images and a change detection result for the object in the two-phase image, which are output by the object change detection network, includes:

Taking the first image feature as a first input, and taking the second image feature as a second input to obtain a first image processing result aiming at least one period of images in the two periods of images and a first change detection result aiming at a target in the two periods of images, which are output by the target change detection network;

taking the first image feature as a second input, and taking the second image feature as a first input to obtain a second image processing result aiming at least one period of images in the two periods of images and a second change detection result aiming at the target in the two periods of images, which are output by the target change detection network;

and carrying out weighted average on the first image processing result and the second image processing result to obtain a final image processing result aiming at least one period of image in the two periods of images, and carrying out weighted average on the first change detection result and the second change detection result to obtain a final change detection result aiming at the target in the two periods of images.

In some embodiments shown, in the case where the target change detection network is a network to be trained, the method further includes:

obtaining image processing loss according to the image processing result of the at least one period of image and the corresponding image processing label value, and obtaining change detection loss according to the change detection result of the target in the two periods of image and the corresponding change detection label value;

And adjusting network parameters of the target change detection network based on the image processing loss and the change detection loss.

In some embodiments shown, the adjusting the network parameters of the target change detection network based on the image processing loss and the change detection loss includes:

detecting loss according to the image processing loss and the change to obtain total loss;

and adjusting network parameters of the target change detection network based on the total loss.

The application also proposes an image processing apparatus, which may include:

the feature extraction module is used for extracting features of two-stage images input into the target change detection network to obtain a first image feature and a second image feature;

and the image processing module is used for obtaining an image processing result aiming at least one period of images in the two periods of images and a change detection result aiming at the target in the two periods of images, which are output by the target change detection network, based on the first image characteristic and the second image characteristic.

The application also proposes an electronic device, the above device comprising: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to invoke the executable instructions stored in the memory to implement the image processing method as shown in any of the foregoing embodiments.

The present application also proposes a computer-readable storage medium storing a computer program for executing the image processing method shown in any one of the foregoing embodiments.

In the scheme, the target change detection network is used for carrying out image processing on the input two-stage images to obtain an image processing result aiming at least one stage of images in the two-stage images and a change detection result aiming at the target in the two-stage images, so that on one hand, the image processing result and the target change detection result can be simultaneously output, and the network utilization rate is improved;

on the other hand, when training the target change detection network, the image processing loss can be obtained based on the image processing result and the corresponding image processing label value, the change detection loss can be obtained based on the change detection result and the corresponding change detection label value, and the network parameters of the building change detection network are adjusted, so that the training of the target change detection result is assisted by the image processing result, the combined training of the image processing task and the building change detection task is realized, a large number of training samples with the change detection label value are not required to be constructed for training, the problem of unbalance of positive and negative samples is solved, the overall training efficiency is improved, and the target change detection performance of the network is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions of one or more embodiments of the present application or of the related art, the following description will briefly describe the drawings that are required to be used in the embodiments or the related art descriptions, and it is apparent that the drawings in the following description are only some embodiments described in one or more embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of a building change detection flow shown in the present application;

FIG. 2 is a method flow chart of an image processing method shown in the present application;

FIG. 3 is a schematic diagram of a building change network structure shown in the present application;

FIG. 4 is a method flow diagram of an image processing method shown in the present application;

FIG. 5a is a diagram illustrating an adaptive feature fusion method according to the present application;

FIG. 5b is a method of adaptive feature fusion as illustrated herein;

FIG. 6 is a schematic diagram of a warp procedure shown in the present application;

FIG. 7 is a schematic flow chart of an adaptive feature fusion method shown in the present application;

FIG. 8 is a schematic diagram of a feature aggregation flow shown in the present application;

FIG. 9 is a method flow diagram of a building change detection method shown in the present application;

fig. 10 is a schematic structural view of an image processing apparatus shown in the present application;

fig. 11 is a hardware configuration diagram of an electronic device shown in the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items. It will also be appreciated that the term "if," as used herein, may be interpreted as "at … …" or "at … …" or "responsive to a determination," depending on the context.

Building change detection is commonly performed in the related art using neural network technology.

Referring to fig. 1, fig. 1 is a schematic diagram of a building change detection flow shown in the present application.

As shown in fig. 1, the building change detection network for performing the building change detection may include a feature extraction sub-network and a building change detection sub-network (hereinafter referred to as a detection sub-network). Wherein the output of the feature extraction sub-network is the input of the detection sub-network.

The feature extraction sub-network is used for extracting features of the first image and the second image to obtain the features of the first image and the features of the second image. At least the following two ways may be included herein. Firstly, respectively extracting features of a first image and a second image through the same feature extraction sub-network; and secondly, respectively extracting the characteristics of the first image and the second image through two characteristic extraction sub-networks sharing the weight.

The first image and the second image may be remote sensing images including a building taken at different times for the same place.

The feature extraction sub-network may be a network determined based on a convolutional neural network. For example, the feature extraction sub-network may be a network constructed based on a VGGNet (Visual Geometry Group Networks, visual geometry group network) series, a res net (Residual Networks, residual network) series, an acceptance (widening network) series, a DenseNet (Dense connection Networks) series, or the like. The structure of the feature extraction subnetwork is not particularly limited in this application.

The first image feature and the second image feature are image features corresponding to the first image and the second image, respectively. In some examples, the first image feature and the second image feature may include a multi-channel feature map.

The detection sub-network is used for obtaining a building change detection result based on the first image feature and the second image feature. In some examples, the building change detection may be a pixel level detection, that is, the change detection result may indicate whether each pixel in the first image and the second image changes, and a confidence level of the change.

In some examples, the detection subnetworks may include semantic segmentation networks such as FCN (Fully Convolution Networks, full convolution network), segNet (Segmentation Networks, segmentation network), and the like. The semantic segmentation network can perform operations such as convolution, downsampling, upsampling and the like on image features, and then map the image features of the input image onto feature maps of the same size, so that a prediction can be generated for each pixel in the feature maps, and spatial information of the input image is reserved at the same time. And finally, classifying pixels on the characteristic map, thereby obtaining a pixel-level change detection confidence value. The specific structure of the detection subnetwork is not particularly limited in this application.

When building change detection is performed based on the first image and the second image, the first image and the second image may be input into a feature extraction sub-network to obtain a first image feature and a second image feature.

And then, inputting the first image characteristic and the second image characteristic into a detection sub-network to obtain a detection result.

Currently, training is typically performed in a supervised manner when training the building change detection network. That is, it is necessary to construct training samples with varying detection tag values. And then performing network training in a supervised manner.

However, in actual situations, there are usually few building areas that change (the changing samples can be regarded as positive samples), and more building areas that do not change (the non-changing samples can be regarded as negative samples), so when training samples are constructed based on remote sensing images, the number of negative samples is far more than that of positive samples, which results in serious imbalance of the number of samples, and affects network training and convergence, so as to affect the change detection performance of the network.

In view of this, the present application aims to propose an image processing method. According to the method, the target change detection network can perform image processing on the input two-stage images to obtain an image processing result aiming at least one stage of images in the two-stage images and a change detection result aiming at the target in the two-stage images, so that on one hand, the image processing result and the target change detection result can be output simultaneously, and the network utilization rate is improved;

The image processing method can be applied to training tasks of the target change detection network and testing or application tasks of the target change detection network.

Referring to fig. 2, fig. 2 is a flowchart of an image processing method shown in the present application. The image processing process and the object detection process mentioned in fig. 2 may refer to an image processing task and a change detection task of an object, which are related to a training task, or an image processing task and a change detection task of an object, which are related to a test or application task of an object change detection network.

The method may include:

s202, extracting features of two-stage images input into a target change detection network to obtain a first image feature and a second image feature;

and S204, obtaining an image processing result for at least one period of images in the two periods of images and a change detection result for the target in the two periods of images, which are output by the target change detection network, based on the first image feature and the second image feature.

The target change detection network can be applied to a network training task, an image processing task and a target change detection task. The target may be any object preset according to the service requirement. For example, the target may be a building, vegetation, a vehicle, a person, or the like. The target change detection task may be to perform change detection for buildings, vehicles, people, vegetation, and the like.

In this application, a target change detection task will be described as an example of a building change detection task. It will be appreciated that the change detection task for any target may refer to a building change detection task, which is not described in detail in this application.

Referring to fig. 3, fig. 3 is a schematic diagram of a building change network structure shown in the present application.

As shown in fig. 3, the above-described building change detection network may include a twin feature extraction sub-network, an image processing sub-network, and a building change detection sub-network; wherein, the twin feature extracts the sub-network sharing parameter weight; and the parallel image processing sub-network and the building change detection sub-network are respectively connected with the twin feature extraction sub-network.

The image processing sub-network may perform image processing on the first image feature and/or the second image feature to obtain the image segmentation result.

It will be appreciated that different image processing operations may correspond to different configurations of the image processing subnetwork.

For example, when the image processing is semantic segmentation, the above-described image processing sub-network may be a semantic segmentation sub-network for detecting a semantic segmentation result, such as FCN (Fully Convolution Networks, full convolution network), segNet (Segmentation Networks, segmentation network), or the like.

For another example, when the image processing is image classification, the image processing sub-network may be a classification sub-network for detecting the image classification result, such as a two-classifier or a multi-classifier constructed based on a deep convolutional neural network.

For another example, when the image processing is an instance segmentation or a panorama segmentation, the above-described image processing sub-network may be a segmentation sub-network for detecting the image segmentation result, such as MASK-RCNN (Mask Region Convolutional Neural Networks, MASK area convolutional neural network).

For another example, when the image processing is target detection, the image processing sub-network may be a target detection sub-network for detecting a region where the target is located, such as a target detection sub-network constructed based on RCNN (Region Convolutional Neural Networks, regional convolutional neural network), FAST-RCNN (Fast Region Convolutional Neural Networks, FAST regional convolutional neural network) or far-RCNN (Faster Region Convolutional Neural Networks, FASTER regional convolutional neural network).

In some examples, the image processing described above includes, but is not limited to, any of the following: classifying images; semantic segmentation; dividing examples; panoramic segmentation; and (5) detecting a target.

Because any image processing can utilize the image features extracted by the feature extraction sub-network to perform image processing, the training of the feature extraction sub-network can be achieved, so that the training effect of the building change detection task can be assisted.

In some examples, the image processing described above may be semantic segmentation for purposes of facilitating annotation of training samples. Since pixel-level labeling is generally required in constructing a sample for training a building change detection task, when the image is processed into semantic segmentation, labeling of semantic segmentation label values can be performed on each pixel in constructing the sample for training the building change detection task, so as to construct the sample for training the semantic segmentation task at the same time, thereby simplifying the construction difficulty of the training sample.

In the building change detection task, the image processing method can be applied to a training task of a building change detection network, and can also be applied to a test or application task of the building change detection network, and the image processing task of a building in a remote sensing image and the change detection task of the building can be realized by adopting the same or similar methods in the two types of tasks. For ease of understanding, the training task of the building change detection network is described herein by way of example, and of course, the image processing and change detection processes involved therein may also be applicable to testing or application tasks of the building change detection network.

The training task is an end-to-end training method for the building change detection network, namely training is carried out on the building change detection network, and training of all sub-networks included in the network is completed. Of course, in some examples, the sub-networks may be pre-trained first, and the present application is not limited thereto.

The training method can be applied to the electronic equipment. The electronic device may execute the training method by carrying a software system corresponding to the training method. Note that, the types of the electronic devices may be a notebook computer, a server, a mobile phone, a PAD terminal, and the like, and are not particularly limited in this application. The electronic device may be a client device or a server device, and is not particularly limited herein.

Referring to fig. 4, fig. 4 is a flowchart of an image processing method shown in the present application.

As shown in fig. 4, the above-described method may include,

and S402, performing feature extraction on two-stage image samples input into a building change detection network to obtain a first image feature and a second image feature.

It will be appreciated that if the building change detection network is the network to be trained, then a preparation operation may be performed prior to performing the network training. The preparation operations include, but are not limited to, constructing a training sample set, determining a network structure, initializing network parameters, determining super parameters such as training times, and the like. The above preparation operation is not described in detail here.

The two-phase image may be two-phase remote sensing images of a building taken at different times for the same location. For example, a training sample may be constructed using a remote sensing image that includes a building and a remote sensing image that does not include a building.

In some examples, the first image and the second image may be sequentially extracted by using the same feature extraction sub-network, so as to obtain the first image feature and the second image feature.

In some examples, to improve network performance, the two-phase image may be extracted by using a twin feature extraction sub-network as shown in fig. 3, to obtain a first image feature corresponding to a first image of the two-phase image and a second image feature corresponding to a second image of the two-phase image.

For example, the twin feature extraction subnetwork described above may be a convolutional network built based on the VGGNet architecture. The two-phase image may be subjected to operations such as convolution, pooling, etc. using the convolution network to obtain the first image feature and the second image feature, respectively. The structure of the above-described twin feature extraction subnetwork is not particularly limited in the present application. It will be appreciated that in some examples modules such as attention mechanisms, pyramid pooling, etc. may also be added to the sub-network in order to enhance the feature extraction effect.

The twin feature extraction sub-network is used for extracting features of the two-stage images, so that feature extraction efficiency can be improved, a building change detection network structure is simplified, network operation amount is reduced, and network performance is improved.

After obtaining the first image feature and the second image feature, the following steps may be performed:

and S404, obtaining an image processing result of at least one period of images in the two periods of images and a building change detection result of the two periods of images, which are output by the building change detection network, based on the first image feature and the second image feature.

In some examples, the image processing sub-network shown in fig. 3 may be used to perform image processing on the first image feature and/or the second image feature to obtain the image processing result.

It will be appreciated that three schemes are included in the steps described above. The image segmentation result obtained by performing image processing on the first image feature by the first image feature image processing sub-network will be described below as an example.

For example, when the image is processed into semantic segmentation, the semantic segmentation sub-network (such as FCN) may be used to convolve the first image feature several times, and after downsampling, the deconvolution layer may be used to upsample the intermediate feature to a feature map of the same size as the first image, so that a prediction may be generated for each pixel in the feature map, and spatial information in the first image may be retained. And finally, classifying whether each pixel is a foreground or not on the feature map obtained after up-sampling, so as to obtain a semantic segmentation result corresponding to the first image.

In the above scheme, if the building change detection network is a network to be trained, the semantic segmentation sub-network may be used to obtain a semantic segmentation result of the first image, and then the semantic segmentation label value corresponding to the first image that is labeled in advance may be used to obtain a semantic segmentation loss, so as to realize training of the image processing task, and further realize the joint training.

In some examples, the first image feature and the second image feature may be used as input to a building change detection sub-network to perform change detection, so as to obtain a building change detection result.

In some examples, the building change detection sub-network shown in fig. 3 may be used to perform feature fusion on the first image feature and the second image feature to obtain a fusion feature, and then perform change detection on the building included in the two-stage image based on the fusion feature to obtain the building change detection result.

For example, the detection subnetwork is a semantic segmentation network constructed based on FCN. When the change is detected, the fusion characteristic can be convolved for a plurality of times, downsampled and the like, and then the deconvolution layer can be utilized to upsample to restore the middle characteristic to the characteristic diagram with the same size as the two-stage image, so that a prediction can be generated for each pixel in the characteristic diagram, and meanwhile, the spatial information in the first image is reserved. And finally, classifying whether the pixel-by-pixel changes on the feature map obtained after up-sampling, thereby obtaining a building change detection result aiming at the two-stage images.

In this example, feature fusion is performed first, so that information brought by fusion of the two image features can be combined besides information carried by the first image feature and the second image feature, and further performance of building change detection by the network is improved.

In this embodiment, if the building change detection network is a trained network, the output image processing result and the building change detection result may be directly used as the final result. If the building change detection network is a network to be trained, the method may further include, after obtaining the image processing result and the building change detection result:

s406, obtaining image processing loss according to the image processing result and the corresponding image processing label value, and obtaining change detection loss according to the change detection result and the corresponding change detection label value.

The image processing label value is specifically a true value marked when an image processing task training sample is constructed. It will be appreciated that different image processing operations correspond to different image processing tag values.

For example, when the image processing is semantic segmentation, the above-described image processing tag value may be a semantic segmentation tag value indicating whether each pixel of the input sample is foreground. For example, a label value of 1 indicates that a pixel is foreground, and a label value of 0 indicates that a pixel is background.

For another example, when the image processing is image classification, the above-described image processing tag value may be an image classification tag value indicating whether the input sample is a building image. For example, a label value of 1 may indicate that the image is a building image and 0 indicates that the image is not a building image.

For another example, when the image processing is an instance segmentation or a panorama segmentation, the above-described image processing tag value may be an instance segmentation or panorama segmentation tag value indicating a building bounding box contained in the input sample and the number of the above-described bounding box. For example, the instance-segmentation tag value may include a tag value indicating a type of object within the bounding box and location information of the bounding box. For another example, when the image processing is the target detection, the above-described image processing tag value may be a target detection tag value indicating a building bounding box included in the input sample. For example, the object detection tag value may include building bounding box coordinate information included in the image, and a tag value indicating whether an object within the bounding box is a building.

The change detection label value is specifically a true value of a label for building a training sample of a building change detection task. For example, the change detection flag value may be a flag value indicating whether or not each pixel of the input sample transmits a change.

In some examples, an image processing penalty error between the image processing result and the corresponding image processing tag value may be determined based on a preset first penalty function. A change detection loss error between the above-described change detection result and the corresponding change detection tag value may then be determined based on a preset second loss function.

The first loss function and the second loss function may be cross entropy loss function, exponential loss function, mean square error loss function, etc. The specific types of the first loss function and the second loss function are not limited in this application.

After the image processing loss and the change detection loss are obtained, S408 may be executed to adjust the network parameters of the building change detection network based on the image processing loss and the change detection loss.

In some examples, the total loss may be derived from the sum of the segmentation loss and the variation detection loss described above. And then adjusting network parameters of the building change detection network based on the total loss. In some examples, the two losses may also be subtracted, multiplied, etc. to obtain a total loss, which is not described in detail herein.

In some examples, the magnitude of the current gradient descent may be determined after the total loss is obtained, such as by a random gradient descent method. And then adjusting the network parameters of the building change detection network by using back propagation according to the amplitude. It will be appreciated that other ways of adjusting the network parameters may be used in practice, and will not be described in detail herein.

The operations of S402-S408 may then be repeated using the constructed training sample set until the building change detection network converges, completing the joint training. It should be noted that the condition for convergence may be, for example, reaching a preset training number, or obtaining a variation of the joint learning loss function after M consecutive forward propagation times (M is a positive integer greater than 1) is less than a certain threshold value. The conditions for converging the model are not particularly limited in the present application.

In the above scheme, the method can enable the building change detection network to perform three tasks. The first aspect is a feature extraction task, namely, feature extraction is performed on input two-stage images to obtain extracted features; the second aspect is an image processing task, that is, an image processing result for at least one of the two-stage images is obtained by using the extracted features output by the feature extraction task; the third aspect is a building change detection task that obtains a change detection result of a building for the two-phase image using the extracted features output by the feature extraction task.

Therefore, during training, the image processing loss can be obtained based on the image processing result and the corresponding image processing label value, the change detection result and the corresponding change detection label value can be obtained, the network parameters of the building change detection network are adjusted, training of the feature extraction task can be assisted through training of the image processing task, further, combined training of the image processing task and the building change detection task is achieved, a large number of training samples with the change detection label value are not needed, the problem of imbalance of positive and negative samples is solved, the overall training efficiency is improved, and the building change detection performance of the network is improved.

In some examples, in constructing the training sample set, at least one of the following data enhancement approaches may be employed: cutting the sample image; rotating the sample image; turning over the sample image; adjusting brightness of the sample image; adjusting the contrast of the sample image; adding Gaussian noise to the sample image; registration errors are added to the two-phase image samples.

Relatively abundant training samples can be obtained through data addition, so that the problem of over-fitting commonly encountered in the training process of the deep learning model is solved to a certain extent, and the robustness of the model is enhanced.

In some examples, to further improve network robustness. After performing target change detection to obtain an image processing result (hereinafter referred to as a first image processing result) and a change detection result (hereinafter referred to as a first change detection result), the input positions of the first image feature and the second image feature may be exchanged, and image processing and change detection may be performed again to obtain a second image processing result and a second change detection result. And then carrying out weighted average on the first image processing result and the second image processing result to obtain a final image processing result, and carrying out weighted average on the first change detection result and the second change detection result to obtain a final change detection result. The weight is not particularly limited in this application. For example, the weight may be 0.5.

In the above example, the change detection may be made again after the input positions of the first image feature and the second image feature are exchanged, and the final image processing result is a weighted average result of the first image processing result and the second image processing result, and the final change detection result is a weighted average result of the first change detection result and the second change detection result, so that an error caused by the input sequence of the first image and the second image may be avoided, and further, the accuracy and the robustness of the building change detection of the network may be improved.

With continued reference to fig. 3, a building change detection sub-network may include a fusion unit and a detection unit.

The fusion unit is used for fusing the first image feature and the second image feature to obtain a fusion feature. The detection unit may be a semantic segmentation network, and may perform pixel-level semantic segmentation based on the fusion feature to determine whether a building in the input two-phase image changes.

In some examples, the above fusion may be performed in the above fusion unit using operations such as superposition, addition, multiplication, and the like.

In practical application, although two input remote sensing images are shot for the same place, on one hand, due to the influence of the shooting angle of the remote sensing images, the roof and the base of a building appearing in the remote sensing images often deviate (hereinafter referred to as deviation), so that the building is deformed; on the other hand, as the image registration is carried out on the two-period remote sensing images, registration deviation often exists, namely the same building can appear at different positions in the two-period images. The reasons for the two above are that the detection result of the building change is possibly affected, and false alarms are caused.

In some examples, in order to reduce the false alarm rate and improve the accuracy of detecting the building change, the first image feature and the second image feature may be adaptively fused to obtain a fused feature.

Because the self-adaptive feature fusion is adopted, the self-adaptive feature fusion can be carried out according to the respective image features of the two-stage images, and compared with fusion modes such as superposition, addition, multiplication and the like, the error caused by the offset and the registration deviation can be corrected in the fusion process, thereby improving the accuracy of building change detection and reducing the false alarm rate.

In some examples, the receptive fields of the feature points in the first image feature and the second image feature may be enlarged based on a deformable convolution, and then the first image feature and the second image feature may be adjusted based on the enlarged receptive fields, so as to obtain the first image feature and the second image feature that eliminate the error caused by the offset.

The receptive field may refer to a size of a region mapped on the input image by a pixel point on the feature map output by each layer of the convolutional neural network. For example, taking a convolution kernel as 3*3 as an example, in conventional convolution, a receptive field of a certain pixel point on an input image is only 9 pixel points adjacent to each other in the vertical direction and the horizontal direction, and the receptive field is smaller. The receptive field of the pixel point on the input image can be enlarged through variability convolution.

Specifically, S502 may be executed first, and a first offset amount corresponding to each feature point in the second image feature and a second offset amount corresponding to each feature point in the first image feature are predicted using the first image feature and the second image feature; the first offset is used for adjusting the receptive field of each feature point in the second image feature; the second offset is used for adjusting the receptive field of each feature point in the first image feature.

Then, the second image feature can be adjusted based on the first offset, so as to obtain an adjusted second image feature; and based on the second offset, the first image feature is adjusted by deformable convolution to obtain an adjusted first image feature.

And finally, carrying out feature fusion on the adjusted first image feature and the adjusted second image feature to obtain the fusion feature.

The first offset amount and the second offset amount are specifically position offset amounts corresponding to feature points in the image feature. For example, the above-described offset amount may be a displacement amount of the characterization feature point in the X-axis and Y-axis directions. The offset is used to adjust the receptive field of each feature point in the second image feature. For example, in a conventional convolution operation with a convolution kernel 3*3, it is common to convolve the convolution kernel with an area formed by the feature points of 3*3 surrounding the feature points, such that the receptive field to the feature points is limited to only the above-described area. By the offset, when the deformable convolution is performed, the convolution kernel and the area formed by the characteristic point determined after the offset according to the offset can be convolved, so that the receptive field of the characteristic point can be adjusted, the determined receptive field is larger than that of the conventional convolution, and the error caused by the offset of the roof and the base is eliminated.

For example, in a remote sensing image that includes a building, the roof and the base of the building are typically offset (i.e., the base is in a different location in the remote sensing image than the roof). By determining the offset and performing deformable convolution based on the offset, the receptive field of the pixel points in the image can be enlarged, so that the foundation and the roof feature of the unified building can be simultaneously included in the receptive field of the pixel points, and thus, the variation detection error caused by the fact that the roof and the foundation feature cannot be sensed in the same receptive field due to the offset can be eliminated.

Referring to fig. 5a and 5b, fig. 5a and 5b illustrate an adaptive feature fusion method according to the present application. Schematically, in fig. 5a and 5b, XA is used to represent a first image feature, XB is used to represent a second image feature, xa_d is used to represent the first image feature adjusted with the deformable convolution operation, xb_d is used to represent the second image feature adjusted with the deformable convolution operation.

In some examples, to enhance the feature fusion effect, the positions of the first image feature and the second image feature may be swapped in determining the first offset and the second offset to eliminate the influence due to the input position of the input image.

As shown in fig. 5a, the first image feature and the second image feature may be superimposed to obtain a first superimposed feature. In some examples, the first image feature is located before the second image feature. The first image feature being located before the second image feature may mean that a feature layer included in the first image feature is located before a feature layer included in the second image feature. And then, carrying out offset prediction on the first superposition characteristics to obtain the first offset.

As shown in fig. 5b, the first image feature and the second image feature may be superimposed to obtain a second superimposed feature. In some examples, the second image feature is located before the first image feature. The second image feature being located before the first image feature may mean that a feature layer included in the second image feature is located before a feature layer included in the first image feature. And then convolving the second superposition feature to obtain the second offset.

And then, the first offset is utilized to carry out variability convolution on the second image feature, and the second offset is utilized to carry out variability convolution on the first image feature, so that the adjusted first image feature and second image feature are obtained. And finally, carrying out feature fusion on the adjusted first image features and the adjusted second image features to obtain the fusion features.

In this example, when the first offset and the second offset are determined, the positions of the first image feature and the second image feature may be exchanged, so that the influence caused by the input position of the input image is eliminated, and the accuracy and the robustness of building change detection are improved.

In some examples, to further improve building change detection accuracy, the false alarm rate is reduced. The optical flow field between the two images can be utilized to distort (including translation and rotation) the image features corresponding to any one of the images, so that the same building is positioned at the same position of the two images, and errors caused by configuration deviation are eliminated.

It will be appreciated that in the warping process described above, three warping schemes may be included, namely warping for the first image feature and/or the second image feature. The following description will take, as an example, a warping for the first image feature.

Referring to fig. 6, fig. 6 is a schematic diagram of a twisting process shown in the present application. Illustratively, XA_d is used in FIG. 6 to represent a first image feature adjusted with a deformable convolution operation, XB_d is used to represent a second image feature adjusted with a deformable convolution operation, and XAB is used to identify a fusion feature.

As shown in fig. 6, S602 may be executed first, and optical flow prediction is performed on the adjusted first image feature and the adjusted second image feature by using an optical flow estimation network (optical flow estimation unit in fig. 6), so as to determine a corresponding optical flow field; wherein the optical flow field characterizes the position error of the same pixel point in the two-phase images.

The optical flow estimation network may be a neural network constructed based on a structure such as a Flownetsimple (simple optical flow network) or a FlowNetCorr (Flow Networks Correction corrected optical flow network).

In some examples, a FlowNetCorr structure may be employed. The optical flow estimation network described above may employ a conventional optical flow estimation loss function. For example, the loss function may be an optical flow network self-supervising loss function Wherein (1)>Representing a predicted image It ' from warping the image Is with the estimated optical flow, SSIM may characterize the structural similarity of the predicted image It ' to the real image It '.

As shown in fig. 6, the optical flow estimation unit may include an encoding unit and a decoding unit.

Wherein the coding unit may comprise a plurality of coding layers constructed of a convolutional layer and a pooling layer. In the above coding unit, the adjusted first image feature and the adjusted second image feature may be coded to obtain a coding result.

The decoding unit may include a plurality of decoding layers in one-to-one correspondence with the encoding layers. The decoding means may decode the encoded result to obtain the optical flow field; wherein, in the decoding process, starting from the second layer decoding layer, the input of each decoding layer comprises: and outputting the characteristics of the decoding layer at the upper layer, predicting the optical flow based on the output characteristics and outputting the characteristics of the coding layer corresponding to the decoding layer.

By means of the optical flow field prediction method, not only can the high-order information transmitted by the rough feature map be reserved, but also the fine local information provided by the low-level feature map is reserved, so that the predicted optical flow field is more accurate, the feature fusion effect is improved, and the change detection accuracy is improved.

For example, when predicting the optical flow field between the first image feature and the second image feature, the optical flow field estimation network with the FlowNetCorr structure is adopted to perform optical flow field estimation, so that not only the whole high-order information of the first image feature and the second image feature can be reserved, but also the fine local information obtained after the image feature is downsampled can be reserved, thus the predicted optical flow field is more accurate, and further the change detection accuracy is improved.

After the optical flow field is obtained, S604 may be executed, and the warping unit may warp the first image feature using the optical flow field to obtain a warped first image feature.

The twisting operation may include translation and selection.

Through the warping operation, feature alignment of the first image feature and the second image feature can be achieved, and therefore change detection errors of the first image and the second image due to registration deviation are eliminated.

Finally, S606 may be executed, where the warped first image feature and the adjusted second image feature are feature-fused in the superimposing unit, to obtain the fused feature.

The fusion characteristics obtained by the method can eliminate the change detection error of the first image and the second image caused by registration deviation, further improve the change detection accuracy of the network and reduce the false alarm rate.

In some examples, adaptive feature fusion may be implemented based on a disparity estimation network.

Referring to fig. 7, fig. 7 is a schematic flow chart of an adaptive feature fusion method shown in the present application. Schematically, in fig. 7, XA is used to represent a first image feature, XB is used to represent a second image feature, xd is used to represent an initial fusion feature obtained by preliminarily fusing XA and XB with a predicted parallax feature, xab_c is used to represent an initial fusion feature, and XAB is used to identify a final fusion feature.

It should be noted that, in fig. 7, a vertical dashed line may be used to extract the parallax feature from the adaptive feature fusion process range, and perform two sub-processes of feature fusion with the parallax feature.

In some examples, referring to the left half of the dashed line in fig. 7, the first image feature XA and the second image feature XB may be feature-fused to obtain an initial fused feature xab_c.

The above feature fusion may be operations such as superposition, addition, multiplication, etc. The following is an example of superposition. At this time, xab_c is an overlay feature.

Then, parallax feature extraction can be carried out on the first image feature and the second image feature to obtain parallax feature Xd; wherein the parallax feature Xd characterizes a degree of matching between feature points of the first image feature and feature points in the second image feature.

The degree of matching may be characterized in different dimensions. For example, the degree of matching may be characterized by a cost value in some examples. For another example, the degree of matching may be characterized by a degree of similarity.

When the matching degree is represented by a Cost value, a Cost space (Cost Volume) may be determined based on the first image feature and the second image feature; the cost space is used for representing cost values between the feature points of the first image feature and the feature points in the second image feature. And then, carrying out cost aggregation on the cost space to obtain the parallax characteristic.

In some examples, the first image feature may be first horizontally aligned with the second image feature when determining the cost space. And then, based on a preset parallax range, starting from the minimum parallax, determining two feature points corresponding to the first image feature and the second image feature. The cost between the two feature points may then be calculated based on techniques such as hamming distance, normal distance, gray level calculation, etc. The cost value set for the current minimum disparity may then be determined based on cost values between all corresponding pairs of feature points in the first image feature and the second image feature.

Gradually increasing the parallax between the two parallax values, and repeating the steps until a cost value set under all the parallaxes in the parallax range is obtained. The cost space may then be composed based on the set of cost values.

In some examples, in cost-aggregating the cost space described above, a 3D convolution kernel may be utilized (e.g., 3 x 3 convolution kernel), performing 3D convolution on the cost space to obtain the parallax characteristic.

It will be appreciated that the parallax feature may characterize the degree of matching between feature points of the first image feature and feature points in the second image feature.

In some examples, after obtaining the parallax feature, weight information may be determined based on the parallax feature, and the initial fusion feature may be selected by using the weight information to obtain the fusion feature.

In some examples, the weighting information may be dot multiplied with the initial fusion feature to obtain the fusion feature when the feature selection is performed.

The parallax feature can represent the matching degree between the feature points of the first image feature and the feature points of the second image feature, so that weight information is determined based on the parallax feature, and the operation of feature selection on the initial fusion feature by using the weight information can strengthen the strong correlation feature with higher matching degree in the initial fusion feature, weaken the weak correlation feature with lower matching degree, and strengthen the strong correlation feature with higher matching degree in the input two-stage image in the obtained fusion feature, and weaken the weak correlation feature with lower matching degree, namely the information provided by the matched pixel points in the two-stage image and the information provided by the unmatched pixel points can be strengthened when the building change detection is performed based on the final fusion feature, so that the change detection error caused by the offset and the configuration deviation is eliminated, the building change detection accuracy aiming at the two-stage image is improved, and the false alarm rate is reduced.

In some examples, to further improve the detection accuracy, the parallax feature and the initial fusion feature may be superimposed to obtain a superimposed feature when determining the weight information, and the weight information may be determined based on the superimposed feature.

Referring to the right half of the dashed line in fig. 7, the initial fusion feature xab_c and the parallax feature Xd may be superimposed, and then a convolution operation is performed on the superimposed feature, and the size of the superimposed feature is adjusted to be consistent with the size of xab_c, so as to obtain the weight information.

In some examples, the weight information may be normalized to obtain values in the range of 0-1 for ease of calculation.

In the above example, the information carried by the initial fusion feature is introduced when the weight information is determined, so that more accurate weight information can be determined, and further the detection accuracy is improved.

In some examples, to further improve the detection accuracy, the sum of the fusion feature and the parallax feature may be determined as a final fusion feature after obtaining the fusion feature.

Referring to the second half of the dashed line in fig. 7, the above weight information may be convolved to obtain the parallax feature Xd, and then Xd and the fusion feature are added to obtain the final fusion feature XAB.

When Xd is added to the fusion feature, the pixel values where Xd and the fusion feature are in the same position may be added to obtain XAB.

In the above example, because the parallax feature can be added into the final fusion feature, when the building change detection is performed, the feature point matching degree information carried by the parallax feature can be introduced, so that the detection accuracy is further improved.

In some examples, adaptive feature fusion may be based on Non-local (Non-local) structures.

In some examples, each feature point in the first image feature may be sequentially determined as the current feature point, and the steps of: determining the similarity between each feature point included in the second image feature and the current feature point, and forming first weight information corresponding to the current feature point; and the similarity between each feature point in the second image feature and the current feature point is represented in the first weight information. And then, carrying out feature aggregation on the second image features by using the first weight information to obtain aggregated second image features.

Referring to fig. 8, fig. 8 is a schematic diagram of a feature aggregation flow shown in the present application. Illustratively, XA in fig. 8 is used to represent a first image feature, XB is used to represent a second image feature, and XBF is used to represent an aggregated image feature obtained by feature-aggregating XB.

As shown in fig. 8, each feature point in XA may be sequentially determined as a current feature point (first feature point in fig. 8), then traversed from the upper-left corner feature point of the second image feature (second feature point in fig. 8), and the similarity between each feature point and the current feature point may be determined. Wherein, the similarity can be calculated by cosine distance, mahalanobis distance and the like. The similarity calculation method is not particularly limited in this application.

Then, first weight information corresponding to the current feature point may be constructed based on the similarity between the current feature point and each of the second feature points. The weight information may be a weight matrix or a weight vector in some examples.

And then, carrying out feature aggregation on the second image features by using the first weight information to obtain aggregated second image features. In some examples, each feature point of the second image feature may be sequentially determined as the current feature point, and performing:

and determining weight information corresponding to the feature points which are the same as the current feature points in the first image feature, carrying out weighted summation on the weight information and the feature points in the second image feature, and fusing the calculation result with the pixel values of the current feature points to obtain fused pixel values. Note that the above-described fusing operation may include addition, multiplication, replacement, and the like. The operation of fusion is not particularly limited in this application. Similarly, feature aggregation XBF for the second image feature described above may be accomplished.

In the above example, since the similarity between each feature point in the second image feature and the current feature point is represented in the first weight information, feature aggregation is performed on the second image feature by using the first weight information, and a strong correlation feature with higher similarity to the first image feature in the second image feature can be highlighted, so that the feature of strong correlation between the second image feature and the first image feature obtained after aggregation is strengthened, the weak correlation feature is weakened, and the change detection error caused by the offset and the configuration deviation is eliminated, thereby improving the building change detection accuracy for the two-stage image, and reducing the false alarm rate.

Similarly, XAF can be obtained by feature-polymerizing XAs in the manner described above.

And then, overlapping the aggregated first image feature XAF and the aggregated second image feature XBF to obtain the fusion feature.

Because the aggregated first image features and the aggregated second image features strengthen the features of strong association between the first image features and the second image features and weaken the features of weak association, when the superimposed features are utilized for carrying out change detection, the information provided by the matched pixel points in the two-stage images can be strengthened and the information provided by the unmatched pixel points is weakened, so that the change detection errors caused by the offset and the configuration deviation are eliminated, the building change detection accuracy aiming at the two-stage images is further improved, and the false alarm rate is reduced.

In some examples, in order to improve accuracy and robustness of the change detection, when calculating the similarity, a second neighborhood corresponding to each feature point included in the second image feature may be determined, and the similarity between the second neighborhood and the first neighborhood corresponding to the current feature point may be determined; the first neighborhood is a neighborhood with a first preset size and is formed by taking the current characteristic point as a center and other surrounding characteristic points; the second neighborhood is a neighborhood with a second preset size and is formed by taking the second image characteristic point as a center and other surrounding characteristic points.

The first preset size and the second preset size may be empirically set values. The first preset size may be the same as or different from the second preset size. In some examples, the first preset size may be 1, and the second preset size may be 3*3.

Because the neighborhood can comprise a larger receptive field, the similarity calculated based on the neighborhood can be more accurate, so that more accurate weight information is determined, and further, the accuracy and the robustness of change detection are improved.

In some examples, to reduce the amount of model computation, the change detection efficiency is improved. When the adaptive feature fusion is performed on the first image feature and the second image feature to obtain a fusion feature, each feature point in the first image feature may be sequentially determined as a current feature point, and the following steps are performed: determining the similarity between each feature point included in a third neighborhood corresponding to the feature point with the same position as the current feature point in the second image feature and the current feature point, and forming third weight information corresponding to the current feature point; wherein, the third weight information characterizes the similarity between each feature point in the third adjacent area and the current feature point; the third neighborhood includes a neighborhood of a third preset size formed by taking the feature point with the same position as the current feature point in the second image feature as a center and other surrounding feature points.

And then, performing feature aggregation on the second image features by using the third weight information to obtain aggregated second image features.

In some examples, each feature point of the second image feature may be sequentially determined as the current feature point, and performing:

and determining weight information corresponding to the feature points which are the same as the current feature points in the first image feature, carrying out weighted summation on the weight information and each feature point in a third preset size adjacent area formed by taking the current feature points as the center and other surrounding feature points, and fusing the calculation result with the pixel value of the current feature points to obtain fused pixel values. Note that the above-described fusing operation may include addition, multiplication, replacement, and the like. The operation of fusion is not particularly limited in this application. Similarly, feature aggregation XBF for the second image feature described above may be accomplished.

In the above example, the foregoing non-local structure example may be modified, and in the adaptive fusion process, the receptive field is reduced to a point in the vicinity of the current pixel point, so as to reduce the calculation amount in the adaptive fusion process and improve the change detection efficiency.

A method of building change detection is also presented in this application. The method can be applied to any electronic device. The type of the electronic device may be a notebook computer, a server, a mobile phone, a PAD terminal, etc., which is not particularly limited in the present application.

It is to be understood that the above-mentioned change detection method may be performed solely by the client device or the server device, or may be performed by the client device and the server device in cooperation.

For example, the above-described change detection method may be integrated with the client device. The client device may perform the image processing method by providing computing power through its own hardware environment.

For another example, the above-described change detection method may be integrated into a system platform. The server device carrying the system platform can provide computing power to execute the image processing method through the hardware environment of the server device.

Also for example, the above-described change detection method may be divided into two tasks of acquiring an image and change detection. Wherein the acquisition task may be integrated with the client device. The change detection task may be integrated with the server device. The client device may initiate a change detection request to the server device after acquiring the two-phase image. After receiving the change detection request, the server device may respond to the request to perform building change detection on the target two-phase image.

Referring to fig. 9, fig. 9 is a flowchart of a method for detecting a building change according to the present application.

As shown in fig. 9, the method may include:

s902, a two-phase image including a building is acquired.

And S904, respectively extracting the characteristics of the acquired two-stage images through a building change detection network to obtain a first image characteristic and a second image characteristic.

S906, detecting the change of the building contained in the two-phase image based on the first image feature and the second image feature to obtain a change detection result of the building.

When the building change detection task further comprises an image processing task for the building, the building change detection network can also be used for performing image processing on at least one of the two-stage images to obtain a corresponding image processing result.

In the above-mentioned image processing task and/or building change detection task, a specific implementation method may refer to the methods mentioned in fig. 1 to 8.

Wherein the building change detection network is trained by the training method shown in any example.

In some examples, in executing S906, the fused feature may be obtained based on feature fusion of the first image feature and the second image feature. And then, based on the fusion characteristics, detecting the change of the building contained in the two-stage images to obtain a change detection result of the building.

The following is an example of a building change detection scenario.

In this scenario, the detection device may periodically acquire remote sensing images captured by satellites for the same location. Then, the detection device can detect building change aiming at the acquired remote sensing image and output a detection result.

The detection device is provided with the building change detection network (hereinafter referred to as a detection network) for the building trained by the training method described in any one of the examples. The network training process is not described in detail herein.

The above-described detection network structure may refer to fig. 3. The network may include a parameter-sharing twin feature extraction sub-network and a detection sub-network connected to the twin feature extraction sub-network. The detection sub-network may include a fusion unit and a detection unit.

The above fusion unit may employ an adaptive feature fusion method as shown in any of the previous examples. The detection unit may be a semantic segmentation network at the pixel level, where it may be determined whether a building has changed.

Assume that the detection device acquires a remote sensing image a and a remote sensing image B.

At this time, the detection device may input the two-phase image into the detection network, and perform feature extraction for a and B by using the trained twin feature extraction sub-network, so as to obtain the first image feature XA and the second image feature XB.

In the training process, the twin feature extraction sub-network is obtained based on joint training, so that features more beneficial to change detection can be extracted, and the accuracy of the change detection is improved.

The adaptive feature fusion of XA and XB may then be performed in a fusion unit. For example, the trained deformable convolution network may be used to perform deformable convolution on XA and XB to obtain adjusted first and second image features xa_d and xb_d to correct for offset between the building roof and the base. The trained FlowNetCorr can then be used to predict the optical flow field between xa_d and xb_d, and then the optical flow field can be used to warp xa_d to achieve feature alignment to eliminate registration bias from image registration of a and B. And finally, superposing the distorted first image characteristic with XB_D to obtain a fusion characteristic XAB so as to complete self-adaptive characteristic fusion.

The self-adaptive feature fusion can eliminate the offset and the image registration deviation, so that the accuracy of change detection can be improved, and the false alarm rate can be reduced.

It will be appreciated that the above-described fusion unit may also adaptively fuse using disparity estimation, non-local mode, and non-local modification to eliminate the above-described offset and image registration bias, which are not described in detail herein.

Finally, XAB can be input into a detection unit to complete the semantic segmentation of the pixel level, and a detection result is obtained.

In some examples, the detection device may display the detection result on a display page. When the detection result is displayed, the bounding box corresponding to the building with the changed display can be displayed.

In some examples, if any building is detected to change, the detection device may also construct early warning information and send the early warning information to the supervisor.

The application also provides an image processing device.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an image processing apparatus shown in the present application.

An image processing apparatus 100, said apparatus 100 comprising:

the feature extraction module 101 is configured to perform feature extraction on two-phase images input into the target change detection network to obtain a first image feature and a second image feature;

the image processing module 102 is configured to obtain, based on the first image feature and the second image feature, an image processing result for at least one of the two-stage images and a change detection result for the object in the two-stage images, which are output by the object change detection network.

In some embodiments shown, the feature extraction module 101 is specifically configured to:

the image processing module 102 includes:

the first image processing sub-module is used for carrying out image processing on the first image characteristics and/or the second image characteristics to obtain the image processing result;

and the second image processing sub-module is used for carrying out feature fusion on the first image features and the second image features to obtain fusion features, and carrying out change detection on targets contained in the two-stage images based on the fusion features to obtain the target change detection result.

In some embodiments shown, the second image processing sub-module includes:

and the fusion module is used for carrying out self-adaptive feature fusion on the first image features and the second image features to obtain fusion features.

In some embodiments shown, the above fusion module is specifically configured to:

In some embodiments shown, the degree of matching is characterized by a cost value; the fusion module is specifically used for:

In some embodiments shown, the above image processing includes at least one of:

In some embodiments shown, the apparatus 100 further comprises:

the data enhancement module is used for data enhancement of at least one of the two-stage images in at least one of the following modes:

In some embodiments shown, the target change detection network includes a first input and a second input; the image processing module 102 is specifically configured to:

In some embodiments shown, in the case where the target change detection network is a network to be trained, the apparatus 100 further includes:

the loss determination module is used for obtaining image processing loss according to the image processing result of the at least one period of image and the corresponding image processing label value, and obtaining change detection loss according to the change detection result of the target in the two periods of image and the corresponding change detection label value;

and the adjusting module is used for adjusting the network parameters of the target change detection network based on the image processing loss and the change detection loss.

In some embodiments shown, the adjustment module is specifically configured to:

The embodiment of the image processing apparatus shown in the present application can be applied to an electronic device. Accordingly, the present application discloses an electronic device, which may include: a processor.

A memory for storing processor-executable instructions.

Wherein the processor is configured to invoke the executable instructions stored in the memory to implement the image processing method as shown in any of the embodiments above.

Referring to fig. 11, fig. 11 is a hardware configuration diagram of an electronic device shown in the present application.

As shown in fig. 11, the electronic device may include a processor for executing instructions, a network interface for making a network connection, a memory for storing operating data for the processor, and a nonvolatile memory for storing instructions corresponding to the image processing apparatus.

The image processing device may be implemented in software, or may be implemented in hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of an electronic device where the device is located for operation. In terms of hardware, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 11, the electronic device in which the apparatus is located in the embodiment generally includes other hardware according to the actual function of the electronic device, which will not be described herein.

It should be understood that, in order to increase the processing speed, the corresponding instructions of the image processing apparatus may also be directly stored in the memory, which is not limited herein.

The present application proposes a computer-readable storage medium storing a computer program for executing the image processing method shown in any one of the above embodiments.

One skilled in the relevant art will recognize that one or more embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present application may take the form of a computer program product on one or more computer-usable storage media (which may include, but are not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

"and/or" in this application means having at least one of the two, e.g., "a and/or B" may include three schemes: A. b, and "a and B".

All embodiments in the application are described in a progressive manner, and identical and similar parts of all embodiments are mutually referred, so that each embodiment mainly describes differences from other embodiments. In particular, for data processing apparatus embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

The foregoing has described certain embodiments of this application. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Embodiments of the subject matter and functional operations described in this application may be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware which may include the structures disclosed in this application and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this application can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described herein can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows described above may also be performed by, and apparatus may also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

A computer suitable for executing a computer program may comprise, for example, a general-purpose and/or special-purpose microprocessor, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential components of a computer may include a central processing unit for executing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data may include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Although this application contains many specific implementation details, these should not be construed as limiting the scope of any disclosure or the scope of what is claimed, but rather as primarily describing features of certain disclosed embodiments. Certain features that are described in this application in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing description of the preferred embodiment(s) of the present application is merely intended to illustrate the embodiment(s) of the present application and is not intended to limit the embodiment(s) of the present application, since any and all modifications, equivalents, improvements, etc. that fall within the spirit and principles of the embodiment(s) of the present application are intended to be included within the scope of the present application.

Claims

1. An image processing method, the method comprising:

respectively extracting features of two-stage images input into a target change detection network to obtain first image features corresponding to a first image in the two-stage images and second image features corresponding to a second image in the two-stage images;

based on the first image feature and the second image feature, obtaining an image processing result aiming at least one period of images in the two periods of images and a change detection result aiming at the target in the two periods of images, which are output by the target change detection network;

performing self-adaptive feature fusion on the first image feature and the second image feature to obtain fusion features, and performing change detection on targets contained in the two-stage images based on the fusion features to obtain a change detection result of the targets;

The self-adaptive feature fusion is performed on the first image feature and the second image feature to obtain a fusion feature, which comprises the following steps:

predicting a first offset corresponding to each feature point in the second image feature and a second offset corresponding to each feature point in the first image feature by using the first image feature and the second image feature; the first offset is used for adjusting the receptive field of each feature point in the second image feature; the second offset is used for adjusting the receptive field of each feature point in the first image feature; based on the first offset, adjusting the second image feature to obtain an adjusted second image feature; based on the second offset, adjusting the first image feature to obtain an adjusted first image feature; performing feature fusion on the adjusted first image feature and the adjusted second image feature to obtain the fusion feature; or (b)

Performing feature fusion on the first image features and the second image features to obtain initial fusion features; performing parallax feature extraction on the first image feature and the second image feature to obtain parallax features; wherein the parallax feature characterizes a degree of matching between feature points of the first image feature and feature points in the second image feature; determining weight information based on the parallax characteristics, and performing characteristic selection on the initial fusion characteristics by utilizing the weight information to obtain the fusion characteristics; or (b)

Sequentially determining each feature point in the first image feature as a current feature point, and executing: determining the similarity between each feature point included in the second image feature and the current feature point, and forming first weight information corresponding to the current feature point; the similarity between each feature point in the second image feature and the current feature point is represented in the first weight information; performing feature aggregation on the second image features by using the first weight information to obtain aggregated second image features; sequentially determining each feature point in the second image feature as a current feature point, and executing: determining the similarity between each feature point included in the first image feature and the current feature point, and forming second weight information corresponding to the current feature point; wherein the second weight information characterizes a similarity between each feature point in the first image feature and the current feature point; performing feature aggregation on the first image features by using the second weight information to obtain aggregated first image features; overlapping the polymerized first image features and the polymerized second image features to obtain the fusion features; or (b)

Sequentially determining each feature point in the first image feature as a current feature point, and executing: determining the similarity between each feature point included in a third neighborhood corresponding to the feature point with the same position as the current feature point in the second image feature and the current feature point, and forming third weight information corresponding to the current feature point; the similarity between each characteristic point in the third adjacent area and the current characteristic point is represented in the third weight information; the third neighborhood comprises a neighborhood with a third preset size, wherein the neighborhood is formed by taking a feature point, which is the same as the current feature point in the second image feature, as a center and other surrounding feature points; performing feature aggregation on the second image features by using the third weight information to obtain aggregated second image features; sequentially determining each feature point in the second image feature as a current feature point, and executing: determining similarity between each feature point included in a fourth neighboring area corresponding to the feature point with the same position as the current feature point in the first image feature and the current feature point respectively, and forming fourth weight information corresponding to the current feature point; wherein, the similarity between each feature point in the fourth adjacent area and the current feature point is represented in the fourth weight information; the fourth neighborhood comprises a neighborhood with a fourth preset size, wherein the neighborhood is formed by taking a feature point, which is the same as the current feature point in the first image feature, as a center and other surrounding feature points; performing feature aggregation on the first image features by using the fourth weight information to obtain aggregated first image features; and superposing the polymerized first image features and the polymerized second image features to obtain the fusion features.

2. The method of claim 1, wherein predicting a first offset for each feature point in the second image feature and a second offset for each feature point in the first image feature using the first image feature and the second image feature comprises:

superposing the first image feature and the second image feature to obtain a first superposition feature; performing offset prediction on the first superposition characteristics to obtain the first offset; the method comprises the steps of,

3. The method according to claim 1 or 2, wherein feature fusing the adjusted first image feature and the adjusted second image feature to obtain the fused feature comprises:

4. The method of claim 3, wherein the performing optical flow prediction on the adjusted first image feature and the adjusted second image feature to determine a corresponding optical flow field comprises:

decoding the coding result by utilizing a plurality of decoding layers included in an optical flow estimation network to obtain the optical flow field; wherein, in the decoding process, starting from the second layer decoding layer, the input of each decoding layer comprises: and outputting the characteristics of the output of the decoding layer of the upper layer, predicting the optical flow based on the output characteristics and outputting the characteristics of the coding layer corresponding to the decoding layer.

5. The method of claim 1, wherein the determining weight information based on the disparity feature comprises:

6. The method according to claim 1 or 5, characterized in that the method further comprises:

7. The method according to claim 1 or 5, characterized in that the degree of matching is characterized by a cost value;

the parallax feature extraction is performed on the first image feature and the second image feature to obtain a parallax feature, including:

8. The method according to claim 1, wherein determining the similarity between each feature point included in the second image feature and the current feature point includes:

9. The method of claim 1, wherein the image processing comprises at least one of:

10. The method according to claim 1, wherein the method further comprises:

data enhancement is performed on at least one of the two-phase images by at least one of:

11. The method of claim 1, wherein the target change detection network comprises a first input and a second input;

Taking the first image feature as a second input, and taking the second image feature as a first input to obtain a second image processing result aiming at least one period of images in the two periods of images and a second change detection result aiming at a target in the two periods of images, which are output by the target change detection network;

12. The method according to claim 1, wherein in case the target change detection network is a network to be trained, the method further comprises:

obtaining image processing loss according to the image processing result and the corresponding image processing label value of at least one of the two-stage images, and obtaining change detection loss according to the change detection result and the corresponding change detection label value of the target in the two-stage images;

13. The method of claim 12, wherein the adjusting network parameters of the target change detection network based on the image processing loss and the change detection loss comprises:

based on the total loss, network parameters of the target change detection network are adjusted.

14. An image processing apparatus, characterized in that the apparatus comprises:

the feature extraction module is used for carrying out feature extraction on two-stage images input into the target change detection network respectively to obtain a first image feature corresponding to a first image in the two-stage images and a second image feature corresponding to a second image in the two-stage images;

the image processing module is used for obtaining an image processing result aiming at least one period of images in the two periods of images and a change detection result aiming at a target in the two periods of images, which are output by the target change detection network, based on the first image feature and the second image feature;

15. An electronic device, the device comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke executable instructions stored in the memory to implement the image processing method of any of claims 1-13.

16. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the image processing method according to any one of claims 1 to 13.