CN115375715A

CN115375715A - Target extraction method and device, electronic equipment and storage medium

Info

Publication number: CN115375715A
Application number: CN202210826414.4A
Authority: CN
Inventors: 王福涛; 周艺; 王世新; 王振庆; 王丽涛; 刘文亮; 朱金峰; 赵清; 侯艳芳
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-11-22

Abstract

The invention provides a target extraction method, a target extraction device, electronic equipment and a storage medium, and relates to the technical field of image processing, wherein the method comprises the following steps: acquiring an image of a target to be extracted; determining edge features of the image; inputting the image and the edge features into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image. According to the method provided by the invention, the image of the target to be extracted and the edge characteristics corresponding to the image are subjected to deep fusion through the multi-scale edge constraint model, the boundary of a building in the image is refined, and the building extraction precision is improved.

Description

Target extraction method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for extracting a target, an electronic device, and a storage medium.

Background

In recent years, with the rapid development of sensor technology, the spatial resolution of remote sensing images is continuously improved, and remote sensing images corresponding to higher spatial resolution have richer spatial detail information, so that accurate extraction of buildings is possible. Because the external environment of the building is very complicated, such as the building auxiliary facilities, shadow shielding and the like, the feature extraction of the building is difficult; the contrast between buildings and some non-buildings (parking lots, bare land, roads) is low, and the recognition result of the buildings is easily interfered. In addition, the difference of the outline, the structure, the material and the like of buildings in different areas is large, so that the fine extraction of the remote sensing images of the buildings is still a challenging task.

In the related art, a convolutional neural network is used to extract a building in a remote sensing image. Although the overall accuracy of the extraction of the building is already high, there is a large error in the boundary area of the building. Therefore, how to improve the extraction accuracy of the boundary area of the building is an urgent problem to be solved.

Disclosure of Invention

The invention provides a target extraction method, a target extraction device, electronic equipment and a storage medium, which are used for solving the defect of low extraction precision of a building boundary area in the prior art, realizing complete building extraction and improving the building extraction precision.

The invention provides a target extraction method, which comprises the following steps:

acquiring an image of a target to be extracted;

determining edge features of the image;

inputting the image and the edge features into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image.

According to an object extraction method provided by the invention, the multi-scale edge constraint model comprises an encoder, a decoder and an edge constraint block;

the inputting the image and the edge features into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model includes:

inputting the image to the encoder to obtain at least one piece of first characteristic information of the image output by the encoder; each piece of first feature information is used for representing features of different dimensions corresponding to the image;

respectively performing upsampling on each first characteristic information to obtain at least one first upsampling result after the upsampling;

inputting each piece of first feature information and each piece of first up-sampling result into the decoder to obtain at least one decoding result output by the decoder; each decoding result is used for representing the feature information of different resolutions corresponding to the image;

and inputting second up-sampling results corresponding to the edge features and the decoding results to the edge constraint block to obtain a target extraction result of the image output by the edge constraint block.

According to the target extraction method provided by the invention, the encoder comprises at least one encoding unit which is sequentially connected in series;

the inputting the image into the encoder to obtain at least one first feature information of the image output by the encoder includes:

inputting the image into a first coding unit to obtain first dimension characteristic information output by the first coding unit;

performing down-sampling on the first-dimension characteristic information to obtain a first down-sampling result after sampling;

inputting the first down-sampling result to a second coding unit to obtain second dimension characteristic information output by the second coding unit;

performing down-sampling on the second dimension characteristic information to obtain a second down-sampling result after sampling;

inputting the second downsampling result to a third encoding unit to obtain third dimensional feature information output by the third encoding unit;

performing down-sampling on the third dimension characteristic information to obtain a third down-sampling result after sampling;

inputting the third down-sampling result to a fourth encoding unit to obtain fourth dimension characteristic information output by the fourth encoding unit;

performing down-sampling on the fourth dimension characteristic information to obtain a fourth down-sampling result after sampling;

and inputting the fourth down-sampling result to a fifth encoding unit to obtain fifth dimension characteristic information output by the fifth encoding unit.

According to an object extraction method provided by the invention, the decoder comprises at least one decoding unit;

the inputting each of the first feature information and each of the first upsampling results into the decoder to obtain at least one decoding result output by the decoder includes:

inputting first up-sampling results corresponding to the second dimension characteristic information and the third dimension characteristic information to a first decoding unit to obtain a first decoding result output by the first decoding unit;

inputting first up-sampling results corresponding to the third dimension characteristic information and the fourth dimension characteristic information to a second decoding unit to obtain a second decoding result output by the second decoding unit;

inputting first up-sampling results corresponding to the fourth dimension characteristic information and the fifth dimension characteristic information to a third decoding unit to obtain a third decoding result output by the third decoding unit;

determining at least one decoding result output by the decoder based on the first decoding result, the second decoding result, and the third decoding result.

According to the target extraction method provided by the invention, the edge constraint block comprises at least one edge constraint unit which is connected in series in sequence;

the inputting the second upsampling results respectively corresponding to the edge features and the decoding results to the edge constraint block to obtain the target extraction result of the image output by the edge constraint block includes:

under the condition that the edge constraint unit is a first edge constraint unit, inputting the first dimension characteristic information, a first up-sampling result corresponding to the second dimension characteristic information and the edge characteristic into the first edge constraint unit to obtain first fusion characteristic information output by the first edge constraint unit; the first fusion characteristic information is used for fusing the first dimension characteristic information, a first up-sampling result corresponding to the second dimension characteristic information and the edge characteristic;

under the condition that the edge constraint unit is a non-first edge constraint unit, inputting second fusion feature information respectively corresponding to the edge feature, all edge constraint units before the non-first edge constraint unit and a second up-sampling result corresponding to a decoding result output by the decoder into the non-first edge constraint unit to obtain third fusion feature information output by the non-first edge constraint unit; and taking the result output by the last edge constraint unit as the target extraction result of the image.

According to the target extraction method provided by the present invention, the inputting the first dimension feature information, the first upsampling result corresponding to the second dimension feature information, and the edge feature into the first edge constraint unit to obtain the first fusion feature information output by the first edge constraint unit includes:

splicing the first dimension characteristic information, a first up-sampling result corresponding to the second dimension characteristic information and the edge characteristic, and determining spliced second characteristic information;

performing convolution twice on the spliced second characteristic information, and determining the convoluted third characteristic information;

and inputting the third feature information into a self-attention mechanism unit to obtain first fusion feature information output by the self-attention mechanism unit.

According to a target extraction method provided by the present invention, the inputting second fused feature information corresponding to the edge feature and all edge constraint units before the non-first edge constraint unit respectively and a second upsampling result corresponding to the decoding result output by the decoder into the non-first edge constraint unit to obtain third fused feature information output by the non-first edge constraint unit includes:

splicing second fusion characteristic information respectively corresponding to the edge characteristic and all edge constraint units before the non-first edge constraint unit and a second up-sampling result corresponding to a decoding result output by the decoder to determine spliced fourth characteristic information;

performing convolution twice on the spliced fourth feature information, and determining fifth feature information after convolution;

and inputting the fifth feature information into the self-attention mechanism unit to obtain third fused feature information output by the self-attention mechanism unit.

According to an object extraction method provided by the invention, the decoding unit comprises a self-attention mechanism unit;

the inputting a first upsampling result corresponding to the second dimension characteristic information and the third dimension characteristic information into a first decoding unit to obtain a first decoding result output by the first decoding unit includes:

splicing the first up-sampling results corresponding to the second dimension characteristic information and the third dimension characteristic information, and determining spliced sixth characteristic information;

performing convolution twice on the sixth characteristic information, and determining seventh characteristic information after the convolution;

and inputting the seventh characteristic information into the self-attention mechanism unit to obtain a first decoding result output by the self-attention mechanism unit.

According to a target extraction method provided by the present invention, the sample image is obtained based on an initial sample image, and the method includes:

selecting at least one set of initial sample image pairs; the initial sample image pair comprises a background image and a copied image;

respectively carrying out Fourier transform on each pixel point in the background image and the copied image to obtain a frequency domain image after the Fourier transform;

respectively comparing the frequency domain images corresponding to the background image and the copied image;

under the condition that low-frequency pixel points in the frequency domain image corresponding to the background image are smaller than low-frequency pixel points in the frequency domain image corresponding to the copied image, replacing the low-frequency pixel points in the frequency domain image corresponding to the copied image with the low-frequency pixel points in the frequency domain image corresponding to the background image, and determining that the replaced background image is a sample image;

and under the condition that low-frequency pixel points in the frequency domain image corresponding to the background image are not smaller than low-frequency pixel points in the frequency domain image corresponding to the copied image, taking the copied image as a sample image.

According to a target extraction method provided by the present invention, the method further comprises:

calculating the copied image and a corresponding mask, and performing expansion operation on the calculation result;

deleting a building in an overlapping area in the background image when the background image and the copied image have the overlapping area;

and copying low-frequency pixel points in the frequency domain image corresponding to the copied image into the background image to obtain the sample image.

The present invention also provides a target extraction apparatus, the apparatus comprising:

the acquisition module is used for acquiring an image of a target to be extracted;

the determining module is used for determining the edge characteristics of the image;

the extraction module is used for inputting the image and the edge characteristics into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image.

The present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement any of the above object extraction methods.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a target extraction method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the method of object extraction as defined in any one of the above.

The invention provides a target extraction method, a target extraction device, electronic equipment and a storage medium, wherein an image of a target to be extracted is acquired; determining the edge characteristics of the image; inputting the image and the edge characteristics into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on the sample image; the target extraction result is used for representing the buildings in the image; the sample image is obtained based on the initial sample image. According to the target extraction method provided by the invention, the image of the target to be extracted and the edge characteristics corresponding to the image are subjected to deep fusion through the multi-scale edge constraint model, the boundary of a building in the image is refined, and the building extraction precision is improved.

Drawings

In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is one of the flow diagrams of the target extraction method provided by the present invention;

FIG. 2 is a second schematic flow chart of the target extraction method provided by the present invention;

FIG. 3 is a flow chart of the operation of the self-attention mechanism unit provided by the present invention;

FIG. 4 is a schematic structural diagram of an edge constraint unit provided by the present invention;

FIG. 5 is a schematic structural diagram of a multi-scale edge constraint model provided by the present invention;

FIG. 6 is a diagram illustrating the results of a sample image acquisition method according to the present invention;

FIG. 7 is a schematic diagram of the results of the target extraction method provided by the present invention;

FIG. 8 is a schematic structural diagram of a target extracting apparatus provided in the present invention;

fig. 9 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The following describes the target extraction method provided by the present invention in detail through some embodiments and application scenarios thereof with reference to the accompanying drawings.

The invention provides a target extraction method, which is suitable for a target extraction scene in a remote sensing image and comprises the following steps: acquiring an image of a target to be extracted; determining edge features of the image; inputting the image and the edge features into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image. According to the method provided by the invention, the image of the target to be extracted and the edge characteristics corresponding to the image are subjected to deep fusion through the multi-scale edge constraint model, the boundary of a building in the image is refined, and the building extraction precision is improved.

The object extraction method of the present invention is described below with reference to fig. 1 to 7.

Fig. 1 is a schematic flow diagram of a target extraction method provided in the present invention, and as shown in fig. 1, the method includes steps 101 to 103, where:

step 101, obtaining an image of a target to be extracted.

It should be noted that the target extraction method provided by the invention can be applied to a target extraction scene in a remote sensing image. The execution subject of the method may be a target extraction apparatus, such as an electronic device, or a control module in the target extraction apparatus for executing the target extraction method.

Specifically, an image of a target to be extracted can be acquired by shooting a remote sensing image of a target area by an airplane or a satellite; for example, the object to be extracted may be a building, but may also be other objects.

Step 102, determining edge features of the image.

Specifically, sobel (Sobel) operators are used to convolve the Sobel operators with the images, and the gradient values obtained by the respective operations are calculated, and the obtained result is represented as the edge feature of the image.

The Sobel operator is a common first derivative edge detection operator, and the Sobel operator uses two 3 x 3 matrixes to respectively convolve with the image to respectively obtain gradient values of horizontal Ex and vertical Ey, which are respectively shown by a formula (1) and a formula (2); obtaining an edge feature E according to the gradient values of the horizontal Ex and the vertical Ey, and expressing the edge feature E by adopting a formula (3):

wherein I represents an image, E _x Representing horizontal gradient values, E _y Indicating the vertical gradient value and E the edge feature.

103, inputting the image and the edge features into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image.

Specifically, the multi-scale edge constraint model (MEC-Net) takes U-Net + + as a basic network, and adopts a residual error network (Resnet 50) as a backbone network (backbone) for extracting buildings in the remote sensing images.

In practice, inputting an acquired image of a target to be extracted and edge features of the image into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model, wherein the target extraction result is used for representing a building in the image, namely the building in the image is extracted through the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image, wherein the sample image is obtained based on an initial sample image.

The target extraction method provided by the invention comprises the steps of obtaining an image of a target to be extracted; determining the edge characteristics of the image; and inputting the image and the edge characteristics into the multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model. According to the method provided by the invention, the image of the target to be extracted and the edge characteristics corresponding to the image are subjected to deep fusion through the multi-scale edge constraint model, the boundary of a building in the image is refined, and the building extraction precision is improved.

Fig. 2 is a second schematic flowchart of the object extracting method provided by the present invention, as shown in fig. 2, the method includes steps 201 to 208, wherein:

step 201, acquiring an image of a target to be extracted;

step 202, determining edge features of the image;

step 203, inputting the image to the encoder to obtain at least one first feature information of the image output by the encoder; each piece of first feature information is used for representing features of different dimensions corresponding to the image.

Specifically, the multi-scale edge constraint model comprises an encoder, a decoder and an edge constraint block; inputting the image into an encoder to obtain at least one first characteristic information of the image output by the encoder; the first feature information is used for representing features of different dimensions corresponding to the image.

Optionally, the encoder comprises at least one coding unit connected in series in sequence; the specific implementation manner of step 203 includes the following steps:

step 1) inputting the image into a first coding unit to obtain first dimension characteristic information output by the first coding unit.

Specifically, an image is input into a first coding unit, the first coding unit performs convolution and ReLU function activation on the image to obtain a result output by the ReLU function, and then performs convolution and ReLU function activation on the output result to obtain first dimension characteristic information output by the first coding unit; the size and number of convolution kernels can be set according to actual conditions, for example, the size of the convolution kernels is 3*3, and the number of the convolution kernels is 64.

And 2) performing down-sampling on the first-dimension characteristic information to obtain a first down-sampling result after sampling.

Specifically, the first dimension characteristic information is subjected to down-sampling through a pooling kernel, so that a first down-sampling result is obtained; for example, the pooling kernel may be 2*2 in size, with the first dimension feature information downsampled to half the original dimension.

And 3) inputting the first down-sampling result into a second coding unit to obtain second dimension characteristic information output by the second coding unit.

Specifically, the first downsampling result is input into a second coding unit, the second coding unit performs convolution and ReLU function activation on the first downsampling result to obtain a result output by the ReLU function, and then performs convolution and ReLU function activation on the output result to obtain second dimension characteristic information output by the second coding unit; wherein the number of convolution kernels is twice the number of convolution kernels in step 1), and the dimensionality of the second-dimension characteristic information is higher than that of the first-dimension characteristic information.

And 4) down-sampling the second dimension characteristic information to obtain a second down-sampling result after sampling.

Specifically, the second dimension feature information is subjected to down-sampling through a pooling kernel to obtain a second down-sampling result.

And 5) inputting the second down-sampling result into a third coding unit to obtain third dimensional characteristic information output by the third coding unit.

Specifically, a second downsampling result is input into a third coding unit, the third coding unit performs convolution and ReLU function activation on the second downsampling result to obtain a result output by the ReLU function, and then performs convolution and ReLU function activation on the output result to obtain third dimension characteristic information output by the third coding unit; wherein the number of convolution kernels is twice the number of convolution kernels in step 3), and the dimension of the third-dimension feature information is higher than that of the second-dimension feature information.

And 6) carrying out down-sampling on the third dimension characteristic information to obtain a third down-sampling result after sampling.

Specifically, the third dimension feature information is subjected to down-sampling through a pooling kernel to obtain a third down-sampling result.

And 7) inputting the third down-sampling result to a fourth encoding unit to obtain fourth dimension characteristic information output by the fourth encoding unit.

Specifically, a third downsampling result is input into a fourth coding unit, the fourth coding unit performs convolution and ReLU function activation on the third downsampling result to obtain a result output by the ReLU function, and then performs convolution and ReLU function activation on the output result to obtain fourth-dimension characteristic information output by the fourth coding unit; wherein the number of convolution kernels is twice the number of convolution kernels in step 5), and the dimensionality of the fourth-dimension characteristic information is higher than that of the third-dimension characteristic information.

And 8) performing down-sampling on the fourth dimension characteristic information to obtain a fourth down-sampling result after sampling.

Specifically, the fourth-dimensional feature information is subjected to down-sampling through a pooling kernel, so as to obtain a fourth down-sampling result.

And 9) inputting the fourth down-sampling result to a fifth coding unit to obtain fifth dimension characteristic information output by the fifth coding unit.

Specifically, a fourth downsampling result is input into a fifth coding unit, the fifth coding unit performs convolution and ReLU function activation on the fourth downsampling result to obtain a result output by the ReLU function, and then performs convolution and ReLU function activation on the output result to obtain fifth dimension characteristic information output by the fifth coding unit; wherein the number of convolution kernels is twice the number of convolution kernels in step 7), and the dimensionality of the fifth-dimension feature information is higher than that of the fourth-dimension feature information.

Step 204, performing upsampling on each first feature information respectively to obtain at least one first upsampling result after the upsampling.

Specifically, at least one first feature information output by the encoder is deconvoluted, that is, each first feature information is up-sampled, so as to obtain at least one first up-sampling result after up-sampling.

Step 205, inputting each of the first feature information and each of the first up-sampling results into the decoder to obtain at least one decoding result output by the decoder; each decoding result is used for representing the feature information of different resolutions corresponding to the image.

Optionally, the decoder comprises at least one decoding unit; the specific implementation manner of the step 205 includes the following steps:

step 1) inputting first up-sampling results corresponding to the second dimension characteristic information and the third dimension characteristic information into a first decoding unit to obtain a first decoding result output by the first decoding unit.

Specifically, the second dimension characteristic information output by the second encoding unit and a first up-sampling result corresponding to the third dimension characteristic information are input to the first decoding unit, and the first decoding unit fuses the second dimension characteristic information and the first up-sampling result to obtain a first decoding result output by the first decoding unit.

Optionally, the decoding unit comprises a self-attention mechanism unit;

the inputting the first upsampling result corresponding to the second dimension characteristic information and the third dimension characteristic information into a first decoding unit to obtain a first decoding result output by the first decoding unit includes:

performing convolution on the sixth characteristic information for two times, and determining seventh characteristic information after the convolution;

Specifically, the decoder comprises at least one decoding unit, and each decoding unit comprises a self-attention mechanism unit, namely scSE; among other things, scSE is a combined attention mechanism that combines spatial attention and channel attention, and the self-attention mechanism can enable the model to learn to use attention, i.e., to focus on information or features of interest, thus enabling the omission of unimportant information and enhanced utilization of more useful information. The self-attention mechanism unit scSE is a parallel combination of a channel attention mechanism unit (cSE) and a space attention mechanism unit (sSE), and specifically, after the channel attention mechanism unit (cSE) and the space attention mechanism unit (sSE) respectively pass through sSE and cSE, output results of sSE and cSE are added, and a more accurate corrected feature diagram is obtained.

It should be noted that the cSE module belongs to a channel attention mechanism, and is to reassign weights to feature maps in channel dimensions to obtain channel weighted feature maps. Firstly, converting the shape (shape) of a feature map from (C, H, W) to (C, 1,1) through global average pooling, then using two 1 × 1 volume blocks and an activation function (ReLU) to obtain a channel feature importance degree vector with the shape being (C, 1,1), and multiplying the channel feature importance degree vector and an original feature map in a channel attention (channel-wise) mode to obtain a feature map corrected by attention on a channel; the sSE module belongs to a spatial attention mechanism, and is used for reallocating weights to spatial information of the same feature map and acquiring the feature map containing different spatial weight information. Firstly, directly performing channel compression on a feature map by using a 1 × 1 convolution block with a channel of C to convert shape of the feature map from (C, H, W) to (C, 1,1), then activating by using a sigmoid function to obtain a spatial feature importance degree vector, and multiplying the spatial feature importance degree vector by an original feature map to obtain a feature map which is corrected by attention in space.

Fig. 3 is an operation flow diagram of the self-attention mechanism unit provided by the present invention, as shown in fig. 3, the channel attention mechanism unit (cSE) pools the original feature map by global averaging, then obtains a channel feature importance degree vector with shape being (C, 1,1) through two 1 × 1 convolution blocks and an activation function (ReLU), and finally multiplies the channel feature importance degree vector by the original feature map to obtain a feature map on the channel, which is corrected by attention; a spatial attention mechanism unit (sSE) performs channel compression on the original feature map through a 1 × 1 convolution block with a channel as C, so that shape of the original feature map is converted from (C, H, W) to (C, 1,1), a sigmoid function is used for activation to obtain a spatial feature importance degree map vector, and the spatial feature importance degree map vector is multiplied by the original feature map to obtain a feature map which is corrected by attention in space; the self-attention mechanism unit (scSE) inputs the original feature maps into cSE and sSE respectively to obtain cSE output feature maps which are subjected to attention correction on channels and sSE output feature maps which are subjected to attention correction on spaces, and adds the two feature maps to obtain more accurately corrected feature maps.

In practice, the first decoding unit convolves the second dimension characteristic information output by the second encoding unit and the first upsampling result corresponding to the third dimension characteristic information output by the third encoding unit, and then splices the two in a connection (concatenation) manner to obtain spliced sixth characteristic information; the sixth feature information passes through a convolution kernel and a ReLu function, and then passes through a convolution kernel and a ReLu function to obtain sixth feature information; inputting the sixth feature information to a self-attention mechanism unit (scSE) to obtain a first decoding result output from the attention mechanism unit.

And 2) inputting the first up-sampling results corresponding to the third dimension characteristic information and the fourth dimension characteristic information into a second decoding unit to obtain a second decoding result output by the second decoding unit.

Specifically, the second decoding unit convolves the third dimensional feature information output by the third encoding unit and the first upsampling result corresponding to the fourth dimensional feature information output by the fourth encoding unit, and then splices the convolved first upsampling result in a connection (concatenation) form to obtain eighth feature information after splicing; the eighth characteristic information passes through a convolution kernel and a ReLu function, and then passes through a convolution kernel and a ReLu function once to obtain ninth characteristic information; inputting the ninth feature information to a self-attention mechanism unit (scSE) to obtain a second decoding result output from the attention mechanism unit.

And step 3) inputting the first up-sampling results corresponding to the fourth dimension characteristic information and the fifth dimension characteristic information into a third decoding unit to obtain a third decoding result output by the third decoding unit.

Specifically, the third decoding unit convolves the fourth dimension characteristic information output by the fourth encoding unit and the first upsampling result corresponding to the fifth dimension characteristic information output by the fifth encoding unit, and then splices the convolved first upsampling result in a connection (concatenation) form to obtain tenth characteristic information after splicing; the tenth characteristic information passes through a convolution kernel and a ReLu function, and then passes through a convolution kernel and a ReLu function once to obtain eleventh characteristic information; the eleventh feature information is input to a self-attention mechanism unit (scSE) to obtain a third decoding result output from the attention mechanism unit.

Step 4) determining at least one decoding result output by the decoder based on the first decoding result, the second decoding result and the third decoding result.

Specifically, according to the steps of performing decoding by the first decoding unit, the second decoding unit, or the third decoding unit according to the first decoding result output by the first decoding unit, the second decoding result output by the second decoding unit, and the third decoding result output by the third decoding unit, the decoding results output by the other decoding units of the decoder can be obtained.

Step 206, inputting the second upsampling results corresponding to the edge features and the decoding results to the edge constraint block, so as to obtain a target extraction result of the image output by the edge constraint block.

Specifically, the first dimension feature information output by the first encoding unit, the first upsampling result corresponding to the first dimension feature information output by the second encoding unit, and the second upsampling result corresponding to each decoding result are input to the edge constraint block, so that a target extraction result of an image output by the edge constraint block can be obtained.

The target extraction method provided by the invention comprises the steps of inputting an image into an encoder to obtain at least one piece of first characteristic information of the image output by the encoder; respectively performing upsampling on each first characteristic information to obtain at least one first upsampling result after the upsampling; inputting each first feature information and each first up-sampling result into a decoder to obtain at least one decoding result output by the decoder; and the image of the target to be extracted, the edge feature corresponding to the image and the feature information of different scales are fused through the encoder, the decoder, the edge constraint block and a self-attention mechanism, the boundary of a building in the image is refined, and the building extraction precision is improved.

Optionally, the edge constraint block includes at least one edge constraint unit connected in series in sequence;

under the condition that the edge constraint unit is a non-first edge constraint unit, inputting second fusion feature information respectively corresponding to the edge feature and all edge constraint units before the non-first edge constraint unit and a second upsampling result corresponding to a decoding result output by the decoder into the non-first edge constraint unit to obtain third fusion feature information output by the non-first edge constraint unit; and taking the result output by the last edge constraint unit as the target extraction result of the image.

Specifically, the following describes edge constraint units in the edge constraint block:

1) When an edge constraint unit in an edge constraint block is a first edge constraint unit, the inputting the first dimension feature information, a first upsampling result corresponding to the second dimension feature information, and the edge feature into the first edge constraint unit to obtain first fusion feature information output by the first edge constraint unit includes:

splicing the first dimension characteristic information, a first up-sampling result corresponding to the second dimension characteristic information and the edge characteristic to determine spliced second characteristic information; performing convolution twice on the spliced second characteristic information, and determining the convoluted third characteristic information; and inputting the third feature information into a self-attention mechanism unit to obtain first fusion feature information output by the self-attention mechanism unit.

Specifically, the first edge constraint unit splices the result of convolution of the first dimension feature information and the first up-sampling result corresponding to the second dimension feature information with the edge feature by adopting the following formula (4), and determines spliced second feature information; performing convolution twice and ReLU function activation on the spliced second characteristic information to obtain third characteristic information; inputting the third characteristic information into the self-attention mechanism unit, namely obtaining first fusion characteristic information output by the self-attention mechanism unit by adopting a formula (5); wherein:

wherein E represents an edge feature,

representing first-dimension feature information, k is 0,

the second-dimension characteristic information is represented,

indicating second characteristic information, output ^j And representing the first fusion characteristic information, wherein the value of j is 1, and representing the first edge constraint unit.

2) When the edge constraint unit is a non-first edge constraint unit, the inputting, to the non-first edge constraint unit, second fused feature information respectively corresponding to the edge feature and all edge constraint units before the non-first edge constraint unit, and a second upsampling result corresponding to a decoding result output by the decoder, to obtain third fused feature information output by the non-first edge constraint unit includes:

splicing second fusion feature information corresponding to the edge feature and all edge constraint units before the non-first edge constraint unit and a second up-sampling result corresponding to the decoding result output by the decoder to determine spliced fourth feature information; performing convolution twice on the spliced fourth feature information, and determining fifth feature information after convolution; and inputting the fifth feature information into the self-attention mechanism unit to obtain third fused feature information output by the self-attention mechanism unit.

Specifically, the non-first edge constraint unit may also adopt the formula (4) to splice the edge features, the second fusion feature information corresponding to all edge constraint units before the non-first edge constraint unit, and the result after convolution of the second upsampling result corresponding to the decoding result output by the decoder, and perform convolution twice and ReLU function activation on the spliced fourth feature information to obtain fifth feature information; and inputting the fifth feature information into the attention mechanism unit, namely obtaining third fusion feature information output by the attention mechanism unit by adopting a formula (5), wherein a result output by the last edge constraint unit is used as a target extraction result of the image. It should be noted that, when the result output by the last edge constraint unit is used as the target extraction result of the image, the result output by the last edge constraint unit is compared with the target threshold, and the pixel points corresponding to the result larger than the threshold are white, and the pixel points corresponding to the result smaller than the threshold are black. Meanwhile, in this case, in the formula (4) and the formula (5)

Representing second fused feature information, k is greater than 0,

which represents the result of the decoding output by the decoder,

indicates fourth characteristic information, output ^j And j is larger than 1 and represents a non-first edge constraint unit.

Fig. 4 is a schematic structural diagram of an Edge constraint unit provided by the present invention, and as shown in fig. 4, after convolving second upsampling results corresponding to an Edge feature (Edge _ feature), a Low-dimensional feature (Low _ feature (0), low _ feature (1), … …, and Low _ feature (j-1)) and a High-dimensional feature (High _ feature (j-1)), the results are spliced by means of a Concat, and the spliced results are convolved by means of a convolution (Conv 2 d) and an activation function (ReLU), then convolved by means of a convolution (Conv 2 d) and an activation function (ReLU), and then passed by means of a self-attention machine unit (scSE), so as to obtain fusion feature information output by the self-attention machine unit.

FIG. 5 is a schematic structural diagram of a multi-scale edge constraint model provided by the present invention, as shown in FIG. 5, X ^0,0 、X ^1,0 、X ² ^,0 、X ^3,0 、X ^4,0 Form an encoder in which X ^0,0 、X ^1,0 、X ^2,0 、X ^3,0 、X ^4,0 Respectively represents a first coding unit, a second coding unit, a third coding unit, a fourth coding unit and a fifth coding unit; x ^1,1 、X ^1,2 、X ^1,3 、X ^2,1 、X ^2,2 、X ^3,1 Form a decoder, wherein X ^1,1 Denotes a first decoding unit, X ^2,1 Denotes a second decoding unit, X ^3,1 Denotes a third decoding unit, and X ^1,2 、X ^1,3 、X ² ^,2 Each represents other decoding units; x ^0,1 、X ^0,2 、X ^0,3 、X ^0,4 Constituting an edge restraint block (MEC block) in which X ^0,1 Denotes the first edge constraint element, X ^0,2 、X ^0,3 、X ^0,4 All represent non-first edge constraint units, denoted by X ^0,4 And comparing the output result with a target threshold value to obtain a target extraction result of the image.

Next, the target extraction method provided by the present invention is described in detail according to the structural schematic diagram of the multi-scale edge constraint model shown in fig. 5.

1) Acquiring an image of a target to be extracted;

2) Determining the edge characteristics of the image by using a Sobel operator;

3) The image is input to a first coding unit of an encoder, the first coding unit performs convolution and ReLU function activation on the image, and then the image is subjected to convolution and ReLU function activation, namely, the image is subjected to convolution and ReLU function activation twice, so that first dimension characteristic information output by the first coding unit is obtained; performing down-sampling on the first-dimension characteristic information, namely passing the first-dimension characteristic information through pooling kernel to obtain a first down-sampling result after sampling; inputting the first downsampling result into a second coding unit, and activating the first downsampling result by the second coding unit through two times of convolution and ReLU function to obtain second dimension characteristic information output by the second coding unit; performing down-sampling on the second dimension characteristic information, and performing pooling kernel on the second dimension characteristic information to obtain a second down-sampling result after sampling; inputting the second down-sampling result to a third coding unit, and performing convolution twice and ReLU function activation on the second down-sampling result by the third coding unit to obtain third dimension characteristic information output by the third coding unit; performing down-sampling on the third-dimension characteristic information, and performing pooling kernel on the third-dimension characteristic information to obtain a third down-sampling result after sampling; inputting the third down-sampling result into a fourth coding unit, and performing convolution twice and ReLU function activation on the third down-sampling result by the fourth coding unit to obtain fourth dimension characteristic information output by the fourth coding unit; performing down-sampling on the fourth-dimensional feature information, and performing pooling kernel on the fourth-dimensional feature information to obtain a fourth down-sampling result after sampling; inputting the fourth down-sampling result to a fifth coding unit, and performing convolution twice and ReLU function activation on the fourth down-sampling result by the fifth coding unit to obtain fifth dimension characteristic information output by the fifth coding unit;

4) Inputting first up-sampling results corresponding to the second dimension characteristic information and the third dimension characteristic information into a first decoding unit, and splicing results obtained after convolution of the first up-sampling results corresponding to the second dimension characteristic information and the third dimension characteristic information by the first decoding unit to determine spliced sixth characteristic information; performing convolution and ReLU function activation on the sixth feature information twice, and determining seventh feature information after convolution; inputting the seventh feature information into the attention control unit to obtain a first decoding result output by the attention control unit;

inputting a first up-sampling result corresponding to the third dimension characteristic information and the fourth dimension characteristic information into a second decoding unit, and splicing the first up-sampling result corresponding to the third dimension characteristic information and the fourth dimension characteristic information by the second decoding unit to determine seventh characteristic information after splicing; performing convolution twice on the seventh characteristic information, and determining eighth characteristic information after the convolution; inputting the eighth characteristic information into the attention mechanism unit to obtain a second decoding result output by the attention mechanism unit;

inputting first up-sampling results corresponding to the fourth dimension characteristic information and the fifth dimension characteristic information into a third decoding unit, and splicing the first up-sampling results corresponding to the fourth dimension characteristic information and the fifth dimension characteristic information by the third decoding unit to determine spliced ninth characteristic information; performing convolution twice on the ninth characteristic information, and determining tenth characteristic information after the convolution; inputting the tenth characteristic information into the attention mechanism unit to obtain a third decoding result output by the attention mechanism unit;

inputting the second dimension characteristic information, the first decoding result and a second up-sampling result corresponding to the second decoding result into a fourth decoding unit, and splicing the second dimension characteristic information, the first decoding result and a result obtained after convolution of the second up-sampling result corresponding to the second decoding result by the fourth decoding unit to determine spliced eleventh characteristic information; performing convolution and ReLU function activation on the eleventh feature information twice, and determining twelfth feature information after convolution; inputting the twelfth feature information into the attention mechanism unit to obtain a fourth decoding result output by the attention mechanism unit;

inputting the third dimension characteristic information, the second decoding result and a second up-sampling result corresponding to the third decoding result into a fifth decoding unit, and splicing the results after convolution of the third dimension characteristic information, the second decoding result and the second up-sampling result corresponding to the third decoding result by the fifth decoding unit to determine thirteenth characteristic information after splicing; performing convolution and ReLU function activation on the thirteenth characteristic information twice, and determining fourteenth characteristic information after convolution; inputting the fourteenth feature information into the attention mechanism unit to obtain a fifth decoding result output from the attention mechanism unit;

inputting second dimension characteristic information, a first decoding result, a fourth decoding result and a second up-sampling result corresponding to a fifth decoding result into a sixth decoding unit, and splicing results obtained after convolution of the second dimension characteristic information, the first decoding result, the fourth decoding result and the second up-sampling result corresponding to the fifth decoding result by the sixth decoding unit to determine fifteenth characteristic information after splicing; performing convolution and ReLU function activation on the fifteenth feature information twice, and determining sixteenth feature information after convolution; inputting the sixteenth feature information into the attention mechanism unit to obtain a sixth decoding result output by the attention mechanism unit; thereby obtaining at least one decoding result output by the decoder;

5) Inputting first up-sampling results corresponding to the edge features, the first dimension feature information and the second dimension feature information into a first edge constraint unit, and splicing results obtained after convolution of the edge features, the first dimension feature information and the first up-sampling results corresponding to the second dimension feature information by the first edge constraint unit to obtain spliced second feature information; performing convolution and ReLU function activation on the spliced second characteristic information twice, and determining the convolved third characteristic information; inputting the third feature information into a self-attention mechanism unit to obtain first fusion feature information output by the self-attention mechanism unit;

inputting second fusion characteristic information respectively corresponding to the edge characteristics and all edge constraint units before the non-first edge constraint unit and a second up-sampling result corresponding to the decoding result output by the decoder into the non-first edge constraint unit, splicing the results after convolution of the second fusion characteristic information respectively corresponding to the edge characteristics and all edge constraint units before the non-first edge constraint unit and the second up-sampling result corresponding to the decoding result output by the decoder by the non-first edge constraint unit, and determining spliced fourth characteristic information; performing convolution twice and ReLU function activation on the spliced fourth characteristic information, and determining the convolved fifth characteristic information; inputting the fifth feature information into the self-attention mechanism unit to obtain third fusion feature information output by the self-attention mechanism unit;

6) And taking the result output by the last non-first edge constraint unit as the target extraction result of the image. It should be noted that, the result output by the last non-first edge constraint unit is compared with the target threshold, the pixel point corresponding to the fusion feature information greater than or equal to the target threshold is white, and the pixel point corresponding to the fusion feature information smaller than the target threshold is black, so as to obtain the target extraction result.

Because the multi-scale edge constraint model is obtained after training based on the sample image, and the sample image is obtained based on the initial sample image, the obtaining mode of the sample image, namely the building-building (built-building) provided by the invention is described.

Optionally, the sample image is obtained based on an initial sample image, including:

selecting at least one group of initial sample image pairs; the initial sample image pair comprises a background image and a copied image;

respectively carrying out Fourier transform on each pixel point in the background image and the copied image to obtain a frequency domain image after the transform;

comparing the frequency domain images corresponding to the background image and the copied image respectively;

Specifically, in order to enable a data enhancement method to play a greater role in building extraction, the invention provides a building-oriented data enhancement method (built-building), which selects at least one set of initial sample image pairs from initial sample images, wherein the initial sample image pairs comprise a background image and a copied image; the background image is represented as a pasted image, and the copied image is represented as a copied image.

In practice, fast Fourier Transform (FFT) is performed on each pixel point in the background image and the copied image, respectively, to obtain a frequency domain image after transformation; the FFT transform is transformed using equation (6) and equation (7), respectively, wherein:

wherein, F _copy (u, v) represents a frequency domain image corresponding to the video to be copied, f _copy (x, y) represents a copied image, F _paste (u, v) frequency domain image corresponding to background video, f _paste (x, y) represents a background image, M represents a width of the copied image or the background image, N represents a height of the copied image or the background image, and u and v each represent a frequency.

After the spatial domain of the copied image and the background image is converted into the frequency domain, the place with severe gray value change corresponds to high frequency, otherwise, the place corresponds to low frequency. That is, the high frequency components are mainly the measure of the image edges and contours, and the low frequency components are the comprehensive measure of the overall image intensity.

After FFT conversion is carried out on the background image and the copied image, frequency domain images corresponding to the background image and the copied image are compared respectively; under the condition that low-frequency pixel points in the frequency domain image corresponding to the background image are smaller than low-frequency pixel points in the frequency domain image corresponding to the copied image, replacing the low-frequency pixel points in the frequency domain image corresponding to the copied image with the low-frequency pixel points in the frequency domain image corresponding to the background image, as shown in formula (8) and formula (9), and replacing the low-frequency pixel points in the frequency domain image corresponding to the copied image with the low-frequency pixel points in the frequency domain image corresponding to the background image by using F _paste (u, v) low frequency part is replaced by F _copy Low of (u, v)The frequency part is obtained as F _copy-paste (u, v), and then converted to the spatial domain f by iFFT _copy-paste (x, y), namely determining the replaced background image as a sample image; and under the condition that the low-frequency pixel points in the frequency domain image corresponding to the background image are not smaller than the low-frequency pixel points in the frequency domain image corresponding to the copied image, taking the copied image as a sample image.

Wherein, F _copy-paste (u, v) denotes the frequency domain image after the replacement, F _copy (u, v) denotes a frequency domain image corresponding to the video to be copied, F _paste (u, v) frequency domain image corresponding to background image, f _copy-paste (x, y) denotes a group F _copy-paste (u, v) a spatial domain image after performing iFFT, M representing the width of the copied picture or background picture, N representing the height of the copied picture or background picture, and u and v representing the frequency.

Optionally, the method further comprises:

deleting the buildings in the overlapping area in the background image when the overlapping area exists between the background image and the copied image;

It should be noted that, when constructing a data set of building images, the top of a building is generally used as a corresponding ground real value, and if only pixels corresponding to the top of the building are directly copied, extraction of very important neighborhood features such as inclined walls or shadows from the building is omitted, so that the mask corresponding to the copying is expanded and then copied and pasted.

Specifically, when the copied image and the corresponding mask are calculated and the calculation result is subjected to the dilation operation, when the copied image is copied and pasted, a building may cover other buildings due to the overlapping area of two or more buildings, and the covered result is not in accordance with the actual situation, which may affect the correct direction of model learning. Therefore, when the background image and the copied image have an overlapping area, the buildings in the overlapping area in the background image are deleted; and copying low-frequency pixel points in the frequency domain image corresponding to the copied image into the background image to obtain a sample image.

Fig. 6 is a schematic diagram of a result of a sample image obtaining method according to the present invention, as shown in fig. 6, (a) shows a copied image including a positive sample copied image and a real result diagram corresponding to the positive sample copied image, (b) shows a background image including a real result diagram corresponding to a positive sample background image and a positive sample background image, (c) shows a result diagram obtained by directly copying and pasting the diagram (a) to the diagram (b), (d) shows a result diagram obtained by separately performing FFT on the diagram (a) and the diagram (b) and then performing copy and paste on the diagram (a) and the diagram (b), and (e) shows a result diagram obtained by separately performing FFT on the diagram (a) and the diagram (b), performing operation on copy and paste and on a corresponding mask, and performing expansion operation on the operation result and then copying the operation result; (f) The drawing shows a building-building (build-building) diagram obtained by deleting a building in an overlapping area in a background image and then copying and pasting the building in the overlapping area when the background image and a copied image have the overlapping area. As can be seen from the results shown in the (c) to (f) graphs, the (f) graph copy-paste effect is the best.

The method provided by the invention is characterized in that training is carried out on an Nvidia GeForce RTX 3090GPU, a model is built by using a pytorch (Paszke et al, 2019) frame, the batch size of model training is set to be 8, an Adam optimizer with weight attenuation is used as an optimizer, wherein the attenuation coefficient is 0.001; adjusting the learning rate by adopting cosine annealing, wherein the initial learning rate is set to be 0.001, and the lowest learning rate is 0.0001; the training set adopts 0.5 random probability to carry out data enhancement, the loss function adopts the sum of soft cross entropy loss and Dice loss, the maximum generation number of model training is 125, in the training process, model parameters corresponding to the verification set are stored, and the model parameters corresponding to the generation number with the best precision are used as the final parameters of the model.

In order to further explain the performance of the model extraction building provided by the invention, four common Precision indexes are adopted for Precision evaluation, including Precision rate (Precision), recall rate (Recall), F1 index and cross-over ratio (IoU); wherein Precision refers to the proportion of the building in the pixels divided into buildings, recall indicates how many building pixels are correctly divided into buildings, F1 index indicates the comprehensive index of accuracy and Recall, and IoU is the ratio of intersection to union of the prediction result and the building area in the ground true value.

To better balance the extraction effect of the building boundary region, an additional IoU calculation was made for the buffer within 2 pixels of the building boundary radius. Wherein, the calculation of each index is shown by formula (10) -formula (13):

wherein, TP is a positive class with accurate classification, FP is a negative class which is misclassified as a positive class, and FN is a positive class which is misclassified as a negative class.

TABLE 1 precision index comparison table

Method	Precison(％)	Recall(％)	F1(％)	IoU(％)	IoU(boundary)(％)
						PSP-Net	92.60	93.83	93.21	87.28	58.88
Res-U-Net	94.00	94.86	94.43	89.45	66.00
						DeeplabV3+	94.30	94.50	94.40	89.39	94.93
HRNet	94.67	94.64	94.66	89.86	67.52
						MEC-Net	94.70	96.03	95.36	91.13	68.52

The results of the multi-scale edge constraint model (MEC-Net) and the existing Pyramid Scene Parsing Network (PSP-Net), the residual U Network (Res-U-Net), the semantic segmentation Network (DeeplabV 3 +) and the HRNet obtained on the WHU building data set are shown in Table 1. The MEC-Net performance provided by the invention reaches IoU and f1 which are higher on a WHU building data set; wherein IoU reaches 91.13%, and f1 reaches 95.36%.

FIG. 7 is a schematic diagram showing the results of the target extraction method provided by the present invention, as shown in FIG. 7, small buildings in the first row, deeplabv3+ and HR-Net, cause missed extraction, while MEC-Net can be extracted well; the parking lot in the second row has the characteristics similar to the building, other models have the condition of false extraction, and the MEC-Net effectively distinguishes two ground objects such as the parking lot and the building; for the very large building in the third row, the results of Deeplabv3+ extraction were quite disorganized, and MEC-Net successfully extracted the building. Therefore, the MEC-Net provided by the target extraction method provided by the invention can obtain the most complete building extraction result and the most outstanding boundary details.

The object extracting device provided by the present invention is described below, and the object extracting device described below and the object extracting method described above may be referred to in correspondence with each other.

Fig. 8 is a schematic structural diagram of an object extraction device provided in the present invention, and as shown in fig. 8, the object extraction device 800 includes: an obtaining module 801, a determining module 802 and an extracting module 803, wherein:

an obtaining module 801, configured to obtain an image of a target to be extracted;

a determining module 802, configured to determine an edge feature of the image;

an extracting module 803, configured to input the image and the edge feature into a multi-scale edge constraint model, so as to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image.

The target extraction device provided by the invention obtains the image of the target to be extracted; determining the edge characteristics of the image; and inputting the image and the edge characteristics into the multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model. According to the method provided by the invention, the image of the target to be extracted and the edge characteristics corresponding to the image are subjected to deep fusion through the multi-scale edge constraint model, the boundary of a building in the image is refined, and the building extraction precision is improved.

Optionally, the multi-scale edge constraint model comprises an encoder, a decoder and an edge constraint block; the extracting module 803 is specifically configured to:

inputting the image into the encoder to obtain at least one first characteristic information of the image output by the encoder; each piece of first feature information is used for representing features of different dimensions corresponding to the image;

respectively performing up-sampling on each first feature information to obtain at least one first up-sampling result after up-sampling;

inputting each first feature information and each first up-sampling result into the decoder to obtain at least one decoding result output by the decoder; each decoding result is used for representing the feature information of different resolutions corresponding to the image;

Optionally, the encoder comprises at least one encoding unit connected in series in sequence; the extracting module 803 is specifically configured to:

inputting the second downsampling result to a third encoding unit to obtain third dimension characteristic information output by the third encoding unit;

Optionally, the decoder comprises at least one decoding unit; the extracting module 803 is specifically configured to:

Optionally, the edge constraint block includes at least one edge constraint unit connected in series in sequence; the extracting module 803 is specifically configured to:

under the condition that the edge constraint unit is a first edge constraint unit, inputting the first dimension characteristic information, a first up-sampling result corresponding to the second dimension characteristic information and the edge characteristic into the first edge constraint unit to obtain first fusion characteristic information output by the first edge constraint unit; the first fusion feature information is used for representing fusion of the first dimension feature information, a first up-sampling result corresponding to the second dimension feature information and the edge feature;

Optionally, the extracting module 803 is specifically configured to:

splicing the first dimension characteristic information, a first up-sampling result corresponding to the second dimension characteristic information and the edge characteristic to determine spliced second characteristic information;

Optionally, the extracting module 803 is specifically configured to:

Optionally, the decoding unit comprises a self-attention mechanism unit; the extracting module 803 is specifically configured to:

Optionally, the extracting module 803 is further configured to:

under the condition that low-frequency pixel points in the frequency domain image corresponding to the background image are smaller than low-frequency pixel points in the frequency domain image corresponding to the copied image, replacing the low-frequency pixel points in the frequency domain image corresponding to the copied image with the low-frequency pixel points in the frequency domain image corresponding to the background image, and determining the replaced background image as a sample image;

Optionally, the object extracting apparatus 800 further comprises:

Fig. 9 is a schematic physical structure diagram of an electronic device provided in the present invention, and as shown in fig. 9, the electronic device 900 may include: a processor (processor) 910, a communication Interface (Communications Interface) 920, a memory (memory) 930, and a communication bus 940, wherein the processor 910, the communication Interface 920, and the memory 930 communicate with each other via the communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a target fetch method comprising: acquiring an image of a target to be extracted; determining edge features of the image; inputting the image and the edge features into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image.

Furthermore, the logic instructions in the memory 930 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the object extraction method provided by the above methods, the method comprising: acquiring an image of a target to be extracted; determining edge features of the image; inputting the image and the edge features into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for extracting an object provided by the above methods, the method comprising: acquiring an image of a target to be extracted; determining edge features of the image; inputting the image and the edge features into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model; the multi-scale edge constraint model is obtained after training based on a sample image; the target extraction result is used for representing a building in the image; the sample image is obtained based on the initial sample image.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of target extraction, the method comprising:

acquiring an image of a target to be extracted;

determining edge features of the image;

2. The method of object extraction according to claim 1, wherein the multi-scale edge constraint model comprises an encoder, a decoder, and an edge constraint block;

the inputting the image and the edge feature into a multi-scale edge constraint model to obtain a target extraction result of the image output by the multi-scale edge constraint model includes:

3. The target extraction method of claim 2, wherein the encoder comprises at least one encoding unit connected in series in sequence;

and inputting the fourth down-sampling result to a fifth coding unit to obtain fifth dimension characteristic information output by the fifth coding unit.

4. The object extraction method of claim 3, wherein the decoder comprises at least one decoding unit;

5. The method of claim 3, wherein the edge constraint block comprises at least one edge constraint unit connected in series in sequence;

6. The target extraction method according to claim 5, wherein the inputting the first dimension feature information, the first upsampling result corresponding to the second dimension feature information, and the edge feature into the first edge constraint unit to obtain first fused feature information output by the first edge constraint unit includes:

7. The method of claim 5, wherein the inputting the second fused feature information corresponding to the edge feature and all edge constraint units before the non-first edge constraint unit and the second upsampling result corresponding to the decoding result output by the decoder into the non-first edge constraint unit to obtain the third fused feature information output by the non-first edge constraint unit comprises:

8. The object extraction method of claim 4, wherein the decoding unit comprises a self-attention mechanism unit;

9. The method of claim 1, wherein the sample image is obtained based on an initial sample image, and the method comprises:

10. The method of claim 9, further comprising:

11. An object extraction apparatus, characterized in that the apparatus comprises:

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the object extraction method according to any one of claims 1 to 10 when executing the program.

13. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the object extraction method according to any one of claims 1 to 10.

14. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the object extraction method according to any one of claims 1 to 10.