CN112084868B

CN112084868B - Target counting method in remote sensing image based on attention mechanism

Info

Publication number: CN112084868B
Application number: CN202010794525.2A
Authority: CN
Inventors: 刘庆杰; 高广帅; 王蕴红
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2022-12-23
Anticipated expiration: 2040-08-10
Also published as: CN112084868A

Abstract

The invention discloses a target counting method in a remote sensing image based on an attention mechanism, which integrates the characteristics of the attention mechanism, a scale pyramid and a deformable convolution on the basis of a VGG16 and consists of three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density graph of the back-end network. By the technical scheme, the problems of target scale diversity, complex and disordered background interference and target direction arbitrariness in a dense target counting task in the remote sensing image can be well solved.

Description

Target counting method in remote sensing image based on attention mechanism

Technical Field

The invention belongs to the technical field of remote sensing images, and particularly relates to a target counting method in a remote sensing image based on an attention mechanism.

Background

In recent decades, with the needs of national security and city planning, the estimation of the number of targets in a complex scene is receiving more and more attention. Therefore, much work has been done on object counting in various fields, such as population counting in surveillance videos, cell counting under microscopes, animal counting in ecological studies, vehicle counting, and object counting in environmental studies.

Although the target count has advanced greatly in various fields, it is rarely involved in the field of remote sensing. The vehicles in the picture taken by the drone are counted, except for a few scenes, such as the count of palm or olive trees. However, the main ground objects in remote sensing images, such as buildings, ships, etc., have not received much attention. Therefore, counting these targets can have many practical meanings, such as city planning, environmental control, digital city model construction, disaster solution planning, and the like.

Compared to the target count of other fields, the target count of remote sensing images presents several challenges: 1) Scale diversity: target scale changes in the remote sensing images are different, for example, the size of the same picture is different from several pixels to thousand pixels; 2) The background is complicated and diverse: multiple surface object targets usually exist in the remote sensing image at the same time, and particularly, under the condition that the size of the target is small, the detection and counting of the target are greatly limited by complex and disordered background interference; 3) Direction arbitrariness: unlike objects in natural scene pictures, such as pedestrians, which are upright, objects in remote sensing images have arbitrary directions due to the on-board or on-board sensor pitch-down perspectives.

The name of VGG is from video Geometry Group (Visual Geometry Group) of the scientific engineering system of oxford university, which publishes a series of convolutional network models beginning with VGG, and the models can be applied to face recognition, image classification and the like from VGG11 to VGG19. The initial purpose of VGG in studying the depth of the convolutional network is to find out how the depth of the convolutional network affects the precision and accuracy of large-scale image classification and identification, and in order to avoid excessive parameters while deepening the number of network layers, the VGG adopts a small convolutional kernel of 3 × 3 in all the layers, and the step size of the convolutional layer is set to 1. The input to the VGG is set to 224x244 size RGB images, the RGB means are calculated for all the images on the training set images, and then the images are passed as input into the VGG convolution network, with the convolution step fixed at 1, using either 3x3 or 1x1 filters. The VGG full connection layers have 3 layers, the VGG 11-VGG 19 can be selected according to the difference of the total number of the convolutional layers and the full connection layers, the minimum VGG11 has 8 convolutional layers and 3 full connection layers, the maximum VGG19 has 16 convolutional layers and 3 full connection layers, in addition, a VGG network is not followed by a pooling layer after each convolutional layer, and the total number of 5 pooling layers is distributed under different convolutional layers. The VGG16, which was originally applied to the image classification task, rapidly became the most popular convolutional neural network model at that time due to its characteristics such as simplicity and practicability, and is now often applied to various computer vision tasks.

Disclosure of Invention

In order to solve the problems of data set scarcity, target scale diversity, complex and chaotic background interference and target direction arbitrariness in a dense target counting task in a remote sensing image, the invention provides a method for counting targets in a remote sensing image based on an attention mechanism, which integrates the characteristics of the attention mechanism (attention mechanism), a scale pyramid (scale pyramid) and a deformable convolution (deformable convolution), namely ASPDNet, on the basis of a VGG16 network structure and consists of three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density graph of the back-end network. The specific technical scheme of the invention is as follows:

a method for counting targets in remote sensing images based on an attention mechanism is characterized by comprising the following three cascade stages of processing on input images on the basis of a VGG16 network structure:

s1: extracting the characteristics of a front-end network;

for an input image, the first 10 layers of operations of a VGG16 network structure are taken, and then a convolution block attention module, namely the operation of connecting a channel attention module and a space attention module, is fused to encode the correlation between characteristic map channels and pixel positions;

s2: multi-scale fusion of the middle-end network;

introducing a scale pyramid module, cascading the cavity convolutions with expansion rates of 2,4,8 and 12 respectively, and capturing more multi-scale information and detail information;

s3: generating a density graph of the back-end network;

adopting three layers of deformable convolution with convolution kernels of 3 multiplied by 3, wherein each layer is followed by a layer of modified linear unit ReLU activation function, and finally adding a convolution layer of 1 multiplied by 1 to generate a density map;

s4: and summing all pixels of the density map in the step S3 to obtain a final target number.

The invention has the beneficial effects that:

1. the front-end network takes the front 10 layers of a VGG16 network structure as a main network, and then an attention module is added, so that a more interested target area can be highlighted, a complex background area is inhibited, and the problem of complex and disordered background interference in a remote sensing image can be well solved.

2. A scale pyramid module is introduced at the middle end of the network, and multi-scale information corresponding to different receptive fields is captured under the condition that the number of parameters is not increased, so that the problem of scale diversity can be well solved.

3. Three layers of deformable convolutions are adopted in the back-end network, and bias learned in the convolutions can well cover the target, so that the problem of target arbitrariness in the remote sensing image can be well solved.

Drawings

In order to illustrate embodiments of the invention or solutions in the prior art more clearly, the drawings that are needed in the embodiments will be briefly described below, so that the features and advantages of the invention will be more clearly understood by referring to the drawings that are schematic and should not be understood as limiting the invention in any way, and other drawings may be obtained by those skilled in the art without inventive effort. Wherein:

FIG. 1 is a flow diagram of a network architecture of the present invention;

FIG. 2 (a) is a diagram of a channel attention module configuration;

FIG. 2 (b) is a view of a spatial attention module configuration;

FIG. 3 is a schematic diagram of a scale pyramid module;

FIG. 4 is a schematic diagram of a deformable convolution;

FIG. 5 (a) is a standard convolution position sample visual;

FIG. 5 (b) is a deformable convolution location sampling visual;

FIG. 6 (a) is a building picture;

FIG. 6 (b) is a true density map and count results for a building;

FIG. 6 (c) is a graph of building density and count results obtained by the method of the present invention;

FIG. 7 (a) is a photograph of a cart;

FIG. 7 (b) is the actual density map and counting results of the cart;

FIG. 7 (c) is a density chart and counting results of the cart obtained by the method of the present invention;

fig. 8 (a) is a picture of a large vehicle;

FIG. 8 (b) is a true density map and technical results for a large vehicle;

FIG. 8 (c) is a density map and count results for a large vehicle obtained by the method of the present invention;

FIG. 9 (a) is a picture of a ship;

FIG. 9 (b) is the real density map and counting result of the ship;

fig. 9 (c) is a density map and count results of a ship obtained by the method of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

The invention aims to accurately estimate the number of dense objects in a remote sensing image, such as a connected dense house, a ship parked at a port, a trolley or a large truck parked in a parking lot and the like.

Aiming at the technical problem to be solved, the invention constructs a convolutional neural network framework based on the target counting of density estimation, integrates the technologies of an attention mechanism (the form of connecting a channel attention module and a space attention module), a scale pyramid module, a deformable convolution module and the like on the basis of a VGG16 network structure, and particularly mainly comprises three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density map of the rear-end network, and finally summing all pixels in the density map to calculate the number of targets in the remote sensing image. A specific network flow diagram is shown in fig. 1.

The front-end network takes the front 10 layers of a VGG16 network structure as a main network, adds an attention module, fully considers the correlation between characteristic diagram channels and between pixel positions, extracts abundant semantic information and context information, and accordingly highlights more interested target areas, inhibits complex background areas, and can well solve the problem of complex and disordered background interference in remote sensing images.

Because the method is applied to three layers of maximum pooling layers in a network, the resolution of an image can be reduced to 1/64 of that of an original image, in order to expand the receptive field of a characteristic diagram, a Scale Pyramid Module (SPM) is introduced at the middle end of the network, namely, four parallel holes with different expansion factors are convolutely connected, and the SPM has the function of capturing multi-scale information corresponding to different receptive fields under the condition of not increasing the number of parameters, so that the problem of scale diversity can be well solved.

A three-layer deformable convolution (deformable convolution) is adopted in a back-end network, a learnable bias is added to the convolution of an original standard in the deformable convolution operation, and the learnt bias in the convolution can well cover a target due to an adaptive position sampling technology in the deformable convolution, so that the problem of target arbitrariness in a remote sensing image can be well solved, a 1x1 convolution layer is used in the last layer of the network to generate a density map, and finally all pixels of the density map are summed to obtain the number of the targets. In particular, the amount of the solvent to be used,

1. feature extraction for front-end networks

Giving a remote sensing picture with any size, taking a VGG16 network structure as a main network, taking the first 10 layers of the VGG16 network structure for operation, then adding a volume block attention module which is an operation of connecting a Channel Attention Module (CAM) and a Space Attention Module (SAM) and is used for coding the correlation between a feature map channel and a pixel position so as to collect more main feature information, thereby achieving the purposes of highlighting a target and restraining a complex chaotic background.

Channel attention module: in dense scenes, the texture of the foreground object is very similar to that of the background, making counting difficult, and the merging into the channel attention module, whose architecture is shown in fig. 2 (a), can alleviate this problem. Specifically, the characteristic diagram for any one intermediate layer is shown as

Wherein the content of the first and second substances,

representing a real space, C representing a channel of a feature map, H and W representing the height and width of the feature map, respectively, first, performing a 1 × 1 convolution operation on the feature maps, and then obtaining two feature maps C by morphing and transposing ₁ And C ₂ (ii) a Next, C is added ₁ And C ₂ Multiplying and performing a normalization index (softmax) operation to obtain a channel attention C of size C _a . In particular, this process is represented as:

wherein the content of the first and second substances,

indicating the effect of the ith channel on the jth channel on the channel attention map,

in order to perform 1 × 1 convolution on the original feature map and then transform the original feature map into the ith channel of the feature map with the size of cxhw,

the original feature map is firstly convolved by 1 × 1, then transformed and transposed into the jth channel of the feature map with the size of HW × C, and finally the feature map with the size of C × H × W weighted by the channel attention module is calculated:

wherein, the lambda is a learnable parameter which can be learnt by the convolution operation of 1 multiplied by 1,

for the jth channel of the feature map that was finally weighted by the channel attention module,

firstly, the original characteristic diagram is subjected to 1x1 convolutionAnd then transformed into the ith channel of a feature map of size CxHW, F ^j Is the jth channel of the original profile.

Spatial attention module: the dependency of the long range in the spatial dimension is further encoded by considering that the feature maps have different densities at different pixel positions, so that the feature information at the spatial position is effectively encoded. The spatial attention module is similar to the above-mentioned channel attention module, and a specific network architecture is shown in fig. 2 (b). However, they differ: 1) Compared with the channel attention module which only has one convolution layer of 1 multiplied by 1, the space attention module needs three; 2) Note figure C compared to channel _a Is C × C, space attention is drawn to FIG. S _a The size of (a) is HW × HW. In particular, it is possible to provide a device,

wherein the content of the first and second substances,

indicating the effect of the kth position on the l-th position,

the original feature map is firstly convolved by 1 × 1 and then deformed into the kth position of the feature map with the size of C × HW,

the original feature map is transformed and transposed to the ith position of the feature map with the size of HW × C after being convolved by 1 × 1, and the feature map with the size of C × H × W weighted by the spatial attention module is calculated as follows:

where μ is a learnable parameter, which can be learned by a 1 × 1 convolution operation,

for the i-th position of the feature map that is finally weighted by the spatial attention module,

is the k-th position of the original feature map after 1 × 1 convolution and then transformed into the feature map with size of C × HW, F ^l Is the ith position of the original feature map.

2. Multi-scale fusion of mid-end networks

Since there are three layers of pooling level operations in the front-end network, the size of the output signature is 1/64 of the original size. In order to increase the receptive field of the feature map while keeping the resolution constant, a scale pyramid module is introduced as shown in fig. 3, i.e. the operation of concatenating several hole convolutions of different expansion rates. The effect of the hole convolution is to increase the receptive field of the characteristic map without increasing the number of parameters and complexity, and different expansion rates correspond to receptive fields of different sizes. In the method, the number of cascaded hole convolutions is set to be 4, the adopted expansion rates are respectively 2,4,8 and 12, and more multi-scale information and detail information can be captured through the operation of the scale pyramid module, so that the robustness of the model to scale change is improved.

3. Generation of a density map of a back-end network

In a back-end network, three layers of deformable convolution operation are adopted to solve the problem of target direction arbitrariness in a remote sensing image, and finally a convolution layer of 1x1 is added to generate a density map.

Compared with standard convolution, deformable convolution (deformable convolution) operation is that a bias which can be learned by increasing the size of each pixel point of the receptive field of the feature map. Learning this bias has the effect that the convolutional layer can cover the entire object regardless of the shape of the object. A schematic diagram of the deformable convolution and a detailed view of the position samples are shown in fig. 4, 5 (a) and 5 (b).

For a standard convolution, the position p of a sample point is given _m For a convolution with a convolution kernel of 3x3, a dilation rate of 1,

is a set of regular sampling points that are,output profile of position p

Where w represents the weighted parameter, x represents the input profile,

represents the M-th sample point, M being the total number of sample points. Compared with standard convolution, deformable convolution adds a bias delta p which can be obtained through training optimization and can be self-adaptively learned on the basis of the standard convolution _m For deformable convolution, feature maps

Specifically, a deformable convolution with a 3 × 3 convolution kernel of three layers is adopted, and each layer is followed by a layer of modified Linear Unit (ReLU) activation function. By the dynamic sampling strategy in the deformable convolution, any performance of the target direction caused by the overlooking visual angle in the remote sensing image can be well solved. At the end of the network, a 1 × 1 convolutional layer is added to generate the density map. The final target number is obtained by summing all pixels of the density map.

The method estimates a density graph from an input picture to count the number of targets, so that a remote sensing picture with an artificially calibrated target center position needs to be converted into a true value density graph in advance and then trained. In training the entire network, an objective function (loss function) needs to be optimized in order to evaluate the difference between the density map estimated over the network and the true density map. Finally, in the testing stage, in order to evaluate the effectiveness of the method of the invention, the method is evaluated by adopting a classical evaluation index. In particular, the amount of the solvent to be used,

for generation of the truth density map: assume a pixel position of x _n (target center coordinates) of a target, using a pulse function delta (x-x) _n ) To show that, for an image containing N objects, it can be expressed as:

to generate the density map F, H (x) is convolved with a Gaussian kernel, i.e.

Wherein H (x) is a function representing an image containing N targets, F (x) is a truth density map function,

is that the variance is sigma _n N denotes the nth target, σ _n Denotes the standard deviation, sets the fixed kernel σ _n ＝15。

With respect to the loss function: using Euclidean distance functions as loss functions for evaluating the difference between predicted and true density maps

Wherein B represents a batch size, X _b Representing the input picture, b the b-th image, theta the parameters of the training, F (X) _b (ii) a Theta) and

representing the estimated density map and the corresponding true density map, respectively.

To evaluate the effectiveness of the method of the invention, two evaluation indices were used for the evaluation: mean Absolute Error (MAE) and Mean Squared Error (MSE), the MAE evaluates the accuracy of the model, and the MSE evaluates the robustness of the model. Two criteria are defined as:

wherein T represents the number of test samples, T is the T-th image,

and

respectively representing the number of estimated objects and the true objectsAnd (4) counting. For the convenience of understanding the technical solutions of the present invention, the technical solutions of the present invention will be described in detail by specific examples.

Example 1

The method provided by the invention is verified on a data set which comprises 3057 pictures and is composed of 4 types of targets of buildings, trolleys, large trucks and ships, and the specific data statistics of the data set can be seen in a table 1.

Table 1 data set information statistics used to validate the invention

Referring to fig. 6 (a) -6 (c), 7 (a) -7 (c), 8 (a) -8 (c), and 9 (a) -9 (c), the model of the present invention is trained end-to-end, the first 10 layers of the network are fine-tuned in the VGG16 network structure, and the parameters of the other convolutional layers are initialized with 0.01 standard deviation gaussian. During training, a Stochastic Gradient (SGD) is used and the learning rate is set to 1e-5. For the building data set, the batch size is 32, and 400 periods are carried out until the training is converged; for the other three categories, i.e. ship, car and large vehicle data sets, the batch size is 1, and training is also performed over 400 cycles.

To augment the training set, 9 blocks of image are cropped at different positions of the picture, each block having a size of 1/4 of the original picture, the first 4 blocks being non-overlapping image blocks, the last five blocks being cropped at random positions, and then the blocks are horizontally flipped. Because the resolution of the pictures of the ship, car and large vehicle data sets is greater than that of other conventional data sets, it is easy to cause insufficient memory of the display card. Therefore, the pictures are fixed to 1024 × 768 in size before data enhancement. The model was written in pytorch and was tested on NVIDIA GTX 2080Ti GPU.

To verify the validity of each module of the model, ablation experiments were performed on the building data set. The experimental process comprises a benchmark experiment and three modules are added on the basis of the benchmark experiment in succession:

● Reference experiment: adopting CSRNet as a reference method (the front-end network takes a VGG16 network structure as a main network, and the rear-end network adopts a convolution layer with 6 layers of expansion factors of 2);

● Reference + attention module: on the basis of a reference method, adding a module for connecting a channel attention mechanism and a space attention mechanism;

● Benchmark + attention module + scale pyramid module: adding a scale pyramid module on the basis of the front;

● Benchmark + attention module + scale pyramid module + deformable convolution module: the invention provides a method.

The results of the ablation experiments are shown in table 2, and it can be seen that each module in the network of the present invention contributes to the improvement of performance. Specifically, the original benchmark method is not ideal in performance on a data set, and after the attention module is added, global and local dependency information of the feature map is collected, so that the performance is improved to a certain extent; after the upscale pyramid module is added, the performance is further improved; finally, after the deformable convolution is fused, the model provided by the invention shows the optimal performance on the data set.

TABLE 2 ablation experiments on building data sets

Table 3 shows the results of the process of the present invention compared to other processes. These methods include: MCNN, CMTL, CSRNet, SFCN, SANet, SPN, SCAR. The method disclosed by the invention shows the optimal result on the constructed counting data of the remote sensing target, and simultaneously shows that the method disclosed by the invention has good generalization capability.

TABLE 3 comparison of the Process of the invention with other Processes

In the present invention, the terms "first", "second", "third", and "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless explicitly defined otherwise.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target counting method in a remote sensing image based on an attention mechanism is characterized by comprising the following three cascade stage processing of an input image on the basis of a VGG16 network structure:

s1: extracting the characteristics of the front-end network;

for an input image, the first 10 layers of operations of a VGG16 network structure are taken, and then a convolution block attention module is fused, namely the operation of connecting a channel attention module and a space attention module is used for coding the correlation between characteristic map channels and pixel positions;

the structural architecture of the channel attention module is as follows: the characteristic diagram for any one intermediate layer is shown as

Wherein the content of the first and second substances,

representing a real number space, C representing a channel of a feature map, H and W representing a height and a width of the feature map, respectively, first, performing a convolution operation of 1 × 1 on the feature maps, and then obtaining two feature maps C by deforming and transposing ₁ And C ₂ (ii) a Is connected withNext, C is added ₁ And C ₂ Multiplying and executing normalization index operation to obtain channel attention C with size of C multiplied by C _a I.e. by

Wherein the content of the first and second substances,

wherein, the lambda is a learnable parameter and is obtained by learning through convolution operation of 1 multiplied by 1,

the method is characterized in that an original characteristic diagram is firstly convolved by 1 multiplied by 1 and then transformed into the ith channel of the characteristic diagram with the size of C multiplied by HW, F ^j The jth channel of the original feature map;

the network architecture of the spatial attention module is as follows: first, three convolution operations of 1 × 1 are performed on the feature map, and then two feature maps S are obtained by morphing and transposing ₁ And S ₂ (ii) a Next, theS ₁ And S ₂ Multiplying and executing normalized exponential operation to obtain channel attention image S with HW multiplied by HW size _a I.e. by

Wherein, the first and the second end of the pipe are connected with each other,

indicating the effect of the kth position on the l position,

the original feature map is transformed into the kth position of the feature map with the size of C × HW after being subjected to 1 × 1 convolution,

wherein mu is a learnable parameter which is obtained by learning through convolution operation of 1 multiplied by 1,

the original feature map is transformed into the k-th position of the feature map with the size of C × HW after being convolved by 1 × 1, F ^l The ith position of the original feature map;

s2: multi-scale fusion of the middle-end network;

introducing a scale pyramid module, cascading cavity convolutions with expansion rates of 2,4,8 and 12 respectively, and capturing more multi-scale information and detail information;

s3: generating a density graph of the back-end network;

wherein for a position p of a sample point _m For a deformable convolution with a convolution kernel of 3x3, a dilation rate of 1,

output profile of position p, a set of regular sampling points

Where w represents the weighted parameter, x represents the input feature map, m =1, l, m,

denotes the M-th sample point, M is the total number of sample points, Δ p _m Is an adaptively learnable bias that can be obtained through training optimization;

s4: and summing all the pixels of the density map of the step S3 to obtain the final target number.