CN112084868A

CN112084868A - Target counting method in remote sensing image based on attention mechanism

Info

Publication number: CN112084868A
Application number: CN202010794525.2A
Authority: CN
Inventors: 刘庆杰; 高广帅; 王蕴红
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-12-15
Anticipated expiration: 2040-08-10
Also published as: CN112084868B

Abstract

The invention discloses a target counting method in a remote sensing image based on an attention mechanism, which integrates the characteristics of the attention mechanism, a scale pyramid and a deformable convolution on the basis of VGG16 and consists of three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density graph of the back-end network. By the technical scheme, the problems of target scale diversity, complex and disordered background interference and target direction arbitrariness in a dense target counting task in the remote sensing image can be well solved.

Description

Target counting method in remote sensing image based on attention mechanism

Technical Field

The invention belongs to the technical field of remote sensing images, and particularly relates to a target counting method in a remote sensing image based on an attention mechanism.

Background

In recent decades, with the needs of national security and city planning, the estimation of the number of targets in a complex scene is receiving more and more attention. Therefore, much work has been done on object counting in various fields, such as population counting in surveillance videos, cell counting under microscopes, animal counting in ecological studies, vehicle counting, and object counting in environmental studies.

Although the target count has advanced greatly in various fields, it is rarely involved in the field of remote sensing. Except for a few scenes, such as the count of palm or olive trees, the number of vehicles in the picture taken by the drone. However, the main ground object objects in the remote sensing images, such as buildings, ships, etc., have not received much attention. Therefore, counting these targets can have many practical meanings, such as city planning, environmental control, digital city model construction, disaster solution planning, and the like.

Compared to the target count of other fields, the target count of remote sensing images presents several challenges: 1) scale diversity: target scale changes in the remote sensing images are different, for example, the size of the same picture is different from several pixels to thousands of pixels; 2) the background is complicated and diverse: multiple surface object targets usually exist in the remote sensing image at the same time, and particularly, under the condition that the size of the target is small, the detection and counting of the target are greatly limited by complex and disordered background interference; 3) direction arbitrariness: unlike objects in natural scene pictures, such as pedestrians, which are upright, objects in remote sensing images have arbitrary directions due to the on-board or on-board sensor pitch-down perspectives.

The name of VGG is from video Geometry Group (Visual Geometry Group) of the scientific engineering system of oxford university, which publishes a series of convolutional network models beginning with VGG, and the models can be applied to face recognition, image classification and the like, from VGG11 to VGG 19. The initial purpose of VGG in studying the depth of the convolutional network is to find out how the depth of the convolutional network affects the precision and accuracy of large-scale image classification and identification, and in order to avoid excessive parameters while deepening the number of network layers, the VGG adopts a small convolutional kernel of 3 × 3 in all the layers, and the convolutional layer step size is set to 1. The input to the VGG is set to 224x244 size RGB images, RGB means are calculated for all images on the training set images, and then the images are passed as input to the VGG convolution network, with the convolution step fixed at 1 using a filter of 3x3 or 1x 1. The VGG full-link layers have 3 layers, and can be from VGG 11-VGG 19 according to the difference of total number of convolutional layers + full-link layers, the minimum VGG11 has 8 convolutional layers and 3 full-link layers, the maximum VGG19 has 16 convolutional layers +3 full-link layers, furthermore, the VGG network is not followed by a pooling layer after each convolutional layer, and has 5 pooling layers in total, and the VGG network is distributed under different convolutional layers. The VGG16 was originally applied to image classification tasks, and due to its characteristics of simplicity, practicability and the like, it is rapidly becoming the most popular convolutional neural network model at that time, and is now also commonly applied to various computer vision tasks.

Disclosure of Invention

In order to solve the problems of data set scarcity, target scale diversity, complex and chaotic background interference and target direction arbitrariness existing in a dense target counting task in a remote sensing image, the invention provides a method for counting targets in the remote sensing image based on an attention mechanism, which integrates the characteristics of the attention mechanism (attention mechanism), a scale pyramid (scale pyramid) and a deformable convolution (deformable convolution), namely ASPDNet, on the basis of a VGG16 network structure and consists of three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density graph of the back-end network. The specific technical scheme of the invention is as follows:

a target counting method in a remote sensing image based on an attention mechanism is characterized by comprising the following three cascade stages of processing on an input image on the basis of a VGG16 network structure:

s1: extracting the characteristics of the front-end network;

for an input image, taking the first 10 layers of operations of a VGG16 network structure, and then fusing a convolution block attention module, namely the operation of connecting a channel attention module and a space attention module, so as to encode the correlation between characteristic map channels and pixel positions;

s2: multi-scale fusion of the middle-end network;

introducing a scale pyramid module, cascading cavity convolutions with expansion rates of 2,4,8 and 12 respectively, and capturing more multi-scale information and detail information;

s3: generating a density graph of the back-end network;

adopting three layers of deformable convolution with convolution kernel of 3 multiplied by 3, wherein each layer is followed by a layer of modified linear unit ReLU activation function, and finally adding a convolution layer of 1 multiplied by 1 to generate a density map;

s4: summing all the pixels of the density map of step S3 yields the final target number.

The invention has the beneficial effects that:

1. the front-end network takes the front 10 layers of the VGG16 network structure as a main network, and then an attention module is added, so that a more interested target area can be highlighted, a complex background area is suppressed, and the problem of complex and disordered background interference in a remote sensing image can be well solved.

2. A scale pyramid module is introduced at the middle end of the network, and multi-scale information corresponding to different receptive fields is captured under the condition that the number of parameters is not increased, so that the problem of scale diversity can be well solved.

3. Three layers of deformable convolutions are adopted in the back-end network, and bias learned in the convolutions can well cover the target, so that the problem of target arbitrariness in the remote sensing image can be well solved.

Drawings

In order to illustrate embodiments of the present invention or technical solutions in the prior art more clearly, the drawings which are needed in the embodiments will be briefly described below, so that the features and advantages of the present invention can be understood more clearly by referring to the drawings, which are schematic and should not be construed as limiting the present invention in any way, and for a person skilled in the art, other drawings can be obtained on the basis of these drawings without any inventive effort. Wherein:

FIG. 1 is a flow chart of the network architecture of the present invention;

FIG. 2(a) is a diagram of a channel attention module configuration;

FIG. 2(b) is a diagram of a spatial attention module configuration;

FIG. 3 is a schematic diagram of a scale pyramid module;

FIG. 4 is a schematic diagram of a deformable convolution;

FIG. 5(a) is a standard convolution position sample visual;

FIG. 5(b) is a deformable convolution location sampling visual;

FIG. 6(a) is a building picture;

FIG. 6(b) is a true density map and count results for a building;

FIG. 6(c) is a graph of building density and count results obtained by the method of the present invention;

FIG. 7(a) is a photograph of a cart;

FIG. 7(b) is a true density map and count results for the cart;

FIG. 7(c) is a density chart and counting results of the cart obtained by the method of the present invention;

fig. 8(a) is a picture of a large vehicle;

FIG. 8(b) is a true density map and technical results for a large vehicle;

FIG. 8(c) is a density map and count results for a large vehicle obtained by the method of the present invention;

FIG. 9(a) is a photograph of a ship;

FIG. 9(b) is a true density map and count results for a ship;

fig. 9(c) is a density map and a count result of the ship obtained by the method of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

The invention aims to accurately estimate the number of dense objects in a remote sensing image, such as a connected dense house, a ship parked at a port, a trolley or a large truck parked at a parking lot, and the like.

Aiming at the technical problem to be solved, the invention constructs a convolutional neural network framework based on the target counting of density estimation, integrates the technologies of an attention mechanism (the form of connecting a channel attention module and a space attention module), a scale pyramid module, a deformable convolution module and the like on the basis of a VGG16 network structure, and particularly mainly comprises three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density map of the rear-end network, and finally summing all pixels in the density map to calculate the number of targets in the remote sensing image. A specific network flow diagram is shown in fig. 1.

The front-end network takes the front 10 layers of a VGG16 network structure as a main network, adds an attention module, fully considers the correlation between characteristic image channels and between pixel positions, extracts abundant semantic information and context information, and accordingly highlights out more interested target areas, inhibits complex background areas, and can well solve the problem of complex and chaotic background interference in remote sensing images.

Because the method is applied to three layers of maximum pooling layers in a network, the resolution of an image can be reduced to 1/64 of an original image, in order to expand the receptive field of a characteristic diagram, a Scale Pyramid Module (SPM) is introduced at the middle end of the network, namely, four parallel holes with different expansion factors are convolutely connected, and the SPM has the function of capturing multi-scale information corresponding to different receptive fields under the condition of not increasing the number of parameters, so that the problem of scale diversity can be well solved.

A three-layer deformable convolution (deformable convolution) is adopted in a back-end network, a learnable bias is added to the convolution of an original standard in the deformable convolution operation, and the learnt bias in the convolution can well cover a target due to an adaptive position sampling technology in the deformable convolution, so that the problem of target arbitrariness in a remote sensing image can be well solved, a 1x1 convolution layer is used in the last layer of the network to generate a density map, and finally all pixels of the density map are summed to obtain the number of the targets. In particular, the amount of the solvent to be used,

1. feature extraction for front-end networks

Giving a remote sensing picture with any size, taking a VGG16 network structure as a main network, taking the first 10 layers of operations of the VGG16 network structure, then adding a volume block attention module (SAM), which is an operation of connecting a Channel Attention Module (CAM) and a Space Attention Module (SAM), and is used for coding the correlation between a feature map channel and a pixel position so as to collect more main feature information, thereby achieving the purposes of highlighting an object and restraining a complex chaotic background.

Channel attention module: in dense scenes, the texture of the foreground object is very similar to that of the background, making counting difficult, and the merging into the channel attention module, whose architecture is shown in fig. 2(a), can alleviate this problem. Specifically, the characteristic diagram for any one intermediate layer is shown as

Wherein the content of the first and second substances,

representing a real space, C representing a channel of a feature map, H and W representing the height and width of the feature map, respectively, first, performing a 1 × 1 convolution operation on the feature maps, and then obtaining two feature maps C by morphing and transposing₁And C₂(ii) a Next, C is added₁And C₂Multiplying and performing a normalization index (softmax) operation to obtain a channel attention C of size C_a. In particular, this process is represented as:

wherein the content of the first and second substances,

indicating the effect of the ith channel on the jth channel on the channel attention map,

in order to perform 1 × 1 convolution on the original feature map and then transform the original feature map into the ith channel of the feature map with the size of cxhw,

the method comprises the following steps of performing 1 × 1 convolution on an original feature map, then transforming and transposing the original feature map into a jth channel of the feature map with the size of HW × C, and finally performing feature map calculation with the size of C × H × W through weighting of a channel attention module:

where λ is a learnable parameter, which can be learned by a convolution operation of 1 × 1,

for the jth channel of the feature map that was finally weighted by the channel attention module,

the method is characterized in that an original characteristic diagram is firstly convolved by 1 multiplied by 1 and then transformed into the ith channel of the characteristic diagram with the size of C multiplied by HW, F^jIs the jth channel of the original feature map.

Spatial attention module: the dependency of the long range in the spatial dimension is further encoded by considering that the feature maps have different densities at different pixel positions, so that the feature information at the spatial position is effectively encoded. The spatial attention module is similar to the above-mentioned channel attention module, and a specific network architecture is shown in fig. 2 (b). However, they differ from each other: 1) compared with the channel attention module which only has one convolution layer of 1 multiplied by 1, the space attention module needs three; 2) note figure C compared to channel_aIs C × C, space attention is drawn to FIG. S_aThe size of (a) is HW × HW. In particular, it is possible to use, for example,

wherein the content of the first and second substances,

indicating the effect of the kth position on the l position,

the original feature map is transformed into the kth position of the feature map with the size of C × HW after being subjected to 1 × 1 convolution,

the original feature map is transformed and transposed into the ith position of the feature map with the size of HW × C after being subjected to 1 × 1 convolution, and finally the feature map with the size of C × H × W subjected to the spatial attention module weighting is calculated as follows:

where μ is a learnable parameter, which can be learned by a 1 × 1 convolution operation,

for the i-th position of the feature map that is finally weighted by the spatial attention module,

the original feature map is transformed into the k-th position of the feature map with the size of C × HW after being convolved by 1 × 1, F^lIs the ith position of the original feature map.

2. Multi-scale fusion of mid-end networks

Because there are three layers of pooling level operations in the front-end network, the size of the output feature map is 1/64 of the original size. In order to increase the receptive field of the feature map while keeping the resolution constant, a scale pyramid module is introduced as shown in fig. 3, i.e. the operation of concatenating several hole convolutions of different expansion rates. The effect of the hole convolution is to increase the receptive field of the characteristic diagram without increasing the number and complexity of parameters, and different expansion rates correspond to the receptive fields of different sizes. In the method, the number of the cascaded hole convolutions is set to be 4, the adopted expansion rates are 2,4,8 and 12 respectively, and more multi-scale information and detail information can be captured through the operation of the scale pyramid module, so that the robustness of the model to scale change is improved.

3. Generation of a density map of a back-end network

In a back-end network, three layers of deformable convolution operation are adopted to solve the problem of target direction arbitrariness in the remote sensing image, and finally, a 1x1 convolution layer is added to generate a density map.

Compared with the standard convolution, the deformable convolution (deformable convolution) operation is to add a bias with a size capable of being learned to each pixel point of the receptive field of the feature map. Learning this bias works to cover the entire object with convolutional layers regardless of changes in the shape of the object. Schematic diagrams of the deformable convolution and the detailed position sampling are shown in fig. 4, 5(a) and 5 (b).

For a standard convolution, the position p of a sample point is given_mFor a convolution with a convolution kernel of 3x3, a dilation rate of 1,

output profile of position p, a set of regular sampling points

Where w represents the weighted parameter, x represents the input profile,

representing the mth sample point, M being the total number of sample points. Compared with standard convolution, deformable convolution adds a bias delta p which can be obtained through training optimization and can be self-adaptively learned on the basis of the standard convolution_mFor deformable convolution, feature maps

Specifically, a deformable convolution with a 3 × 3 convolution kernel of three layers is adopted, and each layer is followed by a layer of modified Linear Unit (ReLU) activation function. By the dynamic sampling strategy in the deformable convolution, the remote sensing image is subjected to overlooking view angleThe target direction arbitrariness can be well solved. At the end of the network, a 1 × 1 convolutional layer is added to generate the density map. The final target number is obtained by summing all pixels of the density map.

The method estimates a density graph from an input picture to count the number of targets, so that a remote sensing picture with an artificially calibrated target center position needs to be converted into a true value density graph in advance and then trained. In training the entire network, in order to evaluate the difference between the density map estimated by the network and the truth density map, an objective function (loss function) needs to be optimized. Finally, in the testing stage, in order to evaluate the effectiveness of the method of the invention, the method is evaluated by adopting a classical evaluation index. In particular, the amount of the solvent to be used,

for generation of the truth density map: assume a pixel position of x_n(target center coordinates) of a target, capable of being operated with a pulse function (x-x)_n) For an image containing N objects, this can be expressed as:

to generate the density map F, H (x) is convolved with a Gaussian kernel, i.e.

Wherein H (x) is a function representing an image containing N targets, F (x) is a truth density map function,

is that the variance is sigma_nN denotes the nth target, σ_nDenotes the standard deviation, sets the fixed kernel σ_n＝15。

With respect to the loss function: using Euclidean distance functions as loss functions for evaluating the difference between predicted and true density maps

Wherein B represents the batch size, X_bRepresentation input diagramSlice, b denotes the b-th image, theta denotes the parameters of the training, F (X)_b(ii) a Theta) and

representing the estimated density map and the corresponding true density map, respectively.

To evaluate the effectiveness of the method of the invention, two evaluation indices were used for the evaluation: mean Absolute Error (MAE) and Mean Squared Error (MSE), the MAE evaluates the accuracy of the model, and the MSE evaluates the robustness of the model. Two criteria are defined as:

wherein T represents the number of test samples, T is the T-th image,

and

respectively representing the estimated target number and the true target number. For the convenience of understanding the above technical aspects of the present invention, the following detailed description will be given of the above technical aspects of the present invention by way of specific examples.

Example 1

The method provided by the invention is verified on a data set comprising 3057 pictures and 4 types of targets of buildings, trolleys, large trucks and ships, and the specific data statistics of the data set can be seen in table 1.

Table 1 data set information statistics used to validate the invention

Referring to fig. 6(a) -6 (c), 7(a) -7 (c), 8(a) -8 (c), and 9(a) -9 (c), the model of the present invention is trained end-to-end, the first 10 layers of the network are fine-tuned in the VGG16 network structure, and the parameters of the other convolutional layers are initialized with 0.01 standard deviation gaussian. During training, a Stochastic Gradient (SGD) is used and the learning rate is set to 1 e-5. For the building data set, the batch size is 32, and 400 periods are carried out until the training is converged; for the other three categories, i.e. ship, car and large vehicle data sets, the batch size is 1, and training is also performed over 400 cycles.

To augment the training set, 9 image blocks were cropped at different positions in the picture, each block being 1/4 of the original picture in size, the first 4 non-overlapping image blocks, and the last five randomly positioned image blocks, and then the blocks were horizontally flipped. Because the resolution of the pictures of the ship, car and large vehicle data sets is greater than that of other conventional data sets, it is easy to cause insufficient memory of the display card. Therefore, the pictures are fixed to 1024 × 768 in size before data enhancement. The model was written in pytorch and was tested on NVIDIA GTX 2080Ti GPU.

To verify the validity of each module of the model, ablation experiments were performed on the building data set. The experimental process comprises a benchmark experiment and three modules are added on the basis of the benchmark experiment in succession:

● benchmark experiment: adopting CSRNet as a reference method (the front-end network takes a VGG16 network structure as a backbone network, and the back-end network adopts a convolution layer with 6 layers of expansion factors of 2);

● benchmark + attention module: on the basis of a reference method, adding a module for connecting a channel attention mechanism and a space attention mechanism;

● benchmark + attention module + scale pyramid module: adding a scale pyramid module on the basis of the front;

● benchmark + attention module + scale pyramid module + deformable convolution module: the invention provides a method.

The results of the ablation experiments are shown in table 2, and it can be seen that each module in the network of the present invention contributes to the improvement of performance. Specifically, the original benchmark method is not ideal in performance on a data set, and after the attention module is added, global and local dependency information of the characteristic diagram is collected, so that the performance is improved to a certain extent; after the upscale pyramid module is added, the performance is further improved; finally, after the deformable convolution is fused, the model provided by the invention shows the optimal performance on the data set.

TABLE 2 ablation experiments on building data sets

Table 3 shows the results of the process of the present invention compared to other processes. These methods include: MCNN, CMTL, CSRNet, SFCN, SANet, SPN, SCAR. The method disclosed by the invention shows the optimal result on the constructed counting data of the remote sensing target, and simultaneously shows that the method disclosed by the invention has good generalization capability.

TABLE 3 comparison of the Process of the invention with other Processes

In the present invention, the terms "first", "second", "third", and "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless expressly limited otherwise.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A target counting method in a remote sensing image based on an attention mechanism is characterized by comprising the following three cascade stages of processing on an input image on the basis of a VGG16 network structure:

s1: extracting the characteristics of the front-end network;

s2: multi-scale fusion of the middle-end network;

s3: generating a density graph of the back-end network;