CN112084868B - Target counting method in remote sensing image based on attention mechanism - Google Patents

Target counting method in remote sensing image based on attention mechanism Download PDF

Info

Publication number
CN112084868B
CN112084868B CN202010794525.2A CN202010794525A CN112084868B CN 112084868 B CN112084868 B CN 112084868B CN 202010794525 A CN202010794525 A CN 202010794525A CN 112084868 B CN112084868 B CN 112084868B
Authority
CN
China
Prior art keywords
feature map
convolution
channel
size
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010794525.2A
Other languages
Chinese (zh)
Other versions
CN112084868A (en
Inventor
刘庆杰
高广帅
王蕴红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010794525.2A priority Critical patent/CN112084868B/en
Publication of CN112084868A publication Critical patent/CN112084868A/en
Application granted granted Critical
Publication of CN112084868B publication Critical patent/CN112084868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/176Urban or other man-made structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target counting method in a remote sensing image based on an attention mechanism, which integrates the characteristics of the attention mechanism, a scale pyramid and a deformable convolution on the basis of a VGG16 and consists of three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density graph of the back-end network. By the technical scheme, the problems of target scale diversity, complex and disordered background interference and target direction arbitrariness in a dense target counting task in the remote sensing image can be well solved.

Description

Target counting method in remote sensing image based on attention mechanism
Technical Field
The invention belongs to the technical field of remote sensing images, and particularly relates to a target counting method in a remote sensing image based on an attention mechanism.
Background
In recent decades, with the needs of national security and city planning, the estimation of the number of targets in a complex scene is receiving more and more attention. Therefore, much work has been done on object counting in various fields, such as population counting in surveillance videos, cell counting under microscopes, animal counting in ecological studies, vehicle counting, and object counting in environmental studies.
Although the target count has advanced greatly in various fields, it is rarely involved in the field of remote sensing. The vehicles in the picture taken by the drone are counted, except for a few scenes, such as the count of palm or olive trees. However, the main ground objects in remote sensing images, such as buildings, ships, etc., have not received much attention. Therefore, counting these targets can have many practical meanings, such as city planning, environmental control, digital city model construction, disaster solution planning, and the like.
Compared to the target count of other fields, the target count of remote sensing images presents several challenges: 1) Scale diversity: target scale changes in the remote sensing images are different, for example, the size of the same picture is different from several pixels to thousand pixels; 2) The background is complicated and diverse: multiple surface object targets usually exist in the remote sensing image at the same time, and particularly, under the condition that the size of the target is small, the detection and counting of the target are greatly limited by complex and disordered background interference; 3) Direction arbitrariness: unlike objects in natural scene pictures, such as pedestrians, which are upright, objects in remote sensing images have arbitrary directions due to the on-board or on-board sensor pitch-down perspectives.
The name of VGG is from video Geometry Group (Visual Geometry Group) of the scientific engineering system of oxford university, which publishes a series of convolutional network models beginning with VGG, and the models can be applied to face recognition, image classification and the like from VGG11 to VGG19. The initial purpose of VGG in studying the depth of the convolutional network is to find out how the depth of the convolutional network affects the precision and accuracy of large-scale image classification and identification, and in order to avoid excessive parameters while deepening the number of network layers, the VGG adopts a small convolutional kernel of 3 × 3 in all the layers, and the step size of the convolutional layer is set to 1. The input to the VGG is set to 224x244 size RGB images, the RGB means are calculated for all the images on the training set images, and then the images are passed as input into the VGG convolution network, with the convolution step fixed at 1, using either 3x3 or 1x1 filters. The VGG full connection layers have 3 layers, the VGG 11-VGG 19 can be selected according to the difference of the total number of the convolutional layers and the full connection layers, the minimum VGG11 has 8 convolutional layers and 3 full connection layers, the maximum VGG19 has 16 convolutional layers and 3 full connection layers, in addition, a VGG network is not followed by a pooling layer after each convolutional layer, and the total number of 5 pooling layers is distributed under different convolutional layers. The VGG16, which was originally applied to the image classification task, rapidly became the most popular convolutional neural network model at that time due to its characteristics such as simplicity and practicability, and is now often applied to various computer vision tasks.
Disclosure of Invention
In order to solve the problems of data set scarcity, target scale diversity, complex and chaotic background interference and target direction arbitrariness in a dense target counting task in a remote sensing image, the invention provides a method for counting targets in a remote sensing image based on an attention mechanism, which integrates the characteristics of the attention mechanism (attention mechanism), a scale pyramid (scale pyramid) and a deformable convolution (deformable convolution), namely ASPDNet, on the basis of a VGG16 network structure and consists of three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density graph of the back-end network. The specific technical scheme of the invention is as follows:
a method for counting targets in remote sensing images based on an attention mechanism is characterized by comprising the following three cascade stages of processing on input images on the basis of a VGG16 network structure:
s1: extracting the characteristics of a front-end network;
for an input image, the first 10 layers of operations of a VGG16 network structure are taken, and then a convolution block attention module, namely the operation of connecting a channel attention module and a space attention module, is fused to encode the correlation between characteristic map channels and pixel positions;
s2: multi-scale fusion of the middle-end network;
introducing a scale pyramid module, cascading the cavity convolutions with expansion rates of 2,4,8 and 12 respectively, and capturing more multi-scale information and detail information;
s3: generating a density graph of the back-end network;
adopting three layers of deformable convolution with convolution kernels of 3 multiplied by 3, wherein each layer is followed by a layer of modified linear unit ReLU activation function, and finally adding a convolution layer of 1 multiplied by 1 to generate a density map;
s4: and summing all pixels of the density map in the step S3 to obtain a final target number.
The invention has the beneficial effects that:
1. the front-end network takes the front 10 layers of a VGG16 network structure as a main network, and then an attention module is added, so that a more interested target area can be highlighted, a complex background area is inhibited, and the problem of complex and disordered background interference in a remote sensing image can be well solved.
2. A scale pyramid module is introduced at the middle end of the network, and multi-scale information corresponding to different receptive fields is captured under the condition that the number of parameters is not increased, so that the problem of scale diversity can be well solved.
3. Three layers of deformable convolutions are adopted in the back-end network, and bias learned in the convolutions can well cover the target, so that the problem of target arbitrariness in the remote sensing image can be well solved.
Drawings
In order to illustrate embodiments of the invention or solutions in the prior art more clearly, the drawings that are needed in the embodiments will be briefly described below, so that the features and advantages of the invention will be more clearly understood by referring to the drawings that are schematic and should not be understood as limiting the invention in any way, and other drawings may be obtained by those skilled in the art without inventive effort. Wherein:
FIG. 1 is a flow diagram of a network architecture of the present invention;
FIG. 2 (a) is a diagram of a channel attention module configuration;
FIG. 2 (b) is a view of a spatial attention module configuration;
FIG. 3 is a schematic diagram of a scale pyramid module;
FIG. 4 is a schematic diagram of a deformable convolution;
FIG. 5 (a) is a standard convolution position sample visual;
FIG. 5 (b) is a deformable convolution location sampling visual;
FIG. 6 (a) is a building picture;
FIG. 6 (b) is a true density map and count results for a building;
FIG. 6 (c) is a graph of building density and count results obtained by the method of the present invention;
FIG. 7 (a) is a photograph of a cart;
FIG. 7 (b) is the actual density map and counting results of the cart;
FIG. 7 (c) is a density chart and counting results of the cart obtained by the method of the present invention;
fig. 8 (a) is a picture of a large vehicle;
FIG. 8 (b) is a true density map and technical results for a large vehicle;
FIG. 8 (c) is a density map and count results for a large vehicle obtained by the method of the present invention;
FIG. 9 (a) is a picture of a ship;
FIG. 9 (b) is the real density map and counting result of the ship;
fig. 9 (c) is a density map and count results of a ship obtained by the method of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
The invention aims to accurately estimate the number of dense objects in a remote sensing image, such as a connected dense house, a ship parked at a port, a trolley or a large truck parked in a parking lot and the like.
Aiming at the technical problem to be solved, the invention constructs a convolutional neural network framework based on the target counting of density estimation, integrates the technologies of an attention mechanism (the form of connecting a channel attention module and a space attention module), a scale pyramid module, a deformable convolution module and the like on the basis of a VGG16 network structure, and particularly mainly comprises three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density map of the rear-end network, and finally summing all pixels in the density map to calculate the number of targets in the remote sensing image. A specific network flow diagram is shown in fig. 1.
The front-end network takes the front 10 layers of a VGG16 network structure as a main network, adds an attention module, fully considers the correlation between characteristic diagram channels and between pixel positions, extracts abundant semantic information and context information, and accordingly highlights more interested target areas, inhibits complex background areas, and can well solve the problem of complex and disordered background interference in remote sensing images.
Because the method is applied to three layers of maximum pooling layers in a network, the resolution of an image can be reduced to 1/64 of that of an original image, in order to expand the receptive field of a characteristic diagram, a Scale Pyramid Module (SPM) is introduced at the middle end of the network, namely, four parallel holes with different expansion factors are convolutely connected, and the SPM has the function of capturing multi-scale information corresponding to different receptive fields under the condition of not increasing the number of parameters, so that the problem of scale diversity can be well solved.
A three-layer deformable convolution (deformable convolution) is adopted in a back-end network, a learnable bias is added to the convolution of an original standard in the deformable convolution operation, and the learnt bias in the convolution can well cover a target due to an adaptive position sampling technology in the deformable convolution, so that the problem of target arbitrariness in a remote sensing image can be well solved, a 1x1 convolution layer is used in the last layer of the network to generate a density map, and finally all pixels of the density map are summed to obtain the number of the targets. In particular, the amount of the solvent to be used,
1. feature extraction for front-end networks
Giving a remote sensing picture with any size, taking a VGG16 network structure as a main network, taking the first 10 layers of the VGG16 network structure for operation, then adding a volume block attention module which is an operation of connecting a Channel Attention Module (CAM) and a Space Attention Module (SAM) and is used for coding the correlation between a feature map channel and a pixel position so as to collect more main feature information, thereby achieving the purposes of highlighting a target and restraining a complex chaotic background.
Channel attention module: in dense scenes, the texture of the foreground object is very similar to that of the background, making counting difficult, and the merging into the channel attention module, whose architecture is shown in fig. 2 (a), can alleviate this problem. Specifically, the characteristic diagram for any one intermediate layer is shown as
Figure BDA0002625051970000058
Wherein the content of the first and second substances,
Figure BDA0002625051970000059
representing a real space, C representing a channel of a feature map, H and W representing the height and width of the feature map, respectively, first, performing a 1 × 1 convolution operation on the feature maps, and then obtaining two feature maps C by morphing and transposing 1 And C 2 (ii) a Next, C is added 1 And C 2 Multiplying and performing a normalization index (softmax) operation to obtain a channel attention C of size C a . In particular, this process is represented as:
Figure BDA0002625051970000051
wherein the content of the first and second substances,
Figure BDA0002625051970000052
indicating the effect of the ith channel on the jth channel on the channel attention map,
Figure BDA0002625051970000053
in order to perform 1 × 1 convolution on the original feature map and then transform the original feature map into the ith channel of the feature map with the size of cxhw,
Figure BDA0002625051970000054
the original feature map is firstly convolved by 1 × 1, then transformed and transposed into the jth channel of the feature map with the size of HW × C, and finally the feature map with the size of C × H × W weighted by the channel attention module is calculated:
Figure BDA0002625051970000055
wherein, the lambda is a learnable parameter which can be learnt by the convolution operation of 1 multiplied by 1,
Figure BDA0002625051970000056
for the jth channel of the feature map that was finally weighted by the channel attention module,
Figure BDA0002625051970000057
firstly, the original characteristic diagram is subjected to 1x1 convolutionAnd then transformed into the ith channel of a feature map of size CxHW, F j Is the jth channel of the original profile.
Spatial attention module: the dependency of the long range in the spatial dimension is further encoded by considering that the feature maps have different densities at different pixel positions, so that the feature information at the spatial position is effectively encoded. The spatial attention module is similar to the above-mentioned channel attention module, and a specific network architecture is shown in fig. 2 (b). However, they differ: 1) Compared with the channel attention module which only has one convolution layer of 1 multiplied by 1, the space attention module needs three; 2) Note figure C compared to channel a Is C × C, space attention is drawn to FIG. S a The size of (a) is HW × HW. In particular, it is possible to provide a device,
Figure BDA0002625051970000061
wherein the content of the first and second substances,
Figure BDA0002625051970000062
indicating the effect of the kth position on the l-th position,
Figure BDA0002625051970000063
the original feature map is firstly convolved by 1 × 1 and then deformed into the kth position of the feature map with the size of C × HW,
Figure BDA0002625051970000064
the original feature map is transformed and transposed to the ith position of the feature map with the size of HW × C after being convolved by 1 × 1, and the feature map with the size of C × H × W weighted by the spatial attention module is calculated as follows:
Figure BDA0002625051970000065
where μ is a learnable parameter, which can be learned by a 1 × 1 convolution operation,
Figure BDA0002625051970000066
for the i-th position of the feature map that is finally weighted by the spatial attention module,
Figure BDA0002625051970000067
is the k-th position of the original feature map after 1 × 1 convolution and then transformed into the feature map with size of C × HW, F l Is the ith position of the original feature map.
2. Multi-scale fusion of mid-end networks
Since there are three layers of pooling level operations in the front-end network, the size of the output signature is 1/64 of the original size. In order to increase the receptive field of the feature map while keeping the resolution constant, a scale pyramid module is introduced as shown in fig. 3, i.e. the operation of concatenating several hole convolutions of different expansion rates. The effect of the hole convolution is to increase the receptive field of the characteristic map without increasing the number of parameters and complexity, and different expansion rates correspond to receptive fields of different sizes. In the method, the number of cascaded hole convolutions is set to be 4, the adopted expansion rates are respectively 2,4,8 and 12, and more multi-scale information and detail information can be captured through the operation of the scale pyramid module, so that the robustness of the model to scale change is improved.
3. Generation of a density map of a back-end network
In a back-end network, three layers of deformable convolution operation are adopted to solve the problem of target direction arbitrariness in a remote sensing image, and finally a convolution layer of 1x1 is added to generate a density map.
Compared with standard convolution, deformable convolution (deformable convolution) operation is that a bias which can be learned by increasing the size of each pixel point of the receptive field of the feature map. Learning this bias has the effect that the convolutional layer can cover the entire object regardless of the shape of the object. A schematic diagram of the deformable convolution and a detailed view of the position samples are shown in fig. 4, 5 (a) and 5 (b).
For a standard convolution, the position p of a sample point is given m For a convolution with a convolution kernel of 3x3, a dilation rate of 1,
Figure BDA0002625051970000068
is a set of regular sampling points that are,output profile of position p
Figure BDA0002625051970000071
Where w represents the weighted parameter, x represents the input profile,
Figure BDA0002625051970000072
represents the M-th sample point, M being the total number of sample points. Compared with standard convolution, deformable convolution adds a bias delta p which can be obtained through training optimization and can be self-adaptively learned on the basis of the standard convolution m For deformable convolution, feature maps
Figure BDA0002625051970000073
Specifically, a deformable convolution with a 3 × 3 convolution kernel of three layers is adopted, and each layer is followed by a layer of modified Linear Unit (ReLU) activation function. By the dynamic sampling strategy in the deformable convolution, any performance of the target direction caused by the overlooking visual angle in the remote sensing image can be well solved. At the end of the network, a 1 × 1 convolutional layer is added to generate the density map. The final target number is obtained by summing all pixels of the density map.
The method estimates a density graph from an input picture to count the number of targets, so that a remote sensing picture with an artificially calibrated target center position needs to be converted into a true value density graph in advance and then trained. In training the entire network, an objective function (loss function) needs to be optimized in order to evaluate the difference between the density map estimated over the network and the true density map. Finally, in the testing stage, in order to evaluate the effectiveness of the method of the invention, the method is evaluated by adopting a classical evaluation index. In particular, the amount of the solvent to be used,
for generation of the truth density map: assume a pixel position of x n (target center coordinates) of a target, using a pulse function delta (x-x) n ) To show that, for an image containing N objects, it can be expressed as:
Figure BDA0002625051970000074
to generate the density map F, H (x) is convolved with a Gaussian kernel, i.e.
Figure BDA0002625051970000075
Wherein H (x) is a function representing an image containing N targets, F (x) is a truth density map function,
Figure BDA0002625051970000076
is that the variance is sigma n N denotes the nth target, σ n Denotes the standard deviation, sets the fixed kernel σ n =15。
With respect to the loss function: using Euclidean distance functions as loss functions for evaluating the difference between predicted and true density maps
Figure BDA0002625051970000077
Wherein B represents a batch size, X b Representing the input picture, b the b-th image, theta the parameters of the training, F (X) b (ii) a Theta) and
Figure BDA0002625051970000078
representing the estimated density map and the corresponding true density map, respectively.
To evaluate the effectiveness of the method of the invention, two evaluation indices were used for the evaluation: mean Absolute Error (MAE) and Mean Squared Error (MSE), the MAE evaluates the accuracy of the model, and the MSE evaluates the robustness of the model. Two criteria are defined as:
Figure BDA0002625051970000081
wherein T represents the number of test samples, T is the T-th image,
Figure BDA0002625051970000082
and
Figure BDA0002625051970000083
respectively representing the number of estimated objects and the true objectsAnd (4) counting. For the convenience of understanding the technical solutions of the present invention, the technical solutions of the present invention will be described in detail by specific examples.
Example 1
The method provided by the invention is verified on a data set which comprises 3057 pictures and is composed of 4 types of targets of buildings, trolleys, large trucks and ships, and the specific data statistics of the data set can be seen in a table 1.
Table 1 data set information statistics used to validate the invention
Figure BDA0002625051970000084
Referring to fig. 6 (a) -6 (c), 7 (a) -7 (c), 8 (a) -8 (c), and 9 (a) -9 (c), the model of the present invention is trained end-to-end, the first 10 layers of the network are fine-tuned in the VGG16 network structure, and the parameters of the other convolutional layers are initialized with 0.01 standard deviation gaussian. During training, a Stochastic Gradient (SGD) is used and the learning rate is set to 1e-5. For the building data set, the batch size is 32, and 400 periods are carried out until the training is converged; for the other three categories, i.e. ship, car and large vehicle data sets, the batch size is 1, and training is also performed over 400 cycles.
To augment the training set, 9 blocks of image are cropped at different positions of the picture, each block having a size of 1/4 of the original picture, the first 4 blocks being non-overlapping image blocks, the last five blocks being cropped at random positions, and then the blocks are horizontally flipped. Because the resolution of the pictures of the ship, car and large vehicle data sets is greater than that of other conventional data sets, it is easy to cause insufficient memory of the display card. Therefore, the pictures are fixed to 1024 × 768 in size before data enhancement. The model was written in pytorch and was tested on NVIDIA GTX 2080Ti GPU.
To verify the validity of each module of the model, ablation experiments were performed on the building data set. The experimental process comprises a benchmark experiment and three modules are added on the basis of the benchmark experiment in succession:
● Reference experiment: adopting CSRNet as a reference method (the front-end network takes a VGG16 network structure as a main network, and the rear-end network adopts a convolution layer with 6 layers of expansion factors of 2);
● Reference + attention module: on the basis of a reference method, adding a module for connecting a channel attention mechanism and a space attention mechanism;
● Benchmark + attention module + scale pyramid module: adding a scale pyramid module on the basis of the front;
● Benchmark + attention module + scale pyramid module + deformable convolution module: the invention provides a method.
The results of the ablation experiments are shown in table 2, and it can be seen that each module in the network of the present invention contributes to the improvement of performance. Specifically, the original benchmark method is not ideal in performance on a data set, and after the attention module is added, global and local dependency information of the feature map is collected, so that the performance is improved to a certain extent; after the upscale pyramid module is added, the performance is further improved; finally, after the deformable convolution is fused, the model provided by the invention shows the optimal performance on the data set.
TABLE 2 ablation experiments on building data sets
Figure BDA0002625051970000091
Table 3 shows the results of the process of the present invention compared to other processes. These methods include: MCNN, CMTL, CSRNet, SFCN, SANet, SPN, SCAR. The method disclosed by the invention shows the optimal result on the constructed counting data of the remote sensing target, and simultaneously shows that the method disclosed by the invention has good generalization capability.
TABLE 3 comparison of the Process of the invention with other Processes
Figure BDA0002625051970000092
Figure BDA0002625051970000101
In the present invention, the terms "first", "second", "third", and "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless explicitly defined otherwise.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. A target counting method in a remote sensing image based on an attention mechanism is characterized by comprising the following three cascade stage processing of an input image on the basis of a VGG16 network structure:
s1: extracting the characteristics of the front-end network;
for an input image, the first 10 layers of operations of a VGG16 network structure are taken, and then a convolution block attention module is fused, namely the operation of connecting a channel attention module and a space attention module is used for coding the correlation between characteristic map channels and pixel positions;
the structural architecture of the channel attention module is as follows: the characteristic diagram for any one intermediate layer is shown as
Figure FDA0003779434320000011
Wherein the content of the first and second substances,
Figure FDA0003779434320000012
representing a real number space, C representing a channel of a feature map, H and W representing a height and a width of the feature map, respectively, first, performing a convolution operation of 1 × 1 on the feature maps, and then obtaining two feature maps C by deforming and transposing 1 And C 2 (ii) a Is connected withNext, C is added 1 And C 2 Multiplying and executing normalization index operation to obtain channel attention C with size of C multiplied by C a I.e. by
Figure FDA0003779434320000013
Wherein the content of the first and second substances,
Figure FDA0003779434320000014
indicating the effect of the ith channel on the jth channel on the channel attention map,
Figure FDA0003779434320000015
in order to perform 1 × 1 convolution on the original feature map and then transform the original feature map into the ith channel of the feature map with the size of cxhw,
Figure FDA0003779434320000016
the original feature map is firstly convolved by 1 × 1, then transformed and transposed into the jth channel of the feature map with the size of HW × C, and finally the feature map with the size of C × H × W weighted by the channel attention module is calculated:
Figure FDA0003779434320000017
wherein, the lambda is a learnable parameter and is obtained by learning through convolution operation of 1 multiplied by 1,
Figure FDA0003779434320000018
for the jth channel of the feature map that was finally weighted by the channel attention module,
Figure FDA0003779434320000019
the method is characterized in that an original characteristic diagram is firstly convolved by 1 multiplied by 1 and then transformed into the ith channel of the characteristic diagram with the size of C multiplied by HW, F j The jth channel of the original feature map;
the network architecture of the spatial attention module is as follows: first, three convolution operations of 1 × 1 are performed on the feature map, and then two feature maps S are obtained by morphing and transposing 1 And S 2 (ii) a Next, theS 1 And S 2 Multiplying and executing normalized exponential operation to obtain channel attention image S with HW multiplied by HW size a I.e. by
Figure FDA00037794343200000110
Wherein, the first and the second end of the pipe are connected with each other,
Figure FDA00037794343200000111
indicating the effect of the kth position on the l position,
Figure FDA00037794343200000112
the original feature map is transformed into the kth position of the feature map with the size of C × HW after being subjected to 1 × 1 convolution,
Figure FDA00037794343200000113
the original feature map is transformed and transposed to the ith position of the feature map with the size of HW × C after being convolved by 1 × 1, and the feature map with the size of C × H × W weighted by the spatial attention module is calculated as follows:
Figure FDA00037794343200000114
wherein mu is a learnable parameter which is obtained by learning through convolution operation of 1 multiplied by 1,
Figure FDA00037794343200000115
for the i-th position of the feature map that is finally weighted by the spatial attention module,
Figure FDA0003779434320000021
the original feature map is transformed into the k-th position of the feature map with the size of C × HW after being convolved by 1 × 1, F l The ith position of the original feature map;
s2: multi-scale fusion of the middle-end network;
introducing a scale pyramid module, cascading cavity convolutions with expansion rates of 2,4,8 and 12 respectively, and capturing more multi-scale information and detail information;
s3: generating a density graph of the back-end network;
adopting three layers of deformable convolution with convolution kernels of 3 multiplied by 3, wherein each layer is followed by a layer of modified linear unit ReLU activation function, and finally adding a convolution layer of 1 multiplied by 1 to generate a density map;
wherein for a position p of a sample point m For a deformable convolution with a convolution kernel of 3x3, a dilation rate of 1,
Figure FDA0003779434320000022
Figure FDA0003779434320000023
output profile of position p, a set of regular sampling points
Figure FDA0003779434320000024
Where w represents the weighted parameter, x represents the input feature map, m =1, l, m,
Figure FDA0003779434320000025
denotes the M-th sample point, M is the total number of sample points, Δ p m Is an adaptively learnable bias that can be obtained through training optimization;
s4: and summing all the pixels of the density map of the step S3 to obtain the final target number.
CN202010794525.2A 2020-08-10 2020-08-10 Target counting method in remote sensing image based on attention mechanism Active CN112084868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010794525.2A CN112084868B (en) 2020-08-10 2020-08-10 Target counting method in remote sensing image based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010794525.2A CN112084868B (en) 2020-08-10 2020-08-10 Target counting method in remote sensing image based on attention mechanism

Publications (2)

Publication Number Publication Date
CN112084868A CN112084868A (en) 2020-12-15
CN112084868B true CN112084868B (en) 2022-12-23

Family

ID=73736164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010794525.2A Active CN112084868B (en) 2020-08-10 2020-08-10 Target counting method in remote sensing image based on attention mechanism

Country Status (1)

Country Link
CN (1) CN112084868B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541459A (en) * 2020-12-21 2021-03-23 山东师范大学 Crowd counting method and system based on multi-scale perception attention network
CN112598059A (en) * 2020-12-22 2021-04-02 深圳集智数字科技有限公司 Worker dressing detection method and device, storage medium and electronic equipment
CN112766123B (en) * 2021-01-11 2022-07-22 山东师范大学 Crowd counting method and system based on criss-cross attention network
CN112926480B (en) * 2021-03-05 2023-01-31 山东大学 Multi-scale and multi-orientation-oriented aerial photography object detection method and system
CN113011329B (en) * 2021-03-19 2024-03-12 陕西科技大学 Multi-scale feature pyramid network-based and dense crowd counting method
CN112906662B (en) * 2021-04-02 2022-07-19 海南长光卫星信息技术有限公司 Method, device and equipment for detecting change of remote sensing image and storage medium
CN113283529B (en) * 2021-06-08 2022-09-06 南通大学 Neural network construction method for multi-modal image visibility detection
CN113554156B (en) * 2021-09-22 2022-01-11 中国海洋大学 Multitask image processing method based on attention mechanism and deformable convolution
CN114170188A (en) * 2021-12-09 2022-03-11 同济大学 Target counting method and system for overlook image and storage medium
CN114399728B (en) * 2021-12-17 2023-12-05 燕山大学 Foggy scene crowd counting method
CN115620284B (en) * 2022-12-19 2023-04-18 广东工业大学 Cell apoptosis counting method, system and platform based on convolution attention mechanism
CN116433675B (en) * 2023-06-15 2023-08-15 武汉理工大学三亚科教创新园 Vehicle counting method based on residual information enhancement, electronic device and readable medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232394B (en) * 2018-03-06 2021-08-10 华南理工大学 Multi-scale image semantic segmentation method
CN109241895B (en) * 2018-08-28 2021-06-04 北京航空航天大学 Dense crowd counting method and device
CN110084210B (en) * 2019-04-30 2022-03-29 电子科技大学 SAR image multi-scale ship detection method based on attention pyramid network
CN110188685B (en) * 2019-05-30 2021-01-05 燕山大学 Target counting method and system based on double-attention multi-scale cascade network
CN110674704A (en) * 2019-09-05 2020-01-10 同济大学 Crowd density estimation method and device based on multi-scale expansion convolutional network
CN111179217A (en) * 2019-12-04 2020-05-19 天津大学 Attention mechanism-based remote sensing image multi-scale target detection method

Also Published As

Publication number Publication date
CN112084868A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN112084868B (en) Target counting method in remote sensing image based on attention mechanism
CN111539370B (en) Image pedestrian re-identification method and system based on multi-attention joint learning
CN111563508B (en) Semantic segmentation method based on spatial information fusion
CN111476822B (en) Laser radar target detection and motion tracking method based on scene flow
CN111639692B (en) Shadow detection method based on attention mechanism
CN109165682B (en) Remote sensing image scene classification method integrating depth features and saliency features
CN104408742B (en) A kind of moving target detecting method based on space time frequency spectrum Conjoint Analysis
Ablavatski et al. Enriched deep recurrent visual attention model for multiple object recognition
CN111160217B (en) Method and system for generating countermeasure sample of pedestrian re-recognition system
CN112001225B (en) Online multi-target tracking method, system and application
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN111126385A (en) Deep learning intelligent identification method for deformable living body small target
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN111582091B (en) Pedestrian recognition method based on multi-branch convolutional neural network
CN112288627A (en) Recognition-oriented low-resolution face image super-resolution method
CN110837786B (en) Density map generation method and device based on spatial channel, electronic terminal and medium
CN115171165A (en) Pedestrian re-identification method and device with global features and step-type local features fused
CN114005078B (en) Vehicle weight identification method based on double-relation attention mechanism
CN114627447A (en) Road vehicle tracking method and system based on attention mechanism and multi-target tracking
CN111062310B (en) Few-sample unmanned aerial vehicle image identification method based on virtual sample generation
CN112785636A (en) Multi-scale enhanced monocular depth estimation method
CN115908772A (en) Target detection method and system based on Transformer and fusion attention mechanism
CN115115973A (en) Weak and small target detection method based on multiple receptive fields and depth characteristics
Malav et al. DHSGAN: An end to end dehazing network for fog and smoke
CN114550014A (en) Road segmentation method and computer device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant