CN112084868A - Target counting method in remote sensing image based on attention mechanism - Google Patents

Target counting method in remote sensing image based on attention mechanism Download PDF

Info

Publication number
CN112084868A
CN112084868A CN202010794525.2A CN202010794525A CN112084868A CN 112084868 A CN112084868 A CN 112084868A CN 202010794525 A CN202010794525 A CN 202010794525A CN 112084868 A CN112084868 A CN 112084868A
Authority
CN
China
Prior art keywords
convolution
remote sensing
target
network
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010794525.2A
Other languages
Chinese (zh)
Other versions
CN112084868B (en
Inventor
刘庆杰
高广帅
王蕴红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010794525.2A priority Critical patent/CN112084868B/en
Publication of CN112084868A publication Critical patent/CN112084868A/en
Application granted granted Critical
Publication of CN112084868B publication Critical patent/CN112084868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/176Urban or other man-made structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target counting method in a remote sensing image based on an attention mechanism, which integrates the characteristics of the attention mechanism, a scale pyramid and a deformable convolution on the basis of VGG16 and consists of three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density graph of the back-end network. By the technical scheme, the problems of target scale diversity, complex and disordered background interference and target direction arbitrariness in a dense target counting task in the remote sensing image can be well solved.

Description

Target counting method in remote sensing image based on attention mechanism
Technical Field
The invention belongs to the technical field of remote sensing images, and particularly relates to a target counting method in a remote sensing image based on an attention mechanism.
Background
In recent decades, with the needs of national security and city planning, the estimation of the number of targets in a complex scene is receiving more and more attention. Therefore, much work has been done on object counting in various fields, such as population counting in surveillance videos, cell counting under microscopes, animal counting in ecological studies, vehicle counting, and object counting in environmental studies.
Although the target count has advanced greatly in various fields, it is rarely involved in the field of remote sensing. Except for a few scenes, such as the count of palm or olive trees, the number of vehicles in the picture taken by the drone. However, the main ground object objects in the remote sensing images, such as buildings, ships, etc., have not received much attention. Therefore, counting these targets can have many practical meanings, such as city planning, environmental control, digital city model construction, disaster solution planning, and the like.
Compared to the target count of other fields, the target count of remote sensing images presents several challenges: 1) scale diversity: target scale changes in the remote sensing images are different, for example, the size of the same picture is different from several pixels to thousands of pixels; 2) the background is complicated and diverse: multiple surface object targets usually exist in the remote sensing image at the same time, and particularly, under the condition that the size of the target is small, the detection and counting of the target are greatly limited by complex and disordered background interference; 3) direction arbitrariness: unlike objects in natural scene pictures, such as pedestrians, which are upright, objects in remote sensing images have arbitrary directions due to the on-board or on-board sensor pitch-down perspectives.
The name of VGG is from video Geometry Group (Visual Geometry Group) of the scientific engineering system of oxford university, which publishes a series of convolutional network models beginning with VGG, and the models can be applied to face recognition, image classification and the like, from VGG11 to VGG 19. The initial purpose of VGG in studying the depth of the convolutional network is to find out how the depth of the convolutional network affects the precision and accuracy of large-scale image classification and identification, and in order to avoid excessive parameters while deepening the number of network layers, the VGG adopts a small convolutional kernel of 3 × 3 in all the layers, and the convolutional layer step size is set to 1. The input to the VGG is set to 224x244 size RGB images, RGB means are calculated for all images on the training set images, and then the images are passed as input to the VGG convolution network, with the convolution step fixed at 1 using a filter of 3x3 or 1x 1. The VGG full-link layers have 3 layers, and can be from VGG 11-VGG 19 according to the difference of total number of convolutional layers + full-link layers, the minimum VGG11 has 8 convolutional layers and 3 full-link layers, the maximum VGG19 has 16 convolutional layers +3 full-link layers, furthermore, the VGG network is not followed by a pooling layer after each convolutional layer, and has 5 pooling layers in total, and the VGG network is distributed under different convolutional layers. The VGG16 was originally applied to image classification tasks, and due to its characteristics of simplicity, practicability and the like, it is rapidly becoming the most popular convolutional neural network model at that time, and is now also commonly applied to various computer vision tasks.
Disclosure of Invention
In order to solve the problems of data set scarcity, target scale diversity, complex and chaotic background interference and target direction arbitrariness existing in a dense target counting task in a remote sensing image, the invention provides a method for counting targets in the remote sensing image based on an attention mechanism, which integrates the characteristics of the attention mechanism (attention mechanism), a scale pyramid (scale pyramid) and a deformable convolution (deformable convolution), namely ASPDNet, on the basis of a VGG16 network structure and consists of three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density graph of the back-end network. The specific technical scheme of the invention is as follows:
a target counting method in a remote sensing image based on an attention mechanism is characterized by comprising the following three cascade stages of processing on an input image on the basis of a VGG16 network structure:
s1: extracting the characteristics of the front-end network;
for an input image, taking the first 10 layers of operations of a VGG16 network structure, and then fusing a convolution block attention module, namely the operation of connecting a channel attention module and a space attention module, so as to encode the correlation between characteristic map channels and pixel positions;
s2: multi-scale fusion of the middle-end network;
introducing a scale pyramid module, cascading cavity convolutions with expansion rates of 2,4,8 and 12 respectively, and capturing more multi-scale information and detail information;
s3: generating a density graph of the back-end network;
adopting three layers of deformable convolution with convolution kernel of 3 multiplied by 3, wherein each layer is followed by a layer of modified linear unit ReLU activation function, and finally adding a convolution layer of 1 multiplied by 1 to generate a density map;
s4: summing all the pixels of the density map of step S3 yields the final target number.
The invention has the beneficial effects that:
1. the front-end network takes the front 10 layers of the VGG16 network structure as a main network, and then an attention module is added, so that a more interested target area can be highlighted, a complex background area is suppressed, and the problem of complex and disordered background interference in a remote sensing image can be well solved.
2. A scale pyramid module is introduced at the middle end of the network, and multi-scale information corresponding to different receptive fields is captured under the condition that the number of parameters is not increased, so that the problem of scale diversity can be well solved.
3. Three layers of deformable convolutions are adopted in the back-end network, and bias learned in the convolutions can well cover the target, so that the problem of target arbitrariness in the remote sensing image can be well solved.
Drawings
In order to illustrate embodiments of the present invention or technical solutions in the prior art more clearly, the drawings which are needed in the embodiments will be briefly described below, so that the features and advantages of the present invention can be understood more clearly by referring to the drawings, which are schematic and should not be construed as limiting the present invention in any way, and for a person skilled in the art, other drawings can be obtained on the basis of these drawings without any inventive effort. Wherein:
FIG. 1 is a flow chart of the network architecture of the present invention;
FIG. 2(a) is a diagram of a channel attention module configuration;
FIG. 2(b) is a diagram of a spatial attention module configuration;
FIG. 3 is a schematic diagram of a scale pyramid module;
FIG. 4 is a schematic diagram of a deformable convolution;
FIG. 5(a) is a standard convolution position sample visual;
FIG. 5(b) is a deformable convolution location sampling visual;
FIG. 6(a) is a building picture;
FIG. 6(b) is a true density map and count results for a building;
FIG. 6(c) is a graph of building density and count results obtained by the method of the present invention;
FIG. 7(a) is a photograph of a cart;
FIG. 7(b) is a true density map and count results for the cart;
FIG. 7(c) is a density chart and counting results of the cart obtained by the method of the present invention;
fig. 8(a) is a picture of a large vehicle;
FIG. 8(b) is a true density map and technical results for a large vehicle;
FIG. 8(c) is a density map and count results for a large vehicle obtained by the method of the present invention;
FIG. 9(a) is a photograph of a ship;
FIG. 9(b) is a true density map and count results for a ship;
fig. 9(c) is a density map and a count result of the ship obtained by the method of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
The invention aims to accurately estimate the number of dense objects in a remote sensing image, such as a connected dense house, a ship parked at a port, a trolley or a large truck parked at a parking lot, and the like.
Aiming at the technical problem to be solved, the invention constructs a convolutional neural network framework based on the target counting of density estimation, integrates the technologies of an attention mechanism (the form of connecting a channel attention module and a space attention module), a scale pyramid module, a deformable convolution module and the like on the basis of a VGG16 network structure, and particularly mainly comprises three cascaded stages: extracting the characteristics of the front-end network, fusing the scales of the middle-end network and generating a density map of the rear-end network, and finally summing all pixels in the density map to calculate the number of targets in the remote sensing image. A specific network flow diagram is shown in fig. 1.
The front-end network takes the front 10 layers of a VGG16 network structure as a main network, adds an attention module, fully considers the correlation between characteristic image channels and between pixel positions, extracts abundant semantic information and context information, and accordingly highlights out more interested target areas, inhibits complex background areas, and can well solve the problem of complex and chaotic background interference in remote sensing images.
Because the method is applied to three layers of maximum pooling layers in a network, the resolution of an image can be reduced to 1/64 of an original image, in order to expand the receptive field of a characteristic diagram, a Scale Pyramid Module (SPM) is introduced at the middle end of the network, namely, four parallel holes with different expansion factors are convolutely connected, and the SPM has the function of capturing multi-scale information corresponding to different receptive fields under the condition of not increasing the number of parameters, so that the problem of scale diversity can be well solved.
A three-layer deformable convolution (deformable convolution) is adopted in a back-end network, a learnable bias is added to the convolution of an original standard in the deformable convolution operation, and the learnt bias in the convolution can well cover a target due to an adaptive position sampling technology in the deformable convolution, so that the problem of target arbitrariness in a remote sensing image can be well solved, a 1x1 convolution layer is used in the last layer of the network to generate a density map, and finally all pixels of the density map are summed to obtain the number of the targets. In particular, the amount of the solvent to be used,
1. feature extraction for front-end networks
Giving a remote sensing picture with any size, taking a VGG16 network structure as a main network, taking the first 10 layers of operations of the VGG16 network structure, then adding a volume block attention module (SAM), which is an operation of connecting a Channel Attention Module (CAM) and a Space Attention Module (SAM), and is used for coding the correlation between a feature map channel and a pixel position so as to collect more main feature information, thereby achieving the purposes of highlighting an object and restraining a complex chaotic background.
Channel attention module: in dense scenes, the texture of the foreground object is very similar to that of the background, making counting difficult, and the merging into the channel attention module, whose architecture is shown in fig. 2(a), can alleviate this problem. Specifically, the characteristic diagram for any one intermediate layer is shown as
Figure BDA0002625051970000058
Wherein the content of the first and second substances,
Figure BDA0002625051970000059
representing a real space, C representing a channel of a feature map, H and W representing the height and width of the feature map, respectively, first, performing a 1 × 1 convolution operation on the feature maps, and then obtaining two feature maps C by morphing and transposing1And C2(ii) a Next, C is added1And C2Multiplying and performing a normalization index (softmax) operation to obtain a channel attention C of size Ca. In particular, this process is represented as:
Figure BDA0002625051970000051
wherein the content of the first and second substances,
Figure BDA0002625051970000052
indicating the effect of the ith channel on the jth channel on the channel attention map,
Figure BDA0002625051970000053
in order to perform 1 × 1 convolution on the original feature map and then transform the original feature map into the ith channel of the feature map with the size of cxhw,
Figure BDA0002625051970000054
the method comprises the following steps of performing 1 × 1 convolution on an original feature map, then transforming and transposing the original feature map into a jth channel of the feature map with the size of HW × C, and finally performing feature map calculation with the size of C × H × W through weighting of a channel attention module:
Figure BDA0002625051970000055
where λ is a learnable parameter, which can be learned by a convolution operation of 1 × 1,
Figure BDA0002625051970000056
for the jth channel of the feature map that was finally weighted by the channel attention module,
Figure BDA0002625051970000057
the method is characterized in that an original characteristic diagram is firstly convolved by 1 multiplied by 1 and then transformed into the ith channel of the characteristic diagram with the size of C multiplied by HW, FjIs the jth channel of the original feature map.
Spatial attention module: the dependency of the long range in the spatial dimension is further encoded by considering that the feature maps have different densities at different pixel positions, so that the feature information at the spatial position is effectively encoded. The spatial attention module is similar to the above-mentioned channel attention module, and a specific network architecture is shown in fig. 2 (b). However, they differ from each other: 1) compared with the channel attention module which only has one convolution layer of 1 multiplied by 1, the space attention module needs three; 2) note figure C compared to channelaIs C × C, space attention is drawn to FIG. SaThe size of (a) is HW × HW. In particular, it is possible to use, for example,
Figure BDA0002625051970000061
wherein the content of the first and second substances,
Figure BDA0002625051970000062
indicating the effect of the kth position on the l position,
Figure BDA0002625051970000063
the original feature map is transformed into the kth position of the feature map with the size of C × HW after being subjected to 1 × 1 convolution,
Figure BDA0002625051970000064
the original feature map is transformed and transposed into the ith position of the feature map with the size of HW × C after being subjected to 1 × 1 convolution, and finally the feature map with the size of C × H × W subjected to the spatial attention module weighting is calculated as follows:
Figure BDA0002625051970000065
where μ is a learnable parameter, which can be learned by a 1 × 1 convolution operation,
Figure BDA0002625051970000066
for the i-th position of the feature map that is finally weighted by the spatial attention module,
Figure BDA0002625051970000067
the original feature map is transformed into the k-th position of the feature map with the size of C × HW after being convolved by 1 × 1, FlIs the ith position of the original feature map.
2. Multi-scale fusion of mid-end networks
Because there are three layers of pooling level operations in the front-end network, the size of the output feature map is 1/64 of the original size. In order to increase the receptive field of the feature map while keeping the resolution constant, a scale pyramid module is introduced as shown in fig. 3, i.e. the operation of concatenating several hole convolutions of different expansion rates. The effect of the hole convolution is to increase the receptive field of the characteristic diagram without increasing the number and complexity of parameters, and different expansion rates correspond to the receptive fields of different sizes. In the method, the number of the cascaded hole convolutions is set to be 4, the adopted expansion rates are 2,4,8 and 12 respectively, and more multi-scale information and detail information can be captured through the operation of the scale pyramid module, so that the robustness of the model to scale change is improved.
3. Generation of a density map of a back-end network
In a back-end network, three layers of deformable convolution operation are adopted to solve the problem of target direction arbitrariness in the remote sensing image, and finally, a 1x1 convolution layer is added to generate a density map.
Compared with the standard convolution, the deformable convolution (deformable convolution) operation is to add a bias with a size capable of being learned to each pixel point of the receptive field of the feature map. Learning this bias works to cover the entire object with convolutional layers regardless of changes in the shape of the object. Schematic diagrams of the deformable convolution and the detailed position sampling are shown in fig. 4, 5(a) and 5 (b).
For a standard convolution, the position p of a sample point is givenmFor a convolution with a convolution kernel of 3x3, a dilation rate of 1,
Figure BDA0002625051970000068
output profile of position p, a set of regular sampling points
Figure BDA0002625051970000071
Where w represents the weighted parameter, x represents the input profile,
Figure BDA0002625051970000072
representing the mth sample point, M being the total number of sample points. Compared with standard convolution, deformable convolution adds a bias delta p which can be obtained through training optimization and can be self-adaptively learned on the basis of the standard convolutionmFor deformable convolution, feature maps
Figure BDA0002625051970000073
Specifically, a deformable convolution with a 3 × 3 convolution kernel of three layers is adopted, and each layer is followed by a layer of modified Linear Unit (ReLU) activation function. By the dynamic sampling strategy in the deformable convolution, the remote sensing image is subjected to overlooking view angleThe target direction arbitrariness can be well solved. At the end of the network, a 1 × 1 convolutional layer is added to generate the density map. The final target number is obtained by summing all pixels of the density map.
The method estimates a density graph from an input picture to count the number of targets, so that a remote sensing picture with an artificially calibrated target center position needs to be converted into a true value density graph in advance and then trained. In training the entire network, in order to evaluate the difference between the density map estimated by the network and the truth density map, an objective function (loss function) needs to be optimized. Finally, in the testing stage, in order to evaluate the effectiveness of the method of the invention, the method is evaluated by adopting a classical evaluation index. In particular, the amount of the solvent to be used,
for generation of the truth density map: assume a pixel position of xn(target center coordinates) of a target, capable of being operated with a pulse function (x-x)n) For an image containing N objects, this can be expressed as:
Figure BDA0002625051970000074
to generate the density map F, H (x) is convolved with a Gaussian kernel, i.e.
Figure BDA0002625051970000075
Wherein H (x) is a function representing an image containing N targets, F (x) is a truth density map function,
Figure BDA0002625051970000076
is that the variance is sigmanN denotes the nth target, σnDenotes the standard deviation, sets the fixed kernel σn=15。
With respect to the loss function: using Euclidean distance functions as loss functions for evaluating the difference between predicted and true density maps
Figure BDA0002625051970000077
Wherein B represents the batch size, XbRepresentation input diagramSlice, b denotes the b-th image, theta denotes the parameters of the training, F (X)b(ii) a Theta) and
Figure BDA0002625051970000078
representing the estimated density map and the corresponding true density map, respectively.
To evaluate the effectiveness of the method of the invention, two evaluation indices were used for the evaluation: mean Absolute Error (MAE) and Mean Squared Error (MSE), the MAE evaluates the accuracy of the model, and the MSE evaluates the robustness of the model. Two criteria are defined as:
Figure BDA0002625051970000081
wherein T represents the number of test samples, T is the T-th image,
Figure BDA0002625051970000082
and
Figure BDA0002625051970000083
respectively representing the estimated target number and the true target number. For the convenience of understanding the above technical aspects of the present invention, the following detailed description will be given of the above technical aspects of the present invention by way of specific examples.
Example 1
The method provided by the invention is verified on a data set comprising 3057 pictures and 4 types of targets of buildings, trolleys, large trucks and ships, and the specific data statistics of the data set can be seen in table 1.
Table 1 data set information statistics used to validate the invention
Figure BDA0002625051970000084
Referring to fig. 6(a) -6 (c), 7(a) -7 (c), 8(a) -8 (c), and 9(a) -9 (c), the model of the present invention is trained end-to-end, the first 10 layers of the network are fine-tuned in the VGG16 network structure, and the parameters of the other convolutional layers are initialized with 0.01 standard deviation gaussian. During training, a Stochastic Gradient (SGD) is used and the learning rate is set to 1 e-5. For the building data set, the batch size is 32, and 400 periods are carried out until the training is converged; for the other three categories, i.e. ship, car and large vehicle data sets, the batch size is 1, and training is also performed over 400 cycles.
To augment the training set, 9 image blocks were cropped at different positions in the picture, each block being 1/4 of the original picture in size, the first 4 non-overlapping image blocks, and the last five randomly positioned image blocks, and then the blocks were horizontally flipped. Because the resolution of the pictures of the ship, car and large vehicle data sets is greater than that of other conventional data sets, it is easy to cause insufficient memory of the display card. Therefore, the pictures are fixed to 1024 × 768 in size before data enhancement. The model was written in pytorch and was tested on NVIDIA GTX 2080Ti GPU.
To verify the validity of each module of the model, ablation experiments were performed on the building data set. The experimental process comprises a benchmark experiment and three modules are added on the basis of the benchmark experiment in succession:
● benchmark experiment: adopting CSRNet as a reference method (the front-end network takes a VGG16 network structure as a backbone network, and the back-end network adopts a convolution layer with 6 layers of expansion factors of 2);
● benchmark + attention module: on the basis of a reference method, adding a module for connecting a channel attention mechanism and a space attention mechanism;
● benchmark + attention module + scale pyramid module: adding a scale pyramid module on the basis of the front;
● benchmark + attention module + scale pyramid module + deformable convolution module: the invention provides a method.
The results of the ablation experiments are shown in table 2, and it can be seen that each module in the network of the present invention contributes to the improvement of performance. Specifically, the original benchmark method is not ideal in performance on a data set, and after the attention module is added, global and local dependency information of the characteristic diagram is collected, so that the performance is improved to a certain extent; after the upscale pyramid module is added, the performance is further improved; finally, after the deformable convolution is fused, the model provided by the invention shows the optimal performance on the data set.
TABLE 2 ablation experiments on building data sets
Figure BDA0002625051970000091
Table 3 shows the results of the process of the present invention compared to other processes. These methods include: MCNN, CMTL, CSRNet, SFCN, SANet, SPN, SCAR. The method disclosed by the invention shows the optimal result on the constructed counting data of the remote sensing target, and simultaneously shows that the method disclosed by the invention has good generalization capability.
TABLE 3 comparison of the Process of the invention with other Processes
Figure BDA0002625051970000092
Figure BDA0002625051970000101
In the present invention, the terms "first", "second", "third", and "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless expressly limited otherwise.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. A target counting method in a remote sensing image based on an attention mechanism is characterized by comprising the following three cascade stages of processing on an input image on the basis of a VGG16 network structure:
s1: extracting the characteristics of the front-end network;
for an input image, taking the first 10 layers of operations of a VGG16 network structure, and then fusing a convolution block attention module, namely the operation of connecting a channel attention module and a space attention module, so as to encode the correlation between characteristic map channels and pixel positions;
s2: multi-scale fusion of the middle-end network;
introducing a scale pyramid module, cascading cavity convolutions with expansion rates of 2,4,8 and 12 respectively, and capturing more multi-scale information and detail information;
s3: generating a density graph of the back-end network;
adopting three layers of deformable convolution with convolution kernel of 3 multiplied by 3, wherein each layer is followed by a layer of modified linear unit ReLU activation function, and finally adding a convolution layer of 1 multiplied by 1 to generate a density map;
s4: summing all the pixels of the density map of step S3 yields the final target number.
CN202010794525.2A 2020-08-10 2020-08-10 Target counting method in remote sensing image based on attention mechanism Active CN112084868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010794525.2A CN112084868B (en) 2020-08-10 2020-08-10 Target counting method in remote sensing image based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010794525.2A CN112084868B (en) 2020-08-10 2020-08-10 Target counting method in remote sensing image based on attention mechanism

Publications (2)

Publication Number Publication Date
CN112084868A true CN112084868A (en) 2020-12-15
CN112084868B CN112084868B (en) 2022-12-23

Family

ID=73736164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010794525.2A Active CN112084868B (en) 2020-08-10 2020-08-10 Target counting method in remote sensing image based on attention mechanism

Country Status (1)

Country Link
CN (1) CN112084868B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541459A (en) * 2020-12-21 2021-03-23 山东师范大学 Crowd counting method and system based on multi-scale perception attention network
CN112598059A (en) * 2020-12-22 2021-04-02 深圳集智数字科技有限公司 Worker dressing detection method and device, storage medium and electronic equipment
CN112766123A (en) * 2021-01-11 2021-05-07 山东师范大学 Crowd counting method and system based on criss-cross attention network
CN112906662A (en) * 2021-04-02 2021-06-04 海南长光卫星信息技术有限公司 Method, device and equipment for detecting change of remote sensing image and storage medium
CN112926480A (en) * 2021-03-05 2021-06-08 山东大学 Multi-scale and multi-orientation-oriented aerial object detection method and system
CN113011329A (en) * 2021-03-19 2021-06-22 陕西科技大学 Pyramid network based on multi-scale features and dense crowd counting method
CN113283529A (en) * 2021-06-08 2021-08-20 南通大学 Neural network construction method for multi-modal image visibility detection
CN113554156A (en) * 2021-09-22 2021-10-26 中国海洋大学 Multi-task learning model construction method based on attention mechanism and deformable convolution
CN114022742A (en) * 2021-10-22 2022-02-08 中国科学院长春光学精密机械与物理研究所 Infrared and visible light image fusion method and device and computer storage medium
CN114170188A (en) * 2021-12-09 2022-03-11 同济大学 Target counting method and system for overlook image and storage medium
CN114187275A (en) * 2021-12-13 2022-03-15 贵州大学 Multi-stage and multi-scale attention fusion network and image rain removing method
CN114399728A (en) * 2021-12-17 2022-04-26 燕山大学 Method for counting crowds in foggy day scene
CN115620284A (en) * 2022-12-19 2023-01-17 广东工业大学 Cell apoptosis counting method, system and platform based on convolution attention mechanism
CN116433675A (en) * 2023-06-15 2023-07-14 武汉理工大学三亚科教创新园 Vehicle counting method based on residual information enhancement, electronic device and readable medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084210A (en) * 2019-04-30 2019-08-02 电子科技大学 The multiple dimensioned Ship Detection of SAR image based on attention pyramid network
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN110232394A (en) * 2018-03-06 2019-09-13 华南理工大学 A kind of multi-scale image semantic segmentation method
CN110674704A (en) * 2019-09-05 2020-01-10 同济大学 Crowd density estimation method and device based on multi-scale expansion convolutional network
US20200074186A1 (en) * 2018-08-28 2020-03-05 Beihang University Dense crowd counting method and apparatus
CN111179217A (en) * 2019-12-04 2020-05-19 天津大学 Attention mechanism-based remote sensing image multi-scale target detection method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232394A (en) * 2018-03-06 2019-09-13 华南理工大学 A kind of multi-scale image semantic segmentation method
US20200074186A1 (en) * 2018-08-28 2020-03-05 Beihang University Dense crowd counting method and apparatus
CN110084210A (en) * 2019-04-30 2019-08-02 电子科技大学 The multiple dimensioned Ship Detection of SAR image based on attention pyramid network
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN110674704A (en) * 2019-09-05 2020-01-10 同济大学 Crowd density estimation method and device based on multi-scale expansion convolutional network
CN111179217A (en) * 2019-12-04 2020-05-19 天津大学 Attention mechanism-based remote sensing image multi-scale target detection method

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541459A (en) * 2020-12-21 2021-03-23 山东师范大学 Crowd counting method and system based on multi-scale perception attention network
CN112598059A (en) * 2020-12-22 2021-04-02 深圳集智数字科技有限公司 Worker dressing detection method and device, storage medium and electronic equipment
CN112766123B (en) * 2021-01-11 2022-07-22 山东师范大学 Crowd counting method and system based on criss-cross attention network
CN112766123A (en) * 2021-01-11 2021-05-07 山东师范大学 Crowd counting method and system based on criss-cross attention network
CN112926480B (en) * 2021-03-05 2023-01-31 山东大学 Multi-scale and multi-orientation-oriented aerial photography object detection method and system
CN112926480A (en) * 2021-03-05 2021-06-08 山东大学 Multi-scale and multi-orientation-oriented aerial object detection method and system
CN113011329A (en) * 2021-03-19 2021-06-22 陕西科技大学 Pyramid network based on multi-scale features and dense crowd counting method
CN113011329B (en) * 2021-03-19 2024-03-12 陕西科技大学 Multi-scale feature pyramid network-based and dense crowd counting method
CN112906662A (en) * 2021-04-02 2021-06-04 海南长光卫星信息技术有限公司 Method, device and equipment for detecting change of remote sensing image and storage medium
CN112906662B (en) * 2021-04-02 2022-07-19 海南长光卫星信息技术有限公司 Method, device and equipment for detecting change of remote sensing image and storage medium
CN113283529A (en) * 2021-06-08 2021-08-20 南通大学 Neural network construction method for multi-modal image visibility detection
CN113554156A (en) * 2021-09-22 2021-10-26 中国海洋大学 Multi-task learning model construction method based on attention mechanism and deformable convolution
CN114022742A (en) * 2021-10-22 2022-02-08 中国科学院长春光学精密机械与物理研究所 Infrared and visible light image fusion method and device and computer storage medium
CN114022742B (en) * 2021-10-22 2024-05-17 中国科学院长春光学精密机械与物理研究所 Infrared and visible light image fusion method and device and computer storage medium
CN114170188A (en) * 2021-12-09 2022-03-11 同济大学 Target counting method and system for overlook image and storage medium
CN114187275A (en) * 2021-12-13 2022-03-15 贵州大学 Multi-stage and multi-scale attention fusion network and image rain removing method
CN114399728B (en) * 2021-12-17 2023-12-05 燕山大学 Foggy scene crowd counting method
CN114399728A (en) * 2021-12-17 2022-04-26 燕山大学 Method for counting crowds in foggy day scene
CN115620284A (en) * 2022-12-19 2023-01-17 广东工业大学 Cell apoptosis counting method, system and platform based on convolution attention mechanism
CN116433675A (en) * 2023-06-15 2023-07-14 武汉理工大学三亚科教创新园 Vehicle counting method based on residual information enhancement, electronic device and readable medium
CN116433675B (en) * 2023-06-15 2023-08-15 武汉理工大学三亚科教创新园 Vehicle counting method based on residual information enhancement, electronic device and readable medium

Also Published As

Publication number Publication date
CN112084868B (en) 2022-12-23

Similar Documents

Publication Publication Date Title
CN112084868B (en) Target counting method in remote sensing image based on attention mechanism
CN111539370B (en) Image pedestrian re-identification method and system based on multi-attention joint learning
CN111563508B (en) Semantic segmentation method based on spatial information fusion
CN111639692B (en) Shadow detection method based on attention mechanism
CN107506740B (en) Human body behavior identification method based on three-dimensional convolutional neural network and transfer learning model
CN109165682B (en) Remote sensing image scene classification method integrating depth features and saliency features
CN104408742B (en) A kind of moving target detecting method based on space time frequency spectrum Conjoint Analysis
CN112288627B (en) Recognition-oriented low-resolution face image super-resolution method
Ablavatski et al. Enriched deep recurrent visual attention model for multiple object recognition
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN115171165A (en) Pedestrian re-identification method and device with global features and step-type local features fused
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN111126385A (en) Deep learning intelligent identification method for deformable living body small target
CN111582091B (en) Pedestrian recognition method based on multi-branch convolutional neural network
CN114627447A (en) Road vehicle tracking method and system based on attention mechanism and multi-target tracking
CN114005078B (en) Vehicle weight identification method based on double-relation attention mechanism
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN110837786A (en) Density map generation method and device based on spatial channel, electronic terminal and medium
CN113011308A (en) Pedestrian detection method introducing attention mechanism
CN114037640A (en) Image generation method and device
CN112785636A (en) Multi-scale enhanced monocular depth estimation method
CN115908772A (en) Target detection method and system based on Transformer and fusion attention mechanism
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN111062310B (en) Few-sample unmanned aerial vehicle image identification method based on virtual sample generation
CN112766378A (en) Cross-domain small sample image classification model method focusing on fine-grained identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant