CN112784685A

CN112784685A - Crowd counting method and system based on multi-scale guiding attention mechanism network

Info

Publication number: CN112784685A
Application number: CN202011580568.7A
Authority: CN
Inventors: 吕蕾; 顾玲玉; 谢锦阳
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-05-11
Anticipated expiration: 2040-12-28
Also published as: CN112784685B

Abstract

The utility model provides a crowd counting method and system based on a multi-scale guiding attention mechanism network, which is used for acquiring image data to be identified; performing multi-scale feature extraction on the acquired image data to obtain a plurality of feature maps, and fusing all the feature maps to obtain a multi-scale fusion feature map; inputting the acquired feature map of each scale and the multi-scale fusion feature map into a preset attention guiding mechanism model to obtain attention feature maps under different scales; fusing the attention feature maps under all scales, performing density regression on the fused feature maps to obtain a crowd density map, and obtaining crowd counts according to the crowd density map; according to the method and the device, richer multi-scale contextual feature information is captured by adopting a multi-scale guiding attention mechanism, local features and corresponding global dependency relations can be integrated, important channel information is highlighted in a self-adaptive mode, and the crowd counting precision is greatly improved.

Description

Crowd counting method and system based on multi-scale guiding attention mechanism network

Technical Field

The disclosure relates to the technical field of computer vision image processing, in particular to a crowd counting method and system based on a multi-scale guiding attention mechanism network.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the development of the technology level to a new height, the life quality of people is gradually improved. People often take part in large activities, and in the scene, the accompanying potential safety hazards such as crowding, treading and the like bring huge threats to lives and properties of people. Therefore, security measures in high-density crowd distribution places are key problems for guaranteeing life and property safety of people. Therefore, the research of the crowd counting problem is more and more intense, and if the crowd density of the current scene can be accurately estimated and the rapid change of the crowd can be timely detected, the public traffic dispatching can be optimized and corresponding security measures can be arranged, so that the purpose of effectively reducing or avoiding the occurrence of the events can be achieved.

In recent years, there has been a tremendous progress in the counting of people based on computer vision. The purpose of the population count is to predict the number of people present in the image. Algorithms developed for population counting have a variety of applications, such as video and traffic monitoring, agricultural monitoring (plant counting), cell counting, scene understanding, city planning and environmental surveys. The field of computer vision has handled this task in various ways: early work counted based on the output of body or head detectors, or learned the mapping of global or local features of images to predicted counts. However, these methods are only suitable for relatively sparse people. In crowded scenes, crowd counting remains a challenging task because it presents problems with variable dimensions, occlusion, changing viewing angles, background clutter, etc.

The inventors have found that some current Convolutional Neural Network (CNN) based methods attempt to solve these problems with varying degrees of success. Although convolutional neural networks have facilitated the development of population counts, these models still have some drawbacks. Firstly, using multi-scale methods, information redundancy results from similar low-level features being extracted multiple times at multiple scales, although pyramid pooling, a hole convolution pyramid, these methods may help to capture objects at different scales, the contextual dependency of all image regions is homogeneous, non-adaptive, ignoring the dependency between local feature representations and contextual information; secondly, the long-distance feature dependency cannot be extracted efficiently, which results in the inability to accurately count the population.

Disclosure of Invention

In order to solve the defects of the prior art, the disclosure provides a crowd counting method and a system based on a multi-scale guiding attention mechanism network, wherein the multi-scale guiding attention mechanism is adopted to capture richer multi-scale context feature information, so that the limitation of the conventional convolutional neural network structure is overcome, local features and corresponding global dependency relations can be integrated, and important channel information is highlighted in a self-adaptive mode; meanwhile, the additional loss between different modules ignores irrelevant information through guiding an attention mechanism, and focuses on the crowd area of the image by emphasizing relevant characteristic association, so that the precision of crowd counting is greatly improved.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

the disclosure provides a crowd counting method based on a multi-scale guiding attention mechanism network.

A crowd counting method based on a multi-scale attention-guiding mechanism network comprises the following steps:

acquiring image data to be identified;

performing multi-scale feature extraction on the acquired image data to obtain a plurality of feature maps, and fusing all the feature maps to obtain a multi-scale fusion feature map;

inputting the acquired feature map of each scale and the multi-scale fusion feature map into a preset attention guiding mechanism model to obtain attention feature maps under different scales;

and fusing the attention characteristic graphs under all scales, performing density regression on the fused characteristic graphs to obtain a crowd density graph, and obtaining the crowd count according to the crowd density graph.

As some possible implementation manners, in the guiding attention mechanism model, weighted attention feature maps at different scales are obtained according to the space attention and the channel attention.

As some possible implementation manners, different loss functions are set, so that the attention mechanism model is guided to self-adjust the feature information needing attention in training.

As a further limitation, according to the obtained feature map and the multi-scale fusion feature map, in combination with the encoder-decoder and attention mechanism module of the guidance attention mechanism model, first attention loss functions on each scale are obtained, and the first attention loss functions on each scale are added to obtain a combined guidance loss.

As a further limitation, the output of the codec is guided to be consistent or nearly consistent with the input features thereof, the reconstructed feature map and the input feature map are combined to obtain second attention loss functions on each scale, and the second attention loss functions on each scale are added to obtain a combined reconstruction loss.

And as some possible implementation modes, performing concatee operation on the obtained multiple feature maps, and then performing convolution operation to generate the multi-scale fusion feature map.

As some possible implementation manners, the pixel values of the crowd density image are accumulated and summed to obtain a final crowd count value.

A second aspect of the present disclosure provides a crowd counting system based on a multi-scale guided attention mechanism network.

A crowd counting system based on a multi-scale attention-directing mechanism network, comprising:

an image acquisition module configured to: acquiring image data to be identified;

a multi-scale feature extraction module configured to: performing multi-scale feature extraction on the acquired image data to obtain a plurality of feature maps, and fusing all the feature maps to obtain a multi-scale fusion feature map;

a direct attention mechanism module configured to: inputting the acquired feature map of each scale and the multi-scale fusion feature map into a preset attention guiding mechanism model to obtain attention feature maps under different scales;

a crowd counting module configured to: and fusing the attention characteristic graphs under all scales, performing density regression on the fused characteristic graphs to obtain a crowd density graph, and obtaining the crowd count according to the crowd density graph.

A third aspect of the present disclosure provides a computer readable storage medium having stored thereon a program which, when executed by a processor, performs the steps in the method for population counting based on a multi-scale guiding attention mechanism network according to the first aspect of the present disclosure.

A fourth aspect of the present disclosure provides an electronic device, comprising a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for people counting based on a multi-scale attention-directing mechanism network according to the first aspect of the present disclosure.

Compared with the prior art, the beneficial effect of this disclosure is:

1. the method, system, medium, or electronic device described in this disclosure employs a multi-scale attention-directing mechanism to capture richer multi-scale contextual feature information, thereby overcoming the limitations of existing convolutional neural network structures, being able to integrate local features with their corresponding global dependencies, and highlighting important channel information in a self-adaptive manner.

2. According to the method, the system, the medium or the electronic equipment, the additional loss between different modules ignores irrelevant information through a guiding attention mechanism, and focuses on the crowd area of the image by emphasizing relevant feature association, so that the crowd counting precision is greatly improved.

Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

Fig. 1 is a schematic flowchart of a crowd counting method based on a multi-scale guiding attention mechanism network according to embodiment 1 of the present disclosure.

Fig. 2 is a schematic diagram of a counting method based on a multi-scale guiding attention mechanism network provided in embodiment 1 of the present disclosure.

Fig. 3 is a schematic diagram of a module for guiding attention according to embodiment 1 of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example 1:

as shown in fig. 1, fig. 2 and fig. 3, embodiment 1 of the present disclosure provides a crowd counting method based on a multi-scale guiding attention mechanism network, which employs the multi-scale guiding attention mechanism network for crowd counting.

The multi-scale attention guiding mechanism network comprises a multi-scale feature information extraction module, an attention mechanism module and an attention guiding mechanism module.

First, the multi-scale feature information extraction module receives context information under different receptive fields. The low-level features are focused on local information, the high-level features are used for coding global information, and the multi-scale method encourages different semantic information to be coded by attention diagrams generated by different receptive fields; then, on each scale information, the attention mechanism module is guided to gradually remove the noise area and emphasize the area with the semantic meaning of the crowd target, and the attention mechanism module comprises two independent attention mechanisms which respectively process the characteristic dependency on the space and the channel, can respectively extract more extensive and richer context information, and strengthen the dependency relationship between the channels in the characteristic diagram, thereby reducing the interference caused by the background area.

Specifically, the technical method comprises the following steps:

s1: multi-scale feature information extraction

The whole network is improved based on VGG16, feature maps F0, F1, F2 and F3 with different scales are generated by Conv1, Conv2, Conv3 and Conv4, and are upsampled to the same size, namely F 'through Bilinear interpolation'_s. Mixing the resulting F'₀,F′₁,F′₂,F′₃Performing concatee operation and then performing convolution operation to generate a multi-scale fusion characteristic diagram F_MS：

F_MS＝conv([F′₀,F′₁,F′₂,F′₃])

Due to multi-scale features F_MSFeatures of different scales are fused, and low-level features and high-level features in the extracted multi-scale feature information are mutually supplemented, so that the extracted multi-scale context information is richer.

S2: attention mechanism

The attention mechanism module explicitly establishes a spatial attention mechanism and a channel attention mechanism, and features of each position are extracted by comparison with all other positions.

For inputFeature(s)

Channel attention was first calculated, since each channel focuses on a different feature, it was necessary to highlight those channels that focused on the population, while calculating the maximum and mean values to obtain soft attention:

c(F_i)＝δ¹(F_i)+δ²(F_i)

δ is a softmax normalization, and each response can be considered a detector of the population when dealing with low-level features. Taking into account delta²Returning only a single response, concentrated on a distinguishable part and ignoring the other, and δ¹The positions where the detector is encouraged to treat on average, inevitably introducing noise, for which purpose c (F) is calculated in this embodiment_i) To make a soft attention selection.

Spatial attention is then computed, which includes two terms: the first term is similar to the channel attention, the spatial mean matrix is calculated and normalized using softmax, and the second term LPPool solves the similarity of local blocks. We scale the channel to 1 by dot convolution and use average pooling (2 x 2) to get a representative value for each block, thus ensuring that the attention feature of each pixel is computed both locally and globally:

s(F_i)＝δ(F_i)+σ(LPPool(F_i))

where σ is a sigmoid, it should be noted that using softmax in spatial averaging pooling, sigmoid is used in local attention calculations, since the response of a single location should be independent (sum to 1), while the local response is correlated with other locations.

Finally, attention weighting features are calculated

Where,. is the pixel multiplication c (F)_i) And s (F)_i) I is a matrix with values of 1, channel attention and spatial attention, respectively.

S3: attention guide mechanism module

In the guiding attention mechanism module, the attention mechanism proposed in S2 is directly used to guide the model to self-adjust the feature information needing attention in training by setting different loss.

Inputting the feature map of each scale and the multi-scale fusion feature map into an attention mechanism module to generate an attention feature map on one hand, and entering a coder-decoder on the other hand, and calculating a first attention loss:

wherein E is_i() is an encoded representation of the ith codec network,

the attention feature generated after the ith attention mechanism module is shown, and M is the iteration number. It should be noted that it is preferable that,

semantically guiding the characteristics of the input of the attention mechanism module, specifically, generating the reconstructed characteristic map in the first coder-decoder and the attention characteristics generated by the first attention mechanism module by matrix multiplication

Furthermore, to ensure that the reconstructed features correspond to the features at the input of the attention mechanism module, leading the output of the codec to closely match the features of its input, the loss of the second attention is calculated:

wherein

Is a reconstructed feature map, i.e. E of the ith encoder-decoder network_i(F)。

Since the lead attention mechanism module is applied at multiple scales, the combined lead penalty for all modules is:

likewise, the reconstruction loss becomes:

wherein L is_Rec1And L_Rec2To guide attention to the loss of reconstruction of the codec structures in the first and second of the modules.

The generated multi-scale feature map F_MSAnd F'₀,F′₁,F′₂,F′₃Respectively carrying out concatee operation, then carrying out convolution, inputting the result into a guide attention mechanism model, and obtaining an attention characteristic diagram A under different scales₀,A₁,A₂,A₃：

A_s＝AttMod_s(conv([F′_s,F_MS]))

Wherein A is_sIndicating the attention characteristics, AttMod indicates each module of the guiding attention mechanism by which the additional loss between different modules causes the attention mechanism to ignore irrelevant information, and by emphasizing features related to the population, focus on the region of the population in the image.

S4: regression density map

4 characteristic maps A for guiding attention mechanism module output₀,A₁,A₂,A₃And performing fusion, and performing density regression on the fused feature map to obtain a high-quality crowd density map.

S5: population count

And accumulating and summing the density image pixel values to obtain a final numerical value of the crowd count, wherein the specific formula is as follows:

where C is the final estimated number of people, H is the height of the density map, W is the width of the density map, P is_ijIs the pixel value at coordinate (i, j) of the entire density map.

Example 2:

the embodiment 2 of the present disclosure provides a crowd counting system based on a multi-scale guiding attention mechanism network, including:

The working method of the system is the same as the crowd counting method based on the multi-scale guiding attention mechanism network provided in embodiment 1, and details are not repeated here.

Example 3:

the embodiment 3 of the present disclosure provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the program implements the steps in the crowd counting method based on the multi-scale guiding attention mechanism network according to the embodiment 1 of the present disclosure, where the steps are:

acquiring image data to be identified;

The detailed steps are the same as those of the population counting method based on the multi-scale guiding attention mechanism network provided in embodiment 1, and are not described herein again.

Example 4:

the embodiment 4 of the present disclosure provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable on the processor, where the processor executes the program to implement the steps in the crowd counting method based on the multi-scale attention-guiding mechanism network according to the embodiment 1 of the present disclosure, where the steps are:

acquiring image data to be identified;

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A crowd counting method based on a multi-scale attention-guiding mechanism network is characterized in that: the method comprises the following steps:

acquiring image data to be identified;

2. The crowd counting method based on the multi-scale attention-guiding mechanism network as claimed in claim 1, wherein:

in the guiding attention mechanism model, weighted attention feature maps under different scales are obtained according to the space attention and the channel attention.

3. The crowd counting method based on the multi-scale attention-guiding mechanism network as claimed in claim 1, wherein:

by setting different loss functions, the attention mechanism model is guided to self-adjust the feature information needing attention in training.

4. The crowd counting method based on the multi-scale attention-guiding mechanism network as claimed in claim 3, wherein:

and according to the obtained feature map and the multi-scale fusion feature map, combining a coder-decoder of the guiding attention mechanism model and an attention mechanism module to obtain first attention loss functions on all scales, and adding the first attention loss functions on all scales to obtain combined guiding loss.

5. The crowd counting method based on the multi-scale attention-guiding mechanism network as claimed in claim 3, wherein:

and guiding the output of the coder-decoder to be consistent or nearly consistent with the input characteristics, combining the reconstructed characteristic diagram and the input characteristic diagram to obtain second attention loss functions on all scales, and adding the second attention loss functions on all scales to obtain combined reconstruction loss.

6. The crowd counting method based on the multi-scale attention-guiding mechanism network as claimed in claim 1, wherein:

and performing concatee operation on the obtained multiple feature maps, and then performing convolution operation to generate the multi-scale fusion feature map.

7. The crowd counting method based on the multi-scale attention-guiding mechanism network as claimed in claim 1, wherein:

and accumulating and summing the pixel values of the crowd density image to obtain a final crowd counting value.

8. A crowd counting system based on a multi-scale attention-guiding mechanism network is characterized in that: the method comprises the following steps:

9. A computer-readable storage medium, on which a program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method for population counting based on a multi-scale guided attention mechanism network according to any one of claims 1-7.

10. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method for people counting based on a multi-scale attentive force mechanism network according to any of claims 1-7.