CN112966600B

CN112966600B - Self-adaptive multi-scale context aggregation method for crowded population counting

Info

Publication number: CN112966600B
Application number: CN202110242403.7A
Authority: CN
Inventors: 赵怀林; 梁兰军; 张亚妮; 周方波
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2024-04-16
Anticipated expiration: 2041-03-04
Also published as: CN112966600A

Abstract

The invention provides a self-adaptive multi-scale context aggregation method for crowds counting, which comprises the following steps: inputting a sample picture into a backbone network, and extracting a feature map with the size j times of the resolution of an input image; inputting the extracted feature map into a plurality of multi-scale context aggregation modules in a cascading mode, extracting and adaptively aggregating multi-scale context information to obtain multi-scale context features; performing convolution layer processing on the generated multi-scale context characteristics to generate a density map; and carrying out integral summation on the density map to obtain the number of predicted people. The method effectively extracts multi-scale information, solves the problem of non-uniform size of the head of a person, adaptively selects and aggregates useful context information through a channel attention mechanism, avoids redundancy of information, can have more accurate density estimation in crowded scenes, and has higher robustness.

Description

Self-adaptive multi-scale context aggregation method for crowded population counting

Technical Field

The invention relates to the technical field of data processing, in particular to a self-adaptive multi-scale context aggregation method for crowded crowd counting.

Background

Crowd counting is a basic task of crowd analysis based on computer vision, aimed at automatically detecting crowd conditions.

However, in crowd scenarios, tasks often encounter challenging factors such as severe occlusions, scale changes, diversity of crowd distribution, etc., especially in very crowded scenarios, where estimating crowdedness is difficult due to the visual similarity of foreground and background objects and scale changes of the head.

Networks that directly aggregate different scale context features currently exist, but not all features are useful for final population counting, and direct aggregation creates redundancy of information that can affect the performance of the counting network.

Disclosure of Invention

In view of the drawbacks of the prior art, an object of the present invention is to provide an adaptive multi-scale context aggregation method for crowding.

The invention provides a self-adaptive multi-scale context aggregation method for crowds counting, which comprises the following steps:

step 1: inputting a sample picture into a backbone network, and extracting a feature map with the size i times of the resolution of an input image;

step 2: inputting the extracted feature map into a plurality of multi-scale context aggregation modules in a cascading mode, extracting and adaptively aggregating multi-scale context information to obtain multi-scale context features; an up-sampling layer is arranged behind each multi-scale context aggregation module and is used for converting multi-scale context characteristics into a characteristic diagram with higher resolution;

step 3: performing convolution layer processing on the generated multi-scale context characteristics to generate a density map;

step 4: calculating a loss function between the generated density map and the true value density map, and optimizing network parameters;

step 5: and integrating and summing the generated density maps to obtain the number of predicted people.

Optionally, the step 4 includes:

generating a true value density map of the crowd through Gaussian kernel convolution according to the picture with the head mark points, wherein the calculation formula of the density map is as follows:

wherein F is _i (x) Represents a true value density map, x _i Pixel point representing head of person, G _σ Representing gaussian kernel, delta (·) representing dirac function, sigma being standard deviation, N representing the total number of people in the picture, xRepresenting the pixel points of the picture.

Optionally, the step 2 includes:

the multi-scale context aggregation module adaptively selects small-scale context features and aggregates the small-scale context features with large-scale context features; the multi-scale context aggregation module comprises a plurality of branches with different void convolutions and different void ratios;

by usingTo represent features extracted by the i-th scale of the hole convolution; wherein i represents the void fraction of the convolution kernel, < >>Representing the resolution as j times the resolution of the input image, r representing the reduction rate of the backbone network, +.>The method comprises the steps of representing a feature map extracted by cavity convolution of an ith scale, wherein the resolution of the feature map is j times of that of the original feature map; w x H represents the resolution of the image, C represents the number of channels of the image, and R represents the set of all feature maps of j times the resolution;

the feature diagram extracted by the cavity convolution is input into a channel attention module, and the channel attention module adopts self-adaptive selection of a selection function fUseful context feature information in the document, and outputs a feature map Y in which the context information is aggregated ^j ∈R ^jW×jH×C Wherein Yj is defined as follows:

Y ^j a feature map representing j times the resolution extracted by the aggregation module,representing element-by-element summation>Representing the extraction of a feature map of scale 1, < ->Representing the extraction of a feature map of scale 2, < ->Representing the extraction of the feature map of scale 3,the feature map of the nth scale is extracted, and j represents that the resolution is j times that of the input picture.

Optionally, the selection is adaptive using a selection function fIncluding:

each context feature is subjected to pooling processing through a global space average pooling layer, and feature information is output

The feature information F is composed of two layers of fully connected bottleneck structure _avg Processing is carried out, and output characteristics are normalized to be (0, 1) through a sigmoid function, wherein the calculation formula of the adaptive output coefficient is as follows:

wherein:and->Respectively representing the weight coefficients of two fully connected layers, wherein the back of the first fully connected layer is provided with a RELU function, and the back of the second fully connected layer adopts a Sigmoid function, & lt + & gt>Representation->Output after the pooling layer is averaged;

adding a residual connection between the input and output of the channel attention mechanism, the resulting selection function is defined as follows:

wherein:representing the output of the ith channel attention mechanism module,/->A feature map representing a convolution extraction of a hole representing the ith scale,/i>Representing the adaptive coefficients of the ith channel attention mechanism module.

Compared with the prior art, the invention has the following beneficial effects:

the self-adaptive multi-scale context aggregation method for crowded counting effectively extracts multi-scale information, solves the problem of non-uniform size of the head of a person, adaptively selects and aggregates useful context information through a channel attention mechanism, avoids redundancy of information, can have more accurate density estimation in crowded scenes, and has higher robustness.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

fig. 1 is a schematic diagram of an adaptive multi-scale context aggregation method for crowd counting according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

The invention provides an adaptive multi-scale context aggregation method for crowd counting, which is used for crowd density estimation in crowded scenes. The method mainly comprises the following steps: and inputting a picture, firstly extracting characteristic information through a backbone network, and then inputting the extracted characteristic picture into a plurality of multi-scale context aggregation modules in a cascading mode. The module firstly extracts multi-scale information by convolution kernels with different void ratios, and then self-adaptively selects channel context characteristic information through a channel attention mechanism and performs aggregation. Each time a multiscale context aggregation module is used, the feature map is converted into a feature map with higher resolution through upsampling, finally, an estimated density map is output through a convolution kernel of 1*1, and the number of people to be predicted is obtained through integral summation. The method provided by the invention effectively extracts multi-scale information through a plurality of convolution kernels with different void ratios, solves the problem of non-uniform size of the head of a person, adaptively selects and aggregates useful context information through a channel attention mechanism, avoids redundancy of information, can have more accurate density estimation in crowded scenes, and has higher robustness.

Fig. 1 is a schematic diagram of an adaptive multi-scale context aggregation method for crowd counting according to an embodiment of the present invention, as shown in fig. 1, may include the following steps:

step S1: and inputting the sample picture into a backbone network, and extracting a feature map with the size i times of the resolution of the original image.

Step S2: the extracted feature images are input into a plurality of self-adaptive multi-scale context aggregation modules in a cascading mode, multi-scale context information is extracted and self-adaptively aggregated, and an up-sampling layer is arranged behind each module and used for converting multi-scale context features into feature images with higher resolution.

Step S3: the generated multi-scale context features are subjected to 1*1 convolution layer processing to generate a density map.

Step S4: calculating a loss function between the generated density map and the true value density map, and optimizing network parameters;

step S5: and integrating and summing the density map to obtain the predicted number of people.

In this embodiment, according to the picture with the head mark point, a real density map of the crowd is generated by gaussian kernel convolution, and the pixel point with the head is expressed as x _i The Gaussian kernel is denoted as G _σ The true density map can be expressed as:

wherein F is _i (x) Represents a true value density map, x _i Pixel point representing head of person, G _σ The method is characterized in that the method comprises the steps of expressing a Gaussian kernel, delta (·) represents a dirac function, sigma is a standard deviation, N represents the total number of people in a picture, and x represents the pixel point of the picture.

Specifically, the adaptive multi-scale context aggregation module in step S2 is shown in fig. 1, and adaptively selects and aggregates reliable small-scale context features with large-scale context features. The specific operation is as follows:

the multi-scale context aggregation module comprises a plurality of branches with different void rate and different void convolution, and is used for To represent features extracted by the i-th scale of the hole convolution; where i represents the void fraction of the convolution kernel,representing the resolution as j times the resolution of the input image, r representing the reduction rate of the backbone network, +.>Representing a feature map extracted by hole convolution of the ith scale, wherein the resolution of the feature map is i times of the original resolution, W multiplied by H represents the resolution of an image, C represents the channel number of the image, and R represents the set of all feature maps with j times of the resolution; the feature map extracted by the hole convolution is then input to a channel attention module (CA) which uses a selection function f adaptive selection +.>Useful context feature information, and finally output feature map Y aggregated with the context information ^j ∈R ^jW×jH×C The definition is as follows:

wherein: y is Y ^j A feature map representing j times the resolution extracted by the aggregation module,representing element-by-element summation>Representing the extraction of a feature map of scale 1, < ->Representing the extraction of a feature map of scale 2, < ->Representing the extraction of a feature map of scale 3, < ->The feature map of the nth scale is extracted, and j represents that the resolution is j times that of the input picture.

Illustratively, the selection function f employs a channel attention mechanism for aggregating multi-scale context information, specifically operating as:

each feature is first pooled by a global spatial averaging layer (denoted F _avg ) The feature is then processed with a bottleneck structure consisting of two fully connected layers, and finally the output feature is normalized to (0, 1) by a sigmoid function. The adaptive output coefficient may be expressed as:

furthermore, for better optimization, a residual connection is added between the input and output of the channel attention mechanism, and the final selection function is defined as:

compared with the existing counting, the embodiment adopts a plurality of convolutions with different void ratios to extract multi-scale information, and self-adaptively selects and aggregates the multi-scale context information through a channel attention mechanism, so that good performance is shown in crowded scenes, and the accuracy of crowd counting is improved.

The technical scheme of the invention is described in more detail below with reference to specific embodiments. Knowing the pixel value and the label of a picture, the true value density map corresponding to the picture is obtained through gaussian convolution, which can be expressed as: wherein xi represents pixel points with human head, x represents all pixel points, G _σ Expressed as gaussian kernel, δ (·) represents dirac function, σ is standard deviation, and N represents the total number of people in the picture.

The complex nonlinear mapping from the input image to the crowd estimated density map is then learned by a multi-scale context aggregation network, as follows:

the first ten layers of VGG-16 are selected as a backbone network, the pictures are input into the backbone network, the characteristic information is extracted, and the size of the characteristic diagram is 1\8 of the input image.

The extracted feature map is convolved with a convolution kernel of 3*3 and the feature information is then sent to the multi-scale context aggregation module. Firstly, extracting different scale features through a plurality of branches of cavity convolution with different cavity rates, wherein each scale feature is marked asThere are n pieces of scale information in total.

Will beThe feature information of (1) adaptively aggregates multi-scale context information through an attention module. The method comprises the steps of firstly extracting context information through a global space average pooling layer, then processing the characteristics by adopting a bottleneck structure formed by two layers which are completely connected, and finally normalizing the output characteristics into (0, 1) through a sigmoid function. The adaptive output coefficient may be expressed as:

finally, we directly connect the input and output of the channel attention mechanism with the residual, and the final output result is:

will beMulti-scale contextual profile selected by attention mechanisms->And the 2 nd scale information->Pixel-by-pixel summing is performed, which can be expressed as: />

Extracting the extractThe feature information is sent to the channel attention mechanism to adaptively select the context information, and the context information and the feature information of the 3 rd scale are subjected to pixel summation, and the like, so that the feature mapping which aggregates the multi-scale context information is finally obtained：

And after the multi-scale context information is extracted by the multi-scale context aggregation module, the multi-scale context information is converted into a characteristic diagram with higher resolution by upsampling. And then the obtained image is sent to a multi-scale context aggregation module to perform feature extraction in the same mode, the three multi-scale context aggregation modules are sequentially processed, and finally, an estimated density map is output through a 1*1 convolution kernel to calculate a loss function L (theta):

wherein F (I) _i The method comprises the steps of carrying out a first treatment on the surface of the θ) is a density map of the output of the network, F _i The method is a true density map, theta is a parameter which needs to be optimized by the network, the network continuously optimizes the parameter theta through a gradient descent method, and a parameter value which enables a loss function to be minimum is found.

It should be noted that, the steps in the adaptive multi-scale context aggregation method for crowd counting provided in the present invention may be implemented by using corresponding modules, devices, units, etc. in the adaptive multi-scale context aggregation system for crowd counting, and those skilled in the art may refer to the technical scheme of the system to implement the step flow of the method, that is, the embodiment in the system may be understood as a preferred embodiment for implementing the method, which is not repeated herein.

Those skilled in the art will appreciate that the invention provides a system and its individual devices that can be implemented entirely by logic programming of method steps, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the system and its individual devices being implemented in pure computer readable program code. Therefore, the system and various devices thereof provided by the present invention may be considered as a hardware component, and the devices included therein for implementing various functions may also be considered as structures within the hardware component; means for achieving the various functions may also be considered as being either a software module that implements the method or a structure within a hardware component.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A method for adaptive multi-scale context aggregation for crowd counting, comprising:

step 1: inputting a sample picture into a backbone network, and extracting a feature map with the size j times of the resolution of an input image;

step 5: integrating and summing the generated density maps to obtain the number of predicted people;

the step 2 comprises the following steps:

the feature diagram extracted by the cavity convolution is input into a channel attention module, and the channel attention module adopts self-adaptive selection of a selection function fUseful context feature information in the document, and outputs a feature map Y in which the context information is aggregated ^j ∈R ^jW×jH×C Wherein Y is ^j The definition is as follows:

Y ^j a feature map representing j times resolution extracted by the aggregation module, the tie representing element-by-element summation,representing the extraction of a feature map of scale 1, < ->Representing the extraction of a feature map of scale 2, < ->Representing the extraction of a feature map of scale 3, < ->Representing extracting a feature map of an nth scale, j representing a resolution j times that of the input picture;

said adaptive selection using a selection function fIncluding:

2. The adaptive multi-scale context aggregation method for crowd counting according to claim 1, wherein the step 4 comprises:

wherein F is _i (x) Represents a true value density map, x _i Pixel point representing head of person, G _σ Representing gaussian kernel, delta (·) representing dirac function, sigma being standard deviation, N representing the total number of people in the picture, x representing the picturePixel points of the tile.