CN109522857B

CN109522857B - People number estimation method based on generation type confrontation network model

Info

Publication number: CN109522857B
Application number: CN201811415565.0A
Authority: CN
Inventors: 元辉; 贺黎恒
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2021-04-23
Anticipated expiration: 2038-11-26
Also published as: CN109522857A

Abstract

The invention relates to a people number estimation method based on a Generative confrontation network model, which relates to a characteristic automatic extraction technology and a multiple regression model in deep learning, fully utilizes the characteristic representation capability of the Generative confrontation network model (GANs, general adaptive Nets), takes a density graph indicating local crowd density as a second supervision signal, takes the number of people in an image as a first supervision signal, trains a network by using a back propagation algorithm, and then initializes the network by using the obtained network parameters to predict the number of people in an unknown image.

Description

People number estimation method based on generation type confrontation network model

Technical Field

The invention relates to a people number estimation method based on a generating type confrontation network model, and belongs to the technical field of image processing.

Background

It has been challenging to directly estimate the number of people from the image due to the influence of illumination variation, perspective distortion, and noisy environment disturbance (such as the background is a forest, or a wall with strong light reflection). However, the arousal of deep learning techniques in recent years has led researchers and engineers' projects to make extensive use of and develop deep network models. Although the automatic people number estimation method based on the deep network model achieves quite good performance in natural scenes.

Zhang [1] et al propose a multi-column convolutional network, as shown in FIG. 1. The scheme provides a single image crowd counting algorithm based on a multi-column convolutional neural network, wherein the multi-column convolutional neural network comprises three sub-networks, the structures of the sub-networks are different, convolution kernels used for the sub-networks are different in size, the input of each sub-network is the same image, after four times of convolution and two times of pooling, feature maps output by the three sub-networks are linked together in a channel dimension, and a 1 x 1 kernel convolution is used for obtaining a crowd density map. However, the scheme is only linked together at the high layer of the network, and the multi-scale features at the shallow layer of the network are not fully fused, so that the loss of the geometric features is caused, and the accuracy of people number estimation is influenced; this scheme requires training three sub-networks before training the entire network, and the training time for each sub-network is not less than ten hours.

Daniel [2] et al proposed a multi-branch convolutional network based on multi-scale blocks, as shown in FIG. 2. This scheme consists of three different subnets, although the input blocks have different dimensions. However, the three sub-networks of the scheme have the same structure, and are only linked together at the high level of the network, and the multi-scale features at the shallow layer of the network are not fully fused, so that the loss of the geometric features is caused, and the accuracy of people number estimation is influenced; this scheme requires training three sub-networks before training the entire network, and the training time for each sub-network is not less than ten hours.

Han [3] et al propose a method based on a combination of residual error network (ResNet) and fully connected network, as shown in FIG. 3. The scheme includes that firstly, a plurality of blocks are sampled from each image in an overlapped mode, then the predicted value of each block is calculated through a residual error network, and then the predicted values of the blocks are sent to a conditional random field to calculate the predicted value of the number of people in the image. However, in the scheme, the predicted value of each block is calculated by using a residual error network, and then the number of people in the image can be predicted by using a conditional random field; that is, the scheme is performed in steps, and the two steps cannot be combined into one step.

However, experiments show that it takes a long time to train these networks, and the training time is continuously increased as the network structure is continuously deepened. The deep network like Han 3, etc. has very deep network structure, and many parameters to be learned, which not only takes long time for training but also has danger of over-fitting; like the schemes proposed by Zhang [1] et al and Daniel [2] et al, although not as deep as the network structure of the scheme proposed by Han [3] et al, the breadth of the network is increased and each subnetwork needs prior pre-training.

Disclosure of Invention

Aiming at the defects of the existing automatic people number estimation technology based on a deep network model, the invention provides a people number estimation method based on a generating type confrontation network model;

in order to reduce network parameters, the size of a convolution kernel of the scheme provided by the invention is not more than 3 at most; in order to reduce the network width, the invention only uses a single-column network structure; in order to ensure the performance of the scheme proposed by the invention, the invention gives different weights to the inputs of the regression network to distinguish the importance degree of different characteristics.

The invention relates to a characteristic automatic extraction technology and a multiple regression model in deep learning, which fully utilize the characteristic representation capability of generating confrontation type network models (GANs), use a density graph indicating local crowd density as a second supervision signal, use the number of people in an image as a first supervision signal, train the network by using a back propagation algorithm, and initialize the network by using the obtained network parameters to predict the number of people in an unknown image.

Interpretation of terms:

1. batch Normalization (Batch Normalization) process, comprising the following four steps: calculating the mean value of each training batch of data; calculating the variance of each training batch of data; normalizing the training data of the batch by using the obtained mean value and variance, namely subtracting the mean value from each training data of the batch and then dividing the result by the standard deviation; then multiplied by a scaling factor gamma, plus a translation factor beta.

2. Linear commutation (ReLU) activation function, which means that f (x) is max (0, x).

3. The max pooling (i.e., "down-sampling") operation refers to maximizing the feature points within a neighborhood.

4. A sigmoid function (sigmoid) activation function, which means

5. The RMSprop optimization algorithm comprises the steps of firstly, calculating the average value of the squares of the gradients of the previous t times; then, dividing the average value of the squares of the gradients of the previous t times by the gradient of the t time to be used as the updating proportion of the learning rate; and finally, obtaining a new learning rate according to the proportion.

6. The Adam optimization algorithm is used for dynamically adjusting the learning rate of each parameter according to the first moment estimation and the second moment estimation of the gradient of each parameter by the loss function.

The technical scheme of the invention is as follows:

a people number estimation method based on a generating confrontation network model,

the generative confrontation network model comprises three sub-networks including a generator network

Discriminating network

Regression network

Generator network

The method comprises four continuous convolutions + batch normalization + maximum pooling and one convolution + batch normalization;

discriminating network

Comprises four continuous up-sampling and convolution components, and is used for judging network

Obtaining an estimated value of the density map through the output of the sensor;

regression network

Is a fully connected network; the regression network R has four different inputs including: generator network

Output after second convolution + batch normalization + max poolingOutput, generator network

Output after third convolution + batch normalization + maximum pooling, generator network

Output, generator network after fourth convolution + batch normalization + max pooling

Output after the last convolution plus batch normalization; regression network

The four different inputs are respectively subjected to different SE-Net to obtain four weighted inputs, and the four weighted inputs are input into a three-layer fully-connected network to obtain a predicted value of the number of people;

generative confrontation network model inspired from two-person zero-sum game in game theory, comprising a generative model (generator network)

) And a discriminant model (discriminant network)

). The generated model captures the distribution of sample data, and the discrimination model is a two-classifier and discriminates whether the input is real data or a generated sample. The optimization process of the model is a 'binary minimum-maximum game' problem, one side is fixed during training, the parameters of the other model are updated, and iteration is performed alternately.

The method comprises the following steps:

A. training process

(1) Obtaining multi-scale data, wherein the multi-scale data refers to a multi-scale data training set (I, M, C), and each sample is used as (I)_i,M_i,C_i) Is represented by I_iRepresenting images i, M_iDensity map, C, representing image i_iRepresenting the number of people in image i;

preferably, in step (1), acquiring multi-scale data includes:

randomly cutting each image in an image database to obtain M image blocks with the size of a multiplied by b and N image blocks with the size of c multiplied by d, wherein the value range of M is 1-100, the value range of N is 1-100, the value range of a is 1-320, the value range of b is 1-240, the value range of c is 1-320, the value range of d is 1-240, and the unit of a, b, c and d is a pixel;

further preferably, in step (i), each image in the image database is randomly cropped to obtain 5 image blocks with a size of 120 × 80 and 5 image blocks with a size of 150 × 100.

(ii) Adjusting the resolution of each image in the image database and each image block randomly intercepted in the step (i) to be e multiplied by f, wherein the value range of e is 80-640, and the value range of f is 60-480;

it is further preferable that in step (ii), the resolution of each image in the image database and each image block randomly truncated in step (i) is adjusted to 320 × 240.

(iii) sequentially performing horizontal turning, vertical turning, centrosymmetric transformation and Gaussian noise addition on each image and each image block in the image database, and performing 4 operations to obtain a new image set, wherein the new image set is marked as I;

(iv) marking the head position of each image in the new image set I to obtain a marking template image set of the image set I, marking the marking template image set as L and a set C of the number of people in all images in the new image set I;

(v) processing each image in the labeling template set L by a formula (II) to obtain a density map set of the image set I, which is marked as M:

in the formula (II), { (x)_k,y_k),0≤k≤C_iDenotes the pixel position of the person marked in the image i, C_iRepresenting the number of persons in image i, M_i(x, y) represents a density map corresponding to an image i, σ is a standard deviation, i represents the number of the image, 0_dxcRepresents an all-zero matrix of size e x f;

more preferably, σ is 3.0.

(vi) obtaining a multi-scale training set of data (I, M, C), each sample using (I)_i,M_i,C_i) Is represented by I_iRepresenting images i, M_iDensity map, C, representing image i_iRepresenting the number of people in image i;

(2) using generator networks

Generating a feature map set of the image:

a. adopting 8 matrixes with the scale of 3 multiplied by 3 and 16 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and adopting a random orthogonal matrix to initialize the convolution kernels, wherein the random orthogonal matrix is formed by [0, 1]The uniformly distributed random number matrix is obtained by SVD (singular value decomposition); respectively adopting different convolution cores to convolute the input image of the new image set I, and respectively and sequentially carrying out batch normalization processing, linear rectification activation function and maximum pooling to obtain an output image set, namely a feature map set

b. Adopting 32 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set

Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set

c. Adopting 64 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set

d. Adopting 128 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set

Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set I_g；

(3) Using discriminant networks

Generating an estimated density map:

adopting 64 matrixes with the scale of 3 multiplied by 3, 32 matrixes with the scale of 3 multiplied by 3, 16 matrixes with the scale of 3 multiplied by 3 and 8 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and initializing the convolution kernels by adopting random orthogonal matrixes; for feature map set I_gPerforming upsampling treatment, and checking the upsampled characteristic diagram set I by adopting different convolution cores respectively_gPerforming convolution to obtain an output image

Namely an estimated density map corresponding to the input image of the new image set I;

(4) attention features were extracted with SE-Net:

e. treating with global average pooling (global average pooling)

Obtaining a feature vector

Processing with global average pooling

Obtaining a feature vector

Processing with global average pooling

Obtaining a feature vector

Treatment with global average pooling I_gTo obtain a feature vector v_g；

f. A multilayer perceptron with 16 nerve units at input, 1 nerve unit at hidden layer and 16 nerve units at output is used

In the second layer, with a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initialized to 0 and subjected to a linear rectification (ReLU) activation function; followed byWith a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initializing to 0, and activating a function through a common S function (sigmoid) to obtain a 16-dimensional feature vector

Meanwhile, a multilayer perceptron with 32 nerve units at input, 1 nerve unit at hidden layer and 32 nerve units at output is utilized

With a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

InitialIs changed into 0 and is subjected to a linear rectification (ReLU) activation function; then, using a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initialized to 0 and activated by a common S function (sigmoid) to obtain a 32-dimensional feature vector

Meanwhile, a multilayer perceptron with 64 nerve units at input, 1 nerve unit at hidden layer and 64 nerve units at output is utilized

With a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initialized to 0 and subjected to a linear rectification (ReLU) activation function; then, using a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initialized to 0 and activated by a common S function (sigmoid) to obtain 64-dimensional feature vector

Meanwhile, a multi-layer perceptron MLP with 128 nerve units at input, 1 nerve unit at hidden layer and 128 nerve units at output is utilized_g(ii) a With a minimum value of

Maximum value of

Uniform distribution initialization multi-layer perceptron MLP_gWeight matrix of

And bias the term

Maximum value of

And bias the term

Initialized to 0 and activated through a common S function (sigmoid) to obtain a 128-dimensional feature vector v'_g；

The extracted attention characteristics include: 16-dimensional feature vector

32-dimensional feature vector

64-dimensional feature vector

128-dimensional feature vector v'_g；

(5) Re-weighting the feature map with the attention feature;

integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

The corresponding component of (a); the feature map set after the weight is newly given is obtained as

Integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

Integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

Collecting the feature map I_gIs multiplied by a feature vector v 'for all pixels of each image'_gThe corresponding component of (a); get the feature map set I 'after re-weighting'_g；

(6) Calculating the number of people in the image by using a regression network R;

g. using a fully-connected layer MLP with 26400 neural elements on the input and 1 neural element on the output_RWith a minimum value of

Maximum value of

The weight matrix W of the full connection layer is initialized by uniform distribution of_RAnd the bias term b is initialized to 0;

h. using full-link MLP_RSimultaneous processing

And l'_gAnd obtaining a scalar quantity of 1 dimension through a linear rectification (ReLU) activation function

Scalar quantity

Is the number of people in the image;

(7) network training;

i. defining a loss function, i.e. an objective function to be optimized, as shown in formula (i):

in formula (I), Loss represents the value of the Loss function, λ₁Representing the weight taken up by the error generated by the discriminator,

representing an image I_iThrough a generator network

Output of (a)₂Representing the weight taken up by the error produced by the generator,

to represent

Passing through a discriminator network

M denotes the number of samples after the training set has been augmented, i.e., m is 70400. I is_iRepresenting an input image, c_iRepresenting the number of persons in the image, M_iA density map representing the correspondence of the image; c. C_iIndicating a primary supervisory signal, M_iRepresents a secondary supervisory signal;

j. raw materialNetwork of generators

Selecting an Adam optimization algorithm, judging the network with an initial learning rate g _ base _ lr

Selecting an RMSprop optimization algorithm, wherein the initial learning rate is d _ base _ lr, and the regression network

Selecting an Adam optimization algorithm, wherein the initial learning rate is r _ base _ lr; the value range of g _ base _ lr is 0.000001-1, the value range of d _ base _ lr is 0.000001-1, and the value range of r _ base _ lr is 0.000001-1;

further preferably, g _ base _ lr takes a value of 0.00001, d _ base _ lr takes a value of 0.0002, and r _ base _ lr takes a value of 0.0001.

k. The following steps are executed

Iterating for m times, and comprising the following steps:

randomly acquiring m images from a training set₁，I₂，…，I_m}；

Secondly, randomly sampling density maps { M ] corresponding to the M images from the training set₁，M₂，…，M_m}；

Computing discriminating network

Gradient (2):

is referred to as discriminating network

Training error relative discriminant network

Parameter theta of_dA gradient of (a);

fourthly, updating the discrimination network by adopting RMSprop optimization algorithm

The parameters of (1);

collecting m images from training set randomly₁，I₂，…，I_m}；

Sixthly, randomly sampling density chart corresponding to m images from training set₁，C₂，…，C_m}；

Seventh calculation generator network

Gradient (2):

finger generator network

Training error versus network

Parameter theta of_gA gradient of (a);

updating generator network by Adam optimization algorithm

The parameters of (1);

ninthly randomly acquiring m images from training set₁，I₂，…，I_m}；

Tag of number of people corresponding to m images randomly sampled from training set in Rir (C)₁，C₂，…，C_m}；

Computing regression networks

Gradient (2):

refers to a regression network

Training error versus regression network

Parameter theta of_rA gradient of (a);

updating a regression network using Adam optimization algorithm

The parameters of (1);

B. the testing process comprises the following steps:

and (4) initializing the network by using the network parameters obtained in the step (7), taking the test image as the input of the network, and directly outputting the number of people in the image by the network.

The invention has the beneficial effects that:

1. the invention provides a feature extraction algorithm based on a generative countermeasure network, which makes full use of the implicit feature representation capability of the generative network and applies a multi-task learning technology to make the generalization capability of a model stronger;

2. the invention utilizes the attention model to lead the adjustment of the network parameters to pay more attention to the characteristics influencing the accuracy;

3. the training algorithm of the countermeasure regression model provided by the invention adopts alternate training and random sampling, so that the occurrence of overfitting is avoided.

Drawings

Figure 1 is an architectural diagram of a multi-column convolutional network proposed by Zhang et al.

Fig. 2 is an architecture diagram of a multi-branch convolutional network based on multi-scale blocks proposed by Daniel et al.

Fig. 3 is an architecture diagram of a combination of residual error network (ResNet), fully connected network and markov random field proposed by Han et al.

Fig. 4 is a structural block diagram of a generative countermeasure network model proposed by the present invention.

Detailed Description

The invention is further defined in the following, but not limited to, the figures and examples in the description.

Example 1

A people number estimation method based on a generative confrontation network model, the generative confrontation network model comprises three sub-networks, as shown in figure 4, including a generator network

Discriminating network

Regression network

Generator network

discriminating network

regression network

Is a fully connected network; regression network

There are four different inputs, including: generator network

Output, generator network after second convolution + batch normalization + max pooling

Output after the last convolution plus batch normalization; regression network

) And a discriminant model (discriminant network)

The method comprises the following steps:

A. training process

(1) Obtaining multi-scale data, wherein the multi-scale data refers to a multi-scale data training set (I, M, C), and each sample is used as (I)_i,M_i,C_i) Is represented by I_iRepresenting images i, M_iDensity map, C, representing image i_iRepresenting the number of people in image i; the method comprises the following steps:

in the formula (II), { (x)_k,y_k),0≤k≤C_iDenotes the pixel position of the person marked in the image i, C_iRepresenting the number of persons in image i, M_i(x, y) represents a density map corresponding to an image i, σ is a standard deviation, i represents the number of the image, 0_dxcRepresents an all-zero matrix of size e x f; σ is 3.0.

(2) using generator networks

Generating a feature map set of the image:

Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map setCombination of Chinese herbs

(3) Using discriminant networks

Generating an estimated density map: adopting 64 matrixes with the scale of 3 multiplied by 3, 32 matrixes with the scale of 3 multiplied by 3, 16 matrixes with the scale of 3 multiplied by 3 and 8 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and initializing the convolution kernels by adopting random orthogonal matrixes; for feature map set I_gPerforming upsampling treatment, and checking the upsampled characteristic diagram set I by adopting different convolution cores respectively_gPerforming convolution to obtain an output image

(4) attention features were extracted with SE-Net:

e. pooling with global averaging(Global operating posing) treatment

Obtaining a feature vector

Processing with global average pooling

Obtaining a feature vector

Processing with global average pooling

Obtaining a feature vector

Treatment with global average pooling I_gTo obtain a feature vector v_g；

In the second layer, with a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

With a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

With a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Maximum value of

And bias the term

Is initialized to 0, andand passing through a linear rectification (ReLU) activation function; then, using a minimum value of

Maximum value of

And bias the term

The extracted attention characteristics include: 16-dimensional feature vector

32-dimensional feature vector

64-dimensional feature vector

128-dimensional feature vector v'_g；

(5) Re-weighting the feature map with the attention feature;

integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

Integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

Integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

(6) Using regression networks

Calculating the number of people in the image;

Maximum value of

Uniformly distributed initialization of the weight moments of the full connection layerArray W_RAnd the bias term b is initialized to 0;

h. using full-link MLP_RSimultaneous processing

And

and obtaining a scalar quantity of 1 dimension by a linear rectification (ReLU) activation function

Scalar quantity

Is the number of people in the image;

(7) network training;

representing an image I_iThrough a generator network

to represent

Passing through a discriminator network

j. generator network

k. the following steps are executed

Iterating for m times, and comprising the following steps:

randomly acquiring m images from a training set₁，I₂，…，I_m}；

Computing discriminating network

Gradient (2):

means to discriminateNetwork

Training error relative discriminant network

Parameter theta of_dA gradient of (a);

The parameters of (1);

collecting m images from training set randomly₁，I₂，…，I_m}；

Seventh calculation generator network

Gradient (2):

finger generator network

Training error versus network

Parameter theta of_gA gradient of (a);

updating generator network by Adam optimization algorithm

The parameters of (1);

ninthly randomly acquiring m images from training set₁，I₂，…，I_m}；

Computing regression networks

Gradient (2):

refers to a regression network

Training error versus regression network

Parameter theta of_rA gradient of (a);

updating a regression network using Adam optimization algorithm

The parameters of (1);

B. the testing process comprises the following steps:

Example 2

The people number estimation method based on the generative confrontation network model according to the embodiment 1 is characterized in that:

in step (i), each image in the image database is randomly cropped to obtain 5 image blocks with the size of 120 × 80 and the size of 150 × 100. This step is only valid for the training set and not for the test set.

In step (ii), the resolution of each image in the image database, and of each image block randomly truncated in step (i), is adjusted to 320 × 240.

The value of g _ base _ lr is 0.00001, the value of d _ base _ lr is 0.0002, and the value of r _ base _ lr is 0.0001.

Algorithm 1 is applied to train a generative confrontation network model.

The method makes full use of the implicit characteristic representation capability of the generative network and applies a multi-task learning technology to ensure that the generalization capability of the model is stronger; an attention model is utilized, so that the adjustment of network parameters is more concerned about the characteristics influencing the accuracy; the method adopts alternate training and random sampling, thereby avoiding the occurrence of overfitting.

The effects of the present invention can be further illustrated by experiments. Table 1 compares the prediction error on the MALL test set using the present invention with the method of Zhang et al, Daniel et al, and the method of Han et al, where "(calculated using true density maps)" in table 1 means: the sum of the pixels of the true density map is considered to correspond to the number of true people in the image.

TABLE 1

As can be seen from Table 1, the method of the present invention is more accurate than the other four methods.

Claims

1. A people number estimation method based on a generative confrontation network model is characterized in that the generative confrontation network model comprises three sub-networks including a generator network

Discriminating network

Regression network

Generator network

The method comprises four continuous convolutions + batch normalization + maximum pooling and one convolution + batch normalization; discriminating network

Obtaining an estimated value of the density map through the output of the sensor; regression network

Is a fully connected network; regression network

There are four different inputs, including: generator network

Output after the last convolution plus batch normalization; regression network

The four different inputs are respectively subjected to different SE-Net to obtain four weighted inputs, and the four weighted inputs are input into a three-layer fully-connected network to obtain the predicted value of the number of people; the method comprises the following steps:

A. training process

(2) using generator networks

Generating a feature map set of the image:

(3) using discriminant networks

Generating an estimated density map:

(4) attention features were extracted with SE-Net:

(5) re-weighting the feature map with the attention feature;

(6) using regression networks

Calculating the number of people in the image;

(7) network training;

B. the testing process comprises the following steps:

2. The people number estimation method based on generative confrontation network model as claimed in claim 1, wherein in step (2), the generator network is used

Generating a feature map set of an image, comprising the steps of:

a. adopting 8 matrixes with the scale of 3 multiplied by 3 and 16 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and adopting a random orthogonal matrix to initialize the convolution kernels, wherein the random orthogonal matrix is formed by [0, 1]The uniformly distributed random number matrix is obtained by SVD; respectively adopting different convolution cores to convolute the input image of the new image set I, and respectively and sequentially carrying out batch normalization processing, linear rectification activation function and maximum pooling to obtain an output image set, namely a feature map set

Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image setAggregate feature graph set

Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set I_g。

3. The method as claimed in claim 2, wherein in the step (3), the discriminative network is used to estimate the number of people using the generative confrontation network model

Generating an estimated density map, comprising the steps of:

I.e. the estimated density map corresponding to the input image of the new image set I.

4. The people number estimation method based on the generative confrontation network model as claimed in claim 2, wherein the step (4) of extracting attention features by using SE-Net comprises the following steps:

e. processing with global average pooling

Obtaining a feature vector

Processing with global average pooling

Obtaining a feature vector

Processing with global average pooling

Obtaining a feature vector

Treatment with global average pooling I_gTo obtain a feature vector v_g；

In the second layer, with a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initializing to 0 and activating a function through linear rectification; then, using a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initialized to 0 and activated by an S function to obtain a 16-dimensional feature vector

With a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initialized to 0 and activated by a common S function to obtain a 32-dimensional feature vector

With a minimum value of

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Maximum value of

Uniformly distributed initialization multi-layer perceptron

Weight matrix of

And bias the term

Initialized to 0 and activated by an S function to obtain a 64-dimensional feature vector

Maximum value of

And bias the term

Initialized to 0 and subjected to a linear rectification activation function; then, using a minimum value of

Maximum value of

And bias the term

Initializing to be 0, and activating a function through an S function to obtain a 128-dimensional feature vector v'_g；

The extracted attention characteristics include: 16-dimensional feature vector

32-dimensional feature vector

64-dimensional feature vector

128-dimensional feature vector v'_g。

5. The people number estimation method based on the generative confrontation network model as claimed in claim 4, wherein the step (5) of reweighing the feature map with attention features comprises the following steps:

integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

Integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

Integrating feature maps

All pixels of each image of (a) are multiplied by a feature vector

Collecting the feature map I_gIs multiplied by a feature vector v 'for all pixels of each image'_gPair ofA component of response; get the feature map set I 'after re-weighting'_g。

6. The people number estimation method based on generative confrontation network model as claimed in claim 5, wherein in step (6), regression network is used

Calculating the number of people in the image, comprising the steps of:

Maximum value of

h. using full-link MLP_RSimultaneous processing

And l'_gAnd obtaining a scalar quantity of 1 dimension by linear rectification activation function

Scalar quantity

Is the number of people in the image.

7. The people number estimation method based on the generative confrontation network model as claimed in claim 6, wherein in the step (7), the network training comprises the following steps:

i. defining a loss function, i.e. an objective function to be optimized, as shown in equation (II):

in the formula (II), Loss represents the value of the Loss function, λ₁Representing the weight taken up by the error generated by the discriminator,

representing an image I_iThrough a generator network

to represent

Passing through a discriminator network

M represents the number of samples after the training set has been augmented, I_iRepresenting an input image, c_iRepresenting the number of persons in the image, M_iA density map representing the correspondence of the image;

j. generator network

Selecting Adam optimization algorithmThe initial learning rate is r _ base _ lr; the value range of g _ base _ lr is 0.000001-1, the value range of d _ base _ lr is 0.000001-1, and the value range of r _ base _ lr is 0.000001-1;

k. the following steps are executed

Iterating for m times, and comprising the following steps:

randomly acquiring m images from a training set₁，I₂，…，I_m}；

Computing discriminating network

Gradient (2):

is referred to as discriminating network

Training error relative discriminant network

Parameter theta of_dA gradient of (a);

The parameters of (1);

collecting m images from training set randomly₁，I₂，…，I_m}；

Seventh calculation generator network

Gradient (2):

finger generator network

Training error versus network

Parameter theta of_gA gradient of (a);

updating generator network by Adam optimization algorithm

The parameters of (1);

ninthly randomly acquiring m images from training set₁，I₂，…，I_m}；

Computing regression networks

Gradient (2):

refers to a regression network

Training error versus regression network

Parameter theta of_rA gradient of (a);

updating a regression network using Adam optimization algorithm

The parameter (c) of (c).

8. The people number estimation method based on the generative confrontation network model as claimed in claim 1, wherein the step (1) of obtaining multi-scale data comprises:

(ii) adjusting the resolution of each image in the image database and each image block randomly intercepted in the step (i) to be e × f, wherein the value range of e is 80-640, and the value range of f is 60-480;

(iii) respectively and sequentially carrying out horizontal turning, vertical turning, central symmetry transformation and Gaussian noise addition on each image and each image block in the image database to obtain a new image set, and marking as I;

(iv) labeling the head position of each image in the new image set I to obtain a labeled template image set of the image set I, which is marked as L, and a set C of the number of people in all the images in the new image set I;

(v) processing each image in the labeling template set L by a formula (I) to obtain a density map set of the image set I, which is marked as M:

in the formula (I) { (x)_k,y_k),0≤k≤C_iDenotes the pixel position of the person marked in the image i, C_iRepresenting the number of persons in image i, M_i(x, y) represents a density map corresponding to an image i, σ is a standard deviation, i represents the number of the image, 0_e×fRepresents an all-zero matrix of size e x f;

(vi) obtaining a training set of multiscale data (I, M, C), each sample using (I)_i,M_i,C_i) Is represented by I_iRepresenting images i, M_iDensity map, C, representing image i_iRepresenting the number of people in the image i.

9. The people estimation method based on generative confrontation network model as claimed in claim 8,

in the step (i), each image in the image database is randomly cropped to obtain 5 image blocks with the size of 120 × 80 and the size of 150 × 100;

in the step (ii), the resolution of each image in the image database and each image block randomly intercepted in the step (i) is adjusted to 320 × 240; σ is 3.0.

10. The people number estimation method based on the generative confrontation network model as claimed in claim 7, wherein g _ base _ lr is 0.00001, d _ base _ lr is 0.0002, and r _ base _ lr is 0.0001.