CN109522857B - People number estimation method based on generation type confrontation network model - Google Patents

People number estimation method based on generation type confrontation network model Download PDF

Info

Publication number
CN109522857B
CN109522857B CN201811415565.0A CN201811415565A CN109522857B CN 109522857 B CN109522857 B CN 109522857B CN 201811415565 A CN201811415565 A CN 201811415565A CN 109522857 B CN109522857 B CN 109522857B
Authority
CN
China
Prior art keywords
image
network
people
convolution
adopting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811415565.0A
Other languages
Chinese (zh)
Other versions
CN109522857A (en
Inventor
元辉
贺黎恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201811415565.0A priority Critical patent/CN109522857B/en
Publication of CN109522857A publication Critical patent/CN109522857A/en
Application granted granted Critical
Publication of CN109522857B publication Critical patent/CN109522857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a people number estimation method based on a Generative confrontation network model, which relates to a characteristic automatic extraction technology and a multiple regression model in deep learning, fully utilizes the characteristic representation capability of the Generative confrontation network model (GANs, general adaptive Nets), takes a density graph indicating local crowd density as a second supervision signal, takes the number of people in an image as a first supervision signal, trains a network by using a back propagation algorithm, and then initializes the network by using the obtained network parameters to predict the number of people in an unknown image.

Description

People number estimation method based on generation type confrontation network model
Technical Field
The invention relates to a people number estimation method based on a generating type confrontation network model, and belongs to the technical field of image processing.
Background
It has been challenging to directly estimate the number of people from the image due to the influence of illumination variation, perspective distortion, and noisy environment disturbance (such as the background is a forest, or a wall with strong light reflection). However, the arousal of deep learning techniques in recent years has led researchers and engineers' projects to make extensive use of and develop deep network models. Although the automatic people number estimation method based on the deep network model achieves quite good performance in natural scenes.
Zhang [1] et al propose a multi-column convolutional network, as shown in FIG. 1. The scheme provides a single image crowd counting algorithm based on a multi-column convolutional neural network, wherein the multi-column convolutional neural network comprises three sub-networks, the structures of the sub-networks are different, convolution kernels used for the sub-networks are different in size, the input of each sub-network is the same image, after four times of convolution and two times of pooling, feature maps output by the three sub-networks are linked together in a channel dimension, and a 1 x 1 kernel convolution is used for obtaining a crowd density map. However, the scheme is only linked together at the high layer of the network, and the multi-scale features at the shallow layer of the network are not fully fused, so that the loss of the geometric features is caused, and the accuracy of people number estimation is influenced; this scheme requires training three sub-networks before training the entire network, and the training time for each sub-network is not less than ten hours.
Daniel [2] et al proposed a multi-branch convolutional network based on multi-scale blocks, as shown in FIG. 2. This scheme consists of three different subnets, although the input blocks have different dimensions. However, the three sub-networks of the scheme have the same structure, and are only linked together at the high level of the network, and the multi-scale features at the shallow layer of the network are not fully fused, so that the loss of the geometric features is caused, and the accuracy of people number estimation is influenced; this scheme requires training three sub-networks before training the entire network, and the training time for each sub-network is not less than ten hours.
Han [3] et al propose a method based on a combination of residual error network (ResNet) and fully connected network, as shown in FIG. 3. The scheme includes that firstly, a plurality of blocks are sampled from each image in an overlapped mode, then the predicted value of each block is calculated through a residual error network, and then the predicted values of the blocks are sent to a conditional random field to calculate the predicted value of the number of people in the image. However, in the scheme, the predicted value of each block is calculated by using a residual error network, and then the number of people in the image can be predicted by using a conditional random field; that is, the scheme is performed in steps, and the two steps cannot be combined into one step.
However, experiments show that it takes a long time to train these networks, and the training time is continuously increased as the network structure is continuously deepened. The deep network like Han 3, etc. has very deep network structure, and many parameters to be learned, which not only takes long time for training but also has danger of over-fitting; like the schemes proposed by Zhang [1] et al and Daniel [2] et al, although not as deep as the network structure of the scheme proposed by Han [3] et al, the breadth of the network is increased and each subnetwork needs prior pre-training.
Disclosure of Invention
Aiming at the defects of the existing automatic people number estimation technology based on a deep network model, the invention provides a people number estimation method based on a generating type confrontation network model;
in order to reduce network parameters, the size of a convolution kernel of the scheme provided by the invention is not more than 3 at most; in order to reduce the network width, the invention only uses a single-column network structure; in order to ensure the performance of the scheme proposed by the invention, the invention gives different weights to the inputs of the regression network to distinguish the importance degree of different characteristics.
The invention relates to a characteristic automatic extraction technology and a multiple regression model in deep learning, which fully utilize the characteristic representation capability of generating confrontation type network models (GANs), use a density graph indicating local crowd density as a second supervision signal, use the number of people in an image as a first supervision signal, train the network by using a back propagation algorithm, and initialize the network by using the obtained network parameters to predict the number of people in an unknown image.
Interpretation of terms:
1. batch Normalization (Batch Normalization) process, comprising the following four steps: calculating the mean value of each training batch of data; calculating the variance of each training batch of data; normalizing the training data of the batch by using the obtained mean value and variance, namely subtracting the mean value from each training data of the batch and then dividing the result by the standard deviation; then multiplied by a scaling factor gamma, plus a translation factor beta.
2. Linear commutation (ReLU) activation function, which means that f (x) is max (0, x).
3. The max pooling (i.e., "down-sampling") operation refers to maximizing the feature points within a neighborhood.
4. A sigmoid function (sigmoid) activation function, which means
Figure BDA0001879368540000021
5. The RMSprop optimization algorithm comprises the steps of firstly, calculating the average value of the squares of the gradients of the previous t times; then, dividing the average value of the squares of the gradients of the previous t times by the gradient of the t time to be used as the updating proportion of the learning rate; and finally, obtaining a new learning rate according to the proportion.
6. The Adam optimization algorithm is used for dynamically adjusting the learning rate of each parameter according to the first moment estimation and the second moment estimation of the gradient of each parameter by the loss function.
The technical scheme of the invention is as follows:
a people number estimation method based on a generating confrontation network model,
the generative confrontation network model comprises three sub-networks including a generator network
Figure BDA0001879368540000022
Discriminating network
Figure BDA0001879368540000023
Regression network
Figure BDA0001879368540000024
Generator network
Figure BDA0001879368540000025
The method comprises four continuous convolutions + batch normalization + maximum pooling and one convolution + batch normalization;
discriminating network
Figure BDA0001879368540000026
Comprises four continuous up-sampling and convolution components, and is used for judging network
Figure BDA0001879368540000027
Obtaining an estimated value of the density map through the output of the sensor;
regression network
Figure BDA0001879368540000031
Is a fully connected network; the regression network R has four different inputs including: generator network
Figure BDA0001879368540000032
Output after second convolution + batch normalization + max poolingOutput, generator network
Figure BDA0001879368540000033
Output after third convolution + batch normalization + maximum pooling, generator network
Figure BDA0001879368540000034
Output, generator network after fourth convolution + batch normalization + max pooling
Figure BDA0001879368540000035
Output after the last convolution plus batch normalization; regression network
Figure BDA0001879368540000036
The four different inputs are respectively subjected to different SE-Net to obtain four weighted inputs, and the four weighted inputs are input into a three-layer fully-connected network to obtain a predicted value of the number of people;
generative confrontation network model inspired from two-person zero-sum game in game theory, comprising a generative model (generator network)
Figure BDA0001879368540000037
) And a discriminant model (discriminant network)
Figure BDA0001879368540000038
). The generated model captures the distribution of sample data, and the discrimination model is a two-classifier and discriminates whether the input is real data or a generated sample. The optimization process of the model is a 'binary minimum-maximum game' problem, one side is fixed during training, the parameters of the other model are updated, and iteration is performed alternately.
The method comprises the following steps:
A. training process
(1) Obtaining multi-scale data, wherein the multi-scale data refers to a multi-scale data training set (I, M, C), and each sample is used as (I)i,Mi,Ci) Is represented by IiRepresenting images i, MiDensity map, C, representing image iiRepresenting the number of people in image i;
preferably, in step (1), acquiring multi-scale data includes:
randomly cutting each image in an image database to obtain M image blocks with the size of a multiplied by b and N image blocks with the size of c multiplied by d, wherein the value range of M is 1-100, the value range of N is 1-100, the value range of a is 1-320, the value range of b is 1-240, the value range of c is 1-320, the value range of d is 1-240, and the unit of a, b, c and d is a pixel;
further preferably, in step (i), each image in the image database is randomly cropped to obtain 5 image blocks with a size of 120 × 80 and 5 image blocks with a size of 150 × 100.
(ii) Adjusting the resolution of each image in the image database and each image block randomly intercepted in the step (i) to be e multiplied by f, wherein the value range of e is 80-640, and the value range of f is 60-480;
it is further preferable that in step (ii), the resolution of each image in the image database and each image block randomly truncated in step (i) is adjusted to 320 × 240.
(iii) sequentially performing horizontal turning, vertical turning, centrosymmetric transformation and Gaussian noise addition on each image and each image block in the image database, and performing 4 operations to obtain a new image set, wherein the new image set is marked as I;
(iv) marking the head position of each image in the new image set I to obtain a marking template image set of the image set I, marking the marking template image set as L and a set C of the number of people in all images in the new image set I;
(v) processing each image in the labeling template set L by a formula (II) to obtain a density map set of the image set I, which is marked as M:
Figure BDA0001879368540000041
in the formula (II), { (x)k,yk),0≤k≤CiDenotes the pixel position of the person marked in the image i, CiRepresenting the number of persons in image i, Mi(x, y) represents a density map corresponding to an image i, σ is a standard deviation, i represents the number of the image, 0dxcRepresents an all-zero matrix of size e x f;
more preferably, σ is 3.0.
(vi) obtaining a multi-scale training set of data (I, M, C), each sample using (I)i,Mi,Ci) Is represented by IiRepresenting images i, MiDensity map, C, representing image iiRepresenting the number of people in image i;
(2) using generator networks
Figure BDA0001879368540000042
Generating a feature map set of the image:
a. adopting 8 matrixes with the scale of 3 multiplied by 3 and 16 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and adopting a random orthogonal matrix to initialize the convolution kernels, wherein the random orthogonal matrix is formed by [0, 1]The uniformly distributed random number matrix is obtained by SVD (singular value decomposition); respectively adopting different convolution cores to convolute the input image of the new image set I, and respectively and sequentially carrying out batch normalization processing, linear rectification activation function and maximum pooling to obtain an output image set, namely a feature map set
Figure BDA0001879368540000044
b. Adopting 32 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure BDA0001879368540000045
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set
Figure BDA0001879368540000046
c. Adopting 64 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure BDA0001879368540000047
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set
Figure BDA0001879368540000048
d. Adopting 128 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure BDA0001879368540000049
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set Ig
(3) Using discriminant networks
Figure BDA0001879368540000043
Generating an estimated density map:
adopting 64 matrixes with the scale of 3 multiplied by 3, 32 matrixes with the scale of 3 multiplied by 3, 16 matrixes with the scale of 3 multiplied by 3 and 8 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and initializing the convolution kernels by adopting random orthogonal matrixes; for feature map set IgPerforming upsampling treatment, and checking the upsampled characteristic diagram set I by adopting different convolution cores respectivelygPerforming convolution to obtain an output image
Figure BDA0001879368540000051
Namely an estimated density map corresponding to the input image of the new image set I;
(4) attention features were extracted with SE-Net:
e. treating with global average pooling (global average pooling)
Figure BDA00018793685400000529
Obtaining a feature vector
Figure BDA00018793685400000530
Processing with global average pooling
Figure BDA00018793685400000531
Obtaining a feature vector
Figure BDA00018793685400000532
Processing with global average pooling
Figure BDA00018793685400000533
Obtaining a feature vector
Figure BDA00018793685400000534
Treatment with global average pooling IgTo obtain a feature vector vg
f. A multilayer perceptron with 16 nerve units at input, 1 nerve unit at hidden layer and 16 nerve units at output is used
Figure BDA00018793685400000535
In the second layer, with a minimum value of
Figure BDA0001879368540000052
Maximum value of
Figure BDA0001879368540000053
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400000536
Weight matrix of
Figure BDA00018793685400000537
And bias the term
Figure BDA00018793685400000538
Initialized to 0 and subjected to a linear rectification (ReLU) activation function; followed byWith a minimum value of
Figure BDA0001879368540000054
Maximum value of
Figure BDA0001879368540000055
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400000540
Weight matrix of
Figure BDA00018793685400000539
And bias the term
Figure BDA00018793685400000542
Initializing to 0, and activating a function through a common S function (sigmoid) to obtain a 16-dimensional feature vector
Figure BDA00018793685400000541
Meanwhile, a multilayer perceptron with 32 nerve units at input, 1 nerve unit at hidden layer and 32 nerve units at output is utilized
Figure BDA00018793685400000543
With a minimum value of
Figure BDA0001879368540000056
Maximum value of
Figure BDA0001879368540000057
Uniformly distributed initialization multi-layer perceptron
Figure BDA0001879368540000058
Weight matrix of
Figure BDA0001879368540000059
And bias the term
Figure BDA00018793685400000510
InitialIs changed into 0 and is subjected to a linear rectification (ReLU) activation function; then, using a minimum value of
Figure BDA00018793685400000511
Maximum value of
Figure BDA00018793685400000512
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400000513
Weight matrix of
Figure BDA00018793685400000514
And bias the term
Figure BDA00018793685400000515
Initialized to 0 and activated by a common S function (sigmoid) to obtain a 32-dimensional feature vector
Figure BDA00018793685400000516
Meanwhile, a multilayer perceptron with 64 nerve units at input, 1 nerve unit at hidden layer and 64 nerve units at output is utilized
Figure BDA00018793685400000517
With a minimum value of
Figure BDA00018793685400000518
Maximum value of
Figure BDA00018793685400000519
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400000520
Weight matrix of
Figure BDA00018793685400000521
And bias the term
Figure BDA00018793685400000522
Initialized to 0 and subjected to a linear rectification (ReLU) activation function; then, using a minimum value of
Figure BDA00018793685400000523
Maximum value of
Figure BDA00018793685400000524
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400000525
Weight matrix of
Figure BDA00018793685400000526
And bias the term
Figure BDA00018793685400000527
Initialized to 0 and activated by a common S function (sigmoid) to obtain 64-dimensional feature vector
Figure BDA00018793685400000528
Meanwhile, a multi-layer perceptron MLP with 128 nerve units at input, 1 nerve unit at hidden layer and 128 nerve units at output is utilizedg(ii) a With a minimum value of
Figure BDA0001879368540000061
Maximum value of
Figure BDA0001879368540000062
Uniform distribution initialization multi-layer perceptron MLPgWeight matrix of
Figure BDA0001879368540000063
And bias the term
Figure BDA0001879368540000064
Initialized to 0 and subjected to a linear rectification (ReLU) activation function; then, using a minimum value of
Figure BDA0001879368540000065
Maximum value of
Figure BDA0001879368540000066
Uniform distribution initialization multi-layer perceptron MLPgWeight matrix of
Figure BDA0001879368540000067
And bias the term
Figure BDA0001879368540000068
Initialized to 0 and activated through a common S function (sigmoid) to obtain a 128-dimensional feature vector v'g
The extracted attention characteristics include: 16-dimensional feature vector
Figure BDA0001879368540000069
32-dimensional feature vector
Figure BDA00018793685400000610
64-dimensional feature vector
Figure BDA00018793685400000611
128-dimensional feature vector v'g
(5) Re-weighting the feature map with the attention feature;
integrating feature maps
Figure BDA00018793685400000612
All pixels of each image of (a) are multiplied by a feature vector
Figure BDA00018793685400000613
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure BDA00018793685400000614
Integrating feature maps
Figure BDA00018793685400000615
All pixels of each image of (a) are multiplied by a feature vector
Figure BDA00018793685400000616
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure BDA00018793685400000617
Integrating feature maps
Figure BDA00018793685400000618
All pixels of each image of (a) are multiplied by a feature vector
Figure BDA00018793685400000619
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure BDA00018793685400000620
Collecting the feature map IgIs multiplied by a feature vector v 'for all pixels of each image'gThe corresponding component of (a); get the feature map set I 'after re-weighting'g
(6) Calculating the number of people in the image by using a regression network R;
g. using a fully-connected layer MLP with 26400 neural elements on the input and 1 neural element on the outputRWith a minimum value of
Figure BDA00018793685400000621
Maximum value of
Figure BDA00018793685400000622
The weight matrix W of the full connection layer is initialized by uniform distribution ofRAnd the bias term b is initialized to 0;
h. using full-link MLPRSimultaneous processing
Figure BDA00018793685400000623
And l'gAnd obtaining a scalar quantity of 1 dimension through a linear rectification (ReLU) activation function
Figure BDA00018793685400000624
Scalar quantity
Figure BDA00018793685400000625
Is the number of people in the image;
(7) network training;
i. defining a loss function, i.e. an objective function to be optimized, as shown in formula (i):
Figure BDA0001879368540000071
in formula (I), Loss represents the value of the Loss function, λ1Representing the weight taken up by the error generated by the discriminator,
Figure BDA0001879368540000072
representing an image IiThrough a generator network
Figure BDA0001879368540000073
Output of (a)2Representing the weight taken up by the error produced by the generator,
Figure BDA0001879368540000074
to represent
Figure BDA0001879368540000075
Passing through a discriminator network
Figure BDA0001879368540000076
M denotes the number of samples after the training set has been augmented, i.e., m is 70400. I isiRepresenting an input image, ciRepresenting the number of persons in the image, MiA density map representing the correspondence of the image; c. CiIndicating a primary supervisory signal, MiRepresents a secondary supervisory signal;
j. raw materialNetwork of generators
Figure BDA0001879368540000077
Selecting an Adam optimization algorithm, judging the network with an initial learning rate g _ base _ lr
Figure BDA0001879368540000078
Selecting an RMSprop optimization algorithm, wherein the initial learning rate is d _ base _ lr, and the regression network
Figure BDA0001879368540000079
Selecting an Adam optimization algorithm, wherein the initial learning rate is r _ base _ lr; the value range of g _ base _ lr is 0.000001-1, the value range of d _ base _ lr is 0.000001-1, and the value range of r _ base _ lr is 0.000001-1;
further preferably, g _ base _ lr takes a value of 0.00001, d _ base _ lr takes a value of 0.0002, and r _ base _ lr takes a value of 0.0001.
k. The following steps are executed
Figure BDA00018793685400000710
Iterating for m times, and comprising the following steps:
randomly acquiring m images from a training set1,I2,…,Im};
Secondly, randomly sampling density maps { M ] corresponding to the M images from the training set1,M2,…,Mm};
Computing discriminating network
Figure BDA00018793685400000711
Gradient (2):
Figure BDA00018793685400000712
Figure BDA00018793685400000713
is referred to as discriminating network
Figure BDA00018793685400000714
Training error relative discriminant network
Figure BDA00018793685400000715
Parameter theta ofdA gradient of (a);
fourthly, updating the discrimination network by adopting RMSprop optimization algorithm
Figure BDA00018793685400000716
The parameters of (1);
collecting m images from training set randomly1,I2,…,Im};
Sixthly, randomly sampling density chart corresponding to m images from training set1,C2,…,Cm};
Seventh calculation generator network
Figure BDA00018793685400000717
Gradient (2):
Figure BDA0001879368540000081
Figure BDA0001879368540000082
finger generator network
Figure BDA0001879368540000083
Training error versus network
Figure BDA0001879368540000084
Parameter theta ofgA gradient of (a);
updating generator network by Adam optimization algorithm
Figure BDA0001879368540000085
The parameters of (1);
ninthly randomly acquiring m images from training set1,I2,…,Im};
Tag of number of people corresponding to m images randomly sampled from training set in Rir (C)1,C2,…,Cm};
Figure BDA0001879368540000086
Computing regression networks
Figure BDA0001879368540000087
Gradient (2):
Figure BDA0001879368540000088
Figure BDA0001879368540000089
refers to a regression network
Figure BDA00018793685400000810
Training error versus regression network
Figure BDA00018793685400000811
Parameter theta ofrA gradient of (a);
Figure BDA00018793685400000812
updating a regression network using Adam optimization algorithm
Figure BDA00018793685400000813
The parameters of (1);
B. the testing process comprises the following steps:
and (4) initializing the network by using the network parameters obtained in the step (7), taking the test image as the input of the network, and directly outputting the number of people in the image by the network.
The invention has the beneficial effects that:
1. the invention provides a feature extraction algorithm based on a generative countermeasure network, which makes full use of the implicit feature representation capability of the generative network and applies a multi-task learning technology to make the generalization capability of a model stronger;
2. the invention utilizes the attention model to lead the adjustment of the network parameters to pay more attention to the characteristics influencing the accuracy;
3. the training algorithm of the countermeasure regression model provided by the invention adopts alternate training and random sampling, so that the occurrence of overfitting is avoided.
Drawings
Figure 1 is an architectural diagram of a multi-column convolutional network proposed by Zhang et al.
Fig. 2 is an architecture diagram of a multi-branch convolutional network based on multi-scale blocks proposed by Daniel et al.
Fig. 3 is an architecture diagram of a combination of residual error network (ResNet), fully connected network and markov random field proposed by Han et al.
Fig. 4 is a structural block diagram of a generative countermeasure network model proposed by the present invention.
Detailed Description
The invention is further defined in the following, but not limited to, the figures and examples in the description.
Example 1
A people number estimation method based on a generative confrontation network model, the generative confrontation network model comprises three sub-networks, as shown in figure 4, including a generator network
Figure BDA0001879368540000091
Discriminating network
Figure BDA0001879368540000092
Regression network
Figure BDA0001879368540000093
Generator network
Figure BDA0001879368540000094
The method comprises four continuous convolutions + batch normalization + maximum pooling and one convolution + batch normalization;
discriminating network
Figure BDA0001879368540000095
Comprises four continuous up-sampling and convolution components, and is used for judging network
Figure BDA0001879368540000096
Obtaining an estimated value of the density map through the output of the sensor;
regression network
Figure BDA0001879368540000097
Is a fully connected network; regression network
Figure BDA0001879368540000098
There are four different inputs, including: generator network
Figure BDA0001879368540000099
Output, generator network after second convolution + batch normalization + max pooling
Figure BDA00018793685400000910
Output after third convolution + batch normalization + maximum pooling, generator network
Figure BDA00018793685400000911
Output, generator network after fourth convolution + batch normalization + max pooling
Figure BDA00018793685400000912
Output after the last convolution plus batch normalization; regression network
Figure BDA00018793685400000913
The four different inputs are respectively subjected to different SE-Net to obtain four weighted inputs, and the four weighted inputs are input into a three-layer fully-connected network to obtain a predicted value of the number of people;
generative confrontation network model inspired from two-person zero-sum game in game theory, comprising a generative model (generator network)
Figure BDA00018793685400000914
) And a discriminant model (discriminant network)
Figure BDA00018793685400000915
). The generated model captures the distribution of sample data, and the discrimination model is a two-classifier and discriminates whether the input is real data or a generated sample. The optimization process of the model is a 'binary minimum-maximum game' problem, one side is fixed during training, the parameters of the other model are updated, and iteration is performed alternately.
The method comprises the following steps:
A. training process
(1) Obtaining multi-scale data, wherein the multi-scale data refers to a multi-scale data training set (I, M, C), and each sample is used as (I)i,Mi,Ci) Is represented by IiRepresenting images i, MiDensity map, C, representing image iiRepresenting the number of people in image i; the method comprises the following steps:
randomly cutting each image in an image database to obtain M image blocks with the size of a multiplied by b and N image blocks with the size of c multiplied by d, wherein the value range of M is 1-100, the value range of N is 1-100, the value range of a is 1-320, the value range of b is 1-240, the value range of c is 1-320, the value range of d is 1-240, and the unit of a, b, c and d is a pixel;
(ii) adjusting the resolution of each image in the image database and each image block randomly intercepted in the step (i) to be e multiplied by f, wherein the value range of e is 80-640, and the value range of f is 60-480;
(iii) sequentially performing horizontal turning, vertical turning, centrosymmetric transformation and Gaussian noise addition on each image and each image block in the image database, and performing 4 operations to obtain a new image set, wherein the new image set is marked as I;
(iv) marking the head position of each image in the new image set I to obtain a marking template image set of the image set I, marking the marking template image set as L and a set C of the number of people in all images in the new image set I;
(v) processing each image in the labeling template set L by a formula (II) to obtain a density map set of the image set I, which is marked as M:
Figure BDA0001879368540000101
in the formula (II), { (x)k,yk),0≤k≤CiDenotes the pixel position of the person marked in the image i, CiRepresenting the number of persons in image i, Mi(x, y) represents a density map corresponding to an image i, σ is a standard deviation, i represents the number of the image, 0dxcRepresents an all-zero matrix of size e x f; σ is 3.0.
(vi) obtaining a multi-scale training set of data (I, M, C), each sample using (I)i,Mi,Ci) Is represented by IiRepresenting images i, MiDensity map, C, representing image iiRepresenting the number of people in image i;
(2) using generator networks
Figure BDA0001879368540000102
Generating a feature map set of the image:
a. adopting 8 matrixes with the scale of 3 multiplied by 3 and 16 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and adopting a random orthogonal matrix to initialize the convolution kernels, wherein the random orthogonal matrix is formed by [0, 1]The uniformly distributed random number matrix is obtained by SVD (singular value decomposition); respectively adopting different convolution cores to convolute the input image of the new image set I, and respectively and sequentially carrying out batch normalization processing, linear rectification activation function and maximum pooling to obtain an output image set, namely a feature map set
Figure BDA0001879368540000103
b. Adopting 32 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure BDA0001879368540000104
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map setCombination of Chinese herbs
Figure BDA0001879368540000105
c. Adopting 64 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure BDA0001879368540000106
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set
Figure BDA0001879368540000107
d. Adopting 128 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure BDA0001879368540000108
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set Ig
(3) Using discriminant networks
Figure BDA0001879368540000109
Generating an estimated density map: adopting 64 matrixes with the scale of 3 multiplied by 3, 32 matrixes with the scale of 3 multiplied by 3, 16 matrixes with the scale of 3 multiplied by 3 and 8 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and initializing the convolution kernels by adopting random orthogonal matrixes; for feature map set IgPerforming upsampling treatment, and checking the upsampled characteristic diagram set I by adopting different convolution cores respectivelygPerforming convolution to obtain an output image
Figure BDA00018793685400001010
Namely an estimated density map corresponding to the input image of the new image set I;
(4) attention features were extracted with SE-Net:
e. pooling with global averaging(Global operating posing) treatment
Figure BDA0001879368540000111
Obtaining a feature vector
Figure BDA0001879368540000112
Processing with global average pooling
Figure BDA0001879368540000113
Obtaining a feature vector
Figure BDA0001879368540000114
Processing with global average pooling
Figure BDA0001879368540000115
Obtaining a feature vector
Figure BDA0001879368540000116
Treatment with global average pooling IgTo obtain a feature vector vg
f. A multilayer perceptron with 16 nerve units at input, 1 nerve unit at hidden layer and 16 nerve units at output is used
Figure BDA0001879368540000117
In the second layer, with a minimum value of
Figure BDA0001879368540000118
Maximum value of
Figure BDA0001879368540000119
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400001110
Weight matrix of
Figure BDA00018793685400001111
And bias the term
Figure BDA00018793685400001112
Initialized to 0 and subjected to a linear rectification (ReLU) activation function; then, using a minimum value of
Figure BDA00018793685400001113
Maximum value of
Figure BDA00018793685400001114
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400001115
Weight matrix of
Figure BDA00018793685400001116
And bias the term
Figure BDA00018793685400001117
Initializing to 0, and activating a function through a common S function (sigmoid) to obtain a 16-dimensional feature vector
Figure BDA00018793685400001118
Meanwhile, a multilayer perceptron with 32 nerve units at input, 1 nerve unit at hidden layer and 32 nerve units at output is utilized
Figure BDA00018793685400001119
With a minimum value of
Figure BDA00018793685400001120
Maximum value of
Figure BDA00018793685400001121
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400001122
Weight matrix of
Figure BDA00018793685400001123
And bias the term
Figure BDA00018793685400001124
Initialized to 0 and subjected to a linear rectification (ReLU) activation function; then, using a minimum value of
Figure BDA00018793685400001125
Maximum value of
Figure BDA00018793685400001126
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400001127
Weight matrix of
Figure BDA00018793685400001128
And bias the term
Figure BDA00018793685400001129
Initialized to 0 and activated by a common S function (sigmoid) to obtain a 32-dimensional feature vector
Figure BDA00018793685400001130
Meanwhile, a multilayer perceptron with 64 nerve units at input, 1 nerve unit at hidden layer and 64 nerve units at output is utilized
Figure BDA00018793685400001131
With a minimum value of
Figure BDA00018793685400001132
Maximum value of
Figure BDA00018793685400001133
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400001134
Weight matrix of
Figure BDA00018793685400001135
And bias the term
Figure BDA00018793685400001136
Initialized to 0 and subjected to a linear rectification (ReLU) activation function; then, using a minimum value of
Figure BDA00018793685400001137
Maximum value of
Figure BDA00018793685400001138
Uniformly distributed initialization multi-layer perceptron
Figure BDA00018793685400001139
Weight matrix of
Figure BDA00018793685400001140
And bias the term
Figure BDA00018793685400001141
Initialized to 0 and activated by a common S function (sigmoid) to obtain 64-dimensional feature vector
Figure BDA00018793685400001142
Meanwhile, a multi-layer perceptron MLP with 128 nerve units at input, 1 nerve unit at hidden layer and 128 nerve units at output is utilizedg(ii) a With a minimum value of
Figure BDA00018793685400001143
Maximum value of
Figure BDA00018793685400001144
Uniform distribution initialization multi-layer perceptron MLPgWeight matrix of
Figure BDA0001879368540000121
And bias the term
Figure BDA0001879368540000122
Is initialized to 0, andand passing through a linear rectification (ReLU) activation function; then, using a minimum value of
Figure BDA0001879368540000123
Maximum value of
Figure BDA0001879368540000124
Uniform distribution initialization multi-layer perceptron MLPgWeight matrix of
Figure BDA0001879368540000125
And bias the term
Figure BDA0001879368540000126
Initialized to 0 and activated through a common S function (sigmoid) to obtain a 128-dimensional feature vector v'g
The extracted attention characteristics include: 16-dimensional feature vector
Figure BDA0001879368540000127
32-dimensional feature vector
Figure BDA0001879368540000128
64-dimensional feature vector
Figure BDA0001879368540000129
128-dimensional feature vector v'g
(5) Re-weighting the feature map with the attention feature;
integrating feature maps
Figure BDA00018793685400001210
All pixels of each image of (a) are multiplied by a feature vector
Figure BDA00018793685400001211
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure BDA00018793685400001212
Integrating feature maps
Figure BDA00018793685400001213
All pixels of each image of (a) are multiplied by a feature vector
Figure BDA00018793685400001214
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure BDA00018793685400001215
Integrating feature maps
Figure BDA00018793685400001216
All pixels of each image of (a) are multiplied by a feature vector
Figure BDA00018793685400001217
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure BDA00018793685400001218
Collecting the feature map IgIs multiplied by a feature vector v 'for all pixels of each image'gThe corresponding component of (a); get the feature map set I 'after re-weighting'g
(6) Using regression networks
Figure BDA00018793685400001219
Calculating the number of people in the image;
g. using a fully-connected layer MLP with 26400 neural elements on the input and 1 neural element on the outputRWith a minimum value of
Figure BDA00018793685400001220
Maximum value of
Figure BDA00018793685400001221
Uniformly distributed initialization of the weight moments of the full connection layerArray WRAnd the bias term b is initialized to 0;
h. using full-link MLPRSimultaneous processing
Figure BDA00018793685400001222
And
Figure BDA00018793685400001223
and obtaining a scalar quantity of 1 dimension by a linear rectification (ReLU) activation function
Figure BDA00018793685400001224
Scalar quantity
Figure BDA00018793685400001225
Is the number of people in the image;
(7) network training;
i. defining a loss function, i.e. an objective function to be optimized, as shown in formula (i):
Figure BDA0001879368540000131
in formula (I), Loss represents the value of the Loss function, λ1Representing the weight taken up by the error generated by the discriminator,
Figure BDA0001879368540000132
representing an image IiThrough a generator network
Figure BDA0001879368540000133
Output of (a)2Representing the weight taken up by the error produced by the generator,
Figure BDA0001879368540000134
to represent
Figure BDA0001879368540000135
Passing through a discriminator network
Figure BDA0001879368540000136
M denotes the number of samples after the training set has been augmented, i.e., m is 70400. I isiRepresenting an input image, ciRepresenting the number of persons in the image, MiA density map representing the correspondence of the image; c. CiIndicating a primary supervisory signal, MiRepresents a secondary supervisory signal;
j. generator network
Figure BDA0001879368540000137
Selecting an Adam optimization algorithm, judging the network with an initial learning rate g _ base _ lr
Figure BDA0001879368540000138
Selecting an RMSprop optimization algorithm, wherein the initial learning rate is d _ base _ lr, and the regression network
Figure BDA0001879368540000139
Selecting an Adam optimization algorithm, wherein the initial learning rate is r _ base _ lr; the value range of g _ base _ lr is 0.000001-1, the value range of d _ base _ lr is 0.000001-1, and the value range of r _ base _ lr is 0.000001-1;
k. the following steps are executed
Figure BDA00018793685400001310
Iterating for m times, and comprising the following steps:
randomly acquiring m images from a training set1,I2,…,Im};
Secondly, randomly sampling density maps { M ] corresponding to the M images from the training set1,M2,…,Mm};
Computing discriminating network
Figure BDA00018793685400001311
Gradient (2):
Figure BDA00018793685400001312
Figure BDA00018793685400001313
means to discriminateNetwork
Figure BDA00018793685400001314
Training error relative discriminant network
Figure BDA00018793685400001315
Parameter theta ofdA gradient of (a);
fourthly, updating the discrimination network by adopting RMSprop optimization algorithm
Figure BDA00018793685400001316
The parameters of (1);
collecting m images from training set randomly1,I2,…,Im};
Sixthly, randomly sampling density chart corresponding to m images from training set1,C2,…,Cm};
Seventh calculation generator network
Figure BDA00018793685400001317
Gradient (2):
Figure BDA00018793685400001318
Figure BDA00018793685400001319
finger generator network
Figure BDA00018793685400001320
Training error versus network
Figure BDA00018793685400001321
Parameter theta ofgA gradient of (a);
updating generator network by Adam optimization algorithm
Figure BDA00018793685400001322
The parameters of (1);
ninthly randomly acquiring m images from training set1,I2,…,Im};
Tag of number of people corresponding to m images randomly sampled from training set in Rir (C)1,C2,…,Cm};
Figure BDA0001879368540000141
Computing regression networks
Figure BDA0001879368540000142
Gradient (2):
Figure BDA0001879368540000143
Figure BDA0001879368540000144
refers to a regression network
Figure BDA0001879368540000145
Training error versus regression network
Figure BDA0001879368540000146
Parameter theta ofrA gradient of (a);
Figure BDA0001879368540000147
updating a regression network using Adam optimization algorithm
Figure BDA0001879368540000148
The parameters of (1);
B. the testing process comprises the following steps:
and (4) initializing the network by using the network parameters obtained in the step (7), taking the test image as the input of the network, and directly outputting the number of people in the image by the network.
Example 2
The people number estimation method based on the generative confrontation network model according to the embodiment 1 is characterized in that:
in step (i), each image in the image database is randomly cropped to obtain 5 image blocks with the size of 120 × 80 and the size of 150 × 100. This step is only valid for the training set and not for the test set.
In step (ii), the resolution of each image in the image database, and of each image block randomly truncated in step (i), is adjusted to 320 × 240.
The value of g _ base _ lr is 0.00001, the value of d _ base _ lr is 0.0002, and the value of r _ base _ lr is 0.0001.
Algorithm 1 is applied to train a generative confrontation network model.
Figure BDA0001879368540000151
The method makes full use of the implicit characteristic representation capability of the generative network and applies a multi-task learning technology to ensure that the generalization capability of the model is stronger; an attention model is utilized, so that the adjustment of network parameters is more concerned about the characteristics influencing the accuracy; the method adopts alternate training and random sampling, thereby avoiding the occurrence of overfitting.
The effects of the present invention can be further illustrated by experiments. Table 1 compares the prediction error on the MALL test set using the present invention with the method of Zhang et al, Daniel et al, and the method of Han et al, where "(calculated using true density maps)" in table 1 means: the sum of the pixels of the true density map is considered to correspond to the number of true people in the image.
TABLE 1
Figure BDA0001879368540000161
As can be seen from Table 1, the method of the present invention is more accurate than the other four methods.

Claims (10)

1. A people number estimation method based on a generative confrontation network model is characterized in that the generative confrontation network model comprises three sub-networks including a generator network
Figure FDA0002937065680000011
Discriminating network
Figure FDA0002937065680000012
Regression network
Figure FDA0002937065680000013
Generator network
Figure FDA0002937065680000014
The method comprises four continuous convolutions + batch normalization + maximum pooling and one convolution + batch normalization; discriminating network
Figure FDA0002937065680000015
Comprises four continuous up-sampling and convolution components, and is used for judging network
Figure FDA0002937065680000016
Obtaining an estimated value of the density map through the output of the sensor; regression network
Figure FDA0002937065680000017
Is a fully connected network; regression network
Figure FDA0002937065680000018
There are four different inputs, including: generator network
Figure FDA0002937065680000019
Output, generator network after second convolution + batch normalization + max pooling
Figure FDA00029370656800000110
Output after third convolution + batch normalization + maximum pooling, generator network
Figure FDA00029370656800000111
Output, generator network after fourth convolution + batch normalization + max pooling
Figure FDA00029370656800000112
Output after the last convolution plus batch normalization; regression network
Figure FDA00029370656800000113
The four different inputs are respectively subjected to different SE-Net to obtain four weighted inputs, and the four weighted inputs are input into a three-layer fully-connected network to obtain the predicted value of the number of people; the method comprises the following steps:
A. training process
(1) Obtaining multi-scale data, wherein the multi-scale data refers to a multi-scale data training set (I, M, C), and each sample is used as (I)i,Mi,Ci) Is represented by IiRepresenting images i, MiDensity map, C, representing image iiRepresenting the number of people in image i;
(2) using generator networks
Figure FDA00029370656800000114
Generating a feature map set of the image:
(3) using discriminant networks
Figure FDA00029370656800000115
Generating an estimated density map:
(4) attention features were extracted with SE-Net:
(5) re-weighting the feature map with the attention feature;
(6) using regression networks
Figure FDA00029370656800000116
Calculating the number of people in the image;
(7) network training;
B. the testing process comprises the following steps:
and (4) initializing the network by using the network parameters obtained in the step (7), taking the test image as the input of the network, and directly outputting the number of people in the image by the network.
2. The people number estimation method based on generative confrontation network model as claimed in claim 1, wherein in step (2), the generator network is used
Figure FDA00029370656800000117
Generating a feature map set of an image, comprising the steps of:
a. adopting 8 matrixes with the scale of 3 multiplied by 3 and 16 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and adopting a random orthogonal matrix to initialize the convolution kernels, wherein the random orthogonal matrix is formed by [0, 1]The uniformly distributed random number matrix is obtained by SVD; respectively adopting different convolution cores to convolute the input image of the new image set I, and respectively and sequentially carrying out batch normalization processing, linear rectification activation function and maximum pooling to obtain an output image set, namely a feature map set
Figure FDA0002937065680000021
b. Adopting 32 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure FDA0002937065680000022
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set
Figure FDA0002937065680000023
c. Adopting 64 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure FDA0002937065680000024
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image setAggregate feature graph set
Figure FDA0002937065680000025
d. Adopting 128 matrixes with the scale of 3 multiplied by 3 as convolution kernels, adopting a random orthogonal matrix to initialize the convolution kernels, and adopting the convolution kernels to check a feature map set
Figure FDA0002937065680000026
Performing convolution, and sequentially performing batch normalization, linear rectification activation function and maximum pooling to obtain an output image set, i.e. a feature map set Ig
3. The method as claimed in claim 2, wherein in the step (3), the discriminative network is used to estimate the number of people using the generative confrontation network model
Figure FDA00029370656800000224
Generating an estimated density map, comprising the steps of:
adopting 64 matrixes with the scale of 3 multiplied by 3, 32 matrixes with the scale of 3 multiplied by 3, 16 matrixes with the scale of 3 multiplied by 3 and 8 matrixes with the scale of 3 multiplied by 3 as convolution kernels, and initializing the convolution kernels by adopting random orthogonal matrixes; for feature map set IgPerforming upsampling treatment, and checking the upsampled characteristic diagram set I by adopting different convolution cores respectivelygPerforming convolution to obtain an output image
Figure FDA0002937065680000027
I.e. the estimated density map corresponding to the input image of the new image set I.
4. The people number estimation method based on the generative confrontation network model as claimed in claim 2, wherein the step (4) of extracting attention features by using SE-Net comprises the following steps:
e. processing with global average pooling
Figure FDA0002937065680000028
Obtaining a feature vector
Figure FDA0002937065680000029
Processing with global average pooling
Figure FDA00029370656800000210
Obtaining a feature vector
Figure FDA00029370656800000211
Processing with global average pooling
Figure FDA00029370656800000212
Obtaining a feature vector
Figure FDA00029370656800000213
Treatment with global average pooling IgTo obtain a feature vector vg
f. A multilayer perceptron with 16 nerve units at input, 1 nerve unit at hidden layer and 16 nerve units at output is used
Figure FDA00029370656800000214
In the second layer, with a minimum value of
Figure FDA00029370656800000215
Maximum value of
Figure FDA00029370656800000216
Uniformly distributed initialization multi-layer perceptron
Figure FDA00029370656800000217
Weight matrix of
Figure FDA00029370656800000218
And bias the term
Figure FDA00029370656800000219
Initializing to 0 and activating a function through linear rectification; then, using a minimum value of
Figure FDA00029370656800000220
Maximum value of
Figure FDA00029370656800000221
Uniformly distributed initialization multi-layer perceptron
Figure FDA00029370656800000222
Weight matrix of
Figure FDA00029370656800000223
And bias the term
Figure FDA0002937065680000031
Initialized to 0 and activated by an S function to obtain a 16-dimensional feature vector
Figure FDA0002937065680000032
Meanwhile, a multilayer perceptron with 32 nerve units at input, 1 nerve unit at hidden layer and 32 nerve units at output is utilized
Figure FDA0002937065680000033
With a minimum value of
Figure FDA0002937065680000034
Maximum value of
Figure FDA0002937065680000035
Uniformly distributed initialization multi-layer perceptron
Figure FDA0002937065680000036
Weight matrix of
Figure FDA0002937065680000037
And bias the term
Figure FDA0002937065680000038
Initializing to 0 and activating a function through linear rectification; then, using a minimum value of
Figure FDA0002937065680000039
Maximum value of
Figure FDA00029370656800000310
Uniformly distributed initialization multi-layer perceptron
Figure FDA00029370656800000311
Weight matrix of
Figure FDA00029370656800000312
And bias the term
Figure FDA00029370656800000313
Initialized to 0 and activated by a common S function to obtain a 32-dimensional feature vector
Figure FDA00029370656800000314
Meanwhile, a multilayer perceptron with 64 nerve units at input, 1 nerve unit at hidden layer and 64 nerve units at output is utilized
Figure FDA00029370656800000315
With a minimum value of
Figure FDA00029370656800000316
Maximum value of
Figure FDA00029370656800000317
Uniformly distributed initialization multi-layer perceptron
Figure FDA00029370656800000318
Weight matrix of
Figure FDA00029370656800000319
And bias the term
Figure FDA00029370656800000320
Initializing to 0 and activating a function through linear rectification; then, using a minimum value of
Figure FDA00029370656800000321
Maximum value of
Figure FDA00029370656800000322
Uniformly distributed initialization multi-layer perceptron
Figure FDA00029370656800000323
Weight matrix of
Figure FDA00029370656800000324
And bias the term
Figure FDA00029370656800000325
Initialized to 0 and activated by an S function to obtain a 64-dimensional feature vector
Figure FDA00029370656800000326
Meanwhile, a multi-layer perceptron MLP with 128 nerve units at input, 1 nerve unit at hidden layer and 128 nerve units at output is utilizedg(ii) a With a minimum value of
Figure FDA00029370656800000327
Maximum value of
Figure FDA00029370656800000328
Uniform distribution initialization multi-layer perceptron MLPgWeight matrix of
Figure FDA00029370656800000329
And bias the term
Figure FDA00029370656800000330
Initialized to 0 and subjected to a linear rectification activation function; then, using a minimum value of
Figure FDA00029370656800000331
Maximum value of
Figure FDA00029370656800000332
Uniform distribution initialization multi-layer perceptron MLPgWeight matrix of
Figure FDA00029370656800000333
And bias the term
Figure FDA00029370656800000334
Initializing to be 0, and activating a function through an S function to obtain a 128-dimensional feature vector v'g
The extracted attention characteristics include: 16-dimensional feature vector
Figure FDA00029370656800000335
32-dimensional feature vector
Figure FDA00029370656800000336
64-dimensional feature vector
Figure FDA00029370656800000337
128-dimensional feature vector v'g
5. The people number estimation method based on the generative confrontation network model as claimed in claim 4, wherein the step (5) of reweighing the feature map with attention features comprises the following steps:
integrating feature maps
Figure FDA0002937065680000041
All pixels of each image of (a) are multiplied by a feature vector
Figure FDA0002937065680000042
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure FDA0002937065680000043
Integrating feature maps
Figure FDA0002937065680000044
All pixels of each image of (a) are multiplied by a feature vector
Figure FDA0002937065680000045
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure FDA0002937065680000046
Integrating feature maps
Figure FDA0002937065680000047
All pixels of each image of (a) are multiplied by a feature vector
Figure FDA0002937065680000048
The corresponding component of (a); the feature map set after the weight is newly given is obtained as
Figure FDA0002937065680000049
Collecting the feature map IgIs multiplied by a feature vector v 'for all pixels of each image'gPair ofA component of response; get the feature map set I 'after re-weighting'g
6. The people number estimation method based on generative confrontation network model as claimed in claim 5, wherein in step (6), regression network is used
Figure FDA00029370656800000421
Calculating the number of people in the image, comprising the steps of:
g. using a fully-connected layer MLP with 26400 neural elements on the input and 1 neural element on the outputRWith a minimum value of
Figure FDA00029370656800000410
Maximum value of
Figure FDA00029370656800000411
The weight matrix W of the full connection layer is initialized by uniform distribution ofRAnd the bias term b is initialized to 0;
h. using full-link MLPRSimultaneous processing
Figure FDA00029370656800000412
And l'gAnd obtaining a scalar quantity of 1 dimension by linear rectification activation function
Figure FDA00029370656800000413
Scalar quantity
Figure FDA00029370656800000414
Is the number of people in the image.
7. The people number estimation method based on the generative confrontation network model as claimed in claim 6, wherein in the step (7), the network training comprises the following steps:
i. defining a loss function, i.e. an objective function to be optimized, as shown in equation (II):
Figure FDA00029370656800000415
in the formula (II), Loss represents the value of the Loss function, λ1Representing the weight taken up by the error generated by the discriminator,
Figure FDA00029370656800000416
representing an image IiThrough a generator network
Figure FDA00029370656800000417
Output of (a)2Representing the weight taken up by the error produced by the generator,
Figure FDA00029370656800000418
to represent
Figure FDA00029370656800000419
Passing through a discriminator network
Figure FDA00029370656800000420
M represents the number of samples after the training set has been augmented, IiRepresenting an input image, ciRepresenting the number of persons in the image, MiA density map representing the correspondence of the image;
j. generator network
Figure FDA0002937065680000051
Selecting an Adam optimization algorithm, judging the network with an initial learning rate g _ base _ lr
Figure FDA0002937065680000052
Selecting an RMSprop optimization algorithm, wherein the initial learning rate is d _ base _ lr, and the regression network
Figure FDA0002937065680000053
Selecting Adam optimization algorithmThe initial learning rate is r _ base _ lr; the value range of g _ base _ lr is 0.000001-1, the value range of d _ base _ lr is 0.000001-1, and the value range of r _ base _ lr is 0.000001-1;
k. the following steps are executed
Figure FDA0002937065680000054
Iterating for m times, and comprising the following steps:
randomly acquiring m images from a training set1,I2,…,Im};
Secondly, randomly sampling density maps { M ] corresponding to the M images from the training set1,M2,…,Mm};
Computing discriminating network
Figure FDA0002937065680000055
Gradient (2):
Figure FDA0002937065680000056
Figure FDA0002937065680000057
is referred to as discriminating network
Figure FDA0002937065680000058
Training error relative discriminant network
Figure FDA0002937065680000059
Parameter theta ofdA gradient of (a);
fourthly, updating the discrimination network by adopting RMSprop optimization algorithm
Figure FDA00029370656800000510
The parameters of (1);
collecting m images from training set randomly1,I2,…,Im};
Sixthly, randomly sampling density chart corresponding to m images from training set1,C2,…,Cm};
Seventh calculation generator network
Figure FDA00029370656800000511
Gradient (2):
Figure FDA00029370656800000512
Figure FDA00029370656800000513
finger generator network
Figure FDA00029370656800000514
Training error versus network
Figure FDA00029370656800000515
Parameter theta ofgA gradient of (a);
updating generator network by Adam optimization algorithm
Figure FDA00029370656800000516
The parameters of (1);
ninthly randomly acquiring m images from training set1,I2,…,Im};
Tag of number of people corresponding to m images randomly sampled from training set in Rir (C)1,C2,…,Cm};
Figure FDA00029370656800000517
Computing regression networks
Figure FDA00029370656800000518
Gradient (2):
Figure FDA00029370656800000519
Figure FDA00029370656800000520
refers to a regression network
Figure FDA00029370656800000521
Training error versus regression network
Figure FDA00029370656800000522
Parameter theta ofrA gradient of (a);
Figure FDA0002937065680000061
updating a regression network using Adam optimization algorithm
Figure FDA0002937065680000062
The parameter (c) of (c).
8. The people number estimation method based on the generative confrontation network model as claimed in claim 1, wherein the step (1) of obtaining multi-scale data comprises:
randomly cutting each image in an image database to obtain M image blocks with the size of a multiplied by b and N image blocks with the size of c multiplied by d, wherein the value range of M is 1-100, the value range of N is 1-100, the value range of a is 1-320, the value range of b is 1-240, the value range of c is 1-320, the value range of d is 1-240, and the unit of a, b, c and d is a pixel;
(ii) adjusting the resolution of each image in the image database and each image block randomly intercepted in the step (i) to be e × f, wherein the value range of e is 80-640, and the value range of f is 60-480;
(iii) respectively and sequentially carrying out horizontal turning, vertical turning, central symmetry transformation and Gaussian noise addition on each image and each image block in the image database to obtain a new image set, and marking as I;
(iv) labeling the head position of each image in the new image set I to obtain a labeled template image set of the image set I, which is marked as L, and a set C of the number of people in all the images in the new image set I;
(v) processing each image in the labeling template set L by a formula (I) to obtain a density map set of the image set I, which is marked as M:
Figure FDA0002937065680000063
in the formula (I) { (x)k,yk),0≤k≤CiDenotes the pixel position of the person marked in the image i, CiRepresenting the number of persons in image i, Mi(x, y) represents a density map corresponding to an image i, σ is a standard deviation, i represents the number of the image, 0e×fRepresents an all-zero matrix of size e x f;
(vi) obtaining a training set of multiscale data (I, M, C), each sample using (I)i,Mi,Ci) Is represented by IiRepresenting images i, MiDensity map, C, representing image iiRepresenting the number of people in the image i.
9. The people estimation method based on generative confrontation network model as claimed in claim 8,
in the step (i), each image in the image database is randomly cropped to obtain 5 image blocks with the size of 120 × 80 and the size of 150 × 100;
in the step (ii), the resolution of each image in the image database and each image block randomly intercepted in the step (i) is adjusted to 320 × 240; σ is 3.0.
10. The people number estimation method based on the generative confrontation network model as claimed in claim 7, wherein g _ base _ lr is 0.00001, d _ base _ lr is 0.0002, and r _ base _ lr is 0.0001.
CN201811415565.0A 2018-11-26 2018-11-26 People number estimation method based on generation type confrontation network model Active CN109522857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811415565.0A CN109522857B (en) 2018-11-26 2018-11-26 People number estimation method based on generation type confrontation network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811415565.0A CN109522857B (en) 2018-11-26 2018-11-26 People number estimation method based on generation type confrontation network model

Publications (2)

Publication Number Publication Date
CN109522857A CN109522857A (en) 2019-03-26
CN109522857B true CN109522857B (en) 2021-04-23

Family

ID=65793346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811415565.0A Active CN109522857B (en) 2018-11-26 2018-11-26 People number estimation method based on generation type confrontation network model

Country Status (1)

Country Link
CN (1) CN109522857B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446302A (en) * 2018-01-29 2018-08-24 东华大学 A kind of personalized recommendation system of combination TensorFlow and Spark
CN110008554B (en) * 2019-03-27 2022-10-18 哈尔滨工业大学 Method for optimizing technological parameters and welding tool structure of friction stir welding seam forming prediction based on numerical simulation and deep learning
CN110097185B (en) * 2019-03-29 2021-03-23 北京大学 Optimization model method based on generation of countermeasure network and application
CN109978807B (en) * 2019-04-01 2020-07-14 西北工业大学 Shadow removing method based on generating type countermeasure network
CN110033043B (en) * 2019-04-16 2020-11-10 杭州电子科技大学 Radar one-dimensional range profile rejection method based on condition generation type countermeasure network
CN110120020A (en) * 2019-04-30 2019-08-13 西北工业大学 A kind of SAR image denoising method based on multiple dimensioned empty residual error attention network
CN110335212B (en) * 2019-06-28 2021-01-15 西安理工大学 Defect ancient book Chinese character repairing method based on condition confrontation network
CN110705340B (en) * 2019-08-12 2023-12-26 广东石油化工学院 Crowd counting method based on attention neural network field
CN110503049B (en) * 2019-08-26 2022-05-03 重庆邮电大学 Satellite video vehicle number estimation method based on generation countermeasure network
CN111080501B (en) * 2019-12-06 2024-02-09 中国科学院大学 Real crowd density space-time distribution estimation method based on mobile phone signaling data
CN111429436B (en) * 2020-03-29 2022-03-15 西北工业大学 Intrinsic image analysis method based on multi-scale attention and label loss
CN112326276B (en) * 2020-10-28 2021-07-16 北京航空航天大学 High-speed rail steering system fault detection LSTM method based on generation countermeasure network
CN112818945A (en) * 2021-03-08 2021-05-18 北方工业大学 Convolutional network construction method suitable for subway station crowd counting
CN113421192B (en) * 2021-08-24 2021-11-19 北京金山云网络技术有限公司 Training method of object statistical model, and statistical method and device of target object
CN114972111B (en) * 2022-06-16 2023-01-10 慧之安信息技术股份有限公司 Dense crowd counting method based on GAN image restoration
CN115357218A (en) * 2022-08-02 2022-11-18 北京航空航天大学 High-entropy random number generation method based on chaos prediction antagonistic learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423701A (en) * 2017-07-17 2017-12-01 北京智慧眼科技股份有限公司 The non-supervisory feature learning method and device of face based on production confrontation network
CN107451619A (en) * 2017-08-11 2017-12-08 深圳市唯特视科技有限公司 A kind of small target detecting method that confrontation network is generated based on perception
CN108764085A (en) * 2018-05-17 2018-11-06 上海交通大学 Based on the people counting method for generating confrontation network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10474881B2 (en) * 2017-03-15 2019-11-12 Nec Corporation Video retrieval system based on larger pose face frontalization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423701A (en) * 2017-07-17 2017-12-01 北京智慧眼科技股份有限公司 The non-supervisory feature learning method and device of face based on production confrontation network
CN107451619A (en) * 2017-08-11 2017-12-08 深圳市唯特视科技有限公司 A kind of small target detecting method that confrontation network is generated based on perception
CN108764085A (en) * 2018-05-17 2018-11-06 上海交通大学 Based on the people counting method for generating confrontation network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Squeeze-and-Excitation Networks;Jie Hu et al;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition》;20180623;第7132-7141页 *
非重叠域行人再识别算法研究;何晴 等;《信息技术》;20180731(第7期);第34-38页 *

Also Published As

Publication number Publication date
CN109522857A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN109522857B (en) People number estimation method based on generation type confrontation network model
CN112364779B (en) Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN108717568B (en) A kind of image characteristics extraction and training method based on Three dimensional convolution neural network
CN110335261B (en) CT lymph node detection system based on space-time circulation attention mechanism
CN114429156B (en) Radar interference multi-domain characteristic countermeasure learning and detection recognition method
CN106295124B (en) The method of a variety of image detecting technique comprehensive analysis gene subgraph likelihood probability amounts
CN106295694B (en) Face recognition method for iterative re-constrained group sparse representation classification
CN109190537A (en) A kind of more personage's Attitude estimation methods based on mask perceived depth intensified learning
CN109145992A (en) Cooperation generates confrontation network and sky composes united hyperspectral image classification method
CN112818764B (en) Low-resolution image facial expression recognition method based on feature reconstruction model
CN109002848B (en) Weak and small target detection method based on feature mapping neural network
CN109919241B (en) Hyperspectral unknown class target detection method based on probability model and deep learning
CN110728629A (en) Image set enhancement method for resisting attack
CN114241422A (en) Student classroom behavior detection method based on ESRGAN and improved YOLOv5s
CN109598220A (en) A kind of demographic method based on the polynary multiple dimensioned convolution of input
CN113780242A (en) Cross-scene underwater sound target classification method based on model transfer learning
CN107729926A (en) A kind of data amplification method based on higher dimensional space conversion, mechanical recognition system
CN114428234A (en) Radar high-resolution range profile noise reduction identification method based on GAN and self-attention
CN116482618B (en) Radar active interference identification method based on multi-loss characteristic self-calibration network
CN104778466A (en) Detection method combining various context clues for image focus region
CN115496720A (en) Gastrointestinal cancer pathological image segmentation method based on ViT mechanism model and related equipment
CN113344045A (en) Method for improving SAR ship classification precision by combining HOG characteristics
CN113435276A (en) Underwater sound target identification method based on antagonistic residual error network
CN112800882A (en) Mask face posture classification method based on weighted double-flow residual error network
CN109389101A (en) A kind of SAR image target recognition method based on denoising autoencoder network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant