CN112036367A

CN112036367A - People number detection method of YOLO convolutional neural network

Info

Publication number: CN112036367A
Application number: CN202010974637.6A
Authority: CN
Inventors: 陈敏; 夏圣奎; 吉训生; 王文郁
Original assignee: Nantong Tiancheng Modern Agricultural Technology Co ltd
Current assignee: Nantong Tiancheng Modern Agricultural Technology Co ltd
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2020-12-04

Abstract

The invention discloses a people number detection method of a YOLO convolutional neural network, which comprises a library file creating unit, a feature extraction unit and a people number judgment unit, wherein the library file creating unit is used for creating a standard library file, and the standard library file comprises a plurality of reference convolutional features, network parameters and corresponding people numbers; the characteristic extraction unit is used for receiving the video frames shot by the camera and extracting the convolution characteristics of the video frames so as to realize people number detection and people number judgment. According to the people number detection method of the YOLO convolutional neural network, the distribution of pedestrians and pedestrians in an image and the attributes of a semantic counting method are detected, the semantic attribute learning method of the pedestrians in the image is used for assisting the pedestrians in detecting the pedestrians in the image, the influence and interference of the semantic attributes of the pedestrians in the image on the pedestrians are restrained, the detection precision is improved, and meanwhile, the problem that the accuracy of a deep learning pedestrian counting method under a target image video detection scene is low is solved.

Description

People number detection method of YOLO convolutional neural network

Technical Field

The invention relates to the technical field of people number detection of a YOLO convolutional neural network and semantic information by combining an optimized prior frame selection mode, in particular to a people number detection method of the YOLO convolutional neural network.

Background

In the existing pedestrian detection algorithm, the problem that the missed detection rate of images in a target image detection network is high by using a Yolov3 is often solved, and the method for inventing the people number detection system based on the Yolo convolutional neural network in an optimized prior frame selection mode is needed in consideration of certain correlation of pedestrians in a pedestrian distribution and semantic counting attribute learning method in the target image network and low accuracy.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a people number detection method of a YOLO convolutional neural network, which has the advantages of optimizing the selection mode of a prior frame, considering the confidence coefficient between IOU pedestrian classifications more comprehensively in the classification selection of the prior frame, and the like, and solves the problems that in a pedestrian detection algorithm, the missed detection rate of an image in a target image detection network is often high by using YOLOv3, and the problems that pedestrians have certain correlation in a pedestrian distribution and semantic counting attribute learning method in the target image network and the accuracy is not high are considered.

(II) technical scheme

In order to realize the optimization of the selection mode of the prior frame and more comprehensively consider the confidence coefficient between IOU pedestrian classifications in the selection of the prior frame classification, the invention provides the following technical scheme: a people number detection method of a YOLO convolutional neural network comprises the following steps:

s1, creating a standard library file

And creating a standard library file through the network parameters, the reference convolution characteristics and the corresponding number of people of the YOLO convolution neural network trained by the marked pedestrian sample.

S2, video frame input

And inputting the intercepted video frame after receiving the video frame shot by the intercepting camera.

S3, track differentiation

And (3) performing track cascade matching on the prediction frame of the Yolov3 by using Deepsort to distinguish different tracks from people in the continuous frame images.

An input tracking set t = {1, …, N }, a detection set D = {1, …, M }, and a maximum threshold value Amax.

The matrix c = [ Ci, j ] stores calculation results of distances between all object tracking i and object detection j.

The matrix b = [ Bi, j ] stores a judgment of whether all object tracks i are associated with object detection j (0/1).

The correlation set M is initialized to the far side.

The set of object detections for which no match can be found is initialized to D.

The tracker that has just successfully matched loops through to the tracker that has Amax times at most without a match.

Selecting a set of trackers t n ← { i ∈ t | ai = n }, which satisfies a condition.

And calculating a set [ xi, j ] of successfully associating the Gamma n with the object detection j according to a minimum cost algorithm.

Updating M to be M { (i, j) | bi, j.xi, j >0} which is successfully matched.

Object detection j that has been successfully matched is removed from u.

And (3 b) to (3 f) are cycled until all video frames are matched.

Returning two sets of M, u

S4, counting the number of people

And setting a detection line, and determining whether the pedestrian enters the place or leaves the place according to the direction of the pedestrian track passing through the detection line so as to count the number of people.

S5, outputting the processed video containing the pedestrian mark frame and the number of people

S51, constructing a lightweight model

By using a ShuffLeNet network to replace a feature extraction network in a darknet53 network and pruning the model, the size and the calculation complexity of the model are obviously optimized, then, a pedestrian detection network after an optimization prior frame selection mode is used for extracting pedestrian target features, and a loss value is calculated by using cross entropy loss and boundary frame regression loss; the pedestrian detection is realized by combining the positions of the detection frames on the basis of classification; a deep learning pedestrian detection method combining the semantic attributes of pedestrians is provided; the size of the anchor point frame is optimized, considering that the research target is a pedestrian, the length and the width of a boundary frame of the pedestrian are generally between 4:1 and 2:1, and the targets with different scales appear in the scene, so that the three scales are respectively set to be 128, 256 and 512, and the length-width ratio is 2:1, 3:1 and 4:1, so that 9 anchor point frames are totally used; and marking as a positive sample if the Intersection ratio (IOU) of the anchor point frame and the real value boundary frame is more than 0.7, and marking as a negative sample if the Intersection ratio is less than 0.3.

S52, constructing an actual model

An actual model is constructed according to the ShuffleNet channel shuffling principle, and a residual unit comprising 3 layers: a depthwise convolution of 11 convolutions and 33, where 33 convolutions are the bottleneck layer, followed by 11 convolutions, and finally a short-circuit connection, adds the input directly to the output, reducing the computational effort.

The dense 11 convolutions are replaced by a group convolution of 11, although a channel shuffle operation is added after the first 11 convolutions. It is also 33 that no ReLU activation function is used after depthwise constraint.

And (3) adopting 33 avg pool of stride =2 for the original input, taking stride =2 at depthwise restriction to ensure that the sizes of two paths are the same, and then connecting the obtained characteristic diagram with the output instead of adding the characteristic diagram. The calculation amount and the parameter size are reduced greatly.

S53 model compression

Utilizing scaling factors in BN layer

The importance of the channels is measured in the training process, and the unimportant channels are deleted, so that the size of a compression model is reduced, and the operation speed is improved.

When in use

When the size is smaller, such as (0.001,0.003), the corresponding channel is deleted and the method is very smart

The training and pruning training method is added into an objective function, and achieves the peculiar effect of training and pruning.

Preferably, the steps S1 and S2 are provided with a library file creating unit, a feature extracting unit, and a person number judging unit, wherein the library file creating unit is configured to create a standard library file, and the standard library file includes a plurality of reference convolution features, network parameters, and corresponding person numbers; the feature extraction unit is used for receiving the video frames shot by the camera and extracting the convolution features of the video frames so as to realize people number detection; and the people number judging unit is used for obtaining the reference convolution characteristic closest to the convolution characteristic of the video frame from the standard library file and taking the people number corresponding to the reference convolution characteristic vector as the people number in the current scene.

Preferably, the feature extraction unit is located in a server connected to the control system, and the library file creation unit includes a people number input subunit located in the server, and a Softmax classifier layer at the end of the network; the classification learning subunit is used for taking the convolution characteristics of the video frames of a plurality of different people numbers and different lights as reference convolution characteristics, and classifying and learning the reference convolution characteristics and the people number of the corresponding video frame input by the people number input subunit to generate a standard library file.

Preferably, the camera transmitting terminal is in signal connection with the server receiving terminal, and the server comprises a signal receiving unit for receiving the video shot by the camera from the control system through the internet.

Preferably, the objective function in step S53

The first term is the loss generated by model prediction, and the second term is used for constraint

In the above-mentioned manner,

is a hyper-parameter that weighs two terms, typically set to 1e-4 or 1 e-5.

Adopt that

。

Preferably, the semantic attributes of the pedestrian in step S51 refer to those attributes of items that are often present around the pedestrian and are attached to the pedestrian, such as the hat, bag, etc. of the pedestrian.

Preferably, the IOU is used in step S51 to establish a relationship between the pedestrian and the semantic attribute: and calculating IOU of the semantic attribute prediction box and the pedestrian prediction box, wherein the bigger IOU shows that the correlation degree of the semantic attribute and the pedestrian is higher, all the semantic attribute prediction boxes with the IOU < mu 3 are removed, and if the semantic attribute prediction boxes are overlapped with a plurality of pedestrian prediction boxes, the semantic attribute is owned by the pedestrian target with the largest IOU. And the final score of the pedestrian target is the product of the score of the semantic attribute target frame corresponding to the pedestrian target frame and the score of the IOU and the semantic attribute target frame of the pedestrian target frame on the basis of the original score of the pedestrian target.

Preferably, the samples with the misclassification are relearned and assigned with larger weights in the steps S51 and S52 to improve the detection accuracy. Firstly, the last weak classifier reads loss values of all prediction frames, then selects the prediction frame with the largest loss according to the loss values in a sorting mode from large to small, then reversely spreads the gradient to a new weak classifier, the new weak classifier distributes weight to the prediction frame with the largest loss in the last round during learning, and then continuously sorts and reversely spreads the loss values of all the prediction frames. And repeating the iteration until the maximum loss value of the prediction box is less than a small constant, and ending the iteration. This constant is 0.05 in this model. The loss function of the model is defined as the sum of the regression loss and the cross entropy loss of the bounding box, S-NMS is used during testing, and the traditional Non Maximum Suppression (NMS) directly suppresses a plurality of bounding boxes with higher scores, namely the scores of the bounding boxes near the high-score bounding boxes are set to be 0, so that the redundant calculation caused by repeated bounding boxes can be reduced, but the aggregation condition of the plurality of high-score bounding boxes caused by high target density can be suppressed, and the model generates missing detection.

(III) advantageous effects

Compared with the prior art, the invention provides the method for detecting the number of people in the YOLO convolutional neural network, which has the following beneficial effects:

1. according to the people number detection method of the YOLO convolutional neural network, the distribution of pedestrians and pedestrians in an image and the attributes of a semantic counting method are detected, the semantic attribute learning method of the pedestrians in the image is used for assisting the pedestrians in detecting the pedestrians in the image, the influence and interference of the semantic attributes of the pedestrians in the image on the pedestrians are restrained, the detection precision is improved, and meanwhile, the problem that the accuracy of a deep learning pedestrian counting method under a target image video detection scene is low is solved.

Drawings

FIG. 1 is a block diagram of the steps of the present invention;

FIG. 2 is a schematic diagram showing the basic ResNet lightweight structure (a), the group convolution structure (b) and the channel shuffling structure (c);

FIG. 3 is a diagram of a pruning model;

FIG. 4 is an overall flow diagram of the algorithm;

FIG. 5 is a diagram of the structure of compressed YOLO.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-5, the present invention provides a technical solution: a people number detection method of a YOLO convolutional neural network comprises the following steps:

s1, creating a standard library file

S2, video frame input

The steps S1 and S2 are provided with a library file creating unit, a feature extraction unit and a person number judgment unit, wherein the library file creating unit is used for creating a standard library file, and the standard library file comprises a plurality of reference convolution features, network parameters and corresponding person numbers; the feature extraction unit is used for receiving the video frames shot by the camera and extracting the convolution features of the video frames so as to realize people number detection; the people number judging unit is used for obtaining reference convolution characteristics closest to the convolution characteristics of the video frames from the standard library file and taking the people number corresponding to the reference convolution characteristic vectors as the people number in the current scene, the characteristic extracting unit is located in a server connected to a control system, and the library file creating unit comprises a people number input sub-unit located in the server and a Softmax classifier layer at the tail end of a network; the classification learning subunit is used for taking the convolution characteristics of the video frames of a plurality of different people numbers and different lamplights as reference convolution characteristics, classifying and learning the reference convolution characteristics and the people numbers of the corresponding video frames input by the people number input subunit to generate a standard library file, the transmitting end of the camera is connected with the receiving end of the server through signals, and the server comprises a signal receiving unit and receives videos shot by the camera from the control system through the internet.

S3, track differentiation

The correlation set M is initialized to the far side.

Updating M to be M { (i, j) | bi, j.xi, j >0} which is successfully matched.

Object detection j that has been successfully matched is removed from u.

And (3 b) to (3 f) are cycled until all video frames are matched.

Returning two sets of M, u

S4, counting the number of people

S51, constructing a lightweight model

By using a ShuffLeNet network to replace a feature extraction network in a darknet53 network and pruning the model, the size and the calculation complexity of the model are obviously optimized, then, a pedestrian detection network after an optimization prior frame selection mode is used for extracting pedestrian target features, and a loss value is calculated by using cross entropy loss and boundary frame regression loss; the pedestrian detection is realized by combining the positions of the detection frames on the basis of classification; a deep learning pedestrian detection method combining the semantic attributes of pedestrians is provided; the size of the anchor point frame is optimized, considering that the research target is a pedestrian, the length and the width of a boundary frame of the pedestrian are generally between 4:1 and 2:1, and the targets with different scales appear in the scene, so that the three scales are respectively set to be 128, 256 and 512, and the length-width ratio is 2:1, 3:1 and 4:1, so that 9 anchor point frames are totally used; wherein the Intersection ratio (IOU) of the anchor point frame and the real value boundary frame is greater than 0.7 and marked as a positive sample, if less than 0.3 and marked as a negative sample, the semantic attribute of the pedestrian refers to the attribute of the object which is often present at the pedestrian and is attached to the pedestrian, such as the cap, bag and the like of the pedestrian, and the IOU is used for establishing the relation between the pedestrian and the semantic attribute: and calculating IOU of the semantic attribute prediction box and the pedestrian prediction box, wherein the bigger IOU shows that the correlation degree of the semantic attribute and the pedestrian is higher, all the semantic attribute prediction boxes with the IOU < mu 3 are removed, and if the semantic attribute prediction boxes are overlapped with a plurality of pedestrian prediction boxes, the semantic attribute is owned by the pedestrian target with the largest IOU. And the final score of the pedestrian target is the product of the score of the semantic attribute target frame corresponding to the pedestrian target frame and the score of the IOU and the semantic attribute target frame of the pedestrian target frame on the basis of the original score of the pedestrian target.

Overlap ratio

Score of pedestrian target

。

S52, constructing an actual model

The misclassified samples are relearned and assigned a greater weight to improve detection accuracy. Firstly, the last weak classifier reads loss values of all prediction frames, then selects the prediction frame with the largest loss according to the loss values in a sorting mode from large to small, then reversely spreads the gradient to a new weak classifier, the new weak classifier distributes weight to the prediction frame with the largest loss in the last round during learning, and then continuously sorts and reversely spreads the loss values of all the prediction frames. And repeating the iteration until the maximum loss value of the prediction box is less than a small constant, and ending the iteration. This constant is 0.05 in this model. The loss function of the model is defined as the sum of the regression loss and the cross entropy loss of the bounding box, S-NMS is used during testing, and the traditional Non Maximum Suppression (NMS) directly suppresses a plurality of bounding boxes with higher scores, namely the scores of the bounding boxes near the high-score bounding boxes are set to be 0, so that the redundant calculation caused by repeated bounding boxes can be reduced, but the aggregation condition of the plurality of high-score bounding boxes caused by high target density can be suppressed, and the model generates missing detection.

The formula of the S-NMS is as follows,

the bounding box with the highest score is selected,

is that

The ith prediction box of the surroundings,N _tis a threshold, set herein to 0.7.

In the network training process, in order to reduce the noise interference in the gradient calculation process, a random gradient descent method with weight attenuation of 0.01, learning rate of 0.01 and momentum of 0.9 is used to minimize the loss value so as to optimize the network parameters.

S53 model compression

Utilizing scaling factors in BN layer

When in use

Objective function

In the above-mentioned manner,

is a hyper-parameter that weighs two terms, typically set to 1e-4 or 1 e-5.

Adopt that

。

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A people number detection method of a YOLO convolutional neural network is characterized by comprising the following steps:

s1, creating a standard library file

Creating a standard library file through the network parameters, the reference convolution features and the corresponding number of people of the YOLO convolution neural network trained by the marked pedestrian sample;

s2, video frame input

After receiving a video frame shot by a shooting camera, inputting the shot video frame;

s3, track differentiation

Performing track cascade matching on a prediction frame of the YOLOV3 by using Deepsort, and distinguishing people in continuous frame images from different tracks;

an input tracking set tau = {1, …, N }, a detection set D = {1, …, M }, a maximum threshold value Amax;

the matrix C = [ Ci, j ] stores calculation results of distances between all object tracking i and object detection j;

the matrix B = [ Bi, j ] stores the judgment (0/1) whether all the object tracking i is associated with the object detection j;

initializing an association set M to the middle beam;

initializing an object detection set with no match to be found as D;

circularly traversing from the tracker which is successfully matched to the tracker which has Amax times without matching at most;

selecting a set of trackers t n ← { i ∈ Τ | ai = n }, which satisfy a condition;

calculating a set [ xi, j ] successfully generated by associating the Gamma n with the object detection j according to a minimum cost algorithm;

updating M to be M { (i, j) | bi, j.xi, j >0} which is successfully matched;

removing object detection j which is successfully matched from u;

looping (3 b) through (3 f) until all video frames match;

returning two sets of M, u

S4, counting the number of people

Setting a detection line, and determining whether the pedestrian enters the place or leaves the place according to the direction of the pedestrian track passing through the detection line so as to count the number of people;

S51, constructing a lightweight model

By replacing the feature extraction network in the darknet53 network with the ShuffleNet network and pruning the model, the model size and the model calculation complexity are optimized significantly, then the pedestrian detection network after optimizing the prior frame selection mode is used to extract the pedestrian target features, the loss value is calculated by using the cross entropy loss and the boundary frame regression loss, the pedestrian detection is realized by combining the detection frame positions on the classification basis, and a deep learning pedestrian detection method combining the pedestrian semantic attributes is proposed, the anchor point frame size is optimized, the boundary frame of the pedestrian is considered to be the pedestrian, the length and width of the boundary frame of the pedestrian are generally between 4:1 and 2:1, and the targets of different scales appear in the scene, so the three scales are respectively set as 9 anchor point frames of 2:1, 3:1, 4:1, wherein the Intersection ratio of the anchor point frame and the real value boundary frame (Intersection Over ratio, IOU) is greater than 0.7 is marked as positive sample, if less than 0.3 is marked as negative sample;

s52, constructing an actual model

An actual model is constructed according to the ShuffleNet channel shuffling principle, and a residual unit comprising 3 layers: a depthwise convolution of 11 convolutions and 33, where 33 convolutions are the bottleneck layer, followed by 11 convolutions, and finally a short-circuit connection, adding the input directly to the output, reducing the computational effort;

replacing the dense 11 convolutions with a group convolution of 11, although a channel shuffle operation is added after the first 11 convolutions; also, 33, no ReLU activation function is used after depthwise constraint;

adopting 33 avg pool of stride =2 for the original input, taking stride =2 at depthwise restriction to ensure that the sizes of two paths are the same, and then connecting the obtained characteristic diagram with the output instead of adding; the calculation amount and the parameter size are reduced greatly;

s53 model compression

Utilizing scaling factors in BN layer

The importance of the channels is measured in the training process, and the unimportant channels are deleted, so that the effects of compressing the size of the model and improving the operation speed are achieved;

when in use

2. The people number detection method of the YOLO convolutional neural network of claim 1, wherein steps S1 and S2 are provided with a library file creation unit, a feature extraction unit, and a people number judgment unit, the library file creation unit is used for creating a standard library file, the standard library file comprises a plurality of reference convolutional features, network parameters, and corresponding people numbers; the feature extraction unit is used for receiving the video frames shot by the camera and extracting the convolution features of the video frames so as to realize people number detection; and the people number judging unit is used for obtaining the reference convolution characteristic closest to the convolution characteristic of the video frame from the standard library file and taking the people number corresponding to the reference convolution characteristic vector as the people number in the current scene.

3. The people number detection method of the YOLO convolutional neural network of claim 2, wherein the feature extraction unit is located in a server connected to a control system, the library file creation unit comprises a people number input subunit located in the server, and a Softmax classifier layer at the end of the network; the classification learning subunit is used for taking the convolution characteristics of the video frames of a plurality of different people numbers and different lights as reference convolution characteristics, and classifying and learning the reference convolution characteristics and the people number of the corresponding video frame input by the people number input subunit to generate a standard library file.

4. The people number detection method of the YOLO convolutional neural network of claim 2, wherein the camera transmitting end is connected to the server receiving end through a signal, and the server comprises a signal receiving unit for receiving the video shot by the camera from the control system through the internet.

5. The method of claim 1, wherein the objective function of step S53 is an objective function

In the above-mentioned manner,

is a hyper-parameter that weighs two terms, typically set to 1e-4 or 1e-5,

adopt that

。

6. The people number detection method of the YOLO convolutional neural network as claimed in claim 1, wherein the semantic attributes of the pedestrian in step S51 refer to those attributes of the objects that are often around the pedestrian and are attached to the pedestrian, such as the cap, bag, etc. of the pedestrian.

7. The people number detection method of the YOLO convolutional neural network of claim 1, wherein the IOU is used in step S51 to establish the relation between the pedestrian and the semantic attribute: calculating IOU of the semantic attribute prediction frame and the pedestrian prediction frame, wherein the bigger the IOU is, the higher the correlation degree of the semantic attribute and the pedestrian is, the semantic attribute prediction frame with the IOU < mu 3 is removed, if the semantic attribute prediction frame is overlapped with a plurality of pedestrian prediction frames, the semantic attribute is owned by the pedestrian target with the largest IOU, and finally the score of the pedestrian target is the product of the IOU of the semantic attribute target frame corresponding to the pedestrian target frame and the IOU of the pedestrian target frame and the score of the semantic attribute target frame on the basis of the original score of the pedestrian target.

8. The people detection method of YOLO convolutional neural network as claimed in claim 1, wherein the steps S51 and S52 are to relearn the misclassified samples and assign larger weight to improve the detection accuracy, the last weak classifier reads in the loss values of all the prediction frames, then selects the prediction frame with the largest loss according to the loss values in the order from large to small, then propagates the gradient back to a new weak classifier, the new weak classifier learns the weight assigned to the prediction frame with the largest loss in the previous round, then continues to sequence and propagate back the loss values of all the prediction frames, iterates repeatedly until the maximum loss value of the prediction frame is smaller than a small constant, the iteration is over, the constant in the model is 0.05, the loss function of the model is defined as the sum of the regression loss of the boundary frame and the cross entropy loss, S-NMS is used during the test, the conventional Non Maximum Suppression (NMS) directly suppresses a plurality of high-scoring bounding boxes, that is, the scores of the bounding boxes near the high-scoring bounding box are set to 0, so that redundant calculation caused by repeated bounding boxes can be reduced, but aggregation of the plurality of high-scoring bounding boxes caused by high target density can be suppressed, and missed detection of the model can be generated.