CN112036367A - People number detection method of YOLO convolutional neural network - Google Patents

People number detection method of YOLO convolutional neural network Download PDF

Info

Publication number
CN112036367A
CN112036367A CN202010974637.6A CN202010974637A CN112036367A CN 112036367 A CN112036367 A CN 112036367A CN 202010974637 A CN202010974637 A CN 202010974637A CN 112036367 A CN112036367 A CN 112036367A
Authority
CN
China
Prior art keywords
pedestrian
frame
people
detection
people number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010974637.6A
Other languages
Chinese (zh)
Inventor
陈敏
夏圣奎
吉训生
王文郁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong Tiancheng Modern Agricultural Technology Co ltd
Original Assignee
Nantong Tiancheng Modern Agricultural Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong Tiancheng Modern Agricultural Technology Co ltd filed Critical Nantong Tiancheng Modern Agricultural Technology Co ltd
Priority to CN202010974637.6A priority Critical patent/CN112036367A/en
Publication of CN112036367A publication Critical patent/CN112036367A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a people number detection method of a YOLO convolutional neural network, which comprises a library file creating unit, a feature extraction unit and a people number judgment unit, wherein the library file creating unit is used for creating a standard library file, and the standard library file comprises a plurality of reference convolutional features, network parameters and corresponding people numbers; the characteristic extraction unit is used for receiving the video frames shot by the camera and extracting the convolution characteristics of the video frames so as to realize people number detection and people number judgment. According to the people number detection method of the YOLO convolutional neural network, the distribution of pedestrians and pedestrians in an image and the attributes of a semantic counting method are detected, the semantic attribute learning method of the pedestrians in the image is used for assisting the pedestrians in detecting the pedestrians in the image, the influence and interference of the semantic attributes of the pedestrians in the image on the pedestrians are restrained, the detection precision is improved, and meanwhile, the problem that the accuracy of a deep learning pedestrian counting method under a target image video detection scene is low is solved.

Description

People number detection method of YOLO convolutional neural network
Technical Field
The invention relates to the technical field of people number detection of a YOLO convolutional neural network and semantic information by combining an optimized prior frame selection mode, in particular to a people number detection method of the YOLO convolutional neural network.
Background
In the existing pedestrian detection algorithm, the problem that the missed detection rate of images in a target image detection network is high by using a Yolov3 is often solved, and the method for inventing the people number detection system based on the Yolo convolutional neural network in an optimized prior frame selection mode is needed in consideration of certain correlation of pedestrians in a pedestrian distribution and semantic counting attribute learning method in the target image network and low accuracy.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a people number detection method of a YOLO convolutional neural network, which has the advantages of optimizing the selection mode of a prior frame, considering the confidence coefficient between IOU pedestrian classifications more comprehensively in the classification selection of the prior frame, and the like, and solves the problems that in a pedestrian detection algorithm, the missed detection rate of an image in a target image detection network is often high by using YOLOv3, and the problems that pedestrians have certain correlation in a pedestrian distribution and semantic counting attribute learning method in the target image network and the accuracy is not high are considered.
(II) technical scheme
In order to realize the optimization of the selection mode of the prior frame and more comprehensively consider the confidence coefficient between IOU pedestrian classifications in the selection of the prior frame classification, the invention provides the following technical scheme: a people number detection method of a YOLO convolutional neural network comprises the following steps:
s1, creating a standard library file
And creating a standard library file through the network parameters, the reference convolution characteristics and the corresponding number of people of the YOLO convolution neural network trained by the marked pedestrian sample.
S2, video frame input
And inputting the intercepted video frame after receiving the video frame shot by the intercepting camera.
S3, track differentiation
And (3) performing track cascade matching on the prediction frame of the Yolov3 by using Deepsort to distinguish different tracks from people in the continuous frame images.
An input tracking set t = {1, …, N }, a detection set D = {1, …, M }, and a maximum threshold value Amax.
The matrix c = [ Ci, j ] stores calculation results of distances between all object tracking i and object detection j.
The matrix b = [ Bi, j ] stores a judgment of whether all object tracks i are associated with object detection j (0/1).
The correlation set M is initialized to the far side.
The set of object detections for which no match can be found is initialized to D.
The tracker that has just successfully matched loops through to the tracker that has Amax times at most without a match.
Selecting a set of trackers t n ← { i ∈ t | ai = n }, which satisfies a condition.
And calculating a set [ xi, j ] of successfully associating the Gamma n with the object detection j according to a minimum cost algorithm.
Updating M to be M { (i, j) | bi, j.xi, j >0} which is successfully matched.
Object detection j that has been successfully matched is removed from u.
And (3 b) to (3 f) are cycled until all video frames are matched.
Returning two sets of M, u
S4, counting the number of people
And setting a detection line, and determining whether the pedestrian enters the place or leaves the place according to the direction of the pedestrian track passing through the detection line so as to count the number of people.
S5, outputting the processed video containing the pedestrian mark frame and the number of people
S51, constructing a lightweight model
By using a ShuffLeNet network to replace a feature extraction network in a darknet53 network and pruning the model, the size and the calculation complexity of the model are obviously optimized, then, a pedestrian detection network after an optimization prior frame selection mode is used for extracting pedestrian target features, and a loss value is calculated by using cross entropy loss and boundary frame regression loss; the pedestrian detection is realized by combining the positions of the detection frames on the basis of classification; a deep learning pedestrian detection method combining the semantic attributes of pedestrians is provided; the size of the anchor point frame is optimized, considering that the research target is a pedestrian, the length and the width of a boundary frame of the pedestrian are generally between 4:1 and 2:1, and the targets with different scales appear in the scene, so that the three scales are respectively set to be 128, 256 and 512, and the length-width ratio is 2:1, 3:1 and 4:1, so that 9 anchor point frames are totally used; and marking as a positive sample if the Intersection ratio (IOU) of the anchor point frame and the real value boundary frame is more than 0.7, and marking as a negative sample if the Intersection ratio is less than 0.3.
S52, constructing an actual model
An actual model is constructed according to the ShuffleNet channel shuffling principle, and a residual unit comprising 3 layers: a depthwise convolution of 11 convolutions and 33, where 33 convolutions are the bottleneck layer, followed by 11 convolutions, and finally a short-circuit connection, adds the input directly to the output, reducing the computational effort.
The dense 11 convolutions are replaced by a group convolution of 11, although a channel shuffle operation is added after the first 11 convolutions. It is also 33 that no ReLU activation function is used after depthwise constraint.
And (3) adopting 33 avg pool of stride =2 for the original input, taking stride =2 at depthwise restriction to ensure that the sizes of two paths are the same, and then connecting the obtained characteristic diagram with the output instead of adding the characteristic diagram. The calculation amount and the parameter size are reduced greatly.
S53 model compression
Utilizing scaling factors in BN layer
Figure 100002_DEST_PATH_IMAGE001
The importance of the channels is measured in the training process, and the unimportant channels are deleted, so that the size of a compression model is reduced, and the operation speed is improved.
When in use
Figure 621663DEST_PATH_IMAGE002
When the size is smaller, such as (0.001,0.003), the corresponding channel is deleted and the method is very smart
Figure 119641DEST_PATH_IMAGE002
The training and pruning training method is added into an objective function, and achieves the peculiar effect of training and pruning.
Preferably, the steps S1 and S2 are provided with a library file creating unit, a feature extracting unit, and a person number judging unit, wherein the library file creating unit is configured to create a standard library file, and the standard library file includes a plurality of reference convolution features, network parameters, and corresponding person numbers; the feature extraction unit is used for receiving the video frames shot by the camera and extracting the convolution features of the video frames so as to realize people number detection; and the people number judging unit is used for obtaining the reference convolution characteristic closest to the convolution characteristic of the video frame from the standard library file and taking the people number corresponding to the reference convolution characteristic vector as the people number in the current scene.
Preferably, the feature extraction unit is located in a server connected to the control system, and the library file creation unit includes a people number input subunit located in the server, and a Softmax classifier layer at the end of the network; the classification learning subunit is used for taking the convolution characteristics of the video frames of a plurality of different people numbers and different lights as reference convolution characteristics, and classifying and learning the reference convolution characteristics and the people number of the corresponding video frame input by the people number input subunit to generate a standard library file.
Preferably, the camera transmitting terminal is in signal connection with the server receiving terminal, and the server comprises a signal receiving unit for receiving the video shot by the camera from the control system through the internet.
Preferably, the objective function in step S53
Figure 100002_DEST_PATH_IMAGE003
The first term is the loss generated by model prediction, and the second term is used for constraint
Figure 953605DEST_PATH_IMAGE004
In the above-mentioned manner,
Figure 100002_DEST_PATH_IMAGE005
is a hyper-parameter that weighs two terms, typically set to 1e-4 or 1 e-5.
Figure 700981DEST_PATH_IMAGE006
Adopt that
Figure 911382DEST_PATH_IMAGE007
Preferably, the semantic attributes of the pedestrian in step S51 refer to those attributes of items that are often present around the pedestrian and are attached to the pedestrian, such as the hat, bag, etc. of the pedestrian.
Preferably, the IOU is used in step S51 to establish a relationship between the pedestrian and the semantic attribute: and calculating IOU of the semantic attribute prediction box and the pedestrian prediction box, wherein the bigger IOU shows that the correlation degree of the semantic attribute and the pedestrian is higher, all the semantic attribute prediction boxes with the IOU < mu 3 are removed, and if the semantic attribute prediction boxes are overlapped with a plurality of pedestrian prediction boxes, the semantic attribute is owned by the pedestrian target with the largest IOU. And the final score of the pedestrian target is the product of the score of the semantic attribute target frame corresponding to the pedestrian target frame and the score of the IOU and the semantic attribute target frame of the pedestrian target frame on the basis of the original score of the pedestrian target.
Preferably, the samples with the misclassification are relearned and assigned with larger weights in the steps S51 and S52 to improve the detection accuracy. Firstly, the last weak classifier reads loss values of all prediction frames, then selects the prediction frame with the largest loss according to the loss values in a sorting mode from large to small, then reversely spreads the gradient to a new weak classifier, the new weak classifier distributes weight to the prediction frame with the largest loss in the last round during learning, and then continuously sorts and reversely spreads the loss values of all the prediction frames. And repeating the iteration until the maximum loss value of the prediction box is less than a small constant, and ending the iteration. This constant is 0.05 in this model. The loss function of the model is defined as the sum of the regression loss and the cross entropy loss of the bounding box, S-NMS is used during testing, and the traditional Non Maximum Suppression (NMS) directly suppresses a plurality of bounding boxes with higher scores, namely the scores of the bounding boxes near the high-score bounding boxes are set to be 0, so that the redundant calculation caused by repeated bounding boxes can be reduced, but the aggregation condition of the plurality of high-score bounding boxes caused by high target density can be suppressed, and the model generates missing detection.
(III) advantageous effects
Compared with the prior art, the invention provides the method for detecting the number of people in the YOLO convolutional neural network, which has the following beneficial effects:
1. according to the people number detection method of the YOLO convolutional neural network, the distribution of pedestrians and pedestrians in an image and the attributes of a semantic counting method are detected, the semantic attribute learning method of the pedestrians in the image is used for assisting the pedestrians in detecting the pedestrians in the image, the influence and interference of the semantic attributes of the pedestrians in the image on the pedestrians are restrained, the detection precision is improved, and meanwhile, the problem that the accuracy of a deep learning pedestrian counting method under a target image video detection scene is low is solved.
Drawings
FIG. 1 is a block diagram of the steps of the present invention;
FIG. 2 is a schematic diagram showing the basic ResNet lightweight structure (a), the group convolution structure (b) and the channel shuffling structure (c);
FIG. 3 is a diagram of a pruning model;
FIG. 4 is an overall flow diagram of the algorithm;
FIG. 5 is a diagram of the structure of compressed YOLO.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-5, the present invention provides a technical solution: a people number detection method of a YOLO convolutional neural network comprises the following steps:
s1, creating a standard library file
And creating a standard library file through the network parameters, the reference convolution characteristics and the corresponding number of people of the YOLO convolution neural network trained by the marked pedestrian sample.
S2, video frame input
And inputting the intercepted video frame after receiving the video frame shot by the intercepting camera.
The steps S1 and S2 are provided with a library file creating unit, a feature extraction unit and a person number judgment unit, wherein the library file creating unit is used for creating a standard library file, and the standard library file comprises a plurality of reference convolution features, network parameters and corresponding person numbers; the feature extraction unit is used for receiving the video frames shot by the camera and extracting the convolution features of the video frames so as to realize people number detection; the people number judging unit is used for obtaining reference convolution characteristics closest to the convolution characteristics of the video frames from the standard library file and taking the people number corresponding to the reference convolution characteristic vectors as the people number in the current scene, the characteristic extracting unit is located in a server connected to a control system, and the library file creating unit comprises a people number input sub-unit located in the server and a Softmax classifier layer at the tail end of a network; the classification learning subunit is used for taking the convolution characteristics of the video frames of a plurality of different people numbers and different lamplights as reference convolution characteristics, classifying and learning the reference convolution characteristics and the people numbers of the corresponding video frames input by the people number input subunit to generate a standard library file, the transmitting end of the camera is connected with the receiving end of the server through signals, and the server comprises a signal receiving unit and receives videos shot by the camera from the control system through the internet.
S3, track differentiation
And (3) performing track cascade matching on the prediction frame of the Yolov3 by using Deepsort to distinguish different tracks from people in the continuous frame images.
An input tracking set t = {1, …, N }, a detection set D = {1, …, M }, and a maximum threshold value Amax.
The matrix c = [ Ci, j ] stores calculation results of distances between all object tracking i and object detection j.
The matrix b = [ Bi, j ] stores a judgment of whether all object tracks i are associated with object detection j (0/1).
The correlation set M is initialized to the far side.
The set of object detections for which no match can be found is initialized to D.
The tracker that has just successfully matched loops through to the tracker that has Amax times at most without a match.
Selecting a set of trackers t n ← { i ∈ t | ai = n }, which satisfies a condition.
And calculating a set [ xi, j ] of successfully associating the Gamma n with the object detection j according to a minimum cost algorithm.
Updating M to be M { (i, j) | bi, j.xi, j >0} which is successfully matched.
Object detection j that has been successfully matched is removed from u.
And (3 b) to (3 f) are cycled until all video frames are matched.
Returning two sets of M, u
S4, counting the number of people
And setting a detection line, and determining whether the pedestrian enters the place or leaves the place according to the direction of the pedestrian track passing through the detection line so as to count the number of people.
S5, outputting the processed video containing the pedestrian mark frame and the number of people
S51, constructing a lightweight model
By using a ShuffLeNet network to replace a feature extraction network in a darknet53 network and pruning the model, the size and the calculation complexity of the model are obviously optimized, then, a pedestrian detection network after an optimization prior frame selection mode is used for extracting pedestrian target features, and a loss value is calculated by using cross entropy loss and boundary frame regression loss; the pedestrian detection is realized by combining the positions of the detection frames on the basis of classification; a deep learning pedestrian detection method combining the semantic attributes of pedestrians is provided; the size of the anchor point frame is optimized, considering that the research target is a pedestrian, the length and the width of a boundary frame of the pedestrian are generally between 4:1 and 2:1, and the targets with different scales appear in the scene, so that the three scales are respectively set to be 128, 256 and 512, and the length-width ratio is 2:1, 3:1 and 4:1, so that 9 anchor point frames are totally used; wherein the Intersection ratio (IOU) of the anchor point frame and the real value boundary frame is greater than 0.7 and marked as a positive sample, if less than 0.3 and marked as a negative sample, the semantic attribute of the pedestrian refers to the attribute of the object which is often present at the pedestrian and is attached to the pedestrian, such as the cap, bag and the like of the pedestrian, and the IOU is used for establishing the relation between the pedestrian and the semantic attribute: and calculating IOU of the semantic attribute prediction box and the pedestrian prediction box, wherein the bigger IOU shows that the correlation degree of the semantic attribute and the pedestrian is higher, all the semantic attribute prediction boxes with the IOU < mu 3 are removed, and if the semantic attribute prediction boxes are overlapped with a plurality of pedestrian prediction boxes, the semantic attribute is owned by the pedestrian target with the largest IOU. And the final score of the pedestrian target is the product of the score of the semantic attribute target frame corresponding to the pedestrian target frame and the score of the IOU and the semantic attribute target frame of the pedestrian target frame on the basis of the original score of the pedestrian target.
Overlap ratio
Figure DEST_PATH_IMAGE008
Score of pedestrian target
Figure 263866DEST_PATH_IMAGE009
S52, constructing an actual model
An actual model is constructed according to the ShuffleNet channel shuffling principle, and a residual unit comprising 3 layers: a depthwise convolution of 11 convolutions and 33, where 33 convolutions are the bottleneck layer, followed by 11 convolutions, and finally a short-circuit connection, adds the input directly to the output, reducing the computational effort.
The dense 11 convolutions are replaced by a group convolution of 11, although a channel shuffle operation is added after the first 11 convolutions. It is also 33 that no ReLU activation function is used after depthwise constraint.
And (3) adopting 33 avg pool of stride =2 for the original input, taking stride =2 at depthwise restriction to ensure that the sizes of two paths are the same, and then connecting the obtained characteristic diagram with the output instead of adding the characteristic diagram. The calculation amount and the parameter size are reduced greatly.
The misclassified samples are relearned and assigned a greater weight to improve detection accuracy. Firstly, the last weak classifier reads loss values of all prediction frames, then selects the prediction frame with the largest loss according to the loss values in a sorting mode from large to small, then reversely spreads the gradient to a new weak classifier, the new weak classifier distributes weight to the prediction frame with the largest loss in the last round during learning, and then continuously sorts and reversely spreads the loss values of all the prediction frames. And repeating the iteration until the maximum loss value of the prediction box is less than a small constant, and ending the iteration. This constant is 0.05 in this model. The loss function of the model is defined as the sum of the regression loss and the cross entropy loss of the bounding box, S-NMS is used during testing, and the traditional Non Maximum Suppression (NMS) directly suppresses a plurality of bounding boxes with higher scores, namely the scores of the bounding boxes near the high-score bounding boxes are set to be 0, so that the redundant calculation caused by repeated bounding boxes can be reduced, but the aggregation condition of the plurality of high-score bounding boxes caused by high target density can be suppressed, and the model generates missing detection.
The formula of the S-NMS is as follows,
Figure 3152DEST_PATH_IMAGE010
the bounding box with the highest score is selected,
Figure 441087DEST_PATH_IMAGE011
is that
Figure 926950DEST_PATH_IMAGE010
The ith prediction box of the surroundings,N t is a threshold, set herein to 0.7.
Figure 399520DEST_PATH_IMAGE012
Figure 575286DEST_PATH_IMAGE013
In the network training process, in order to reduce the noise interference in the gradient calculation process, a random gradient descent method with weight attenuation of 0.01, learning rate of 0.01 and momentum of 0.9 is used to minimize the loss value so as to optimize the network parameters.
S53 model compression
Utilizing scaling factors in BN layer
Figure 500517DEST_PATH_IMAGE001
The importance of the channels is measured in the training process, and the unimportant channels are deleted, so that the size of a compression model is reduced, and the operation speed is improved.
When in use
Figure 724825DEST_PATH_IMAGE004
When the size is smaller, such as (0.001,0.003), the corresponding channel is deleted and the method is very smart
Figure 848639DEST_PATH_IMAGE004
The training and pruning training method is added into an objective function, and achieves the peculiar effect of training and pruning.
Objective function
Figure 132990DEST_PATH_IMAGE014
The first term is the loss generated by model prediction, and the second term is used for constraint
Figure 545516DEST_PATH_IMAGE004
In the above-mentioned manner,
Figure 635832DEST_PATH_IMAGE005
is a hyper-parameter that weighs two terms, typically set to 1e-4 or 1 e-5.
Figure 817415DEST_PATH_IMAGE006
Adopt that
Figure 69404DEST_PATH_IMAGE007
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. A people number detection method of a YOLO convolutional neural network is characterized by comprising the following steps:
s1, creating a standard library file
Creating a standard library file through the network parameters, the reference convolution features and the corresponding number of people of the YOLO convolution neural network trained by the marked pedestrian sample;
s2, video frame input
After receiving a video frame shot by a shooting camera, inputting the shot video frame;
s3, track differentiation
Performing track cascade matching on a prediction frame of the YOLOV3 by using Deepsort, and distinguishing people in continuous frame images from different tracks;
an input tracking set tau = {1, …, N }, a detection set D = {1, …, M }, a maximum threshold value Amax;
the matrix C = [ Ci, j ] stores calculation results of distances between all object tracking i and object detection j;
the matrix B = [ Bi, j ] stores the judgment (0/1) whether all the object tracking i is associated with the object detection j;
initializing an association set M to the middle beam;
initializing an object detection set with no match to be found as D;
circularly traversing from the tracker which is successfully matched to the tracker which has Amax times without matching at most;
selecting a set of trackers t n ← { i ∈ Τ | ai = n }, which satisfy a condition;
calculating a set [ xi, j ] successfully generated by associating the Gamma n with the object detection j according to a minimum cost algorithm;
updating M to be M { (i, j) | bi, j.xi, j >0} which is successfully matched;
removing object detection j which is successfully matched from u;
looping (3 b) through (3 f) until all video frames match;
returning two sets of M, u
S4, counting the number of people
Setting a detection line, and determining whether the pedestrian enters the place or leaves the place according to the direction of the pedestrian track passing through the detection line so as to count the number of people;
s5, outputting the processed video containing the pedestrian mark frame and the number of people
S51, constructing a lightweight model
By replacing the feature extraction network in the darknet53 network with the ShuffleNet network and pruning the model, the model size and the model calculation complexity are optimized significantly, then the pedestrian detection network after optimizing the prior frame selection mode is used to extract the pedestrian target features, the loss value is calculated by using the cross entropy loss and the boundary frame regression loss, the pedestrian detection is realized by combining the detection frame positions on the classification basis, and a deep learning pedestrian detection method combining the pedestrian semantic attributes is proposed, the anchor point frame size is optimized, the boundary frame of the pedestrian is considered to be the pedestrian, the length and width of the boundary frame of the pedestrian are generally between 4:1 and 2:1, and the targets of different scales appear in the scene, so the three scales are respectively set as 9 anchor point frames of 2:1, 3:1, 4:1, wherein the Intersection ratio of the anchor point frame and the real value boundary frame (Intersection Over ratio, IOU) is greater than 0.7 is marked as positive sample, if less than 0.3 is marked as negative sample;
s52, constructing an actual model
An actual model is constructed according to the ShuffleNet channel shuffling principle, and a residual unit comprising 3 layers: a depthwise convolution of 11 convolutions and 33, where 33 convolutions are the bottleneck layer, followed by 11 convolutions, and finally a short-circuit connection, adding the input directly to the output, reducing the computational effort;
replacing the dense 11 convolutions with a group convolution of 11, although a channel shuffle operation is added after the first 11 convolutions; also, 33, no ReLU activation function is used after depthwise constraint;
adopting 33 avg pool of stride =2 for the original input, taking stride =2 at depthwise restriction to ensure that the sizes of two paths are the same, and then connecting the obtained characteristic diagram with the output instead of adding; the calculation amount and the parameter size are reduced greatly;
s53 model compression
Utilizing scaling factors in BN layer
Figure DEST_PATH_IMAGE001
The importance of the channels is measured in the training process, and the unimportant channels are deleted, so that the effects of compressing the size of the model and improving the operation speed are achieved;
when in use
Figure 10243DEST_PATH_IMAGE002
When the size is smaller, such as (0.001,0.003), the corresponding channel is deleted and the method is very smart
Figure 576354DEST_PATH_IMAGE002
The training and pruning training method is added into an objective function, and achieves the peculiar effect of training and pruning.
2. The people number detection method of the YOLO convolutional neural network of claim 1, wherein steps S1 and S2 are provided with a library file creation unit, a feature extraction unit, and a people number judgment unit, the library file creation unit is used for creating a standard library file, the standard library file comprises a plurality of reference convolutional features, network parameters, and corresponding people numbers; the feature extraction unit is used for receiving the video frames shot by the camera and extracting the convolution features of the video frames so as to realize people number detection; and the people number judging unit is used for obtaining the reference convolution characteristic closest to the convolution characteristic of the video frame from the standard library file and taking the people number corresponding to the reference convolution characteristic vector as the people number in the current scene.
3. The people number detection method of the YOLO convolutional neural network of claim 2, wherein the feature extraction unit is located in a server connected to a control system, the library file creation unit comprises a people number input subunit located in the server, and a Softmax classifier layer at the end of the network; the classification learning subunit is used for taking the convolution characteristics of the video frames of a plurality of different people numbers and different lights as reference convolution characteristics, and classifying and learning the reference convolution characteristics and the people number of the corresponding video frame input by the people number input subunit to generate a standard library file.
4. The people number detection method of the YOLO convolutional neural network of claim 2, wherein the camera transmitting end is connected to the server receiving end through a signal, and the server comprises a signal receiving unit for receiving the video shot by the camera from the control system through the internet.
5. The method of claim 1, wherein the objective function of step S53 is an objective function
Figure DEST_PATH_IMAGE003
The first term is the loss generated by model prediction, and the second term is used for constraint
Figure 940339DEST_PATH_IMAGE002
In the above-mentioned manner,
Figure 300913DEST_PATH_IMAGE004
is a hyper-parameter that weighs two terms, typically set to 1e-4 or 1e-5,
Figure DEST_PATH_IMAGE005
adopt that
Figure 750349DEST_PATH_IMAGE006
6. The people number detection method of the YOLO convolutional neural network as claimed in claim 1, wherein the semantic attributes of the pedestrian in step S51 refer to those attributes of the objects that are often around the pedestrian and are attached to the pedestrian, such as the cap, bag, etc. of the pedestrian.
7. The people number detection method of the YOLO convolutional neural network of claim 1, wherein the IOU is used in step S51 to establish the relation between the pedestrian and the semantic attribute: calculating IOU of the semantic attribute prediction frame and the pedestrian prediction frame, wherein the bigger the IOU is, the higher the correlation degree of the semantic attribute and the pedestrian is, the semantic attribute prediction frame with the IOU < mu 3 is removed, if the semantic attribute prediction frame is overlapped with a plurality of pedestrian prediction frames, the semantic attribute is owned by the pedestrian target with the largest IOU, and finally the score of the pedestrian target is the product of the IOU of the semantic attribute target frame corresponding to the pedestrian target frame and the IOU of the pedestrian target frame and the score of the semantic attribute target frame on the basis of the original score of the pedestrian target.
8. The people detection method of YOLO convolutional neural network as claimed in claim 1, wherein the steps S51 and S52 are to relearn the misclassified samples and assign larger weight to improve the detection accuracy, the last weak classifier reads in the loss values of all the prediction frames, then selects the prediction frame with the largest loss according to the loss values in the order from large to small, then propagates the gradient back to a new weak classifier, the new weak classifier learns the weight assigned to the prediction frame with the largest loss in the previous round, then continues to sequence and propagate back the loss values of all the prediction frames, iterates repeatedly until the maximum loss value of the prediction frame is smaller than a small constant, the iteration is over, the constant in the model is 0.05, the loss function of the model is defined as the sum of the regression loss of the boundary frame and the cross entropy loss, S-NMS is used during the test, the conventional Non Maximum Suppression (NMS) directly suppresses a plurality of high-scoring bounding boxes, that is, the scores of the bounding boxes near the high-scoring bounding box are set to 0, so that redundant calculation caused by repeated bounding boxes can be reduced, but aggregation of the plurality of high-scoring bounding boxes caused by high target density can be suppressed, and missed detection of the model can be generated.
CN202010974637.6A 2020-09-16 2020-09-16 People number detection method of YOLO convolutional neural network Pending CN112036367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010974637.6A CN112036367A (en) 2020-09-16 2020-09-16 People number detection method of YOLO convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010974637.6A CN112036367A (en) 2020-09-16 2020-09-16 People number detection method of YOLO convolutional neural network

Publications (1)

Publication Number Publication Date
CN112036367A true CN112036367A (en) 2020-12-04

Family

ID=73589481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010974637.6A Pending CN112036367A (en) 2020-09-16 2020-09-16 People number detection method of YOLO convolutional neural network

Country Status (1)

Country Link
CN (1) CN112036367A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633162A (en) * 2020-12-22 2021-04-09 重庆大学 Rapid pedestrian detection and tracking method suitable for expressway outfield shielding condition
CN112733679A (en) * 2020-12-31 2021-04-30 南京视察者智能科技有限公司 Case logic reasoning-based early warning system and training method
CN113034548A (en) * 2021-04-25 2021-06-25 安徽科大擎天科技有限公司 Multi-target tracking method and system suitable for embedded terminal
CN113538263A (en) * 2021-06-28 2021-10-22 江苏威尔曼科技有限公司 Motion blur removing method, medium, and device based on improved DeblurgAN model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609597A (en) * 2017-09-26 2018-01-19 嘉世达电梯有限公司 A kind of number of people in lift car detecting system and its detection method
CN109271942A (en) * 2018-09-26 2019-01-25 上海七牛信息技术有限公司 A kind of stream of people's statistical method and system
CN109522854A (en) * 2018-11-22 2019-03-26 广州众聚智能科技有限公司 A kind of pedestrian traffic statistical method based on deep learning and multiple target tracking
CN110309717A (en) * 2019-05-23 2019-10-08 南京熊猫电子股份有限公司 A kind of pedestrian counting method based on deep neural network
CN111640135A (en) * 2020-05-25 2020-09-08 台州智必安科技有限责任公司 TOF camera pedestrian counting method based on hardware front end

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609597A (en) * 2017-09-26 2018-01-19 嘉世达电梯有限公司 A kind of number of people in lift car detecting system and its detection method
CN109271942A (en) * 2018-09-26 2019-01-25 上海七牛信息技术有限公司 A kind of stream of people's statistical method and system
CN109522854A (en) * 2018-11-22 2019-03-26 广州众聚智能科技有限公司 A kind of pedestrian traffic statistical method based on deep learning and multiple target tracking
CN110309717A (en) * 2019-05-23 2019-10-08 南京熊猫电子股份有限公司 A kind of pedestrian counting method based on deep neural network
CN111640135A (en) * 2020-05-25 2020-09-08 台州智必安科技有限责任公司 TOF camera pedestrian counting method based on hardware front end

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
WOJKE N 等: ""Simple Online and Realtime Tracking with a Deep Association Metric"", 《IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)》, 22 February 2018 (2018-02-22), pages 3645 *
XIANGYU ZHANG等: ""ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices"", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, 16 December 2018 (2018-12-16), pages 6848 - 6856 *
ZHUANG LIU等: ""Learning Efficient Convolutional Networks through Network Slimming"", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》, 25 December 2017 (2017-12-25), pages 2755 - 2763 *
吉训生 等: ""基于优化可形变区域全卷积神经网络的人头检测方法"", 《激光与光电子学进展》, vol. 56, no. 14, 31 July 2019 (2019-07-31), pages 141009 - 1 *
廖晓雯: ""基于多层特征融合的目标检测"", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 01, 15 January 2020 (2020-01-15), pages 33 - 36 *
赵朵朵 等: ""基于深度学习的实时人流统计方法研究"", 《传感技术学报》, vol. 33, no. 8, 31 August 2020 (2020-08-31), pages 1161 - 1168 *
邓炜 等: ""联合语义的深度学习行人检测"", 《计算机系统应用》, vol. 27, no. 6, 28 May 2018 (2018-05-28), pages 165 - 170 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633162A (en) * 2020-12-22 2021-04-09 重庆大学 Rapid pedestrian detection and tracking method suitable for expressway outfield shielding condition
CN112633162B (en) * 2020-12-22 2024-03-22 重庆大学 Pedestrian rapid detection and tracking method suitable for expressway external field shielding condition
CN112733679A (en) * 2020-12-31 2021-04-30 南京视察者智能科技有限公司 Case logic reasoning-based early warning system and training method
CN112733679B (en) * 2020-12-31 2023-09-01 南京视察者智能科技有限公司 Early warning system and training method based on case logic reasoning
CN113034548A (en) * 2021-04-25 2021-06-25 安徽科大擎天科技有限公司 Multi-target tracking method and system suitable for embedded terminal
CN113034548B (en) * 2021-04-25 2023-05-26 安徽科大擎天科技有限公司 Multi-target tracking method and system suitable for embedded terminal
CN113538263A (en) * 2021-06-28 2021-10-22 江苏威尔曼科技有限公司 Motion blur removing method, medium, and device based on improved DeblurgAN model

Similar Documents

Publication Publication Date Title
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
Ramachandra et al. Learning a distance function with a Siamese network to localize anomalies in videos
CN112036367A (en) People number detection method of YOLO convolutional neural network
Zhu et al. Dual path multi-scale fusion networks with attention for crowd counting
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
Li et al. Mimicking very efficient network for object detection
KR102280414B1 (en) Method for optimizing hyperparameters of auto-labeling device which auto-labels training images for use in deep learning network to analyze images with high precision, and optimizing device using the same
CN109543602B (en) Pedestrian re-identification method based on multi-view image feature decomposition
CN107590442A (en) A kind of video semanteme Scene Segmentation based on convolutional neural networks
CN107463920A (en) A kind of face identification method for eliminating partial occlusion thing and influenceing
CN112639828A (en) Data processing method, method and equipment for training neural network model
CN112686304B (en) Target detection method and device based on attention mechanism and multi-scale feature fusion and storage medium
CN108280421B (en) Human behavior recognition method based on multi-feature depth motion map
CN112288773A (en) Multi-scale human body tracking method and device based on Soft-NMS
CN110490907A (en) Motion target tracking method based on multiple target feature and improvement correlation filter
WO2024032010A1 (en) Transfer learning strategy-based real-time few-shot object detection method
CN110716792A (en) Target detector and construction method and application thereof
CN109919246A (en) Pedestrian&#39;s recognition methods again based on self-adaptive features cluster and multiple risks fusion
CN111582091A (en) Pedestrian identification method based on multi-branch convolutional neural network
CN111340019A (en) Grain bin pest detection method based on Faster R-CNN
CN113569981A (en) Power inspection bird nest detection method based on single-stage target detection network
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
CN116912796A (en) Novel dynamic cascade YOLOv 8-based automatic driving target identification method and device
Ehsan et al. Vi-Net: a deep violent flow network for violence detection in video sequences
CN109359530B (en) Intelligent video monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination