CN113139481A

CN113139481A - Classroom people counting method based on yolov3

Info

Publication number: CN113139481A
Application number: CN202110466081.4A
Authority: CN
Inventors: 朱静; 潘梓沛; 林静旖; 何伟聪; 薛穗华; 李昂; 尹邦政; 黄仙烨; 欧阳淑榆; 朱雪冰
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-20
Anticipated expiration: 2041-04-28
Also published as: CN113139481B

Abstract

The invention discloses a yolov 3-based classroom people counting method, which comprises the following steps of: s1, acquiring an original image of a classroom as an image set for model training through a camera arranged on the ceiling of the classroom; s2, labeling the top view of the head of each student in the original image, generating a labeling file, and calculating a labeling frame in the labeling file; s3, extracting features, and performing convolution and downsampling on the input image; s4, establishing a detection model, clustering the data set by using a k-means clustering algorithm, wherein the detection model adopts an improved yolov3 network and comprises a feature extraction network and a target detection layer; s5, training a detection model; and S6, inputting the image to be detected into the detection model for people number detection. The method is based on yolov3 algorithm, can quickly and accurately count the number of students in a classroom at a certain moment, and teachers can quickly find out whether the students come up, whether someone is mixed in or quit early in the course of lecture, so that the phenomena of late arrival, early quit, class escape and the like of the students can be improved.

Description

Classroom people counting method based on yolov3

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a classroom people counting method based on yolov 3.

Background

Under the university 'class-walking' type education, the conditions of late arrival, early departure and class escape of students are endless. The counting of the number of people in each class by the teacher is time-consuming, and whether students leave or enter in the middle of the class can not be known in real time.

Some methods for detecting more people, such as a single-chip microcomputer infrared detection method, are easily affected by the environment, and are low in detection accuracy and high in energy consumption.

Disclosure of Invention

The invention mainly aims to overcome the defects and shortcomings of the prior art, and provides a classroom people counting method based on yolov3, based on a yolov3 algorithm, the number of students in a classroom at a certain moment can be quickly and accurately counted, a teacher can quickly find whether the students come together, whether people are mixed or quit early in the course of lecturing, and the phenomena of late arrival, early quit, class escape and the like of the students are improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the classroom people counting method based on yolov3 comprises the following steps:

s1, acquiring an original image of a classroom as an image set for model training through a camera arranged on the ceiling of the classroom;

s2, labeling the top view of the head of each student in the original image, generating a labeling file, and calculating a labeling frame in the labeling file;

s3, extracting features, namely performing convolution processing and downsampling on the input image;

s4, establishing a detection model, clustering the data set by using a k-means clustering algorithm, wherein the detection model adopts an improved yolov3 network and comprises a feature extraction network and a target detection layer;

s5, training a detection model;

and S6, inputting the image to be detected into the detection model for people number detection.

Further, the acquiring of the original image of the classroom specifically includes:

the method comprises the steps of collecting images of students in class through cameras on ceilings of the classrooms, collecting not less than 20 images of the classrooms, collecting not less than 100 frames of images in each classroom, and enabling each frame to be an image at different moments.

Further, the labeling specifically includes:

and (3) marking the top view of the head of each person in the acquired original image by using a Bounding box image marking tool LabelImg, uniformly marking the top views as a type of head, generating and storing an xml marking file after marking, and simultaneously storing the original image.

Further, the calculation of the labeling frame in the labeling file specifically includes:

4 parameters of each marking frame in the xml marking file are calculated and converted, and the formula is as follows:

w＝W*(X₁-X₂)

h＝H*(Y₁-Y₂)

wherein ,X₁、Y₁、X₂ and Y₂Are respectively a label4 parameters of the frame, wherein W and H are the width and height of the original image respectively, and W and H are output values and correspond to the width and height of the marked frame respectively;

and (4) integrating all the w and h sets for calculating and converting output into a wh-data set.

Further, the step S3 is specifically:

extracting the characteristics of the original image, and adopting an improved yolov3 network characteristic extraction network, wherein the algorithm applied by the characteristic extraction network is as follows:

adjusting the size of the input image to 416 x 416, and performing convolution on the input image by adopting a convolution kernel of 16 x 3, wherein the step size is 1;

let the input image size be k x k, the convolution kernel be n x n, and the convolution formula be as follows:

wherein ,y_ijRepresenting the pixel value, w, of the convolved output map at the subscript value i, j_uvRepresenting the pixel value, x, at the subscript value u, v in the corresponding convolution kernel_i-_u+1，j-v+1Representing the pixel values of image x at i-u +1, j-v + 1;

corresponding to the net input y of the first layer⁽¹⁾The standard normalization formula is as follows:

wherein ,E(y⁽¹⁾) And var (y)⁽¹⁾) Means y under the current parameters⁽¹⁾The expectation and variance of each dimension over the entire training set,

an output for the first layer normalization;

and correcting the output image by adopting a Leaky ReLU function as an activation function, wherein the formula is as follows, x represents the input image, and a takes positive real numbers:

performing 5 times of downsampling on the feature map output by the convolution operation;

the convolution kernels used are 32 × 3, 64 × 3, 128 × 3, 256 × 3, 51 × 3 in turn, the convolution steps are all 2, and the convolution objects are all the previous layer of output feature map;

after 5 times of downsampling of the feature map, 4 groups of convolution incomplete modules consisting of 256 × 1 and 512 × 3 convolution kernels are used for extracting features of the feature map output from the previous layer, the convolution step is 1, and the output is 26 × 26 feature maps.

Further, the data set is clustered by using a k-means clustering algorithm, specifically, the wh-data set is clustered:

selecting a clustering center point number k as 3; the k-means clustering algorithm is specifically as follows:

given a data sample X, n objects X ═ X are included₁,X₂,X₃,...,X_n-wherein each object has attributes of m dimensions;

the aim of the k-means algorithm is to gather n objects into specified k class clusters according to the similarity among the objects, wherein each object belongs to and only belongs to one class cluster with the minimum distance from the center of the class cluster;

initializing k cluster centers C ═ C₁,C₂,C₃,...,C_k}，1<k≤n；

By calculating the euclidean distance of each object to each cluster center, as shown in the following equation:

wherein ,X_iDenotes that the ith object 1. ltoreq. i.ltoreq.n, C_jJ is more than or equal to 1 and less than or equal to k and X of j-th cluster center_itT-th attribute representing the ith object, t is more than or equal to 1 and less than or equal to m, C_jtA tth attribute representing a jth cluster center;

sequentially comparing the distance from each object to each cluster center, and distributing the objects to the cluster of the cluster center closest to the object to obtain k clusters { S }₁,S₂,S₃,...,S_k}；

The k-means algorithm defines a prototype of a class cluster by using a center, wherein the class cluster center is the mean value of all objects in the class cluster in each dimension, and the calculation formula is as follows:

wherein ,C_lRepresents the center of the first cluster, l is more than or equal to 1 and less than or equal to k, | S_lI represents the number of objects in the ith class cluster, X_iRepresents the ith object in the ith class cluster, and is more than or equal to 1 and less than or equal to i and less than or equal to | S_l|；

And clustering the wh-data set by the k-means algorithm to obtain the width and height of three prior frames.

Further, step S4 further includes the following steps:

inputting the images in the training set and the xml data into a modified yolov3 network for training, wherein the steps are as follows:

the input image is detected by using three prior boxes obtained previously, and the specific algorithm is as follows:

dividing an input image into S-S grids, and if the center of a target is in a certain grid, the grid is responsible for the detection of the target; each grid predicts 3 bounding boxes with dimensions:

S×S×B×(4+1+C)

b represents the number of predicted prediction frames of each grid, and B is set to be 3; 4 center coordinates and width and height (b) of each prediction box_x，b_y，b_w，b_h) 1 represents the confidence; c represents the total amount of the categories, and C is set to be 1;

the formula of the parameters of the prediction box and the parameters of the real box is as follows:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein ,c_x and c_yIs the upper left corner coordinate of the grid; p is a radical of_w and p_hMapping the prior frame to the width and height of the feature map; b_x，b_y，b_w，b_hThe output values are respectively the central coordinate, width and height of the prediction frame; t is t_x，t_y，t_w，t_hThe output values are respectively the center coordinate, width and height of the real frame; sigma (t)_x)，σ(t_y) Indicates that t is to be expressed using Sigmoid function_x，t_yCompression onto (O, 1);

the confidence is calculated as follows:

wherein ,

representing the confidence of the jth prediction box of the ith grid; p_rThe probability of whether the current prediction box has an object is represented;

a value representing the IOU of the real box with which the predicted box most closely matches;

the calculation formula of the IOU is as follows:

wherein, A is the area of the frame of the predicted frame, and B is the area of the frame of the real frame.

Further, the loss function formula used by the detection model is as follows:

wherein i represents the ith grid, and j represents the jth prediction frame predicted by each grid; (x)_i，y_i) The prediction frame center coordinates representing the ith mesh prediction; w is a_i and h_iRepresenting the width and height of the prediction box; p (c) represents the probability that the object belongs to the c-th class; coord is a weight coefficient; lambda [ alpha ]_noobjIs a penalty weight coefficient; i is_ij ^objAnd whether the jth prediction box of the ith grid is responsible for predicting the target or not is represented, and the value is 0 or 1.

Further, step S5 includes training a network model, setting parameters of the cfg of the original yolov3 network configuration file, starting training a training set after setting, stopping training until convergence of the loss function, and storing weights of the trained network model.

Further, step S6 is specifically:

detecting an image to be detected by using the trained network model, and selecting the confidence coefficient with the maximum numerical value of three prediction frames predicted by each grid on the image; setting a threshold value to be 0.75, and if the confidence coefficient is smaller than the threshold value, marking as F; if the confidence coefficient is larger than the threshold value, marking as T; the number of T is the detected number of people.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method uses the convolutional neural network which is more suitable for detecting the small target as the feature extraction network of the improved yolov3 algorithm, the size of the output feature map is more specific to the detection of the small target of the student head top view shot by the classroom camera, and the accuracy of target detection is greatly improved.

2. The loss function used by the method is more specific to the application of small target detection, and the influence of disappearance of the Sigmoid function gradient can be effectively reduced, so that the detection model is converged more quickly, and the detection result is more accurate.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

As shown in FIG. 1, the classroom people counting method based on yolov3 of the invention comprises the following steps:

s1, acquiring an original image of a classroom as a data set for model training through a camera arranged on a ceiling of the classroom, specifically:

the method comprises the steps of collecting images of students in class through cameras on ceilings of the classrooms, collecting images of not less than 20 classrooms, and collecting images of not less than 100 frames in each classroom.

S2, labeling the top view of the head of each student in the original image, generating a labeling file, and calculating a labeling frame in the labeling file, wherein the labeling frame specifically comprises the following steps:

marking the top view of the head of each person in the acquired original image by using a Bounding box image marking tool LabelImg, uniformly marking the top views as a type of head, generating and storing an xml marking file after marking, and simultaneously storing an original image;

the specific calculation of the labeling frame in the labeling file is as follows:

w＝W*(X₁-X₂)

h＝H*(Y₁-Y₂)

wherein ,X₁、Y₁、X₂ and Y₂Respectively representing 4 parameters of the labeling frame, wherein W and H respectively represent the width and height of the original image, and W and H respectively represent output values and respectively correspond to the width and height of the labeling frame;

S3, feature extraction, namely performing convolution processing and down sampling on the input image, and clustering the operated data set by using a k-means algorithm, wherein the method specifically comprises the following steps:

an output for the first layer normalization;

the convolution kernels used were 32 × 3, 64 × 3, 128 × 3, 256 × 3, 51 × 3 in this order, the convolution steps were all 2, and the objects of convolution were all the previous output feature maps.

S4, establishing a detection model, clustering the data set by using a k-means clustering algorithm, wherein the detection model adopts an improved yolov3 network and comprises a feature extraction network and a target detection layer, and the method specifically comprises the following steps:

s41, clustering the data set by using a k-means clustering algorithm, specifically clustering the wh-data set:

given a data sample X, n objects X ═ X are included₁，X₂，X₃，...，X_n-wherein each object has attributes of m dimensions;

initializing k cluster centers C ═ C₁，C₂，C₃，...，C_k}，1＜k≤n；

sequentially comparing the distance from each object to each cluster center, and distributing the objects to the cluster of the cluster center closest to the object to obtain k clusters { S }₁，S₂，S₃，...，S_k}；

Clustering the wh-data set by the k-means algorithm to obtain the width and height of three prior frames;

s42, inputting the images in the training set and the xml data into the improved yolov3 network for training, wherein the steps are as follows:

S×S×B×(4+1+C)

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein ,c_x and c_yIs the upper left corner coordinate of the grid; p is a radical of_w and p_hMapping the prior frame to the width and height of the feature map; b_x，b_y，b_w，b_hThe output values are respectively the central coordinate, width and height of the prediction frame; t is t_x，t_y，t_w，t_hThe output values are respectively the center coordinate, width and height of the real frame; sigma (t)_x)，σ(t_y) Indicates that t is to be expressed using Sigmoid function_x，t_yCompression onto (0, 1);

the confidence is calculated as follows:

wherein ,

specifying the value of the IOU of the real box that the predicted box is most matched with;

the calculation formula of the IOU is as follows:

The loss function formula used by the detection model is as follows:

wherein i represents the ith grid, and j represents the jth prediction frame predicted by each grid; (x)_i，y_i) The prediction frame center coordinates representing the ith mesh prediction; w is a_i and h_iRepresenting prediction blocksWidth and height; p (c) represents the probability that the object belongs to the c-th class; coord is a weight coefficient; lambda [ alpha ]_noobjIs a penalty weight coefficient; i is_ij ^objWhether the jth prediction box of the ith grid is responsible for predicting the target or not is represented, and the value is 0 or 1;

s5, carrying out detection model training, specifically:

training a network model, setting parameters of an original yolov3 network configuration file cfg, starting training a training set after the setting is finished, stopping training until the convergence of a loss function, and storing the trained network model weights.

S6, inputting the image to be detected into the detection model for people number detection, specifically:

It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The classroom people counting method based on yolov3 is characterized by comprising the following steps:

s5, training a detection model;

2. The yolov 3-based classroom people counting method as claimed in claim 1, wherein the raw images of the collected classroom are specifically:

3. The yolov 3-based classroom people counting method as claimed in claim 1, wherein said labeling is specifically:

4. The yolov 3-based classroom people counting method as defined in claim 3, wherein the calculation of the annotation box in the annotation file is specifically as follows:

w＝W*(X₁-X₂)

h＝H*(Y₁-Y₂)

5. The yolov 3-based classroom people counting method as claimed in claim 1, wherein said step S3 specifically comprises:

wherein ,y_ijRepresenting the pixel value, w, of the convolved output map at the subscript value i, j_uvRepresenting the pixel value, x, at the subscript value u, v in the corresponding convolution kernel_{i-u+1，j-v+1}Representing the pixel values of image x at i-u +1, j-v + 1;

an output for the first layer normalization;

6. The yolov 3-based classroom people counting method as defined in claim 4, wherein the data sets are clustered using a k-means clustering algorithm, specifically clustering wh-data sets:

initializing k cluster centers C ═ C₁，C₂，C₃，...，C_k}，1<k≤n；

7. The yolov 3-based classroom people counting method as claimed in claim 1, wherein the step S4 further comprises the steps of:

S×S×B×(4+1+C)

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

the confidence is calculated as follows:

wherein ,

the calculation formula of the IOU is as follows:

8. The yolov 3-based classroom people statistics method of claim 7, wherein the loss function used by the detection model is formulated as follows:

wherein i represents the ith grid, and j represents the jth prediction frame predicted by each grid; (x)_i，y_i) The prediction frame center coordinates representing the ith mesh prediction; w is a_i and h_iRepresenting the width and height of the prediction box; p (c) represents the probability that the object belongs to the c-th class;_coordis a weight coefficient; lambda [ alpha ]_noobjIs a penalty weight coefficient; i is_ij ^objAnd whether the jth prediction box of the ith grid is responsible for predicting the target or not is represented, and the value is 0 or 1.

9. The yolov 3-based classroom people counting method as claimed in claim 1, wherein, step S5 comprises training the network model, setting parameters of cfg of original yolov3 network configuration file, starting training the training set after setting, stopping training until the loss function converges, and saving weights weight of the trained network model.

10. The yolov 3-based classroom people counting method as claimed in claim 1, wherein the step S6 is specifically: