CN113139481A - Classroom people counting method based on yolov3 - Google Patents

Classroom people counting method based on yolov3 Download PDF

Info

Publication number
CN113139481A
CN113139481A CN202110466081.4A CN202110466081A CN113139481A CN 113139481 A CN113139481 A CN 113139481A CN 202110466081 A CN202110466081 A CN 202110466081A CN 113139481 A CN113139481 A CN 113139481A
Authority
CN
China
Prior art keywords
image
follows
frame
classroom
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110466081.4A
Other languages
Chinese (zh)
Other versions
CN113139481B (en
Inventor
朱静
潘梓沛
林静旖
何伟聪
薛穗华
李昂
尹邦政
黄仙烨
欧阳淑榆
朱雪冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202110466081.4A priority Critical patent/CN113139481B/en
Publication of CN113139481A publication Critical patent/CN113139481A/en
Application granted granted Critical
Publication of CN113139481B publication Critical patent/CN113139481B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a yolov 3-based classroom people counting method, which comprises the following steps of: s1, acquiring an original image of a classroom as an image set for model training through a camera arranged on the ceiling of the classroom; s2, labeling the top view of the head of each student in the original image, generating a labeling file, and calculating a labeling frame in the labeling file; s3, extracting features, and performing convolution and downsampling on the input image; s4, establishing a detection model, clustering the data set by using a k-means clustering algorithm, wherein the detection model adopts an improved yolov3 network and comprises a feature extraction network and a target detection layer; s5, training a detection model; and S6, inputting the image to be detected into the detection model for people number detection. The method is based on yolov3 algorithm, can quickly and accurately count the number of students in a classroom at a certain moment, and teachers can quickly find out whether the students come up, whether someone is mixed in or quit early in the course of lecture, so that the phenomena of late arrival, early quit, class escape and the like of the students can be improved.

Description

Classroom people counting method based on yolov3
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a classroom people counting method based on yolov 3.
Background
Under the university 'class-walking' type education, the conditions of late arrival, early departure and class escape of students are endless. The counting of the number of people in each class by the teacher is time-consuming, and whether students leave or enter in the middle of the class can not be known in real time.
Some methods for detecting more people, such as a single-chip microcomputer infrared detection method, are easily affected by the environment, and are low in detection accuracy and high in energy consumption.
Disclosure of Invention
The invention mainly aims to overcome the defects and shortcomings of the prior art, and provides a classroom people counting method based on yolov3, based on a yolov3 algorithm, the number of students in a classroom at a certain moment can be quickly and accurately counted, a teacher can quickly find whether the students come together, whether people are mixed or quit early in the course of lecturing, and the phenomena of late arrival, early quit, class escape and the like of the students are improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the classroom people counting method based on yolov3 comprises the following steps:
s1, acquiring an original image of a classroom as an image set for model training through a camera arranged on the ceiling of the classroom;
s2, labeling the top view of the head of each student in the original image, generating a labeling file, and calculating a labeling frame in the labeling file;
s3, extracting features, namely performing convolution processing and downsampling on the input image;
s4, establishing a detection model, clustering the data set by using a k-means clustering algorithm, wherein the detection model adopts an improved yolov3 network and comprises a feature extraction network and a target detection layer;
s5, training a detection model;
and S6, inputting the image to be detected into the detection model for people number detection.
Further, the acquiring of the original image of the classroom specifically includes:
the method comprises the steps of collecting images of students in class through cameras on ceilings of the classrooms, collecting not less than 20 images of the classrooms, collecting not less than 100 frames of images in each classroom, and enabling each frame to be an image at different moments.
Further, the labeling specifically includes:
and (3) marking the top view of the head of each person in the acquired original image by using a Bounding box image marking tool LabelImg, uniformly marking the top views as a type of head, generating and storing an xml marking file after marking, and simultaneously storing the original image.
Further, the calculation of the labeling frame in the labeling file specifically includes:
4 parameters of each marking frame in the xml marking file are calculated and converted, and the formula is as follows:
w=W*(X1-X2)
h=H*(Y1-Y2)
wherein ,X1、Y1、X2 and Y2Are respectively a label4 parameters of the frame, wherein W and H are the width and height of the original image respectively, and W and H are output values and correspond to the width and height of the marked frame respectively;
and (4) integrating all the w and h sets for calculating and converting output into a wh-data set.
Further, the step S3 is specifically:
extracting the characteristics of the original image, and adopting an improved yolov3 network characteristic extraction network, wherein the algorithm applied by the characteristic extraction network is as follows:
adjusting the size of the input image to 416 x 416, and performing convolution on the input image by adopting a convolution kernel of 16 x 3, wherein the step size is 1;
let the input image size be k x k, the convolution kernel be n x n, and the convolution formula be as follows:
Figure BDA0003044030080000031
wherein ,yijRepresenting the pixel value, w, of the convolved output map at the subscript value i, juvRepresenting the pixel value, x, at the subscript value u, v in the corresponding convolution kerneli-u+1,j-v+1Representing the pixel values of image x at i-u +1, j-v + 1;
corresponding to the net input y of the first layer(1)The standard normalization formula is as follows:
Figure BDA0003044030080000032
wherein ,E(y(1)) And var (y)(1)) Means y under the current parameters(1)The expectation and variance of each dimension over the entire training set,
Figure BDA0003044030080000034
an output for the first layer normalization;
and correcting the output image by adopting a Leaky ReLU function as an activation function, wherein the formula is as follows, x represents the input image, and a takes positive real numbers:
Figure BDA0003044030080000033
performing 5 times of downsampling on the feature map output by the convolution operation;
the convolution kernels used are 32 × 3, 64 × 3, 128 × 3, 256 × 3, 51 × 3 in turn, the convolution steps are all 2, and the convolution objects are all the previous layer of output feature map;
after 5 times of downsampling of the feature map, 4 groups of convolution incomplete modules consisting of 256 × 1 and 512 × 3 convolution kernels are used for extracting features of the feature map output from the previous layer, the convolution step is 1, and the output is 26 × 26 feature maps.
Further, the data set is clustered by using a k-means clustering algorithm, specifically, the wh-data set is clustered:
selecting a clustering center point number k as 3; the k-means clustering algorithm is specifically as follows:
given a data sample X, n objects X ═ X are included1,X2,X3,...,Xn-wherein each object has attributes of m dimensions;
the aim of the k-means algorithm is to gather n objects into specified k class clusters according to the similarity among the objects, wherein each object belongs to and only belongs to one class cluster with the minimum distance from the center of the class cluster;
initializing k cluster centers C ═ C1,C2,C3,...,Ck},1<k≤n;
By calculating the euclidean distance of each object to each cluster center, as shown in the following equation:
Figure BDA0003044030080000041
wherein ,XiDenotes that the ith object 1. ltoreq. i.ltoreq.n, CjJ is more than or equal to 1 and less than or equal to k and X of j-th cluster centeritT-th attribute representing the ith object, t is more than or equal to 1 and less than or equal to m, CjtA tth attribute representing a jth cluster center;
sequentially comparing the distance from each object to each cluster center, and distributing the objects to the cluster of the cluster center closest to the object to obtain k clusters { S }1,S2,S3,...,Sk};
The k-means algorithm defines a prototype of a class cluster by using a center, wherein the class cluster center is the mean value of all objects in the class cluster in each dimension, and the calculation formula is as follows:
Figure BDA0003044030080000042
wherein ,ClRepresents the center of the first cluster, l is more than or equal to 1 and less than or equal to k, | SlI represents the number of objects in the ith class cluster, XiRepresents the ith object in the ith class cluster, and is more than or equal to 1 and less than or equal to i and less than or equal to | Sl|;
And clustering the wh-data set by the k-means algorithm to obtain the width and height of three prior frames.
Further, step S4 further includes the following steps:
inputting the images in the training set and the xml data into a modified yolov3 network for training, wherein the steps are as follows:
the input image is detected by using three prior boxes obtained previously, and the specific algorithm is as follows:
dividing an input image into S-S grids, and if the center of a target is in a certain grid, the grid is responsible for the detection of the target; each grid predicts 3 bounding boxes with dimensions:
S×S×B×(4+1+C)
b represents the number of predicted prediction frames of each grid, and B is set to be 3; 4 center coordinates and width and height (b) of each prediction boxx,by,bw,bh) 1 represents the confidence; c represents the total amount of the categories, and C is set to be 1;
the formula of the parameters of the prediction box and the parameters of the real box is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0003044030080000051
Figure BDA0003044030080000052
wherein ,cx and cyIs the upper left corner coordinate of the grid; p is a radical ofw and phMapping the prior frame to the width and height of the feature map; bx,by,bw,bhThe output values are respectively the central coordinate, width and height of the prediction frame; t is tx,ty,tw,thThe output values are respectively the center coordinate, width and height of the real frame; sigma (t)x),σ(ty) Indicates that t is to be expressed using Sigmoid functionx,tyCompression onto (O, 1);
the confidence is calculated as follows:
Figure BDA0003044030080000053
wherein ,
Figure BDA0003044030080000054
representing the confidence of the jth prediction box of the ith grid; prThe probability of whether the current prediction box has an object is represented;
Figure BDA0003044030080000055
a value representing the IOU of the real box with which the predicted box most closely matches;
the calculation formula of the IOU is as follows:
Figure BDA0003044030080000056
wherein, A is the area of the frame of the predicted frame, and B is the area of the frame of the real frame.
Further, the loss function formula used by the detection model is as follows:
Figure BDA0003044030080000061
Figure BDA0003044030080000062
Figure BDA0003044030080000063
Figure BDA0003044030080000064
Figure BDA0003044030080000065
wherein i represents the ith grid, and j represents the jth prediction frame predicted by each grid; (x)i,yi) The prediction frame center coordinates representing the ith mesh prediction; w is ai and hiRepresenting the width and height of the prediction box; p (c) represents the probability that the object belongs to the c-th class; coord is a weight coefficient; lambda [ alpha ]noobjIs a penalty weight coefficient; i isij objAnd whether the jth prediction box of the ith grid is responsible for predicting the target or not is represented, and the value is 0 or 1.
Further, step S5 includes training a network model, setting parameters of the cfg of the original yolov3 network configuration file, starting training a training set after setting, stopping training until convergence of the loss function, and storing weights of the trained network model.
Further, step S6 is specifically:
detecting an image to be detected by using the trained network model, and selecting the confidence coefficient with the maximum numerical value of three prediction frames predicted by each grid on the image; setting a threshold value to be 0.75, and if the confidence coefficient is smaller than the threshold value, marking as F; if the confidence coefficient is larger than the threshold value, marking as T; the number of T is the detected number of people.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the method uses the convolutional neural network which is more suitable for detecting the small target as the feature extraction network of the improved yolov3 algorithm, the size of the output feature map is more specific to the detection of the small target of the student head top view shot by the classroom camera, and the accuracy of target detection is greatly improved.
2. The loss function used by the method is more specific to the application of small target detection, and the influence of disappearance of the Sigmoid function gradient can be effectively reduced, so that the detection model is converged more quickly, and the detection result is more accurate.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Examples
As shown in FIG. 1, the classroom people counting method based on yolov3 of the invention comprises the following steps:
s1, acquiring an original image of a classroom as a data set for model training through a camera arranged on a ceiling of the classroom, specifically:
the method comprises the steps of collecting images of students in class through cameras on ceilings of the classrooms, collecting images of not less than 20 classrooms, and collecting images of not less than 100 frames in each classroom.
S2, labeling the top view of the head of each student in the original image, generating a labeling file, and calculating a labeling frame in the labeling file, wherein the labeling frame specifically comprises the following steps:
marking the top view of the head of each person in the acquired original image by using a Bounding box image marking tool LabelImg, uniformly marking the top views as a type of head, generating and storing an xml marking file after marking, and simultaneously storing an original image;
the specific calculation of the labeling frame in the labeling file is as follows:
4 parameters of each marking frame in the xml marking file are calculated and converted, and the formula is as follows:
w=W*(X1-X2)
h=H*(Y1-Y2)
wherein ,X1、Y1、X2 and Y2Respectively representing 4 parameters of the labeling frame, wherein W and H respectively represent the width and height of the original image, and W and H respectively represent output values and respectively correspond to the width and height of the labeling frame;
and (4) integrating all the w and h sets for calculating and converting output into a wh-data set.
S3, feature extraction, namely performing convolution processing and down sampling on the input image, and clustering the operated data set by using a k-means algorithm, wherein the method specifically comprises the following steps:
extracting the characteristics of the original image, and adopting an improved yolov3 network characteristic extraction network, wherein the algorithm applied by the characteristic extraction network is as follows:
adjusting the size of the input image to 416 x 416, and performing convolution on the input image by adopting a convolution kernel of 16 x 3, wherein the step size is 1;
let the input image size be k x k, the convolution kernel be n x n, and the convolution formula be as follows:
Figure BDA0003044030080000081
wherein ,yijRepresenting the pixel value, w, of the convolved output map at the subscript value i, juvRepresenting the pixel value, x, at the subscript value u, v in the corresponding convolution kerneli-u+1,j-v+1Representing the pixel values of image x at i-u +1, j-v + 1;
corresponding to the net input y of the first layer(1)The standard normalization formula is as follows:
Figure BDA0003044030080000082
wherein ,E(y(1)) And var (y)(1)) Means y under the current parameters(1)The expectation and variance of each dimension over the entire training set,
Figure BDA0003044030080000083
an output for the first layer normalization;
and correcting the output image by adopting a Leaky ReLU function as an activation function, wherein the formula is as follows, x represents the input image, and a takes positive real numbers:
Figure BDA0003044030080000091
performing 5 times of downsampling on the feature map output by the convolution operation;
the convolution kernels used were 32 × 3, 64 × 3, 128 × 3, 256 × 3, 51 × 3 in this order, the convolution steps were all 2, and the objects of convolution were all the previous output feature maps.
After 5 times of downsampling of the feature map, 4 groups of convolution incomplete modules consisting of 256 × 1 and 512 × 3 convolution kernels are used for extracting features of the feature map output from the previous layer, the convolution step is 1, and the output is 26 × 26 feature maps.
S4, establishing a detection model, clustering the data set by using a k-means clustering algorithm, wherein the detection model adopts an improved yolov3 network and comprises a feature extraction network and a target detection layer, and the method specifically comprises the following steps:
s41, clustering the data set by using a k-means clustering algorithm, specifically clustering the wh-data set:
selecting a clustering center point number k as 3; the k-means clustering algorithm is specifically as follows:
given a data sample X, n objects X ═ X are included1,X2,X3,...,Xn-wherein each object has attributes of m dimensions;
the aim of the k-means algorithm is to gather n objects into specified k class clusters according to the similarity among the objects, wherein each object belongs to and only belongs to one class cluster with the minimum distance from the center of the class cluster;
initializing k cluster centers C ═ C1,C2,C3,...,Ck},1<k≤n;
By calculating the euclidean distance of each object to each cluster center, as shown in the following equation:
Figure BDA0003044030080000092
wherein ,XiDenotes that the ith object 1. ltoreq. i.ltoreq.n, CjJ is more than or equal to 1 and less than or equal to k and X of j-th cluster centeritT-th attribute representing the ith object, t is more than or equal to 1 and less than or equal to m, CjtA tth attribute representing a jth cluster center;
sequentially comparing the distance from each object to each cluster center, and distributing the objects to the cluster of the cluster center closest to the object to obtain k clusters { S }1,S2,S3,...,Sk};
The k-means algorithm defines a prototype of a class cluster by using a center, wherein the class cluster center is the mean value of all objects in the class cluster in each dimension, and the calculation formula is as follows:
Figure BDA0003044030080000101
wherein ,ClRepresents the center of the first cluster, l is more than or equal to 1 and less than or equal to k, | SlI represents the number of objects in the ith class cluster, XiRepresents the ith object in the ith class cluster, and is more than or equal to 1 and less than or equal to i and less than or equal to | Sl|;
Clustering the wh-data set by the k-means algorithm to obtain the width and height of three prior frames;
s42, inputting the images in the training set and the xml data into the improved yolov3 network for training, wherein the steps are as follows:
the input image is detected by using three prior boxes obtained previously, and the specific algorithm is as follows:
dividing an input image into S-S grids, and if the center of a target is in a certain grid, the grid is responsible for the detection of the target; each grid predicts 3 bounding boxes with dimensions:
S×S×B×(4+1+C)
b represents the number of predicted prediction frames of each grid, and B is set to be 3; 4 center coordinates and width and height (b) of each prediction boxx,by,bw,bh) 1 represents the confidence; c represents the total amount of the categories, and C is set to be 1;
the formula of the parameters of the prediction box and the parameters of the real box is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0003044030080000111
Figure BDA0003044030080000112
wherein ,cx and cyIs the upper left corner coordinate of the grid; p is a radical ofw and phMapping the prior frame to the width and height of the feature map; bx,by,bw,bhThe output values are respectively the central coordinate, width and height of the prediction frame; t is tx,ty,tw,thThe output values are respectively the center coordinate, width and height of the real frame; sigma (t)x),σ(ty) Indicates that t is to be expressed using Sigmoid functionx,tyCompression onto (0, 1);
the confidence is calculated as follows:
Figure BDA0003044030080000113
wherein ,
Figure BDA0003044030080000114
representing the confidence of the jth prediction box of the ith grid; prThe probability of whether the current prediction box has an object is represented;
Figure BDA0003044030080000115
specifying the value of the IOU of the real box that the predicted box is most matched with;
the calculation formula of the IOU is as follows:
Figure BDA0003044030080000116
wherein, A is the area of the frame of the predicted frame, and B is the area of the frame of the real frame.
The loss function formula used by the detection model is as follows:
Figure BDA0003044030080000121
Figure BDA0003044030080000122
Figure BDA0003044030080000123
Figure BDA0003044030080000124
Figure BDA0003044030080000125
wherein i represents the ith grid, and j represents the jth prediction frame predicted by each grid; (x)i,yi) The prediction frame center coordinates representing the ith mesh prediction; w is ai and hiRepresenting prediction blocksWidth and height; p (c) represents the probability that the object belongs to the c-th class; coord is a weight coefficient; lambda [ alpha ]noobjIs a penalty weight coefficient; i isij objWhether the jth prediction box of the ith grid is responsible for predicting the target or not is represented, and the value is 0 or 1;
s5, carrying out detection model training, specifically:
training a network model, setting parameters of an original yolov3 network configuration file cfg, starting training a training set after the setting is finished, stopping training until the convergence of a loss function, and storing the trained network model weights.
S6, inputting the image to be detected into the detection model for people number detection, specifically:
detecting an image to be detected by using the trained network model, and selecting the confidence coefficient with the maximum numerical value of three prediction frames predicted by each grid on the image; setting a threshold value to be 0.75, and if the confidence coefficient is smaller than the threshold value, marking as F; if the confidence coefficient is larger than the threshold value, marking as T; the number of T is the detected number of people.
It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The classroom people counting method based on yolov3 is characterized by comprising the following steps:
s1, acquiring an original image of a classroom as an image set for model training through a camera arranged on the ceiling of the classroom;
s2, labeling the top view of the head of each student in the original image, generating a labeling file, and calculating a labeling frame in the labeling file;
s3, extracting features, namely performing convolution processing and downsampling on the input image;
s4, establishing a detection model, clustering the data set by using a k-means clustering algorithm, wherein the detection model adopts an improved yolov3 network and comprises a feature extraction network and a target detection layer;
s5, training a detection model;
and S6, inputting the image to be detected into the detection model for people number detection.
2. The yolov 3-based classroom people counting method as claimed in claim 1, wherein the raw images of the collected classroom are specifically:
the method comprises the steps of collecting images of students in class through cameras on ceilings of the classrooms, collecting not less than 20 images of the classrooms, collecting not less than 100 frames of images in each classroom, and enabling each frame to be an image at different moments.
3. The yolov 3-based classroom people counting method as claimed in claim 1, wherein said labeling is specifically:
and (3) marking the top view of the head of each person in the acquired original image by using a Bounding box image marking tool LabelImg, uniformly marking the top views as a type of head, generating and storing an xml marking file after marking, and simultaneously storing the original image.
4. The yolov 3-based classroom people counting method as defined in claim 3, wherein the calculation of the annotation box in the annotation file is specifically as follows:
4 parameters of each marking frame in the xml marking file are calculated and converted, and the formula is as follows:
w=W*(X1-X2)
h=H*(Y1-Y2)
wherein ,X1、Y1、X2 and Y2Respectively representing 4 parameters of the labeling frame, wherein W and H respectively represent the width and height of the original image, and W and H respectively represent output values and respectively correspond to the width and height of the labeling frame;
and (4) integrating all the w and h sets for calculating and converting output into a wh-data set.
5. The yolov 3-based classroom people counting method as claimed in claim 1, wherein said step S3 specifically comprises:
extracting the characteristics of the original image, and adopting an improved yolov3 network characteristic extraction network, wherein the algorithm applied by the characteristic extraction network is as follows:
adjusting the size of the input image to 416 x 416, and performing convolution on the input image by adopting a convolution kernel of 16 x 3, wherein the step size is 1;
let the input image size be k x k, the convolution kernel be n x n, and the convolution formula be as follows:
Figure FDA0003044030070000021
wherein ,yijRepresenting the pixel value, w, of the convolved output map at the subscript value i, juvRepresenting the pixel value, x, at the subscript value u, v in the corresponding convolution kerneli-u+1,j-v+1Representing the pixel values of image x at i-u +1, j-v + 1;
corresponding to the net input y of the first layer(1)The standard normalization formula is as follows:
Figure FDA0003044030070000022
wherein ,E(y(1)) And var (y)(1)) Means y under the current parameters(1)The expectation and variance of each dimension over the entire training set,
Figure FDA0003044030070000023
an output for the first layer normalization;
and correcting the output image by adopting a Leaky ReLU function as an activation function, wherein the formula is as follows, x represents the input image, and a takes positive real numbers:
Figure FDA0003044030070000024
performing 5 times of downsampling on the feature map output by the convolution operation;
the convolution kernels used are 32 × 3, 64 × 3, 128 × 3, 256 × 3, 51 × 3 in turn, the convolution steps are all 2, and the convolution objects are all the previous layer of output feature map;
after 5 times of downsampling of the feature map, 4 groups of convolution incomplete modules consisting of 256 × 1 and 512 × 3 convolution kernels are used for extracting features of the feature map output from the previous layer, the convolution step is 1, and the output is 26 × 26 feature maps.
6. The yolov 3-based classroom people counting method as defined in claim 4, wherein the data sets are clustered using a k-means clustering algorithm, specifically clustering wh-data sets:
selecting a clustering center point number k as 3; the k-means clustering algorithm is specifically as follows:
given a data sample X, n objects X ═ X are included1,X2,X3,...,Xn-wherein each object has attributes of m dimensions;
the aim of the k-means algorithm is to gather n objects into specified k class clusters according to the similarity among the objects, wherein each object belongs to and only belongs to one class cluster with the minimum distance from the center of the class cluster;
initializing k cluster centers C ═ C1,C2,C3,...,Ck},1<k≤n;
By calculating the euclidean distance of each object to each cluster center, as shown in the following equation:
Figure FDA0003044030070000031
wherein ,XiDenotes that the ith object 1. ltoreq. i.ltoreq.n, CjJ is more than or equal to 1 and less than or equal to k and X of j-th cluster centeritT-th attribute representing the ith object, t is more than or equal to 1 and less than or equal to m, CjtA tth attribute representing a jth cluster center;
sequentially comparing the distance from each object to each cluster center, and distributing the objects to the cluster of the cluster center closest to the object to obtain k clusters { S }1,S2,S3,...,Sk};
The k-means algorithm defines a prototype of a class cluster by using a center, wherein the class cluster center is the mean value of all objects in the class cluster in each dimension, and the calculation formula is as follows:
Figure FDA0003044030070000032
wherein ,ClRepresents the center of the first cluster, l is more than or equal to 1 and less than or equal to k, | SlI represents the number of objects in the ith class cluster, XiRepresents the ith object in the ith class cluster, and is more than or equal to 1 and less than or equal to i and less than or equal to | Sl|;
And clustering the wh-data set by the k-means algorithm to obtain the width and height of three prior frames.
7. The yolov 3-based classroom people counting method as claimed in claim 1, wherein the step S4 further comprises the steps of:
inputting the images in the training set and the xml data into a modified yolov3 network for training, wherein the steps are as follows:
the input image is detected by using three prior boxes obtained previously, and the specific algorithm is as follows:
dividing an input image into S-S grids, and if the center of a target is in a certain grid, the grid is responsible for the detection of the target; each grid predicts 3 bounding boxes with dimensions:
S×S×B×(4+1+C)
b represents the number of predicted prediction frames of each grid, and B is set to be 3; 4 center coordinates and width and height (b) of each prediction boxx,by,bw,bh) 1 represents the confidence; c represents the total amount of the categories, and C is set to be 1;
the formula of the parameters of the prediction box and the parameters of the real box is as follows:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure FDA0003044030070000041
Figure FDA0003044030070000042
wherein ,cx and cyIs the upper left corner coordinate of the grid; p is a radical ofw and phMapping the prior frame to the width and height of the feature map; bx,by,bw,bhThe output values are respectively the central coordinate, width and height of the prediction frame; t is tx,ty,tw,thThe output values are respectively the center coordinate, width and height of the real frame; sigma (t)x),σ(ty) Indicates that t is to be expressed using Sigmoid functionx,tyCompression onto (0, 1);
the confidence is calculated as follows:
Figure FDA0003044030070000043
wherein ,
Figure FDA0003044030070000044
representing the confidence of the jth prediction box of the ith grid; prThe probability of whether the current prediction box has an object is represented;
Figure FDA0003044030070000045
a value representing the IOU of the real box with which the predicted box most closely matches;
the calculation formula of the IOU is as follows:
Figure FDA0003044030070000051
wherein, A is the area of the frame of the predicted frame, and B is the area of the frame of the real frame.
8. The yolov 3-based classroom people statistics method of claim 7, wherein the loss function used by the detection model is formulated as follows:
Figure FDA0003044030070000052
wherein i represents the ith grid, and j represents the jth prediction frame predicted by each grid; (x)i,yi) The prediction frame center coordinates representing the ith mesh prediction; w is ai and hiRepresenting the width and height of the prediction box; p (c) represents the probability that the object belongs to the c-th class;coordis a weight coefficient; lambda [ alpha ]noobjIs a penalty weight coefficient; i isij objAnd whether the jth prediction box of the ith grid is responsible for predicting the target or not is represented, and the value is 0 or 1.
9. The yolov 3-based classroom people counting method as claimed in claim 1, wherein, step S5 comprises training the network model, setting parameters of cfg of original yolov3 network configuration file, starting training the training set after setting, stopping training until the loss function converges, and saving weights weight of the trained network model.
10. The yolov 3-based classroom people counting method as claimed in claim 1, wherein the step S6 is specifically:
detecting an image to be detected by using the trained network model, and selecting the confidence coefficient with the maximum numerical value of three prediction frames predicted by each grid on the image; setting a threshold value to be 0.75, and if the confidence coefficient is smaller than the threshold value, marking as F; if the confidence coefficient is larger than the threshold value, marking as T; the number of T is the detected number of people.
CN202110466081.4A 2021-04-28 2021-04-28 Classroom people counting method based on yolov3 Active CN113139481B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110466081.4A CN113139481B (en) 2021-04-28 2021-04-28 Classroom people counting method based on yolov3

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110466081.4A CN113139481B (en) 2021-04-28 2021-04-28 Classroom people counting method based on yolov3

Publications (2)

Publication Number Publication Date
CN113139481A true CN113139481A (en) 2021-07-20
CN113139481B CN113139481B (en) 2023-09-01

Family

ID=76816299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110466081.4A Active CN113139481B (en) 2021-04-28 2021-04-28 Classroom people counting method based on yolov3

Country Status (1)

Country Link
CN (1) CN113139481B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989708A (en) * 2021-10-27 2022-01-28 福州大学 Campus library epidemic prevention and control method based on YOLO v4
CN114495003A (en) * 2022-01-24 2022-05-13 上海申视信科技有限公司 People number identification and statistics method and system based on improved YOLOv3 network
CN116563797A (en) * 2023-07-10 2023-08-08 安徽网谷智能技术有限公司 Monitoring management system for intelligent campus
CN117557820A (en) * 2024-01-08 2024-02-13 浙江锦德光电材料有限公司 Quantum dot optical film damage detection method and system based on machine vision

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509860A (en) * 2018-03-09 2018-09-07 西安电子科技大学 HOh Xil Tibetan antelope detection method based on convolutional neural networks
CN108647587A (en) * 2018-04-23 2018-10-12 腾讯科技(深圳)有限公司 Demographic method, device, terminal and storage medium
CN108717798A (en) * 2018-07-16 2018-10-30 辽宁工程技术大学 A kind of intelligent public transportation system based on Internet of Things pattern
CN108830145A (en) * 2018-05-04 2018-11-16 深圳技术大学(筹) A kind of demographic method and storage medium based on deep neural network
CN110060233A (en) * 2019-03-20 2019-07-26 中国农业机械化科学研究院 A kind of corn ear damage testing method
CN110837795A (en) * 2019-11-04 2020-02-25 防灾科技学院 Teaching condition intelligent monitoring method, device and equipment based on classroom monitoring video
CN111626128A (en) * 2020-04-27 2020-09-04 江苏大学 Improved YOLOv 3-based pedestrian detection method in orchard environment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509860A (en) * 2018-03-09 2018-09-07 西安电子科技大学 HOh Xil Tibetan antelope detection method based on convolutional neural networks
CN108647587A (en) * 2018-04-23 2018-10-12 腾讯科技(深圳)有限公司 Demographic method, device, terminal and storage medium
CN108830145A (en) * 2018-05-04 2018-11-16 深圳技术大学(筹) A kind of demographic method and storage medium based on deep neural network
CN108717798A (en) * 2018-07-16 2018-10-30 辽宁工程技术大学 A kind of intelligent public transportation system based on Internet of Things pattern
CN110060233A (en) * 2019-03-20 2019-07-26 中国农业机械化科学研究院 A kind of corn ear damage testing method
CN110837795A (en) * 2019-11-04 2020-02-25 防灾科技学院 Teaching condition intelligent monitoring method, device and equipment based on classroom monitoring video
CN111626128A (en) * 2020-04-27 2020-09-04 江苏大学 Improved YOLOv 3-based pedestrian detection method in orchard environment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989708A (en) * 2021-10-27 2022-01-28 福州大学 Campus library epidemic prevention and control method based on YOLO v4
CN113989708B (en) * 2021-10-27 2024-06-04 福州大学 Campus library epidemic prevention and control method based on YOLO v4
CN114495003A (en) * 2022-01-24 2022-05-13 上海申视信科技有限公司 People number identification and statistics method and system based on improved YOLOv3 network
CN116563797A (en) * 2023-07-10 2023-08-08 安徽网谷智能技术有限公司 Monitoring management system for intelligent campus
CN116563797B (en) * 2023-07-10 2023-10-27 安徽网谷智能技术有限公司 Monitoring management system for intelligent campus
CN117557820A (en) * 2024-01-08 2024-02-13 浙江锦德光电材料有限公司 Quantum dot optical film damage detection method and system based on machine vision
CN117557820B (en) * 2024-01-08 2024-04-16 浙江锦德光电材料有限公司 Quantum dot optical film damage detection method and system based on machine vision

Also Published As

Publication number Publication date
CN113139481B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN113139481B (en) Classroom people counting method based on yolov3
JP6892558B2 (en) Theological assistance method and the theological assistance system that adopts the method
CN110334765B (en) Remote sensing image classification method based on attention mechanism multi-scale deep learning
CN110321361B (en) Test question recommendation and judgment method based on improved LSTM neural network model
CN107292246A (en) Infrared human body target identification method based on HOG PCA and transfer learning
CN109376637A (en) Passenger number statistical system based on video monitoring image processing
CN110889672A (en) Student card punching and class taking state detection system based on deep learning
CN109299707A (en) A kind of unsupervised pedestrian recognition methods again based on fuzzy depth cluster
CN106156765A (en) safety detection method based on computer vision
CN110321862B (en) Pedestrian re-identification method based on compact ternary loss
CN108256486B (en) Image identification method and device based on nonnegative low-rank and semi-supervised learning
CN107392251B (en) Method for improving target detection network performance by using classified pictures
CN109902615A (en) A kind of multiple age bracket image generating methods based on confrontation network
CN109784288B (en) Pedestrian re-identification method based on discrimination perception fusion
CN107808376A (en) A kind of detection method of raising one&#39;s hand based on deep learning
CN110163567A (en) Classroom roll calling system based on multitask concatenated convolutional neural network
CN114898460B (en) Teacher nonverbal behavior detection method based on graph convolution neural network
CN111860297A (en) SLAM loop detection method applied to indoor fixed space
CN109190458A (en) A kind of person of low position&#39;s head inspecting method based on deep learning
CN107832747A (en) A kind of face identification method based on low-rank dictionary learning algorithm
CN116052211A (en) Knowledge distillation-based YOLOv5s lightweight sheep variety identification method and system
CN114627553A (en) Method for detecting classroom scene student behaviors based on convolutional neural network
CN114299279A (en) Unmarked group rhesus monkey motion amount estimation method based on face detection and recognition
CN108280516A (en) The optimization method of Intelligent evolution is mutually won between a kind of multigroup convolutional neural networks
Pei et al. Convolutional neural networks for class attendance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant