CN112686180A

CN112686180A - Method for calculating number of personnel in closed space

Info

Publication number: CN112686180A
Application number: CN202011645491.7A
Authority: CN
Inventors: 朱洁; 阿斯木·阿不力孜
Original assignee: Chinaccs Information Industry Co ltd
Current assignee: Chinaccs Information Industry Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-20

Abstract

The invention discloses a method for calculating the number of people in a closed space, which relates to the technical field of computers. The invention has the beneficial effects that: according to the invention, through means of video acquisition, model training, behavior recognition and the like, people in the closed space are automatically counted by combining the recognition result.

Description

Method for calculating number of personnel in closed space

Technical Field

The invention relates to the technical field of computers, in particular to a method for calculating the number of people in a closed space.

Background

The floating population, especially the floating population in the operation places such as the network contract room and the hotel, has the characteristics of complicated structure, short-term behavior and the like, relatively speaking, the risk of the epidemic situation is higher, and the method brings great challenge to the current epidemic situation management work. One major measure to reduce the risk of an epidemic is to reduce the gathering of people.

Currently, the number of people in a closed space is mainly judged by manual checking or face recognition counting. But the home check needs to consume a large amount of manpower and material resources and has low efficiency. And the face recognition counting is not accurate enough under the condition that the person generally wears the mask.

Disclosure of Invention

Aiming at the technical problem, the invention provides a method for calculating the number of personnel in a closed space.

The method comprises the following steps that firstly, a Yolov5 neural network is used for training photos with people entering a room and leaving the room, and a people behavior classification model capable of classifying people entering the room or leaving the room is obtained;

secondly, arranging video acquisition equipment at an inlet of a closed space to be monitored;

thirdly, carrying out video acquisition on the personnel entering and exiting conditions of the inlet of the closed space, and judging the entering and exiting conditions of the personnel by combining the model in the first step;

and step four, increasing and decreasing the number of the personnel in the closed space through the judgment result of the step three, so as to obtain a total value of the personnel.

Preferably, the method for establishing the personnel behavior classification model comprises the following steps,

s1, preparing a training data set, wherein the data set is a training data set for a person to return to a house or go out when opening a door;

s2, training a model, and performing model training by combining a neural network with the training data set acquired by the S1;

and S3, observing the loss function, and stopping training when the loss function value output by the training data set converges below a threshold value or reaches a set maximum iteration number to obtain a trained recognition network.

Preferably, in S1, a training data set is prepared, specifically,

s1-1, triggering the video acquisition device by a sensor to record a door opening time point when the door is opened through the video acquisition device, and then intercepting videos of a period of time before and after the time point by taking the time point as a center for storage, wherein the videos are used as video data for training of a personnel behavior analysis model, and the video interception time is preferably 15 seconds before and after the door is opened and is 30 seconds in total;

s1-2, using computer equipment to perform frame extraction processing on the video data obtained in the S1-1, and intercepting a picture at intervals of frames and storing the picture in a computer;

s1-3, manually screening out pictures with person entering and exiting behaviors from the pictures obtained in the S1-2;

s1-4, labeling the people in the picture after the S1-3 screening by using a LabelImg tool, and labeling the people with the house-entering behavior and the people with the out-going behavior in the picture differently; selecting and marking the person with the house-entering behavior in the picture as in, and selecting and marking the person with the house-exiting behavior in the picture as out;

s1-5, making the pictures labeled in the S1-4 into data sets according to certain rules and formats, and randomly dividing the data sets into training data sets and verification data sets according to a certain proportion;

specifically, renaming the picture marked by S1-4 according to the format of a Pascal VOC data set, creating a folder with the name of Annotations for storing an xml data set, creating a folder with the name of ImageSets for storing training data and verification data, and creating a folder with the name of JPEGImages for storing a picture set;

and S1-6, normalizing the data in the data set obtained in the S1-5, and converting the normalized data into a txt file required by the neural network, wherein the txt file comprises the type of the picture label, the coordinate of the center point of the label box and the length and the width of the label box.

Specifically, the xml file in the VOC format is converted into the txt file required by the Yolov5 neural network.

Preferably, the S2 model training method specifically includes,

s2-1, setting training parameters of a Yolov5 neural network, wherein the learning rate is 0.001, the iteration number is 40000, the learning rate variation step size is 20000 and 30000, the learning rate variation factor is 0.1 and 0.1, the batch size is 64, the confidence threshold of the target frame is 0.5, the non-maximum value inhibition threshold is 0.3, the depth of the network is 0.33, the width of the network is 0.5, the width of the input data set is 608, the height of the input data set is 608, and the number of channels is 3;

s2-2, sending the prepared data set in S1 to a network for training;

s2-3, formatting the length and width of a picture input into a Yolov5 neural network into 608 pixels × 608 pixels, and obtaining two feature maps with different sizes through operations such as convolution, upsampling, residual error unit, tensor splicing and the like, wherein the two feature maps are 19 × 19 and 38 × 38 respectively;

s2-4, selecting corresponding prediction box sizes from the feature maps of two different sizes obtained in S2-3 according to the range of the sizes of the regions mapped on the input picture by the feature maps, and selecting boundary boxes of 3 sizes from each feature map;

and S2-5, screening the prediction frame obtained in the S2-4 by using an NMS non-maximum value inhibition method, searching a local maximum value, inhibiting non-maximum value elements, eliminating redundant candidate prediction frames, and finding the best prediction frame position.

Preferably, the bounding boxes of 3 sizes in S2-4 are specifically,

s2-4-1, detecting a large target human body in a short distance by using a feature map with the size of 19 x 19, wherein the input images with the sizes of corresponding prediction frames mapped to 608 x 608 are respectively 142 x 110, 192 x 243 and 459 x 401;

s2-4-2, detecting a medium target human body with a medium distance by a feature map with the size of 38 x 38, wherein the input images with the corresponding prediction box sizes mapped to 608 x 608 are respectively 36 x 75, 76 x 55 and 72 x 146;

preferably, the selected network model is Yolov5s, the target of the scheme is a medium target or a large target, so that the network model Yolov5s with the minimum depth and the minimum width of the feature map in the Yolov5 series is selected, and the model identification speed can be increased.

The bottleeckCSP in the Yolov5 network is replaced by the Ghost bottleeck, and linear convolution is performed again on the basis of a small amount of feature graphs obtained by nonlinear convolution, so that more feature graphs are obtained, redundant features are eliminated, a lighter model is obtained, and the training speed is higher.

Preferably, an Squeeze-and-Excitation attention mechanism module is added into the Yolov5 network, the importance degree of each feature channel is automatically acquired in a learning mode, and then useful features are promoted and other features are suppressed according to the importance degree, wherein the other features are features which are not useful for the current task;

preferably, in S2-5, the non-maximum suppression is performed by using DIoU _ NMS. In conventional NMS, the IoU criterion is often used to suppress redundant detection boxes, where overlapping areas are the only factor, often resulting in false suppression for occlusion cases. The DIoU _ NMS considers not only the overlapping regions but also the center point distance between the two prediction boxes. The identification of objects occluding overlapping by the model can be improved.

IoU are defined as follows:

wherein B is^gt1，B^gt2Representing two prediction boxes, IoU represents the intersection ratio of the two prediction boxes.

DIoU is defined as follows:

wherein, b^gt1,b^gt2Represents the center points of the two prediction frames, respectively, p (b)^gt1,b^gt2) Typically, the euclidean distance between the two center points is calculated. c represents the diagonal Euclidean distance of the minimum closure area which can contain two prediction frames simultaneously.

When NMS non-maximum value suppression is carried out, all prediction boxes are traversed from large to small according to the confidence degrees of the prediction boxes, and the redundant prediction boxes are deleted by simultaneously considering the intersection ratio of the two prediction boxes and the distance between the central points, so that the error suppression of the shielding target is reduced.

When people enter the house, the people enter a shooting area of a video acquisition device at a house entrance, and the house entering or leaving behaviors of the people are identified by combining the people behavior classification model trained by the method. When a person opens and closes the door, the sensor for detecting the state of the door is triggered, the entering and exiting conditions of the person in the time period are recorded, and the person is stored in the computer.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

1. the number of people in the house can be automatically calculated without manual participation;

2. people counting is carried out without depending on human faces, and when people wear the mask or other reasons, people still can count the number of people in the house through the judgment of the home and the out behaviors of the people when the face information cannot be collected.

Drawings

Fig. 1 is a Yolov5 network modification diagram according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a person entering a house according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating a counting process of the personnel in the house according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. Of course, the specific embodiments described herein are merely illustrative of the invention and are not intended to be limiting.

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example 1

Referring to fig. 1 to 3, the invention provides a method for calculating the number of people in a closed space, which comprises configuring a door with a sensor, arranging a video acquisition device at the side of an entrance of the door close to a handle, configuring a light supplement device in an acquisition area of the video acquisition device, and connecting the sensor and the video acquisition device with a computer with a processor and a memory through a network. According to the invention, through means of video acquisition, model training, behavior recognition and the like, people in the closed space are automatically counted by combining the recognition result.

For example, the network reservation room is used, and the camera of the video acquisition device is a wide-angle camera with a light supplement lamp.

The video acquisition device is arranged at the position, close to the handle, outside the door frame, the height of a lens is 1.55 m, and the lens is perpendicular to the wall surface. The bulb in the equipment installation area is not lower than a 10W energy-saving lamp or a 5W LED lamp. The brightness of the light in the equipment snapshot area is about 500-1000 Lux;

the process of model training is as follows:

s1-1, triggering the video acquisition device by the sensor to record the door opening time point when the door is opened through the video acquisition device, and then intercepting videos of a period of time before and after the time point by taking the time point as the center for storage, wherein the videos are used as video data for training of a personnel behavior analysis model. The video capturing time is preferably 15 before and after the door is opened, and the total time is 30 seconds;

s1-2, using computer equipment to perform frame extraction processing on the video data obtained in the S1-1, and capturing a picture every 10 frames and storing the picture in a computer;

and S1-5, making the pictures labeled in the S1-4 into a data set according to a certain rule and format, and randomly dividing the data set into a training data set and a verification data set according to a certain proportion. Renaming the pictures according to the format of the Pascal VOC data set, creating a folder with the name of being options for storing an xml data set, creating a folder with the name of being ImageSets for storing training data and verification data, and creating a folder of JPEGImages for storing the picture set;

and S1-6, normalizing the data in the data set obtained in the S1-5 and converting the normalized data into a txt file required by the neural network, wherein the content in the file is the type of picture marking, the coordinate of the center point of a marking frame and the length and the width of the marking frame. S2, training a model, and performing model training by combining a neural network with the training data set acquired by the S1;

the model training adopts a Yolov5 neural network algorithm, and because the attention of the user is mainly a large target, a Yolov5s network with the minimum depth, the minimum width of a characteristic diagram and the fastest speed can be selected;

s2-1, setting training parameters of a Yolov5 neural network, preferably setting the learning rate to be 0.001, the iteration number to be 40000, the learning rate variation step size to be 20000 and 30000, the learning rate variation factor to be 0.1 and 0.1, the batch size to be 64, the target frame confidence threshold to be 0.5, the depth of the network to be 0.33, the width of the network to be 0.50, the width of an input data set to be 608, the height to be 608 and the number of channels to be 3;

s2-2, sending the prepared data set in S1 to a network for training;

the method comprises the steps of replacing a bottleckCSP in a Yolov5 network with a Ghost bottleckP, which is abbreviated as Ghost in figure 1, and performing linear convolution once again on the basis of a small amount of feature graphs obtained by nonlinear convolution to obtain more feature graphs, so that redundant features are eliminated, a lighter model is obtained, and the training speed is higher.

An attention mechanism module of Squeeze-and-Excitation, abbreviated as SELayer in fig. 1, is added to a Yolov5 network, the importance degree of each feature channel is automatically obtained in a learning manner, and then useful features are promoted according to the importance degree and features with little use for the current task are suppressed, so that the performance is improved and the learning speed is accelerated.

And removing the feature map of 76 x 76 in the prediction module in the Yolov5 network, and reducing the sensitivity of the model to small targets.

S2-3, formatting the length and width of a picture input into a Yolov5 neural network into 608 pixels × 608 pixels, and obtaining two feature maps with different sizes through convolution, upsampling, residual error unit and tensor splicing, wherein the two feature maps are 19 × 19 and 38 × 38 respectively;

and S2-4, selecting corresponding prediction box sizes from the feature maps of two different sizes obtained in the step S2-3 according to the range of the sizes of the regions mapped on the input picture by the feature maps, and selecting boundary boxes of 3 sizes from each feature map. The feature map with the size of 19 × 19 detects a large target human body in a short distance, and input images with the corresponding prediction frame sizes mapped to 608 × 608 are 142 × 110, 192 × 243 and 459 × 401 respectively; the feature map of size 38 × 38 detects medium target human bodies at medium distances, and the corresponding prediction box sizes are mapped to input images of 608 × 608 as 36 × 75, 76 × 55, and 72 × 146, respectively.

And S2-5, screening the prediction frame obtained in the S2-4 by using an NMS non-maximum value inhibition method, searching a local maximum value, inhibiting non-maximum value elements, eliminating redundant candidate prediction frames, and finding the best prediction frame position. And carrying out non-maximum suppression by using a DIoU _ NMS mode, and being used for improving the identification of the model to the target with the shielding overlapping.

And S2-6, observing a loss function, and stopping training when a loss function value output by the training data set converges below a threshold value or reaches a set maximum iteration number to obtain a trained recognition network.

The behavior recognition process comprises the following steps:

s3, combining the trained recognition network in S2 with a video acquisition device and a sensor to count the personnel who get in and out of the house:

s3-1, detecting that the front of the person appears in the shooting area of the video acquisition device, and then triggering a sensor for detecting the opening and closing state of the door, so that the detected number of the person is added to the gathered number of the persons in the house.

S3-2, triggering a sensor for detecting the opening and closing state of the door, then detecting that the back of the person is in the shooting area of the video acquisition device, and subtracting the detected number of the person from the gathered number of the person in the house.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for calculating the number of people in a closed space is characterized by comprising the following steps,

step one, training photos with people entering a room and leaving a room by using a neural network to obtain a people behavior classification model capable of classifying people entering the room or leaving the room;

2. The closed space personnel number calculation method of claim 1, wherein the personnel behavior classification model is established by a method comprising,

3. The method for calculating the number of people in an enclosed space according to claim 2, wherein a training data set is prepared in the step S1, specifically,

s1-1, triggering the video acquisition device by a sensor to record a door opening time point when a door is opened through the video acquisition device, and then intercepting videos of a period of time before and after the time point by taking the time point as a center for storage, wherein the videos are used as video data for training of a personnel behavior analysis model;

s1-2, performing frame extraction processing on the video data obtained in the S1-1, and intercepting and storing a picture at intervals of frames;

s1-3, screening out pictures of people entering and exiting behaviors from the pictures obtained in the S1-2;

s1-4, marking the people in the picture screened by the S1-3, and marking the people with the house-entering behavior and the people with the house-exiting behavior in the picture differently;

s1-5, making the pictures labeled in the S1-4 into a data set according to a certain rule and format, and randomly dividing the data set into a training data set and a verification data set according to a proportion;

4. The method for calculating the number of people in an enclosed space according to claim 3, wherein the S2 training model method is specifically,

s2-1, setting training parameters of a Yolov5 neural network, learning rate, iteration times, learning rate variation step size, learning rate variation factor, batch size, target frame confidence threshold, non-maximum value inhibition threshold, input width, height and channel number of the network;

s2-2, sending the prepared data set in S1 to a network for training;

s2-3, formatting the length and width of the picture input into a neural network into 608 pixels multiplied by 608 pixels, and obtaining two feature maps with different sizes through convolution, up-sampling, residual error unit and tensor splicing;

5. The method for calculating the number of people in the enclosed space according to claim 4, wherein the bounding boxes of two sizes in S2-4 are,

s2-4-1, detecting a large target human body in a short distance by using the characteristic diagram with the size of 19 multiplied by 19, wherein the input images with the sizes of the corresponding prediction frames mapped to 608 multiplied by 608 are respectively 142 multiplied by 110, 192 multiplied by 243, 459 multiplied by 401;

the feature map of size 38 × 38 of S2-4-2 detects medium target human bodies of medium distance, and the corresponding prediction box sizes are mapped to the input images of 608 × 608 as 36 × 75, 76 × 55, 72 × 146, respectively.

6. The method for calculating the number of people in the closed space according to claim 4, wherein the selected network model is Yolov5 s.

Replacing the bottleeckCSP in the Yolov5 network with the Ghost bottleeck, and performing linear convolution again on the basis of a feature diagram obtained by a small amount of nonlinear convolution.

7. The method for calculating the number of people in the closed space according to claim 4, wherein an Squeeze-and-Excitation attention mechanism module is added into a Yolov5 network, the importance degree of each feature channel is automatically acquired in a learning mode, and then useful features are promoted and other features are suppressed according to the importance degree, wherein the other features are features which are not useful for the current task.

8. The method for calculating the number of people in the enclosed space according to claim 4, wherein in the step S2-5, the DIoU _ NMS is used for non-maximum suppression, so as to improve the identification of the model to the target with overlapped occlusion.