CN116958204A - Light-weight efficient classroom student standing and sitting detection method and system - Google Patents

Light-weight efficient classroom student standing and sitting detection method and system Download PDF

Info

Publication number
CN116958204A
CN116958204A CN202310990752.6A CN202310990752A CN116958204A CN 116958204 A CN116958204 A CN 116958204A CN 202310990752 A CN202310990752 A CN 202310990752A CN 116958204 A CN116958204 A CN 116958204A
Authority
CN
China
Prior art keywords
head
detection
frame
tracking
sitting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310990752.6A
Other languages
Chinese (zh)
Inventor
杨悦
陈冠岐
黄正林
王亮
王欢良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Qimengzhe Technology Co ltd
Original Assignee
Suzhou Qimengzhe Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Qimengzhe Technology Co ltd filed Critical Suzhou Qimengzhe Technology Co ltd
Priority to CN202310990752.6A priority Critical patent/CN116958204A/en
Publication of CN116958204A publication Critical patent/CN116958204A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a light-weight and efficient classroom student standing and sitting detection method and system, wherein the method comprises the following steps: step 1: detecting the head of a person; step 2: tracking the head of a person; step 3: and (5) post-processing judgment, namely judging whether the standing or sitting action exists. According to the light and efficient classroom student standing and sitting detection method and system, aiming at lesson scenes of dense students and complex student actions, the remote and near student standing and sitting actions can be detected in real time and efficiently on the basis of an improved light detection algorithm and a tracking algorithm under the condition of limited computing resources of low-power consumption embedded recording and broadcasting equipment, meanwhile, the simple post-processing judgment rule is used, false action detection is effectively prevented, the light weight of calculation force and the high efficiency of detection performance are considered, the whole system only needs to use one camera, the cost is greatly saved, the commercial detection performance is simultaneously met, and the requirement of detecting the classroom student standing and sitting in real time and efficiently at an embedded terminal can be met.

Description

Light-weight efficient classroom student standing and sitting detection method and system
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a light-weight and efficient classroom student standing and sitting detection method and system.
Background
With the rise of intelligent classrooms in the education industry, the intelligent teaching platform is widely applied in the intelligent classrooms. The recording and broadcasting teaching system is an important function and is mainly applied to scenes such as public classes, fine classes, racing classes, micro class making, micro grid teaching, campus conferences and activity recording. In the recording and broadcasting teaching system, the information is collected through a camera to analyze and position the dynamic state of students in classrooms, and the dynamic state is fed back through a large screen in real time: when the students answer the questions at the beginning, the large screen is used for closing the students at the beginning; when the students answer the questions and sit down, the large screen is switched back to the panoramic picture of the classroom. Therefore, the recording and broadcasting teaching system needs to be able to accurately detect two actions of standing up and sitting down of the student and the student position of the current action in real time.
Currently, the classroom student standing and sitting detection method with use value comprises the following steps:
methods based on video context difference, such as an inter-frame difference method, a background difference method and a dense optical flow method, judge the students doing actions by comparing the areas where images in the front and rear video frames change, but the difference method is sensitive to the light brightness, is not accurate enough in the task of extracting the target frame area of the standing students, is easy to misdetect in a crowded environment such as a classroom, and the dense optical flow method has very large calculated amount, depends on a high-performance CPU and is not easy to accurately extract the target frame area of the students. In addition, such methods often require two cameras, one as a primary camera and one as an auxiliary camera for use, increasing deployment costs;
the space-time action detection method based on deep learning is a method with excellent performance, however, the method has quite large operation complexity, the algorithm cannot be directly deployed in the embedded equipment by the calculation power of the current embedded terminal, and if a server is used for operation, the cost and the complexity of installation and management are greatly increased;
compared with the prior art, the method for judging the positions of the faces/heads based on deep learning is more practical, for example, the Chinese patent application No. CN202011158610.6 proposes to judge the positions of the faces of students in front and back frames by using face recognition, but the method for judging the positions of the faces of the students in front and back frames aims at the crowded environment of class, so that on one hand, higher calculation force is needed, on the other hand, the performance is greatly reduced in the scenes of low heads, side faces and the like, and meanwhile, the students are required to register the face information in advance, and the method is more complicated. In another chinese patent application, CN202211092375.6, it is proposed to use 3D head detection and tracking to identify the position information of the head of the student to perform standing or sitting gesture recognition, but performing 3D gesture recognition on several tens of heads in a classroom also requires higher computation effort, which is a great test on hardware cost and real-time performance.
Therefore, there is a need for both light weight and high efficiency in detection performance for both standing and sitting detection tasks in a crowd-intensive classroom environment. Based on the detection method, the light-weight efficient classroom student standing and sitting detection method and system are disclosed.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention aims to provide a light and efficient classroom student standing and sitting detection method and system.
In order to achieve the above purpose and achieve the above technical effects, the invention adopts the following technical scheme:
a light-weight and efficient classroom student standing and sitting detection method comprises the following steps:
step 1: detecting the head of a person;
step 2: tracking the head of a person;
step 3: and (5) post-processing judgment, namely judging whether the standing or sitting action exists.
Further, in step 1, the step of detecting the head of the person includes:
and shooting and collecting continuous classroom panoramic video frames by using a camera, zooming an original image to a size suitable for calculation and capable of detecting a head 10 meters away every time a classroom panoramic image is obtained, and detecting the head by using a lightweight head detector so as to detect all head boundary frames in a current video frame picture.
Further, the lightweight human head detector is a human head detector suitable for NPU operation based on the lightweight human face detector SCRFD training, and the training steps include:
1.1 Collecting intensive human Head data of Classroom and school environments by using a 4K wide-angle camera, and manufacturing a class human Head detection data set Classroom Head;
1.2 Construction of human head detection model
Based on the SCRFD-500M model, replacing the structure with unfriendly NPU operation in the model: replacing the GroupNorm used in the original model detection head with the BatchNorm, and merging with the convolution layer to delete the scale operation of the detection head; controlling the calculated amount of the model to be N GFLOPs under the scale of a W multiplied by H input picture, wherein N is determined by the model reasoning speed on the NPU, searching a network architecture to obtain the depth and the width of a model backbone network, and obtaining an SCRHD-NG model;
1.3 Pre-training the SCRHD-NG model using a CrowdHuman open source dataset;
1.4 Based on the SCRHD-NG pre-training model trained by the CrowdHuman open source data set, fine tuning the model by using the Classroom Head to finish model training.
Further, in step 2, the step of tracking the head of the person includes:
inputting all the head bounding boxes corresponding to each frame of video frame obtained in the step 1 into a head tracking algorithm, and endowing each head bounding box with a tracking ID (identification) so as to obtain the tracking track of all the head bounding boxes.
Further, the head tracking algorithm performs head tracking based on an improved lightweight tracking algorithm SORT, and the tracking steps include:
2.1 Initializing a tracking track list; let all the tracking track lists corresponding to all the head target bounding boxes detected by the head detector be Tracks. Each Track of the Tracks is used to store the tracking trajectory of a person's head and the state of the kalman filter of that Track. Initializing corresponding Tracks of all the Detections at the 1 st frame time, namely storing a bounding box of the detection in the Tracks as an initial tracking result and initializing the state of a Kalman filter;
2.2 A Kalman filter prediction; predicting the latest tracking result of the Tracks by using a Kalman filter to obtain a prediction frame Preds;
2.3 Match the pres and Detections; calculating a cost matrix by using Euclidean distances of center points of Preds and Detections to adapt to the influence of insufficient frame rate on the tracking algorithm effect in the embedded equipment, setting the center point coordinates of Pred as (x 1, y 1) and the center point coordinates of Detection as (x 2, y 2), and then the corresponding Euclidean distance is:
the Euclidean distance cost matrix is:
wherein i is the serial number of Pred, j is the serial number of Detection;
then matching is carried out by using a Hungary algorithm according to the cost matrix;
2.4 Processing the matching result, there are three cases:
1) If the Pred and the Detection are successfully matched, adding the current Detection into the Track, and updating the state of the Kalman filter;
2) The Detection matching fails, and the Track of the Detection is initialized;
3) If the Pred matching fails, if the continuous matching fails to achieve the max_age frame, deleting the Track corresponding to the Pred.
Further, in step 3, the step of post-processing the decision includes:
and judging whether the standing or sitting action exists currently or not by using a post-processing rule according to the tracking track change of each head boundary box.
Further, the post-processing rule includes:
1) Adjacent frame judgment rules; the distance h_dst_frame= |y2-y1| of the center point coordinate of the human head boundary frame of the same tracking ID of the adjacent frames on the y axis, wherein y1 is the y axis value of the center point coordinate of the human head boundary frame of the previous frame, y2 is the y axis value of the center point coordinate of the human head detection frame of the next frame, if h_dst_frame is larger than a threshold value thresh_frame, when y2 is smaller than y1, the rising action is considered to be possible, and when y2 is larger than y1, the sitting action is considered to be possible;
2) Frame segment decision rules; the difference h_dst_seg= |min (Y) -max (Y) | between the maximum value and the minimum value of the coordinates of the center points of the human head boundary frames with the same tracking ID in all frames in one second, wherein Y is a set of the values of the Y axes of the coordinates of the center points of the human head boundary frames with the same tracking ID in all frames in one second; if h_dst_seg is greater than the threshold thresh_seg, then it is considered that a raising action may occur when argmin (Y) > argmax (Y), and a sitting action may occur when argmin (Y) < argmax (Y), argmin (Y) represents the frame number at which the minimum value is taken, and argmax (Y) represents the frame number at which the maximum value is taken;
3) Multiple threshold rules; the values of the threshold thresh_frame and thresh_seg are set according to the proportion of the adult head bounding box height H;
4) Anti-walk false detection rules; if the judgment thresholds thresh_frame and thresh_seg are satisfied, a false detection prevention judgment is made.
The invention discloses a light-weight and efficient classroom student standing and sitting detection system, which comprises:
the wide-angle camera is used for collecting classroom panoramic data and ensuring that the faces of all students in normal sitting postures are not blocked by front-row students;
the standing and sitting detection module is used for detecting standing and sitting actions of students in a class;
the video close-up module is used for showing close-up of the standing students; after the standing action is detected, carrying out expansion buckling according to a human head boundary box of the current ID, extracting standing student close-up pictures, displaying the close-up pictures at the upper left of the original panoramic picture, and when a plurality of close-up pictures exist, arranging the close-up pictures in sequence; when the sitting action is detected, the student's close-ups are released, and when a plurality of close-ups are generated, the unreleased close-ups are arranged in sequence.
The invention discloses an electronic device, comprising:
a memory for storing a computer program;
and the processor is used for executing the lightweight and efficient classroom student standing and sitting detection method by calling the computer program.
The invention discloses a readable storage medium, on which a computer program is stored, which when being executed by a processor, implements a lightweight and efficient classroom student standing and sitting detection method as described above.
Compared with the prior art, the invention has the beneficial effects that:
the invention discloses a light-weight and efficient classroom student standing and sitting detection method and system, which are capable of achieving both light weight of calculation power and high efficiency of detection performance, effectively saving calculation amount by utilizing a 2D head detection and head tracking algorithm and detection post-processing, simultaneously meeting commercial detection performance and meeting the requirement of detecting the classroom student standing and sitting in an embedded terminal in real time and efficiently.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a network block diagram of a lightweight human head detector of the present invention;
FIG. 3 is a flow chart of head tracking according to the present invention;
FIG. 4 is a post-processing flow chart of the present invention;
fig. 5 is a close-up effect diagram of embodiment 1 of the present invention.
Detailed Description
The present invention is described in detail below so that advantages and features of the present invention can be more easily understood by those skilled in the art, thereby making clear and unambiguous the scope of the present invention.
The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
As shown in fig. 1-5, a light weight and efficient classroom student standing and sitting detection method, which combines light weight of calculation power and efficient detection performance, comprises the following steps:
step 1: and detecting the head of a person. And shooting and collecting continuous classroom panoramic video frames by using a 4K wide-angle camera. Since the longest distance in a classroom can reach about 10 meters, if an original 4K-sized image is input, the detection calculation amount is very large, if the image is scaled down too small, the pixels of the long-distance heads are too small to be detected, therefore, each time a frame of classroom panoramic image is acquired, the original image is scaled to a size suitable for calculation and capable of detecting heads with a distance of 10 meters, and a lightweight head detector is used for head detection, so that all head bounding boxes (head bounding box) in the current video frame picture are detected;
step 2: and (5) tracking the head of a person. Inputting all the head boundary boxes corresponding to each frame of video frame obtained in the step 1 into a head tracking algorithm, and endowing each head boundary box with a tracking ID (identification) so as to obtain tracking tracks of all the head boundary boxes;
step 3: and (5) post-processing judgment. And judging whether the person with the current ID has standing or sitting actions or not by using a post-processing rule according to the tracking track change of each head boundary box.
In step 1, a lightweight human head detector is a human head detector which is based on a lightweight human face detector SCRFD (refer to Guo J, deng J, lattas A, et al sample and computation redistribution for efficient face detection [ J ] arXiv preprint arXiv:2105.04714,2021) and is suitable for NPU operation;
the network structure is shown in fig. 2:
the human head detector consists of backbone, neck, head: the backbox consists of a stack of 6 basicblocks, each consisting of 1 3x3 convolutional layer and 1x1 convolutional layer stack, for extracting features. The neg extracts and fuses the characteristics of the last 3 basic blocks, and consists of a 1x1 convolution layer, a 3x3 convolution layer, up-sampling and tensor addition operation, and is used for fusing the characteristics of 3 different scales. head consists of a 3x3 convolutional layer and sigmoid for outputting the predicted head coordinate position and score at 3 scales.
The training steps comprise:
1.1 Using a 4K wide-angle camera to collect intensive human Head data of Classroom and school environments, and manufacturing a Classroom human Head detection data set (class room Head);
1.2 Construction of human head detection model
Based on the SCRFD-500M model (the model is 500MFLOPs under the scale of 640 multiplied by 480 input pictures), replacing the structure with the NPU operation unfriendly structure in the model: replacing the GroupNorm used in the original model detection head with the BatchNorm, and merging with the convolution layer to delete the scale operation of the detection head; controlling the calculated amount of the model to be N GFLOPs under the scale of a W multiplied by H input picture, wherein N is determined by the model reasoning speed on the NPU, and performing Network Architecture Search (NAS) to obtain the depth and the width of a model backbone network and obtain an SCRHD-NG model;
1.3 Pre-training the SCRHD-NG model using a CrowdHuman open source dataset (see Shao S, zhao Z, li B, et al CrowdHuman: abenchmark for detecting human in a crowd [ J ]. ArXiv preprint arXiv:1805.00123,2018);
1.4 Based on the SCRHD-NG pre-training model trained by the CrowdHuman open source data set, fine tuning the model by using the Classroom Head to finish model training.
In step 2, the head tracking algorithm performs head tracking based on a modified lightweight tracking algorithm SORT (see Bewley A, ge Z, ott L, et al sample online and realtime tracking [ C ]//2016IEEE international conference on image processing (ICIP). IEEE, 2016:3464-3468), the tracking steps comprising:
2.1 Initializing a tracking track list. Let all the tracking track lists corresponding to all the head target bounding boxes detected by the head detector be Tracks. Each Track of the Tracks is used to store the tracking trajectory of a person's head and the state of the kalman filter of that Track. Initializing corresponding Tracks of all the Detections at the 1 st frame time, namely storing a bounding box of the detection in the Tracks as an initial tracking result and initializing the state of a Kalman filter;
2.2 Kalman filter prediction. Predicting the latest tracking result of the Tracks by using a Kalman filter to obtain a prediction frame Preds;
2.3 Pres and Detections are matched. The original sort algorithm computes the overlap ratio (IOU) of Preds and Detections, and computes the IOU cost matrix. Because the computing resources of the embedded equipment are limited, the human head Detection input image needs to be large enough to meet the requirement of detecting a long-distance small human head, human head Detection and human head tracking are difficult to be carried out on 30 frames of images per second, frame skipping is needed, the displacement of Detection of the same human head of adjacent frames becomes large, meanwhile, the long-distance human head frame is smaller, the Pred and the Detection have no intersection, and matching failure can be caused by using an IOU (input-output) computing cost matrix. The improved SORT algorithm of the invention uses the Euclidean distance of the center points of Preds and Detection to calculate a cost matrix so as to adapt to the influence of insufficient frame rate on the tracking algorithm effect in embedded equipment, and the center point coordinates of Pred are set as (x 1, y 1), the center point coordinates of Detection are set as (x 2, y 2), and the corresponding Euclidean distance is:
the Euclidean distance cost matrix is:
wherein i is the serial number of Pred, j is the serial number of Detection;
then matching is carried out by using a Hungary algorithm according to the cost matrix;
2.4 Processing the matching result. There are three cases of matching results:
1) If the Pred and the Detection are successfully matched, adding the current Detection into the Track, and updating the state of the Kalman filter;
2) The Detection matching fails, and the Track of the Detection is initialized;
3) If the Pred matching fails, if the continuous matching fails to achieve the max_age frame, deleting the Track corresponding to the Pred.
In step 3, the post-processing rule is:
1) Adjacent frame decision rules. The distance h_dst_frame= |y2-y1| of the center point coordinates of the human head boundary frames of the same tracking ID of the adjacent frames on the y axis, wherein y1 is the y axis value of the center point coordinates of the human head boundary frames of the previous frame, and y2 is the y axis value of the center point coordinates of the human head detection frames of the next frame. If h_dst_frame is greater than threshold thresh_frame, then it is considered that a raising action may occur when y2< y1, and a lower left action when y2> y 1;
2) Frame segment decision rules. The difference h_dst_seg= |min (Y) -max (Y) | between the maximum value and the minimum value of the coordinates of the center points of the human head boundary frames with the same tracking ID in all frames in one second is the set of the Y-axis values of the coordinates of the center points of the human head boundary frames with the same tracking ID in all frames in one second. If h_dst_seg is greater than the threshold thresh_seg, then it is considered that a raising action may occur when argmin (Y) > argmax (Y), and a sitting action may occur when argmin (Y) < argmax (Y), argmin (Y) represents the frame number at which the minimum value is taken, and argmax (Y) represents the frame number at which the maximum value is taken;
3) Multiple threshold rules. Under the panoramic picture view angle of the camera, the front-row student heads are larger in the picture, the back-row student heads are smaller in the picture, and the numerical value of the threshold value thresh_frame and thresh_seg is set larger, so that the back-row students cannot easily detect the threshold value thresh_frame and thresh_seg; the setting is less, can lead to front row student to lie prone and raise the head also can be detected successfully. The bigger the y-axis coordinate value of the center point of the human head boundary frame is, the closer the human head is to the camera, the bigger the wide-height value of the boundary frame is in the camera picture, the smaller the y-axis coordinate value of the center point of the human head boundary frame is, the farther the human head is from the camera, and the smaller the wide-height value of the boundary frame is in the camera picture, so that the numerical values of the threshold value thresh_frame and thresh_seg are set according to the proportion of the height H of the boundary frame of the human head, and the method can be effectively adapted to the standing and sitting detection of students with different distances.
4) Anti-walk false detection rules. Under the view angle of the panoramic picture of the camera, when a person walks in the direction of the camera or walks away from the camera, the distance change of the center point coordinate of the boundary frame of the person at the Y axis also exists, and the judgment thresholds thresh_frame and thresh_seg can be possibly met. If the judgment thresholds thresh_frame and thresh_seg are satisfied, false detection prevention judgment is performed, the frame numbers argmin (Y) and argmax (Y) are known, and the areas of the head bounding boxes corresponding to the IDs at the two moments are calculated as area_min_y and area_max_y, respectively. When a person in the picture is far away from the camera, the area of the face boundary box is gradually reduced from the near to the far; when a person in the picture approaches the camera, the area of the boundary box of the face also gradually increases from far to near; without significantly changing the area of the face bounding box when standing up and sitting down in place. The ratio h_ratio=area_min_y/area_max_y of the two is thus calculated. If less than the threshold Thresh ratio, it is considered to walk back and forth, rather than sit up.
A lightweight, efficient classroom student sit-up detection system comprising:
a wide-angle camera with 30fps/4K resolution (3840 x 2160) is used for collecting classroom panoramic data and is arranged at a position over 2 meters above the center of the blackboard, so that the faces of all students in normal sitting postures are not blocked by front-row students;
and the standing and sitting detection module is used for detecting standing and sitting actions of students in the class. Judging the standing and sitting of the student according to the change of the motion trail of the head of the student;
and the video close-up module is used for carrying out close-up display on the standing students. After the standing action is detected, carrying out expansion buckling according to a human head boundary box of the current ID, extracting standing student close-up pictures, displaying the close-up pictures at the upper left of the original panoramic picture, and when a plurality of close-up pictures exist, arranging the close-up pictures in sequence; when the sitting action is detected, the student's close-ups are released, and when a plurality of close-ups are generated, the unreleased close-ups are arranged in sequence.
The invention discloses an electronic device, comprising:
a memory for storing a computer program;
and the processor is used for executing the lightweight and efficient classroom student standing and sitting detection method by calling the computer program.
The invention discloses a readable storage medium, on which a computer program is stored, which when being executed by a processor, implements a lightweight and efficient classroom student standing and sitting detection method as described above.
Aiming at the lesson scenes of dense students and complex student actions, the invention can detect the standing and sitting actions of far and near students in real time and high efficiency on the basis of an improved lightweight detection algorithm and a tracking algorithm under the condition of limited computing resources of the low-power embedded recording and broadcasting equipment, and meanwhile, a simple post-processing judgment rule is used, thus effectively preventing false detection of actions, and the whole system only needs to use one camera, thereby greatly saving the cost.
Example 1
As shown in fig. 1-5, a light-weight and efficient classroom student standing and sitting detection method comprises the following steps:
step 1: human head detection
1.1 Using a 4K wide-angle camera to collect high-quality classroom head data with different backgrounds and different angles under different illumination from real classroom videos and online classroom photographs, marking a head target frame, and manufacturing a head detection training data set ClassroomHead;
1.2 Based on SCRFD-500M model, setting model input size as 960×544, using NAS to search the main network of the model, wherein the main network consists of depth-wise conv, each basic block is formed by stacking conv3x3 and conv1x1, blocks are formed by stacking basic blocks with channels of [24,24,48,96,192,288], stacking depth is [1,1,2,3,2,6], so that the overall calculated amount of the model is controlled to be 2GFLOPs, NPU operation-unfriendly structure in the original SCRFD model is replaced, groupNorm used in an original model detection head is replaced by BatchNorm, and the GroupNorm is combined with a convolution layer to delete scale operation of the detection head, so as to obtain an SCRHD-2G model, the picture speed of one 540p on a low-calculation NPU with 0.5Tops is only 28ms, and the network structure is inferred as shown in figure 2;
1.3 Pre-training the SCRHD-2G model by using a CrowdHuman open source data set, so that the model has better generalization capability and characterization capability;
1.4 Performing fine adjustment on the pre-training model in the step 1.3) by using the ClassroomHead training set manufactured in the step 1.1), so that the detection performance of the model in a Classroom scene is improved, and mAP reaching 98.7% on a ClassroomHead verification set is used for completing SCRHD-2G model training;
1.5 Compressing the classroom video acquired by the 4K/30fps camera and jumping to 540p/15fps to send into the SCRHD-2G model to obtain the classroom human head boundary frame.
Step 2: human head tracking
2.1 Initializing a tracking track list. Let all the tracking track lists corresponding to all the head target bounding boxes detected by the head detector be Tracks. Each Track of the Tracks is used to store the tracking trajectory of a person's head and the state of the kalman filter of that Track. Initializing corresponding Tracks of all the Detections at the 1 st frame time, namely storing a bounding box of the detection in the Tracks as an initial tracking result and initializing the state of a Kalman filter;
2.2 Kalman filter prediction. Predicting the latest tracking result of the Tracks by using a Kalman filter to obtain a prediction frame Preds;
2.3 Pres and Detections are matched. The original sort algorithm computes the overlap ratio (IOU) of Preds and Detections, and computes the IOU cost matrix. Because the computing resources of the embedded equipment are limited, the human head Detection input image needs to be large enough to meet the requirement of detecting a long-distance small human head, human head Detection and human head tracking are difficult to be carried out on 30 frames of images per second, frame skipping is needed, the displacement of Detection of the same human head of adjacent frames becomes large, meanwhile, the long-distance human head frame is smaller, the Pred and the Detection have no intersection, and matching failure can be caused by using an IOU (input-output) computing cost matrix. The improved SORT algorithm of the invention uses the Euclidean distance of the center points of Preds and Detection to calculate a cost matrix so as to adapt to the influence of insufficient frame rate on the tracking algorithm effect in embedded equipment, and the center point coordinates of Pred are set as (x 1, y 1), the center point coordinates of Detection are set as (x 2, y 2), and the corresponding Euclidean distance is:
the Euclidean distance cost matrix is:
wherein i is the serial number of Pred, j is the serial number of Detection;
then matching is carried out by using a Hungary algorithm according to the cost matrix;
2.4 Processing the matching result. There are three cases of matching results:
1) If the Pred and the Detection are successfully matched, adding the current Detection into the Track, and updating the state of the Kalman filter;
2) If the Detection fails to match, initializing the Track of the Detection, and adding 1 to the unmatched times max_age of the tracking list corresponding to the prediction frame;
3) If the Pred matching fails, if the continuous matching fails max_age frame, that is, if the continuous 3 frames are not matched, that is, max_age=3, the Track corresponding to the Pred is deleted, and the bounding box is initialized to the tracking Track list.
Step 3: post-processing determination, as shown in fig. 4:
3.1 And (3) carrying out first-stage interframe judgment according to all tracking track lists obtained in the step (2). The adjacent frame judges that the difference h_dst_frame= |y2-y1|of the y-axis value of the center point coordinates of the human head boundary frame is larger than the adjacent frame threshold value thresh_frame, and thresh_frame is set as followsWherein y2 is the y-axis value of the center point coordinate of the human head boundary frame of the next frame, y1 is the y-axis value of the center point coordinate of the human head boundary frame of the previous frame, and H is the human head boundary frame height of the current ID. If the continuous 3 frames meet the current judging rule, starting from the frame of the third judgment, entering the second judgment. When y2<y1 is considered to be likely to be raised, when y2>y1 then considers that a lower left action may occur. If the judgment rule is not satisfied, continuing to judge;
3.2 If the judgment of the step 3.1) is satisfied, further carrying out the judgment of the second-stage frame fragments according to all the tracking track lists obtained in the step 2). For each tracking track list, firstly judging whether the difference h_dst_seg= |min (Y) -max (Y) | between the maximum value and the minimum value of the distances between the center point coordinates of the head boundary frames of all tracking tracks stored in the past 15 frames for 1 second is greater than a frame segment threshold thresh_seg, wherein thresh_seg is set to be 2H, Y is a set of the Y-axis values of the center point coordinates of the head boundary frames of the current tracking track in 15 frames, and H is the head boundary frame height of the current ID. If argmin (Y) > argmax (Y), then it is considered that a standing motion is likely to occur, and if argmin (Y) < argmax (Y), then it is considered that a sitting motion is likely to occur, wherein argmin (Y) represents the frame number at which the minimum value is taken, and argmax (Y) represents the frame number at which the maximum value is taken. If the judgment is not met, the frame segment continues to judge the 12 frames forwards, and if the judgment is still not met in the 12 frame time segments in the future, the step 3.1) is returned;
3.3 In order to prevent the behavior of someone from going straight towards the camera or going straight away from the camera, the decision of step 3.1) and step 3.2) is satisfied. The areas of the head bounding boxes at these two times are calculated based on the frame numbers argmin (Y) and argmax (Y), respectively, area_min_y and area_max_y. When a person in the picture is far away from the camera, the area of the boundary frame of the person head is gradually reduced from the near to the far; when a person in the picture approaches the camera, the area of the boundary frame of the person head is gradually increased from far to near; without the area of the bounding box of the head changing much when standing up and sitting down in place. The ratio h_ratio=area_min_y/area_max_y of the two is thus calculated. In the embodiment, the thresh_ratio is set to 0.85, if the h_ratio is smaller than the threshold thresh_ratio, the user is considered to walk back and forth, and not to stand up or sit down, and otherwise, a standing up or sitting down message is output.
A lightweight, efficient classroom student sit-up detection system comprising:
a wide-angle camera with 30fps/4K resolution (3840X 2160) is used for collecting classroom panoramic data and is arranged at a position over 2 meters above the center of the blackboard, so that the head of a person is not blocked by front-row students under normal sitting postures of all students;
and the standing and sitting detection module is used for detecting standing and sitting actions of students in the class. And judging the standing and sitting of the student according to the change of the head position of the student. The module only needs about 50ms for detecting the standing and sitting of students in a 4K picture on the camera module;
and the video close-up module is used for carrying out close-up display on the standing students, as shown in fig. 5. And after the standing action is detected, buckling the picture according to the head boundary box of the current ID, extracting a standing student close-up picture, and expanding the picture by taking the extracted head boundary box as the center to obtain the close-up frame. Let the coordinates of the human head bounding box be expressed as (x 1, y1, x2, y 2) in terms of the upper left corner coordinates plus the lower right corner coordinates, and the width and height of the human head bounding box be (w=x2-x 1, h=y2-y 1), then the extraction frame range is (x 1-2w, y1-h, x2+2w, y2+2h), and then the frame pixels are complemented to 16:9, finally scaling to 720p size, displaying the close-up picture on the original 4K panoramic picture, and placing in the order of nine squares from the upper left. When sitting actions are detected, the student features are released, and the unreleased features are sequentially moved to replace the released features.
An electronic device, comprising:
a memory for storing a computer program;
and the processor is used for executing the lightweight and efficient classroom student standing and sitting detection method by calling the computer program.
A readable storage medium having stored thereon a computer program which when executed by a processor implements a lightweight, efficient classroom student sit-up detection method as described above.
Parts or structures of the present invention, which are not specifically described, may be existing technologies or existing products, and are not described herein.
Other variations, such as different head detection models, different tracking algorithms, different parameter decision rules, etc., can be easily associated with the above implementation types, as will be apparent to those skilled in the art. Accordingly, the present invention is not limited to the above examples, which are provided merely as examples to illustrate one form of the present invention in detail and exemplarily. It is intended that all technical solutions obtained by various equivalents according to the above-described specific examples by persons skilled in the art without departing from the scope of the present invention shall be included in the scope of the claims and their equivalents.

Claims (10)

1. A light-weight and efficient classroom student standing and sitting detection method is characterized by comprising the following steps of:
step 1: detecting the head of a person;
step 2: tracking the head of a person;
step 3: and (5) post-processing judgment, namely judging whether the standing or sitting action exists.
2. The light-weight and efficient classroom student standing and sitting detection method according to claim 1, wherein in step 1, the step of detecting the head of a person comprises:
and shooting and collecting continuous classroom panoramic video frames by using a camera, zooming an original image to a size suitable for calculation and capable of detecting a head 10 meters away every time a classroom panoramic image is obtained, and detecting the head by using a lightweight head detector so as to detect all head boundary frames in a current video frame picture.
3. The method for detecting standing and sitting down of classroom students with light weight and high efficiency according to claim 2, wherein the lightweight human head detector is a human head detector suitable for NPU operation based on the lightweight human face detector SCRFD training, and the training step comprises:
1.1 Collecting intensive human Head data of Classroom and school environments by using a 4K wide-angle camera, and manufacturing a class human Head detection data set Classroom Head;
1.2 Construction of human head detection model
Based on the SCRFD-500M model, replacing the structure with unfriendly NPU operation in the model: replacing the GroupNorm used in the original model detection head with the BatchNorm, and merging with the convolution layer to delete the scale operation of the detection head; controlling the calculated amount of the model to be N GFLOPs under the scale of a W multiplied by H input picture, wherein N is determined by the model reasoning speed on the NPU, searching a network architecture to obtain the depth and the width of a model backbone network, and obtaining an SCRHD-NG model;
1.3 Pre-training the SCRHD-NG model using a CrowdHuman open source dataset;
1.4 Based on the SCRHD-NG pre-training model trained by the CrowdHuman open source data set, fine tuning the model by using the Classroom Head to finish model training.
4. The light-weight and efficient classroom student standing and sitting detection method according to claim 1, wherein in step 2, the step of head tracking comprises:
inputting all the head bounding boxes corresponding to each frame of video frame obtained in the step 1 into a head tracking algorithm, and endowing each head bounding box with a tracking ID (identification) so as to obtain the tracking track of all the head bounding boxes.
5. The method for detecting standing and sitting of students in a classroom with light weight and high efficiency according to claim 4, wherein the head tracking algorithm performs head tracking based on an improved lightweight tracking algorithm SORT, and the tracking step comprises:
2.1 Initializing a tracking track list; let all the tracking track lists corresponding to all the head target bounding boxes detected by the head detector be Tracks. Each Track of the Tracks is used to store the tracking trajectory of a person's head and the state of the kalman filter of that Track. Initializing corresponding Tracks of all the Detections at the 1 st frame time, namely storing a bounding box of the detection in the Tracks as an initial tracking result and initializing the state of a Kalman filter;
2.2 A Kalman filter prediction; predicting the latest tracking result of the Tracks by using a Kalman filter to obtain a prediction frame Preds;
2.3 Match the pres and Detections; calculating a cost matrix by using Euclidean distances of center points of Preds and Detections to adapt to the influence of insufficient frame rate on the tracking algorithm effect in the embedded equipment, setting the center point coordinates of Pred as (x 1, y 1) and the center point coordinates of Detection as (x 2, y 2), and then the corresponding Euclidean distance is:
the Euclidean distance cost matrix is:
wherein i is the serial number of Pred, j is the serial number of Detection;
then matching is carried out by using a Hungary algorithm according to the cost matrix;
2.4 Processing the matching result, there are three cases:
1) If the Pred and the Detection are successfully matched, adding the current Detection into the Track, and updating the state of the Kalman filter;
2) The Detection matching fails, and the Track of the Detection is initialized;
3) If the Pred matching fails, if the continuous matching fails to achieve the max_age frame, deleting the Track corresponding to the Pred.
6. The light-weight and efficient classroom student standing and sitting detection method according to claim 1, wherein in step 3, the step of post-processing decision comprises:
and judging whether the standing or sitting action exists currently or not by using a post-processing rule according to the tracking track change of each head boundary box.
7. The lightweight, efficient classroom student sit-up detection method of claim 6 wherein the post-processing rules comprise:
1) Adjacent frame judgment rules; the distance h_dst_frame= |y2-y1| of the center point coordinate of the human head boundary frame of the same tracking ID of the adjacent frames on the y axis, wherein y1 is the y axis value of the center point coordinate of the human head boundary frame of the previous frame, y2 is the y axis value of the center point coordinate of the human head detection frame of the next frame, if h_dst_frame is larger than a threshold value thresh_frame, when y2 is smaller than y1, the rising action is considered to be possible, and when y2 is larger than y1, the left lower action is considered to be possible;
2) Frame segment decision rules; the difference h_dst_seg= |min (Y) -max (Y) | between the maximum value and the minimum value of the coordinates of the center points of the human head boundary frames with the same tracking ID in all frames in one second, wherein Y is a set of the values of the Y axes of the coordinates of the center points of the human head boundary frames with the same tracking ID in all frames in one second; if h_dst_seg is greater than the threshold thresh_seg, then it is considered that a raising action may occur when argmin (Y) > argmax (Y), and a sitting action may occur when argmin (Y) < argmax (Y), argmin (Y) represents the frame number at which the minimum value is taken, and argmax (Y) represents the frame number at which the maximum value is taken;
3) Multiple threshold rules; the values of the threshold thresh_frame and thresh_seg are set according to the proportion of the adult head bounding box height H;
4) Anti-walk false detection rules; if the judgment thresholds thresh_frame and thresh_seg are satisfied, a false detection prevention judgment is made.
8. A lightweight, efficient classroom student sit-up detection system, comprising:
the wide-angle camera is used for collecting classroom panoramic data and ensuring that the heads of all students in normal sitting postures are not blocked by front-row students;
the standing and sitting detection module is used for detecting standing and sitting actions of students in a class;
the video close-up module is used for showing close-up of the standing students; after the standing action is detected, carrying out expansion buckling according to a human head boundary box of the current ID, extracting standing student close-up pictures, displaying the close-up pictures at the upper left of the original panoramic picture, and when a plurality of close-up pictures exist, arranging the close-up pictures in sequence; when the sitting action is detected, the student's close-ups are released, and when a plurality of close-ups are generated, the unreleased close-ups are arranged in sequence.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing a lightweight, efficient classroom student sit-down detection method as claimed in any one of claims 1-7 by invoking a computer program.
10. A readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a lightweight, efficient classroom student sit-up detection method as claimed in any one of claims 1 to 7.
CN202310990752.6A 2023-08-08 2023-08-08 Light-weight efficient classroom student standing and sitting detection method and system Pending CN116958204A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310990752.6A CN116958204A (en) 2023-08-08 2023-08-08 Light-weight efficient classroom student standing and sitting detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310990752.6A CN116958204A (en) 2023-08-08 2023-08-08 Light-weight efficient classroom student standing and sitting detection method and system

Publications (1)

Publication Number Publication Date
CN116958204A true CN116958204A (en) 2023-10-27

Family

ID=88452920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310990752.6A Pending CN116958204A (en) 2023-08-08 2023-08-08 Light-weight efficient classroom student standing and sitting detection method and system

Country Status (1)

Country Link
CN (1) CN116958204A (en)

Similar Documents

Publication Publication Date Title
Ruiz et al. Fine-grained head pose estimation without keypoints
US10254845B2 (en) Hand gesture recognition for cursor control
Biswas et al. Gesture recognition using microsoft kinect®
US20180114071A1 (en) Method for analysing media content
Niu et al. View-invariant human activity recognition based on shape and motion features
WO2016183766A1 (en) Method and apparatus for generating predictive models
TWI439951B (en) Facial gender identification system and method and computer program products thereof
Idrees et al. Tracking in dense crowds using prominence and neighborhood motion concurrence
Kruthiventi et al. Low-light pedestrian detection from RGB images using multi-modal knowledge distillation
CN104813339A (en) Methods, devices and systems for detecting objects in a video
JP2018181273A (en) Image processing apparatus, method thereof, and program
Wu et al. Fusing motion patterns and key visual information for semantic event recognition in basketball videos
Li et al. Robust multiperson detection and tracking for mobile service and social robots
KR20170097265A (en) System for tracking of moving multi target and method for tracking of moving multi target using same
CN111415374A (en) KVM system and method for monitoring and managing scenic spot pedestrian flow
CN112084952B (en) Video point location tracking method based on self-supervision training
CN110909625A (en) Computer vision basic network training, identifying and constructing method and device
CN105912126A (en) Method for adaptively adjusting gain, mapped to interface, of gesture movement
CN112270381A (en) People flow detection method based on deep learning
Wang et al. A new algorithm for robust pedestrian tracking based on manifold learning and feature selection
Wu et al. Real‐time running detection system for UAV imagery based on optical flow and deep convolutional networks
KR20150065370A (en) Apparatus and method for recognizing human actions
CN116824641B (en) Gesture classification method, device, equipment and computer storage medium
Park et al. Understanding human interactions with track and body synergies (TBS) captured from multiple views
CN112488072A (en) Method, system and equipment for acquiring face sample set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination