CN111563452B - Multi-human-body gesture detection and state discrimination method based on instance segmentation - Google Patents

Multi-human-body gesture detection and state discrimination method based on instance segmentation Download PDF

Info

Publication number
CN111563452B
CN111563452B CN202010371935.6A CN202010371935A CN111563452B CN 111563452 B CN111563452 B CN 111563452B CN 202010371935 A CN202010371935 A CN 202010371935A CN 111563452 B CN111563452 B CN 111563452B
Authority
CN
China
Prior art keywords
student
state
class
key point
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010371935.6A
Other languages
Chinese (zh)
Other versions
CN111563452A (en
Inventor
谢非
章悦
刘益剑
陆飞
汪璠
吴俊�
汪壬甲
钱伟行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhenjiang Institute For Innovation And Development Nnu
Nanjing Normal University
Original Assignee
Zhenjiang Institute For Innovation And Development Nnu
Nanjing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhenjiang Institute For Innovation And Development Nnu, Nanjing Normal University filed Critical Zhenjiang Institute For Innovation And Development Nnu
Priority to CN202010371935.6A priority Critical patent/CN111563452B/en
Publication of CN111563452A publication Critical patent/CN111563452A/en
Application granted granted Critical
Publication of CN111563452B publication Critical patent/CN111563452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a multi-human body posture detection and state discrimination method based on instance segmentation, which comprises the following steps: collecting an original framing image of a classroom video; dividing student individuals and non-student individuals, marking different student individuals with masks of different colors, simultaneously detecting gestures, extracting key points of human body gestures of each student, and performing annotation connection; the method comprises the steps of specifically judging the class-listening state of students, identifying and positioning the faces of the students, judging whether all students can detect the front face, if so, preliminarily indicating that the students are in the class-listening state, and judging whether the students are in the hand-lifting state; if the face cannot be detected, it is further judged whether it is in an inaudible state. The student class-listening efficiency is evaluated in combination with each student individual class-listening state. The invention provides a solution for realizing the discrimination and analysis of the class listening state of students, and has the advantages of real-time identification, high identification precision, strong resistance to complex environment interference and the like.

Description

Multi-human-body gesture detection and state discrimination method based on instance segmentation
Technical Field
The invention relates to the technical field of machine learning and machine vision, in particular to a multi-human body gesture detection and state discrimination method based on example segmentation.
Background
With the advent of big data and artificial intelligence age, the fusion of information technology and school education and teaching has become the focus of research. Smart classes are emerging concepts that integrate advanced information acquisition and transmission technologies, intelligent sensing technologies, and computer processing technologies into educational fields. In the educational teaching process, the student's class state can be most effective to the study degree of student and the condition of giving lessons of teacher feed back. The existing teaching feedback is still mainly analyzed and evaluated manually, and is time-consuming, low in efficiency and incomplete in evaluation. The example segmentation can also segment pixels of a target object on the basis of detecting the target, and different individuals of the same object can be marked. Example segmentation has been widely used in the fields of autopilot, medical detection, clothing classification, precision agriculture, etc. With the development of artificial intelligence, instance segmentation can also be gradually applied to intelligent classes.
The currently proposed methods for identifying and analyzing the class listening state of students are few, and mainly adopt single face recognition, human body gesture detection, brain wave monitoring and other methods. The method has the unavoidable defects of low accuracy, low real-time performance, higher cost, poor audience experience and the like. The invention provides a solution for distinguishing and analyzing the class listening state of students. The invention can realize real-time identification and high identification precision, can simultaneously complete human body gesture detection and classroom state discrimination of student individuals on the basis of dividing the student individuals and classroom background, and can output labels of different class-listening states of the student individuals and classify the student individuals in different class-listening states by masks of different colors. The invention also provides a calculation method for analyzing the efficiency of the multi-user class, which can obtain the class listening efficiency of the student individuals after the detection of one class period is finished and has the characteristics of high recognition efficiency, good recognition precision, strong resistance to complex environment interference and the like.
Disclosure of Invention
The invention aims to provide a multi-human body gesture detection and state discrimination method based on example segmentation, which has the advantages of strong real-time performance, high recognition rate and strong background environment interference resistance.
In order to achieve the above purpose, the invention adopts the following technical scheme: the method for detecting and judging the states of the multiple human body gestures based on the example segmentation comprises the following steps:
step 1: collecting video of students on class at a front angle, extracting one frame every 5 seconds, and carrying out framing treatment on the collected video to obtain all framing images of the video of the class;
step 2: dividing all student individuals and non-student individuals in an original frame image of a classroom video by using an example division model, marking different student individuals by masks of different colors, simultaneously carrying out gesture detection, extracting 18 key points of the human body gesture of each student, and carrying out mark connection, thereby obtaining a classroom image marked by connecting the masks of different colors and the key points of the human body;
step 3: detecting the position of the face of each student by using a dlib model;
step 4: the specific judgment is carried out on the class listening state of the students: if the front face can be detected, judging whether the student is in a general class-listening state or a hand-lifting state according to the extracted human body key point coordinate relation; if the face of the student cannot be detected, judging whether the student is in a low-head or side-body joint lug state according to the extracted key point angles of the human body coordinates;
step 5: and (3) processing all original framing images of the classroom video according to the steps 1-4 to obtain all labeling framing images labeling the individual gestures of students, outputting the classroom states of the students, and performing scoring weighted calculation on different classroom states to obtain the class listening efficiency percentage of each student in the whole classroom period.
Further, the step 1 includes:
step 1.1: recording the front videos of all students in the whole classroom period, and storing the videos to a computer;
step 1.2: carrying out framing operation on all the front videos of students in a stored classroom period, setting to extract a frame of to-be-processed image every 5 seconds, and outputting and storing the image;
further, the step 2 includes:
step 2.1: inputting all original framing images of the classroom video obtained in the step 1 into a backbone neural network of an example segmentation model for processing so as to obtain a feature map in an input picture, wherein the extracted feature map is used as input for subsequent processing;
step 2.2: inputting the feature map obtained in the step 2.1 into an area generation network (RPN) layer in an example segmentation model, and scanning an image by using a sliding window to find an area with a target, thereby obtaining an area of interest (RoI);
step 2.3: detecting each generated region of interest, and when the type of the region of interest including a person is detected, performing single-heat encoding on the position of each key point on the human body, and generating a mask corresponding to each key point of the human body;
step 2.4: performing alignment operation on an output result RoI of the RPN layer, and extracting a feature corresponding to each RoI on a feature map;
step 2.5: and (3) respectively sending the RoI processed in the step (2.3) into two branches of a Fast-region-based convolutional network Fast R-CNN and a full convolutional neural network FCN in an example segmentation model, wherein the Fast R-CNN carries out gesture classification and bounding box regression on the RoI, and the full convolutional neural network FCN generates a mask for each RoI.
Step 2.6: and extracting coordinates of the gesture key points of the student individuals, and storing the extracted coordinate key point information in a CSV file form.
Further, the step 2.1 includes:
the backbone neural network comprises a residual network ResNet101 and a feature map pyramid network FPN.
The residual network res net101 is a total of 101 layers of network because each residual block is 3 layers, and is classified by an input convolution of 7×7×64, and then by 33 residual blocks, namely, a building block, and finally by a full connection layer FC. Each residual block is represented as:
x n+1 =h(x n )+F(x n ,W n )
wherein x is n+1 For the output of each residual block, x n For the input of the residual block, W n Referring to convolution operation, F (x n ,W n ) Representing the residual part, h (x n )=W’ n x n Representing the direct mapped portion, W' n Is a 1 x 1 convolution operation.
And dividing the residual network ResNet101 into 5 stages, and correspondingly obtaining 5 feature map outputs with different scales in the feature map pyramid network FPN network.
Further, the step 2.2 includes:
step 2.2.1: the regional generation network RPN layer generates 9 target frames with preset length-width ratio and area for each position through a sliding window, and the target frames are called anchor box anchor boxes. The 9 initial anchor boxes contained three areas (128×128, 256×256, 512×512), each area in turn containing three aspect ratios (1:1, 1:2, 2:1);
step 2.2.2: after the generated initial anchor boxes are cut and filtered, the regional generation network RPN layer judges whether an anchor point belongs to the foreground or the background through a Softmax function, namely, a student individual or a classroom background, and in addition, first coordinate correction is carried out on the anchor boxes belonging to the foreground.
Further, the step 2.2.2 includes:
the Softmax function is used in the multi-class process, which maps the output of multiple neurons into (0, 1) intervals, and normalizes the guaranteed sum to 1, so that the sum of probabilities for multiple classes is also exactly 1.
The Softmax function is defined as follows:
Figure BDA0002478466730000041
wherein V is i Is the output of the classifier front-stage output unit. i denotes a category index, and C denotes the total number of categories. S is S i The ratio of the index of the current element to the sum of all element indices is shown. The multi-class output value can be converted to a relative probability by this Softmax function.
The Loss function Loss of Softmax is mostly in the form of cross entropy:
Figure BDA0002478466730000042
wherein t is i Representing the true value, y i Representing the value found by the Softmax function.
Inputting a sample, wherein only one neuron corresponds to the correct category of the sample; if the probability value of the neuron output is higher, the loss generated by the neuron output is smaller according to the function formula; otherwise, the higher the loss that occurs. The trained Softmax function can be used to classify feature maps.
Further, the step 2.3 includes:
the one-hot code is a one-bit efficient code. When the human body posture is detected, the human body is used as a target example for classification detection, the key points of each part of the human body correspond to a single-heat code, 18 key points are marked on each human body, and the marking mode of the key points refers to the marking mode of the key points of the human body in the COCO data set.
Further, the step 2.4 includes:
step 2.4.1: using the existing VGG16 network, selecting a convolution step length of 32, mapping the region of interest passing through the VGG16 network layer to the original 1/32 of the size of the feature map, and if the size mapped to the feature map at the moment is a floating point number, not performing rounding operation and reserving the floating point number;
step 2.4.2: setting a characteristic diagram which is fixed to be 7*7 after pooling, wherein the mapped characteristic diagram is n x n, and n represents the side length of the characteristic diagram. Dividing a candidate region with the size of n x n into 49 small regions with the same size, wherein the size of each small region is (n/7);
step 2.4.3: the sampling point number is 4, namely dividing each small area with the size of (n/7) into four parts, taking the pixel at the center point position of each part, and calculating by adopting a bilinear interpolation method to obtain the pixel values of four points;
step 2.4.4: the pixel value of the small region is set as the maximum value of four pixel values obtained by calculation through a bilinear interpolation method, and the like, and 49 pixel values obtained by 49 small regions form a characteristic diagram with the size of 7*7.
Further, the step 4 includes:
step 4.1: if the front face can be detected, the class listening state is further judged according to the extracted human body gesture key point information, whether the student individual is in the hand lifting state is judged by judging the size of delta h, and the calculation formula of the height difference delta h between the wrist key point and the shoulder key point is as follows:
Figure BDA0002478466730000051
or:
Figure BDA0002478466730000052
wherein y is Left shoulder Is the ordinate of the left shoulder key point position, y Right shoulder Ordinate, y, of the right shoulder key point position Left wrist Is the ordinate of the left wrist key point position, y Right wrist Ordinate, y, of the right wrist keypoint location Nose tip Is the ordinate, x of the position of the key point of the nose tip Nose tip X is the abscissa of the position of the key point of the nose tip Left shoulder X is the abscissa of the left shoulder key point position Right shoulder The abscissa of the right shoulder key point position;
if the height difference deltah is greater than 0.5, the student individual is judged to be in a carefully attended state (hand lift), otherwise, the student is judged to be in a general attended state.
Step 4.2: if the front face cannot be detected, further judging according to the relation between the extracted human body posture key point information and the front frame image and the rear frame image.
According to general life experience and experimental statistics, the included angles between key points at the nose tip position of the low head state and vectors of left shoulder and right shoulder point are distributed in a range of 170-200 degrees, the included angles between the nose tip point of the non-low head state and vectors of left shoulder point and right shoulder point are distributed in a range of 90-120 degrees in a concentrated manner, and 160 degrees are selected as boundaries between the low head state and the non-low head state. According to general life experience and experimental statistics, the student is briefly low in reading or writing, if the student is in a low head state in the previous frame and is in a head-up state in the subsequent frame, the student is recorded as a serious class-listening state (writing), otherwise, the student is judged to be in a class-inaudible state according to the general life experience.
If the angle between the vector from the nose tip to the left and right shoulders is less than 160 degrees, judging that the human body is not in a low head state, and further judging the class listening state of the student individual according to the horizontal relative distance between the left shoulder key point and the right shoulder key point. According to general life experience and experimental statistics, the horizontal relative distance between the left shoulder key point and the right shoulder key point in the sideways state is less than 1.5. Calculating a normalized standard formula of the horizontal relative distance deltax between the left shoulder key point and the right shoulder key point:
Figure BDA0002478466730000061
wherein x is Left shoulder Transverse to the left shoulder key pointCoordinates x Right shoulder Is the abscissa of the right shoulder key point position, y Neck (B) Is the ordinate, x of the position of the key point of the neck Neck (B) Is the abscissa of the neck key point position, y Nose tip Is the ordinate, x of the position of the key point of the nose tip Nose tip Is the abscissa of the position of the tip key point;
if the horizontal distance between the left shoulder and the right shoulder is smaller than 1.5, the students are judged to be in a non-class-listening state of the side joint lugs, otherwise, the students are judged to be in a general class-listening state.
Further, the step 5 includes:
scoring weighting is carried out according to different class-listening states, and the class-listening efficiency percentage of students in the whole class period is calculated.
For the student individuals in the class-listening state judged in the step 4, if the students are in the general class-listening state, each time a score of 0.6 is detected;
for the student individuals in the class listening state judged in the step 4, if the student is in the writing state, every time a beat of 0.8 point is detected, and if the student is in the hand lifting state, every time a beat of 1 point is detected;
for the student individuals in the state of inaudible class such as the passion or the junction lug distinguished in the step 4, scoring 0 for each time;
the calculation formula of the class listening efficiency percentage P of the whole class period of each student individual is as follows:
Figure BDA0002478466730000062
wherein r is the total frame number of the student in the hand lifting state, 1 is the total frame number of the student in the writing state, s is the total frame number of the student in the general class listening state, and N is the total frame number of the continuous frame images of the classroom video.
As can be seen from the above technical solutions, the embodiment of the present invention provides a multi-body gesture detection and multi-person classroom state discriminant analysis method, including: step 1: collecting a video of a student in a class at a front angle, extracting one frame every 5 seconds, and carrying out framing treatment on the collected video to obtain all original framing images of the video of the class; step 2: dividing all student individuals and non-student individuals in an original frame image of a classroom video by using an example division model, marking different student individuals by masks of different colors, simultaneously carrying out gesture detection, extracting 18 key points of the human body gesture of each student, and carrying out mark connection, thereby obtaining a classroom image marked by connecting the masks of different colors and the key points of the human body; step 3: detecting the position of the face of each student by using a dlib model; step 4: the specific judgment is carried out on the class listening state of the students: if the front face can be detected, judging whether the student is in a general class-listening state or a hand-lifting state according to the extracted human body gesture key point information; if the front face of the student cannot be detected, judging whether the student is in a low-head or side-body joint lug state according to the extracted human body gesture key point information; step 5: and (3) processing all original framing images of the classroom video according to the steps 1-4 to obtain all labeling framing images labeling the individual gestures of students, outputting the classroom states of the students, and performing scoring weighted calculation on different classroom states to obtain the class listening efficiency percentage of each student in the whole classroom period.
Through implementation of the technical scheme, the invention has the beneficial effects that: (1) The method for processing the video framing is provided, and a proper time interval is selected, so that the detection efficiency can be greatly improved; (2) The method for detecting the human body postures of the students and analyzing the class listening state based on the example segmentation is provided, the gestures of the students are accurately detected, the detection efficiency is high, and the method is suitable for complex environment background. (3) By combining with face detection, a human body posture distinguishing algorithm is provided for analyzing the relation among key points of the human body posture, so that the analysis and distinguishing of various specific class-listening states of students are achieved, and the class-listening efficiency of the students is judged accurately; (4) Low realization cost, high recognition efficiency and strong capability of resisting complex environment interference.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
Fig. 1 is a flow chart of a multi-human gesture detection and status discrimination method based on example segmentation according to the present invention. Fig. 2 is an example segmented backbone neural network according to the present invention.
Fig. 3 is a sequence number and a position diagram corresponding to 18 feature points of the human body according to the present invention.
Fig. 4 is a graph of the results of face detection output for students according to the present invention.
Fig. 5 is a schematic diagram of human body posture detection and key point labeling for students in an embodiment of the invention.
Fig. 6 is a graph of output results of a class status determination for a student in an embodiment of the invention.
Detailed Description
Examples
In this embodiment, taking one frame every 5 seconds as an example, a class-listening experimental video of 100 frames of images is extracted, and a student individual detection and class-listening state autonomous identification method in a complete class period is described;
referring to fig. 1, a method for detecting and judging states of multiple human body gestures based on example segmentation according to an embodiment of the present invention includes the following steps:
step 1: collecting a video of a student in a class at a front angle, extracting one frame every 5 seconds, and carrying out framing treatment on the collected video to obtain all original framing images of the video of the class;
step 2: the method comprises the steps of utilizing an example segmentation model [ can refer to Zheng Jiangong, bao Guanjun, zhang Libin, first, old teaching materials ], combining deep learning and metal part recognition of a support vector machine [ J ]. Chinese image graphic newspaper, 2019,24 (12): 2233-2242 ] to segment student individuals and non-student individuals in all original framing images of a classroom video, marking different student individuals with masks of different colors, simultaneously carrying out gesture detection, extracting 18 key points of human body gestures of each student, and carrying out label connection, thereby obtaining classroom images marked by the masks of different colors and the connection of the key points of the human body;
step 3: the position of the front face of each student individual is detected by utilizing dlib model (see Chen Meiling. Key algorithm research of face recognition technology in security inspection process [ D ]. University of Liaoning industry, 2019.).
Step 4: the specific judgment is carried out on the class listening state of the students: if the front face can be detected, judging whether the student is in a general class-listening state or a hand-lifting state according to the extracted human body key point coordinate relation; if the face of the student cannot be detected, judging whether the student is in a low-head or side-body joint lug state according to the extracted key point angles of the human body coordinates;
step 5: and (3) processing all original framing images of the classroom video according to the steps 1-4 to obtain all labeling framing images labeling the individual gestures of students, outputting the classroom states of the students, and performing scoring weighted calculation on different classroom states to obtain the class listening efficiency percentage of each student in the whole classroom period.
The invention will be further described with reference to the drawings and the specific examples.
In the embodiment of the invention, a multi-human body posture detection and state discriminant analysis method based on example segmentation is adopted, wherein a structure diagram of an example segmentation backbone neural network is shown in fig. 2.
In an embodiment of the present invention, the step 1 includes:
step 1.1: recording the front videos of all students in the whole classroom period, and storing the recorded videos to a computer;
step 1.2: carrying out framing operation on all the front videos of students in a stored classroom period, setting to extract a frame of to-be-processed image every 5 seconds, and outputting and storing the image;
in an embodiment of the present invention, the step 2 includes:
step 2.1: inputting all original framing images of the classroom video obtained in the step 1 into a backbone neural network of an example segmentation model for processing so as to obtain a feature map in an input picture, wherein the extracted feature map is used as input for subsequent processing;
step 2.2: inputting the feature map obtained in the step 2.1 into an area generation network (RPN) [ can refer to Wang Wen, zhou Chenyi, xu Yibai, lu Sha, zhou Menglan ] a multi-scale feature fusion ammeter box rust spot detection algorithm [ J ] adopting cascade RPN, a computer and modernization, 2020 (01): 117-121 ] layer, and scanning an image with a sliding window to find an area with a target, thereby obtaining an area of interest (RoI);
step 2.3: and detecting each generated region of interest, and when the type of the region of interest including the human is detected, performing single-heat coding on the position of each key point on the human body. Generating a mask corresponding to each key point of the human body;
step 2.4: performing alignment operation on an output result RoI of the RPN layer, and extracting a feature corresponding to each RoI on a feature map;
step 2.5: the RoI processed in the step 2.3 is respectively sent into a convolution network Fast R-CNN based on an example segmentation model (can refer to Cao Shiyu, liu Yuehu and Li Xinzhao. Fast R-CNN based vehicle target detection [ J ]. Chinese image graphic report, 2017,22 (05): 671-677 ] and a full convolution neural network FCN [ can refer to Weng Jian. Full convolution neural network based omnidirectional scene segmentation research and algorithm implementation [ D ]. Shandong university, 2017 ], the Fast R-CNN carries out gesture classification and bounding box regression on the RoI, and the FCN generates a mask for each RoI
Step 2.6: extracting coordinates of attitude key points of student individuals, and storing the extracted coordinate key point information in a CSV file form; .
In an embodiment of the present invention, the step 2.1 includes:
the backbone neural network is composed of ResNet101 [ can be referenced Ji Yongfeng, marzhong Jade ] multi-loss head pose estimation based on depth residual error network [ J/OL ]. Computer engineering: 1-8[2020-03-18] and feature map pyramid network FPN (Feature Pyramid Networks) [ can be referenced Liu Yun, qian Meiyi, li Hui, wang Chuanxu ]. Multi-scale multi-person object detection method research for deep learning [ J/OL ]. Computer engineering and application: 1-10[2020-03-16].
The residual network of the ResNet101 is a total 101-layer network because the residual network is classified by an input convolution of 7×7×64, then by 33 residual blocks (building blocks), and finally by a full connection layer (fully connected layers, abbreviated as FC), and each residual block is 3 layers. Each residual block may be represented as:
x n+1 =h(x n )+F(x n ,W n )
wherein x is n+1 For the output of each residual block, x n For the input of the residual block, W n Referring to convolution operation, F (x n ,W n ) Representing the residual part, h (x n )=W’ n x n Representing the direct mapped portion, W' n Is a 1 x 1 convolution operation.
The ResNet101 network is divided into 5 stages, and 5 feature images with different scales in the FPN network are correspondingly obtained and output.
In an embodiment of the present invention, the step 2.2 includes:
step 2.2.1: the RPN generates 9 target boxes of preset aspect ratio and area for each position through a sliding window, which is also called an anchor box. The 9 initial anchor boxes contained three areas (128X 128, 256X 256, 512X 512), each of which contained three aspect ratios (1:1, 1:2, 2:1);
step 2.2.2: after the generated initial anchor boxes are cut and filtered, RPN judges whether the anchor points belong to the foreground or the background, namely the student individuals or the classroom background through a Softmax function (can refer to Jiang Baihua) based on face recognition research [ D ] of deep learning, university of Anhui and 2019 ], and in addition, the first coordinate correction is carried out on the anchor boxes belonging to the foreground.
In an embodiment of the present invention, the step 2.3 includes:
the one-hot code is a one-bit efficient code. In human body posture detection, a human body can be used as a target example for classification detection, and key points of each part of the human body correspond to a single thermal code [ can refer to Liang Jie, chen Jiahao, zhang Xueqin, zhou Yue and Lin Gujun ], abnormality detection based on single thermal codes and convolutional neural networks [ J ]. University of Qinghua journal (Nature science edition), 2019, 59 (07): 523-529, labeling 18 key points on each human body, wherein the labeling mode of the key points refers to a COCO data set (a large and rich object detection, segmentation and caption data set, and the processing mode can refer to Zhang Xiangyi). As shown in fig. 3, the reference numerals from 0 to 17 are: nose tip, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right crotch, right knee, right ankle, left crotch, left knee, left ankle, right eye, left eye, right ear, and left ear.
In an embodiment of the present invention, the step 2.4 includes:
step 2.4.1: the existing VGG16 network [ can refer to Feng Guohui ] small-scale image classification [ D ] Lanzhou university, 2018 ] based on a convolutional neural network VGG model is used, a convolution step length is selected to be 32, the size of an interested region after passing through a VGG16 network layer is mapped to be 1/32 of the original size in a feature map, if the size mapped to the feature map at the moment is a floating point number, rounding operation is not performed, and the floating point number is reserved;
step 2.4.2: the feature map is fixed to be 7*7 after pooling, and the mapped feature map is assumed to be n×n, where n represents the edge length of the feature map. Dividing a candidate region with the size of n x n into 49 small regions with the same size, wherein the size of each small region is (n/7);
step 2.4.3: assuming a sampling point number of 4, namely dividing each small area with the size of (n/7) into four parts, taking pixels at the central point position of each part, adopting bilinear interpolation method [ can refer to Xueyu, liu Changlu, hu Jingying ], scaling IP core design [ J ] based on bilinear interpolation algorithm, calculation technology and automation, 2017, 36 (01): calculating to obtain pixel values of four points;
step 2.4.4: the pixel value of the small region is set as the maximum value of four pixel values obtained by calculation through a bilinear interpolation method, and the like, and 49 pixel values obtained by 49 small regions form a characteristic diagram with the size of 7*7.
In an embodiment of the present invention, the step 4 includes:
step 4.1: the result of face detection output of the student individuals by step 3 is shown in fig. 4. If the front face can be detected, the class listening state is further judged according to the extracted human body gesture key point information. The human body posture and key point mark graph detected by the step 2 is shown in fig. 5. Judging whether the student individual is in a hand lifting state or not by judging the size of the delta h, wherein the calculation formula of the height difference delta h of the key points of the wrist and the shoulder is as follows:
Figure BDA0002478466730000111
or:
Figure BDA0002478466730000112
wherein y is Left shoulder Is the ordinate of the left shoulder key point position, y Right shoulder Ordinate, y, of the right shoulder key point position Left wrist Is the ordinate of the left wrist key point position, y Right wrist Ordinate, y, of the right wrist keypoint location Nose tip Is the ordinate, x of the position of the key point of the nose tip Nose tip X is the abscissa of the position of the key point of the nose tip Left shoulder X is the abscissa of the left shoulder key point position Right shoulder The abscissa of the right shoulder key point position;
if the height difference delta h between the wrist and the shoulder key points is larger than 0.5, the student individual is judged to be in a serious class-listening state (hand lifting) or else, the student is judged to be in a general class-listening state.
Step 4.2: if the front face cannot be detected, further judging according to the relation between the extracted human body posture key point information and the front frame image and the rear frame image.
According to general life experience and experimental statistics, the included angles between key points at the nose tip position of the low head state and vectors of left shoulder and right shoulder point are distributed in a range of 170-200 degrees, the included angles between the nose tip point of the non-low head state and vectors of left shoulder point and right shoulder point are distributed in a range of 90-120 degrees in a concentrated manner, and 160 degrees are selected as boundaries between the low head state and the non-low head state.
According to general life experience and experimental statistics, the student is briefly low in reading or writing, if the student is in a low head state in the previous frame and is in a head-up state in the subsequent frame, the student is recorded as a serious class-listening state (writing), otherwise, the student is judged to be in a class-inaudible state according to the general life experience.
If the angle between the vector from the nose tip to the left and right shoulders is less than 160 degrees, judging that the human body is not in a low head state, and further judging the class listening state of the student individual according to the horizontal relative distance between the left shoulder key point and the right shoulder key point. According to general life experience and experimental statistics, the horizontal relative distance between the left shoulder key point and the right shoulder key point in the sideways state is less than 1.5. Calculating a normalized standard formula of the horizontal relative distance deltax between the left shoulder key point and the right shoulder key point:
Figure BDA0002478466730000121
wherein x is Left shoulder X is the abscissa of the left shoulder key point position Right shoulder Is the abscissa of the right shoulder key point position, y Neck (B) Is the ordinate, x of the position of the key point of the neck Neck (B) Is the abscissa of the neck key point position, y Nose tip Is the ordinate, x of the position of the key point of the nose tip Nose tip Is the abscissa of the position of the tip key point;
if the horizontal distance between the left shoulder and the right shoulder is smaller than 1.5, the students are judged to be in a non-class-listening state of the side joint lugs, otherwise, the students are judged to be in a general class-listening state.
As shown in fig. 6, three classes of student classes are detected, student individuals who have a side joint lug are labeled with "training", students who are in a general class state are labeled with "training", students who have a low head distraction are labeled with "abant-minded", and the hand lifting state and the low head writing state are labeled with "track-hand" and "writing", respectively, and are not shown because they are not detected in fig. 6. While different lecture states are distinguished by different colored masks.
In an embodiment of the present invention, the step 5 includes:
scoring weighting is carried out according to different class-listening states, and the class-listening efficiency percentage of students in the whole class period is calculated.
For the student individuals in the class-listening state judged in the step 4, if the students are in the general class-listening state, each time a score of 0.6 is detected;
for the student individuals in the class listening state judged in the step 4, if the student is in the writing state, every time a beat of 0.8 point is detected, and if the student is in the hand lifting state, every time a beat of 1 point is detected;
for the student individuals in the state of inaudible class such as the passion or the junction lug distinguished in the step 4, scoring 0 for each time;
the calculation formula of the class listening efficiency percentage P of the whole class period of each student individual is as follows:
Figure BDA0002478466730000131
wherein r is the total frame number of the student in the hand lifting state, l is the total frame number of the student in the writing state, s is the total frame number of the student in the general class listening state, and N is the total frame number of the continuous frame images of the classroom video.
The invention provides a multi-human body posture detection and state discrimination method based on example segmentation, and the method and the way for realizing the technical scheme are numerous, the above description is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, improvement and modification can be made without departing from the principle of the invention, and the improvement and modification should be regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (6)

1. The multi-human body posture detection and state discrimination method based on the example segmentation is characterized by comprising the following steps of:
step 1: collecting a video of a student in a class at a front angle, extracting one frame every 5 seconds, and carrying out framing treatment on the collected video to obtain all original framing images of the video of the class;
step 2: dividing all student individuals and non-student individuals in an original frame image of a classroom video by using an example division model, marking different student individuals by masks of different colors, simultaneously carrying out gesture detection, extracting 18 key points of the human body gesture of each student, and carrying out mark connection, thereby obtaining a classroom image marked by connecting the masks of different colors and the key points of the human body;
step 3: detecting the position of the face of each student by using a dlib model;
step 4: the specific judgment is carried out on the class listening state of the students: if the front face can be detected, judging whether the student is in a general class-listening state or a hand-lifting state according to the extracted human body gesture key point information; if the front face of the student cannot be detected, judging whether the student is in a low-head or side-body joint lug state according to the extracted human body gesture key point information;
step 5: processing all original framing images of the classroom video according to the steps 1-4 to obtain all labeling framing images labeling the individual gestures of students, outputting the classroom states of the students, and performing scoring weighted calculation on different classroom states to obtain the class listening efficiency percentage of each student in the whole classroom period;
the step 1 comprises the following steps:
step 1.1: recording and storing videos of the front faces of all students in the whole classroom period;
step 1.2: carrying out framing operation on all the front videos of students in a stored classroom period, setting to extract a frame of to-be-processed image every 5 seconds, and outputting and storing the image;
the step 2 comprises the following steps:
step 2.1: inputting all original framing images of the classroom video obtained in the step 1 into a backbone neural network of an example segmentation model for processing so as to obtain a feature map in an input picture, wherein the extracted feature map is used as input for subsequent processing;
step 2.2: inputting the feature map obtained in the step 2.1 into an area generation network (RPN) layer in an example segmentation model, and searching an area with a target by using a sliding window scanning image to obtain an area of interest (RoI);
step 2.3: detecting each generated region of interest, and when the type of the region of interest including a person is detected, performing single-heat encoding on the position of each key point on the human body, and generating a mask corresponding to each key point of the human body;
step 2.4: carrying out alignment operation on an output result RoI of an RPN layer of the regional generation network, and extracting the corresponding feature of each RoI on a feature map;
step 2.5: respectively sending the RoI processed in the step 2.3 into two branches of a Fast region-based convolutional network Fast R-CNN and a full convolutional neural network FCN in an example segmentation model, wherein the Fast R-CNN carries out gesture classification and bounding box regression on the RoI, and the full convolutional neural network FCN generates a mask for each RoI;
step 2.6: extracting coordinates of attitude key points of student individuals, and storing the extracted coordinate key point information in a CSV file form;
the step 2.1 comprises the following steps:
the backbone neural network comprises a residual network ResNet101 and a feature map pyramid network FPN;
the residual network res net101 is a total of 101 layers of network because each residual block is 3 layers, and each residual block is represented as:
x n+1 =h(x n )+F(x n ,W n )
wherein x is n+1 For the output of each residual block, x n For input of residual block, W n Referring to convolution operation, F (x n ,W n ) Representing the residual part, h (x n )=W’ n x n Representing the direct mapped portion, W' n Is a 1 x 1 convolution operation;
and dividing the residual network ResNet101 into 5 stages, and correspondingly obtaining 5 feature map outputs with different scales in the feature map pyramid network FPN network.
2. The method according to claim 1, wherein step 2.2 comprises:
step 2.2.1: the regional generation network RPN layer generates 9 target frames with preset length-width ratios and areas for each position through a sliding window, wherein the target frames are called anchor boxes and anchor boxes, the 9 initial anchor boxes comprise three areas (128×128, 256×256 and 512×512), and each area comprises three length-width ratios (1:1, 1:2 and 2:1);
step 2.2.2: after the generated initial anchor boxes are cut and filtered, the regional generation network RPN layer judges whether an anchor point belongs to the foreground or the background through a Softmax function, namely, a student individual or a classroom background, and carries out first coordinate correction on the anchor boxes belonging to the foreground.
3. The method according to claim 2, wherein step 2.3 comprises:
when the single thermal code is an effective code and the human body posture is detected, the human body is used as a target example to carry out classified detection, the key points of each part of the human body correspond to the single thermal code, 18 key points are marked on each human body, and the marking mode of the key points refers to the marking mode of the key points of the human body in the COCO data set.
4. A method according to claim 3, wherein step 2.4 comprises:
step 2.4.1: using the existing VGG16 network, selecting a convolution step length of 32, mapping the region of interest passing through the VGG16 network layer to the original 1/32 of the size of the feature map, and if the size mapped to the feature map at the moment is a floating point number, not performing rounding operation and reserving the floating point number;
step 2.4.2: setting a feature map which is fixed to be 7*7 after pooling, and assuming that the mapped size of the feature map is n x n, wherein n represents the side length of the feature map; dividing a candidate region with the size of n x n into 49 regions with the same size, wherein the size of each small region is (n/7) x (n/7);
step 2.4.3: setting the sampling point number to be 4, namely dividing each small area with the size of (n/7) into four parts in a bisection way, taking the pixel at the center point position of each part, and calculating by adopting a bilinear interpolation method to obtain the pixel values of four points;
step 2.4.4: the pixel value of the small region is set as the maximum value of four pixel values obtained by calculation through a bilinear interpolation method, and the like, and 49 pixel values obtained by 49 small regions form a characteristic diagram with the size of 7*7.
5. The method of claim 4, wherein step 4 comprises:
step 4.1: if the front face can be detected, the class listening state is further judged according to the extracted human body gesture key point information, whether the student individual is in the hand lifting state is judged by judging the height difference delta h of the wrist and shoulder key points, and the calculation formula of the height difference delta h of the wrist and shoulder key points is as follows:
Figure FDA0004100497630000031
or:
Figure FDA0004100497630000032
wherein y is Left shoulder Is the ordinate of the left shoulder key point position, y Right shoulder Ordinate, y, of the right shoulder key point position Left wrist Is the ordinate of the left wrist key point position, y Right wrist Ordinate, y, of the right wrist keypoint location Nose tip Is the ordinate, x of the position of the key point of the nose tip Nose tip X is the abscissa of the position of the key point of the nose tip Left shoulder X is the abscissa of the left shoulder key point position Right shoulder The abscissa of the right shoulder key point position;
if the height difference delta h between the wrist and the shoulder key points is larger than 0.5, the student individuals are judged to be in a serious class-listening state, otherwise, the students are judged to be in a general class-listening state;
step 4.2: if the frontal face cannot be detected, further judging according to the relation between the extracted human body posture key point information and the front and rear frames of images, wherein the included angles between the key points at the nose tip position of the low head state and the vectors of the left shoulder and the right shoulder point are distributed in the interval of 170-200 degrees, the included angles between the nose tip of the non-low head state and the vectors of the left and right shoulders are intensively distributed in the interval of 90-120 degrees, and 160 degrees are selected as the boundary between the low head state and the non-low head state;
when the student is in a low head state, if the student is in a low head state, and the student is in a head-up state, the student is recorded as a serious class-listening state, otherwise, the student is judged to be in a class-inaudible state;
if the angle between the nose tip and the left and right shoulder vectors is smaller than 160 degrees, judging that the human body is not in a low head state, continuously judging the class listening state of the student individual according to the horizontal relative distance between the left shoulder key point and the right shoulder key point, calculating a normalized standard formula of the horizontal relative distance delta x between the left shoulder key point and the right shoulder key point under the sideways state, wherein the horizontal relative distance between the left shoulder key point and the right shoulder key point is smaller than 1.5:
Figure FDA0004100497630000041
wherein x is Left shoulder X is the abscissa of the left shoulder key point position Right shoulder Is the abscissa of the right shoulder key point position, y Neck (B) Is the ordinate, x of the position of the key point of the neck Neck (B) Is the abscissa of the neck key point position, y Nose tip Is the ordinate, x of the position of the key point of the nose tip Nose tip Is the abscissa of the position of the tip key point;
if the horizontal relative distance between the left shoulder key point and the right shoulder key point is smaller than 1.5, the student is judged to be in a non-class-listening state of the side joint lug, otherwise, the student individual is judged to be in a general class-listening state.
6. The method of claim 5, wherein step 5 comprises:
scoring weighting is carried out according to different class-listening states, and the class-listening efficiency percentage of students in the whole class period is calculated.
For the student individuals in the class-listening state judged in the step 4, if the students are in the general class-listening state, each time a score of 0.6 is detected;
for the student individuals in the class listening state judged in the step 4, if the student is in the writing state, every time a beat of 0.8 point is detected, and if the student is in the hand lifting state, every time a beat of 1 point is detected;
for the student individuals in the state of not listening to class, which are judged in the step 4, scoring 0 for each time;
the calculation formula of the class listening efficiency percentage P of the whole class period of each student individual is as follows:
Figure FDA0004100497630000051
wherein r is the total frame number of the student in the hand lifting state, l is the total frame number of the student in the writing state, s is the total frame number of the student in the general class listening state, and N is the total frame number of the continuous frame images of the classroom video.
CN202010371935.6A 2020-05-06 2020-05-06 Multi-human-body gesture detection and state discrimination method based on instance segmentation Active CN111563452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010371935.6A CN111563452B (en) 2020-05-06 2020-05-06 Multi-human-body gesture detection and state discrimination method based on instance segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010371935.6A CN111563452B (en) 2020-05-06 2020-05-06 Multi-human-body gesture detection and state discrimination method based on instance segmentation

Publications (2)

Publication Number Publication Date
CN111563452A CN111563452A (en) 2020-08-21
CN111563452B true CN111563452B (en) 2023-04-21

Family

ID=72074457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010371935.6A Active CN111563452B (en) 2020-05-06 2020-05-06 Multi-human-body gesture detection and state discrimination method based on instance segmentation

Country Status (1)

Country Link
CN (1) CN111563452B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750125B (en) * 2021-01-28 2022-04-15 华南理工大学 Glass insulator piece positioning method based on end-to-end key point detection
CN113111747A (en) * 2021-03-31 2021-07-13 新疆爱华盈通信息技术有限公司 Abnormal limb behavior detection method, device, terminal and medium
CN114140282B (en) * 2021-11-19 2023-03-24 武汉东信同邦信息技术有限公司 Method and device for quickly reviewing answers of general teaching classroom based on deep learning
CN114708657A (en) * 2022-03-30 2022-07-05 深圳可视科技有限公司 Student attention detection method and system based on multimedia teaching
CN115311606B (en) * 2022-10-08 2022-12-27 成都华栖云科技有限公司 Classroom recorded video validity detection method
CN116739859A (en) * 2023-08-15 2023-09-12 创而新(北京)教育科技有限公司 Method and system for on-line teaching question-answering interaction
CN116778481B (en) * 2023-08-17 2023-10-31 武汉互创联合科技有限公司 Method and system for identifying blastomere image based on key point detection
CN116968758A (en) * 2023-09-19 2023-10-31 江西五十铃汽车有限公司 Vehicle control method and device based on three-dimensional scene representation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109035089A (en) * 2018-07-25 2018-12-18 重庆科技学院 A kind of Online class atmosphere assessment system and method
CN109284737A (en) * 2018-10-22 2019-01-29 广东精标科技股份有限公司 A kind of students ' behavior analysis and identifying system for wisdom classroom
CN109409371A (en) * 2017-08-18 2019-03-01 三星电子株式会社 The system and method for semantic segmentation for image
CN109740446A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Classroom students ' behavior analysis method and device
CN110287792A (en) * 2019-05-23 2019-09-27 华中师范大学 A kind of classroom Middle school students ' learning state real-time analysis method in nature teaching environment
CN110598554A (en) * 2019-08-09 2019-12-20 中国地质大学(武汉) Multi-person posture estimation method based on counterstudy
CN111079554A (en) * 2019-11-25 2020-04-28 恒安嘉新(北京)科技股份公司 Method, device, electronic equipment and storage medium for analyzing classroom performance of students

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409371A (en) * 2017-08-18 2019-03-01 三星电子株式会社 The system and method for semantic segmentation for image
CN109035089A (en) * 2018-07-25 2018-12-18 重庆科技学院 A kind of Online class atmosphere assessment system and method
CN109284737A (en) * 2018-10-22 2019-01-29 广东精标科技股份有限公司 A kind of students ' behavior analysis and identifying system for wisdom classroom
CN109740446A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Classroom students ' behavior analysis method and device
CN110287792A (en) * 2019-05-23 2019-09-27 华中师范大学 A kind of classroom Middle school students ' learning state real-time analysis method in nature teaching environment
CN110598554A (en) * 2019-08-09 2019-12-20 中国地质大学(武汉) Multi-person posture estimation method based on counterstudy
CN111079554A (en) * 2019-11-25 2020-04-28 恒安嘉新(北京)科技股份公司 Method, device, electronic equipment and storage medium for analyzing classroom performance of students

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
谭斌 等.基于Faster R-CNN的学生课堂行为检测算法研究.现代计算机(专业版).2018,全文. *
邓益侬 等.基于深度学习的人体姿态估计方法综述.《计算机工程与应用》.2019,全文. *

Also Published As

Publication number Publication date
CN111563452A (en) 2020-08-21

Similar Documents

Publication Publication Date Title
CN111563452B (en) Multi-human-body gesture detection and state discrimination method based on instance segmentation
CN107832672B (en) Pedestrian re-identification method for designing multi-loss function by utilizing attitude information
CN106960202B (en) Smiling face identification method based on visible light and infrared image fusion
Li et al. Robust visual tracking based on convolutional features with illumination and occlusion handing
CN108520226B (en) Pedestrian re-identification method based on body decomposition and significance detection
Lin Face detection in complicated backgrounds and different illumination conditions by using YCbCr color space and neural network
CN104063059B (en) A kind of real-time gesture recognition method based on finger segmentation
CN106960181B (en) RGBD data-based pedestrian attribute identification method
CN111402224B (en) Target identification method for power equipment
CN102902986A (en) Automatic gender identification system and method
CN109086659B (en) Human behavior recognition method and device based on multi-channel feature fusion
CN109920538B (en) Zero sample learning method based on data enhancement
Pandey et al. Hand gesture recognition for sign language recognition: A review
CN110991315A (en) Method for detecting wearing state of safety helmet in real time based on deep learning
CN111046732A (en) Pedestrian re-identification method based on multi-granularity semantic analysis and storage medium
CN107808376A (en) A kind of detection method of raising one's hand based on deep learning
CN105069745A (en) face-changing system based on common image sensor and enhanced augmented reality technology and method
CN112487981A (en) MA-YOLO dynamic gesture rapid recognition method based on two-way segmentation
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN116052222A (en) Cattle face recognition method for naturally collecting cattle face image
CN107578015B (en) First impression recognition and feedback system and method based on deep learning
CN104866826A (en) Static gesture language identification method based on KNN algorithm and pixel ratio gradient features
CN113723277B (en) Learning intention monitoring method and system integrated with multi-mode visual information
CN111178201A (en) Human body sectional type tracking method based on OpenPose posture detection
WO2021248814A1 (en) Robust visual supervision method and apparatus for home learning state of child

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant