CN111563452B

CN111563452B - Multi-human-body gesture detection and state discrimination method based on instance segmentation

Info

Publication number: CN111563452B
Application number: CN202010371935.6A
Authority: CN
Inventors: 谢非; 章悦; 刘益剑; 陆飞; 汪璠; 吴俊�; 汪壬甲; 钱伟行
Original assignee: Zhenjiang Institute For Innovation And Development Nnu; Nanjing Normal University
Current assignee: Zhenjiang Institute For Innovation And Development Nnu; Nanjing Normal University
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2023-04-21
Anticipated expiration: 2040-05-06
Also published as: CN111563452A

Abstract

The invention provides a multi-human body posture detection and state discrimination method based on instance segmentation, which comprises the following steps: collecting an original framing image of a classroom video; dividing student individuals and non-student individuals, marking different student individuals with masks of different colors, simultaneously detecting gestures, extracting key points of human body gestures of each student, and performing annotation connection; the method comprises the steps of specifically judging the class-listening state of students, identifying and positioning the faces of the students, judging whether all students can detect the front face, if so, preliminarily indicating that the students are in the class-listening state, and judging whether the students are in the hand-lifting state; if the face cannot be detected, it is further judged whether it is in an inaudible state. The student class-listening efficiency is evaluated in combination with each student individual class-listening state. The invention provides a solution for realizing the discrimination and analysis of the class listening state of students, and has the advantages of real-time identification, high identification precision, strong resistance to complex environment interference and the like.

Description

Multi-human-body gesture detection and state discrimination method based on instance segmentation

Technical Field

The invention relates to the technical field of machine learning and machine vision, in particular to a multi-human body gesture detection and state discrimination method based on example segmentation.

Background

With the advent of big data and artificial intelligence age, the fusion of information technology and school education and teaching has become the focus of research. Smart classes are emerging concepts that integrate advanced information acquisition and transmission technologies, intelligent sensing technologies, and computer processing technologies into educational fields. In the educational teaching process, the student's class state can be most effective to the study degree of student and the condition of giving lessons of teacher feed back. The existing teaching feedback is still mainly analyzed and evaluated manually, and is time-consuming, low in efficiency and incomplete in evaluation. The example segmentation can also segment pixels of a target object on the basis of detecting the target, and different individuals of the same object can be marked. Example segmentation has been widely used in the fields of autopilot, medical detection, clothing classification, precision agriculture, etc. With the development of artificial intelligence, instance segmentation can also be gradually applied to intelligent classes.

The currently proposed methods for identifying and analyzing the class listening state of students are few, and mainly adopt single face recognition, human body gesture detection, brain wave monitoring and other methods. The method has the unavoidable defects of low accuracy, low real-time performance, higher cost, poor audience experience and the like. The invention provides a solution for distinguishing and analyzing the class listening state of students. The invention can realize real-time identification and high identification precision, can simultaneously complete human body gesture detection and classroom state discrimination of student individuals on the basis of dividing the student individuals and classroom background, and can output labels of different class-listening states of the student individuals and classify the student individuals in different class-listening states by masks of different colors. The invention also provides a calculation method for analyzing the efficiency of the multi-user class, which can obtain the class listening efficiency of the student individuals after the detection of one class period is finished and has the characteristics of high recognition efficiency, good recognition precision, strong resistance to complex environment interference and the like.

Disclosure of Invention

The invention aims to provide a multi-human body gesture detection and state discrimination method based on example segmentation, which has the advantages of strong real-time performance, high recognition rate and strong background environment interference resistance.

In order to achieve the above purpose, the invention adopts the following technical scheme: the method for detecting and judging the states of the multiple human body gestures based on the example segmentation comprises the following steps:

step 1: collecting video of students on class at a front angle, extracting one frame every 5 seconds, and carrying out framing treatment on the collected video to obtain all framing images of the video of the class;

step 2: dividing all student individuals and non-student individuals in an original frame image of a classroom video by using an example division model, marking different student individuals by masks of different colors, simultaneously carrying out gesture detection, extracting 18 key points of the human body gesture of each student, and carrying out mark connection, thereby obtaining a classroom image marked by connecting the masks of different colors and the key points of the human body;

step 3: detecting the position of the face of each student by using a dlib model;

step 4: the specific judgment is carried out on the class listening state of the students: if the front face can be detected, judging whether the student is in a general class-listening state or a hand-lifting state according to the extracted human body key point coordinate relation; if the face of the student cannot be detected, judging whether the student is in a low-head or side-body joint lug state according to the extracted key point angles of the human body coordinates;

step 5: and (3) processing all original framing images of the classroom video according to the steps 1-4 to obtain all labeling framing images labeling the individual gestures of students, outputting the classroom states of the students, and performing scoring weighted calculation on different classroom states to obtain the class listening efficiency percentage of each student in the whole classroom period.

Further, the step 1 includes:

step 1.1: recording the front videos of all students in the whole classroom period, and storing the videos to a computer;

step 1.2: carrying out framing operation on all the front videos of students in a stored classroom period, setting to extract a frame of to-be-processed image every 5 seconds, and outputting and storing the image;

further, the step 2 includes:

step 2.1: inputting all original framing images of the classroom video obtained in the step 1 into a backbone neural network of an example segmentation model for processing so as to obtain a feature map in an input picture, wherein the extracted feature map is used as input for subsequent processing;

step 2.2: inputting the feature map obtained in the step 2.1 into an area generation network (RPN) layer in an example segmentation model, and scanning an image by using a sliding window to find an area with a target, thereby obtaining an area of interest (RoI);

step 2.3: detecting each generated region of interest, and when the type of the region of interest including a person is detected, performing single-heat encoding on the position of each key point on the human body, and generating a mask corresponding to each key point of the human body;

step 2.4: performing alignment operation on an output result RoI of the RPN layer, and extracting a feature corresponding to each RoI on a feature map;

step 2.5: and (3) respectively sending the RoI processed in the step (2.3) into two branches of a Fast-region-based convolutional network Fast R-CNN and a full convolutional neural network FCN in an example segmentation model, wherein the Fast R-CNN carries out gesture classification and bounding box regression on the RoI, and the full convolutional neural network FCN generates a mask for each RoI.

Step 2.6: and extracting coordinates of the gesture key points of the student individuals, and storing the extracted coordinate key point information in a CSV file form.

Further, the step 2.1 includes:

the backbone neural network comprises a residual network ResNet101 and a feature map pyramid network FPN.

The residual network res net101 is a total of 101 layers of network because each residual block is 3 layers, and is classified by an input convolution of 7×7×64, and then by 33 residual blocks, namely, a building block, and finally by a full connection layer FC. Each residual block is represented as:

x _n+1 ＝h(x _n )+F(x _n ，W _n )

wherein x is _n+1 For the output of each residual block, x _n For the input of the residual block, W _n Referring to convolution operation, F (x _n ，W _n ) Representing the residual part, h (x _n )＝W’ _n x _n Representing the direct mapped portion, W' _n Is a 1 x 1 convolution operation.

And dividing the residual network ResNet101 into 5 stages, and correspondingly obtaining 5 feature map outputs with different scales in the feature map pyramid network FPN network.

Further, the step 2.2 includes:

step 2.2.1: the regional generation network RPN layer generates 9 target frames with preset length-width ratio and area for each position through a sliding window, and the target frames are called anchor box anchor boxes. The 9 initial anchor boxes contained three areas (128×128, 256×256, 512×512), each area in turn containing three aspect ratios (1:1, 1:2, 2:1);

step 2.2.2: after the generated initial anchor boxes are cut and filtered, the regional generation network RPN layer judges whether an anchor point belongs to the foreground or the background through a Softmax function, namely, a student individual or a classroom background, and in addition, first coordinate correction is carried out on the anchor boxes belonging to the foreground.

Further, the step 2.2.2 includes:

the Softmax function is used in the multi-class process, which maps the output of multiple neurons into (0, 1) intervals, and normalizes the guaranteed sum to 1, so that the sum of probabilities for multiple classes is also exactly 1.

The Softmax function is defined as follows:

wherein V is _i Is the output of the classifier front-stage output unit. i denotes a category index, and C denotes the total number of categories. S is S _i The ratio of the index of the current element to the sum of all element indices is shown. The multi-class output value can be converted to a relative probability by this Softmax function.

The Loss function Loss of Softmax is mostly in the form of cross entropy:

wherein t is _i Representing the true value, y _i Representing the value found by the Softmax function.

Inputting a sample, wherein only one neuron corresponds to the correct category of the sample; if the probability value of the neuron output is higher, the loss generated by the neuron output is smaller according to the function formula; otherwise, the higher the loss that occurs. The trained Softmax function can be used to classify feature maps.

Further, the step 2.3 includes:

the one-hot code is a one-bit efficient code. When the human body posture is detected, the human body is used as a target example for classification detection, the key points of each part of the human body correspond to a single-heat code, 18 key points are marked on each human body, and the marking mode of the key points refers to the marking mode of the key points of the human body in the COCO data set.

Further, the step 2.4 includes:

step 2.4.1: using the existing VGG16 network, selecting a convolution step length of 32, mapping the region of interest passing through the VGG16 network layer to the original 1/32 of the size of the feature map, and if the size mapped to the feature map at the moment is a floating point number, not performing rounding operation and reserving the floating point number;

step 2.4.2: setting a characteristic diagram which is fixed to be 7*7 after pooling, wherein the mapped characteristic diagram is n x n, and n represents the side length of the characteristic diagram. Dividing a candidate region with the size of n x n into 49 small regions with the same size, wherein the size of each small region is (n/7);

step 2.4.3: the sampling point number is 4, namely dividing each small area with the size of (n/7) into four parts, taking the pixel at the center point position of each part, and calculating by adopting a bilinear interpolation method to obtain the pixel values of four points;

step 2.4.4: the pixel value of the small region is set as the maximum value of four pixel values obtained by calculation through a bilinear interpolation method, and the like, and 49 pixel values obtained by 49 small regions form a characteristic diagram with the size of 7*7.

Further, the step 4 includes:

step 4.1: if the front face can be detected, the class listening state is further judged according to the extracted human body gesture key point information, whether the student individual is in the hand lifting state is judged by judging the size of delta h, and the calculation formula of the height difference delta h between the wrist key point and the shoulder key point is as follows:

or:

wherein y is _{Left shoulder} Is the ordinate of the left shoulder key point position, y _{Right shoulder} Ordinate, y, of the right shoulder key point position _{Left wrist} Is the ordinate of the left wrist key point position, y _{Right wrist} Ordinate, y, of the right wrist keypoint location _{Nose tip} Is the ordinate, x of the position of the key point of the nose tip _{Nose tip} X is the abscissa of the position of the key point of the nose tip _{Left shoulder} X is the abscissa of the left shoulder key point position _{Right shoulder} The abscissa of the right shoulder key point position;

if the height difference deltah is greater than 0.5, the student individual is judged to be in a carefully attended state (hand lift), otherwise, the student is judged to be in a general attended state.

Step 4.2: if the front face cannot be detected, further judging according to the relation between the extracted human body posture key point information and the front frame image and the rear frame image.

According to general life experience and experimental statistics, the included angles between key points at the nose tip position of the low head state and vectors of left shoulder and right shoulder point are distributed in a range of 170-200 degrees, the included angles between the nose tip point of the non-low head state and vectors of left shoulder point and right shoulder point are distributed in a range of 90-120 degrees in a concentrated manner, and 160 degrees are selected as boundaries between the low head state and the non-low head state. According to general life experience and experimental statistics, the student is briefly low in reading or writing, if the student is in a low head state in the previous frame and is in a head-up state in the subsequent frame, the student is recorded as a serious class-listening state (writing), otherwise, the student is judged to be in a class-inaudible state according to the general life experience.

If the angle between the vector from the nose tip to the left and right shoulders is less than 160 degrees, judging that the human body is not in a low head state, and further judging the class listening state of the student individual according to the horizontal relative distance between the left shoulder key point and the right shoulder key point. According to general life experience and experimental statistics, the horizontal relative distance between the left shoulder key point and the right shoulder key point in the sideways state is less than 1.5. Calculating a normalized standard formula of the horizontal relative distance deltax between the left shoulder key point and the right shoulder key point:

wherein x is _{Left shoulder} Transverse to the left shoulder key pointCoordinates x _{Right shoulder} Is the abscissa of the right shoulder key point position, y _{Neck (B)} Is the ordinate, x of the position of the key point of the neck _{Neck (B)} Is the abscissa of the neck key point position, y _{Nose tip} Is the ordinate, x of the position of the key point of the nose tip _{Nose tip} Is the abscissa of the position of the tip key point;

if the horizontal distance between the left shoulder and the right shoulder is smaller than 1.5, the students are judged to be in a non-class-listening state of the side joint lugs, otherwise, the students are judged to be in a general class-listening state.

Further, the step 5 includes:

scoring weighting is carried out according to different class-listening states, and the class-listening efficiency percentage of students in the whole class period is calculated.

For the student individuals in the class-listening state judged in the step 4, if the students are in the general class-listening state, each time a score of 0.6 is detected;

for the student individuals in the class listening state judged in the step 4, if the student is in the writing state, every time a beat of 0.8 point is detected, and if the student is in the hand lifting state, every time a beat of 1 point is detected;

for the student individuals in the state of inaudible class such as the passion or the junction lug distinguished in the step 4, scoring 0 for each time;

the calculation formula of the class listening efficiency percentage P of the whole class period of each student individual is as follows:

wherein r is the total frame number of the student in the hand lifting state, 1 is the total frame number of the student in the writing state, s is the total frame number of the student in the general class listening state, and N is the total frame number of the continuous frame images of the classroom video.

As can be seen from the above technical solutions, the embodiment of the present invention provides a multi-body gesture detection and multi-person classroom state discriminant analysis method, including: step 1: collecting a video of a student in a class at a front angle, extracting one frame every 5 seconds, and carrying out framing treatment on the collected video to obtain all original framing images of the video of the class; step 2: dividing all student individuals and non-student individuals in an original frame image of a classroom video by using an example division model, marking different student individuals by masks of different colors, simultaneously carrying out gesture detection, extracting 18 key points of the human body gesture of each student, and carrying out mark connection, thereby obtaining a classroom image marked by connecting the masks of different colors and the key points of the human body; step 3: detecting the position of the face of each student by using a dlib model; step 4: the specific judgment is carried out on the class listening state of the students: if the front face can be detected, judging whether the student is in a general class-listening state or a hand-lifting state according to the extracted human body gesture key point information; if the front face of the student cannot be detected, judging whether the student is in a low-head or side-body joint lug state according to the extracted human body gesture key point information; step 5: and (3) processing all original framing images of the classroom video according to the steps 1-4 to obtain all labeling framing images labeling the individual gestures of students, outputting the classroom states of the students, and performing scoring weighted calculation on different classroom states to obtain the class listening efficiency percentage of each student in the whole classroom period.

Through implementation of the technical scheme, the invention has the beneficial effects that: (1) The method for processing the video framing is provided, and a proper time interval is selected, so that the detection efficiency can be greatly improved; (2) The method for detecting the human body postures of the students and analyzing the class listening state based on the example segmentation is provided, the gestures of the students are accurately detected, the detection efficiency is high, and the method is suitable for complex environment background. (3) By combining with face detection, a human body posture distinguishing algorithm is provided for analyzing the relation among key points of the human body posture, so that the analysis and distinguishing of various specific class-listening states of students are achieved, and the class-listening efficiency of the students is judged accurately; (4) Low realization cost, high recognition efficiency and strong capability of resisting complex environment interference.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

Fig. 1 is a flow chart of a multi-human gesture detection and status discrimination method based on example segmentation according to the present invention. Fig. 2 is an example segmented backbone neural network according to the present invention.

Fig. 3 is a sequence number and a position diagram corresponding to 18 feature points of the human body according to the present invention.

Fig. 4 is a graph of the results of face detection output for students according to the present invention.

Fig. 5 is a schematic diagram of human body posture detection and key point labeling for students in an embodiment of the invention.

Fig. 6 is a graph of output results of a class status determination for a student in an embodiment of the invention.

Detailed Description

Examples

In this embodiment, taking one frame every 5 seconds as an example, a class-listening experimental video of 100 frames of images is extracted, and a student individual detection and class-listening state autonomous identification method in a complete class period is described;

referring to fig. 1, a method for detecting and judging states of multiple human body gestures based on example segmentation according to an embodiment of the present invention includes the following steps:

step 1: collecting a video of a student in a class at a front angle, extracting one frame every 5 seconds, and carrying out framing treatment on the collected video to obtain all original framing images of the video of the class;

step 2: the method comprises the steps of utilizing an example segmentation model [ can refer to Zheng Jiangong, bao Guanjun, zhang Libin, first, old teaching materials ], combining deep learning and metal part recognition of a support vector machine [ J ]. Chinese image graphic newspaper, 2019,24 (12): 2233-2242 ] to segment student individuals and non-student individuals in all original framing images of a classroom video, marking different student individuals with masks of different colors, simultaneously carrying out gesture detection, extracting 18 key points of human body gestures of each student, and carrying out label connection, thereby obtaining classroom images marked by the masks of different colors and the connection of the key points of the human body;

step 3: the position of the front face of each student individual is detected by utilizing dlib model (see Chen Meiling. Key algorithm research of face recognition technology in security inspection process [ D ]. University of Liaoning industry, 2019.).

The invention will be further described with reference to the drawings and the specific examples.

In the embodiment of the invention, a multi-human body posture detection and state discriminant analysis method based on example segmentation is adopted, wherein a structure diagram of an example segmentation backbone neural network is shown in fig. 2.

In an embodiment of the present invention, the step 1 includes:

step 1.1: recording the front videos of all students in the whole classroom period, and storing the recorded videos to a computer;

in an embodiment of the present invention, the step 2 includes:

step 2.2: inputting the feature map obtained in the step 2.1 into an area generation network (RPN) [ can refer to Wang Wen, zhou Chenyi, xu Yibai, lu Sha, zhou Menglan ] a multi-scale feature fusion ammeter box rust spot detection algorithm [ J ] adopting cascade RPN, a computer and modernization, 2020 (01): 117-121 ] layer, and scanning an image with a sliding window to find an area with a target, thereby obtaining an area of interest (RoI);

step 2.3: and detecting each generated region of interest, and when the type of the region of interest including the human is detected, performing single-heat coding on the position of each key point on the human body. Generating a mask corresponding to each key point of the human body;

step 2.5: the RoI processed in the step 2.3 is respectively sent into a convolution network Fast R-CNN based on an example segmentation model (can refer to Cao Shiyu, liu Yuehu and Li Xinzhao. Fast R-CNN based vehicle target detection [ J ]. Chinese image graphic report, 2017,22 (05): 671-677 ] and a full convolution neural network FCN [ can refer to Weng Jian. Full convolution neural network based omnidirectional scene segmentation research and algorithm implementation [ D ]. Shandong university, 2017 ], the Fast R-CNN carries out gesture classification and bounding box regression on the RoI, and the FCN generates a mask for each RoI

Step 2.6: extracting coordinates of attitude key points of student individuals, and storing the extracted coordinate key point information in a CSV file form; .

In an embodiment of the present invention, the step 2.1 includes:

the backbone neural network is composed of ResNet101 [ can be referenced Ji Yongfeng, marzhong Jade ] multi-loss head pose estimation based on depth residual error network [ J/OL ]. Computer engineering: 1-8[2020-03-18] and feature map pyramid network FPN (Feature Pyramid Networks) [ can be referenced Liu Yun, qian Meiyi, li Hui, wang Chuanxu ]. Multi-scale multi-person object detection method research for deep learning [ J/OL ]. Computer engineering and application: 1-10[2020-03-16].

The residual network of the ResNet101 is a total 101-layer network because the residual network is classified by an input convolution of 7×7×64, then by 33 residual blocks (building blocks), and finally by a full connection layer (fully connected layers, abbreviated as FC), and each residual block is 3 layers. Each residual block may be represented as:

x _n+1 ＝h(x _n )+F(x _n ，W _n )

The ResNet101 network is divided into 5 stages, and 5 feature images with different scales in the FPN network are correspondingly obtained and output.

In an embodiment of the present invention, the step 2.2 includes:

step 2.2.1: the RPN generates 9 target boxes of preset aspect ratio and area for each position through a sliding window, which is also called an anchor box. The 9 initial anchor boxes contained three areas (128X 128, 256X 256, 512X 512), each of which contained three aspect ratios (1:1, 1:2, 2:1);

step 2.2.2: after the generated initial anchor boxes are cut and filtered, RPN judges whether the anchor points belong to the foreground or the background, namely the student individuals or the classroom background through a Softmax function (can refer to Jiang Baihua) based on face recognition research [ D ] of deep learning, university of Anhui and 2019 ], and in addition, the first coordinate correction is carried out on the anchor boxes belonging to the foreground.

In an embodiment of the present invention, the step 2.3 includes:

the one-hot code is a one-bit efficient code. In human body posture detection, a human body can be used as a target example for classification detection, and key points of each part of the human body correspond to a single thermal code [ can refer to Liang Jie, chen Jiahao, zhang Xueqin, zhou Yue and Lin Gujun ], abnormality detection based on single thermal codes and convolutional neural networks [ J ]. University of Qinghua journal (Nature science edition), 2019, 59 (07): 523-529, labeling 18 key points on each human body, wherein the labeling mode of the key points refers to a COCO data set (a large and rich object detection, segmentation and caption data set, and the processing mode can refer to Zhang Xiangyi). As shown in fig. 3, the reference numerals from 0 to 17 are: nose tip, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right crotch, right knee, right ankle, left crotch, left knee, left ankle, right eye, left eye, right ear, and left ear.

In an embodiment of the present invention, the step 2.4 includes:

step 2.4.1: the existing VGG16 network [ can refer to Feng Guohui ] small-scale image classification [ D ] Lanzhou university, 2018 ] based on a convolutional neural network VGG model is used, a convolution step length is selected to be 32, the size of an interested region after passing through a VGG16 network layer is mapped to be 1/32 of the original size in a feature map, if the size mapped to the feature map at the moment is a floating point number, rounding operation is not performed, and the floating point number is reserved;

step 2.4.2: the feature map is fixed to be 7*7 after pooling, and the mapped feature map is assumed to be n×n, where n represents the edge length of the feature map. Dividing a candidate region with the size of n x n into 49 small regions with the same size, wherein the size of each small region is (n/7);

step 2.4.3: assuming a sampling point number of 4, namely dividing each small area with the size of (n/7) into four parts, taking pixels at the central point position of each part, adopting bilinear interpolation method [ can refer to Xueyu, liu Changlu, hu Jingying ], scaling IP core design [ J ] based on bilinear interpolation algorithm, calculation technology and automation, 2017, 36 (01): calculating to obtain pixel values of four points;

In an embodiment of the present invention, the step 4 includes:

step 4.1: the result of face detection output of the student individuals by step 3 is shown in fig. 4. If the front face can be detected, the class listening state is further judged according to the extracted human body gesture key point information. The human body posture and key point mark graph detected by the step 2 is shown in fig. 5. Judging whether the student individual is in a hand lifting state or not by judging the size of the delta h, wherein the calculation formula of the height difference delta h of the key points of the wrist and the shoulder is as follows:

or:

if the height difference delta h between the wrist and the shoulder key points is larger than 0.5, the student individual is judged to be in a serious class-listening state (hand lifting) or else, the student is judged to be in a general class-listening state.

According to general life experience and experimental statistics, the included angles between key points at the nose tip position of the low head state and vectors of left shoulder and right shoulder point are distributed in a range of 170-200 degrees, the included angles between the nose tip point of the non-low head state and vectors of left shoulder point and right shoulder point are distributed in a range of 90-120 degrees in a concentrated manner, and 160 degrees are selected as boundaries between the low head state and the non-low head state.

According to general life experience and experimental statistics, the student is briefly low in reading or writing, if the student is in a low head state in the previous frame and is in a head-up state in the subsequent frame, the student is recorded as a serious class-listening state (writing), otherwise, the student is judged to be in a class-inaudible state according to the general life experience.

wherein x is _{Left shoulder} X is the abscissa of the left shoulder key point position _{Right shoulder} Is the abscissa of the right shoulder key point position, y _{Neck (B)} Is the ordinate, x of the position of the key point of the neck _{Neck (B)} Is the abscissa of the neck key point position, y _{Nose tip} Is the ordinate, x of the position of the key point of the nose tip _{Nose tip} Is the abscissa of the position of the tip key point;

As shown in fig. 6, three classes of student classes are detected, student individuals who have a side joint lug are labeled with "training", students who are in a general class state are labeled with "training", students who have a low head distraction are labeled with "abant-minded", and the hand lifting state and the low head writing state are labeled with "track-hand" and "writing", respectively, and are not shown because they are not detected in fig. 6. While different lecture states are distinguished by different colored masks.

In an embodiment of the present invention, the step 5 includes:

wherein r is the total frame number of the student in the hand lifting state, l is the total frame number of the student in the writing state, s is the total frame number of the student in the general class listening state, and N is the total frame number of the continuous frame images of the classroom video.

The invention provides a multi-human body posture detection and state discrimination method based on example segmentation, and the method and the way for realizing the technical scheme are numerous, the above description is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, improvement and modification can be made without departing from the principle of the invention, and the improvement and modification should be regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. The multi-human body posture detection and state discrimination method based on the example segmentation is characterized by comprising the following steps of:

step 4: the specific judgment is carried out on the class listening state of the students: if the front face can be detected, judging whether the student is in a general class-listening state or a hand-lifting state according to the extracted human body gesture key point information; if the front face of the student cannot be detected, judging whether the student is in a low-head or side-body joint lug state according to the extracted human body gesture key point information;

step 5: processing all original framing images of the classroom video according to the steps 1-4 to obtain all labeling framing images labeling the individual gestures of students, outputting the classroom states of the students, and performing scoring weighted calculation on different classroom states to obtain the class listening efficiency percentage of each student in the whole classroom period;

the step 1 comprises the following steps:

step 1.1: recording and storing videos of the front faces of all students in the whole classroom period;

the step 2 comprises the following steps:

step 2.2: inputting the feature map obtained in the step 2.1 into an area generation network (RPN) layer in an example segmentation model, and searching an area with a target by using a sliding window scanning image to obtain an area of interest (RoI);

step 2.4: carrying out alignment operation on an output result RoI of an RPN layer of the regional generation network, and extracting the corresponding feature of each RoI on a feature map;

step 2.5: respectively sending the RoI processed in the step 2.3 into two branches of a Fast region-based convolutional network Fast R-CNN and a full convolutional neural network FCN in an example segmentation model, wherein the Fast R-CNN carries out gesture classification and bounding box regression on the RoI, and the full convolutional neural network FCN generates a mask for each RoI;

step 2.6: extracting coordinates of attitude key points of student individuals, and storing the extracted coordinate key point information in a CSV file form;

the step 2.1 comprises the following steps:

the backbone neural network comprises a residual network ResNet101 and a feature map pyramid network FPN;

the residual network res net101 is a total of 101 layers of network because each residual block is 3 layers, and each residual block is represented as:

x _n+1 ＝h(x _n )+F(x _n ，W _n )

wherein x is _n+1 For the output of each residual block, x _n For input of residual block, W _n Referring to convolution operation, F (x _n ，W _n ) Representing the residual part, h (x _n )＝W’ _n x _n Representing the direct mapped portion, W' _n Is a 1 x 1 convolution operation;

2. The method according to claim 1, wherein step 2.2 comprises:

step 2.2.1: the regional generation network RPN layer generates 9 target frames with preset length-width ratios and areas for each position through a sliding window, wherein the target frames are called anchor boxes and anchor boxes, the 9 initial anchor boxes comprise three areas (128×128, 256×256 and 512×512), and each area comprises three length-width ratios (1:1, 1:2 and 2:1);

step 2.2.2: after the generated initial anchor boxes are cut and filtered, the regional generation network RPN layer judges whether an anchor point belongs to the foreground or the background through a Softmax function, namely, a student individual or a classroom background, and carries out first coordinate correction on the anchor boxes belonging to the foreground.

3. The method according to claim 2, wherein step 2.3 comprises:

when the single thermal code is an effective code and the human body posture is detected, the human body is used as a target example to carry out classified detection, the key points of each part of the human body correspond to the single thermal code, 18 key points are marked on each human body, and the marking mode of the key points refers to the marking mode of the key points of the human body in the COCO data set.

4. A method according to claim 3, wherein step 2.4 comprises:

step 2.4.2: setting a feature map which is fixed to be 7*7 after pooling, and assuming that the mapped size of the feature map is n x n, wherein n represents the side length of the feature map; dividing a candidate region with the size of n x n into 49 regions with the same size, wherein the size of each small region is (n/7) x (n/7);

step 2.4.3: setting the sampling point number to be 4, namely dividing each small area with the size of (n/7) into four parts in a bisection way, taking the pixel at the center point position of each part, and calculating by adopting a bilinear interpolation method to obtain the pixel values of four points;

5. The method of claim 4, wherein step 4 comprises:

step 4.1: if the front face can be detected, the class listening state is further judged according to the extracted human body gesture key point information, whether the student individual is in the hand lifting state is judged by judging the height difference delta h of the wrist and shoulder key points, and the calculation formula of the height difference delta h of the wrist and shoulder key points is as follows:

or:

if the height difference delta h between the wrist and the shoulder key points is larger than 0.5, the student individuals are judged to be in a serious class-listening state, otherwise, the students are judged to be in a general class-listening state;

step 4.2: if the frontal face cannot be detected, further judging according to the relation between the extracted human body posture key point information and the front and rear frames of images, wherein the included angles between the key points at the nose tip position of the low head state and the vectors of the left shoulder and the right shoulder point are distributed in the interval of 170-200 degrees, the included angles between the nose tip of the non-low head state and the vectors of the left and right shoulders are intensively distributed in the interval of 90-120 degrees, and 160 degrees are selected as the boundary between the low head state and the non-low head state;

when the student is in a low head state, if the student is in a low head state, and the student is in a head-up state, the student is recorded as a serious class-listening state, otherwise, the student is judged to be in a class-inaudible state;

if the angle between the nose tip and the left and right shoulder vectors is smaller than 160 degrees, judging that the human body is not in a low head state, continuously judging the class listening state of the student individual according to the horizontal relative distance between the left shoulder key point and the right shoulder key point, calculating a normalized standard formula of the horizontal relative distance delta x between the left shoulder key point and the right shoulder key point under the sideways state, wherein the horizontal relative distance between the left shoulder key point and the right shoulder key point is smaller than 1.5:

if the horizontal relative distance between the left shoulder key point and the right shoulder key point is smaller than 1.5, the student is judged to be in a non-class-listening state of the side joint lug, otherwise, the student individual is judged to be in a general class-listening state.

6. The method of claim 5, wherein step 5 comprises:

for the student individuals in the state of not listening to class, which are judged in the step 4, scoring 0 for each time;