CN113361352A

CN113361352A - Student classroom behavior analysis monitoring method and system based on behavior recognition

Info

Publication number: CN113361352A
Application number: CN202110586729.1A
Authority: CN
Inventors: 徐超; 李珊; 孟昭鹏; 胡静; 肖健
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-09-07

Abstract

The invention discloses a monitoring method for student classroom behavior analysis based on behavior recognition, which comprises the steps of collecting video information of human body behaviors and depth information of key points of a human body skeleton diagram; extracting a student behavior skeleton diagram from the acquired image through a gesture recognition algorithm; performing feature extraction on the skeleton diagram through a graph convolution neural network, giving a 2D or 3D coordinate sequence of body joints, and constructing a naturally-connected space-time diagram which takes the joints as nodes and the human body structure and time as edges; in the convolution process, according to the motion correlation between human joints and bones, the bone data is expressed into a directed graph, the information of the joints, the bones and the mutual relation of the joints and the bones is extracted, and the prediction is carried out according to the extracted characteristics and the information is fed back to a teacher. The invention can analyze a series of learning behaviors of students in a classroom and reduce the burden of the classroom. In addition, the invention also discloses a monitoring system for student classroom behavior analysis based on behavior recognition.

Description

Student classroom behavior analysis monitoring method and system based on behavior recognition

Technical Field

The invention belongs to the technical field of human behavior recognition, and particularly relates to a monitoring method for student classroom behavior analysis based on behavior recognition.

Background

The human behavior recognition aims to analyze and understand the behavior of an individual and the interaction behavior among multiple individuals from a video. The human behavior recognition technology is widely applied to various fields, such as security monitoring systems, intelligent medical monitoring systems, human-computer interaction systems and the like, and is not commonly applied in the field of education, but the application does not represent the defect that the behavior recognition technology has applicability in the aspect of education. Instead, with the rapid development of information digitization, the human behavior recognition technology has a wide development prospect in the field of education.

Classroom teaching plays a key role in modern education and teaching, and the behavior of students in classroom is a main component of classroom teaching evaluation, which has great influence on teaching quality. In the traditional classroom teaching, the time and the labor are consumed by the way that a teacher acquires classroom feedback, namely, the classroom behavior of students is observed in class, or classroom videos are watched after class, and the learning condition of the classroom of the students is known through the post-classroom work of the students. Therefore, the teacher needs to give consideration to the quality of classroom teaching contents and supervise whether students listen to the classroom carefully, a large amount of time and energy of the teacher can be consumed, the teacher is tired, the classroom teaching quality cannot be guaranteed, and the teacher cannot pay attention to the class listening condition of each student at every moment.

At present, an artificial intelligence technology aiming at a classroom scene is mainly applied to the aspects of expression and voice recognition, the recognition on the aspect of human behaviors is less, and more, some wearable devices are used for recognizing human actions. The devices interfere with the learning condition of students to a certain extent, so that the collected data and the real classroom data have certain errors, and the recognition result is inaccurate. However, because students have various behaviors in a classroom and a complex background, the recognition result of the student behavior is easily influenced by some irrelevant factors, such as illumination, clothes, camera viewpoints, etc., and if a large amount of data is not available for supporting learning, it is difficult to accurately recognize the student behavior. The monitoring system widely used in the market at present only has the monitoring function, and the classroom behavior of students can be known only by self-checking of teaching personnel after class, so that continuous observation and supervision cannot be carried out in class.

Disclosure of Invention

One of the objects of the present invention is: aiming at the defects of the prior art, the monitoring method for student classroom behavior analysis based on behavior recognition is provided, a series of learning behaviors of students in a classroom can be analyzed, so that feedback evaluation is carried out on participation conditions, activity degree and the like of the students in the classroom, a teacher can know the class listening condition of the students in the classroom more visually, the teacher is helped to adjust teaching strategies, improve teaching modes and reduce classroom burdens.

In order to achieve the purpose, the invention adopts the following technical scheme:

the monitoring method for student classroom behavior analysis based on behavior recognition comprises the following steps:

collecting video information of human body behaviors and depth information of key points of a human body skeleton diagram;

extracting a student behavior skeleton diagram from the acquired image through a gesture recognition algorithm;

performing feature extraction on the skeleton diagram through a graph convolution neural network, giving a 2D or 3D coordinate sequence of body joints, and constructing a naturally-connected space-time diagram which takes the joints as nodes and the human body structure and time as edges;

in the convolution process, according to the motion correlation between human joints and bones, the bone data is expressed into a directed graph, the information of the joints, the bones and the mutual relation of the joints and the bones is extracted, the prediction is carried out according to the extracted characteristics, and then the analysis data is stored and fed back to a teacher.

As an improvement of the monitoring method for student classroom behavior analysis based on behavior recognition, the device for acquiring human body behaviors comprises a depth camera and a monitoring camera, wherein the depth camera and the monitoring camera are hung in the front, the back, the left and the right directions of a classroom ceiling.

As an improvement of the monitoring method for student classroom behavior analysis based on behavior recognition, the gesture recognition algorithm is an OpenPose gesture recognition algorithm, and the depth camera is a Kinect depth camera.

As an improvement of the monitoring method for student classroom behavior analysis based on behavior recognition, the skeleton map comprises 25 joint points and 24 skeletons, each skeleton is a vector pointing from a source joint to a target joint of the skeleton and comprises length information and direction information.

As an improvement of the monitoring method for student classroom behavior analysis based on behavior recognition, the characteristics of the skeleton diagram comprise spatial characteristics and time characteristics, the spatial characteristics are extracted from joints and bones, the joints are 3D coordinates, the bones are the difference of the coordinates of the two joints, the time characteristics are motion information, and the motion information comprises displacement, joint direction, motion speed and acceleration.

The invention also aims to provide a monitoring system for student classroom behavior analysis based on behavior recognition, which comprises:

the data acquisition module is used for acquiring video information of human body behaviors and depth information of key points of the human body skeleton map;

the data preprocessing module is used for extracting the acquired images into a student behavior skeleton diagram through a gesture recognition algorithm;

the behavior recognition module extracts the characteristics of the skeleton diagram through a diagram convolution neural network, constructs a naturally-connected space-time diagram which takes the joints as nodes and the human body structure and time as sides by giving a sequence of coordinates of the joints of the body in a 2D or 3D form, expresses the bone data as a directed diagram according to the motion correlation between the joints and the bones of the body in the convolution process, extracts the information of the joints, the bones and the mutual relations of the joints and the bones, and predicts according to the extracted characteristics.

The method has the advantages that the method comprises the steps of collecting video information of human body behaviors and depth information of key points of a human body skeleton diagram; extracting a student behavior skeleton diagram from the acquired image through a gesture recognition algorithm; performing feature extraction on the skeleton diagram through a graph convolution neural network, giving a 2D or 3D coordinate sequence of body joints, and constructing a naturally-connected space-time diagram which takes the joints as nodes and the human body structure and time as edges; in the convolution process, according to the motion correlation between human joints and bones, the bone data is expressed into a directed graph, the information of the joints, the bones and the mutual relation of the joints and the bones is extracted, the prediction is carried out according to the extracted characteristics, and then the analysis data is stored and fed back to a teacher. The behavior recognition technology is combined with classroom teaching, a system capable of automatically recording classroom behaviors of students is built in a classroom, the class-taking videos of the students can be automatically recorded and stored according to the requirements of teachers, the classroom behaviors of the students can be analyzed and displayed after the fact, actions of the students in the classroom can be recognized, such as typical classroom behaviors of hands lifting, sleeping, mobile phone playing, writing, eating and the like, the teachers can be prompted under the required condition, and meanwhile, recording and storing are carried out, and a set of interactive behavior measuring system capable of being used by the teachers is achieved. The invention can analyze a series of learning behaviors of the students in the classroom, thereby performing feedback evaluation on participation conditions, activity degree and the like of the students in the classroom, enabling the teacher to more intuitively know the class listening condition of the students in the classroom, helping the teacher to adjust teaching strategies, improving teaching modes and reducing classroom burden.

Drawings

Features, advantages and technical effects of exemplary embodiments of the present invention will be described below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of behavior recognition in accordance with the present invention.

FIG. 2 is a schematic diagram of the Kinect depth camera of the present invention.

Fig. 3 is an overall system framework diagram of the present invention.

Detailed Description

As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, and a person skilled in the art can solve the technical problem within a certain error range to substantially achieve the technical effect.

Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

The present invention will be described in further detail with reference to fig. 1 to 3, but the present invention is not limited thereto.

The device for acquiring the human body behaviors comprises a depth camera and a monitoring camera, wherein the depth camera and the monitoring camera are hung in the front, back, left and right directions of a ceiling of a classroom.

The gesture recognition algorithm is an OpenPose gesture recognition algorithm, and the depth camera is a Kinect depth camera.

The skeleton map comprises 25 joint points and 24 bones, each of which is a vector pointing from its source joint to its target joint, containing length information and direction information.

The characteristics of the skeleton diagram comprise space characteristics and time characteristics, the space characteristics are extracted from joints and bones, the joints are 3D coordinates, the bones are the difference of the coordinates of the two joints, the time characteristics are motion information, and the motion information comprises displacement, joint direction, motion speed and acceleration.

The Kinect sensor is the most widely used information acquisition equipment in the field of human behavior recognition, and the equipment consists of three parts: one is an RGB color camera which is used for shooting and collecting RGB video information, the other is an external line emitter, and the other is a depth sensor. The Kinect can collect video original image and depth data, the color image specification is 1920 x 1080, and the depth image specification is 512 x 424.

The Kinect sensor needs to hang in the ceiling in classroom, through its coverage of its of measurement of system and adjust its depression, can solve the problem of sheltering from that produces when the student is intensive, can not influence teacher's teaching yet. The Kinect sensor obtains human behavior data, one part of the obtained data is used as statistical information to be collected, sorted and stored at the rear end, the other part of the obtained data is transmitted to the user, and teachers select to check student behaviors. And the student behaviors are calibrated, and prompt information can be sent to a teacher when the student is detected to have an unauthorized listening and speaking behavior or a classroom interaction behavior.

The general execution flow chart can refer to fig. 3, specifically:

step one, data acquisition

Because the number of people in a classroom is large, if the Kinect depth camera is placed in front of students in the classroom, the problem of blocking of the students can be caused, and classroom teaching can also be influenced, so that the Kinetc depth camera and the monitoring camera are hung in the front, back, left and right directions of a ceiling of the classroom to respectively obtain depth information in the four directions. The Kinect depth camera has the advantages that the detection range of the Kinect depth camera can cover all seats, and under the condition that people are fully seated in a classroom, the Kinect depth camera cannot be shielded.

The Kinect depth cameras are hung in the front, back, left and right directions of a ceiling of a classroom, various behavior information of a student in a classroom is collected in real time, for example, a series of behaviors of the student who does not carefully listen to the speech, including sleeping, eating, playing mobile phones, chatting and the like, and for example, the classroom interaction behaviors of the student, including raising hands and the like. The acquired data are in the form of color images and depth images.

Step two, data preprocessing

In the process of identifying and analyzing the classroom behavior of the student, the human body skeleton information has robustness compared with video information, and the human body skeleton information can highlight key information of the action of the student and is not influenced by external factors irrelevant to the classroom behavior, such as dressing of the student, classroom background, posture of the student, illumination intensity, camera view angle and the like, so that the complexity of behavior identification is reduced, and the accuracy of behavior identification is increased.

The Kinect depth camera can be used for collecting video information of human body behaviors and depth information of key points of a human body skeleton map. And extracting human skeleton information from the collected video data. The human skeleton information can be extracted through OpenPose, and the OpenPose human posture recognition is an open source library developed by the university of Kanai Melong based on a convolutional neural network and supervised learning and taking caffe as a framework. The gesture estimation of human body action, facial expression, finger motion and the like can be realized. OpenPose detects human skeletal joint points from bottom to top, firstly predicts confidence coefficient of human body part detection through a network, detects positions of human key skeletal points, after the positions of the key skeletal points are obtained, analyzes affinity between the parts related to the parts through the network prediction, and finally analyzes the confidence coefficient and the affinity through a greedy algorithm to connect the key points, so that human action skeletal graph information can be obtained.

Since humans naturally evaluate motion based on the orientation and position of human bones rather than the position of joints, and joints and bones are strongly coupled, extracting valid spatial and temporal features from bones can also aid in motion recognition. The first order information of human bone data is the 2D or 3D coordinates of the joints, and the length and direction of the bone generally provide more information and discrimination for motion recognition, so in order to use the second order information representing the bone characteristics between two joints, the length and direction of the bone is represented as a vector pointing from its source joint to the target joint. The first-order information and the second-order information are fused through a model to achieve the effect of further improving the performance.

And extracting a student behavior skeleton diagram from the acquired image by utilizing an OpenPose gesture recognition algorithm. The skeleton diagram is composed of 25 joint points and 24 skeleton edges, and is a natural structure presentation of human joints and bones. The original skeleton data in each frame is always provided as a sequence of vectors, each vector representing the two or three dimensional coordinates of a respective human joint.

Each skeleton edge is bound by two joints, a joint close to the center of gravity of a human body is defined as a source joint, and a joint far away from the center of gravity is defined as a target joint. Each bone is represented as a vector pointing from its source joint to its target joint, containing not only length information, but also direction information. For example, given an active joint v₁＝(x₁，y₁，z₁) And target joint v₂＝(x₂，y₂，z₂) The vector of the bone is calculated as

Step three, behavior recognition

And (3) performing feature extraction on the extracted skeleton diagram by using a diagram convolution neural network, and given a 2D or 3D coordinate sequence of the body joint, constructing a naturally-connected space-time diagram which takes the joint as a node and the human body structure and time as edges. Thus, the input to the model is the joint coordinate vectors on the graph nodes. This can be seen as an image-based simulation CNN, where the input is formed by pixel vectors located on a 2D image grid. And (4) performing multilayer space-time graph convolution operation on input data to generate a feature graph with a higher level on the graph. It will then be classified by the standard SoftMax classifier into the corresponding action class. The whole model adopts a back propagation method to carry out end-to-end training.

The skeleton diagram is constructed as a undirected space-time diagram g ═ (V, E), and the sequence has the characteristics of point and interframe connection. In the skeleton diagram, all joints are represented as a node set v ═ { v ═ v_tiI T1., T, i 1.,. N }, as input to the model, the feature vectors on the set of nodes (vti) contain the coordinate vectors and the estimated confidence of the i-th joint on frame T. The set of skeletal edges E consists of two subsets, the first describing the intra-skeletal connection of each frame, denoted E_s＝{v_tiv_tjL (ij) ∈ H }, where H is the set of naturally connected human joints. The second subset contains inter-frame edges that connect the same joints in successive frames, and the graph convolution formula is:

where w is a weight function, the normalization term Z_ti(v_tj) Equal to the cardinality of the corresponding subset, this term is added to balance the contribution of the different subsets to the output.

In the convolution process, according to the motion correlation between human joints and bones, the bone data is expressed into a directed graph, and the directed graph is used for extracting information of the joints, the bones and the mutual relation of the joints and the bones and predicting according to the extracted characteristics. The network can propagate information into adjacent joints and bones and can update their relevant information in layers. The process of updating is that the attributes of the node and its input and output edges are combined to obtain an updated node. The attributes of the edge itself and of its source and target nodes are combined to obtain an updated edge. The finally extracted features not only contain the information of each joint and each bone, but also contain the dependency relationship between the joints and the bones, and are beneficial to motion recognition.

The features of the human skeleton include spatial features extracted from joints and bones and temporal features. The joints are represented as 3D coordinates and the bones as the difference between the two joint coordinates. The temporal features are typically represented as motion information, which includes displacement, joint direction, motion velocity, acceleration, and so forth. Since the bone data is represented as coordinates of the joint, the motion information of the joint is calculated as a coordinate difference along the time dimension. Similarly, the deformation of a bone is represented as the difference between vectors of the same bone in successive frames. Formally, the motion of the joint v in time t is calculated as

Definition of bone deformation and

similarly. Like spatial information modeling, motion information is formulated as a sequence of directed acyclic graphs

Wherein

The motion information is then fed to the model to predict the action tag.

The traditional deep learning-based method manually constructs a skeleton into a joint coordinate vector sequence or a pseudo image, and feeds the joint coordinate vector sequence or the pseudo image into a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN) to generate a prediction, but the skeleton data is expressed into a vector sequence or a 2D grid, so that the dependency between related joints cannot be completely expressed. Therefore, the skeleton is naturally constructed in a graph form, the joints are nodes, and the skeleton is an edge.

The convolutional neural network (GCN) is used for expanding the convolutional neural network to non-Euclidean data such as human skeleton map data, so that human behavior is identified. For an image, when some convolution kernel with fixed size is applied to an input image, the mapped pixel matrix is the area adjacent to each scan center with the same size as the weight matrix. For the human skeleton diagram structure, the neighborhood of the nodes needs to be defined in groups, and the weights of the nodes belonging to different neighborhoods are defined through division. We define a adjacency set using distance segmentation, i.e. according to the distance of the node from the root node, the adjacency set being divided into two subsets, one subset being the root node itself and the other subset being the node at a distance d-1 from the root node, different subsets learning different weights.

The algorithm module consists of a graph convolution neural network and comprises ten layers of basic space-time convolution blocks, wherein each basic space-time convolution block comprises a space GCN, a time GCN and a dropout layer, and the space GCN and the time GCN are followed by a Batch Normalization (BN) layer and a ReLU excitation function layer. The number of output channels per block is 64, 128, 256, and 256. A data BN layer is added at the beginning to normalize the input data, a global averaging pool layer is performed by pooling operations on different samples to the same size feature map, and the final output is sent to the softmax classifier to obtain the prediction result.

The human body skeleton map information and the skeleton information acquired in the first two stages are fed to a behavior recognition module for recognition and analysis, and mainly recognition is carried out on interactive behaviors (such as raising hands and standing up) or behaviors (such as sleeping, playing mobile phones and eating) of students not paying attention to class listening behaviors (such as raising hands and standing up) in a classroom.

Student classroom behavior analysis's monitored control system based on action discernment includes:

Variations and modifications to the above-described embodiments may also occur to those skilled in the art, which fall within the scope of the invention as disclosed and taught herein. Therefore, the present invention is not limited to the above-mentioned embodiments, and any obvious improvement, replacement or modification made by those skilled in the art based on the present invention is within the protection scope of the present invention. Furthermore, although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

1. The monitoring method for student classroom behavior analysis based on behavior recognition is characterized by comprising the following steps:

in the convolution process, according to the motion correlation between human joints and bones, the bone data is expressed into a directed graph, the information of the joints, the bones and the mutual relation thereof is extracted, the prediction is carried out according to the extracted characteristics, and then the prediction is fed back to a teacher.

2. The monitoring method for student classroom behavior analysis based on behavior recognition as recited in claim 1, wherein: the device for acquiring human body behaviors comprises a depth camera and a monitoring camera, wherein the depth camera and the monitoring camera are hung in the front, back, left and right directions of a ceiling of a classroom.

3. The monitoring method for student classroom behavior analysis based on behavior recognition as claimed in claim 2, wherein: the gesture recognition algorithm is an OpenPose gesture recognition algorithm, and the depth camera is a Kinect depth camera.

4. The monitoring method for student classroom behavior analysis based on behavior recognition as recited in claim 1, wherein: the skeleton map comprises 25 joint points and 24 bones, each of which is a vector pointing from its source joint to its target joint, containing length information and direction information.

5. The monitoring method for student classroom behavior analysis based on behavior recognition as recited in claim 1, wherein: the characteristics of the skeleton diagram comprise space characteristics and time characteristics, the space characteristics are extracted from joints and bones, the joints are 3D coordinates, the bones are the difference of the coordinates of the two joints, the time characteristics are motion information, and the motion information comprises displacement, joint direction, motion speed and acceleration.

6. Student classroom behavior analysis's monitored control system based on action discernment, its characterized in that includes: