CN113723233A

CN113723233A - Student learning participation degree evaluation method based on layered time sequence multi-example learning

Info

Publication number: CN113723233A
Application number: CN202110942289.9A
Authority: CN
Inventors: 李特; 姜新波; 马嘉遥; 秦学英; 顾建军
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2021-11-30
Anticipated expiration: 2041-08-17
Also published as: CN113723233B

Abstract

The invention discloses a student learning participation degree evaluation method based on hierarchical time sequence multi-example learning. The method uses three characteristics of head posture, facial expression and body posture extracted from a video and a video-level learning participation label to train an evaluation model, and the model can obtain the video-level learning participation and the learning participation of all video segments. The implementation method is convenient, efficient and simple in calculation, and the learning participation degree evaluation precision is reliably guaranteed.

Description

Student learning participation degree evaluation method based on layered time sequence multi-example learning

Technical Field

The invention belongs to the field of computer vision, artificial intelligence and education, and particularly relates to a student learning participation degree evaluation method based on hierarchical time sequence multi-example learning.

Background

The advent of large-scale open online courses (MOOC) has raised widespread interest and great expectations in the educational community. Despite the broad potential of new educational approaches, low learning completion rates of students are considered to be one of their major problems. In order to overcome the deficiency, dynamic assessment of the participation of individual students during online learning activities can provide timely teaching intervention to improve the learning completion rate and perform personalized learning. Since a large number of students are often seen in an MOOC environment, the cost of manually performing such evaluations is prohibitive. Therefore, research into automated techniques for instantly evaluating student learning participation is receiving increasing attention.

The study of automatic assessment of learning participation has the following problems:

1) due to the fact that the work of segment-by-segment annotation is time-consuming and labor-consuming, the problem of evaluating the learning participation degree of the whole video is only solved by a plurality of previous methods, and attention is lacked to the evaluation of the more meaningful video segment participation degree.

2) The lessons of distance education usually last tens of minutes or even an hour, and the large amount of video data makes evaluation difficult. Therefore, how to obtain effective features that can represent the whole video and each short film becomes a problem to be solved urgently.

Disclosure of Invention

The invention aims to provide a student learning participation degree evaluation method based on hierarchical time sequence multi-example learning, aiming at the defects of the prior art.

The purpose of the invention is realized by the following technical scheme: a student learning participation degree evaluation method based on hierarchical time sequence multi-example learning comprises the following steps:

step 1, extracting image frames from each video, wherein each frame of image forms a video segment, and each video acquires N video segments.

And 2, respectively extracting body posture features, head posture features and facial key point features of each frame image in each video clip by using OpenPose, FSA-NET and PLFD networks for evaluation of learning participation.

Step 3, for each type of characteristic sequence of the video clip, respectively using a Bi-LSTM network to obtain the hidden state of each moment; the hidden state is input into an underlying temporal multi-instance learning module (B-TMIL) to obtain a feature representation of the video segment. And processing the extracted features of all video segments of one video through a full-connection and top-level multi-instance learning module (T-TMIL) to obtain the feature representation of the video. Wherein, the B-TIMIL and the T-TMIL are realized based on a self-attention mechanism, and the structures are the same.

And 4, fusing the three video segment-level features extracted by the B-TMIL in the step 3, and fusing the three video-level features extracted by the T-TMIL in the step 3.

And 5, performing full connection operation on the fusion characteristics of the video clip level to obtain the learning participation of the video clip. And obtaining the learning participation of the video through full-connection operation of the video-level fusion characteristics. And respectively establishing local and global supervision by using the average value of the video segment participation and the learning participation of the video, and training the whole network.

Further, in step 1, extracting the image frames specifically includes reserving one frame every several frames at equal intervals.

Further, in step 2, for the video clip frame v_i,jRespectively extracting head posture characteristics e by using OpenPose, FSA-NET and PLFD networks_i,jBody posture characteristics b_i,jAnd face key point m_i,j. For video clip V_iThen get the head pose sequence E_i＝{e_i,1,e_i,2…,e_i,l}, body posture sequence B_i＝{b_i,1,b_i,2…,b_i,lAnd facial Key Point sequence M_i＝{m_i,1,m_i,2…,m_i,l}。

Further, in step 3, the B-TMIL module acts on the sequence of sampled video frames that constitute a short video segment, where the frames are examples and the segments are packets. A valid representation of the packet needs to be obtained in order to accurately obtain its tag. The characterization of the packet is adaptively obtained by trainable parameters using a multi-instance learning module of an attention-machine mechanism to act on the hidden state at all times of the Bi-LSTM.

Further, with X_iRepresenting one of a head pose sequence, a body pose sequence, a facial keypoint sequence, X_iInputting into Bi-LSTM to obtain hidden state sequence H_i＝{h_i,1,h_i,2…,h_i,l}。X_iCorresponding segment-level aggregation features

Is calculated as

And (3) weighted sum form after dimension reduction:

wherein the content of the first and second substances,

representing one of a segment-level head pose feature sequence, a segment-level body pose feature sequence, and a segment-level face keypoint feature sequence. Delta is the function of the ReLU and is,

is a full join operation for dimensionality reduction.

Is a weight matrix,. alpha.is an element-by-element multiplication,. alpha.is a Sigmoid function,. tau.is Tanh function.

Further, in step 3, T-TMIL acts between video segments, which are considered as examples, and the complete video composed of segments is a packet. The MIL module is applied on the basis of the segment-level features. The dimensionality of the segment-level features is reduced by the fully-join operation to generate a more efficient embedded representation. A weighted combination of video segments is constructed to represent the video.

Further, the air conditioner is provided with a fan,

representing one of a video-level head pose feature sequence, a video-level body pose feature sequence, a video-level facial key point feature sequence:

wherein the content of the first and second substances,

is the weight matrix in T-TMIL.

Representing one of a segment-level head pose feature sequence, a segment-level body pose feature sequence, and a segment-level face keypoint feature sequence.

Further, in step 4, a weighted feature fusion method is adopted to extract the fusion features of the segment level and the video level for evaluation. The weight matrix is the same size as the feature matrix and is composed of trainable parameters. And normalizing each column of the weight matrix through a Softmax function to obtain different proportions of different features on corresponding dimensions, and then weighting and summing to obtain weighted fusion features.

The invention has the beneficial effects that: the invention establishes a layered time sequence multi-example learning model according to the time correlation between examples, wherein the model is composed of a bottom module of a video frame-video fragment and a top module of the video fragment-video. The method uses three characteristics of head posture, facial expression and body posture extracted from the video and the learning participation degree label of video level to train the evaluation model, and the model not only can obtain the learning participation degree of video level, but also can obtain the learning participation degree of all video segments. The implementation method is convenient, efficient and simple in calculation, and the learning participation degree evaluation precision is reliably guaranteed.

Drawings

FIG. 1 is a schematic diagram of a learning engagement assessment framework;

FIG. 2 is a schematic diagram of a feature fusion process.

Detailed Description

As shown in fig. 1, the student learning participation degree evaluation method based on hierarchical time series multi-instance learning of the present invention includes the following steps:

step 1, pretreatment. Extracting 3000 frames of images at equal intervals from each video; each 30 frames of images constitute one video clip, so that 100 video clips can be acquired for one video.

Down-sampling: the body posture, head posture and facial key points of the learner tend to change gradually and slowly during the learning process. Therefore, we downsample each original video by keeping one frame every few frames to achieve more efficient computational processing. In our experiment, 3000 frames per video were reserved for evaluation.

And (3) dividing: since the features of the current moment have little influence on the evaluation of learning engagement of other moments, and the network usually has difficulty in processing a lengthy feature sequence, we segment the input video into short video segments as basic analysis objects. We set the length of the video segment to 30 frames, denoted l-30 in all our experiments, to trade off the result between computational efficiency and accuracy of the proposed method. The number of video clips N extracted from each video is 100.

And 2, extracting the characteristics. And respectively extracting body posture characteristics, head posture characteristics and facial key point characteristics of each frame image in each video clip by using OpenPose, FSA-NET and PLFD networks for the evaluation of learning participation.

According to previous studies, the body language and facial expression have a strong correlation with the learner's learning engagement. Therefore, we use head pose features, body pose features, and facial keypoints as inputs to improve the accuracy and robustness of our model. For video clip frame v_i,jWe use OpenPose, FSA-NET and PLFD networks to extract the head pose features e separately_i,jBody posture characteristics b_i,jAnd face key point m_i,j(ii) a i is 1 to N, and j is 1 to l. Then for video segment V_iWe can get the head pose sequence E_i＝{e_i,1,e_i,2…,e_i,l}, body posture sequence B_i＝{b_i,1,b_i,2…,b_i,lAnd facial Key Point sequence M_i＝{m_i,1,m_i,2…,m_i,l}。

And step 3, layering a time sequence multi-instance learning model (H-TMIL). Based on a frame-segment-video structure with only video-level tags, as shown in fig. 1, we propose a hierarchical temporal multi-instance learning model (H-TMIL) consisting of a bottom-level temporal multi-instance learning module (B-TMIL) and a top-level temporal multi-instance learning module (T-TMIL) that work to learn the potential relationships between segments and their constituent frames and between videos and their constituent segments, respectively. Through this framework, we establish a connection between the underlying video frames and the video-level tags, and can implicitly learn the expressions of the middleware, i.e., the segment-level features useful for evaluating segment-level learning engagement.

And 3.1, a bottom-layer time sequence multi-instance learning module (B-TMIL). For each class of feature sequences of a video segment, we use the Bi-LSTM network to obtain the hidden state at each time. The hidden state is input into an underlying temporal multi-instance learning module (B-TMIL) to obtain a feature representation of the video segment. As shown in the lower left corner of FIG. 1, B-TMIL is implemented based on a self-attention mechanism.

The B-TMIL module operates on a sequence of sampled video frames that constitute a short video segment, where frames are examples and segments are packets. We need to obtain a valid representation of a packet in order to accurately obtain its tag. However, unlike conventional multi-instance learning MILs, there is a strong timing correlation between frames for a short sequence of time frames. The use of Bi-LSTM to capture temporal correlations and use the last layer of hidden states to express sequences results in the loss of early information. To address this problem, we use a multi-instance learning module of an attention-oriented mechanism to act on hidden states at all times of Bi-LSTM, adaptively obtaining the characterization of the packet through trainable parameters. The body posture feature is used as an example for illustration.

First, we will characterize the body posture sequence B_iInputting into Bi-LSTM to obtain hidden state sequence H_i＝{h_i,1,h_i,2…,h_i,l}. Segment level aggregation feature

Can be calculated as

And (3) weighted sum form after dimension reduction:

where, delta refers to the ReLU function,

refers to a full join operation for dimensionality reduction. Weight of

The calculation formula is as follows:

wherein the content of the first and second substances,

the calculation is as follows:

in the formula (I), the compound is shown in the specification,

is a weight matrix,. phi.is an element-by-element multiplication,. sigma.is a Sigmoid function, and. tau.is a Tanh function. Wherein, the Tanh function is used for obtaining the correlation between the characteristics, and the Sigmoid function is used as a door mechanism.

Similar to segment-level body posture feature sequence

Using the B-TMIL module to extract segment-level head pose feature sequences

And segment level face key point feature sequence

I.e. B in the above process_iIs replaced by E_iOr M_i。

And 3.2, a top-level time sequence multi-instance learning module (T-TMIL). And processing the extracted features of all video segments of a video through a full-connection and top-level multi-instance learning module (T-TMIL) to obtain a feature representation of the video. As shown in the lower left corner of FIG. 1, T-TMIL is also implemented based on a self-attention mechanism, similar to B-TIMIL.

Similar to B-TMIL, T-TMIL acts between video clips, which can be considered as examples, where a complete video composed of clips is a packet. However, since the duration of each video is long, we consider that there is no longer a strong temporal relationship between video segments. Many previous works have generally considered that the video level feature can be represented as an average of all segment level features. These methods treat all video segments equally, which adversely affects the accuracy of the final assessment. To obtain a more robust and flexible video representation, we still apply the MIL module on the basis of the slice-level features. And the top module becomes more similar to the conventional multi-instance structure. Here we also describe the process of T-TMIL with body posture characteristics as an example.

The dimensionality of the segment-level features is reduced by the fully-join operation to generate a more efficient embedded representation. We construct a weighted combination of video segments to represent the video, which is calculated as follows:

wherein the content of the first and second substances,

is a video-level body posture characteristic sequence.

The calculation is as follows:

wherein the content of the first and second substances,

is the weight matrix in T-TMIL. Similar to the process described above, we use the T-TMIL module to aggregate video-level head pose feature sequences

And video level face key point feature sequence

I.e. in the above formula

Is replaced by

Or

And 4, fusing the characteristics. And (3) fusing the three types of video clip level features extracted in the step (3.1), and fusing the three types of video level features extracted in the step (3.2). The fusion process is shown in figure 2.

We use three types of features to assess learning engagement, and our hierarchical module processes each type of feature separately. In order to realize advantage complementation among different features and increase judgment information, a weighted feature fusion method is provided, and segment-level and video-level fusion features used for evaluation are extracted. As shown in fig. 2, the weight matrix is the same size as the feature matrix and is composed of trainable parameters. And normalizing each column of the weight matrix through a Softmax function to obtain different proportions of different features on the dimension, and then weighting and summing to obtain weighted fusion features.

And 5, performing full connection operation on the fusion characteristics of the video clip level to obtain the learning participation of the video clip. And obtaining the learning participation of the video through full-connection operation of the video-level fusion characteristics. And respectively establishing local and global supervision by using the average value of the video segment participation and the learning participation of the video, and training the network.

Claims

1. A student learning participation degree evaluation method based on hierarchical time sequence multi-example learning is characterized by comprising the following steps:

2. The student learning participation degree evaluation method based on the hierarchical time series multi-instance learning as claimed in claim 1, wherein in the step 1, the image frames are extracted to be specific to reserving one frame every few frames at equal intervals.

3. The student learning engagement assessment method based on hierarchical time series multi-instance learning as claimed in claim 1, wherein in step 2, v is a video segment frame_i,jRespectively extracting head posture characteristics e by using OpenPose, FSA-NET and PLFD networks_i,jBody posture characteristics b_i,jAnd face key point m_i,j. For video clip V_iThen get the head gesture sequenceE_i＝{e_i,1,e_i,2…,e_i,l}, body posture sequence B_i＝{b_i,1,b_i,2…,b_i,lAnd facial Key Point sequence M_i＝{m_i,1,m_i,2…,m_i,l}。

4. The student learning participation evaluation method based on hierarchical time series multiple instance learning according to claim 1, wherein in step 3, the B-TMIL module acts on a sequence of sample video frames constituting a short video segment, wherein the frames are instances and the segments are packets. A valid representation of the packet needs to be obtained in order to accurately obtain its tag. The characterization of the packet is adaptively obtained by trainable parameters using a multi-instance learning module of an attention-machine mechanism to act on the hidden state at all times of the Bi-LSTM.

5. The student learning engagement assessment method based on hierarchical time series multi-instance learning according to claim 4, characterized in that X is used_iRepresenting one of a head pose sequence, a body pose sequence, a facial keypoint sequence, X_iInputting into Bi-LSTM to obtain hidden state sequence H_i＝{h_i,1,h_i,2…,h_i,l}。X_iCorresponding segment-level aggregation features

Is calculated as

And (3) weighted sum form after dimension reduction:

wherein the content of the first and second substances,

is a full join operation for dimensionality reduction.

Is a weight matrix,. phi.is an element-by-element multiplication,. sigma.is a Sigmoid function, and. tau.is a Tanh function.

6. The student learning participation evaluation method based on the hierarchical time series multi-instance learning as claimed in claim 1, wherein in step 3, T-TMIL is applied between video segments, the video segments are considered as instances, and a complete video composed of the segments is a package. The MIL module is applied on the basis of the segment-level features. The dimensionality of the segment-level features is reduced by the fully-join operation to generate a more efficient embedded representation. A weighted combination of video segments is constructed to represent the video.

7. The student learning engagement assessment method based on hierarchical time series multi-instance learning according to claim 6,

wherein the content of the first and second substances,

is the weight matrix in T-TMIL.

8. The student learning participation evaluation method based on the hierarchical time series multi-instance learning as claimed in claim 1, wherein in step 4, a weighted feature fusion method is adopted to extract segment-level and video-level fusion features for evaluation. The weight matrix is the same size as the feature matrix and is composed of trainable parameters. And normalizing each column of the weight matrix through a Softmax function to obtain different proportions of different features on corresponding dimensions, and then weighting and summing to obtain weighted fusion features.