CN115984956B

CN115984956B - Multi-mode visual analysis system for class investment of students through man-machine cooperation

Info

Publication number: CN115984956B
Application number: CN202211621966.8A
Authority: CN
Inventors: 蒋艳双; 祁彬斌; 包昊罡; 黄荣怀; 刘德建
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-08-29
Anticipated expiration: 2042-12-16
Also published as: CN115984956A

Abstract

The invention discloses a man-machine collaborative student classroom input multi-mode visual analysis system which comprises a multi-mode data acquisition module, a learning behavior analysis module, a teaching activity fusion analysis module, a classroom input analysis module and a visual feedback module which are connected in sequence, wherein the learning behavior analysis module, the teaching activity fusion analysis module, the classroom input analysis module and the education field knowledge extraction module are respectively adjacent. According to the man-machine collaborative student classroom input degree multi-mode visual analysis system with the structure, through acquisition and analysis of student modal data, comprehensive evaluation and visual feedback are carried out on the learning input degree of students in different scenes according to the established class input related classification standards and indexes.

Description

Multi-mode visual analysis system for class investment of students through man-machine cooperation

Technical Field

The invention relates to the technical field of intelligent teaching, in particular to a man-machine collaborative student class input multi-mode visual analysis system.

Background

The class input analysis of students is an important basis in the fields of education measurement and learning analysis. Existing analytical techniques mainly include two classes: observation and evaluation technology based on a structuring scale and coding analysis technology based on objective behavior measurement. The observation and evaluation technology takes a series of evaluation items pointing to the class investment of students as a measuring tool basis, but the observation and evaluation technology often has unstable analysis standards due to the attention and experience difference of observers, has stronger subjectivity, does not have an automation flow basis for large-scale development, and has low measurement reliability, high analysis cost and single result presentation mode. The coding analysis technology utilizes an artificial behavior coding system or an automatic behavior coding system based on a computer vision technology to dissociate classroom live into discrete student behavior sequences, generally takes the proportion of specific first-order or multi-order (conversion) behaviors in the sequences as objective measurement indexes as input degree analysis standards, has strong objectivity supporting large-scale development, but has low measurement efficiency, insufficient interpretability and single data modal source because the analysis standards are limited by single explicit behaviors and lack of experience participation of people.

Disclosure of Invention

The invention aims to provide a man-machine collaborative student classroom input degree multi-mode visual analysis system, which is used for comprehensively evaluating and visually feeding back the learning input degree of students in different scenes according to established class standards and indexes related to the classroom input through acquisition and analysis of student mode data.

In order to achieve the aim, the invention provides a man-machine collaborative student classroom input multi-mode visual analysis system, which comprises a multi-mode data acquisition module, a learning behavior analysis module, a teaching activity fusion analysis module, a classroom input analysis module and a visual feedback module which are connected in sequence, wherein the learning behavior analysis module, the teaching activity fusion analysis module, the classroom input analysis module and the education field knowledge extraction module are respectively adjacent;

the multi-mode data acquisition module is used for acquiring original multi-mode data generated in the classroom process, wherein the original multi-mode data comprises classroom two-dimensional video data, classroom depth video data and classroom audio data;

the learning behavior analysis module calculates and primarily analyzes the learning behavior of the students on the basis of the multi-mode data source in real time, and is specifically embodied as identifying the modal information of the expressions, actions and languages of the students in the class through an artificial intelligent algorithm;

the teaching activity fusion analysis module is used for generating higher-level activity information based on student behavior information analysis, and is specifically embodied in a way that all mode information is cooperatively represented as matched learning activities through a multi-mode machine learning method;

the class input degree analysis module is used for carrying out input degree analysis calculation aiming at student individuals by combining objective learning activity information in a specific scene, wherein the specific scene is a background category of teaching activities, and comprises teaching, practice and discussion, the specific index dimension of input degree analysis calculation is obtained from an education field knowledge extraction module, the original value of the input degree analysis module is obtained by multiplying a scene row matrix of m columns, a weight matrix of m rows and n columns and an activity column matrix of n rows, the weight matrix is obtained from the education field knowledge extraction module, and the standard value of the index is calculated by a zero-mean value standardization method on the basis of the original value;

the education field knowledge extraction module is used for inquiring and combining expert opinions to form theoretical dimensions and indexes of each level related to classroom investment, wherein the theoretical dimensions and indexes of each level comprise: a learning behavior classification standard related to classroom input, a learning activity classification standard related to classroom input, a teaching scene classification standard related to classroom input, a classroom input measurement dimension and index, and a weight matrix of each learning activity corresponding to each measurement index in each teaching scene;

the visual feedback module is used for visually outputting the behavior, the activity recognition result and the evaluation index calculation result related to the input of the classroom, calculating the learning input index score of each student in each scene in the classroom process, outputting the evaluation index calculation result as an input degree change curve, and outputting video and images in a visual output mode.

Preferably, the multi-mode data acquisition module is composed of 2 4K cameras and 1 depth camera, the two 4K cameras are respectively arranged at the left upper corner and the right upper corner of a classroom blackboard, the depth camera is arranged at the center of the upper edge of the blackboard, the two 4K cameras respectively shoot students at the left half side and the right half side in the classroom, and the central depth camera shoots all students in the classroom forwards.

Preferably, in the learning behavior analysis module, the implementation method for identifying the student modal information in the class through the artificial intelligence algorithm comprises the following steps:

1) The confidence threshold value is adjusted through an artificial intelligence algorithm, all visible teacher and student entities are identified through a computer vision technology, and the positions, the categories and the confidence of the entities in the two-dimensional picture are detected;

2) Combining the two-dimensional entity position information and the entity depth information, performing entity-label mapping by a dynamic tracking algorithm taking the minimized inter-frame entity position offset as an optimization target, wherein the optimization target of the dynamic tracking algorithm is the sum of minimum offsets of all entity Euclidean distances of adjacent frames in a three-dimensional space;

3) Extracting and aligning language information through a voice recognition algorithm and a Chinese word vector algorithm, and converting unstructured language information into 300-dimensional structured vectors by applying a Chinese word vector pre-training model based on a public corpus;

4) And recognizing and obtaining the expression and action states of the teacher and student entities in each frame based on the behavior classification coding standard examined in the education field knowledge extraction module by using an expression and action recognition model based on the training of the public large-scale data set.

Preferably, in the teaching activity fusion analysis module, the implementation steps of the multi-mode machine learning method include:

1) Mapping the expression, action and language mode information of the student entity into the same feature space x;

2) Based on the learning activity classification coding standard examined in the education field knowledge extraction module, training a classification model from modal information such as expression, action, language and the like to learning activities, and performing automatic student entity activity matching coding.

Preferably, in the education domain knowledge extraction module, the steps for forming each level of theoretical dimension and index related to classroom input are as follows:

1) Interfacing with a learning behavior analysis module, preparing a learning behavior classification coding standard, and screening out behaviors and expressions which are highly relevant to classroom teaching behaviors, including head gestures, limb actions, expressions, speech and interpersonal interaction actions according to the relevant actions and expressions which can be identified by a current computer;

2) Interfacing with a teaching activity fusion analysis module, preparing a teaching activity classification coding standard, respectively carrying out listening and speaking, hand-operated experiments/practices, note taking, practice, computer/PAD operation, hand lifting, standing, reading, conversation with a teacher, feeding back the teacher, companion discussion, hand-operated cooperation and coding interpretation on 13 activity states separated from a classroom on students, and constructing a teaching activity automatic analysis coding table;

3) Interfacing with a classroom input degree analysis module, preparing teaching scene classification coding standards, and preparing corresponding scene codes and scene categories under different scene descriptions;

4) Docking with a classroom input degree analysis module, and examining and verifying evaluation index dimensions related to classroom input;

5) And interfacing with a classroom input degree analysis module, and determining a weight matrix from a scene, activities to input evaluation indexes in a subjective and objective combined mode.

Therefore, the human-computer collaborative student class input multi-mode visual analysis system with the structure realizes the interpretability through the disassembly of the model calculation process and the introduction of the domain knowledge, and further directly interprets the behaviors and actions of teachers and students in education through the analysis result of the general scene; the non-invasive multi-mode data acquisition and analysis technology is adopted to analyze learning behaviors and learning inputs, so that the problems that the single-mode information quantity is insufficient and the single-mode information quantity is easily influenced by external factors are solved, and the method has important value for improving the accuracy of collaborative learning input analysis; the scene is introduced into an analysis flow, the change rule of learning investment in each scene is explored, and different learning variables such as behaviors, languages, learning investment degrees and the like of students in different scenes are characterized.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a schematic diagram of a man-machine collaborative student class input multi-mode visual analysis system;

FIG. 2 is a schematic diagram illustrating the distribution of a multi-modal data acquisition module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process of model training by XGBoost algorithm in the teaching activity fusion analysis module according to the embodiment of the present invention;

FIG. 4 is a video of a live analysis of a classroom output by a visual feedback module according to an embodiment of the present invention;

fig. 5 is a graph of input level change output by the visual feedback module according to an embodiment of the present invention.

Reference numerals

1. A blackboard; 2. a 4K camera; 3. a depth camera; m1, a multi-mode data acquisition module; m2, a learning behavior analysis module; m3, teaching activity fusion analysis module; m4, a classroom input degree analysis module; m5, knowledge extraction module in education field; and M6, a visual feedback module.

Detailed Description

The technical scheme of the invention is further described below through the attached drawings and the embodiments.

Examples

As shown in fig. 1, the human-computer collaborative student classroom input multi-mode visual analysis system comprises a multi-mode data acquisition module M1, a learning behavior analysis module M2, a teaching activity fusion analysis module M3, a classroom input analysis module M4 and a visual feedback module M6 which are sequentially connected, wherein the learning behavior analysis module M2, the teaching activity fusion analysis module M3, the classroom input analysis module M4 and an education field knowledge extraction module M5 are respectively adjacent;

the multi-mode data acquisition module M1 is used for acquiring original multi-mode data generated in a classroom process, wherein the original multi-mode data comprises classroom two-dimensional video data, classroom depth video data and classroom audio data; the multi-mode data acquisition module M1 is composed of 2 4K cameras 2 and 1 depth camera 3, as shown in FIG. 2, the two 4K cameras 2 are respectively arranged at the left upper corner and the right upper corner of the classroom blackboard 1, the depth camera 3 is arranged at the center of the upper edge of the blackboard 1, the two 4K cameras 2 respectively shoot students at the left half side and the right half side in the classroom, and the central depth camera 3 shoots all students in the classroom forwards.

The learning behavior analysis module M2 calculates and primarily analyzes learning behaviors of students in real time based on a multi-mode data source, and is specifically embodied as identifying modal information of expressions, actions and languages of the students in a class through an artificial intelligence algorithm, and the implementation method comprises the following steps:

1) Confidence threshold adjustment is performed through an artificial intelligence algorithm, and all visible teacher and student entities are identified through a computer vision technology. Entity identification is carried out by adopting a Yolo-v5 algorithm, the Yolo-v5 is an open-source target detection network algorithm, and the position, the category and the confidence of the entity in the two-dimensional picture can be detected by following a GPL-3.0 open-source protocol;

2) Combining the two-dimensional entity position information and the entity depth information, performing entity-label mapping by a dynamic tracking algorithm taking the minimum inter-frame entity position offset as an optimization target, wherein the optimization target of the dynamic tracking algorithm is the minimum offset sum of all entity Euclidean distances of adjacent frames in a three-dimensional space (x, y, z). The dynamic tracking algorithm can set a maximum offset threshold L, and when the offset of adjacent frames of a single entity is significantly greater than L, the entity is judged to be abnormal in recognition, and the frame is skipped;

4) And recognizing and obtaining the expression and action states of the teacher and student entities in each frame based on the behavior classification coding standard examined in the education field knowledge extraction module M5 by using an expression and action recognition model trained based on the public large-scale data set. The expression and action recognition is carried out by adopting VGGNet16 and a Slowfast algorithm, wherein the Slowfast is an open-source video understanding network algorithm, and the Apache-2.0 open-source protocol is followed.

The teaching activity fusion analysis module M3 is used for generating higher-level activity information based on student behavior information analysis and is specifically embodied in a mode that each mode information is cooperatively represented as a matched learning activity through a multi-mode machine learning method. The implementation steps of the multi-mode machine learning method comprise:

2) Based on the learning activity classification coding standard examined in the education domain knowledge extraction module M5, training a classification model from expression, action, language and other modal information to learning activities, and performing automatic student entity activity matching coding. The XGBoost algorithm is used here to implement the activity matching process.

The essence of the algorithm is to continuously perform feature splitting to grow decision trees, each round of learning a tree to fit the residual between the predicted value and the actual value of the previous round of model, and the minimization of the objective function is realized through second-order taylor expansion, as shown in fig. 3. Wherein, the objective function is:

the square loss function of the actual value and the predicted value is as follows:

the regularization function is (where T refers to the number of leaves in the decision tree, refers to the L2-modulo square of the predicted value of the decision tree):

when model training is completed to obtain k decision trees, if the score of one sample is to be predicted, according to the characteristics of the sample, the sample falls to a corresponding leaf node in each tree, each leaf node corresponds to a corresponding score, and finally the scores corresponding to each tree are added to obtain the predicted value of the sample.

The class input degree analysis module M4 is configured to perform input degree analysis calculation for an individual student in combination with objective learning activity information in a specific scene, where the specific scene is a background category where teaching activities occur, such as teaching, practice, discussion, and the like, a specific index dimension of input degree analysis calculation is from the education field knowledge extraction module M5, an original value of the input degree analysis calculation is calculated by multiplying a scene row matrix of M columns, a weight matrix of M rows and n columns, and an activity column matrix of n rows, the weight matrix is derived from the education field knowledge extraction module M5, and a standard value of an index is calculated by a z-score (zero-mean normalization) method based on the original value.

The formula aims to normalize the original dataset to a dataset with a mean of 0 and a variance of 1, where μ and σ are the mean and variance of the original dataset, respectively.

The education field knowledge extraction module M5 is configured to query and combine expert opinions to form theoretical dimensions and indexes of each level related to classroom input, where the theoretical dimensions and indexes of each level include: the method comprises the following steps of:

1) Interfacing with a learning behavior analysis module M2, and preparing learning behavior classification coding standards. According to the related actions and expressions which can be recognized by the current computer, 59 actions and expressions which are highly related to classroom teaching actions are selected through education expert discussion, wherein the 59 actions and expressions comprise head gesture 6, limb actions 31, expressions 7, 2 types of speech and 13 human-computer interaction actions, and the human-computer interaction actions are shown in table 1.

TABLE 1 behaviors and expressions related to classroom teaching behaviors

2) Interfacing with a teaching activity fusion analysis module M3 to prepare a teaching activity classification coding standard. Based on the existing teaching behavior analysis indexes such as the France coding system, the S-T coding and the like, the teaching activity automatic analysis coding table is constructed through the discussion of education specialists, and is shown in the table 2.

Table 2 teaching activity automatic analysis coding table

3) And interfacing with a class input degree analysis module M4, preparing teaching scene classification coding standards, and preparing corresponding scene codes and scene categories under different scene descriptions, as shown in a table 3.

Table 3 teaching scene classification coding standard

4) And interfacing with a classroom input analysis module M4 to examine the evaluation index dimension related to classroom input as shown in table 4.

Table 4 evaluation index dimension related to classroom input

5) And the model is in butt joint with a classroom input degree analysis module M4, and the weight matrix from the scene and the activity to each input evaluation index is determined in a subjective and objective combined mode. This step may employ a Delphi Method. The Delphi method is a method for acquiring expert consensus aiming at a specific subject content, firstly, 10-30 expert group members with professional representativeness and authority are selected, and then two indexes are used for pollingDetermining m-n matrix W of corresponding calculation index k of n-class learning activities under m teaching scenes by query and one-round weight determination _k The range of values of the elements in the matrix is between 0 and 1.

And the visual feedback module M6 is used for visually outputting the behavior, the activity recognition result and the evaluation index calculation result related to the classroom investment. The visual output mode is video and image output, comprising:

1) Aligned multi-modal data source information, dynamic tracking information, character behavior information, activity matching information, and output as a classroom live analysis video, as shown in fig. 4.

2) And calculating learning input index scores of students in each scene in the classroom process, and outputting as input degree change curves, as shown in fig. 5.

The above system configuration is configured in a computer device including a memory, a processor, a display adapter, a communication interface, and a communication bus, on which a computer program executable on the processor is stored, which when executed implements the steps of the embodiments described above.

Therefore, the man-machine collaborative student class input multi-mode visual analysis system adopting the structure has the following beneficial effects:

1) And the method integrates the field knowledge to promote the interpretation of analysis in a man-machine cooperative mode. In order to support the interpretation of the calculation process, the analysis framework introduces the experience of the education field expert at each key node, and the education expert is consulted by adopting the Delphi method, so that the coding tables of basic actions, teaching activities, teaching scenes and input states are obtained, and a knowledge base is provided for interpretation calculation.

2) The teaching activity is used as a pivot to enhance the universality of the framework. Teaching activities are the basis and key links of classroom observation and analysis. Based on the related education theory, the analysis framework uses teaching activities as bridges to communicate the bottom layer features identifiable by the computer with education high-level semantics. The teaching behavior is analyzed by an end-to-end method, and then the expert weighting mode is adopted to calculate the high-level semantic concepts of learning investment, so that the full-flow automatic analysis is realized.

3) The multi-mode analysis and the scene of the whole process are automatically integrated. The multi-mode data acquisition, analysis and fusion method is adopted, and the comprehensive analysis of the multi-mode teaching behaviors of teachers and students is realized from four aspects of language, action, expression and head gesture. Meanwhile, the teaching scene is automatically identified on the basis of teaching behaviors, so that the calculation of the learning input scenes is realized.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims

1. A man-machine cooperation student class input multi-mode visual analysis system is characterized in that: the system comprises a multi-mode data acquisition module, a learning behavior analysis module, a teaching activity fusion analysis module, a classroom input analysis module and a visual feedback module which are connected in sequence, wherein the learning behavior analysis module, the teaching activity fusion analysis module, the classroom input analysis module and the education field knowledge extraction module are respectively adjacent;

the teaching activity fusion analysis module is used for generating higher-level activity information based on student behavior information analysis, and is specifically embodied in that each mode of information is cooperatively represented as matched learning activities through a multi-mode machine learning method, and the implementation steps of the multi-mode machine learning method comprise:

2) Training a classification model from expression, action and language modal information to learning activities based on the learning activity classification coding standard examined in the education field knowledge extraction module, and performing automatic student entity activity matching coding;

the class input degree analysis module is used for carrying out input degree analysis calculation aiming at student individuals by combining objective learning activity information in a scene, wherein the scene is a background category of teaching activities, the teaching, exercise and discussion are carried out, the specific index dimension of the input degree analysis calculation is obtained from the education field knowledge extraction module, the original value of the input degree analysis calculation is obtained by multiplying a scene row matrix of m columns, a weight matrix of m rows and n columns and an activity column matrix of n rows, the weight matrix is obtained from the education field knowledge extraction module, and the standard value of the index is calculated by a zero-mean value standardization method on the basis of the original value;

the education field knowledge extraction module is used for inquiring and combining expert opinions to form theoretical dimensions and indexes of each level related to classroom investment, wherein the theoretical dimensions and indexes of each level comprise: a learning behavior classification standard related to classroom input, a learning activity classification standard related to classroom input, a teaching scene classification standard related to classroom input, a classroom input measurement dimension and index, and a weight matrix of each learning activity corresponding to each measurement index in each teaching scene; the steps for forming theoretical dimensions and indexes of each level related to classroom input are as follows:

5) The method comprises the steps of interfacing with a classroom input degree analysis module, and determining a weight matrix from a scene, activities to input evaluation indexes in a subjective and objective combined mode;

2. The human-computer collaborative student class input multi-mode visual analysis system according to claim 1, wherein: the multi-mode data acquisition module is composed of 2 4K cameras and 1 depth camera, the two 4K cameras are respectively arranged at the left upper corner and the right upper corner of a classroom blackboard, the depth camera is arranged at the center of the upper edge of the blackboard, the two 4K cameras respectively shoot students at the left half side and the right half side in the classroom, and the central depth camera shoots all students in the classroom forwards.

3. The human-computer collaborative student class input multi-mode visual analysis system according to claim 1, wherein: in the learning behavior analysis module, the implementation method for identifying the student modal information in the class through the artificial intelligence algorithm comprises the following steps: