CN115984956A

CN115984956A - Man-machine cooperation student classroom attendance multi-mode visual analysis system

Info

Publication number: CN115984956A
Application number: CN202211621966.8A
Authority: CN
Inventors: 蒋艳双; 祁彬斌; 包昊罡; 黄荣怀; 刘德建
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-18
Anticipated expiration: 2042-12-16
Also published as: CN115984956B

Abstract

The invention discloses a man-machine cooperative student class attendance multi-mode visual analysis system which comprises a multi-mode data acquisition module, a learning behavior analysis module, a teaching activity fusion analysis module, a class attendance analysis module and a visual feedback module which are sequentially connected, wherein the learning behavior analysis module, the teaching activity fusion analysis module, the class attendance analysis module and an education field knowledge extraction module are respectively adjacent. According to the man-machine cooperative student class input degree multi-mode visual analysis system with the structure, the student modal data are collected and analyzed, and the learning input degree of students under different scenes is comprehensively evaluated and visually fed back according to the established classification standard and index related to class input.

Description

Man-machine cooperation student classroom attendance multi-mode visual analysis system

Technical Field

The invention relates to the technical field of intelligent teaching, in particular to a man-machine cooperation student classroom attendance multi-mode visual analysis system.

Background

Student classroom attendance analysis is an important foundation in the field of education measurement and learning analysis. Existing analytical techniques mainly include two categories: structured scale based observational rating techniques and objective behavioral metric based coding analysis techniques. The observation and evaluation technology takes a series of evaluation items which are put into classes of students as a measurement tool basis, but the evaluation items often have unstable analysis standards due to the attention and experience difference of observers, have strong subjectivity, do not have an automatic flow basis developed in a large scale, and have low measurement reliability, high analysis cost and single result presentation mode. The coding analysis technology utilizes an artificial behavior coding system or an automatic behavior coding system based on a computer vision technology to dissociate classroom scenes into a discrete student behavior sequence, generally takes the proportion of specific first-order or multi-order (conversion) behaviors in the sequence as an objective measurement index as an input degree analysis standard, has strong objectivity for supporting large-scale development, but is limited by single explicit behaviors and lacks of experience participation of people, and has low measurement efficiency, insufficient interpretability and single data modal source.

Disclosure of Invention

The invention aims to provide a man-machine cooperative student classroom input degree multi-mode visual analysis system, which comprehensively evaluates and visually feeds back the learning input degree of students in different scenes according to the established classification standard and index related to classroom input by acquiring and analyzing student modal data.

In order to achieve the purpose, the invention provides a man-machine cooperative student classroom attendance multi-mode visual analysis system which comprises a multi-mode data acquisition module, a learning behavior analysis module, a teaching activity fusion analysis module, a classroom attendance analysis module and a visual feedback module which are sequentially connected, wherein the learning behavior analysis module, the teaching activity fusion analysis module and the classroom attendance analysis module are respectively adjacent to an education field knowledge extraction module;

the system comprises a multi-mode data acquisition module, a multi-mode data processing module and a multi-mode data processing module, wherein the multi-mode data acquisition module is used for acquiring original multi-mode data generated in the classroom process, and the original multi-mode data comprises classroom two-dimensional video data, classroom depth video data and classroom audio data;

the learning behavior analysis module calculates and preliminarily analyzes the learning behaviors of the students in real time based on the multi-mode data source, and specifically embodies that the modal information of expressions, actions and languages of the students in a classroom is identified through an artificial intelligence algorithm;

the teaching activity fusion analysis module is used for generating higher-level activity information based on the behavior information analysis of students, and is specifically embodied in that each modal information is cooperatively expressed as matched learning activity by a multi-modal machine learning method;

the classroom investment analysis module is used for analyzing and calculating the investment of individual students in a specific scene in a combined objective learning activity information manner, wherein the specific scene is a background category of occurrence of teaching activities and comprises teaching, practice and discussion, the specific index dimension of the investment analysis and calculation comes from the education field knowledge extraction module, the original value of the specific index dimension is calculated by multiplying a scene row matrix of m columns, a weight matrix of m rows and n columns and an activity column matrix of n rows, the weight matrix comes from the education field knowledge extraction module, and the standard value of the index is calculated by a zero-mean standardization method on the basis of the original value;

the education field knowledge extraction module is used for inquiring and combining expert opinions to form theoretical dimensions and indexes of each level related to classroom input, and the theoretical dimensions and indexes of each level comprise: the learning method comprises the following steps of (1) learning behavior classification standards related to classroom input, learning activity classification standards related to classroom input, teaching scene classification standards related to classroom input, classroom input measurement dimension and index, and weight matrixes of each learning activity corresponding to each measurement index in each teaching scene;

and the visual feedback module is used for visually outputting behaviors and activity recognition results related to classroom investment and evaluation index calculation results, calculating the learning investment index scores of students in various scenes in the classroom process, outputting the evaluation index calculation results as an investment degree change curve, and outputting the visual output mode in a video and image output mode.

Preferably, the multi-mode data acquisition module is composed of 2 4K cameras and 1 depth camera, the two 4K cameras are respectively arranged at the upper left corner and the upper right corner of a blackboard of a classroom, the depth cameras are arranged at the center of the upper edge of the blackboard, the two 4K cameras respectively shoot students at the left half side and the right half side in the classroom, and the depth camera at the center shoots all the students in the classroom forwards.

Preferably, in the learning behavior analysis module, the implementation method for identifying modal information of students in a classroom through an artificial intelligence algorithm comprises the following steps:

1) Carrying out confidence threshold adjustment through an artificial intelligence algorithm, identifying all visible teacher and student entities by using a computer vision technology, and detecting the position, the category and the confidence of the entities in the two-dimensional picture;

2) Combining two-dimensional entity position information and entity depth information, and performing entity-label mapping through a dynamic tracking algorithm taking minimum interframe entity position offset as an optimization target, wherein the optimization target of the dynamic tracking algorithm is the sum of minimum offsets of Euclidean distances of all entities in adjacent frames in a three-dimensional space;

3) Extracting and aligning language information through a voice recognition algorithm and a Chinese word vector algorithm, and converting unstructured speech information into a 300-dimensional structured vector by applying a Chinese word vector pre-training model based on a public corpus;

4) And identifying the expression and action states of the teachers and students in each frame by using an expression and action identification model trained on the public large-scale data set and on the basis of the action classification coding standard approved in the education field knowledge extraction module.

Preferably, in the teaching activity fusion analysis module, the implementation steps of the multi-modal machine learning method include:

1) Mapping the expression, action and language modal information of the student entity to the same feature space x;

2) Training a classification model from modal information such as expressions, actions, languages and the like to learning activities based on a learning activity classification coding standard approved in a knowledge extraction module in the education field, and performing automatic student entity activity matching coding.

Preferably, in the education domain knowledge extraction module, the steps of forming the theoretical dimensions and indexes of each level related to the classroom investment are as follows:

1) The learning behavior classification coding module is connected with a learning behavior analysis module, a learning behavior classification coding standard is formulated, and behaviors and expressions which are highly related to classroom teaching behaviors are screened out according to related actions and expressions which can be recognized by a current computer, wherein the behaviors and expressions include head gestures, limb actions, expressions, speech and interpersonal interaction actions;

2) The system is in butt joint with a teaching activity fusion analysis module, a teaching activity classification coding standard is formulated, 13 activity states of listening and speaking, hands-on experiment/practice, note taking, exercise making, computer/PAD operation, hands raising, standing up, reading, conversation with a teacher, feedback of the teacher, companion discussion, hands-on cooperation and classroom separation are coded and explained respectively, and a teaching activity automatic analysis coding table is constructed;

3) The system is in butt joint with a classroom investment degree analysis module, a teaching scene classification code standard is formulated, and corresponding scene codes and scene categories under different scene descriptions are formulated;

4) The evaluation index dimension related to the classroom investment is examined and determined by being connected with a classroom investment degree analysis module;

5) And the system is in butt joint with a classroom investment degree analysis module, and subjectively and jointly determines a weight matrix from scenes, activities to each investment evaluation index.

Therefore, the man-machine cooperation student classroom attendance multi-mode visual analysis system with the structure realizes interpretability by disassembling the model calculation process and introducing the domain knowledge, and further directly explains teacher and student behaviors and actions in education through the analysis result of the general scene; the learning behavior and the learning input are analyzed by adopting a non-invasive multi-mode data acquisition and analysis technology, so that the problems that the single-mode information quantity is insufficient and is easily influenced by external factors are solved, and the method has an important value for improving the accuracy of collaborative learning input analysis; and (3) introducing the scenes into an analysis process, exploring the change rule of the learning input in each scene, and representing different learning variables of behaviors, languages, learning input degrees and the like of students in different scenes.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a schematic structural diagram of a man-machine cooperative student classroom investment multi-modal visual analysis system according to the present invention;

FIG. 2 is a schematic diagram of the structural distribution of the multi-modal data collection modules according to the embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a process of model training by XGboost algorithm in a teaching activity fusion analysis module according to an embodiment of the present invention;

FIG. 4 is a classroom live analysis video output by a visual feedback module in accordance with an embodiment of the present invention;

fig. 5 is a graph illustrating the variation of the input level output by the feedback module according to the embodiment of the present invention.

Reference numerals

1. A blackboard; 2. a 4K camera; 3. a depth camera; m1, a multi-mode data acquisition module; m2, a learning behavior analysis module; m3, a teaching activity fusion analysis module; m4, a classroom investment degree analysis module; m5, an education domain knowledge extraction module; m6, a visual feedback module.

Detailed Description

The technical scheme of the invention is further explained by the attached drawings and the embodiment.

Examples

As shown in fig. 1, a man-machine cooperation student classroom investment multi-modal visual analysis system comprises a multi-modal data acquisition module M1, a learning behavior analysis module M2, a teaching activity fusion analysis module M3, a classroom investment analysis module M4 and a visual feedback module M6 which are connected in sequence, wherein the learning behavior analysis module M2, the teaching activity fusion analysis module M3, the classroom investment analysis module M4 and an education field knowledge extraction module M5 are respectively adjacent;

the multi-mode data acquisition module M1 is used for acquiring original multi-mode data generated in a classroom process, wherein the original multi-mode data comprises classroom two-dimensional video data, classroom depth video data and classroom audio data; the multimode data acquisition module M1 is composed of 2 4K cameras 2 and 1 depth camera 3, as shown in figure 2, two 4K cameras 2 are respectively arranged at the upper left corner and the upper right corner of a blackboard 1 of a classroom, the depth camera 3 is arranged at the center of the upper edge of the blackboard 1, the two 4K cameras 2 respectively shoot students at the left half side and the right half side in the classroom, and the depth camera 3 at the center shoots all students in the classroom forwards.

The learning behavior analysis module M2 calculates and preliminarily analyzes the learning behaviors of students in real time based on a multi-mode data source, and specifically embodies the mode information of expressions, actions and languages of the students in a classroom is identified through an artificial intelligence algorithm, and the implementation method comprises the following steps:

1) And carrying out confidence threshold adjustment through an artificial intelligence algorithm, and identifying all visible teacher and student entities by using a computer vision technology. Entity identification is carried out by adopting a Yolo-v5 algorithm, wherein the Yolo-v5 algorithm is an open source target detection network algorithm, and the position, the category and the confidence coefficient of an entity in a two-dimensional picture can be detected according to a GPL-3.0 open source protocol;

2) And combining the two-dimensional entity position information and the entity depth information, and performing entity-label mapping by using a dynamic tracking algorithm taking the minimum interframe entity position offset as an optimization target, wherein the optimization target of the dynamic tracking algorithm is the sum of minimum offsets of Euclidean distances of all entities in adjacent frames in a three-dimensional space (x, y, z). The dynamic tracking algorithm can set a maximum offset threshold L, when the offset of an adjacent frame of a single entity is obviously greater than L, the entity is judged to be abnormal, and the frame is skipped;

3) Extracting and aligning language information through a voice recognition algorithm and a Chinese word vector algorithm, and converting an unstructured speech information into a 300-dimensional structured vector by applying a Chinese word vector pre-training model based on a public corpus;

4) And identifying the expression and action states of the teachers and students in each frame by using an expression and action identification model trained on the public large-scale data set and based on the action classification coding standard approved in the education field knowledge extraction module M5. The expression and action recognition is carried out by adopting VGGNet16 and Slowfast algorithm, and the Slowfast is an open-source video understanding network algorithm and follows Apache-2.0 open-source protocol.

The teaching activity fusion analysis module M3 is used for generating higher-level activity information based on the behavior information analysis of students, and specifically representing each modal information as matched learning activity in a collaborative way by a multi-modal machine learning method. The implementation steps of the multi-modal machine learning method comprise:

2) Based on the learning activity classification coding standard examined in the education domain knowledge extraction module M5, a classification model from modal information such as expressions, actions and languages to learning activities is trained, and automatic student entity activity matching coding is carried out. The XGBoost algorithm is used here to implement the activity matching process.

As shown in fig. 3, the essence of the algorithm lies in that feature splitting is continuously performed to grow a decision tree, and each round of learning a tree is used to fit the residual between the predicted value and the actual value of the model in the previous round, and the minimization of the objective function is realized through second-order taylor expansion. Wherein the objective function is:

wherein the square loss function of the actual value and the predicted value is:

the regularization function is (where T refers to the number of leaves in the decision tree, and to the L2 modulo square of the predicted value of the decision tree):

when the model training is finished to obtain k decision trees, if the score of a sample is to be predicted, a corresponding leaf node is fallen in each tree according to the characteristics of the sample, each leaf node corresponds to a corresponding score, and finally the scores corresponding to each tree are added to obtain the predicted value of the sample.

The classroom investment degree analysis module M4 is used for analyzing and calculating the investment degree of students in a specific scene in a combined objective learning activity information manner, wherein the specific scene is a background category of occurrence of teaching activities, such as teaching, practice, discussion and the like, specific index dimensionality of the investment degree analysis and calculation comes from the education field knowledge extraction module M5, the original value of the investment degree analysis and calculation is obtained by multiplying a scene row matrix of M columns, a weight matrix of M rows and n columns and an activity column matrix of n rows, the weight matrix is derived from the education field knowledge extraction module M5, and the standard value of the index is calculated by a z-score (zero-mean standardization) method on the basis of the original value.

The formula aims to normalize the raw data set to a mean 0 and variance 1 data set, where μ and σ are the mean and variance, respectively, of the raw data set.

The education field knowledge extraction module M5 is used for consulting and combining expert opinions to form theoretical dimensions and indexes of each level related to classroom investment, and the theoretical dimensions and indexes of each level comprise: the method comprises the following steps of classifying learning behavior relevant to class input, classifying learning activity relevant to class input, classifying teaching scene relevant to class input, measuring dimension and index of class input, and weighting matrix of each learning activity corresponding to each measuring index in each teaching scene, and comprises the following steps:

1) And the learning behavior analysis module M2 is connected with the computer to make a learning behavior classification coding standard. According to the related actions and expressions which can be identified by the current computer, 59 items of actions and expressions which are highly related to classroom teaching actions are screened out through the discussion of educational experts, wherein the actions and expressions comprise 6 items of head gestures, 31 items of limb actions, 7 expressions, 2 classes of words and 13 items of man-machine interaction actions, and the items are shown in the table 1.

TABLE 1 actions and expressions associated with classroom teaching actions

2) And the teaching activity fusion analysis module M3 is in butt joint to make a teaching activity classification coding standard. An automatic analysis coding table of teaching activities is constructed by the discussion of education experts on the basis of the analysis indexes of the existing classroom teaching behaviors such as a Frands coding system, an S-T code and the like, and is shown in a table 2.

TABLE 2 teaching activities automatic analysis coding table

3) And the system is connected with a classroom investment degree analysis module M4 in a butt joint mode, a teaching scene classification code standard is formulated, and corresponding scene codes and scene categories under different scene descriptions are formulated, as shown in a table 3.

TABLE 3 teaching scene Classification coding Standard

4) And (4) docking with a classroom investment degree analysis module M4, and examining evaluation index dimensions related to classroom investment, as shown in Table 4.

TABLE 4 evaluation index dimensionality related to classroom investment

5) And the system is in butt joint with a classroom investment degree analysis module M4, and subjectively and jointly determines a weight matrix from scenes, activities to each investment evaluation index. This step can be performed by the Delphi Method. The Delphi method is a method for obtaining expert consensus for a specific subject content, firstly, selecting 10-30 expert group members with professional representativeness and authority, then determining m x n matrix W of n types of learning activities corresponding to calculation index k under m teaching scenes through two rounds of index inquiry and one round of weight determination _k The value range of the elements in the matrix is between 0 and 1.

And the visual feedback module M6 is used for visually outputting behaviors, activity recognition results and evaluation index calculation results related to classroom input. The output mode of visual output is video and image output, and comprises the following steps:

1) Align multimodal data source information, dynamic tracking information, character behavior information, activity matching information, output as a live classroom analysis video, as shown in fig. 4.

2) The learning input index score of each student in each scene in the classroom process is calculated and output as an input degree change curve, as shown in fig. 5.

The above system configuration is configured in a computer device including a memory, a processor, a display adapter, a communication interface, and a communication bus, a computer program executable on the processor is stored in the memory, and the steps in the above embodiments are realized when the processor executes the computer program.

Therefore, the man-machine cooperation student classroom attendance multi-mode visual analysis system adopting the structure has the following beneficial effects:

1) By means of man-machine cooperation, the interpretability of analysis is improved by integrating the field knowledge. In order to support the interpretability of the calculation process, the analysis framework introduces the experience of experts in the education field at each key node, and the Delphi method is adopted to consult the education experts, so that a coding table of basic actions, teaching activities, teaching scenes and input states is obtained, and a knowledge base is provided for interpretable calculation.

2) The teaching activities are used as hinges to enhance the universality of the framework. The teaching activities are the basis and key link for classroom observation and analysis. On the basis of a relevant education theory, an analysis framework takes teaching activities as a bridge and communicates the recognizable bottom-layer characteristics of a computer with the high-layer semantics of education. The analysis of teaching behaviors is realized by an end-to-end method, and then the high-level semantic concepts such as learning investment and the like are calculated by adopting an expert weighting mode, so that the automatic analysis of the whole process is realized.

3) The multi-modal analysis of the whole process is automatically integrated with the scene. By adopting a multi-modal data acquisition, analysis and fusion method, the comprehensive analysis of the multi-modal teaching behaviors of teachers and students is realized from four aspects of language, action, expression and head posture. Meanwhile, the teaching scenes are automatically identified on the basis of teaching behaviors, so that the scene calculation of learning investment is realized.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the invention without departing from the spirit and scope of the invention.

Claims

1. A man-machine cooperation student classroom attendance multi-mode visual analysis system is characterized in that: the system comprises a multi-mode data acquisition module, a learning behavior analysis module, a teaching activity fusion analysis module, a classroom entrance degree analysis module and a visual feedback module which are sequentially connected, wherein the learning behavior analysis module, the teaching activity fusion analysis module and the classroom entrance degree analysis module are respectively adjacent to an education field knowledge extraction module;

the education field knowledge extraction module is used for consulting and combining expert opinions to form theoretical dimensions and indexes of each level related to classroom input, and the theoretical dimensions and indexes of each level comprise: the learning method comprises the following steps of (1) learning behavior classification standards related to classroom input, learning activity classification standards related to classroom input, teaching scene classification standards related to classroom input, classroom input measurement dimension and index, and weight matrixes of each learning activity corresponding to each measurement index in each teaching scene;

and the visual feedback module is used for visually outputting behaviors, activity recognition results and evaluation index calculation results related to classroom investment, calculating the learning investment index score of each student in each scene in the classroom process, outputting the evaluation index calculation results as an investment degree change curve, and outputting the visual output in a video and image output mode.

2. The human-computer cooperative student classroom attendance multi-modal visual analysis system according to claim 1, wherein: the multimode data acquisition module is composed of 2 4K cameras and 1 depth camera, wherein the two 4K cameras are respectively arranged at the upper left corner and the upper right corner of a blackboard of a classroom, the depth cameras are arranged at the center of the upper edge of the blackboard, the two 4K cameras respectively shoot students at the left half side and the right half side in the classroom, and the depth camera at the center shoots all the students in the classroom forwards.

3. The human-computer cooperative student classroom attendance multi-modal visual analysis system according to claim 1, wherein: in the learning behavior analysis module, the implementation method for identifying the modal information of students in a classroom through an artificial intelligence algorithm comprises the following steps:

1) Carrying out confidence threshold adjustment through an artificial intelligence algorithm, identifying all visible teacher and student entities by using a computer vision technology, and detecting the position, category and confidence of the entities in the two-dimensional picture;

4. The human-computer cooperative student classroom attendance multi-modal visual analysis system according to claim 1, wherein: in the teaching activity fusion analysis module, the implementation steps of the multi-mode machine learning method comprise:

2) Training a classification model from modal information such as expressions, actions, languages and the like to learning activities based on the learning activity classification coding standard examined in the education field knowledge extraction module, and carrying out automatic student entity activity matching coding.

5. The human-computer cooperative student classroom attendance multi-modal visual analysis system according to claim 1, wherein: in the education field knowledge extraction module, the steps of forming each level of theoretical dimension and index related to classroom input are as follows:

1) The system is connected with a learning behavior analysis module, a learning behavior classification coding standard is formulated, and behaviors and expressions which are highly related to classroom teaching behaviors are screened out according to related actions and expressions which can be identified by a current computer, wherein the behaviors and expressions comprise head gestures, limb actions, expressions, speech and interpersonal interaction actions;

2) The system is in butt joint with a teaching activity fusion analysis module, a teaching activity classification coding standard is formulated, 13 activity states of listening, manual experiment/practice, note taking, exercise doing, computer/PAD operation, hand lifting, standing up, reading, conversation with a teacher, feedback to the teacher, partner discussion, manual cooperation and classroom separation are coded and explained respectively, and a teaching activity automatic analysis coding table is constructed;

4) The system is in butt joint with a classroom investment degree analysis module, and the evaluation index dimensionality related to classroom investment is examined;

5) And the system is in butt joint with a classroom investment analysis module, and subjectively and objectively jointly determines a weight matrix from scenes and activities to each investment evaluation index.