CN114970701A

CN114970701A - Multi-mode fusion-based classroom interaction analysis method and system

Info

Publication number: CN114970701A
Application number: CN202210539253.0A
Authority: CN
Inventors: 戴志诚; 刘三女牙; 杨宗凯; 王春冉; 朱晓亮; 赵亮; 孙成章
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-30

Abstract

The invention discloses a classroom interaction analysis method and system based on multi-mode fusion. Specifically, the method comprises the steps of obtaining a classroom teaching video and preprocessing the classroom video; processing a classroom teaching video by utilizing a voiceprint recognition technology and an attitude estimation algorithm, and judging the identity of a speaker; designing teacher and student interactive behavior categories, and classifying according to different classroom interactive behavior bodies; and finally, compiling an MFIAS coding table for interactive analysis, constructing a multi-mode fusion behavior comparison table mapping characteristic behavior, and quantitatively analyzing classroom teaching interactive behavior. The invention realizes rich connotation of comprehensively and finely analyzing classroom teacher-student interaction behaviors, provides a classroom interaction analysis method and system based on multi-mode fusion, and solves the problems that feedback analysis results of classroom videos are greatly influenced by subjective factors, process operation is complex and the like in an artificial mode.

Description

Multi-mode fusion-based classroom interaction analysis method and system

Technical Field

The invention belongs to the technical field of education informatization, and particularly relates to a classroom interaction analysis method and system based on multi-mode fusion.

Background

The voiceprint is similar to biological information of fingerprints, five sense organs, walking posture, pupillary iris and the like of a human body, and can be used for identifying the physiological attributes of the identity of a person. Because the vocal organs of each person are different from nature, the speaking habits are different according to the study and life of the person, so that different voices have unique characteristics. Voiceprint Recognition (VPR), also known as Speaker Recognition (SR), is a biometric identification technique that collects human voices through a voice recording device and identifies the identity of a Speaker through computer processing, calculation and analysis. The technology integrates multiple disciplinary knowledge, comprises the contents of a plurality of technologies such as signal processing, pattern recognition, natural language processing and expression and the like, and is widely applied to the fields of finance, public safety, intelligent security, intelligent education and the like.

The posture estimation is an algorithm for judging the posture of a human body by positioning human body joint points by using methods of computer vision identification, sensor wearing and the like, and the obtained human body skeleton characteristics can be used for realizing tracking, identification, classification and the like of personnel by calculating the coordinates of each joint point of the human body. Currently, the attitude estimation algorithm based on deep learning is divided into two categories due to the number of detection targets: single person pose estimation and multi person pose estimation. The multi-person posture estimation is more suitable for real scenes and is a research hotspot of various research institutions at present.

The multi-person attitude estimation can be divided into Two different estimation methods, namely top-down (Two-Step) estimation and bottom-up (Part-Base) estimation according to the principle. The top-down detection sequence is to detect the relative position of each person in the image and then detect the body joint points of each person to determine the posture; and bottom-up means that all body joints in the environment are detected first and then connected according to the body structure. The two identification methods have advantages and disadvantages respectively: the detection accuracy rate from top to bottom is higher, but the detection accuracy rate highly depends on the detection quality of the human body boundary frame, namely the positioning of the detection frame must be accurate; the bottom-up method can complete most of posture estimation without such accurate positioning, but because the joint point connection link is followed by target detection, when two or more different people are very close to each other in an image, the joint point connection can appear ambiguous, and even misjudgment or missed judgment occurs. In a classroom teaching video, people in a classroom are dense and distributed disorderly, the relative positions of the people change frequently, accurate human body recognition results are needed for judging main behaviors in the classroom, and a bottom-up method is difficult to distinguish the shielded or overlapped human body postures, so that a top-down detection method is adopted for high-precision classroom human body posture estimation.

Modality (modality) is a specific way things are observed or perceived, for example, the content in a class can be represented by speech signals between teachers and students, sound is a modality; can be represented by a picture in a video, an image being a modality; or by the text content of the teacher-student dialogue, the characters are a mode. Multi-modal Fusion (Multi-modal Fusion) is also known as Multi-sensor Information Fusion (Multi-sensor Information Fusion) or Multi-source Information Fusion (Multi-source Information Fusion). Generally, different fusion strategies according to the fusion hierarchy can be divided into three categories: data layer fusion, feature layer fusion and decision layer fusion. Data of objects of the same kind and different modalities are generally heterogeneous data of multiple sources, and just because the representation forms or characteristics of different modalities are different, the angles of representing the objects are also different, information complementation exists among the data, the data of multiple different sources are collected and summarized, and then the relevance among the data is utilized to carry out combined characterization, so that the multi-modality analysis is more comprehensive than the single modality analysis.

Since the video analysis technology of the class in the 80 th of the 20 th century was started, the research of teaching process by video analysis has become a hot topic of education informatization. In order to adapt to classroom teaching behavior Analysis in an information-based teaching environment, coding systems with information technology characteristics are continuously proposed, wherein typical classroom Interaction Analysis methods include a Frands Interaction Analysis System (FIAS) and S-T classroom teaching Analysis. In addition, methods such as an Information Technology-based Interaction Analysis System (ITIAS), an improved Frands Interaction Analysis System i-FIAS (improved FIAS) and a speech Interaction classification System (VICS) based on Information Technology provide certain support for classroom Interaction behavior Analysis and classroom effect evaluation.

The FIAS system uses manual code observation and has high requirements on observers, complex classroom behavior coding process and large workload, mainly takes language analysis as a main part, and the coding system is imperfect; the i-FIAS system is improved on the FIAS system, but focuses on the problems of language interaction in the classroom teaching process, lack of record analysis of teacher and student behavior details and the like; the ITIAS system classifies the languages of teachers one by one, codes in the technical level are too simple, the system only stays in the technical level of teachers or students, and rich connotation in technical use is ignored.

In conclusion, in the traditional classroom behavior analysis method, most of the analysis process is completed in a manual mode, an automatic method is not used, and the problems of single data mode, complex operation, low efficiency and the like exist.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a classroom interaction analysis method and system based on multi-mode fusion, and aims to solve the problems that the feedback analysis result of classroom video in a manual mode is greatly influenced by subjective factors and the process operation is complex.

In order to achieve the above object, in a first aspect, the present invention provides a classroom interaction analysis method based on multi-modal fusion, including the following steps:

determining a classroom teaching video to be analyzed;

extracting voiceprint features in the classroom teaching video, and determining the speaker identity and speaking time corresponding to each section of voiceprint feature through a speaker recognition clustering algorithm;

carrying out target detection on the classroom video to detect a human body; carrying out posture estimation on the detected human body, and determining key action characteristics of the human body; the key action features include: a standing posture, a hand posture, and a face-facing posture;

fusing the voiceprint features and the key human body action features, and analyzing the classroom interaction behaviors of the classroom teaching video by combining a classroom interaction behavior coding system which is constructed in advance according to related education theories and the fused teacher-student interaction behavior semantic features; the principle for constructing the specific reference of the classroom interaction behavior coding system comprises the following steps: in a certain time period, after a student is detected to speak, if no person is detected to stand, the student is considered to be sitting to answer a question or a speech, if a person is detected to stand without hand movement, the student is considered to stand to answer the question or the speech, if a person is detected to stand with hand movement, the student is considered to make gesture explanation content, if a person is detected to stand with arms raised and extending to the side, the student is considered to point to display content or operate an electronic whiteboard, if a person is detected to stand with arms raised and extending to the side facing a lens, the student is considered to be operating the electronic whiteboard, the speech or a blackboard-writing, and if a plurality of persons are detected to stand, a teacher is considered to listen to the student on a podium to answer the question or the speech; in a certain time period, after a teacher is detected to speak, if the standing of a person is not detected, the teacher is considered to be sitting to give a lecture or ask a question, if a person is detected to stand without hand movement, the teacher is considered to stand to give the lecture or ask the question, if a person is detected to stand with hand movement, the teacher is considered to make gesture explanation content, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to point to give the lecture or blackboard writing, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to operate an electronic whiteboard, lecture or blackboard writing, and if a plurality of persons are detected to stand, the teacher is considered to point or indicate to a lecture student.

In an optional example, the performing posture estimation on the detected human body to determine key motion features of the human body specifically includes:

selecting a plurality of human body key nodes, wherein the key nodes comprise: a head key node and a body key node; the key nodes of the head include the right ear, the left ear, the right eye, the left eye and the nose, and the key nodes of the body include: neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, and left ankle;

taking a connecting line between the right hip and the neck as a first vector, taking a connecting line between the right hip and the right knee as a second vector, and taking an angle between the first vector and the second vector as an angle between the trunk and the right hip bone;

a connecting line between the left hip and the neck is used as a third vector, a connecting line between the left hip and the left knee is used as a fourth vector, and an angle between the third vector and the fourth vector is used as an angle between the trunk and the left hip bone;

when the angle between the trunk and the right hip bone and the angle between the trunk and the left hip bone are both 160-180 degrees, the human body is considered to be standing; otherwise, the human body is considered to be sitting.

In an alternative example, a line connecting the right shoulder and the left shoulder is taken as a fifth vector, a line connecting the right shoulder and the right elbow is taken as a sixth vector, and an angle between the fifth vector and the sixth vector is taken as a right shoulder elbow angle;

taking a connecting line between the left shoulder and the right shoulder as a seventh vector, taking a connecting line between the left shoulder and the left elbow as an eighth vector, and taking an angle between the seventh vector and the eighth vector as a left shoulder elbow angle;

taking a connecting line between the right elbow and the right shoulder as a ninth vector, taking a connecting line between the right elbow and the right wrist as a tenth vector, and taking an angle between the ninth vector and the tenth vector as a right elbow wrist angle;

taking a connecting line between the left elbow and the left shoulder as an eleventh vector, taking a connecting line between the left elbow and the left wrist as a twelfth vector, and taking an angle between the eleventh vector and the twelfth vector as a left elbow wrist angle;

when the angle of the shoulder and elbow on one side ranges from 120 degrees to 165 degrees, the arm on the one side is considered to be raised;

when the elbow wrist angle of one side ranges between 65-110 degrees, the elbow of the side is considered to be bent;

when one arm is raised and the elbow of the side is bent, the human body is considered to have a hand-lifting gesture;

when one arm is raised and the elbow of the side is not bent, the human body is considered to raise the arm and extend the hand to point aside;

when the arms on the two sides are not lifted, if the elbows on the two sides are bent, the human body is considered to make gestures;

when the arms on the two sides are not lifted, if the elbow on one side is bent, the human body is considered to have hand-lifting gestures;

when the arms on both sides are not lifted, if the elbows on both sides are not bent, the human body is considered to have no action.

In an optional example, an eye vector is determined according to a connecting line of a left eye and a right eye, and the slope of a perpendicular bisector of the eye vector is calculated; when the absolute value of the slope of the perpendicular bisector is between 1.6 and + ∞) or the slope cannot be calculated, the human body is considered to face the front of the lens;

when the absolute value of the slope of the perpendicular bisector is between [0, 1.6), the human body is considered to face the side of the lens; or when only one node is detected in the two nodes of the left ear and the right ear of the human body, or only one node is detected in the two nodes of the left eye and the right eye, the key node required for calculating the face orientation is determined to be missing, and the side face of the human body facing the lens is considered.

In an optional example, the principle of constructing the specific reference of the classroom interaction behavior coding system further includes: if the speaker is not detected and the key motion characteristics of the human body are not detected within a certain time period, the classroom is considered to be silent or chaotic in the certain time period.

In a second aspect, the present invention provides a classroom interaction analysis system based on multi-modal fusion, including:

the teaching video determining unit is used for determining a classroom teaching video to be analyzed;

the voiceprint feature extraction unit is used for extracting voiceprint features in the classroom teaching video and determining the speaker identity and the speaking time corresponding to each section of voiceprint feature through a speaker recognition clustering algorithm;

the motion characteristic detection unit is used for carrying out target detection on the classroom video and detecting a human body; carrying out posture estimation on the detected human body, and determining key action characteristics of the human body; the key action features include: a standing posture, a hand posture, and a face-facing posture;

the classroom interaction analysis unit is used for fusing the voiceprint features and the human body key action features, and analyzing classroom interaction behaviors of the classroom teaching video in combination with a classroom interaction behavior coding system which is constructed in advance according to a relevant education theory and the fused teacher-student interaction behavior semantic features; the principle for constructing the specific reference of the classroom interaction behavior coding system comprises the following steps: in a certain time period, after a student is detected to speak, if no person is detected to stand, the student is considered to be sitting to answer a question or a speech, if a person is detected to stand without hand movement, the student is considered to stand to answer the question or the speech, if a person is detected to stand with hand movement, the student is considered to make gesture explanation content, if a person is detected to stand with arms raised and extending to the side, the student is considered to point to display content or operate an electronic whiteboard, if a person is detected to stand with arms raised and extending to the side facing a lens, the student is considered to be operating the electronic whiteboard, the speech or a blackboard-writing, and if a plurality of persons are detected to stand, a teacher is considered to listen to the student on a podium to answer the question or the speech; in a certain time period, after a teacher is detected to speak, if the standing of a person is not detected, the teacher is considered to be sitting to give a lecture or ask a question, if a person is detected to stand without hand movement, the teacher is considered to stand to give the lecture or ask the question, if a person is detected to stand with hand movement, the teacher is considered to make gesture explanation content, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to point to give the lecture or blackboard writing, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to operate an electronic whiteboard, lecture or blackboard writing, and if a plurality of persons are detected to stand, the teacher is considered to point or indicate to a lecture student.

In an optional example, the motion feature detection unit selects a plurality of human body key nodes, where the key nodes include: a head key node and a body key node; the head key node comprises a right ear, a left ear, a right eye, a left eye and a nose, and the body key node comprises: neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, and left ankle; taking a connecting line between the right hip and the neck as a first vector, taking a connecting line between the right hip and the right knee as a second vector, and taking an angle between the first vector and the second vector as an angle between the trunk and the right hip bone; a connecting line between the left hip and the neck is used as a third vector, a connecting line between the left hip and the left knee is used as a fourth vector, and an angle between the third vector and the fourth vector is used as an angle between the trunk and the left hip bone; when the angle between the trunk and the right hip bone and the angle between the trunk and the left hip bone are both 160-180 degrees, the human body is considered to be standing; otherwise, the human body is considered to be sitting.

In an optional example, the motion feature detection unit takes a line between the right shoulder and the left shoulder as a fifth vector, a line between the right shoulder and the right elbow as a sixth vector, and an angle between the fifth vector and the sixth vector as a right shoulder elbow angle; taking a connecting line between the left shoulder and the right shoulder as a seventh vector, taking a connecting line between the left shoulder and the left elbow as an eighth vector, and taking an angle between the seventh vector and the eighth vector as a left shoulder elbow angle; taking a connecting line between the right elbow and the right shoulder as a ninth vector, taking a connecting line between the right elbow and the right wrist as a tenth vector, and taking an angle between the ninth vector and the tenth vector as a right elbow wrist angle; taking a connecting line between the left elbow and the left shoulder as an eleventh vector, taking a connecting line between the left elbow and the left wrist as a twelfth vector, and taking an angle between the eleventh vector and the twelfth vector as a left elbow wrist angle; when the angle of the shoulder and elbow on one side ranges from 120 degrees to 165 degrees, the arm on the one side is considered to be raised; when the elbow wrist angle of one side ranges between 65-110 degrees, the elbow of the side is considered to be bent; when one arm is raised and the elbow of the side is bent, the human body is considered to have a hand-lifting gesture; when one arm is raised and the elbow of the side is not bent, the human body is considered to raise the arm and extend the hand to point aside; when the arms on the two sides are not lifted, if the elbows on the two sides are bent, the human body is considered to make gestures; when the arms on the two sides are not lifted, if the elbow on one side is bent, the human body is considered to have hand-lifting gestures; when the arms on both sides are not lifted, if the elbows on both sides are not bent, the human body is considered to have no action.

In an optional example, the motion feature detection unit determines an eye vector according to a connection line between the left eye and the right eye, and finds a slope of a perpendicular bisector of the eye vector; when the absolute value of the slope of the perpendicular bisector is between 1.6 and + ∞) or the slope cannot be calculated, the human body is considered to face the front of the lens; when the absolute value of the slope of the perpendicular bisector is between [0, 1.6), the human body is considered to face the side of the lens; or when only one node is detected in the two nodes of the left ear and the right ear of the human body, or only one node is detected in the two nodes of the left eye and the right eye, the key node required for calculating the face orientation is determined to be missing, and the side face of the human body facing the lens is considered.

In an optional example, the principle that the classroom interaction analysis unit constructs the specific reference of the classroom interaction behavior coding system further includes: if the speaker is not detected and the key motion characteristics of the human body are not detected within a certain time period, the classroom is considered to be silent or chaotic in the certain time period.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

the invention provides a classroom interaction analysis method and system based on multi-modal fusion, and provides an intelligent identification method based on multi-modal feature fusion aiming at the problem of intelligent identification and classification of classroom video interaction behaviors. The design can reduce the burden of teachers in classroom feedback analysis after class, simplify the operation by using an intelligent technology and accelerate the analysis speed so as to help teachers to obtain classroom teaching conditions in time.

The invention provides a classroom interaction analysis method and system based on multi-modal fusion, which simultaneously consider image data and audio data, and fuse the multi-modal characteristics of the two different dimensions to fully mine information in a classroom in the process of describing classroom interaction behaviors, refine the classification of the behaviors and improve the accuracy of behavior recognition.

The invention provides a classroom interaction analysis method and system based on multi-mode fusion, aiming at the characteristics of diversified classroom teaching modes and rich interaction behaviors, the invention performs fusion analysis on key information according to technologies such as voiceprint recognition, human body posture estimation and the like, provides a classroom interaction analysis method and system based on multi-mode fusion, compiles an interaction behavior analysis system (MFIAS) based on multi-mode fusion, and solves the practical problem in classroom teaching interaction.

Drawings

Fig. 1 is a flow architecture diagram of a multi-modal fusion-based classroom interaction analysis method provided in an embodiment of the present invention;

FIG. 2 is a diagram of a classroom interaction analysis framework based on multi-modal fusion according to an embodiment of the present invention;

fig. 3 is a flow chart of classroom teaching interaction behavior analysis provided by the embodiment of the present invention;

fig. 4 is a diagram illustrating artificial labeling for classroom behavior recognition and an algorithm recognition effect according to an embodiment of the present invention.

Fig. 5 is an architecture diagram of a multi-modal fusion-based classroom interaction analysis system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the defects and shortcomings of the prior art, the invention provides a classroom interaction analysis method and system based on multi-mode fusion, and aims to solve the problems that a teacher adopts an artificial mode to feed back an analysis result of a classroom video, the influence of subjective factors is great, the process operation is complex, and the classroom interaction analysis system has the problems. The invention can help teachers analyze teaching conditions such as teacher-student interaction, classroom structure and the like more effectively, quickly and pertinently, improve high-order thinking ability of students and promote deep learning.

Fig. 1 is a flow architecture diagram of a multi-modal fusion-based classroom interaction analysis method provided in an embodiment of the present invention; as shown in fig. 1, the method comprises the following steps:

s101, determining a classroom teaching video to be analyzed;

s102, extracting voiceprint features in the classroom teaching video, and determining the speaker identity and the speaking time corresponding to each section of voiceprint feature through a speaker recognition clustering algorithm;

s103, carrying out target detection on the classroom video to detect a human body; carrying out attitude estimation on the detected human body, and determining key action characteristics of the human body; the key action features include: a standing posture, a hand posture, and a face-facing posture;

s104, fusing the voiceprint features and the key human body action features, and analyzing the classroom interaction behaviors corresponding to the classroom teaching video by combining a classroom interaction behavior coding system which is constructed in advance according to related education theories and the fused teacher-student interaction behavior semantic features; the principle for constructing the specific reference of the classroom interaction behavior coding system comprises the following steps: in a certain time period, after a student is detected to speak, if no person is detected to stand, the student is considered to be sitting to answer a question or a speech, if a person is detected to stand without hand movement, the student is considered to stand to answer the question or the speech, if a person is detected to stand with hand movement, the student is considered to make gesture explanation content, if a person is detected to stand with arms raised and extending to the side, the student is considered to point to display content or operate an electronic whiteboard, if a person is detected to stand with arms raised and extending to the side facing a lens, the student is considered to be operating the electronic whiteboard, the speech or a blackboard-writing, and if a plurality of persons are detected to stand, a teacher is considered to listen to the student on a podium to answer the question or the speech; in a certain time period, after a teacher is detected to speak, if the standing of a person is not detected, the teacher is considered to be sitting to give a lecture or ask a question, if a person is detected to stand without hand movement, the teacher is considered to stand to give the lecture or ask the question, if a person is detected to stand with hand movement, the teacher is considered to make gesture explanation content, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to point to give the lecture or blackboard writing, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to operate an electronic whiteboard, lecture or blackboard writing, and if a plurality of persons are detected to stand, the teacher is considered to point or indicate to a lecture student.

In a specific embodiment, the present invention provides a classroom interaction analysis method based on multi-modal fusion, as shown in fig. 2, the method includes:

acquiring classroom teaching video data to be identified by using a microphone and a high-definition camera in a classroom;

extracting voiceprint characteristics of the classroom teaching video data to be identified according to different speakers by utilizing a voice activity detection and voiceprint identification technology;

performing human body action recognition on the video data by utilizing a target detection and posture estimation technology, and extracting key action characteristics (hand posture, standing/sitting posture and human body orientation) of a human body in a picture;

according to differences among different behavior classes observed and analyzed in a classroom, an interaction behavior coding table MFIAS for uniformly describing all classroom interaction behaviors is established;

and fusing the obtained voiceprint features and the action features to generate video behavior feature description values arranged according to time, and establishing a mapping relation table according to the description values and the coding table to realize the conversion from the multi-modal features to the interactive behaviors.

And obtaining a final classroom teaching video interaction behavior sequence according to the mapping relation table.

The classroom interaction analysis method and system based on multi-mode fusion solve the problems that a teacher adopts an artificial mode to feed back an analysis result of classroom video, the influence of subjective factors is large, process operation is complex and the like, explore and expand the field of classroom teaching behavior analysis research from a new perspective, further innovate teaching activities better and provide support for promoting students to actively learn.

Fig. 3 is a flow chart of classroom teaching interaction behavior analysis provided by the embodiment of the present invention, which includes the following specific steps:

step 1: and acquiring classroom teaching video data to be trained. The classroom interaction behavior is analyzed based on classroom teaching videos, the problems of inconsistent formats, content loss and the like exist in the recorded videos, and the integrity and the effectiveness of classroom video files are the basis for observing the classroom interaction behavior of teachers and students. Therefore, before the analysis of the classroom interaction behavior, the classroom teaching video is screened to ensure that the teaching video has consistent format and complete content.

Step 2: and acquiring key frames of the video data to be trained by utilizing a voiceprint recognition technology, and dividing different classroom teaching activity events according to the voice signal and key frame characteristics of each video.

And step 3: in step 2, according to the slicing habit of classroom behavior analysis coding, a section of classroom audio of 40-45 minutes is divided into a plurality of voice signals with the length of 3 seconds, the voice signals are subjected to framing, windowing and other processing, characteristics of Mel-scale Frequency Coefficients (MFCC for short) are extracted and calculated, and speaker identity is judged according to a voice signal clustering value.

Specifically, speakers with different voiceprint characteristics in a section of classroom video can be distinguished through a speaker clustering recognition algorithm. In the whole processing process, the video is firstly sliced and added into a temporary voiceprint comparison library, and then the similarity of each voiceprint characteristic is compared, so that the identities of different speakers are judged.

And 4, step 4: and positioning key joint points of the human body by using a target detection algorithm and a multi-person human body posture estimation algorithm, and judging human body postures including a standing posture, a hand posture and a human body orientation posture.

And 5: in step 4, due to the complex environment in the classroom and the frequent change of the relative positions of the personnel, firstly, a single human body in the image needs to be preliminarily extracted through a target detection algorithm, and finally, a human body posture image corresponding to the original image is obtained through spatial transformation, optimization and adjustment and redundancy elimination.

Step 6: in step 4, a plurality of human body key nodes are selected, the head key points comprise right ears, right eyes (eye), left eyes (eye) and noses (nose), and the body key points sequentially comprise necks (rock), right shoulders (rsho), right elbows (rebb), right wrists (rwri), left shoulders (lsho), left elbows (lel), left wrists (lwri), right hips (rhip), right knees (rnke), right ankles (rank), left hips (lhip), left knees (lkne) and left ankles (park).

And 7: in step 6, a joint point vector group is further calculated, and a specific human body motion posture is determined. A feature vector is composed of two adjacent human joint points, which are defined as A (x) ₁ ，y ₁ ) And B (x) ₂ ，y ₂ ) Then the body structure vector of this part is expressed as

And 8: based on step 7, according to the human body structure andthe invention establishes the following vector group included angles according to the posture to be judged: theta _{neck-rsho-relb} ，θ _{neck-lsho-lelb} ，θ _{rsho-relb-rwri} ，θ _{lsho-lelb-lwri} The hand motion calculation module is used for calculating hand motions and respectively corresponds to a right shoulder angle, a left shoulder angle, a right elbow angle and a left elbow angle; theta _{neck-rhip-rkne} ，θ _{neck-lhip-lkne} ，θ _{rhip-rkne-rank} ，θ _{lhip-lkne-lank} The leg motion calculation method is used for calculating leg motions and respectively corresponds to a right hip angle, a left knee angle and a right knee angle.

And step 9: and (3) extracting standing posture characteristics, namely calculating the angle between skeleton vectors when the user stands by using an included angle formula, specifically, taking a connecting line between the right hip and the neck as a first vector, taking a connecting line between the right hip and the right knee as a second vector, and taking the angle between the first vector and the second vector as the angle theta between the trunk and the right hip bone _{neck-rhip-rkne} B, carrying out the following steps of; using a connecting line between the left hip and the neck as a third vector, using a connecting line between the left hip and the left knee as a fourth vector, and using an angle between the third vector and the fourth vector as an angle theta between the trunk and the left hip bone _{neck-lhip-lkne} . When a person is in a standing position, the angle theta between the trunk and the hip bone _{neck-rhip-rkne} ，θ _{neck-lhip-lkne} Is in the range of 160-180 deg., a specific range satisfying the standing motion is determined by determining the median value M and then performing threshold range matching.

Step 10: the standing posture characteristic extraction method has the advantages that due to the fact that classroom activities are unique, four categories of gesture interpretation, equipment operation, instruction content, no specific action and the like are generally available in hand motions of a speaker. Similar to step 9, the gesture angle rule analysis is performed on the manual mapping, and only one side of the action angle is considered firstly because the human body is symmetrical. For the actions of raising hands, raising arms, making gestures, etc., it is mainly to determine the positional relationship between each joint of the arm and the trunk, or the shoulder.

Specifically, a connecting line between the right shoulder and the left shoulder is taken as a fifth vector, a connecting line between the right shoulder and the right elbow is taken as a sixth vector, and an angle between the fifth vector and the sixth vector is taken as a right shoulder elbow angle; taking a connecting line between the left shoulder and the right shoulder as a seventh vector, taking a connecting line between the left shoulder and the left elbow as an eighth vector, and taking an angle between the seventh vector and the eighth vector as a left shoulder elbow angle; taking a connecting line between the right elbow and the right shoulder as a ninth vector, taking a connecting line between the right elbow and the right wrist as a tenth vector, and taking an angle between the ninth vector and the tenth vector as a right elbow wrist angle; and taking a connecting line between the left elbow and the left shoulder as an eleventh vector, taking a connecting line between the left elbow and the left wrist as a twelfth vector, and taking an angle between the eleventh vector and the twelfth vector as a left elbow and wrist angle.

When the angle range of the shoulder and elbow of one side is between 120 degrees and 165 degrees, the arm of the side is considered to be raised; a side elbow is considered to be curved when the elbow wrist angle for that side ranges between 65 deg. -110 deg..

As shown in table 1, when one arm is raised and the elbow of the side is bent, the human body is considered to have a hand-lifting gesture; when one arm is raised and the elbow of the side is not bent, the human body is considered to raise the arm and extend the hand to point aside; when the arms on the two sides are not lifted, if the elbows on the two sides are bent, the human body is considered to make gestures; when the arms on the two sides are not lifted, if the elbows on only one side are bent, the human body is considered to have hand-lifting gestures; when the arms on both sides are not lifted, if the elbows on both sides are not bent, the human body is considered to have no action.

TABLE 1

Step 11: human body orientation feature extraction, the human body face orientation is introduced as an additional judgment parameter, and the classification of teachers and students on electronic whiteboard interaction is further refined. The whole orientation of the human body can be judged by calculating the slope of the characteristic vectors such as eyes and shoulders, the horizontal edge position of a lens picture of a camera in the classroom teaching video is parallel to a horizontal plane, so the horizontal direction normal vector can be regarded as the normal vector of the horizontal direction, and the human body orientation is identified and predicted by establishing a coordinate system.

Specifically, an eye vector is determined according to a connecting line of a left eye and a right eye, and the slope of a perpendicular bisector of the eye vector is solved; when the absolute value of the slope of the perpendicular bisector is between 1.6 and + ∞) or the slope cannot be calculated, the human body is considered to face the front of the lens;

Step 12: and (3) multi-modal feature fusion, fusing image feature data and voiceprint feature data on the basis of the steps 2 to 11 to realize the conversion of feature extraction to behavior discrimination. The invention mainly aims at the main body of classroom teaching interaction behavior, adopts a decision-making layer fusion mode to perform classroom multi-mode data fusion, and simultaneously adopts a Bayesian factor net as an algorithm basis of feature fusion.

Step 13: based on the step 12, a classroom interaction behavior coding system MFIAS is constructed according to relevant education theories and the semantic features of the combined teacher-student interaction behaviors, and is shown in the reference table 2:

TABLE 2

Further, on the basis of the classroom interaction behavior coding system, a multi-modal fusion behavior and classroom teaching interaction behavior mapping comparison table is constructed according to the Bayesian causal network theory, as shown in Table 3.

TABLE 3

It should be noted that the parameters in table 3 are explained as follows: detection of a standing person: 0 indicates no person standing is detected; 1, detecting that 1 person stands; >1 indicates that multiple persons are detected standing; detecting the hand movement: 0 indicates that no hand motion was detected; 1 represents detection of a lifting hand gesture; 2 denotes raising the arm and extending the hand to point aside; detection oriented: 0 represents a lens-facing direction, i.e., facing the front; 1 denotes a facing side.

Step 14: and analyzing the recognition result of the classroom teaching interactive behavior on the basis of the step 13, wherein 28 videos are selected for intelligent analysis, and meanwhile, the videos are analyzed and processed in a manual labeling mode to compare the effects of an intelligent processing method and a manual method.

From the analysis of the overall recognition condition, the average accuracy of the 28-section video is about 84.6%, and the recognition effect of the video in most classes is good, as shown in fig. 4, the specific recognition effect of the "virtual reality and interaction technology" class is shown. The reasons for the decline of the accuracy rate of classroom teaching interactive behavior analysis are as follows:

from the perspective of voiceprint recognition, the classroom environment is noisy, the environmental noise is sometimes similar to the recorded speaker characteristics, and the voice information collected by the recording device is interfered by factors such as distance and obstruction. Many factors are added together, so that the accuracy of voiceprint recognition is reduced.

From the perspective of human posture recognition, a teacher uses a touch screen bar to perform whiteboard interaction and indicate student behaviors, teaching activities are conditions of conversation among multiple students, spoken language practice and the like, and more errors can occur to the prediction result of the part of motion recognition.

In conclusion, the intelligent classroom behavior recognition system provided by the invention can effectively shorten the analysis of classroom behaviors and provide powerful support for teachers to improve teaching quality, adjust teaching strategies and realize self specialized development after class.

Fig. 5 is a configuration diagram of a classroom interaction analysis system based on multi-modal fusion according to an embodiment of the present invention. As shown in fig. 5, includes:

a teaching video determining unit 510, configured to determine a classroom teaching video to be analyzed;

a voiceprint feature extraction unit 520, configured to extract voiceprint features in the classroom teaching video, and determine, through a speaker recognition clustering algorithm, a speaker identity and a speaking time corresponding to each section of voiceprint feature;

an action characteristic detection unit 530, configured to perform target detection on the classroom video and detect a human body; carrying out posture estimation on the detected human body, and determining key action characteristics of the human body; the key action features include: a standing posture, a hand posture, and a face-facing posture;

the classroom interaction analysis unit 540 is used for fusing the voiceprint features and the human body key action features, and analyzing classroom interaction behaviors corresponding to classroom teaching videos by combining a classroom interaction behavior coding system which is constructed in advance according to related education theories and the fused teacher-student interaction behavior semantic features; the principle for constructing the specific reference of the classroom interaction behavior coding system comprises the following steps: in a certain time period, after a student is detected to speak, if no person is detected to stand, the student is considered to answer a question or a speech while sitting, if a person is detected to stand without hand movement, the student is considered to answer the question or the speech while standing, if a person is detected to stand and hand movement is detected, the student is considered to make gesture explanation content, if a person is detected to stand and the arm and the hand are lifted to point to the side, the student is considered to point to display content or operate an electronic whiteboard, if a person is detected to stand and the side facing to a lens is lifted to the side, the student is considered to operate the electronic whiteboard, a explanation or a blackboard writing, and if a plurality of persons are detected to stand, the teacher is considered to listen to the student on a platform to answer the question or the speech; in a certain time period, after a teacher is detected to speak, if the standing of a person is not detected, the teacher is considered to be sitting to give a lecture or ask a question, if a person is detected to stand without hand movement, the teacher is considered to stand to give the lecture or ask the question, if a person is detected to stand with hand movement, the teacher is considered to make gesture explanation content, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to point to give the lecture or blackboard writing, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to operate an electronic whiteboard, lecture or blackboard writing, and if a plurality of persons are detected to stand, the teacher is considered to point or indicate to a lecture student.

It can be understood that detailed functional implementation of each unit in fig. 5 can refer to the description of the foregoing method embodiment, and is not described herein again.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A classroom interaction analysis method based on multi-modal fusion is characterized by comprising the following steps:

determining a classroom teaching video to be analyzed;

fusing the voiceprint features and the key human body action features, and analyzing the classroom interaction behavior corresponding to the classroom teaching video by combining a classroom interaction behavior coding system which is constructed in advance according to related education theories and the fused teacher-student interaction behavior semantic features; the principle for constructing the specific reference of the classroom interaction behavior coding system comprises the following steps: in a certain time period, after a student is detected to speak, if no person is detected to stand, the student is considered to be sitting to answer a question or a speech, if a person is detected to stand without hand movement, the student is considered to stand to answer the question or the speech, if a person is detected to stand with hand movement, the student is considered to make gesture explanation content, if a person is detected to stand with arms raised and extending to the side, the student is considered to point to display content or operate an electronic whiteboard, if a person is detected to stand with arms raised and extending to the side facing a lens, the student is considered to be operating the electronic whiteboard, the speech or a blackboard-writing, and if a plurality of persons are detected to stand, a teacher is considered to listen to the student on a podium to answer the question or the speech; in a certain time period, after a teacher is detected to speak, if the standing of a person is not detected, the teacher is considered to be sitting to give a lecture or ask a question, if a person is detected to stand without hand movement, the teacher is considered to stand to give the lecture or ask the question, if a person is detected to stand with hand movement, the teacher is considered to make gesture explanation content, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to point to give the lecture or blackboard writing, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to operate an electronic whiteboard, lecture or blackboard writing, and if a plurality of persons are detected to stand, the teacher is considered to point or indicate to a lecture student.

2. The method according to claim 1, wherein the pose estimation is performed on the detected human body to determine key motion characteristics of the human body, specifically:

selecting a plurality of human body key nodes, wherein the key nodes comprise: a head key node and a body key node; the head key node comprises a right ear, a left ear, a right eye, a left eye and a nose, and the body key node comprises: neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, and left ankle;

when the angle between the trunk and the right hip bone and the angle between the trunk and the left hip bone are both in the range of 160-180 degrees, the human body is considered to be standing; otherwise, the human body is considered to be sitting.

3. The method of claim 2, wherein a line connecting the right shoulder and the left shoulder is taken as a fifth vector, a line connecting the right shoulder and the right elbow is taken as a sixth vector, and an angle between the fifth vector and the sixth vector is taken as a right shoulder elbow angle;

4. The method of claim 2, wherein the eye vector is determined from the line connecting the left eye and the right eye, and the slope of the perpendicular bisector of the eye vector is determined; when the absolute value of the slope of the perpendicular bisector is between 1.6 and + ∞) or the slope cannot be calculated, the human body is considered to face the front of the lens;

5. The method according to any one of claims 1 to 4, wherein the principle of constructing the classroom interaction behavior encoding system specific reference further comprises: if the speaker is not detected and the key motion characteristics of the human body are not detected within a certain time period, the classroom is considered to be silent or chaotic in the certain time period.

6. A classroom interaction analysis system based on multi-modal fusion, comprising:

the classroom interaction analysis unit is used for fusing the voiceprint features and the human body key action features, and analyzing classroom interaction behaviors corresponding to classroom teaching videos by combining a classroom interaction behavior coding system which is constructed in advance according to related education theories and the fused teacher-student interaction behavior semantic features; the principle for constructing the specific reference of the classroom interaction behavior coding system comprises the following steps: in a certain time period, after a student is detected to speak, if no person is detected to stand, the student is considered to be sitting to answer a question or a speech, if a person is detected to stand without hand movement, the student is considered to stand to answer the question or the speech, if a person is detected to stand with hand movement, the student is considered to make gesture explanation content, if a person is detected to stand with arms raised and extending to the side, the student is considered to point to display content or operate an electronic whiteboard, if a person is detected to stand with arms raised and extending to the side facing a lens, the student is considered to be operating the electronic whiteboard, the speech or a blackboard-writing, and if a plurality of persons are detected to stand, a teacher is considered to listen to the student on a podium to answer the question or the speech; in a certain time period, after a teacher is detected to speak, if the standing of a person is not detected, the teacher is considered to be sitting to give a lecture or ask a question, if a person is detected to stand without hand movement, the teacher is considered to stand to give the lecture or ask the question, if a person is detected to stand with hand movement, the teacher is considered to make gesture explanation content, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to point to give the lecture or blackboard writing, if a person is detected to stand with arms raised to the side of the lens and pointing to the side, the teacher is considered to operate an electronic whiteboard, lecture or blackboard writing, and if a plurality of persons are detected to stand, the teacher is considered to point or indicate to a lecture student.

7. The system according to claim 6, wherein the motion characteristic detecting unit selects a plurality of human body key nodes, and the key nodes include: a head key node and a body key node; the head key node comprises a right ear, a left ear, a right eye, a left eye and a nose, and the body key node comprises: neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, and left ankle; taking a connecting line between the right hip and the neck as a first vector, taking a connecting line between the right hip and the right knee as a second vector, and taking an angle between the first vector and the second vector as an angle between the trunk and the right hip bone; a connecting line between the left hip and the neck is used as a third vector, a connecting line between the left hip and the left knee is used as a fourth vector, and an angle between the third vector and the fourth vector is used as an angle between the trunk and the left hip bone; when the angle between the trunk and the right hip bone and the angle between the trunk and the left hip bone are both 160-180 degrees, the human body is considered to be standing; otherwise, the human body is considered to be sitting.

8. The system according to claim 7, wherein the motion feature detection unit takes a line between the right shoulder and the left shoulder as a fifth vector, a line between the right shoulder and the right elbow as a sixth vector, and an angle between the fifth vector and the sixth vector as a right shoulder elbow angle; taking a connecting line between the left shoulder and the right shoulder as a seventh vector, taking a connecting line between the left shoulder and the left elbow as an eighth vector, and taking an angle between the seventh vector and the eighth vector as a left shoulder elbow angle; taking a connecting line between the right elbow and the right shoulder as a ninth vector, taking a connecting line between the right elbow and the right wrist as a tenth vector, and taking an angle between the ninth vector and the tenth vector as a right elbow wrist angle; taking a connecting line between the left elbow and the left shoulder as an eleventh vector, taking a connecting line between the left elbow and the left wrist as a twelfth vector, and taking an angle between the eleventh vector and the twelfth vector as a left elbow wrist angle; when the angle of the shoulder and elbow on one side ranges from 120 degrees to 165 degrees, the arm on the one side is considered to be raised; when the elbow wrist angle of one side ranges from 65 degrees to 110 degrees, the elbow of the side is considered to be bent; when one arm is raised and the elbow of the side is bent, the human body is considered to have a hand-lifting gesture; when one arm is raised and the elbow of the side is not bent, the human body is considered to raise the arm and stretch the hand to point aside; when the arms on the two sides are not lifted, if the elbows on the two sides are bent, the human body is considered to make gestures; when the arms on the two sides are not lifted, if the elbow on one side is bent, the human body is considered to have hand-lifting gestures; when the arms on both sides are not lifted, if the elbows on both sides are not bent, the human body is considered to have no action.

9. The system according to claim 7, wherein the motion feature detection unit determines an eye vector according to a connection line between the left eye and the right eye, and calculates a slope of a perpendicular bisector of the eye vector; when the absolute value of the slope of the perpendicular bisector is between 1.6 and + ∞) or the slope cannot be calculated, the human body is considered to face the front of the lens; when the absolute value of the slope of the perpendicular bisector is between [0, 1.6), the human body is considered to face the side of the lens; or when only one node is detected in the two nodes of the left ear and the right ear of the human body, or only one node is detected in the two nodes of the left eye and the right eye, the key node required for calculating the face orientation is determined to be missing, and the side face of the human body facing the lens is considered.

10. The system according to any one of claims 6 to 9, wherein the principle of the classroom interaction analysis unit for constructing the classroom interaction behavior coding system specific reference further comprises: if the speaker is not detected and the key motion characteristics of the human body are not detected within a certain time period, the classroom is considered to be silent or chaotic in the certain time period.