CN118016073A

CN118016073A - Classroom coarse granularity sound event detection method based on audio and video feature fusion

Info

Publication number: CN118016073A
Application number: CN202311820919.0A
Authority: CN
Inventors: 许炜; 崔玉蕾; 周为
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-05-10
Anticipated expiration: 2043-12-27
Also published as: CN118016073B

Abstract

The present invention belongs to the field of smart classroom technology, and specifically relates to a classroom coarse-grained sound event detection method based on audio and video feature fusion, including: using a video information processing model to perform face detection on video data frame by frame, and extract all mouth state information in each frame; performing human posture detection on video data frame by frame, and extracting all posture information of all people in each frame; splicing all mouth state information and all posture information of all people according to time series as video action features; using an audio information processing model to extract audio features from audio data frame by frame, and converting audio data into text to extract text features frame by frame; splicing audio features and text features according to time series as audio information features; based on video action features and audio information features, using feature fusion and classification models, outputting the detection and classification results of the speaking role in each frame, and obtaining coarse-grained sound event detection results. The present invention can improve the accuracy of classroom sound event detection.

Description

A coarse-grained sound event detection method in classroom based on audio and video feature fusion

技术领域Technical Field

本发明属于智慧课堂技术领域，更具体地，涉及一种基于音视频特征融合的课堂粗粒度声音事件检测方法。The present invention belongs to the technical field of smart classrooms, and more specifically, relates to a classroom coarse-grained sound event detection method based on audio and video feature fusion.

背景技术Background technique

课堂活动检测一直是一个热门话题，不断有专家学者在此方面进行研究，通过分析课堂中学生以及教师行为，课后针对内容进行相应调整，可以同时提升老师的教学技能和学生的学习效率。Classroom activity detection has always been a hot topic, and experts and scholars have been conducting research in this area. By analyzing the behavior of students and teachers in class and making corresponding adjustments to the content after class, both teachers' teaching skills and students' learning efficiency can be improved.

要实现课堂活动检测，高质量、细致的课堂活动记录是必不可少的，这就需要判别是否有人在说话，且说话人的身份是什么，同时记录下课堂上老师、学生各自话语的起止时间，总的来说，也就是进行一种用于课堂的粗粒度的声音事件检测。可是除非专人记录课堂上发生的一切，或者同时让老师和学生都各自佩戴上独立的收声设备，否则课堂中老师和学生各自单独的活动记录是非常难以获取的，但这二者显然都无法实现，往往所能提供的只有一至两个收声设备，里面包含了课堂情景下所有声音的混合。To achieve classroom activity detection, high-quality and detailed classroom activity records are essential. This requires determining whether someone is speaking and the identity of the speaker, and recording the start and end time of the teacher and student's speech in the classroom. In general, it is a coarse-grained sound event detection for the classroom. However, unless someone records everything that happens in the classroom, or the teacher and students each wear independent sound collection devices, it is very difficult to obtain separate activity records of the teacher and students in the classroom. However, these two are obviously impossible to achieve. Often, only one or two sound collection devices can be provided, which contain a mixture of all sounds in the classroom scenario.

部分研究提供了基于音频的课堂声音事件检测，当遇采集音频质量较差，课堂环境更加复杂时，或者教师的声音可能与某些学生的声音非常接近等情况，都会影响课堂声音事件检测的精度。同时在一些情境下，如师生交流问题时，老师和学生说话切换速度极快，这对于仅基于音频的课堂声音事件检测带来了极大的挑战，往往不能很好的检测到事件切换点，从而做出一些错误划分。Some studies provide classroom sound event detection based on audio. When the quality of the collected audio is poor, the classroom environment is more complex, or the teacher's voice may be very close to the voice of some students, the accuracy of classroom sound event detection will be affected. At the same time, in some situations, such as when teachers and students communicate, the teacher and students switch very quickly, which brings great challenges to classroom sound event detection based only on audio. It is often difficult to detect the event switching point well, resulting in some wrong divisions.

发明内容Summary of the invention

针对现有技术的缺陷和改进需求，本发明提供了一种基于音视频特征融合的课堂粗粒度声音事件检测方法，其目的在于提高课堂的粗粒度的声音事件检测精度。In view of the defects and improvement needs of the prior art, the present invention provides a method for detecting coarse-grained sound events in the classroom based on audio and video feature fusion, which aims to improve the detection accuracy of coarse-grained sound events in the classroom.

为实现上述目的，按照本发明的一个方面，提供了一种基于音视频特征融合的课堂粗粒度声音事件检测方法，包括：To achieve the above object, according to one aspect of the present invention, a method for detecting coarse-grained sound events in a classroom based on audio and video feature fusion is provided, comprising:

获取课堂中所产生的视频数据和音频数据；Acquire video and audio data generated in class;

采用已构建的视频信息处理模型，对视频数据逐帧进行人脸检测，提取每帧中所有嘴部的状态信息；对视频数据逐帧进行人体姿态检测，提取每帧中所有人的姿态信息；按照时间序列对所述所有嘴部的状态信息和所述所有人的姿态信息进行拼接，作为视频动作特征；Using the constructed video information processing model, face detection is performed frame by frame on the video data to extract the state information of all mouths in each frame; human posture detection is performed frame by frame on the video data to extract the posture information of all persons in each frame; the state information of all mouths and the posture information of all persons are spliced in time series as video action features;

采用已构建的音频信息处理模型，对音频数据逐帧提取音频特征，同时将音频数据转换为文本，对文本逐帧提取文本特征；按照时间序列对所述音频特征和所述文本特征进行拼接，作为音频信息特征；The constructed audio information processing model is used to extract audio features from the audio data frame by frame, and the audio data is converted into text, and text features are extracted from the text frame by frame; the audio features and the text features are spliced in time series as audio information features;

基于所述视频动作特征以及所述音频信息特征，采用已构建的基于注意力机制的特征融合与分类模型，输出每帧说话角色的检测分类结果，从而得到课堂粗粒度声音事件检测结果，每个粗粒度声音事件包括事件起止时间及其对应的说话角色，说话角色分为老师、学生和混合三类。Based on the video action features and the audio information features, the constructed feature fusion and classification model based on the attention mechanism is adopted to output the detection and classification results of the speaking role in each frame, so as to obtain the coarse-grained sound event detection results in the classroom. Each coarse-grained sound event includes the start and end time of the event and its corresponding speaking role. The speaking roles are divided into three categories: teacher, student and mixed.

进一步，所述视频信息处理模型中用于提取每帧中所有嘴部的状态信息的部分是通过基于MTCNN算法进行训练构建得到。Furthermore, the part of the video information processing model used to extract the status information of all mouths in each frame is constructed by training based on the MTCNN algorithm.

进一步，每个嘴部的状态信息是由左右嘴角及上下嘴唇四个关键点所构成的状态信息。Furthermore, the state information of each mouth is composed of four key points: left and right mouth corners and upper and lower lips.

进一步，当识别到口罩时，对应嘴部的状态信息采用口罩信息进行标识。Furthermore, when a mask is recognized, the status information of the corresponding mouth is identified using the mask information.

进一步，所述视频信息处理模型中用于提取每帧中所有人的姿态信息的部分是通过基于AlphaPose算法进行训练构建得到。Furthermore, the part of the video information processing model used to extract posture information of all persons in each frame is constructed by training based on the AlphaPose algorithm.

进一步，所述视频数据包括教师视频数据和学生视频数据；Further, the video data includes teacher video data and student video data;

则按照时间序列对基于所述教师视频数据所得到的视频动作特征和基于学生视频数据所得到的视频动作特征进行拼接，作为总的视频动作特征，用于和所述音频信息特征输入所述特征融合与分类模型。The video action features obtained based on the teacher video data and the video action features obtained based on the student video data are spliced in time series as the overall video action features, which are used to be input into the feature fusion and classification model together with the audio information features.

进一步，基于所述教师视频数据所提取的教师的姿态信息是由头部、颈部、左右肩、左右手肘、左右手、左右脚踝、左右膝、左右胯骨和躯干15个关键点的所构成的姿态信息；Further, the teacher's posture information extracted based on the teacher video data is posture information composed of 15 key points of the head, neck, left and right shoulders, left and right elbows, left and right hands, left and right ankles, left and right knees, left and right hips and torso;

基于所述学生视频数据所提取的学生的姿态信息是由头部、颈部、左右肩、左右手肘和左右手八个关键点所构成的姿态信息。The student's posture information extracted based on the student video data is posture information composed of eight key points: head, neck, left and right shoulders, left and right elbows, and left and right hands.

进一步，所述音频特征为梅尔倒谱系数。Furthermore, the audio feature is a Mel-frequency cepstral coefficient.

进一步，按照时间序列对所述音频特征和所述文本特征进行拼接的方式为：Further, the method of splicing the audio features and the text features according to the time series is:

将所述音频特征和所述文本特征分别输入到CNN特征提取网络中，对特征提取后的结果按时间序列对齐后进行拼接，再输入至RNN网络，得到结合上下文信息的音频信息特征。The audio features and the text features are respectively input into the CNN feature extraction network, the feature extraction results are aligned in time series and then spliced, and then input into the RNN network to obtain the audio information features combined with context information.

本发明还提供一种计算机可读存储介质，所述计算机可读存储介质存储的计算机程序，其中，在所述计算机程序被处理器运行时控制所述存储介质所在设备执行如上所述的一种基于音视频特征融合的课堂粗粒度声音事件检测方法。The present invention also provides a computer-readable storage medium, which stores a computer program. When the computer program is executed by a processor, the device where the storage medium is located is controlled to execute the above-mentioned method for detecting coarse-grained sound events in a classroom based on the fusion of audio and video features.

总体而言，通过本发明所构思的以上技术方案，能够取得以下有益效果：In general, the above technical solutions conceived by the present invention can achieve the following beneficial effects:

(1)本发明根据实际课堂环境产出的多种模态的数据，包括音频和视频，其中还提出将音频转为文本，进行文本特征提取，相当于提出采用三种模态的数据进行结合，进行课堂粗粒度声音事件检测方法。具体的，通过对课堂音频和视频信息进行特征提取及分析，使用了音频特征、文本特征和视频动作特征三种特征，采用这种多模态的融合的方式进行课堂粗粒度声音事件检测，有效避免了仅音频模态下当遇到音频质量较差、课堂环境复杂、教师的声音可能与某些学生的声音非常接近造成分割结果较差的问题，提高课堂粗粒度声音事件检测精度。(1) The present invention is based on the data of multiple modes produced by the actual classroom environment, including audio and video, and it is also proposed to convert audio into text and perform text feature extraction, which is equivalent to proposing to combine the data of three modes to perform a classroom coarse-grained sound event detection method. Specifically, by extracting and analyzing the features of classroom audio and video information, three features, namely audio features, text features and video action features, are used. This multi-modal fusion method is used to perform classroom coarse-grained sound event detection, which effectively avoids the problem of poor segmentation results when encountering poor audio quality, complex classroom environment, and the teacher's voice may be very close to the voice of some students in the audio mode alone, thereby improving the accuracy of classroom coarse-grained sound event detection.

(2)本发明提出视频数据包括教师视频数据和学生视频数据，分别清楚的采集涉及教师动作的视频数据和涉及学生动作的视频数据，便于提高粗粒度声音事件检测精度。(2) The present invention proposes that video data includes teacher video data and student video data, and clearly collects video data related to teacher actions and video data related to student actions, so as to improve the accuracy of coarse-grained sound event detection.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例提供的一种基于音视频特征融合的课堂粗粒度声音事件检测方法示意框图。FIG1 is a schematic block diagram of a method for detecting coarse-grained sound events in a classroom based on audio and video feature fusion according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the purpose, technical solutions and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

实施例一Embodiment 1

一种基于音视频特征融合的课堂粗粒度声音事件检测方法，如图1所示，包括：A method for detecting coarse-grained classroom sound events based on audio and video feature fusion, as shown in FIG1 , includes:

采用已构建的视频信息处理模型，对视频数据逐帧进行人脸检测，提取每帧中所有嘴部的状态信息；对视频数据逐帧进行人体姿态检测，提取每帧中所有人的姿态信息；按照时间序列对上述所有嘴部的状态信息和上述所有人的姿态信息进行拼接，作为视频动作特征；Using the constructed video information processing model, face detection is performed frame by frame on the video data to extract the state information of all mouths in each frame; human posture detection is performed frame by frame on the video data to extract the posture information of all persons in each frame; the state information of all mouths and the posture information of all persons are spliced in time series as video action features;

采用已构建的音频信息处理模型，对音频数据逐帧提取音频特征，同时将音频数据转换为文本，对文本逐帧提取文本特征；按照时间序列对上述音频特征和上述文本特征进行拼接，作为音频信息特征；The constructed audio information processing model is used to extract audio features from the audio data frame by frame, and the audio data is converted into text, and text features are extracted from the text frame by frame; the audio features and the text features are concatenated according to the time series to serve as audio information features;

基于上述视频动作特征以及上述音频信息特征，采用已构建的基于注意力机制的特征融合与分类模型，输出每帧说话角色的检测分类结果，从而得到课堂粗粒度声音事件检测结果，每个粗粒度声音事件包括事件起止时间及其对应的说话角色，说话角色分为老师、学生和混合三类。Based on the above video action features and the above audio information features, the constructed feature fusion and classification model based on the attention mechanism is adopted to output the detection and classification results of the speaking role in each frame, so as to obtain the coarse-grained sound event detection results in the classroom. Each coarse-grained sound event includes the start and end time of the event and its corresponding speaking role. The speaking roles are divided into three categories: teacher, student and mixed.

本实施例中课堂粗粒度声音事件检测模型，最终将声音事件检测结果划分为三类：老师、学生和Babble(即混合的方式，共同讲话)，输出课堂粗粒度声音事件检测结果，包括各事件起止时间及其对应说话角色。本实施例结合嘴部特征、文本特征和动作特征共同判断说话角色。The classroom coarse-grained sound event detection model in this embodiment finally divides the sound event detection results into three categories: teacher, student and Babble (i.e., mixed mode, speaking together), and outputs the classroom coarse-grained sound event detection results, including the start and end time of each event and its corresponding speaking role. This embodiment combines mouth features, text features and action features to jointly determine the speaking role.

采用retinaface、MTCNN等算法进行课堂教师学生的人脸检测；同时采用人脸特征点算法(例如Dlib，MTCNN)提取所检测到人脸的脸部关键点，可作为优选的实施方式，这里采用MTCNN算法进行视频特征提取(即人脸的脸部关键点)，可同时完成人脸检测及关键点提取两个任务，如果选择retinaface进行人脸检测，后续还需要Dlib等算法进行人脸关键点提取。The retinaface, MTCNN and other algorithms are used to detect the faces of teachers and students in the classroom; at the same time, the facial feature point algorithm (such as Dlib, MTCNN) is used to extract the facial key points of the detected face, which can be used as a preferred implementation. Here, the MTCNN algorithm is used for video feature extraction (i.e., facial key points of the face), which can complete the two tasks of face detection and key point extraction at the same time. If retinaface is selected for face detection, Dlib and other algorithms will be required for facial key point extraction later.

MTCNN包括三层网络结构分别为：快速生成候选窗口的P-Net、进行高精度候选窗口过滤选择的R-Net和生成最终边界框与人脸关键点的O-Net。P-Net是一个全卷积网络，经历三层卷积，卷积核大小为3＊3。通过三个全连接层输出该区域是否存在人脸，同时输出多个人脸可能的边框，及所对应的面部关键点，并将这些区域输入R-Net进行进一步处理。R-Net和P-Net相比在最后一个卷积层后多了一个维度128的全连接层，用于对对输入进行细化选择，并且舍去大部分的错误输入，最后继而再用三个全连接层分别输出边框、面部关键点三项信息。O-Net网络更加复杂，经R-Net传输过来的信息经4次卷积、3次最大池化后，通过维度256的全连接层，再用三个全连接层分别输出边框、面部关键点三项信息。MTCNN consists of three layers of network structures: P-Net for quickly generating candidate windows, R-Net for high-precision candidate window filtering and selection, and O-Net for generating the final bounding box and facial key points. P-Net is a fully convolutional network that undergoes three layers of convolution with a convolution kernel size of 3*3. Three fully connected layers are used to output whether there is a face in the area, and at the same time, multiple possible bounding boxes of faces and the corresponding facial key points are output, and these areas are input into R-Net for further processing. Compared with P-Net, R-Net has an additional fully connected layer of dimension 128 after the last convolution layer, which is used to refine the input and discard most of the wrong inputs. Finally, three fully connected layers are used to output the three pieces of information of bounding boxes and facial key points respectively. The O-Net network is more complex. The information transmitted from R-Net undergoes 4 convolutions and 3 maximum poolings, passes through a fully connected layer of dimension 256, and then uses three fully connected layers to output the three pieces of information of bounding boxes and facial key points respectively.

最开始的一些人脸算法以及对应的数据集标注为左右眼，鼻子，左右嘴角，以这五个关键点作为脸部的关键点，进行人脸识别及检测，其并不适用于本实施例场景下，本实施例场景所需要的是嘴部信息特征以及是否戴口罩，所以将关键点信息进行修改，并加入是否带口罩这一关键点。也就是，因为在本实例中检测是否讲话更多只需要关注嘴部特征，因此优选得将特征点信息提取目标修改左右嘴角及上下嘴唇四个关键点，同时考虑到可能会有学生或老师有带口罩的情况出现，因此在特征点提取时，当识别到口罩时，将口罩信息作为嘴部关键点输出，其中的口罩信息，比如可以采用一个特征图进行标记。Some of the earliest face algorithms and corresponding data sets were labeled as left and right eyes, nose, left and right corners of the mouth, and these five key points were used as the key points of the face for face recognition and detection. This is not applicable to the scenario of this embodiment. The scenario of this embodiment requires mouth information features and whether a mask is worn, so the key point information is modified and the key point of whether a mask is worn is added. That is, because in this example, it is more necessary to focus on the mouth features to detect whether to speak, it is preferred to modify the feature point information extraction target to the four key points of the left and right corners of the mouth and the upper and lower lips. At the same time, considering that there may be students or teachers wearing masks, when the mask is identified during feature point extraction, the mask information is output as the mouth key point, and the mask information can be marked with a feature map, for example.

关于姿态信息的获取，采用人体姿态算法(如AlphaPose)提取人体关键点。优选地，采用AlphaPose算法，这是多人姿态估计方法，采用自顶向下检测方法，先检测图片中的每个人的边界框，再独立检测每个人体边界框中的姿态。其主要包含两个网络：(1)对称空间变换网络，包括一个空间变换网络STN，一个空间反变换网络SDTN，一个单人姿态估计SPPE，SPPE位于STN和SDTN之间。再从STN连接出来一个并行的Parallel SPPE分支来优化网络。Parallel SPPE主要判定STN提取的姿态是不是位于中心位置，反向传播误差给STN，对STN网络权重进行更新，Parallel SPPE不参与输出，只用来优化STN网络。(2)参数化姿态最极大抑制网络，人体定位不可避免的会产生冗余的检测框，即一个检测目标有多个检测框，因此也会产生冗余的姿态检测，使用姿态距离测量来比较姿态的相似度，以此去除相似的姿态。Regarding the acquisition of posture information, a human posture algorithm (such as AlphaPose) is used to extract human key points. Preferably, the AlphaPose algorithm is used, which is a multi-person posture estimation method. It adopts a top-down detection method, first detects the bounding box of each person in the image, and then independently detects the posture in each human body bounding box. It mainly includes two networks: (1) Symmetric spatial transformation network, including a spatial transformation network STN, a spatial inverse transformation network SDTN, and a single-person posture estimation SPPE, and SPPE is located between STN and SDTN. Then a parallel Parallel SPPE branch is connected from STN to optimize the network. Parallel SPPE mainly determines whether the posture extracted by STN is located at the center position, back-propagates the error to STN, and updates the STN network weight. Parallel SPPE does not participate in the output and is only used to optimize the STN network. (2) Parameterized posture maximum suppression network. Human body positioning will inevitably produce redundant detection frames, that is, a detection target has multiple detection frames, so redundant posture detection will also be generated. The posture distance measurement is used to compare the similarity of the postures to remove similar postures.

为了采集更为清楚好用的动作特征，优选的，上述视频数据包括教师视频数据和学生视频数据；示例性地，可在教室的前方设置摄像头采集学生的视频数据，在教室的后方设置摄像头采集教师的视频数据。In order to collect clearer and more useful motion features, preferably, the above-mentioned video data includes teacher video data and student video data; illustratively, a camera can be set at the front of the classroom to collect student video data, and a camera can be set at the back of the classroom to collect teacher video data.

则按照时间序列对基于教师视频数据所得到的视频动作特征和基于学生视频数据所得到的视频动作特征进行拼接，作为总的视频动作特征，用于和上述音频信息特征一起输入特征融合与分类模型。The video action features obtained based on the teacher's video data and the video action features obtained based on the student's video data are spliced in time series as the overall video action features, which are used to input into the feature fusion and classification model together with the above-mentioned audio information features.

作为优选的实施方式，基于教师视频数据所提取的教师的姿态信息是由头部、颈部、左右肩、左右手肘、左右手、左右脚踝、左右膝、左右胯骨和躯干15个关键点的所构成的姿态信息；As a preferred embodiment, the teacher's posture information extracted based on the teacher video data is posture information composed of 15 key points of the head, neck, left and right shoulders, left and right elbows, left and right hands, left and right ankles, left and right knees, left and right hips and torso;

基于学生视频数据所提取的学生的姿态信息是由头部、颈部、左右肩、左右手肘和左右手八个关键点所构成的姿态信息。The student's posture information extracted based on the student video data is composed of eight key points: head, neck, left and right shoulders, left and right elbows, and left and right hands.

关于文本特征、音频特征的提取。将课堂音频数据转录为文本数据，优选地，可采用whisper、kalid等算法，转录的结果包含文本内容所对应时间戳，其将所述文本数据输入Bert模型中。Bert型使用多层双向Transformer作为特征提取器，来提取词向量特征，其所提取的特征能够包含上下文信息，其中，词向量特征是通过将输入文本的词汇嵌入向量和位置嵌入进行组合得到的，x_ti＝w_i+p_i ^embedding，x_ti表示第i个单词的词向量特征，w_i为第i个单词对应的词汇嵌入向量，p_i ^embedding为第i个单词对应的位置嵌入向量，最终得到的词向量特征Xt＝[x_t1,x_t2…x_tn]。About the extraction of text features and audio features. The classroom audio data is transcribed into text data. Preferably, whisper, kalid and other algorithms can be used. The transcription result contains the timestamp corresponding to the text content, which inputs the text data into the Bert model. The Bert type uses a multi-layer bidirectional Transformer as a feature extractor to extract word vector features. The extracted features can contain contextual information, wherein the word vector features are obtained by combining the vocabulary embedding vector and position embedding of the input text, x _ti = w _i + p _i ^embedding , x _ti represents the word vector feature of the i-th word, w _i is the vocabulary embedding vector corresponding to the i-th word, p _i ^embedding is the position embedding vector corresponding to the i-th word, and the final word vector feature Xt = [x _t1 , x _t2 ...x _tn ].

采用音频处理工具，提取如MFCC、LLD等音频特征Xs。优选地，本实例选择提取梅尔倒谱系数，简称MFCC特征，首先x(n)表示原始音频信号，对其进行预处理，去除静音段、降噪、预加重等，将预处理后的音频信号划分为帧长为N的重叠帧，优选地，使用50％比例进行音频信号重叠(用于避免相邻两帧的变化过大，因此会让两相邻帧之间有一段重叠区域)。对每一帧的音频信号应用窗函数w(n)，即s(n)＝x(n)*w(n)，其中，s(n)是加窗后的信号。对加窗后的信号进行快速傅里叶变换(FFT)，即S(k)＝FFT[s(n)]，其中，S(k)是频域信号，k表示频率索引。计算每一帧的能量谱E(k)，即频域信号的幅度的平方：E(k)＝|S(k)|^2。设计一组梅尔滤波器，这些滤波器在梅尔刻度上均匀分布。将能量谱通过这组滤波器进行滤波，得到滤波器组的输出H_m(k)，即其中，m表示滤波器索引，H_m(k′,m)是第m个滤波器在频率索引k'处的响应。对滤波器组的输出进行对数运算，得到对数能量谱M_m(k)＝log(H_m(k))，对对数能量谱进行离散余弦变换(DCT)，得到倒谱系数Xs_m：Use audio processing tools to extract audio features Xs such as MFCC and LLD. Preferably, this example chooses to extract Mel cepstral coefficients, referred to as MFCC features. First, x(n) represents the original audio signal, which is preprocessed to remove silent segments, reduce noise, pre-emphasize, etc. The preprocessed audio signal is divided into overlapping frames with a frame length of N. Preferably, a 50% ratio is used for audio signal overlap (to avoid excessive changes between two adjacent frames, so there will be an overlapping area between two adjacent frames). Apply a window function w(n) to the audio signal of each frame, that is, s(n) = x(n)*w(n), where s(n) is the windowed signal. Perform a fast Fourier transform (FFT) on the windowed signal, that is, S(k) = FFT[s(n)], where S(k) is the frequency domain signal and k represents the frequency index. Calculate the energy spectrum E(k) of each frame, that is, the square of the amplitude of the frequency domain signal: E(k) = |S(k)|^2. Design a set of Mel filters that are evenly distributed on the Mel scale. The energy spectrum is filtered through this set of filters to obtain the output H _m (k) of the filter bank, that is, Where m represents the filter index, and H _m (k′,m) is the response of the mth filter at the frequency index k′. Performing logarithmic operation on the output of the filter bank, we obtain the logarithmic energy spectrum M _m (k)=log(H _m (k)), and performing discrete cosine transform (DCT) on the logarithmic energy spectrum to obtain the cepstrum coefficient Xs _m :

其中，Xs_m是第m个倒谱系数。Among them, Xs _m is the mth cepstral coefficient.

最后，得到的Xs_m可以作为音频信号的MFCC特征向量。Finally, the obtained Xs _m can be used as the MFCC feature vector of the audio signal.

可作为优选的实施方式，按照时间序列对上述音频特征和上述文本特征进行拼接的方式为：As a preferred implementation, the method of splicing the above audio features and the above text features in time series is:

将上述音频特征和上述文本特征分别输入到CNN特征提取网络中，对特征提取后的结果按时间序列对齐后进行拼接，再输入至RNN网络，得到结合上下文信息的音频信息特征。The above audio features and the above text features are respectively input into the CNN feature extraction network, the feature extraction results are aligned in time series and then spliced, and then input into the RNN network to obtain the audio information features combined with context information.

关于采用已构建的基于注意力机制的特征融合与分类模型，输出每帧说话角色的检测分类结果。将视频动作特征Xv和音频信息特征Xa，输入至带有注意力机制的融合模型，进行融合后输入至全连接层，得到集成的课堂声音事件检测结果。将音频信息特征当键，视频动作特征作为查询特征，计算Xai和Xvi的相关性s_i，Xai代表的是第i时刻的音频信息特征，Xvi代表的是第i时刻视频动作特征，s_i＝Vtanh(W·Xvi+U·Xai)，其中，W、U是权重参数，tanh是激活函数，V是多模态融合的权重参数，α_i为归一化后注意力系数，最终融合的特征可以表示为Yi＝α_i*Xai+(1-α_i)*Xvi。Regarding the use of the constructed feature fusion and classification model based on the attention mechanism, the detection and classification results of the speaking character in each frame are output. The video action feature Xv and the audio information feature Xa are input into the fusion model with the attention mechanism, and after fusion, they are input into the fully connected layer to obtain the integrated classroom sound event detection results. The audio information feature is used as the key and the video action feature is used as the query feature to calculate the correlation s _i between Xai and Xvi. Xai represents the audio information feature at the i-th moment, and Xvi represents the video action feature at the i-th moment. s _i =Vtanh(W·Xvi+U·Xai), where W and U are weight parameters, tanh is the activation function, and V is the weight parameter of multimodal fusion. α _i is the normalized attention coefficient, and the final fused feature can be expressed as Yi = α _i *Xai + (1-α _i )*Xvi.

最终将融合特征Yi输入到全连接层网络中进行分类，得到课堂粗粒度声音事件检测结果。Finally, the fused feature Yi is input into the fully connected layer network for classification to obtain the coarse-grained sound event detection results in the classroom.

关于视频信息处理模型、音频信息处理模型以及基于注意力机制的特征融合与分类模型的构建，三者可独立训练，其中：Regarding the construction of the video information processing model, the audio information processing model, and the feature fusion and classification model based on the attention mechanism, the three can be trained independently, among which:

视频信息处理模型中用于提取每帧中所有嘴部的状态信息的部分的损失函数L_mouth表示为：The loss function L _mouth in the video information processing model for extracting the state information of all mouths in each frame is expressed as:

在该式中，i是指第几个嘴部关键点，e为嘴部关键点个数，t是代表时间，表示第几帧，l_bce表示交叉熵损失，p_i,t表示在嘴部关键点p_i在时间t所对应的真实状态，表示嘴部关键点p_i在时间t的预测概率。In this formula, i refers to the number of mouth key points, e is the number of mouth key points, t represents time, represents the number of frames, l _bce represents the cross entropy loss, _pi,t represents the true state corresponding to the mouth key point _pi at time t, represents the predicted probability of the mouth key point _pi at time t.

视频信息处理模型中用于提取每帧中所有人的姿态信息的部分的损失函数L_posture表示为：The loss function L _posture of the part of the video information processing model used to extract the posture information of all people in each frame is expressed as:

在该式中，i是指第几个人体关键点，g为人体关键点个数，t是代表时间，表示第几帧，a_i,t表示在姿态关键点a_i在时间t所对应的真实状态，表示姿态关键点a_i在时间t的预测概率。In this formula, i refers to the number of human key points, g is the number of human key points, t represents time, represents the number of frames, and _ai,t represents the real state corresponding to the posture key point _ai at time t. represents the predicted probability of the pose keypoint _ai at time t.

音频信息处理模型中用于对文本逐帧提取文本特征的部分的损失函数L_word表示为：The loss function L _word of the part used to extract text features frame by frame in the audio information processing model is expressed as:

在该式中，W表示语音转录文本的词汇总量，q_i表示语音转录文本结果第i个单词的真实结果，表示语音转录文本结果第i个单词预测结果。In this formula, W represents the total vocabulary of the speech transcription text, q _i represents the true result of the i-th word in the speech transcription text result, Represents the prediction result of the i-th word in the speech transcription text result.

特征融合与分类模型中用于对上述视频动作特征以及上述音频信息特征进行融合的损失函数L_fusion表示为：The loss function L _fusion used to fuse the above video action features and the above audio information features in the feature fusion and classification model is expressed as:

在该式中，i是指第几类声音检测标签，z为声音检测标签个数，t是代表时间，表示第几帧，y_i,t表示在特征融合模型中课堂声音事件检测结果y_i在时间t对应的真实状态课堂声音事件检测结果y_i在时间t的预测概率。In this formula, i refers to the type of sound detection label, z is the number of sound detection labels, t represents time, represents the frame, and _yi,t represents the real state corresponding to the classroom sound event detection result _yi at time t in the feature fusion model. The predicted probability of classroom sound event detection result _yi at time t.

对于真实标签标注，视频标注中，当遇到教师或学生戴口罩时，将口罩作为嘴部关键点进行标注。通过人工标注，标注出每个事件的起止时间点，及其所对应的标签，学生标注不区分不同学生之间的差异，统一标签为学生，当声音极短时，不对其标注任何标签。对根据以上标注规则，获得标注好的课堂音视频数据集。For real label annotation, in video annotation, when teachers or students wear masks, the masks are annotated as key points of the mouth. Through manual annotation, the start and end time points of each event and their corresponding labels are marked. Student annotation does not distinguish between different students, and the unified label is student. When the sound is extremely short, no label is added to it. According to the above annotation rules, a well-annotated classroom audio and video dataset is obtained.

总的来说，本发明采用的是使用多种模态数据融合的方式进行粗粒度课堂声音事件检测，加入教师、学生视频模态特征，检测当前教师学生嘴巴状态及身体姿态，克服了遇到采集音频质量较差，课堂环境更加复杂时，教师的声音可能与某些学生的声音非常接近等情况时，单采用音频时影响课堂声音事件检测的精度的问题，因为此时视频模态能更好反映当前的课堂声音事件。同时使用音视频融合的手段，可以避免单使用视频模态时，遇到学生教师戴口罩情况，无法进行课堂声音事件检测的问题。In general, the present invention adopts a method of using multiple modal data fusion to perform coarse-grained classroom sound event detection, adding teacher and student video modal features, detecting the current teacher and student mouth state and body posture, and overcoming the problem of affecting the accuracy of classroom sound event detection when only using audio when encountering poor audio quality, more complex classroom environment, and the teacher's voice may be very close to the voice of some students, because at this time the video modality can better reflect the current classroom sound event. At the same time, the use of audio and video fusion can avoid the problem of being unable to detect classroom sound events when encountering students and teachers wearing masks when using only video modality.

实施例二Embodiment 2

一种计算机可读存储介质，所述计算机可读存储介质存储的计算机程序，其中，在所述计算机程序被处理器运行时控制所述存储介质所在设备执行如上所述的一种基于音视频特征融合的课堂粗粒度声音事件检测方法。A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the device where the storage medium is located is controlled to execute a method for detecting coarse-grained sound events in a classroom based on the fusion of audio and video features as described above.

相关技术方案同实施例一，在此不再赘述。The relevant technical solution is the same as that in Embodiment 1 and will not be described in detail here.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It will be easily understood by those skilled in the art that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the present invention.

Claims

1. A class coarse granularity sound event detection method based on audio and video feature fusion is characterized by comprising the following steps:

Acquiring video data and audio data generated in a classroom;

Adopting the constructed video information processing model to perform face detection on video data frame by frame, and extracting state information of all mouths in each frame; carrying out human body posture detection on the video data frame by frame, and extracting posture information of all people in each frame; splicing the state information of all the mouths and the gesture information of all the people according to a time sequence to be used as video action characteristics;

Extracting audio characteristics from audio data frame by adopting the constructed audio information processing model, converting the audio data into text, and extracting text characteristics from the text frame by frame; splicing the audio features and the text features according to a time sequence to serve as audio information features;

Based on the video action characteristics and the audio information characteristics, a constructed characteristic fusion and classification model based on an attention mechanism is adopted, and a detection classification result of each frame of speaking roles is output, so that a class coarse-granularity sound event detection result is obtained, each coarse-granularity sound event comprises event start-stop time and a speaking role corresponding to the event start-stop time, and the speaking roles are divided into three categories of teachers, students and mixtures.

2. The method for detecting class coarse-granularity sound events according to claim 1, wherein the part of the video information processing model for extracting the state information of all the mouths in each frame is obtained by training based on MTCNN algorithm.

3. The classroom coarse granularity sound event detection method according to claim 1 or 2, wherein the status information of each mouth is status information composed of four key points of left and right mouth corners and upper and lower lips.

4. The classroom coarse granularity sound event detection method according to claim 1 or 2, wherein when the mask is identified, the state information of the corresponding mouth is identified using mask information.

5. The method for detecting class coarse-granularity sound events according to claim 1, wherein the part of the video information processing model for extracting the gesture information of all people in each frame is obtained by training based on AlphaPose algorithm.

6. The classroom coarse granularity sound event detection method according to any one of claims 1 to 5, wherein the video data includes teacher video data and student video data;

and splicing the video motion characteristics obtained based on the teacher video data and the video motion characteristics obtained based on the student video data according to a time sequence to serve as total video motion characteristics, and inputting the characteristics fusion and classification model with the audio information characteristics.

7. The classroom coarse granularity sound event detection method according to claim 6, wherein the posture information of the teacher extracted based on the teacher video data is posture information composed of 15 key points of the head, neck, left and right shoulders, left and right elbows, left and right hands, left and right ankles, left and right knees, left and right crotch bones, and torso;

The posture information of the student extracted based on the student video data is posture information composed of eight key points of a head, a neck, left and right shoulders, left and right elbows and left and right hands.

8. The classroom coarse granularity sound event detection method according to claim 1, wherein the audio feature is mel-frequency cepstral coefficient.

9. The method for detecting class coarse granularity sound events according to claim 1, wherein the manner of splicing the audio feature and the text feature according to the time sequence is as follows:

And respectively inputting the audio features and the text features into a CNN feature extraction network, splicing the results after feature extraction according to time sequence alignment, and inputting the results into an RNN network to obtain the audio information features combined with the context information.

10. A computer readable storage medium, characterized by a computer program stored in the computer readable storage medium, wherein the computer program, when executed by a processor, controls a device in which the storage medium is located to perform a class coarse granularity sound event detection method based on audio-video feature fusion according to one of claims 1 to 9.